Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unittest mixture metadata extraction #53

Merged
merged 6 commits into from
Nov 24, 2022

Conversation

soudehMasoudian
Copy link
Collaborator

a unittest has been written for mixture_metadata_extraction script.

Copy link
Member

@eriktamsen eriktamsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a good step, but some imporvements will be needed before merging.

First of all, please make sure the tests are working within your branch. It might be a good idea to pull from main as some things have changed, maybe this will solve the problems in actions.

Then, please add your script to the doit workflow.

I would suggest to change the folder strucutre slightly in the "raw_data_processing" directory. A folder "mixture_data" for this script. I will create a new folder (in a new branch) "emodul_data" where the "emodul_metadata_extraction.py" and the "emodul_generate_processed_data.py" will be moved.

Apart from some of the comments in the code, I think we (@joergfunger: people wo know concrete and the people wo need to work with the data) need to think of a better target structure of the data. Probaly do a similar split as with the experimental files in "metadata - yaml file" and "processed data - csv table".

The excelfile that contains the translations, columns are "german" and "english"
"""

translation = pd.read_excel(dic)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to use csv or yaml for this data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is to simplify the reading in of the raw excel sheet (which cannot be changed) and to translate the file. However, I'm not sure why we need the translation of the orginal file. IMO, all the raw excel sheets are German, and I think it would be good to a have mapping such that when printing something, we could always change the printed output by a dictionary e.g. translate[German,specimen]=Probekörper, but internally I would only use a single (English) representation. That makes it much easier to work with that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this specifically is for the translation file: translation.xlsx. If we want/need a translation file, I would prefer it to be not an exel file.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right, that is correct, my mistake.

Comment on lines 81 to 111
if english == True:
exceltodf = translate(exceltodf,"translation.xlsx")
specimeninfo = {
"Date" : str(exceltodf.columns[9])[:10],
"Editor" : exceltodf.iloc[0,10],
"Requester" : exceltodf.iloc[2,3],
"Project number": exceltodf.iloc[3,3],
"Specimen" : exceltodf.iloc[4,3],
"Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4],
"Wasserzementwert" : exceltodf.iloc[7,2],
"Konsistenz" : exceltodf.iloc[7,7],
"Sieblinie n. DIN 1045" : exceltodf.iloc[8,2],
"Körnungsziffer" : exceltodf.iloc[8,8],
"Vol Leim/Zuschlag" : exceltodf.iloc[10,10]
}
word = "Supplementary cementious materials"
else:
specimeninfo = {
"Datum" : str(exceltodf.columns[9])[:10],
"Bearbeiter" : exceltodf.iloc[0,10],
"Antragsteller" : exceltodf.iloc[2,3],
"Antrags-/Projektnummer": exceltodf.iloc[3,3],
"Bezeichnung der Proben" : exceltodf.iloc[4,3],
"Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4],
"Wasserzementwert" : exceltodf.iloc[7,2],
"Konsistenz" : exceltodf.iloc[7,7],
"Sieblinie n. DIN 1045" : exceltodf.iloc[8,2],
"Körnungsziffer" : exceltodf.iloc[8,8],
"Vol Leim/Zuschlag" : exceltodf.iloc[10,10]
}
word = "Zusatzstoff"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this, if we have a "translate" function? Cant we add these pairs there as well?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, I agree that I would not switch the language of the metadata.yaml, but rather have a translate function to could convert the meta.yaml to meta_ge.yaml, though I'm not sure why we would need that.

}
word = "Zusatzstoff"
specimeninfo = pd.Series(specimeninfo)
specimeninfo = specimeninfo.fillna("---") # fill nans
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding None to the fields might be better then a "---" string.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to add nan at all, I would just leave out the coresponding entry. IMO there is no huge difference between checking if a value exists or whether it is NaN or ---. By leaving it out, we also allow to condense the information, since some informations might be only given for a small subset of the data.

Comment on lines 11 to 17
target_data = {'Cement -- Quantity in mix': 330, 'Cement -- Specific gravity': 3.123,
'Cement -- Annotation': 'CEM I 42,5 R', 'Water (total) -- Quantity in mix':
175, 'Water (total) -- Specific gravity': 1, 'Water (total) -- Annotation':
'---', 'Water (effective) -- Quantity in mix':
'---', 'Water (effective) -- Specific gravity':
'---', 'Water (effective) -- Annotation': '---', 'Air -- Quantity in mix':
'---', 'Air -- Specific gravity': '---', 'Air -- Annotation': '---',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seem to be values missing.
Where is "Cement -- Quantity in mix" or "Water (total) -- Specific gravity"?

'4,0 / 8,0 -- Specific gravity': 2.63, '4,0 / 8,0 -- Annotation': 'Okrilla ',
'8,0 / 16,0 -- Quantity in mix': 708, '8,0 / 16,0 -- Specific gravity': 2.63,
'8,0 / 16,0 -- Annotation': 'Okrilla ', 'nan -- Quantity in mix': '---',
'nan -- Specific gravity': '---', 'nan -- Annotation': '---',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These empty lines should not be included in the data.

Copy link
Member

@joergfunger joergfunger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the nice work. There are a couple of suggested changes. In addition, add the extraction not only to the test, but also to the doit script such that all the mixes are extracted for the minimum working example. In addition, make sure that the github actions do not fail, otherwise a merge is not possible.

The excelfile that contains the translations, columns are "german" and "english"
"""

translation = pd.read_excel(dic)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is to simplify the reading in of the raw excel sheet (which cannot be changed) and to translate the file. However, I'm not sure why we need the translation of the orginal file. IMO, all the raw excel sheets are German, and I think it would be good to a have mapping such that when printing something, we could always change the printed output by a dictionary e.g. translate[German,specimen]=Probekörper, but internally I would only use a single (English) representation. That makes it much easier to work with that.

Comment on lines 81 to 111
if english == True:
exceltodf = translate(exceltodf,"translation.xlsx")
specimeninfo = {
"Date" : str(exceltodf.columns[9])[:10],
"Editor" : exceltodf.iloc[0,10],
"Requester" : exceltodf.iloc[2,3],
"Project number": exceltodf.iloc[3,3],
"Specimen" : exceltodf.iloc[4,3],
"Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4],
"Wasserzementwert" : exceltodf.iloc[7,2],
"Konsistenz" : exceltodf.iloc[7,7],
"Sieblinie n. DIN 1045" : exceltodf.iloc[8,2],
"Körnungsziffer" : exceltodf.iloc[8,8],
"Vol Leim/Zuschlag" : exceltodf.iloc[10,10]
}
word = "Supplementary cementious materials"
else:
specimeninfo = {
"Datum" : str(exceltodf.columns[9])[:10],
"Bearbeiter" : exceltodf.iloc[0,10],
"Antragsteller" : exceltodf.iloc[2,3],
"Antrags-/Projektnummer": exceltodf.iloc[3,3],
"Bezeichnung der Proben" : exceltodf.iloc[4,3],
"Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4],
"Wasserzementwert" : exceltodf.iloc[7,2],
"Konsistenz" : exceltodf.iloc[7,7],
"Sieblinie n. DIN 1045" : exceltodf.iloc[8,2],
"Körnungsziffer" : exceltodf.iloc[8,8],
"Vol Leim/Zuschlag" : exceltodf.iloc[10,10]
}
word = "Zusatzstoff"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above, I agree that I would not switch the language of the metadata.yaml, but rather have a translate function to could convert the meta.yaml to meta_ge.yaml, though I'm not sure why we would need that.

}
word = "Zusatzstoff"
specimeninfo = pd.Series(specimeninfo)
specimeninfo = specimeninfo.fillna("---") # fill nans
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we want to add nan at all, I would just leave out the coresponding entry. IMO there is no huge difference between checking if a value exists or whether it is NaN or ---. By leaving it out, we also allow to condense the information, since some informations might be only given for a small subset of the data.


# Some files have multiple entries of Zusatzstoffe, which are not further labeled, so this adds
# the type of Zusatzstoff
howmanyzusatzstoffe = [True if i == word else False for i in exceltodf["Stoffart"] ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we have the naming in the code with a consistent language?


# Some files have multiple entries of Zusatzstoffe, which are not further labeled, so this adds
# the type of Zusatzstoff
howmanyzusatzstoffe = [True if i == word else False for i in exceltodf["Stoffart"] ]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, my suggestion is to use a dict such as
exceltopdf[translate[language]['type']] with language being set to german if the read in file is a German raw data file.

'---', 'Water (effective) -- Specific gravity':
'---', 'Water (effective) -- Annotation': '---', 'Air -- Quantity in mix':
'---', 'Air -- Specific gravity': '---', 'Air -- Annotation': '---',
'Supplementary cementious materials Kalksteinmehl -- Quantity in mix': 147,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kalksteinmehl is German, could we make the yaml file language consistent?

@@ -0,0 +1,55 @@
import pytest
from lebedigital.raw_data_processing.metadata_extraction \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adopt the suggested changes in main from @eriktamsen in #55

@joergfunger
Copy link
Member

@soudehMasoudian What is the status with this merge request? The mixture meta data extraction is merged, but a test is missing. In addition, it would be good to reflect the structure in lebedigital also in the tests (instead of having a folder in tests that is called data extraction)

@soudehMasoudian soudehMasoudian force-pushed the unittest-mixture-metadata-extraction branch from 3abc48f to 6d17ad6 Compare November 22, 2022 22:15
@soudehMasoudian
Copy link
Collaborator Author

@soudehMasoudian What is the status with this merge request? The mixture meta data extraction is merged, but a test is missing. In addition, it would be good to reflect the structure in lebedigital also in the tests (instead of having a folder in tests that is called data extraction)

I commited mixture_metadata_extraction and other unittests located inside data_extraction. The tests didn't run properly as an extra package added to tests package. In this case, the pytest doesn't work and exceptions occures. I've changed some parts of unittests and added a class definition with unittest as the input parameter.
Everything works now, but please give me more specific data if you want to change the packages. Do you want to change the tests package to raw_data_processing>mixture for Test_Mixture_Metadata_Extraction, mapping>emodul_mapping for test_mapping_script and etc?

@eriktamsen
Copy link
Member

@soudehMasoudian, I dont fully follow what the problem with pytest is/was.
I dont think mixing two testing packages is a good idea. All our other tests are run in pytest.
I will have a quick look and try to rewrite the function for pytest. Maybe I can understand why there should be a problem.
@joergfunger: I agree with the idea, that the tests and lebedigital directory should have the same structure. Fortunately changing the location of the test scripts is a simple thing. We should maybe have a discussion at some point, if the lebedigital directory could have a better structure, but this is not the place ;)

@eriktamsen
Copy link
Member

Actually I noticed, that we have at least one more other test done with unittest. I am not sure if these tests are currently automatically included in the CI pytest. Maybe it is not a problem. I would still prefer to stick to one package.

@eriktamsen
Copy link
Member

I restructured the tests directory, not only for the extraction scripts. after merging, I will make sure all current test scripts in main are in the right place.
I rewrote the function for pytest and I guess the problem was the "nan" value, however I checked and unittest tests are working fine in combination with the pytest package. we can use both packages if we want to and do not necessarily need to change the others.

Copy link
Member

@eriktamsen eriktamsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work, looks good.

@soudehMasoudian
Copy link
Collaborator Author

I restructured the tests directory, not only for the extraction scripts. after merging, I will make sure all current test scripts in main are in the right place. I rewrote the function for pytest and I guess the problem was the "nan" value, however I checked and unittest tests are working fine in combination with the pytest package. we can use both packages if we want to and do not necessarily need to change the others.

If you're speaking about lebedigital scripts it's true. Json doesn't know nan, rather it should be camel case, like NaN. Except from that, the array keys should be sorounded with double quotation insterad of single to be readable as a json for test. Besides that, it's much better to compare the equality of two json than strings as it's more accurate and less error prone. But it works perfectly when we define the service result as a string.
Now I'll change the structure of the unittest packaging once you confirm it.

Copy link
Member

@joergfunger joergfunger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revision. I would stick to a single testing framework (pytest). In addition, I have some comments to the query tests.


from lebedigital.mapping.emodul_mapping import generate_knowledge_graph, get_date_time_value

class TestMappingScript(unittest.TestCase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mapping is done for different experiments, in that case Youngs modulus test (which is called emodul_mapping). Thus, I would suggest to have a file in test/mapping that is called test_emodul_mapping.py). I would also suggest we stick to a single testing framework (as Erik has already pointed out) and use pytest instead of unittest.

class TestMappingScript(unittest.TestCase):

def query_and_test_result(self, typeOfQuery, predicate, prop, g):
print(f"""Query for {typeOfQuery}:""")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be advantageous to not use print but rather the loguru logger such that we can define (when running the tests) the level of output we are interested in.

tests/mapping/test_mapping_script.py Show resolved Hide resolved

self.print_first()

result = self.query_and_test_result('diameter', 'ns1:informationbearingentity1',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only checks of there is a single informationbearingentity1 with the value. How do we know that this is the diameter? I would think it might be relevant to start from the id of the test and then extract its diameter traversing the complete graph. This might look redundant (and does not allow to have a query_and_test method) but IMO advantageous (since this is how we are supposed to use the knowledge graph afterwards). This is true for all the subsequent tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please give me more info related this issue? How the function should change in order to cover the new requirements? It seems a bit confusing, do you mean it's needed to change the query?

Copy link
Member

@joergfunger joergfunger Dec 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the query is supposed to search in a complete triple store with all test (potentially multiple test each having a diameter) As such we would need to start from the id of test and then traverse the graph to get to the diameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joergfunger are you sure? I thought this test is supposed to check the mapping of extracted data to ttl file.
The connection to the triple store is one or two steps further, right?
This is still an open issue but not related to this test.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The data structure that is generated is a graph. We are currently just checking the last leaf of the graph, without any understanding of the connections. IMO, the test should traverse the complete graph, such that this graph is tested, not only the final leaf. Thus, the question to be answered would be to return the diameter of a Young's modulus test (with a given id) by traversing from the test to the specimen and then the diameter.

@eriktamsen
Copy link
Member

@joergfunger: I would suggest to open a new issue to improve the query test and refactor them to use pytest only.
These files where not changed (as far as I know) by @soudehMasoudian now. It only looks like it, as I have moved them to a new place.

@eriktamsen
Copy link
Member

eriktamsen commented Nov 23, 2022

I copied your (@joergfunger) comments to a new issue: #84
I hope to assign someone to the issue in the next knowledge graph meeting.
I would suggest to merge this PR.

@eriktamsen eriktamsen merged commit 30a6be5 into main Nov 24, 2022
@eriktamsen eriktamsen deleted the unittest-mixture-metadata-extraction branch November 24, 2022 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants