-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unittest mixture metadata extraction #53
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a good step, but some imporvements will be needed before merging.
First of all, please make sure the tests are working within your branch. It might be a good idea to pull from main as some things have changed, maybe this will solve the problems in actions.
Then, please add your script to the doit workflow.
I would suggest to change the folder strucutre slightly in the "raw_data_processing" directory. A folder "mixture_data" for this script. I will create a new folder (in a new branch) "emodul_data" where the "emodul_metadata_extraction.py" and the "emodul_generate_processed_data.py" will be moved.
Apart from some of the comments in the code, I think we (@joergfunger: people wo know concrete and the people wo need to work with the data) need to think of a better target structure of the data. Probaly do a similar split as with the experimental files in "metadata - yaml file" and "processed data - csv table".
The excelfile that contains the translations, columns are "german" and "english" | ||
""" | ||
|
||
translation = pd.read_excel(dic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be better to use csv or yaml for this data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is to simplify the reading in of the raw excel sheet (which cannot be changed) and to translate the file. However, I'm not sure why we need the translation of the orginal file. IMO, all the raw excel sheets are German, and I think it would be good to a have mapping such that when printing something, we could always change the printed output by a dictionary e.g. translate[German,specimen]=Probekörper
, but internally I would only use a single (English) representation. That makes it much easier to work with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this specifically is for the translation file: translation.xlsx. If we want/need a translation file, I would prefer it to be not an exel file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right, that is correct, my mistake.
if english == True: | ||
exceltodf = translate(exceltodf,"translation.xlsx") | ||
specimeninfo = { | ||
"Date" : str(exceltodf.columns[9])[:10], | ||
"Editor" : exceltodf.iloc[0,10], | ||
"Requester" : exceltodf.iloc[2,3], | ||
"Project number": exceltodf.iloc[3,3], | ||
"Specimen" : exceltodf.iloc[4,3], | ||
"Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4], | ||
"Wasserzementwert" : exceltodf.iloc[7,2], | ||
"Konsistenz" : exceltodf.iloc[7,7], | ||
"Sieblinie n. DIN 1045" : exceltodf.iloc[8,2], | ||
"Körnungsziffer" : exceltodf.iloc[8,8], | ||
"Vol Leim/Zuschlag" : exceltodf.iloc[10,10] | ||
} | ||
word = "Supplementary cementious materials" | ||
else: | ||
specimeninfo = { | ||
"Datum" : str(exceltodf.columns[9])[:10], | ||
"Bearbeiter" : exceltodf.iloc[0,10], | ||
"Antragsteller" : exceltodf.iloc[2,3], | ||
"Antrags-/Projektnummer": exceltodf.iloc[3,3], | ||
"Bezeichnung der Proben" : exceltodf.iloc[4,3], | ||
"Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4], | ||
"Wasserzementwert" : exceltodf.iloc[7,2], | ||
"Konsistenz" : exceltodf.iloc[7,7], | ||
"Sieblinie n. DIN 1045" : exceltodf.iloc[8,2], | ||
"Körnungsziffer" : exceltodf.iloc[8,8], | ||
"Vol Leim/Zuschlag" : exceltodf.iloc[10,10] | ||
} | ||
word = "Zusatzstoff" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this, if we have a "translate" function? Cant we add these pairs there as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above, I agree that I would not switch the language of the metadata.yaml, but rather have a translate function to could convert the meta.yaml to meta_ge.yaml, though I'm not sure why we would need that.
} | ||
word = "Zusatzstoff" | ||
specimeninfo = pd.Series(specimeninfo) | ||
specimeninfo = specimeninfo.fillna("---") # fill nans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding None to the fields might be better then a "---" string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to add nan at all, I would just leave out the coresponding entry. IMO there is no huge difference between checking if a value exists or whether it is NaN or ---. By leaving it out, we also allow to condense the information, since some informations might be only given for a small subset of the data.
target_data = {'Cement -- Quantity in mix': 330, 'Cement -- Specific gravity': 3.123, | ||
'Cement -- Annotation': 'CEM I 42,5 R', 'Water (total) -- Quantity in mix': | ||
175, 'Water (total) -- Specific gravity': 1, 'Water (total) -- Annotation': | ||
'---', 'Water (effective) -- Quantity in mix': | ||
'---', 'Water (effective) -- Specific gravity': | ||
'---', 'Water (effective) -- Annotation': '---', 'Air -- Quantity in mix': | ||
'---', 'Air -- Specific gravity': '---', 'Air -- Annotation': '---', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seem to be values missing.
Where is "Cement -- Quantity in mix" or "Water (total) -- Specific gravity"?
'4,0 / 8,0 -- Specific gravity': 2.63, '4,0 / 8,0 -- Annotation': 'Okrilla ', | ||
'8,0 / 16,0 -- Quantity in mix': 708, '8,0 / 16,0 -- Specific gravity': 2.63, | ||
'8,0 / 16,0 -- Annotation': 'Okrilla ', 'nan -- Quantity in mix': '---', | ||
'nan -- Specific gravity': '---', 'nan -- Annotation': '---', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These empty lines should not be included in the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the nice work. There are a couple of suggested changes. In addition, add the extraction not only to the test, but also to the doit script such that all the mixes are extracted for the minimum working example. In addition, make sure that the github actions do not fail, otherwise a merge is not possible.
The excelfile that contains the translations, columns are "german" and "english" | ||
""" | ||
|
||
translation = pd.read_excel(dic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is to simplify the reading in of the raw excel sheet (which cannot be changed) and to translate the file. However, I'm not sure why we need the translation of the orginal file. IMO, all the raw excel sheets are German, and I think it would be good to a have mapping such that when printing something, we could always change the printed output by a dictionary e.g. translate[German,specimen]=Probekörper
, but internally I would only use a single (English) representation. That makes it much easier to work with that.
if english == True: | ||
exceltodf = translate(exceltodf,"translation.xlsx") | ||
specimeninfo = { | ||
"Date" : str(exceltodf.columns[9])[:10], | ||
"Editor" : exceltodf.iloc[0,10], | ||
"Requester" : exceltodf.iloc[2,3], | ||
"Project number": exceltodf.iloc[3,3], | ||
"Specimen" : exceltodf.iloc[4,3], | ||
"Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4], | ||
"Wasserzementwert" : exceltodf.iloc[7,2], | ||
"Konsistenz" : exceltodf.iloc[7,7], | ||
"Sieblinie n. DIN 1045" : exceltodf.iloc[8,2], | ||
"Körnungsziffer" : exceltodf.iloc[8,8], | ||
"Vol Leim/Zuschlag" : exceltodf.iloc[10,10] | ||
} | ||
word = "Supplementary cementious materials" | ||
else: | ||
specimeninfo = { | ||
"Datum" : str(exceltodf.columns[9])[:10], | ||
"Bearbeiter" : exceltodf.iloc[0,10], | ||
"Antragsteller" : exceltodf.iloc[2,3], | ||
"Antrags-/Projektnummer": exceltodf.iloc[3,3], | ||
"Bezeichnung der Proben" : exceltodf.iloc[4,3], | ||
"Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4], | ||
"Wasserzementwert" : exceltodf.iloc[7,2], | ||
"Konsistenz" : exceltodf.iloc[7,7], | ||
"Sieblinie n. DIN 1045" : exceltodf.iloc[8,2], | ||
"Körnungsziffer" : exceltodf.iloc[8,8], | ||
"Vol Leim/Zuschlag" : exceltodf.iloc[10,10] | ||
} | ||
word = "Zusatzstoff" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above, I agree that I would not switch the language of the metadata.yaml, but rather have a translate function to could convert the meta.yaml to meta_ge.yaml, though I'm not sure why we would need that.
} | ||
word = "Zusatzstoff" | ||
specimeninfo = pd.Series(specimeninfo) | ||
specimeninfo = specimeninfo.fillna("---") # fill nans |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we want to add nan at all, I would just leave out the coresponding entry. IMO there is no huge difference between checking if a value exists or whether it is NaN or ---. By leaving it out, we also allow to condense the information, since some informations might be only given for a small subset of the data.
|
||
# Some files have multiple entries of Zusatzstoffe, which are not further labeled, so this adds | ||
# the type of Zusatzstoff | ||
howmanyzusatzstoffe = [True if i == word else False for i in exceltodf["Stoffart"] ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we have the naming in the code with a consistent language?
|
||
# Some files have multiple entries of Zusatzstoffe, which are not further labeled, so this adds | ||
# the type of Zusatzstoff | ||
howmanyzusatzstoffe = [True if i == word else False for i in exceltodf["Stoffart"] ] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above, my suggestion is to use a dict such as
exceltopdf[translate[language]['type']]
with language being set to german if the read in file is a German raw data file.
'---', 'Water (effective) -- Specific gravity': | ||
'---', 'Water (effective) -- Annotation': '---', 'Air -- Quantity in mix': | ||
'---', 'Air -- Specific gravity': '---', 'Air -- Annotation': '---', | ||
'Supplementary cementious materials Kalksteinmehl -- Quantity in mix': 147, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kalksteinmehl is German, could we make the yaml file language consistent?
@@ -0,0 +1,55 @@ | |||
import pytest | |||
from lebedigital.raw_data_processing.metadata_extraction \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adopt the suggested changes in main from @eriktamsen in #55
@soudehMasoudian What is the status with this merge request? The mixture meta data extraction is merged, but a test is missing. In addition, it would be good to reflect the structure in lebedigital also in the tests (instead of having a folder in tests that is called data extraction) |
…e expected output is equal to the actual output
3abc48f
to
6d17ad6
Compare
I commited mixture_metadata_extraction and other unittests located inside data_extraction. The tests didn't run properly as an extra package added to tests package. In this case, the pytest doesn't work and exceptions occures. I've changed some parts of unittests and added a class definition with unittest as the input parameter. |
@soudehMasoudian, I dont fully follow what the problem with pytest is/was. |
Actually I noticed, that we have at least one more other test done with unittest. I am not sure if these tests are currently automatically included in the CI pytest. Maybe it is not a problem. I would still prefer to stick to one package. |
I restructured the tests directory, not only for the extraction scripts. after merging, I will make sure all current test scripts in main are in the right place. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work, looks good.
If you're speaking about lebedigital scripts it's true. Json doesn't know nan, rather it should be camel case, like NaN. Except from that, the array keys should be sorounded with double quotation insterad of single to be readable as a json for test. Besides that, it's much better to compare the equality of two json than strings as it's more accurate and less error prone. But it works perfectly when we define the service result as a string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the revision. I would stick to a single testing framework (pytest). In addition, I have some comments to the query tests.
|
||
from lebedigital.mapping.emodul_mapping import generate_knowledge_graph, get_date_time_value | ||
|
||
class TestMappingScript(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mapping is done for different experiments, in that case Youngs modulus test (which is called emodul_mapping). Thus, I would suggest to have a file in test/mapping that is called test_emodul_mapping.py). I would also suggest we stick to a single testing framework (as Erik has already pointed out) and use pytest instead of unittest.
class TestMappingScript(unittest.TestCase): | ||
|
||
def query_and_test_result(self, typeOfQuery, predicate, prop, g): | ||
print(f"""Query for {typeOfQuery}:""") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be advantageous to not use print but rather the loguru logger such that we can define (when running the tests) the level of output we are interested in.
|
||
self.print_first() | ||
|
||
result = self.query_and_test_result('diameter', 'ns1:informationbearingentity1', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This only checks of there is a single informationbearingentity1 with the value. How do we know that this is the diameter? I would think it might be relevant to start from the id of the test and then extract its diameter traversing the complete graph. This might look redundant (and does not allow to have a query_and_test method) but IMO advantageous (since this is how we are supposed to use the knowledge graph afterwards). This is true for all the subsequent tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please give me more info related this issue? How the function should change in order to cover the new requirements? It seems a bit confusing, do you mean it's needed to change the query?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, the query is supposed to search in a complete triple store with all test (potentially multiple test each having a diameter) As such we would need to start from the id of test and then traverse the graph to get to the diameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joergfunger are you sure? I thought this test is supposed to check the mapping of extracted data to ttl file.
The connection to the triple store is one or two steps further, right?
This is still an open issue but not related to this test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The data structure that is generated is a graph. We are currently just checking the last leaf of the graph, without any understanding of the connections. IMO, the test should traverse the complete graph, such that this graph is tested, not only the final leaf. Thus, the question to be answered would be to return the diameter of a Young's modulus test (with a given id) by traversing from the test to the specimen and then the diameter.
@joergfunger: I would suggest to open a new issue to improve the query test and refactor them to use pytest only. |
I copied your (@joergfunger) comments to a new issue: #84 |
a unittest has been written for mixture_metadata_extraction script.