Unittest mixture metadata extraction #53

soudehMasoudian · 2022-09-13T10:07:01Z

a unittest has been written for mixture_metadata_extraction script.

eriktamsen

This seems to be a good step, but some imporvements will be needed before merging.

First of all, please make sure the tests are working within your branch. It might be a good idea to pull from main as some things have changed, maybe this will solve the problems in actions.

Then, please add your script to the doit workflow.

I would suggest to change the folder strucutre slightly in the "raw_data_processing" directory. A folder "mixture_data" for this script. I will create a new folder (in a new branch) "emodul_data" where the "emodul_metadata_extraction.py" and the "emodul_generate_processed_data.py" will be moved.

Apart from some of the comments in the code, I think we (@joergfunger: people wo know concrete and the people wo need to work with the data) need to think of a better target structure of the data. Probaly do a similar split as with the experimental files in "metadata - yaml file" and "processed data - csv table".

eriktamsen · 2022-09-14T07:59:24Z

lebedigital/raw_data_processing/metadata_extraction/mixture_metadata_extraction.py

+            The excelfile that contains the translations, columns are "german" and "english"
+    """
+
+    translation = pd.read_excel(dic)


It would be better to use csv or yaml for this data.

I think this is to simplify the reading in of the raw excel sheet (which cannot be changed) and to translate the file. However, I'm not sure why we need the translation of the orginal file. IMO, all the raw excel sheets are German, and I think it would be good to a have mapping such that when printing something, we could always change the printed output by a dictionary e.g. translate[German,specimen]=Probekörper, but internally I would only use a single (English) representation. That makes it much easier to work with that.

No, this specifically is for the translation file: translation.xlsx. If we want/need a translation file, I would prefer it to be not an exel file.

Yes, you are right, that is correct, my mistake.

eriktamsen · 2022-09-14T08:00:59Z

lebedigital/raw_data_processing/metadata_extraction/mixture_metadata_extraction.py

+        if english == True:
+            exceltodf = translate(exceltodf,"translation.xlsx")
+            specimeninfo = {
+                    "Date" : str(exceltodf.columns[9])[:10], 
+                    "Editor" : exceltodf.iloc[0,10], 
+                    "Requester" : exceltodf.iloc[2,3],
+                    "Project number": exceltodf.iloc[3,3],
+                    "Specimen" : exceltodf.iloc[4,3], 
+                    "Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4],
+                    "Wasserzementwert" : exceltodf.iloc[7,2],
+                    "Konsistenz" : exceltodf.iloc[7,7],
+                    "Sieblinie n. DIN 1045" : exceltodf.iloc[8,2],
+                    "Körnungsziffer" : exceltodf.iloc[8,8],
+                    "Vol Leim/Zuschlag" : exceltodf.iloc[10,10]
+                    }
+            word = "Supplementary cementious materials"
+        else:
+            specimeninfo = {
+                    "Datum" : str(exceltodf.columns[9])[:10], 
+                    "Bearbeiter" : exceltodf.iloc[0,10], 
+                    "Antragsteller" : exceltodf.iloc[2,3],
+                    "Antrags-/Projektnummer": exceltodf.iloc[3,3],
+                    "Bezeichnung der Proben" : exceltodf.iloc[4,3], 
+                    "Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4],
+                    "Wasserzementwert" : exceltodf.iloc[7,2],
+                    "Konsistenz" : exceltodf.iloc[7,7],
+                    "Sieblinie n. DIN 1045" : exceltodf.iloc[8,2],
+                    "Körnungsziffer" : exceltodf.iloc[8,8],
+                    "Vol Leim/Zuschlag" : exceltodf.iloc[10,10]
+                    }
+            word = "Zusatzstoff"


Why do we need this, if we have a "translate" function? Cant we add these pairs there as well?

see above, I agree that I would not switch the language of the metadata.yaml, but rather have a translate function to could convert the meta.yaml to meta_ge.yaml, though I'm not sure why we would need that.

eriktamsen · 2022-09-14T08:02:15Z

lebedigital/raw_data_processing/metadata_extraction/mixture_metadata_extraction.py

+                    }
+            word = "Zusatzstoff"
+        specimeninfo = pd.Series(specimeninfo)
+        specimeninfo = specimeninfo.fillna("---") # fill nans


Adding None to the fields might be better then a "---" string.

Why do we want to add nan at all, I would just leave out the coresponding entry. IMO there is no huge difference between checking if a value exists or whether it is NaN or ---. By leaving it out, we also allow to condense the information, since some informations might be only given for a small subset of the data.

eriktamsen · 2022-09-14T08:07:27Z

tests/data_extraction/test_mixture_metadata_extraction.py

+    target_data = {'Cement -- Quantity in mix': 330, 'Cement -- Specific gravity': 3.123,
+                   'Cement -- Annotation': 'CEM I 42,5 R', 'Water (total) -- Quantity in mix':
+                    175, 'Water (total) -- Specific gravity': 1, 'Water (total) -- Annotation': 
+                    '---', 'Water (effective) -- Quantity in mix': 
+                    '---', 'Water (effective) -- Specific gravity': 
+                    '---', 'Water (effective) -- Annotation': '---', 'Air -- Quantity in mix': 
+                    '---', 'Air -- Specific gravity': '---', 'Air -- Annotation': '---', 


There seem to be values missing.
Where is "Cement -- Quantity in mix" or "Water (total) -- Specific gravity"?

eriktamsen · 2022-09-14T08:25:27Z

tests/data_extraction/test_mixture_metadata_extraction.py

+                    '4,0 / 8,0 -- Specific gravity': 2.63, '4,0 / 8,0 -- Annotation': 'Okrilla  ', 
+                    '8,0 / 16,0 -- Quantity in mix': 708, '8,0 / 16,0 -- Specific gravity': 2.63, 
+                    '8,0 / 16,0 -- Annotation': 'Okrilla  ', 'nan -- Quantity in mix': '---', 
+                    'nan -- Specific gravity': '---', 'nan -- Annotation': '---', 


These empty lines should not be included in the data.

joergfunger

Thanks for the nice work. There are a couple of suggested changes. In addition, add the extraction not only to the test, but also to the doit script such that all the mixes are extracted for the minimum working example. In addition, make sure that the github actions do not fail, otherwise a merge is not possible.

joergfunger · 2022-09-15T06:13:52Z

lebedigital/raw_data_processing/metadata_extraction/mixture_metadata_extraction.py

+            The excelfile that contains the translations, columns are "german" and "english"
+    """
+
+    translation = pd.read_excel(dic)


I think this is to simplify the reading in of the raw excel sheet (which cannot be changed) and to translate the file. However, I'm not sure why we need the translation of the orginal file. IMO, all the raw excel sheets are German, and I think it would be good to a have mapping such that when printing something, we could always change the printed output by a dictionary e.g. translate[German,specimen]=Probekörper, but internally I would only use a single (English) representation. That makes it much easier to work with that.

joergfunger · 2022-09-15T06:17:02Z

lebedigital/raw_data_processing/metadata_extraction/mixture_metadata_extraction.py

+        if english == True:
+            exceltodf = translate(exceltodf,"translation.xlsx")
+            specimeninfo = {
+                    "Date" : str(exceltodf.columns[9])[:10], 
+                    "Editor" : exceltodf.iloc[0,10], 
+                    "Requester" : exceltodf.iloc[2,3],
+                    "Project number": exceltodf.iloc[3,3],
+                    "Specimen" : exceltodf.iloc[4,3], 
+                    "Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4],
+                    "Wasserzementwert" : exceltodf.iloc[7,2],
+                    "Konsistenz" : exceltodf.iloc[7,7],
+                    "Sieblinie n. DIN 1045" : exceltodf.iloc[8,2],
+                    "Körnungsziffer" : exceltodf.iloc[8,8],
+                    "Vol Leim/Zuschlag" : exceltodf.iloc[10,10]
+                    }
+            word = "Supplementary cementious materials"
+        else:
+            specimeninfo = {
+                    "Datum" : str(exceltodf.columns[9])[:10], 
+                    "Bearbeiter" : exceltodf.iloc[0,10], 
+                    "Antragsteller" : exceltodf.iloc[2,3],
+                    "Antrags-/Projektnummer": exceltodf.iloc[3,3],
+                    "Bezeichnung der Proben" : exceltodf.iloc[4,3], 
+                    "Betonsorte u -festigkeitsklasse" : exceltodf.iloc[6,4],
+                    "Wasserzementwert" : exceltodf.iloc[7,2],
+                    "Konsistenz" : exceltodf.iloc[7,7],
+                    "Sieblinie n. DIN 1045" : exceltodf.iloc[8,2],
+                    "Körnungsziffer" : exceltodf.iloc[8,8],
+                    "Vol Leim/Zuschlag" : exceltodf.iloc[10,10]
+                    }
+            word = "Zusatzstoff"


see above, I agree that I would not switch the language of the metadata.yaml, but rather have a translate function to could convert the meta.yaml to meta_ge.yaml, though I'm not sure why we would need that.

joergfunger · 2022-09-15T06:19:39Z

lebedigital/raw_data_processing/metadata_extraction/mixture_metadata_extraction.py

+                    }
+            word = "Zusatzstoff"
+        specimeninfo = pd.Series(specimeninfo)
+        specimeninfo = specimeninfo.fillna("---") # fill nans


Why do we want to add nan at all, I would just leave out the coresponding entry. IMO there is no huge difference between checking if a value exists or whether it is NaN or ---. By leaving it out, we also allow to condense the information, since some informations might be only given for a small subset of the data.

joergfunger · 2022-09-15T06:21:42Z

lebedigital/raw_data_processing/metadata_extraction/mixture_metadata_extraction.py

+
+        # Some files have multiple entries of Zusatzstoffe, which are not further labeled, so this adds 
+        # the type of Zusatzstoff
+        howmanyzusatzstoffe = [True if i == word else False for i in exceltodf["Stoffart"] ]


Could we have the naming in the code with a consistent language?

joergfunger · 2022-09-15T06:25:45Z

lebedigital/raw_data_processing/metadata_extraction/mixture_metadata_extraction.py

+
+        # Some files have multiple entries of Zusatzstoffe, which are not further labeled, so this adds 
+        # the type of Zusatzstoff
+        howmanyzusatzstoffe = [True if i == word else False for i in exceltodf["Stoffart"] ]


See above, my suggestion is to use a dict such as
exceltopdf[translate[language]['type']] with language being set to german if the read in file is a German raw data file.

joergfunger · 2022-09-15T06:36:04Z

tests/data_extraction/test_mixture_metadata_extraction.py

+                    '---', 'Water (effective) -- Specific gravity': 
+                    '---', 'Water (effective) -- Annotation': '---', 'Air -- Quantity in mix': 
+                    '---', 'Air -- Specific gravity': '---', 'Air -- Annotation': '---', 
+                    'Supplementary cementious materials Kalksteinmehl -- Quantity in mix': 147, 


Kalksteinmehl is German, could we make the yaml file language consistent?

joergfunger · 2022-09-15T06:38:05Z

tests/data_extraction/test_mixture_metadata_extraction.py

@@ -0,0 +1,55 @@
+import pytest
+from lebedigital.raw_data_processing.metadata_extraction \


adopt the suggested changes in main from @eriktamsen in #55

joergfunger · 2022-11-17T15:24:03Z

@soudehMasoudian What is the status with this merge request? The mixture meta data extraction is merged, but a test is missing. In addition, it would be good to reflect the structure in lebedigital also in the tests (instead of having a folder in tests that is called data extraction)

…e expected output is equal to the actual output

soudehMasoudian · 2022-11-22T22:41:12Z

@soudehMasoudian What is the status with this merge request? The mixture meta data extraction is merged, but a test is missing. In addition, it would be good to reflect the structure in lebedigital also in the tests (instead of having a folder in tests that is called data extraction)

I commited mixture_metadata_extraction and other unittests located inside data_extraction. The tests didn't run properly as an extra package added to tests package. In this case, the pytest doesn't work and exceptions occures. I've changed some parts of unittests and added a class definition with unittest as the input parameter.
Everything works now, but please give me more specific data if you want to change the packages. Do you want to change the tests package to raw_data_processing>mixture for Test_Mixture_Metadata_Extraction, mapping>emodul_mapping for test_mapping_script and etc?

eriktamsen · 2022-11-23T08:08:35Z

@soudehMasoudian, I dont fully follow what the problem with pytest is/was.
I dont think mixing two testing packages is a good idea. All our other tests are run in pytest.
I will have a quick look and try to rewrite the function for pytest. Maybe I can understand why there should be a problem.
@joergfunger: I agree with the idea, that the tests and lebedigital directory should have the same structure. Fortunately changing the location of the test scripts is a simple thing. We should maybe have a discussion at some point, if the lebedigital directory could have a better structure, but this is not the place ;)

eriktamsen · 2022-11-23T08:19:11Z

Actually I noticed, that we have at least one more other test done with unittest. I am not sure if these tests are currently automatically included in the CI pytest. Maybe it is not a problem. I would still prefer to stick to one package.

eriktamsen · 2022-11-23T08:55:31Z

I restructured the tests directory, not only for the extraction scripts. after merging, I will make sure all current test scripts in main are in the right place.
I rewrote the function for pytest and I guess the problem was the "nan" value, however I checked and unittest tests are working fine in combination with the pytest package. we can use both packages if we want to and do not necessarily need to change the others.

eriktamsen

Thanks for your work, looks good.

soudehMasoudian · 2022-11-23T09:10:01Z

I restructured the tests directory, not only for the extraction scripts. after merging, I will make sure all current test scripts in main are in the right place. I rewrote the function for pytest and I guess the problem was the "nan" value, however I checked and unittest tests are working fine in combination with the pytest package. we can use both packages if we want to and do not necessarily need to change the others.

If you're speaking about lebedigital scripts it's true. Json doesn't know nan, rather it should be camel case, like NaN. Except from that, the array keys should be sorounded with double quotation insterad of single to be readable as a json for test. Besides that, it's much better to compare the equality of two json than strings as it's more accurate and less error prone. But it works perfectly when we define the service result as a string.
Now I'll change the structure of the unittest packaging once you confirm it.

joergfunger

Thanks for the revision. I would stick to a single testing framework (pytest). In addition, I have some comments to the query tests.

joergfunger · 2022-11-23T14:34:25Z

tests/mapping/test_mapping_script.py

+
+from lebedigital.mapping.emodul_mapping import generate_knowledge_graph, get_date_time_value
+
+class TestMappingScript(unittest.TestCase):


mapping is done for different experiments, in that case Youngs modulus test (which is called emodul_mapping). Thus, I would suggest to have a file in test/mapping that is called test_emodul_mapping.py). I would also suggest we stick to a single testing framework (as Erik has already pointed out) and use pytest instead of unittest.

joergfunger · 2022-11-23T14:39:38Z

tests/mapping/test_mapping_script.py

+class TestMappingScript(unittest.TestCase):
+
+    def query_and_test_result(self, typeOfQuery, predicate, prop, g):
+        print(f"""Query for {typeOfQuery}:""")


It would be advantageous to not use print but rather the loguru logger such that we can define (when running the tests) the level of output we are interested in.

tests/mapping/test_mapping_script.py

joergfunger · 2022-11-23T14:46:47Z

tests/mapping/test_mapping_script.py

+
+        self.print_first()
+
+        result = self.query_and_test_result('diameter', 'ns1:informationbearingentity1',


This only checks of there is a single informationbearingentity1 with the value. How do we know that this is the diameter? I would think it might be relevant to start from the id of the test and then extract its diameter traversing the complete graph. This might look redundant (and does not allow to have a query_and_test method) but IMO advantageous (since this is how we are supposed to use the knowledge graph afterwards). This is true for all the subsequent tests.

Could you please give me more info related this issue? How the function should change in order to cover the new requirements? It seems a bit confusing, do you mean it's needed to change the query?

yes, the query is supposed to search in a complete triple store with all test (potentially multiple test each having a diameter) As such we would need to start from the id of test and then traverse the graph to get to the diameter.

@joergfunger are you sure? I thought this test is supposed to check the mapping of extracted data to ttl file.
The connection to the triple store is one or two steps further, right?
This is still an open issue but not related to this test.

The data structure that is generated is a graph. We are currently just checking the last leaf of the graph, without any understanding of the connections. IMO, the test should traverse the complete graph, such that this graph is tested, not only the final leaf. Thus, the question to be answered would be to return the diameter of a Young's modulus test (with a given id) by traversing from the test to the specimen and then the diameter.

eriktamsen · 2022-11-23T15:10:50Z

@joergfunger: I would suggest to open a new issue to improve the query test and refactor them to use pytest only.
These files where not changed (as far as I know) by @soudehMasoudian now. It only looks like it, as I have moved them to a new place.

eriktamsen · 2022-11-23T16:43:15Z

I copied your (@joergfunger) comments to a new issue: #84
I hope to assign someone to the issue in the next knowledge graph meeting.
I would suggest to merge this PR.

eriktamsen requested changes Sep 14, 2022

View reviewed changes

joergfunger requested changes Sep 15, 2022

View reviewed changes

soudehMasoudian added 2 commits November 22, 2022 22:46

write unittest for mixture_metadata_extraction script and check if th…

057c48c

…e expected output is equal to the actual output

fix problems in unittests after putting into separated folders

6d17ad6

soudehMasoudian force-pushed the unittest-mixture-metadata-extraction branch from 3abc48f to 6d17ad6 Compare November 22, 2022 22:15

fix config problem

7ba9277

eriktamsen added 3 commits November 23, 2022 09:40

changed from unittest to pytest

d0d9f7b

restructure test_extraction

2fb5b1e

restructure test directory

c61b1f1

eriktamsen approved these changes Nov 23, 2022

View reviewed changes

joergfunger requested changes Nov 23, 2022

View reviewed changes

eriktamsen mentioned this pull request Nov 23, 2022

Improve test_mapping_script #84

Open

joergfunger approved these changes Nov 24, 2022

View reviewed changes

eriktamsen merged commit 30a6be5 into main Nov 24, 2022

eriktamsen deleted the unittest-mixture-metadata-extraction branch November 24, 2022 08:26

		@@ -0,0 +1,55 @@
		import pytest
		from lebedigital.raw_data_processing.metadata_extraction \


		from lebedigital.mapping.emodul_mapping import generate_knowledge_graph, get_date_time_value

		class TestMappingScript(unittest.TestCase):


		self.print_first()

		result = self.query_and_test_result('diameter', 'ns1:informationbearingentity1',

Unittest mixture metadata extraction #53

Unittest mixture metadata extraction #53

Conversation

soudehMasoudian commented Sep 13, 2022

eriktamsen left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joergfunger left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joergfunger commented Nov 17, 2022

soudehMasoudian commented Nov 22, 2022

eriktamsen commented Nov 23, 2022

eriktamsen commented Nov 23, 2022

eriktamsen commented Nov 23, 2022

eriktamsen left a comment

Choose a reason for hiding this comment

soudehMasoudian commented Nov 23, 2022

joergfunger left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

joergfunger Dec 5, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eriktamsen commented Nov 23, 2022

eriktamsen commented Nov 23, 2022 • edited Loading

eriktamsen left a comment •

edited

Loading

joergfunger left a comment •

edited

Loading

joergfunger Dec 5, 2022 •

edited

Loading

eriktamsen commented Nov 23, 2022 •

edited

Loading