You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Unfortunately fuzzy matching is not reliable enough in our case because the title values are often very broken so a strict matching will fail and a loose enough matching will yield a lot of mismatches.
For example, the title Financial headcount(**) as at 31 December is split on multiple lines one of which has the value headcount(**) as at 31 December 2014. However, it will not be matched against that column because there is another column that yields matches against 1:
In [1]: fromfuzzywuzzyimportfuzzIn [2]: fuzz.partial_ratio('Financial headcount(**) as at 31 December', 'headcount(**) as at 31 December 2014')
Out[2]: 91In [3]: fuzz.partial_ratio('Financial headcount(**) as at 31 December', '1')
Out[3]: 100
If we try to replace partial_ratio with the stricter ratio, we end up missing many matches because the headers are parsed into multiple values on different lines.
Possible solution
We should determine on which row the titles end and the values start so that we can concatenate the title values from the same column into a single string and match strictly against that.
E.g:
,,,,Current,,,headcount(**)
,,Public,,tax,Defer-,Corporate,as at
,,subsidies,Income,ex-,red,income,31 December
,Revenues,received,before tax,pense,taxes,tax,2014
European Union member,,,,,,,
States,,,,,,,
Germany,"1,114",0,297,(91),(19),(110),"4,163"
Austria,14,0,5,(1),0,(1),170
Belgium,"4,514",0,"1,393",(31),(441),(472),"16,383"
Bulgaria,54,0,17,(1),0,(1),"1,055"
First 5 rows are headers, merge the values from the same column:
column 2: Revenues
column 3: Public subsidies received
column 4: Income before tax
etc.
The text was updated successfully, but these errors were encountered:
Problem
Currently we ask from users the mapping of the table titles to our fields.
Then, to determine which column corresponds to each title, we fuzzy match the given title against the values in that column.
Unfortunately fuzzy matching is not reliable enough in our case because the title values are often very broken so a strict matching will fail and a loose enough matching will yield a lot of mismatches.
For example, the title
Financial headcount(**) as at 31 December
is split on multiple lines one of which has the valueheadcount(**) as at 31 December 2014
. However, it will not be matched against that column because there is another column that yields matches against1
:If we try to replace
partial_ratio
with the stricterratio
, we end up missing many matches because the headers are parsed into multiple values on different lines.Possible solution
We should determine on which row the titles end and the values start so that we can concatenate the title values from the same column into a single string and match strictly against that.
E.g:
First 5 rows are headers, merge the values from the same column:
etc.
The text was updated successfully, but these errors were encountered: