Parse multi-line titles #31

georgiana-b · 2017-08-23T12:29:40Z

Problem

Currently we ask from users the mapping of the table titles to our fields.
Then, to determine which column corresponds to each title, we fuzzy match the given title against the values in that column.

Unfortunately fuzzy matching is not reliable enough in our case because the title values are often very broken so a strict matching will fail and a loose enough matching will yield a lot of mismatches.

For example, the title Financial headcount(**) as at 31 December is split on multiple lines one of which has the value headcount(**) as at 31 December 2014. However, it will not be matched against that column because there is another column that yields matches against 1:

In [1]: from fuzzywuzzy import fuzz

In [2]: fuzz.partial_ratio('Financial headcount(**) as at 31 December', 'headcount(**) as at 31  December 2014')
Out[2]: 91

In [3]: fuzz.partial_ratio('Financial headcount(**) as at 31 December', '1')
Out[3]: 100

If we try to replace partial_ratio with the stricter ratio, we end up missing many matches because the headers are parsed into multiple values on different lines.

Possible solution

We should determine on which row the titles end and the values start so that we can concatenate the title values from the same column into a single string and match strictly against that.

E.g:

,,,,Current,,,headcount(**)
,,Public,,tax,Defer-,Corporate,as at
,,subsidies,Income,ex-,red,income,31 December
,Revenues,received,before tax,pense,taxes,tax,2014
European Union member,,,,,,,
States,,,,,,,
Germany,"1,114",0,297,(91),(19),(110),"4,163"
Austria,14,0,5,(1),0,(1),170
Belgium,"4,514",0,"1,393",(31),(441),(472),"16,383"
Bulgaria,54,0,17,(1),0,(1),"1,055"

First 5 rows are headers, merge the values from the same column:

column 2: Revenues
column 3: Public subsidies received
column 4: Income before tax
etc.

The text was updated successfully, but these errors were encountered:

pwalsh · 2017-08-23T15:54:07Z

Excellent analysis @georgiana-b !

georgiana-b added the bug label Aug 23, 2017

georgiana-b mentioned this issue Aug 23, 2017

Extract title from whole table #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parse multi-line titles #31

Parse multi-line titles #31

georgiana-b commented Aug 23, 2017

pwalsh commented Aug 23, 2017

Parse multi-line titles #31

Parse multi-line titles #31

Comments

georgiana-b commented Aug 23, 2017

Problem

Possible solution

pwalsh commented Aug 23, 2017