Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse multi-line titles #31

Open
georgiana-b opened this issue Aug 23, 2017 · 1 comment
Open

Parse multi-line titles #31

georgiana-b opened this issue Aug 23, 2017 · 1 comment
Labels

Comments

@georgiana-b
Copy link
Contributor

Problem

Currently we ask from users the mapping of the table titles to our fields.
Then, to determine which column corresponds to each title, we fuzzy match the given title against the values in that column.

Unfortunately fuzzy matching is not reliable enough in our case because the title values are often very broken so a strict matching will fail and a loose enough matching will yield a lot of mismatches.

For example, the title Financial headcount(**) as at 31 December is split on multiple lines one of which has the value headcount(**) as at 31 December 2014. However, it will not be matched against that column because there is another column that yields matches against 1:

In [1]: from fuzzywuzzy import fuzz

In [2]: fuzz.partial_ratio('Financial headcount(**) as at 31 December', 'headcount(**) as at 31  December 2014')
Out[2]: 91

In [3]: fuzz.partial_ratio('Financial headcount(**) as at 31 December', '1')
Out[3]: 100

If we try to replace partial_ratio with the stricter ratio, we end up missing many matches because the headers are parsed into multiple values on different lines.

Possible solution

We should determine on which row the titles end and the values start so that we can concatenate the title values from the same column into a single string and match strictly against that.

E.g:

,,,,Current,,,headcount(**)
,,Public,,tax,Defer-,Corporate,as at
,,subsidies,Income,ex-,red,income,31 December
,Revenues,received,before tax,pense,taxes,tax,2014
European Union member,,,,,,,
States,,,,,,,
Germany,"1,114",0,297,(91),(19),(110),"4,163"
Austria,14,0,5,(1),0,(1),170
Belgium,"4,514",0,"1,393",(31),(441),(472),"16,383"
Bulgaria,54,0,17,(1),0,(1),"1,055"

First 5 rows are headers, merge the values from the same column:

  • column 2: Revenues
  • column 3: Public subsidies received
  • column 4: Income before tax
    etc.
@pwalsh
Copy link
Member

pwalsh commented Aug 23, 2017

Excellent analysis @georgiana-b !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants