return unknown/leftover tokens #36

trescube · 2017-06-30T18:38:29Z

It's becoming increasingly apparent that the Pelias API will have to call both libpostal and placeholder to come up with the correct answer. For example, libpostal parsed "Fort Hood, TX" as:

> Fort Hood, TX

Result:

{
  "road": "fort",
  "city": "hood",
  "state": "tx"
}

There's no reasonable non-hacky way to correct for this so the idea is to call both placeholder and libpostal for inputs, then figure out an answer from both responses.

For example, for the input 30 W 26th St, New York, NY, placeholder throws away the 30 W 26th St. For the above strategy to work, the API would need to know which tokens are unknown. In the case of Fort Hood, TX, if the API that placeholder had no leftover tokens then it could reasonably assured that the input was only admin data and could disregard the libpostal input (which is incorrect in this case).

The text was updated successfully, but these errors were encountered:

missinglink · 2017-07-03T12:08:24Z

the tokenize endpoint returns the token 'groups', it would a simple matter of working from left-to-right through this array to find the tokens which didn't match:

http://parser.wiz.co.nz/parser/tokenize?text=Example+Street+Neutral+Bay+North+Sydney+New+South+Wales+9999+AU

[
  [
    "street",
    "neutral bay",
    "north sydney",
    "new south wales",
    "au"
  ]
]

it might actually be better to do this on the placeholder end as these tokens have been normalized and so may not match verbatim the input tokens.

would you like the tokens returned as normalized values or verbatim as they were input by the user? what about punctuation such as commas, periods etc?

note: yes there is actually a place called Street

trescube · 2017-07-04T01:18:36Z

At this point, I'm not terribly concerned with the format of the unknown tokens, just whether there were any. An array of them would be nice in case the API would want to know what they are, but for the initial steps, the API would just make decisions on the condition that there were or weren't.

missinglink linked a pull request Jul 5, 2017 that will close this issue

issue 36: first pass at returning unparsed prefix #37

Open

missinglink self-assigned this Jul 5, 2017

missinglink added the in review label Jul 5, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

return unknown/leftover tokens #36

return unknown/leftover tokens #36

trescube commented Jun 30, 2017

missinglink commented Jul 3, 2017 •

edited

Loading

trescube commented Jul 4, 2017

return unknown/leftover tokens #36

return unknown/leftover tokens #36

Comments

trescube commented Jun 30, 2017

missinglink commented Jul 3, 2017 • edited Loading

trescube commented Jul 4, 2017

missinglink commented Jul 3, 2017 •

edited

Loading