Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

return unknown/leftover tokens #36

Open
trescube opened this issue Jun 30, 2017 · 2 comments · May be fixed by #37
Open

return unknown/leftover tokens #36

trescube opened this issue Jun 30, 2017 · 2 comments · May be fixed by #37
Assignees

Comments

@trescube
Copy link
Contributor

It's becoming increasingly apparent that the Pelias API will have to call both libpostal and placeholder to come up with the correct answer. For example, libpostal parsed "Fort Hood, TX" as:

> Fort Hood, TX

Result:

{
  "road": "fort",
  "city": "hood",
  "state": "tx"
}

There's no reasonable non-hacky way to correct for this so the idea is to call both placeholder and libpostal for inputs, then figure out an answer from both responses.

For example, for the input 30 W 26th St, New York, NY, placeholder throws away the 30 W 26th St. For the above strategy to work, the API would need to know which tokens are unknown. In the case of Fort Hood, TX, if the API that placeholder had no leftover tokens then it could reasonably assured that the input was only admin data and could disregard the libpostal input (which is incorrect in this case).

@missinglink
Copy link
Member

missinglink commented Jul 3, 2017

the tokenize endpoint returns the token 'groups', it would a simple matter of working from left-to-right through this array to find the tokens which didn't match:

http://parser.wiz.co.nz/parser/tokenize?text=Example+Street+Neutral+Bay+North+Sydney+New+South+Wales+9999+AU

[
  [
    "street",
    "neutral bay",
    "north sydney",
    "new south wales",
    "au"
  ]
]

it might actually be better to do this on the placeholder end as these tokens have been normalized and so may not match verbatim the input tokens.

would you like the tokens returned as normalized values or verbatim as they were input by the user? what about punctuation such as commas, periods etc?

note: yes there is actually a place called Street

@trescube
Copy link
Contributor Author

trescube commented Jul 4, 2017

At this point, I'm not terribly concerned with the format of the unknown tokens, just whether there were any. An array of them would be nice in case the API would want to know what they are, but for the initial steps, the API would just make decisions on the condition that there were or weren't.

@missinglink missinglink linked a pull request Jul 5, 2017 that will close this issue
@missinglink missinglink self-assigned this Jul 5, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants