Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Melbourne and Bristol coming up as US only... #16

Open
rdlou opened this issue Oct 18, 2018 · 6 comments
Open

Melbourne and Bristol coming up as US only... #16

rdlou opened this issue Oct 18, 2018 · 6 comments

Comments

@rdlou
Copy link

rdlou commented Oct 18, 2018

Hi, I am running single cities through the country_mentions func and both of them are coming up only with "OrderedDict([('US', 1)])"

cities = ['Melbourne', 'Bristol']

for city in cities:
    country_dict = GeoText(city.title()).country_mentions
    print(country_dict)

I understand that these are places in the US, but obviously Melbourne is pretty significant in Australia, as is Bristol in the UK. Should the Dict come back with numerous country mentions?

Thanks!

@rdlou
Copy link
Author

rdlou commented Oct 31, 2018

Paris comes up as United States, Sydney comes up as Canada....

@iwpnd
Copy link

iwpnd commented Nov 2, 2018

Think of geotext as the general framework on how to extract named entities (low level approach) that are then looked up in an exemplary table of cities. If you want to be able to distinguish between cities in the US, Canada or Australia you could always provide the proper logic in separate lookup tables on your own.

@rdlou
Copy link
Author

rdlou commented Nov 2, 2018

Thanks @iwpnd iwpnd. I've ended up doing that using geocache So it will come back with a list which has city, country and confidence score.

So if you said "I live in London" it would come back with:

[{"city":"London","country":"United Kingdom","confidence": 50},{"city":"London","country":"Canada","confidence": 25}]

London UK gets a higher score because it has a higher population.... That sort of thing. If "Ontario" or "Canada" was in the sentence then that would get a better score. Might upload the code.

Thanks for your response, appreciate it.

@iwpnd
Copy link

iwpnd commented Nov 2, 2018

I like the idea, thanks for sharing!

@VanessaVanG
Copy link

rdlou -- your idea seems great! This is what I ended up doing -- I made a text doc like this:
Dublin: Cork,
Paris: Dijon,
Moscow: Vladivostok,
...
where the first city is the one that's mistaken and the second city is a city that returns the correct country (as in there isn't another city by that name in the US).
I used regex and made replacements. Here's my code: https://github.com/MAVRYK/GW-Project3/blob/master/data_prep/location_extractor.ipynb

(In case you're wondering about the stopwords I removed, they're words like Franklin
Harrison
Liberal
Helena
Defiance
that clearly aren't a city name.)

@albertc1
Copy link
Contributor

I was having the same problem. My simple solution was to sort the cities15000.txt datafile by ascending population, so that the biggest cities get processed later and overwrite the smaller cities in GeoText.index.cities.

#18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants