-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Overview query breakout #89
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏽 ... Looks good for a draft.
}, | ||
"lang": {"terms": {"field": "language.keyword", "size": 100}}, | ||
"domain": {"terms": {"field": "canonical_domain", "size": 100}}, | ||
"tld": {"terms": {"field": "tld", "size": 100}}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we dropping tld
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of the other layers above this point in the API use TLD- we had been calculating it here, I think, so that we had a backup for when the total count was above the 10,000 document limit imposed by elasticsearch- and we can replace that function with the domain field- so I removed it here in the spirit of speeding up the query! But, happy to consider leaving it in- @rahulbot any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can't think of a researc huse of TLD in our current practices. I'd support removal.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to put this snippet of a conversation I had with @ibnesayeed many months ago here for context-
Basically, we had an aggregation on TLDs, so it would report how many matching documents are there for each TLD, which we can sum up to get the better total count. While we had aggregation on other aspects, such as language, date, etc., which I could have used for summation, but I chose TLD for two reasons: 1) it is finite in number, so there are very few entries to add and will never overflow the usual search result limit, and 2) almost all documents will have TLDs associated, unlike language or publication date, which might be empty, hence some documents would go uncounted.
I think this property of TLDs, where it is always filled, is also true of canonical_domain- so we can drop that in instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Implemented this backup in the most recent commit
} | ||
TOP_LANGS = {"toplangs": {"terms": {"field": "language.keyword", "size": 100}}} | ||
TOP_DOMAINS = { | ||
"topdomains": {"terms": {"field": "canonical_domain.keyword", "size": 100}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Field changing from canonical_domain
to canonical_domain.keyword
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! In trying to benchmark the latency of these operations I noticed that aggregating on the .keywords field of canonical domain made the query return significantly faster- ultimately it made the overview query almost twice as fast.
Breaking out the overview query into several smaller queries for daily counts, top languages, etc.
The new routes have not been tested, hence draft status!