Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overview query breakout #89

Merged
merged 7 commits into from
Sep 5, 2024
Merged

Overview query breakout #89

merged 7 commits into from
Sep 5, 2024

Conversation

pgulley
Copy link
Member

@pgulley pgulley commented Jul 23, 2024

Breaking out the overview query into several smaller queries for daily counts, top languages, etc.
The new routes have not been tested, hence draft status!

@pgulley pgulley marked this pull request as ready for review July 25, 2024 20:43
@pgulley pgulley requested a review from kilemensi July 25, 2024 20:43
Copy link
Contributor

@kilemensi kilemensi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍🏽 ... Looks good for a draft.

},
"lang": {"terms": {"field": "language.keyword", "size": 100}},
"domain": {"terms": {"field": "canonical_domain", "size": 100}},
"tld": {"terms": {"field": "tld", "size": 100}},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we dropping tld?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of the other layers above this point in the API use TLD- we had been calculating it here, I think, so that we had a backup for when the total count was above the 10,000 document limit imposed by elasticsearch- and we can replace that function with the domain field- so I removed it here in the spirit of speeding up the query! But, happy to consider leaving it in- @rahulbot any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can't think of a researc huse of TLD in our current practices. I'd support removal.

Copy link
Member Author

@pgulley pgulley Jul 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to put this snippet of a conversation I had with @ibnesayeed many months ago here for context-

Basically, we had an aggregation on TLDs, so it would report how many matching documents are there for each TLD, which we can sum up to get the better total count. While we had aggregation on other aspects, such as language, date, etc., which I could have used for summation, but I chose TLD for two reasons: 1) it is finite in number, so there are very few entries to add and will never overflow the usual search result limit, and 2) almost all documents will have TLDs associated, unlike language or publication date, which might be empty, hence some documents would go uncounted.

I think this property of TLDs, where it is always filled, is also true of canonical_domain- so we can drop that in instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implemented this backup in the most recent commit

}
TOP_LANGS = {"toplangs": {"terms": {"field": "language.keyword", "size": 100}}}
TOP_DOMAINS = {
"topdomains": {"terms": {"field": "canonical_domain.keyword", "size": 100}}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Field changing from canonical_domain to canonical_domain.keyword?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! In trying to benchmark the latency of these operations I noticed that aggregating on the .keywords field of canonical domain made the query return significantly faster- ultimately it made the overview query almost twice as fast.

client.py Outdated Show resolved Hide resolved
@pgulley pgulley merged commit c22f9e7 into main Sep 5, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants