Overview query breakout #89

pgulley · 2024-07-23T20:16:22Z

Breaking out the overview query into several smaller queries for daily counts, top languages, etc.
The new routes have not been tested, hence draft status!

kilemensi

👍🏽 ... Looks good for a draft.

kilemensi · 2024-07-26T14:12:06Z

client.py

-                    },
-                    "lang": {"terms": {"field": "language.keyword", "size": 100}},
-                    "domain": {"terms": {"field": "canonical_domain", "size": 100}},
-                    "tld": {"terms": {"field": "tld", "size": 100}},


Are we dropping tld?

None of the other layers above this point in the API use TLD- we had been calculating it here, I think, so that we had a backup for when the total count was above the 10,000 document limit imposed by elasticsearch- and we can replace that function with the domain field- so I removed it here in the spirit of speeding up the query! But, happy to consider leaving it in- @rahulbot any thoughts?

Yeah, I can't think of a researc huse of TLD in our current practices. I'd support removal.

I'm going to put this snippet of a conversation I had with @ibnesayeed many months ago here for context-

Basically, we had an aggregation on TLDs, so it would report how many matching documents are there for each TLD, which we can sum up to get the better total count. While we had aggregation on other aspects, such as language, date, etc., which I could have used for summation, but I chose TLD for two reasons: 1) it is finite in number, so there are very few entries to add and will never overflow the usual search result limit, and 2) almost all documents will have TLDs associated, unlike language or publication date, which might be empty, hence some documents would go uncounted.

I think this property of TLDs, where it is always filled, is also true of canonical_domain- so we can drop that in instead.

Implemented this backup in the most recent commit

kilemensi · 2024-07-26T14:14:21Z

client.py

+        }
+        TOP_LANGS = {"toplangs": {"terms": {"field": "language.keyword", "size": 100}}}
+        TOP_DOMAINS = {
+            "topdomains": {"terms": {"field": "canonical_domain.keyword", "size": 100}}


Field changing from canonical_domain to canonical_domain.keyword?

Yes! In trying to benchmark the latency of these operations I noticed that aggregating on the .keywords field of canonical domain made the query return significantly faster- ultimately it made the overview query almost twice as fast.

client.py

Overview query breakout, untested

a509e41

pgulley mentioned this pull request Jul 23, 2024

split overview to make more parallel queries? #73

Open

Paige Gulley added 2 commits July 24, 2024 14:45

Adding tests for new endpoints

2eba97e

Adding .keyword significantly improves performance

1351ed0

pgulley marked this pull request as ready for review July 25, 2024 20:43

pgulley requested a review from kilemensi July 25, 2024 20:43

kilemensi reviewed Jul 26, 2024

View reviewed changes

Paige Gulley and others added 4 commits July 26, 2024 14:07

more grokkable

cf465e5

add back in aggregation total count backup using top_domains

fafa04f

comment to explain the top_domains aggregation in overview

3dbaef0

Merge branch 'main' into overview-breakout

7fb1cad

pgulley merged commit c22f9e7 into main Sep 5, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overview query breakout #89

Overview query breakout #89

pgulley commented Jul 23, 2024 •

edited

Loading

kilemensi left a comment

kilemensi Jul 26, 2024

pgulley Jul 26, 2024

rahulbot Jul 26, 2024

pgulley Jul 26, 2024 •

edited

Loading

pgulley Jul 26, 2024

kilemensi Jul 26, 2024

pgulley Jul 26, 2024

Overview query breakout #89

Overview query breakout #89

Conversation

pgulley commented Jul 23, 2024 • edited Loading

kilemensi left a comment

Choose a reason for hiding this comment

kilemensi Jul 26, 2024

Choose a reason for hiding this comment

pgulley Jul 26, 2024

Choose a reason for hiding this comment

rahulbot Jul 26, 2024

Choose a reason for hiding this comment

pgulley Jul 26, 2024 • edited Loading

Choose a reason for hiding this comment

pgulley Jul 26, 2024

Choose a reason for hiding this comment

kilemensi Jul 26, 2024

Choose a reason for hiding this comment

pgulley Jul 26, 2024

Choose a reason for hiding this comment

pgulley commented Jul 23, 2024 •

edited

Loading

pgulley Jul 26, 2024 •

edited

Loading