Add `filter_map`, `dedup` and some variants to the stdlib #2120

yannham · 2024-12-04T23:08:38Z

Closes #1958. Adds filter_map, dedup, as well as two variants of dedup that avoid the quadratic behavior of the purely equality-based vanilla version, sort_dedup and hash_dedup.

github-actions · 2024-12-05T09:56:59Z

Bencher Report

Branch	stdlib/dedup-and-filter-map
Testbed	ubuntu-latest

Click to view all benchmark results

Benchmark	Latency	nanoseconds (ns)
fibonacci 10	📈 view plot 🚷 view threshold	492,600.00
foldl arrays 50	📈 view plot 🚷 view threshold	1,823,700.00
foldl arrays 500	📈 view plot 🚷 view threshold	7,162,400.00
foldr strings 50	📈 view plot 🚷 view threshold	7,330,100.00
foldr strings 500	📈 view plot 🚷 view threshold	64,605,000.00
generate normal 250	📈 view plot 🚷 view threshold	49,633,000.00
generate normal 50	📈 view plot 🚷 view threshold	2,367,600.00
generate normal unchecked 1000	📈 view plot 🚷 view threshold	3,643,400.00
generate normal unchecked 200	📈 view plot 🚷 view threshold	799,120.00
pidigits 100	📈 view plot 🚷 view threshold	3,234,800.00
pipe normal 20	📈 view plot 🚷 view threshold	1,523,500.00
pipe normal 200	📈 view plot 🚷 view threshold	10,979,000.00
product 30	📈 view plot 🚷 view threshold	860,070.00
scalar 10	📈 view plot 🚷 view threshold	1,550,800.00
sum 30	📈 view plot 🚷 view threshold	859,300.00

🐰 View full continuous benchmarking report in Bencher

yannham · 2024-12-11T13:44:38Z

It seems we now blow the stack on Windows (1MB), and it's can't be really tweaked from tests - I tried to wrap all main tests as a scoped thread with larger stack but it didn't work so I suspect this is the separate LSP command. This is annoying and also probably means the standalone LSP binary is also overflowing the stack on Windows.

jneem

Sorry, I just realized this has been hanging for weeks.

I think filter_map and dedup are totally fine. sort_dedup is fine too, but I proposed an alternative just for fun.

I'm not totally sold on hash_dedup, though...

jneem · 2024-12-19T11:46:56Z

core/stdlib/std.ncl

+        quadratic for `std.array.dedup`).
+
+        The hash function must separate distinct elements with very high
+        probability. If two elements have the same hash, they will be considered


I find the mention of probability misleading, especially when talking about a "hash" function: in a hash table, having a low collision probability only matters for performance, but here you absolutely need two distinct elements to have different "hash"es or else you'll get the wrong answer.

I wonder if it's worth baking in a notion of hashes and sets directly into nickel? It seems like almost as "core" a notion as equality, and we've had a feature request for sets for a while now...

You're right, it's a logic error to not separate unequal keys. In some sense the hash function is the notion of equality at hand. Maybe I shouldn't call this a hash function at all?

Though I don't have a better name in mind that would be discoverable as a function name (mathematically what we want is an injection T -> String), but dedup_inject doesn't sound so good.

For the motivation, this function is taken from a real use-case: this is how we check for duplicates uid in our build machine profiles. Also, it's the only way to deduplicate an array efficiently (in this context meaning O(nlog(n)) or less) without re-ordering it

jneem · 2024-12-19T11:50:24Z

core/stdlib/std.ncl

+          []
+          array,
+
+    sort_dedup


Just brainstorming here, but what if we had dedup_sorted : forall a. -> Array a -> Array a instead (which would do what the inner go function does, basically)? Then you'd use it like xs |> sort cmp |> dedup_sorted, but it feels a bit more composable than sort_dedup because if you know that your array is already sorted then you don't need to sort it again.

Sure, we can factor out dedup_sorted. I would still keep sort_dedup as a handy alias, though.

yannham · 2024-12-19T15:04:15Z

After discussing with @jneem through other channels, we agreed that hash_dedup has a strange interface and will be split off this pull request. It should make a come back but without having to provide an explicit hash function; either using a built-in that we'll add to Nickel, or as a middle ground waiting for that, use std.serialize automatically.

yannham added the area:stdlib label Dec 4, 2024

yannham requested a review from jneem December 4, 2024 23:08

Stdlib: add filter_map, dedup and its variants

800a949

yannham force-pushed the stdlib/dedup-and-filter-map branch from 554ffc9 to 800a949 Compare December 5, 2024 09:43

Fix NLS stack overflow on Windows

d8e18b4

yannham force-pushed the stdlib/dedup-and-filter-map branch from 5b1c106 to d8e18b4 Compare December 11, 2024 14:15

jneem reviewed Dec 19, 2024

View reviewed changes

Get rid of hash_dedup for now, split dedup_sorted out of sort_dedup

435f5e4

yannham requested a review from jneem December 19, 2024 17:19

Update snapshot tests

a94604f

jneem approved these changes Dec 20, 2024

View reviewed changes

yannham added this pull request to the merge queue Dec 20, 2024

Merged via the queue into master with commit c412b24 Dec 20, 2024
5 checks passed

yannham deleted the stdlib/dedup-and-filter-map branch December 20, 2024 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `filter_map`, `dedup` and some variants to the stdlib #2120

Add `filter_map`, `dedup` and some variants to the stdlib #2120

yannham commented Dec 4, 2024

github-actions bot commented Dec 5, 2024 •

edited

Loading

yannham commented Dec 11, 2024

jneem left a comment

jneem Dec 19, 2024

yannham Dec 19, 2024

yannham Dec 19, 2024

jneem Dec 19, 2024

yannham Dec 19, 2024

yannham commented Dec 19, 2024

Add filter_map, dedup and some variants to the stdlib #2120

Add filter_map, dedup and some variants to the stdlib #2120

Conversation

yannham commented Dec 4, 2024

github-actions bot commented Dec 5, 2024 • edited Loading

Bencher Report

yannham commented Dec 11, 2024

jneem left a comment

Choose a reason for hiding this comment

jneem Dec 19, 2024

Choose a reason for hiding this comment

yannham Dec 19, 2024

Choose a reason for hiding this comment

yannham Dec 19, 2024

Choose a reason for hiding this comment

jneem Dec 19, 2024

Choose a reason for hiding this comment

yannham Dec 19, 2024

Choose a reason for hiding this comment

yannham commented Dec 19, 2024

Add `filter_map`, `dedup` and some variants to the stdlib #2120

Add `filter_map`, `dedup` and some variants to the stdlib #2120

github-actions bot commented Dec 5, 2024 •

edited

Loading