Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add filter_map, dedup and some variants to the stdlib #2120

Merged
merged 4 commits into from
Dec 20, 2024

Conversation

yannham
Copy link
Member

@yannham yannham commented Dec 4, 2024

Closes #1958. Adds filter_map, dedup, as well as two variants of dedup that avoid the quadratic behavior of the purely equality-based vanilla version, sort_dedup and hash_dedup.

@yannham yannham requested a review from jneem December 4, 2024 23:08
@yannham yannham force-pushed the stdlib/dedup-and-filter-map branch from 554ffc9 to 800a949 Compare December 5, 2024 09:43
Copy link
Contributor

github-actions bot commented Dec 5, 2024

@yannham
Copy link
Member Author

yannham commented Dec 11, 2024

It seems we now blow the stack on Windows (1MB), and it's can't be really tweaked from tests - I tried to wrap all main tests as a scoped thread with larger stack but it didn't work so I suspect this is the separate LSP command. This is annoying and also probably means the standalone LSP binary is also overflowing the stack on Windows.

@yannham yannham force-pushed the stdlib/dedup-and-filter-map branch from 5b1c106 to d8e18b4 Compare December 11, 2024 14:15
Copy link
Member

@jneem jneem left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I just realized this has been hanging for weeks.

I think filter_map and dedup are totally fine. sort_dedup is fine too, but I proposed an alternative just for fun.

I'm not totally sold on hash_dedup, though...

quadratic for `std.array.dedup`).

The hash function must separate distinct elements with very high
probability. If two elements have the same hash, they will be considered
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find the mention of probability misleading, especially when talking about a "hash" function: in a hash table, having a low collision probability only matters for performance, but here you absolutely need two distinct elements to have different "hash"es or else you'll get the wrong answer.

I wonder if it's worth baking in a notion of hashes and sets directly into nickel? It seems like almost as "core" a notion as equality, and we've had a feature request for sets for a while now...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it's a logic error to not separate unequal keys. In some sense the hash function is the notion of equality at hand. Maybe I shouldn't call this a hash function at all?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Though I don't have a better name in mind that would be discoverable as a function name (mathematically what we want is an injection T -> String), but dedup_inject doesn't sound so good.

For the motivation, this function is taken from a real use-case: this is how we check for duplicates uid in our build machine profiles. Also, it's the only way to deduplicate an array efficiently (in this context meaning O(nlog(n)) or less) without re-ordering it

[]
array,

sort_dedup
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just brainstorming here, but what if we had dedup_sorted : forall a. -> Array a -> Array a instead (which would do what the inner go function does, basically)? Then you'd use it like xs |> sort cmp |> dedup_sorted, but it feels a bit more composable than sort_dedup because if you know that your array is already sorted then you don't need to sort it again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, we can factor out dedup_sorted. I would still keep sort_dedup as a handy alias, though.

@yannham
Copy link
Member Author

yannham commented Dec 19, 2024

After discussing with @jneem through other channels, we agreed that hash_dedup has a strange interface and will be split off this pull request. It should make a come back but without having to provide an explicit hash function; either using a built-in that we'll add to Nickel, or as a middle ground waiting for that, use std.serialize automatically.

@yannham yannham requested a review from jneem December 19, 2024 17:19
@yannham yannham added this pull request to the merge queue Dec 20, 2024
Merged via the queue into master with commit c412b24 Dec 20, 2024
5 checks passed
@yannham yannham deleted the stdlib/dedup-and-filter-map branch December 20, 2024 08:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Some utilities in std.array: remove_duplicates and filter_map
2 participants