Added helper function call_with_deduplication and use it to speed up path_file and path_dir for vectors with repeats. #425
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
See #424.
This PR incurs a slight time cost for fully unique vectors, but I believe the majority of use cases involving long vectors involve many repeated values (e.g.
readr::read_csv(x, id="file_path)
).For a vector with significant duplication, the time savings is 2x on Mac and 40x on Windows (see below). In the tests below there is a 5ms overhead cost for fully unique vectors.
Some ways to speed this up would be:
match
call iflength(unique(x)) == length(x)
(or close) -- note: I tried this but a chunk of the work (unique(x)
) has already been done and becomes sunk cost.unique
and/ormatch
(e.g.collapse::funique
ordata.table::chmatch
).vctrs::vec_duplicated_id
and/orvctrs::vec_unique_loc
since it seems the action ofunique(x)
andmatch(x, unique_x)
is redundant. I tried this but couldn't figure it out -- it might need to be a new function invctrs
. I will submit an issue separately.Timing details
Mac
Created on 2023-07-05 with reprex v2.0.2
Session info
Windows
Created on 2023-07-05 with reprex v2.0.2
Session info