Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsspec integration #63
fsspec integration #63
Changes from 10 commits
0a6d056
ed0073e
18bde17
a9b0d51
bf4f5e0
fdd2a79
6495836
68a2d1f
ec9c559
956b1c0
a5a8e82
0a0a2fc
032c976
d26a651
bf1e368
3072e4e
71d9ef7
31fbb58
79449fd
abe7a44
4a05dcb
2f70443
d8bba78
fca4619
3fc017e
f02fa2b
c1ee71d
cdfa2ab
5b9b0f7
45868c9
ecc9399
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the right type for
loop
?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If
None
is the only valid argument toloop
, then we should remove it from__init__
and manually passloop=None
tosuper().__init__
below.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove
loop
too?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest leaving it here even if we don't use it, to match the normal AsyncFileSystem signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to add some tests for this? I think this function is the least stable of everything because it relies on custom code to split the ranges into per-file ranges and piece the outputs together again.
We should have some tests for multiple ranges in a single file, single range for multiple files, multiple ranges for multiple files, and maybe overlapping ranges too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this. I strongly suspect that it will "just work" with the upstream fsspec code, actually, rather than having to carve out the per-file requests and repeat the code in rust. The only downside of that would be bytes copies in python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it would just work with the upstream code, but the benefit here is that we get request merging in the underlying Rust code. We should absolutely use the underlying
get_ranges
if we can; we just have to map between multi-file and single-file ranges.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as_opposed to doing it in python, you mean? I wonder if it makes a practical difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. My position is mostly: we should use whatever tools
object_store
provides us, because the Rust code I'd guess will be slightly faster, which could make a difference for large-scale data loading withget_ranges
, like in zarr.I don't think it's too much work to have this helper function, and it would also be useful in places that don't depend on
fsspec
, like in the zarr PR: https://github.com/zarr-developers/zarr-python/pull/1661/files#diff-6f6bc4f5bf8c4e9eb8f9bb486d3699e17e4a4d6efc90b922c63a75d45cd7a9a6R47-R51There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I can agree with that reasoning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream says that the type here is
int | list[int]
: https://github.com/fsspec/filesystem_spec/blob/9a161714f0bbfe44ee769f259420f2f7db975471/fsspec/asyn.py#L495-L497We should update here for symmetry, and then we'll need to fix line 89 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can any element of
starts
orends
beNone
? Or these are always bounded ranges?I updated the zarr PR to handle non-bounded ranges in the multi-request path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can choose to support that. For kerchunk use, we will only have full files (called via cat()) and all known ranges (called via cat_ranges()). Having all that and partial/suffix ranges in one place would be convenient, but I don't think urgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can handle that in a follow up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this what's going to have a default implementation when fsspec/filesystem_spec#1732 is merged? Or this is separate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could have been done with that PR, once release.
I was worried about passing Buffers around rather than bytes and/or whether read buffering might be implemented on the rust side (without copies). I can confirm that AbstractBufferedFile works directly including the fsspec change.
However, we will end up making a class to deal with writing data unless we only every write whole memory buffers in a single go as opposed to multi-part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can wrap a
BufWriter
similarly to what we do now withBufReader
, which will use a multi part upload under the hood.