-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fsspec integration #63
fsspec integration #63
Conversation
…entseed/object-store-rs into kyle/fsspec-integration
I don't know why the conftest is not being picked up (works locally). I could put the fixtures into a utils module ... |
I'm afraid I don't know enough about pytest to understand why the test setup isn't working correctly. Does it work if you move it into |
Co-authored-by: Kyle Barron <[email protected]>
|
||
raise NotImplementedError("todo: handle open-ended ranges") | ||
|
||
async def _cat_ranges( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be able to add some tests for this? I think this function is the least stable of everything because it relies on custom code to split the ranges into per-file ranges and piece the outputs together again.
We should have some tests for multiple ranges in a single file, single range for multiple files, multiple ranges for multiple files, and maybe overlapping ranges too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into this. I strongly suspect that it will "just work" with the upstream fsspec code, actually, rather than having to carve out the per-file requests and repeat the code in rust. The only downside of that would be bytes copies in python.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I strongly suspect that it will "just work" with the upstream fsspec code, actually,
Yes, it would just work with the upstream code, but the benefit here is that we get request merging in the underlying Rust code. We should absolutely use the underlying get_ranges
if we can; we just have to map between multi-file and single-file ranges.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
request merging in the underlying Rust code
as_opposed to doing it in python, you mean? I wonder if it makes a practical difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
True. My position is mostly: we should use whatever tools object_store
provides us, because the Rust code I'd guess will be slightly faster, which could make a difference for large-scale data loading with get_ranges
, like in zarr.
I don't think it's too much work to have this helper function, and it would also be useful in places that don't depend on fsspec
, like in the zarr PR: https://github.com/zarr-developers/zarr-python/pull/1661/files#diff-6f6bc4f5bf8c4e9eb8f9bb486d3699e17e4a4d6efc90b922c63a75d45cd7a9a6R47-R51
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I can agree with that reasoning.
store, | ||
*args, | ||
asynchronous=False, | ||
loop=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the right type for loop
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
loop: since both fsspec/python and tokio/rust may be using loops, this should
be kept None for now
If None
is the only valid argument to loop
, then we should remove it from __init__
and manually pass loop=None
to super().__init__
below.
Actually, all the tests are failing in CI, not what I see locally :| |
(this fails because he _fetch_range fix in fsspec isn't released yet) |
OK, so the prefix case doesn't work ATM (I wasn't sure if you were implying it should) and "create" doesn't raise for moto. I haven't tried real s3 yet. |
I don't think |
OK, I can make that one xfail, and I suppose we wait for fsspec's release before merging here? |
If this doesn't work until the next fsspec release, we should wait to merge until that is released and then probably do an fsspec version check on module import. |
It would work now with the subclass I had before, which might prove convenient in the long run... up to you. |
I'm ok with the subclass as long as it's not exposed in a public API. It's essentially just an implementation detail for now, right? We could remove it in the future easily when the new fsspec version is released? |
Let's keep to this formalism for now, then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple more comments, but overall looks good
obstore/python/obstore/fsspec.py
Outdated
kwargs: not currently supported; extra configuration for the backend should be | ||
done to the Store passed in the first argument. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there are no valid parameters to put in **kwargs
we should not include it in the function signature.
obstore/python/obstore/fsspec.py
Outdated
starts: List[int], | ||
ends: List[int], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Upstream says that the type here is int | list[int]
: https://github.com/fsspec/filesystem_spec/blob/9a161714f0bbfe44ee769f259420f2f7db975471/fsspec/asyn.py#L495-L497
We should update here for symmetry, and then we'll need to fix line 89 as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can any element of starts
or ends
be None
? Or these are always bounded ranges?
I updated the zarr PR to handle non-bounded ranges in the multi-request path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can any element of starts or ends be None?
We can choose to support that. For kerchunk use, we will only have full files (called via cat()) and all known ranges (called via cat_ranges()). Having all that and partial/suffix ranges in one place would be convenient, but I don't think urgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, we can handle that in a follow up.
store: obs.store.ObjectStore, | ||
*args, | ||
asynchronous: bool = False, | ||
loop=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove loop
too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest leaving it here even if we don't use it, to match the normal AsyncFileSystem signature.
Publishing v0.3.0-beta.6 |
We still need to add documentation about this to the website |
Replaces #60