Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fsspec integration #63

Merged
merged 31 commits into from
Nov 13, 2024

Conversation

martindurant
Copy link
Contributor

Replaces #60

@martindurant
Copy link
Contributor Author

I don't know why the conftest is not being picked up (works locally). I could put the fixtures into a utils module ...

@kylebarron
Copy link
Member

kylebarron commented Oct 31, 2024

I'm afraid I don't know enough about pytest to understand why the test setup isn't working correctly. Does it work if you move it into conftest.py?

tests/test_fsspec.py Outdated Show resolved Hide resolved
@kylebarron kylebarron changed the title Kyle/fsspec integration fsspec integration Oct 31, 2024
@kylebarron kylebarron mentioned this pull request Oct 31, 2024

raise NotImplementedError("todo: handle open-ended ranges")

async def _cat_ranges(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you be able to add some tests for this? I think this function is the least stable of everything because it relies on custom code to split the ranges into per-file ranges and piece the outputs together again.

We should have some tests for multiple ranges in a single file, single range for multiple files, multiple ranges for multiple files, and maybe overlapping ranges too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into this. I strongly suspect that it will "just work" with the upstream fsspec code, actually, rather than having to carve out the per-file requests and repeat the code in rust. The only downside of that would be bytes copies in python.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I strongly suspect that it will "just work" with the upstream fsspec code, actually,

Yes, it would just work with the upstream code, but the benefit here is that we get request merging in the underlying Rust code. We should absolutely use the underlying get_ranges if we can; we just have to map between multi-file and single-file ranges.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

request merging in the underlying Rust code

as_opposed to doing it in python, you mean? I wonder if it makes a practical difference.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. My position is mostly: we should use whatever tools object_store provides us, because the Rust code I'd guess will be slightly faster, which could make a difference for large-scale data loading with get_ranges, like in zarr.

I don't think it's too much work to have this helper function, and it would also be useful in places that don't depend on fsspec, like in the zarr PR: https://github.com/zarr-developers/zarr-python/pull/1661/files#diff-6f6bc4f5bf8c4e9eb8f9bb486d3699e17e4a4d6efc90b922c63a75d45cd7a9a6R47-R51

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I can agree with that reasoning.

store,
*args,
asynchronous=False,
loop=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the right type for loop?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

   loop: since both fsspec/python and tokio/rust may be using loops, this should
        be kept None for now

If None is the only valid argument to loop, then we should remove it from __init__ and manually pass loop=None to super().__init__ below.

@martindurant
Copy link
Contributor Author

Actually, all the tests are failing in CI, not what I see locally :|

@martindurant
Copy link
Contributor Author

(this fails because he _fetch_range fix in fsspec isn't released yet)

@martindurant
Copy link
Contributor Author

OK, so the prefix case doesn't work ATM (I wasn't sure if you were implying it should) and "create" doesn't raise for moto. I haven't tried real s3 yet.

@kylebarron
Copy link
Member

I don't think "create" will work until apache/arrow-rs#6682

@martindurant
Copy link
Contributor Author

OK, I can make that one xfail, and I suppose we wait for fsspec's release before merging here?

@kylebarron
Copy link
Member

If this doesn't work until the next fsspec release, we should wait to merge until that is released and then probably do an fsspec version check on module import.

@martindurant
Copy link
Contributor Author

It would work now with the subclass I had before, which might prove convenient in the long run... up to you.

@kylebarron
Copy link
Member

kylebarron commented Nov 6, 2024

I'm ok with the subclass as long as it's not exposed in a public API. It's essentially just an implementation detail for now, right? We could remove it in the future easily when the new fsspec version is released?

@martindurant
Copy link
Contributor Author

martindurant commented Nov 8, 2024

Let's keep to this formalism for now, then.

Copy link
Member

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple more comments, but overall looks good

obstore/python/obstore/fsspec.py Outdated Show resolved Hide resolved
Comment on lines 54 to 55
kwargs: not currently supported; extra configuration for the backend should be
done to the Store passed in the first argument.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are no valid parameters to put in **kwargs we should not include it in the function signature.

Comment on lines 82 to 83
starts: List[int],
ends: List[int],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Upstream says that the type here is int | list[int]: https://github.com/fsspec/filesystem_spec/blob/9a161714f0bbfe44ee769f259420f2f7db975471/fsspec/asyn.py#L495-L497

We should update here for symmetry, and then we'll need to fix line 89 as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can any element of starts or ends be None? Or these are always bounded ranges?

I updated the zarr PR to handle non-bounded ranges in the multi-request path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can any element of starts or ends be None?

We can choose to support that. For kerchunk use, we will only have full files (called via cat()) and all known ranges (called via cat_ranges()). Having all that and partial/suffix ranges in one place would be convenient, but I don't think urgent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, we can handle that in a follow up.

store: obs.store.ObjectStore,
*args,
asynchronous: bool = False,
loop=None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove loop too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest leaving it here even if we don't use it, to match the normal AsyncFileSystem signature.

@kylebarron kylebarron enabled auto-merge (squash) November 13, 2024 16:00
@kylebarron kylebarron merged commit 6ec315f into developmentseed:main Nov 13, 2024
4 checks passed
@kylebarron
Copy link
Member

Publishing v0.3.0-beta.6

@kylebarron
Copy link
Member

We still need to add documentation about this to the website

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants