-
-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read Parquet exceptions thrown with PyArrow S3FileSystem #1005
Comments
Hi, thanks for your report. We will look into those (already put up a PR to fix the list case). For context: The Arrow FS now leverages a rewrite of the parquet implementation that's a lot faster than the legacy implementation (a few rough edges are still expected unfortunately) |
Great thanks! I believe glob is not supported in Arrrow’s s3 FS - that may be the issue for the glob case. The list case fix is much appreciated 😀 |
Yeah, glob patterns are not supported by Arrow FS. I don't have a solution to this yet but at the same time I'm not entirely convinced this is even necessary. For example, instead of |
Thanks, yes glob pattern example was a trivial to show the exception. In our real use cases, it is useful to load only the targeted subset of files. e.g,, load all files for a specific year, month, and various identifiers. But the list path case fix is great and resolves the main issue. |
Describe the issue:
Hi team,
We are intensive users of Dask and it's a great product!
We use Apache Arrow's
pyarrow.fs.S3FileSystem
in our ecosystem rather thans3fs.S3FileSystem
due to performance and deadlock issues that were found during multiprocessing.We're able to retrieve single files and entire directories of many files successfully. But using either a glob path or a interable of paths the
read_parquet
API throws exceptions with maximum recursion depth.If would really helpful if the team can investigate? Many thanks!
Minimal Verifiable Example:
Exception:
Environment:
OS: Amazon Linux release 2 (Karoo)
Linux: 4.14.336-257.566.amzn2.x86_64
Python: 3.12.2
Packages:
arrow: 1.3.0
dask: 2024.3.1
dask-expr: 1.0.4
numpy: 1.26.4
pandas: 2.2.1
pyarrow: 15.0.2
pyarrow-hotfix: 0.6
Install method pip
The text was updated successfully, but these errors were encountered: