-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Issue reading hive partitioned dataset with NativeExecutionEngine #288
Comments
The problem is due to the dtype of the partition column which is set to pd.CategoricalDtype. By converting the dtype to str, the partitioned dataset is correctly read. The user will have to convert the type of the partition column if necessary. Dirty hack to test this, in /triad/collections/schema.py, in function append (line232) where obj is a List:
|
Ah, good catch. So in pyarrow, the dictionary is the categorical type. But the implementation can be very hard. Converting categorical to string may be a more practical way? I am not sure yet, maybe we should spend the effort to support pyarrow dictionary. I will need to think about it. |
Hi, I'm agree with you, it is better to try to support dictionary first, also, the value type in dictionary seems to be well inferred. |
What makes me think that pyarrow correctly infer types : Then I also tried with column containing integers: |
We need to add this from triad And then on Fugue |
And then we also need to add tons of unit tests and need to make it work for all backends |
We will try to solve it in #296 |
I think that this issue cannot be closed with #306 because we still can't read hive partitioned dataset with native or Dask execution engine. |
Sorry, let me reopen |
I have a pandas dataframe with a column DAY representing the day number in month (ex : values from 1 to 31 for december)
I save this dataframe with hive partition DAY
The result folder has a format similar to this:
! tree output_path
output_path
├── DAY=6
│ └── 02b4a05c12fa4791aca2931e47659ecc.parquet
└── DAY=7
└── bd17a05a5bd948cc824e4730fd03b473.parquet
When I try to read the dataset using spark execution engine, there is no problem
But the same code fails using native execution engine.
I also observed that when you specifiy the list of columns you want to read, and this does not include the partition column, else it works fine:
Environment:
The text was updated successfully, but these errors were encountered: