[WIP] Fix issue #285 : save hive partitioned dataset using NativeExecutionEngine and DaskExecutionEngine #306

LaurentErreca · 2022-02-28T14:39:49Z

Enable saving hive partitioned dataset using Native (pandas) or Dask execution engine. Work still in progress as test_take with Dask is failing.
Related to #285

…ioned dataset

kvnkho · 2022-03-03T03:54:02Z

I looked at the tests here and from what I see, the only failure is in the duckdb execution engine for the io tests that you added. Below is a log of the test. The test_take tests seem to be fine.

self = <tests.fugue_duckdb.test_execution_engine.DuckBuiltInTests testMethod=test_io>

    def test_io(self):
        path = os.path.join(self.tmpdir, "a")
        path2 = os.path.join(self.tmpdir, "b.test.csv")
        path3 = os.path.join(self.tmpdir, "c.partition")
        with self.dag() as dag:
            b = dag.df([[6, 1], [2, 7]], "c:int,a:long")
            b.partition(num=3).save(path, fmt="parquet", single=True)
            b.save(path2, header=True)
        assert FileSystem().isfile(path)
        with self.dag() as dag:
            a = dag.load(path, fmt="parquet", columns=["a", "c"])
            a.assert_eq(dag.df([[1, 6], [7, 2]], "a:long,c:int"))
            a = dag.load(path2, header=True, columns="c:int,a:long")
            a.assert_eq(dag.df([[6, 1], [2, 7]], "c:int,a:long"))
        with self.dag() as dag:
            b = dag.df([[6, 1], [2, 7]], "c:int,a:long")
            b.partition(by='c').save(path3, fmt="parquet", single=False)
            b.save(path2, header=True)
>       assert FileSystem().isdir(path3)
E       AssertionError

So the issue here appears to be that the added tests support partitioned files as output (which is what the PR is for), but duckdb fails the test because it doesn't support partitioning yet.

How do we proceed with this @goodwanghan ?

goodwanghan · 2022-03-04T22:17:25Z

fugue_test/builtin_suite.py

+            # TODO: in test below, once issue #288 is fixed, use dag.load instead of pd.read_parquet
+            pd.testing.assert_frame_equal(
+                pd.read_parquet(path3).sort_values('a').reset_index(drop=True),
+                pd.DataFrame({'c': pd.Categorical([6, 2]), 'a': [1, 7]}).reset_index(drop=True),


Should we assert c is int type?

goodwanghan · 2022-03-04T22:20:31Z

The failure on duckdb test is expected because duckdb itself does not have partitioning feature. But it is quite straightforward to convert duckdbdataframe to arrow table (as_arrow), and then save to parquet with partition, so we should also make that change, then this looks super nice.

goodwanghan · 2022-03-04T22:24:09Z

I apologize for the delay, my daughter was born in Feb, so I am extremely busy recently.

The PR looks great, we just need to change duckdb io function to pass the test suite.

You can see we have other work items to redesign the IO part for all engines. Because the current design is a bit messy. But please continue working on this PR and merge first. The code refactoring can happen later. As long as we have good unit tests, we can ship this feature first.

Thank you!

LaurentErreca · 2022-03-04T23:42:34Z

Hi! No problem at all with the delay. Congrats for the recent birth of your daughter! I'll try to fix the duckdb io function that is failing to pass the test. I did some prototyping this week to try to fix issue #288 related to reading of hive partitioned dataset with Native and Dask execution engine. I think I'm close to a solution in native mode, but it seems more dificult to deal with dask categorical type, as category values are set to string type. So the inference of partition column type is more tricky. Laurent

…

On Fri, Mar 4, 2022, 11:24 PM Han Wang ***@***.***> wrote: I apologize for the delay, my daughter was born in Feb, so I am extremely busy recently. The PR looks great, we just need to change duckdb io function to pass the test suite. You can see we have other work items to redesign the IO part for all engines. Because the current design is a bit messy. But please continue working on this PR and merge first. The code refactoring can happen later. As long as we have good unit tests, we can ship this feature first. Thank you! — Reply to this email directly, view it on GitHub <#306 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZ7LJVWTKVI4ALB4ETAFQTU6KERLANCNFSM5PRIYHVQ> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>. You are receiving this because you authored the thread.Message ID: ***@***.***>

ecosystem and downloads badge

* plugin * update

goodwanghan · 2022-03-26T21:22:47Z

@LaurentErreca do you plan to finalize this PR soon?

LaurentErreca · 2022-03-28T16:42:16Z

Hi, yes i hope i'll find the needed time soon! Laurent

…

On Sat, Mar 26, 2022, 10:23 PM Han Wang ***@***.***> wrote: @LaurentErreca <https://github.com/LaurentErreca> do you plan to finalize this PR soon? — Reply to this email directly, view it on GitHub <#306 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFZ7LJRYXQZAJSYA5ZK25T3VB553DANCNFSM5PRIYHVQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Upgrading black version

…ioned dataset

LaurentErreca · 2022-04-03T11:30:35Z

I've been working on this issue. For duckdb to pass the test, it must be able to write hive partitioned dataset. I propose to use arrow with param partition_cols instead of the duckdb write function which uses a COPY TO function under the hood. This function doesn't support hive partitioned write.

goodwanghan · 2022-04-03T18:28:59Z

I've been working on this issue. For duckdb to pass the test, it must be able to write hive partitioned dataset. I propose to use arrow with param partition_cols instead of the duckdb write function which uses a COPY TO function under the hood. This function doesn't support hive partitioned write.

Yes exactly, we should just convert it to arrow and save (only if this partition key is specified). It is straightforward to convert duckdbdataframe to arrow table (as_arrow), and then save to parquet with partition.

codecov-commenter · 2022-04-03T18:39:42Z

Codecov Report

Merging #306 (b3e86b5) into master (281fe2a) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #306   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          101       102    +1     
  Lines         9480      9511   +31     
=========================================
+ Hits          9480      9511   +31

Impacted Files	Coverage Δ
fugue/_utils/interfaceless.py	`100.00% <100.00%> (ø)`
fugue/_utils/register.py	`100.00% <100.00%> (ø)`
fugue/execution/__init__.py	`100.00% <100.00%> (ø)`
fugue/execution/factory.py	`100.00% <100.00%> (ø)`
fugue/execution/native_execution_engine.py	`100.00% <100.00%> (ø)`
fugue/extensions/_utils.py	`100.00% <100.00%> (ø)`
fugue/workflow/utils.py	`100.00% <100.00%> (ø)`
fugue_dask/execution_engine.py	`100.00% <100.00%> (ø)`
fugue_duckdb/_io.py	`100.00% <100.00%> (ø)`
fugue_duckdb/execution_engine.py	`100.00% <100.00%> (ø)`
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 281fe2a...b3e86b5. Read the comment docs.

goodwanghan · 2022-04-03T20:06:39Z

It looks good you only have a linting issue left @LaurentErreca

LaurentErreca · 2022-04-03T20:11:56Z

I've been working on this issue. For duckdb to pass the test, it must be able to write hive partitioned dataset. I propose to use arrow with param partition_cols instead of the duckdb write function which uses a COPY TO function under the hood. This function doesn't support hive partitioned write.

Yes exactly, we should just convert it to arrow and save (only if this partition key is specified). It is straightforward to convert duckdbdataframe to arrow table (as_arrow), and then save to parquet with partition.

This is done by replacing:
if p.file_format not in self._format_save: self._fs.makedirs(os.path.dirname(uri), recreate=True) ldf = ArrowDataFrame(df.native.arrow())
with:
if (p.file_format not in self._format_save) or ("partition_cols" in kwargs): self._fs.makedirs(os.path.dirname(uri), recreate=True) ldf = ArrowDataFrame(df.native.arrow())

goodwanghan · 2022-04-03T20:30:40Z

I've been working on this issue. For duckdb to pass the test, it must be able to write hive partitioned dataset. I propose to use arrow with param partition_cols instead of the duckdb write function which uses a COPY TO function under the hood. This function doesn't support hive partitioned write.

Yes exactly, we should just convert it to arrow and save (only if this partition key is specified). It is straightforward to convert duckdbdataframe to arrow table (as_arrow), and then save to parquet with partition.

This is done by replacing: if p.file_format not in self._format_save: self._fs.makedirs(os.path.dirname(uri), recreate=True) ldf = ArrowDataFrame(df.native.arrow()) with: if (p.file_format not in self._format_save) or ("partition_cols" in kwargs): self._fs.makedirs(os.path.dirname(uri), recreate=True) ldf = ArrowDataFrame(df.native.arrow())

I think you should use df.as_arrow() instead of df.native.arrow() because as_arrow is working on any dataframe. df.native assumes it is a duckdb relation, sometimes it is not.

goodwanghan · 2022-04-03T20:32:04Z

fugue_duckdb/_io.py

@@ -70,7 +70,7 @@ def save_df(
            NotImplementedError(f"{mode} is not supported"),
        )
        p = FileParser(uri, format_hint).assert_no_glob()
-        if p.file_format not in self._format_save:
+        if (p.file_format not in self._format_save) or ("partition_cols" in kwargs):
            self._fs.makedirs(os.path.dirname(uri), recreate=True)
            ldf = ArrowDataFrame(df.native.arrow())


Please use ArrowDataFrame(df.as_arrow()) instead

….arrow())

LaurentErreca added 2 commits February 27, 2022 21:39

Work in progress to fix issue 285 reported here fugue-project#285

53d9e7d

Use option partition_on in Dask execution engine to write hive partit…

af81b59

…ioned dataset

goodwanghan reviewed Mar 4, 2022

View reviewed changes

WangCHX and others added 7 commits March 15, 2022 00:13

Add handling for spark array type (fugue-project#307)

36c17b8

adding ecosystem to README

1b32865

adding ecosystem to README

9a12cb6

downloads badge

3e6cfda

merge conflict

42daae4

Merge pull request fugue-project#309 from kvnkho/master

4419961

ecosystem and downloads badge

Fugue plugin (fugue-project#311)

f2676e0

* plugin * update

kvnkho and others added 7 commits April 2, 2022 15:24

upgrading black version

d0c7d96

fixing black version

a53fc9c

Merge pull request fugue-project#314 from kvnkho/black_version

73d551b

Upgrading black version

Work in progress to fix issue 285 reported here fugue-project#285

7678f31

Use option partition_on in Dask execution engine to write hive partit…

3f1d7e4

…ioned dataset

Handle hive partitioning with Duckdb execution engine

8af3bb0

Merge commit

d4fa047

goodwanghan added this to the 0.6.6 milestone Apr 3, 2022

This was linked to issues Apr 3, 2022

[BUG] Save partitioned parquet dataset using NativeExecutionEngine #285

Closed

[BUG] Issue reading hive partitioned dataset with NativeExecutionEngine #288

Open

Clean code with pylint

a561717

goodwanghan reviewed Apr 3, 2022

View reviewed changes

Use ArrowDataFrame(df.as_arrow()) instead of ArrowDataFrame(df.native…

b3e86b5

….arrow())

goodwanghan approved these changes Apr 3, 2022

View reviewed changes

goodwanghan merged commit 8a16603 into fugue-project:master Apr 3, 2022

LaurentErreca mentioned this pull request Apr 4, 2022

[BUG] Issue reading hive partitioned dataset with NativeExecutionEngine #288

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Fix issue #285 : save hive partitioned dataset using NativeExecutionEngine and DaskExecutionEngine #306

[WIP] Fix issue #285 : save hive partitioned dataset using NativeExecutionEngine and DaskExecutionEngine #306

LaurentErreca commented Feb 28, 2022 •

edited

Loading

kvnkho commented Mar 3, 2022

goodwanghan Mar 4, 2022

goodwanghan commented Mar 4, 2022

goodwanghan commented Mar 4, 2022

LaurentErreca commented Mar 4, 2022 via email

goodwanghan commented Mar 26, 2022

LaurentErreca commented Mar 28, 2022 via email

LaurentErreca commented Apr 3, 2022

goodwanghan commented Apr 3, 2022

codecov-commenter commented Apr 3, 2022 •

edited

Loading

goodwanghan commented Apr 3, 2022

LaurentErreca commented Apr 3, 2022

goodwanghan commented Apr 3, 2022

goodwanghan Apr 3, 2022

LaurentErreca Apr 3, 2022

[WIP] Fix issue #285 : save hive partitioned dataset using NativeExecutionEngine and DaskExecutionEngine #306

[WIP] Fix issue #285 : save hive partitioned dataset using NativeExecutionEngine and DaskExecutionEngine #306

Conversation

LaurentErreca commented Feb 28, 2022 • edited Loading

kvnkho commented Mar 3, 2022

goodwanghan Mar 4, 2022

Choose a reason for hiding this comment

goodwanghan commented Mar 4, 2022

goodwanghan commented Mar 4, 2022

LaurentErreca commented Mar 4, 2022 via email

goodwanghan commented Mar 26, 2022

LaurentErreca commented Mar 28, 2022 via email

LaurentErreca commented Apr 3, 2022

goodwanghan commented Apr 3, 2022

codecov-commenter commented Apr 3, 2022 • edited Loading

Codecov Report

goodwanghan commented Apr 3, 2022

LaurentErreca commented Apr 3, 2022

goodwanghan commented Apr 3, 2022

goodwanghan Apr 3, 2022

Choose a reason for hiding this comment

LaurentErreca Apr 3, 2022

Choose a reason for hiding this comment

LaurentErreca commented Feb 28, 2022 •

edited

Loading

codecov-commenter commented Apr 3, 2022 •

edited

Loading