Skip to content

Commit

Permalink
[Backport 3.3] Update liquid clustering docs (#3959)
Browse files Browse the repository at this point in the history
<!--
Thanks for sending a pull request!  Here are some tips for you:
1. If this is your first time, please read our contributor guidelines:
https://github.com/delta-io/delta/blob/master/CONTRIBUTING.md
2. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP]
Your PR title ...'.
  3. Be sure to keep the PR description updated to reflect all changes.
  4. Please write your PR title to summarize what this PR proposes.
5. If possible, provide a concise example to reproduce the issue for a
faster review.
6. If applicable, include the corresponding issue number in the PR title
and link it in the body.
-->

#### Which Delta project/connector is this regarding?
<!--
Please add the component selected below to the beginning of the pull
request title
For example: [Spark] Title of my pull request
-->

- [ ] Spark
- [ ] Standalone
- [ ] Flink
- [ ] Kernel
- [x] Other (docs)

## Description

<!--
- Describe what this PR changes.
- Describe why we need the change.
 
If this PR resolves an issue be sure to include "Resolves #XXX" to
correctly link and close the issue upon merge.
-->
This is a manual backport of #3958
to branch 3.3

Add docs for OPTIMIZE FULL, in-place migration, and create table from
external location.
## How was this patch tested?

<!--
If tests were added, say they were added here. Please make sure to test
the changes thoroughly including negative and positive cases if
possible.
If the changes were tested in any way other than unit tests, please
clarify how you tested step by step (ideally copy and paste-able, so
that other reviewers can test and check, and descendants can verify in
the future).
If the changes were not tested, please explain why.
-->
![127 0 0 1_8000_delta-clustering html
(6)](https://github.com/user-attachments/assets/93ecca31-5d37-41e0-b118-35b93a42cb75)

## Does this PR introduce _any_ user-facing changes?

<!--
If yes, please clarify the previous behavior and the change this PR
proposes - provide the console output, description and/or an example to
show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change
compared to the released Delta Lake versions or within the unreleased
branches such as master.
If no, write 'No'.
-->
No
  • Loading branch information
zedtang authored Dec 12, 2024
1 parent c655a2a commit 9ca7f0c
Showing 1 changed file with 22 additions and 2 deletions.
24 changes: 22 additions & 2 deletions docs/source/delta-clustering.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ The following are examples of scenarios that benefit from clustering:

## Enable liquid clustering

You must enable liquid clustering when creating a table. Clustering is not compatible with partitioning or `ZORDER`. Once enabled, run `OPTIMIZE` jobs as normal to incrementally cluster data. See [_](#optimize).
You can enable liquid clustering on an existing table or during table creation. Clustering is not compatible with partitioning or `ZORDER`. Once enabled, run `OPTIMIZE` jobs as usual to incrementally cluster data. See [_](#optimize).

To enable liquid clustering, add the `CLUSTER BY` phrase to a table creation statement, as in the examples below:

Expand All @@ -34,7 +34,8 @@ To enable liquid clustering, add the `CLUSTER BY` phrase to a table creation sta
CREATE TABLE table1(col0 int, col1 string) USING DELTA CLUSTER BY (col0);

-- Using a CTAS statement
CREATE TABLE table2 CLUSTER BY (col0) -- specify clustering after table name, not in subquery
CREATE EXTERNAL TABLE table2 CLUSTER BY (col0) -- specify clustering after table name, not in subquery
LOCATION 'table_location'
AS SELECT * FROM table1;
```

Expand All @@ -60,6 +61,15 @@ To enable liquid clustering, add the `CLUSTER BY` phrase to a table creation sta

.. warning:: Tables created with liquid clustering have `Clustering` and `DomainMetadata` table features enabled (both writer features) and use Delta writer version 7 and reader version 1. Table protocol versions cannot be downgraded. See [_](/versioning.md).

You can enable liquid clustering on an existing unpartitioned Delta table using the following syntax:

```sql
ALTER TABLE <table_name>
CLUSTER BY (<clustering_columns>)
```

.. important:: Default behavior does not apply clustering to previously written data. To force reclustering for all records, you must use `OPTIMIZE FULL`. See [_](#optimize-full).

## Choose clustering columns

Clustering columns can be defined in any order. If two columns are correlated, you only need to add one of them as a clustering column.
Expand Down Expand Up @@ -87,6 +97,16 @@ OPTIMIZE table_name;

Liquid clustering is incremental, meaning that data is only rewritten as necessary to accommodate data that needs to be clustered. Already clustered data files with different clustering columns are not rewritten.

In <Delta> 3.3 and above, you can force reclustering of all records in a table with the following syntax:

```sql
OPTIMIZE table_name FULL;
```

.. important:: Running `OPTIMIZE FULL` reclusters all existing data as necessary. For large tables that have not previously been clustered on the specified columns, this operation might take hours.

Run `OPTIMIZE FULL` when you change clustering columns. If you have previously run `OPTIMIZE FULL` and there has been no change to clustering columns, `OPTIMIZE FULL` runs the same as `OPTIMIZE`. Always use `OPTIMIZE FULL` to ensure that data layout reflects the current clustering columns.

## Read data from a clustered table

You can read data in a clustered table using any <Delta> client. For best query results, include clustering columns in your query filters, as in the following example:
Expand Down

0 comments on commit 9ca7f0c

Please sign in to comment.