Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROTOCOL] Allow CDC actions to register data files #3285

Merged
merged 4 commits into from
Jun 25, 2024

Conversation

cstavr
Copy link
Contributor

@cstavr cstavr commented Jun 20, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (PROTOCOL)

Description

Update the Delta PROTOCOL to allow AddCDCFile actions that do not add Change Data Files, but instead register Data Files that are also added by AddFile actions. In this case the _change_type column in the Data Files might not be null. Non-change data readers should disregard this column and only process columns defined within the table's schema.

How was this patch tested?

N/A

Does this PR introduce any user-facing changes?

Protocol clarification.

@cstavr cstavr force-pushed the protocol-cdf-note branch from 2c5cc40 to 8229e51 Compare June 20, 2024 11:41
PROTOCOL.md Outdated
@@ -477,7 +477,7 @@ The following is an example `remove` action.
```

### Add CDC File
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.
The `cdc` action is used to register a file containing only the data that was changed as part of the transaction. The `cdc` action can either add a [Change Data File](#change-data-files) or register a [Data File](#data-files) that is also added by an `add` action. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont get what this means "The cdc action can either add a Change Data File or register a Data File that is also added by an add action."

"cdc action" is 1 action - AddCdcFile. How can it "register" another action?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas I mean that an AddCDCFile action can contain the path for a Parquet file that is also added by an AddFile action.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdas I update the PR to make this more clear.

@cstavr cstavr requested a review from tdas June 21, 2024 18:01
PROTOCOL.md Outdated
@@ -477,7 +477,7 @@ The following is an example `remove` action.
```

### Add CDC File
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.
The `cdc` action is used to add a file containing only the data that was changed as part of the transaction. The file might be a [Change Data File](#change-data-files) or a [Data File](#data-file) that does not contain any copied rows. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i still dont get it. is it your goal to say that the same parquet file can be put as both AddFile and AddCDCFile? if so, then this is not clear.

Also if that is the goal, then its important also write down when a parquet data file can be considered as a CDC file? what constraints does that parquet file need to satisfy.

Copy link
Contributor Author

@cstavr cstavr Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @tdas. I reworded this section to make it more clear and clarify the requirements for the file added by the AddCDCFile action.

In general the protocol does not refer to "Parquet files". It defines that:

  • An AddFile action adds a Data File which is a file that is stored in the root directory.
  • An AddCDCFile action adds a Change Data File which is a file stored in the _change_data directory, contains only data that was changes as part of a transaction (no copied rows) and has an extra _change_type column.

I believe that this is unnecessarily restrictive. So we relax this requirement by specifying that:
"The cdc action is allowed to add a Data File that is also added by an add action, when it does not contain any copied rows and the _change_type column is filled for all rows."

Copy link
Collaborator

@bart-samwel bart-samwel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm OK with this. I'm pretty sure that no existing readers would mind this. All modern readers that I know will just ignore columns that are not in the schema (because that's what they need to do for Column Mapping anyway). And the only non-modern readers out there are EITHER well-known (old Delta Spark) and do the right thing, OR they are already reading and ignoring the _change_type column.

@tdas
Copy link
Contributor

tdas commented Jun 25, 2024

I agree with @bart-samwel on this change being safe and backward compatible. LGTM. Merging it.

@tdas tdas merged commit e36829b into delta-io:master Jun 25, 2024
9 of 10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants