-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROTOCOL] Allow CDC actions to register data files #3285
Conversation
2c5cc40
to
8229e51
Compare
PROTOCOL.md
Outdated
@@ -477,7 +477,7 @@ The following is an example `remove` action. | |||
``` | |||
|
|||
### Add CDC File | |||
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. | |||
The `cdc` action is used to register a file containing only the data that was changed as part of the transaction. The `cdc` action can either add a [Change Data File](#change-data-files) or register a [Data File](#data-files) that is also added by an `add` action. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont get what this means "The cdc
action can either add a Change Data File or register a Data File that is also added by an add
action."
"cdc action" is 1 action - AddCdcFile. How can it "register" another action?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdas I mean that an AddCDCFile
action can contain the path for a Parquet file that is also added by an AddFile
action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tdas I update the PR to make this more clear.
PROTOCOL.md
Outdated
@@ -477,7 +477,7 @@ The following is an example `remove` action. | |||
``` | |||
|
|||
### Add CDC File | |||
The `cdc` action is used to add a [file](#change-data-files) containing only the data that was changed as part of the transaction. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. | |||
The `cdc` action is used to add a file containing only the data that was changed as part of the transaction. The file might be a [Change Data File](#change-data-files) or a [Data File](#data-file) that does not contain any copied rows. When change data readers encounter a `cdc` action in a particular Delta table version, they must read the changes made in that version exclusively using the `cdc` files. If a version has no `cdc` action, then the data in `add` and `remove` actions are read as inserted and deleted rows, respectively. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i still dont get it. is it your goal to say that the same parquet file can be put as both AddFile and AddCDCFile? if so, then this is not clear.
Also if that is the goal, then its important also write down when a parquet data file can be considered as a CDC file? what constraints does that parquet file need to satisfy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @tdas. I reworded this section to make it more clear and clarify the requirements for the file added by the AddCDCFile
action.
In general the protocol does not refer to "Parquet files". It defines that:
- An
AddFile
action adds aData File
which is a file that is stored in the root directory. - An
AddCDCFile
action adds aChange Data File
which is a file stored in the_change_data
directory, contains only data that was changes as part of a transaction (no copied rows) and has an extra_change_type
column.
I believe that this is unnecessarily restrictive. So we relax this requirement by specifying that:
"The cdc
action is allowed to add a Data File that is also added by an add
action, when it does not contain any copied rows and the _change_type
column is filled for all rows."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm OK with this. I'm pretty sure that no existing readers would mind this. All modern readers that I know will just ignore columns that are not in the schema (because that's what they need to do for Column Mapping anyway). And the only non-modern readers out there are EITHER well-known (old Delta Spark) and do the right thing, OR they are already reading and ignoring the _change_type column.
I agree with @bart-samwel on this change being safe and backward compatible. LGTM. Merging it. |
Which Delta project/connector is this regarding?
Description
Update the Delta PROTOCOL to allow
AddCDCFile
actions that do not add Change Data Files, but instead register Data Files that are also added byAddFile
actions. In this case the_change_type
column in the Data Files might not benull
. Non-change data readers should disregard this column and only process columns defined within the table's schema.How was this patch tested?
N/A
Does this PR introduce any user-facing changes?
Protocol clarification.