-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Spark][3.3] Make Identity Column High Water Mark updates consistent (#…
…3990) #### Which Delta project/connector is this regarding? <!-- Please add the component selected below to the beginning of the pull request title For example: [Spark] Title of my pull request --> - [x] Spark - [ ] Standalone - [ ] Flink - [ ] Kernel - [ ] Other (fill in here) ## Description Currently: - When we do a MERGE, we will always call `setTrackHighWaterMarks` on the transaction. This will have an effect if there is an INSERT clause in the MERGE. - If we `setTrackHighWaterMarks`, we collect the max/min of the column using `DeltaIdentityColumnStatsTracker`. This stats tracker is only invoked on files that are written/rewritten. These min/max values are compared with the existing high watermark. If the high watermark doesn't exist, we will keep as high watermark the largest of the max or the lowest of the min without checking against the starting value of the identity column. - If an identity column did not generate a value yet, the high watermark is None and isn't stored in the table. This is true for GENERATED ALWAYS AS IDENTITY tables when it is empty and true for GENERATED BY DEFAULT AS IDENTITY tables when it only has user inserted values for the identity column. - If you run a MERGE UPSERT that only ends up updating values in a GENERATED BY DEFAULT table that doesn't have a high watermark yet, we will write a new high watermark that is the highest for the updated file, which may be lower than the starting value specified for the identity column. Proposal: - This PR makes all high water mark updates go through the same validation function by default. It will not update the high watermark if it violates the start or the existing high watermark. Exception is if the table already has a corrupted high water mark. - This does NOT prevent the scenario where we automatically set the high watermark for a generated by default column based on user inserted values when it does respect the start. - Previously, we did not do high water mark rounding on the `updateSchema` path. This seems erroneous as the min/max values can be user inserted. We fix that in this PR. - Previously, we did not validate that on SYNC identity, the result of max can be below the existing high water mark. Now, we also do check this invariant and block it by default. A SQLConf has been introduced to allow reducing the high water mark if the user wants. - We add logging to catch bad high water mark. ## How was this patch tested? New tests that were failing prior to this change. ## Does this PR introduce _any_ user-facing changes? No
- Loading branch information
Showing
9 changed files
with
770 additions
and
44 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.