-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] Implement AddFile class using DelegateRow #3993
base: master
Are you sure you want to change the base?
[Kernel] Implement AddFile class using DelegateRow #3993
Conversation
1533153
to
e8b758f
Compare
e8b758f
to
058587a
Compare
/** | ||
* The underlying {@link Row} that represents an 'AddFile' action and contains all its field | ||
* values. Can be either a {@link GenericRow} or a {@link DelegateRow}. | ||
*/ | ||
private final Row row; | ||
|
||
public AddFile(Row row) { | ||
this.row = row; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make the concept of type backed by a Row
generic so that it can eventually be reused for all data that is passed between the engine and kernel - e.g. other actions - without having to reimplement the same core functionality every time
At the minimum, something that enforces the contract that the type is backed by a Row:
public class AddFile extends RowBacked {}
abstract class RowBacked {
private final Row row;
public Row toRow() {
return row;
}
}
Although it would be interesting to see how much common functionality we can put into RowBacked
, e.g. the concrete type provides a schema, and we have a generic implementation for common methods based on that schema:
public class AddFile extends RowBacked {
public static final StructType FULL_SCHEMA = SCHEMA_WITH_STATS;
protected StructType getFullSchema() {
return FULL_SCHEMA;
}
}
abstract class RowBacked {
abstract protected StructType getFullSchema();
public boolean equals(Object obj) {
// Probably not correct but to give an idea
if (this == obj) return true;
if (!(obj instanceof RowBacked)) return false;
RowBacked other = (RowBacked) obj;
return getFullSchema().fields().stream().allMatch(
field -> row.get(field.name()).equals(other.toRow().get(field.name())));
}
private def generateTestAddFileRow( | ||
path: String = "path", | ||
partitionValues: Map[String, String] = Map.empty, | ||
size: Long = 10L, | ||
modificationTime: Long = 20L, | ||
dataChange: Boolean = true, | ||
deletionVector: Option[DeletionVectorDescriptor] = Option.empty, | ||
tags: Option[Map[String, String]] = Option.empty, | ||
baseRowId: Option[Long] = Option.empty, | ||
defaultRowCommitVersion: Option[Long] = Option.empty, | ||
stats: Option[String] = Option.empty |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would actually make a good constructor inside AddFile
for future use
It can already be used for convertDataFileStatus
|
||
public static final StructType SCHEMA_WITH_STATS = SCHEMA_WITHOUT_STATS.add(JSON_STATS_FIELD); | ||
|
||
/** Full schema of the {@code add} action in the Delta Log. */ | ||
public static final StructType FULL_SCHEMA = SCHEMA_WITH_STATS; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't leave a comment on line 81 so putting it here:
public static Row convertDataFileStatus(
This is a good example of what I mentioned in the slack thread:
- All public kernel APIs should use Row as this is what engines understand
- All internal kernel APIs should use concrete, internal types such as AddFile
(Optional?)We should convert from Row to concrete type at the boundary with engines so that all the internal kernel code doesn't need to use Row
This is an internal API, so it should return an AddFile
.
It's then used in generateAppendActions
which is a public API, which is where we'll want to turn into a row:
static CloseableIterator<Row> generateAppendActions(...) {
AddFile addFile =
AddFile.convertDataFileStatus(
tableRoot,
dataFileStatus,
((DataWriteContextImpl) dataWriteContext).getPartitionValues(),
true /* dataChange */);
return SingleAction.createAddFileSingleAction(addFile.toRow());
Which Delta project/connector is this regarding?
Description
This PR implements the
AddFile
class, which represents an 'AddFile' Delta log action. The implementation includes the constructor, field getters, necessary overrides, and APIs for updating specific fields to create a newAddFile
instance. It also add two fieldsbaseRowId
anddefaultRowCommitVersion
btw, which will be used in Row Tracking later.An
AddFile
instance is created from aRow
that represents an 'AddFile' action and contains all its field values. We avoid materializing all fields from theRow
when creating anAddFile
instance. Instead, we only maintain a reference to the underlyingRow
, and the getters retrieve values directly from it.The backing
Row
can be aGenericRow
or aDelegateRow
, the latter being newly introduced in this PR. WhileRow
is immutable, theDelegateRow
class wraps an existingRow
and allows overriding values for specific ordinals. In this way, we can create a modified view of aRow
without mutating the original one. This is used in this PR to update a certain field of anAddFile
instance, where it returns a newAddFile
backed by aDelegateRow
.How was this patch tested?
Unit tests were added in
AddFileSuite.scala
.Does this PR introduce any user-facing changes?
No.