Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Kernel] Implement AddFile class using DelegateRow #3993

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

qiyuandong-db
Copy link
Contributor

@qiyuandong-db qiyuandong-db commented Dec 19, 2024

Which Delta project/connector is this regarding?

  • Spark
  • Standalone
  • Flink
  • Kernel
  • Other (fill in here)

Description

This PR implements the AddFile class, which represents an 'AddFile' Delta log action. The implementation includes the constructor, field getters, necessary overrides, and APIs for updating specific fields to create a new AddFile instance. It also add two fields baseRowId and defaultRowCommitVersion btw, which will be used in Row Tracking later.

An AddFile instance is created from a Row that represents an 'AddFile' action and contains all its field values. We avoid materializing all fields from the Row when creating an AddFile instance. Instead, we only maintain a reference to the underlying Row, and the getters retrieve values directly from it.

The backing Row can be a GenericRow or a DelegateRow, the latter being newly introduced in this PR. While Row is immutable, the DelegateRow class wraps an existing Row and allows overriding values for specific ordinals. In this way, we can create a modified view of a Row without mutating the original one. This is used in this PR to update a certain field of an AddFile instance, where it returns a new AddFile backed by a DelegateRow.

How was this patch tested?

Unit tests were added in AddFileSuite.scala.

Does this PR introduce any user-facing changes?

No.

@qiyuandong-db qiyuandong-db force-pushed the delta-kernel-implement-AddFile branch from 1533153 to e8b758f Compare December 19, 2024 22:38
@qiyuandong-db qiyuandong-db force-pushed the delta-kernel-implement-AddFile branch from e8b758f to 058587a Compare December 20, 2024 10:24
@johanl-db johanl-db self-requested a review December 20, 2024 10:37
Comment on lines +103 to +111
/**
* The underlying {@link Row} that represents an 'AddFile' action and contains all its field
* values. Can be either a {@link GenericRow} or a {@link DelegateRow}.
*/
private final Row row;

public AddFile(Row row) {
this.row = row;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make the concept of type backed by a Row generic so that it can eventually be reused for all data that is passed between the engine and kernel - e.g. other actions - without having to reimplement the same core functionality every time

At the minimum, something that enforces the contract that the type is backed by a Row:

public class AddFile extends RowBacked {}

abstract class RowBacked {
  private final Row row;

  public Row toRow() {
    return row;
  }
}

Although it would be interesting to see how much common functionality we can put into RowBacked, e.g. the concrete type provides a schema, and we have a generic implementation for common methods based on that schema:

public class AddFile extends RowBacked {
  public static final StructType FULL_SCHEMA = SCHEMA_WITH_STATS;
  
  protected StructType getFullSchema() {
    return FULL_SCHEMA;
  }
}

abstract class RowBacked {
  abstract protected StructType getFullSchema();

  public boolean equals(Object obj) {
    // Probably not correct but to give an idea
    if (this == obj) return true;
    if (!(obj instanceof RowBacked)) return false;
    RowBacked other = (RowBacked) obj;
    return getFullSchema().fields().stream().allMatch(
      field -> row.get(field.name()).equals(other.toRow().get(field.name())));
    }

Comment on lines +33 to +43
private def generateTestAddFileRow(
path: String = "path",
partitionValues: Map[String, String] = Map.empty,
size: Long = 10L,
modificationTime: Long = 20L,
dataChange: Boolean = true,
deletionVector: Option[DeletionVectorDescriptor] = Option.empty,
tags: Option[Map[String, String]] = Option.empty,
baseRowId: Option[Long] = Option.empty,
defaultRowCommitVersion: Option[Long] = Option.empty,
stats: Option[String] = Option.empty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would actually make a good constructor inside AddFile for future use
It can already be used for convertDataFileStatus


public static final StructType SCHEMA_WITH_STATS = SCHEMA_WITHOUT_STATS.add(JSON_STATS_FIELD);

/** Full schema of the {@code add} action in the Delta Log. */
public static final StructType FULL_SCHEMA = SCHEMA_WITH_STATS;

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't leave a comment on line 81 so putting it here:

  public static Row convertDataFileStatus(

This is a good example of what I mentioned in the slack thread:

  • All public kernel APIs should use Row as this is what engines understand
  • All internal kernel APIs should use concrete, internal types such as AddFile
  • (Optional?) We should convert from Row to concrete type at the boundary with engines so that all the internal kernel code doesn't need to use Row

This is an internal API, so it should return an AddFile.
It's then used in generateAppendActions which is a public API, which is where we'll want to turn into a row:

  static CloseableIterator<Row> generateAppendActions(...) {
          AddFile addFile =
              AddFile.convertDataFileStatus(
                  tableRoot,
                  dataFileStatus,
                  ((DataWriteContextImpl) dataWriteContext).getPartitionValues(),
                  true /* dataChange */);
          return SingleAction.createAddFileSingleAction(addFile.toRow());

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants