feat: supoprt check equality of schema and arrow schema #10903

ZENOTME · 2023-07-12T04:40:30Z

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

To support iceberg sink(#10875), we need to check the equality of schema before create the writer.
Icelake support to convert to its schema to the arrow schema, so this pr implement a function to check the equality between schema and arrow schema.

Checklist

I have written necessary rustdoc comments
I have added necessary unit tests and integration tests
I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

Click here for Documentation

Types of user-facing changes

Please keep the types that apply to your changes, and remove the others.

Installation and deployment
Connector (sources & sinks)
SQL commands, functions, and operators
RisingWave cluster configuration changes
Other (please specify in the release note below)

Release note

ZENOTME · 2023-07-12T04:42:07Z

src/common/src/catalog/schema.rs

+            .iter()
+            .zip_eq_fast(arrow_schema.fields())
+            .all(|(field, arrow_field)| {
+                field


We only check that the type in schema is simalar and ignore the other field (such as field name). Do we need to check those?🤔 cc @liurenjie1024

You can provide a parameter in this function to users to choose whether check names or not.

For current use case, I think we should also check field names. Fields in table schema is identified by column name and there is no enforce order fields. We can add a version without comparing field names when necessary

src/common/src/catalog/schema.rs

codecov · 2023-07-12T05:19:59Z

Codecov Report

Merging #10903 (24b0d08) into main (4cdf329) will increase coverage by 0.00%.
The diff coverage is 96.15%.

@@           Coverage Diff           @@
##             main   #10903   +/-   ##
=======================================
  Coverage   69.91%   69.91%           
=======================================
  Files        1309     1309           
  Lines      223864   223916   +52     
=======================================
+ Hits       156512   156556   +44     
- Misses      67352    67360    +8

Flag	Coverage Δ
rust	`69.91% <96.15%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
src/common/src/catalog/schema.rs	`83.20% <96.15%> (+3.40%)`	⬆️

... and 4 files with indirect coverage changes

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

xiangjinwu · 2023-07-12T05:55:33Z

(This discusses arrow types in general, not necessarily iceberg.)

FYI there is an existing mapping between our DataType and arrow as part of UDF:
https://github.com/risingwavelabs/risingwave/blob/main/src/common/src/array/arrow.rs

To sink an internal DataType into an arrow one, there are some additional questions to answer:

Is it okay to sink a Time (always microsecond as i64) into Time64(nano)?
- Maybe ok, by * 1000 without overflow
Is it okay to sink a Time (always microsecond as i64) into Time64(milli or sec)?
- Maybe ok, by rounding
Is it okay to sink a Time (always microsecond as i64) into Time32(milli or sec)?
- Maybe ok, by rounding without overflow
Is it okay to sink a Time (always microsecond as i64) into Time32(micro or nano)?
- Can overflow

Note that following the same rationale above, ingesting from a Time32(micro or nano) source is totally ok. That is, the mapping here may not be called with unidirectional words like "equality" or "similar", but directional words like from/source and into/sink.

Some other types to note:

Timestamp is equivalent to Timestamp(micro, None)
Timestamptz is equivalent to Timestamp(micro, Some("UTC")), or we can also support timezone conversion if non-UTC is given
Interval is closest to Interval(MonthDayNano) rather than Duration
Bytea to FixedSizeBinary may lead to truncation (also FixedSizeList)
Varchar can be mapped to Utf8
current dynamic scale Decimal need a static scale Decimal256(56, 28) without loss of precision. The UDF implementation is just experimental
UDF uses a trick to handle jsonb and rw_int256:
- varchar is Utf8 and jsonb is LargeUtf8
- decimal is Decimal128 and rw_int256 is Decimal256

ZENOTME · 2023-07-12T06:42:08Z

FYI there is an existing mapping between our DataType and arrow as part of UDF:
https://github.com/risingwavelabs/risingwave/blob/main/src/common/src/array/arrow.rs

Thanks! We can directly reuse it.

To sink an internal DataType into an arrow one, there are some additional questions to answer:
Is it okay to sink a Time (always microsecond as i64) into Time64(nano)?
Maybe ok, by * 1000 without overflow
Is it okay to sink a Time (always microsecond as i64) into Time64(milli or sec)?
Maybe ok, by rounding
Is it okay to sink a Time (always microsecond as i64) into Time32(milli or sec)?
Maybe ok, by rounding without overflow
Is it okay to sink a Time (always microsecond as i64) into Time32(micro or nano)?
Can overflow

Good question. As you say, when we convert the internal data to arrow one, it's possible cause the data is not the same(e.g. rounding or overflow). I think we can let the user choose omit or insert the inconsistent data in this case.

liurenjie1024

Generally LGTM

liurenjie1024 · 2023-07-12T07:00:36Z

src/common/src/catalog/schema.rs

+    /// Check if the schema can covert to the iceberg table schema.
+    ///
+    /// If `is_check_name` is enable, the name of the field will be checked.
+    pub fn check_to_arrow_schema(&self, arrow_schema: &ArrowSchema, is_check_name: bool) -> bool {


Suggested change

pub fn check_to_arrow_schema(&self, arrow_schema: &ArrowSchema, is_check_name: bool) -> bool {

pub fn same_as_arrow_schema(&self, arrow_schema: &ArrowSchema, is_check_name: bool) -> bool {

I think this check should not compare fields in order, but should compare field names and data type.

I agree matching fields by name is more reasonable here. Just for context, the cast between 2 structs are by position rather than by name #9694

xiangjinwu · 2023-07-12T08:09:05Z

Linking to the types supported by iceberg here:
https://iceberg.apache.org/docs/latest/schemas/
https://iceberg.apache.org/spec/#schemas-and-data-types
icelake-arrow-mapping
official-arrow-mapping

It seems the following types corresponds well: boolean, int32, int64, float32, float64, date, time, timestamp, timestamptz, string, binary, list, struct

Then some types only exists on our side:

int16 may be stored in int32 (or even int64? do we intend to support or reject sinking from int32 into int64?)
interval seems not presentable. Options: (a) report error (b) as string (c) as fixed(16) or binary
jsonb maybe as string

Some types only exists in iceberg:

fixed(L): (a) report error (b) bytea with truncation
map<K, V>: report error because hstore is not supported

Similar but different type: decimal
Maybe allow sinking into any decimal(p, s), with rounding when not enough scale, and runtime overflow error / skip when not enough precision.
Example: we may have 1e28 and 1e-28 in the same column, which is impossible in iceberg

When the iceberg column is decimal(38, <=9), it can hold 1e28 but need to round 1e-28 to 0
When the iceberg column is decimal(38, >=28), it can hold 1e-28 losslessly but 1e28 would be overflow error
For scale 10..=27, it would overflow for 1e28 and round 1e-28 to 0

liurenjie1024 · 2023-07-12T08:35:12Z

@xiangjinwu Good point. Data type matching is quite important for compatibility, and we do need careful testing and documentation for it. My suggestion is that in first version, we only support well matched types as listed above, and reject others. For complex types, we should resolve them one by one later.

xiangjinwu · 2023-07-12T08:52:14Z

src/common/src/catalog/schema.rs

+            schema_fields
+                .get(arrow_field.name())
+                .and_then(|data_type| {
+                    if *data_type == &arrow_field.data_type().into() {


This into currently panics on unsupported types. We need to update it to a TryFrom and handle those cases.

Also note that, among the well matched types listed above:

RW and official java iceberg uses Binary but icelake uses LargeBinary

RW and official java iceberg uses Time64(Micro) but icelake uses Time32(Micro), likely a bug

RW allows Timestamp(Micro, Some(_)), official java iceberg uses Timestamp(Micro, Some("UTC")), but icelake uses Timestamp(Micro, None)

After updating into to try_into this PR should be good to go. The 3 inconsistencies above may be resolved in icelake side.

I find that here should be data type convert to arrow type, so that we can convert the internal data to arrow data and sink into icelake:
*data_type.into() == &arrow_field.data_type()

For data_type convert into arrow_type, only DataType::Serial is unsupported type, can we directly convert DataType::Serial to a int64.🤔

I agree with @xiangjinwu that it's safer to use try_into here so that we will not panic when we have some data types not completely compatible with arrow/iceberg.

xxchan · 2023-07-13T20:01:36Z

src/common/src/catalog/schema.rs

Just some nits:

What about putting this function in arrow.rs so that we have all the code for arrow compatibility layer in one place?

Why not sth like impl PartialEq<arrow_schema::Schema> for Schema {?

Why not sth like impl PartialEq<arrow_schema::Schema> for Schema {?

Because I'm not sure whether the sematic of this function is PartialEq. I think the sematic of this function is that the schema can be convert to the arrow schema. But it doesn't means that they are equal. e.g:

Some types only exists in iceberg:
fixed(L): (a) report error (b) bytea with truncation

For this type, maybe we can let bytea convert to it, but bytea and fixed is not equal.

xxchan · 2023-07-13T20:04:11Z

src/common/src/catalog/schema.rs

@@ -197,6 +198,32 @@ impl Schema {
            true
        }
    }
+
+    /// Check if the schema is same as  the iceberg table schema.


Suggested change

/// Check if the schema is same as the iceberg table schema.

/// Check if the schema is same as an Arrow schema.

src/common/src/catalog/schema.rs

liurenjie1024

LGTM

ZENOTME requested review from liurenjie1024, chenzl25, xiangjinwu and BugenZhao July 12, 2023 04:40

github-actions bot added the type/feature label Jul 12, 2023

ZENOTME commented Jul 12, 2023

View reviewed changes

BugenZhao reviewed Jul 12, 2023

View reviewed changes

src/common/src/catalog/schema.rs Outdated Show resolved Hide resolved

ZENOTME force-pushed the zj/schema branch from 08eac1d to aac4e74 Compare July 12, 2023 04:52

ZENOTME force-pushed the zj/schema branch from aac4e74 to 97ea76b Compare July 12, 2023 06:55

liurenjie1024 reviewed Jul 12, 2023

View reviewed changes

ZENOTME force-pushed the zj/schema branch from 97ea76b to 6ffa3e9 Compare July 12, 2023 08:19

xiangjinwu reviewed Jul 12, 2023

View reviewed changes

ZENOTME mentioned this pull request Jul 13, 2023

refactor(common): switch to try_into for data to arrow data #10937

Merged

5 tasks

xxchan reviewed Jul 13, 2023

View reviewed changes

src/common/src/catalog/schema.rs Outdated Show resolved Hide resolved

support check whether schema can be convert to arrow_schema

24b0d08

ZENOTME force-pushed the zj/schema branch from 6ffa3e9 to 24b0d08 Compare July 17, 2023 01:33

ZENOTME requested a review from liurenjie1024 July 17, 2023 04:45

github-actions bot added breaking-change user-facing-changes Contains changes that are visible to users labels Jul 17, 2023

ZENOTME removed user-facing-changes Contains changes that are visible to users breaking-change labels Jul 17, 2023

liurenjie1024 approved these changes Jul 17, 2023

View reviewed changes

liurenjie1024 added this pull request to the merge queue Jul 17, 2023

Merged via the queue into main with commit 9587646 Jul 17, 2023

liurenjie1024 deleted the zj/schema branch July 17, 2023 06:57

xiangjinwu mentioned this pull request Aug 23, 2023

bug: Decimal processing is incorrect when using external. #11823

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: supoprt check equality of schema and arrow schema #10903

feat: supoprt check equality of schema and arrow schema #10903

ZENOTME commented Jul 12, 2023 •

edited

Loading

ZENOTME Jul 12, 2023

chenzl25 Jul 12, 2023

liurenjie1024 Jul 12, 2023

codecov bot commented Jul 12, 2023 •

edited

Loading

xiangjinwu commented Jul 12, 2023 •

edited

Loading

ZENOTME commented Jul 12, 2023

liurenjie1024 left a comment

liurenjie1024 Jul 12, 2023

liurenjie1024 Jul 12, 2023

xiangjinwu Jul 12, 2023

xiangjinwu commented Jul 12, 2023 •

edited

Loading

liurenjie1024 commented Jul 12, 2023

xiangjinwu Jul 12, 2023 •

edited

Loading

ZENOTME Jul 13, 2023

liurenjie1024 Jul 13, 2023

xxchan Jul 13, 2023

ZENOTME Jul 17, 2023

xxchan Jul 13, 2023

liurenjie1024 left a comment

	pub fn check_to_arrow_schema(&self, arrow_schema: &ArrowSchema, is_check_name: bool) -> bool {
	pub fn same_as_arrow_schema(&self, arrow_schema: &ArrowSchema, is_check_name: bool) -> bool {

	/// Check if the schema is same as the iceberg table schema.
	/// Check if the schema is same as an Arrow schema.

feat: supoprt check equality of schema and arrow schema #10903

feat: supoprt check equality of schema and arrow schema #10903

Conversation

ZENOTME commented Jul 12, 2023 • edited Loading

What's changed and what's your intention?

Checklist

Documentation

Types of user-facing changes

Release note

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Jul 12, 2023 • edited Loading

Codecov Report

xiangjinwu commented Jul 12, 2023 • edited Loading

ZENOTME commented Jul 12, 2023

liurenjie1024 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xiangjinwu commented Jul 12, 2023 • edited Loading

liurenjie1024 commented Jul 12, 2023

xiangjinwu Jul 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

liurenjie1024 left a comment

Choose a reason for hiding this comment

ZENOTME commented Jul 12, 2023 •

edited

Loading

codecov bot commented Jul 12, 2023 •

edited

Loading

xiangjinwu commented Jul 12, 2023 •

edited

Loading

xiangjinwu commented Jul 12, 2023 •

edited

Loading

xiangjinwu Jul 12, 2023 •

edited

Loading