A 69-bit integer value in JSON could cause data loss when using kafka source #13047

ly9chee · 2023-10-25T06:23:10Z

Describe the bug

Hi there, recently I found some parse error logs in cn node like below

2023-10-24T15:48:19.469303701Z ERROR actor{otel.name="Actor 2740" actor_id=2740 prev_epoch=5303626091593728 curr_epoch=5303626134585344}:executor{otel.name="DmlExecutor AB400000001 (actor 2740)" actor_id=2740}:executor{otel.name="SourceExecutor AB400002716 (actor 2740)" actor_id=2740}:source_parser{actor_id=2740 source_id=2085}:parse_one{split_id="0" offset="8"}: risingwave_connector::parser: failed to parse message, skipping error=Protocol error: InvalidNumber at character 776 ('0')

the JSON data mentioned in log at partition 0 offset 8 in kafka topic is skipped by kafka source and result in data loss eventually.

after some invistigation of the source code I realized that this is simd-json parse error caused by a field has a large integer number value, so I wrote a mimimum rust snippet to reproduce this error:

use simd_json;

fn main() {
    let mut d = br#"{"some": 327267367867168100000}"#.to_vec();
    let v: simd_json::BorrowedValue = simd_json::to_borrowed_value(&mut d).unwrap();
    println!("{}", v);
}

output:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { index: 28, character: Some('0'), error: InvalidNumber }

In our kafka upstream, JSON data may contains large integer value inside nested object, it is very hard to detect those values and correct them before ingestion.

{
    "fields":[
        {
            "field_id": "field1",
            "float_value": 327267367867168100000
        }
    ]
}

so, is this an expected behavior? or this field should placed by NULL instead of a parse error?

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

Kubernetes

The version of RisingWave

PostgreSQL 9.5-RisingWave-1.3.0 (c4c31bd)

Additional context

No response

The text was updated successfully, but these errors were encountered:

xiangjinwu · 2023-10-25T07:32:16Z

https://datatracker.ietf.org/doc/html/rfc8259#section-6

This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754 binary64 (double precision) numbers [IEEE754] is generally
available and widely used, good interoperability can be achieved by
implementations that expect no more precision or range than these
provide, in the sense that implementations will approximate JSON
numbers within the expected precision. A JSON number such as 1E400
or 3.141592653589793238462643383279 may indicate potential
interoperability problems, since it suggests that the software that
created it expects receiving software to have greater capabilities
for numeric magnitude and precision than is widely available.

Note that when such software is used, numbers that are integers and
are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.

Also note that proto3 maps 64-bit integer to json string rather than json number:
https://protobuf.dev/programming-guides/proto3/#:~:text=int64%2C%20fixed64%2C%20uint64,string

In the long term we do not want to be the bottleneck of limiting json numbers to only 2^53. But for good interoperability it is better not to assume all tools in the data pipeline can support it.

or this field should placed by NULL instead of a parse error?

Currently a parse error during ingestion (source) causes the whole row to be skipped, while an expr error (incl casting from string to some type) during computation (stream) fills a null value to that field. This may feel inconsistent, and thanks for bringing it up.

neverchanje · 2024-03-14T03:07:46Z

As the user suggests, simd-json 0.13.5 now enables us to convert very-large-numbers into f64, with a potential loss of precisions.

big-int-as-float
The big-int-as-float feature flag treats very large integers that won't fit into u64 as f64 floats. This prevents parsing errors if the JSON you are parsing contains very large numbers. Keep in mind that f64 loses some precision when representing very large numbers.

But I still think it's not a good way to deal with such numbers. Converting them to decimal https://crates.io/crates/bigdecimal may be better.

ly9chee added the type/bug Something isn't working label Oct 25, 2023

github-actions bot added this to the release-1.4 milestone Oct 25, 2023

xiangjinwu self-assigned this Oct 25, 2023

xiangjinwu changed the title ~~A large integer value in JSON could cause data loss when using kafka source~~ A 69-bit integer value in JSON could cause data loss when using kafka source Oct 25, 2023

xiangjinwu modified the milestones: release-1.4, future-release-1.5 Oct 27, 2023

xiangjinwu modified the milestones: release-1.5, future-release-1.7 Dec 6, 2023

xiangjinwu modified the milestones: future-release-1.7, future-release-1.8 Dec 29, 2023

xiangjinwu modified the milestones: release-1.8, future-release-1.9 Mar 6, 2024

xiangjinwu modified the milestones: release-1.9, future-release-1.10 Apr 8, 2024

xiangjinwu modified the milestones: release-1.10, future-release-1.11 May 13, 2024

xiangjinwu modified the milestones: release-1.11, future-release-1.12 Jul 10, 2024

xiangjinwu modified the milestones: future-release-2.1, future-release-2.2 Aug 19, 2024

xiangjinwu modified the milestones: release-2.2, future-release-2.3 Oct 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

A 69-bit integer value in JSON could cause data loss when using kafka source #13047

A 69-bit integer value in JSON could cause data loss when using kafka source #13047

ly9chee commented Oct 25, 2023

xiangjinwu commented Oct 25, 2023 •

edited

Loading

neverchanje commented Mar 14, 2024

A 69-bit integer value in JSON could cause data loss when using kafka source #13047

A 69-bit integer value in JSON could cause data loss when using kafka source #13047

Comments

ly9chee commented Oct 25, 2023

Describe the bug

Error message/log

To Reproduce

Expected behavior

How did you deploy RisingWave?

The version of RisingWave

Additional context

xiangjinwu commented Oct 25, 2023 • edited Loading

neverchanje commented Mar 14, 2024

xiangjinwu commented Oct 25, 2023 •

edited

Loading