Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A 69-bit integer value in JSON could cause data loss when using kafka source #13047

Open
ly9chee opened this issue Oct 25, 2023 · 2 comments
Open
Assignees
Labels
type/bug Something isn't working
Milestone

Comments

@ly9chee
Copy link
Contributor

ly9chee commented Oct 25, 2023

Describe the bug

Hi there, recently I found some parse error logs in cn node like below

2023-10-24T15:48:19.469303701Z ERROR actor{otel.name="Actor 2740" actor_id=2740 prev_epoch=5303626091593728 curr_epoch=5303626134585344}:executor{otel.name="DmlExecutor AB400000001 (actor 2740)" actor_id=2740}:executor{otel.name="SourceExecutor AB400002716 (actor 2740)" actor_id=2740}:source_parser{actor_id=2740 source_id=2085}:parse_one{split_id="0" offset="8"}: risingwave_connector::parser: failed to parse message, skipping error=Protocol error: InvalidNumber at character 776 ('0')

the JSON data mentioned in log at partition 0 offset 8 in kafka topic is skipped by kafka source and result in data loss eventually.

after some invistigation of the source code I realized that this is simd-json parse error caused by a field has a large integer number value, so I wrote a mimimum rust snippet to reproduce this error:

use simd_json;

fn main() {
    let mut d = br#"{"some": 327267367867168100000}"#.to_vec();
    let v: simd_json::BorrowedValue = simd_json::to_borrowed_value(&mut d).unwrap();
    println!("{}", v);
}

output:

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error { index: 28, character: Some('0'), error: InvalidNumber }

In our kafka upstream, JSON data may contains large integer value inside nested object, it is very hard to detect those values and correct them before ingestion.

{
    "fields":[
        {
            "field_id": "field1",
            "float_value": 327267367867168100000
        }
    ]
}

so, is this an expected behavior? or this field should placed by NULL instead of a parse error?

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

Kubernetes

The version of RisingWave

PostgreSQL 9.5-RisingWave-1.3.0 (c4c31bd)

Additional context

No response

@ly9chee ly9chee added the type/bug Something isn't working label Oct 25, 2023
@github-actions github-actions bot added this to the release-1.4 milestone Oct 25, 2023
@xiangjinwu
Copy link
Contributor

xiangjinwu commented Oct 25, 2023

https://datatracker.ietf.org/doc/html/rfc8259#section-6

This specification allows implementations to set limits on the range
and precision of numbers accepted. Since software that implements
IEEE 754 binary64 (double precision) numbers [IEEE754] is generally
available and widely used, good interoperability can be achieved by
implementations that expect no more precision or range than these
provide, in the sense that implementations will approximate JSON
numbers within the expected precision. A JSON number such as 1E400
or 3.141592653589793238462643383279 may indicate potential
interoperability problems, since it suggests that the software that
created it expects receiving software to have greater capabilities
for numeric magnitude and precision than is widely available.

Note that when such software is used, numbers that are integers and
are in the range [-(2**53)+1, (2**53)-1] are interoperable in the
sense that implementations will agree exactly on their numeric
values.

Also note that proto3 maps 64-bit integer to json string rather than json number:
https://protobuf.dev/programming-guides/proto3/#:~:text=int64%2C%20fixed64%2C%20uint64,string

In the long term we do not want to be the bottleneck of limiting json numbers to only 2^53. But for good interoperability it is better not to assume all tools in the data pipeline can support it.

or this field should placed by NULL instead of a parse error?

Currently a parse error during ingestion (source) causes the whole row to be skipped, while an expr error (incl casting from string to some type) during computation (stream) fills a null value to that field. This may feel inconsistent, and thanks for bringing it up.

@xiangjinwu xiangjinwu self-assigned this Oct 25, 2023
@xiangjinwu xiangjinwu changed the title A large integer value in JSON could cause data loss when using kafka source A 69-bit integer value in JSON could cause data loss when using kafka source Oct 25, 2023
@neverchanje
Copy link
Contributor

As the user suggests, simd-json 0.13.5 now enables us to convert very-large-numbers into f64, with a potential loss of precisions.

big-int-as-float
The big-int-as-float feature flag treats very large integers that won't fit into u64 as f64 floats. This prevents parsing errors if the JSON you are parsing contains very large numbers. Keep in mind that f64 loses some precision when representing very large numbers.

But I still think it's not a good way to deal with such numbers. Converting them to decimal https://crates.io/crates/bigdecimal may be better.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants