Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug(source): datagen source generate data skew #19469

Open
Li0k opened this issue Nov 20, 2024 · 0 comments
Open

Bug(source): datagen source generate data skew #19469

Li0k opened this issue Nov 20, 2024 · 0 comments
Labels
help wanted Issues that need help from contributors type/bug Something isn't working
Milestone

Comments

@Li0k
Copy link
Contributor

Li0k commented Nov 20, 2024

Test case 1:

create table t_large_ckpt_with_pk (k bigint, v varchar, primary key(k)) with (
        connector = 'datagen',
        fields.v.length = 100000,
        datagen.rows.per.second='100000',
        datagen.split.num = '24',
        fields.k.kind = 'random',
        fields.k.min = 1,
        fields.k.max = 50000000,
    );
create materialized view mv_large_ckpt_with_pk as select * from t_large_ckpt_with_pk;

checkpoint_frequency = 60

image

The metrics show that the actual data written to the mv is less than 10% of the data written to the table, which means that there are a lot of duplicates in the generated pk(k), and the duplicates are cleaned up in the shared buffer compact phase.

Test case2 :

test by @tabVersion

dev=> create source  t_large_ckpt_with_pk (k bigint, v varchar) with (
        connector = 'datagen',
        fields.v.length = 100000,
        datagen.rows.per.second='100000',
        datagen.split.num = '24',
        fields.k.kind = 'random',
        fields.k.min = 1,
        fields.k.max = 50000000,
    );
CREATE_SOURCE
dev=> create materialized view mv as ( select k, count(*) from t_large_ckpt_with_pk group by k ) ;
CREATE_MATERIALIZED_VIEW
dev=> create table t as select * from mv ;
INSERT 0 3072
dev=>
dev=> create table t1 as select * from mv ;
INSERT 0 4184
select * from t1 order by count  desc limit 10 ;
    k    | count
---------+-------
 1626013 |    12
  153605 |    12
  957843 |    12
   15215 |    12
 1689808 |    12
 1885161 |    12
  623498 |    12
  592407 |    12
 1572091 |    12
  384375 |    12
(10 rows)
select * from t join t1 on t.k = t1.k order by t1.count desc limit 10 ;
    k    | count |    k    | count
---------+-------+---------+-------
 9199828 |    12 | 9199828 |    12
 2322857 |    12 | 2322857 |    12
 8216541 |    12 | 8216541 |    12
 1157410 |    12 | 1157410 |    12
 9581087 |    12 | 9581087 |    12
 9753606 |    12 | 9753606 |    12
 5211605 |    12 | 5211605 |    12
 3506222 |    12 | 3506222 |    12
 8834964 |    12 | 8834964 |    12
 3109476 |    12 | 3109476 |    12
(10 rows)
dev=> select sum(count) from t;
  sum
-------
 36864
(1 row)

dev=> select sum(count) from t1;
  sum
-------
 50002
(1 row)
@github-actions github-actions bot added this to the release-2.2 milestone Nov 20, 2024
@Li0k Li0k added type/bug Something isn't working and removed type/feature labels Nov 20, 2024
@Li0k Li0k added the help wanted Issues that need help from contributors label Dec 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues that need help from contributors type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant