Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: Redpanda source startup mode = latest doesn't work in batch query #12361

Closed
fuyufjh opened this issue Sep 17, 2023 · 2 comments
Closed

bug: Redpanda source startup mode = latest doesn't work in batch query #12361

fuyufjh opened this issue Sep 17, 2023 · 2 comments
Assignees
Labels
priority/high type/bug Something isn't working
Milestone

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Sep 17, 2023

Describe the bug

I configured a data source by connecting it to the Redpanda topic with the 'startup mode' set as 'latest.' However, I encountered an issue when querying the data. Despite having three days of data in my topic, the queries from the source consistently return data from the earliest records, not the latest ones. I'm puzzled about the purpose of specifying 'latest' as the startup mode in this scenario.

Error message/log

No response

To Reproduce

No response

Expected behavior

No response

How did you deploy RisingWave?

No response

The version of RisingWave

No response

Additional context

Feedback from users.

@fuyufjh fuyufjh added type/bug Something isn't working priority/high labels Sep 17, 2023
@github-actions github-actions bot added this to the release-1.3 milestone Sep 17, 2023
@ZENOTME
Copy link
Contributor

ZENOTME commented Sep 17, 2023

I think we don't support the semantics of latest in query batch source.

If my understand is correct, in streaming, latest means that we can see the data after materized veiw is created. E.g.
data: 1 |create source| data: 2 | create materized view | data: 3
The materized view only can see the data 3. Because we fetch the partition offset when we actually create the materized view.

According to #6725, in batch, we fetch the partition offset every time the batch query comes in. We can't directly apply the latest in streaming because that will cause get empty data every time. So to support latest, I think we need to define latest semantics in batch source first.

E.g. latest in batch source query means that we can only see the data after create source, in above example, which means that we can see data 2 and data 3.
data: 1 |create source| data: 2 | create materized view | data: 3
And then to support above semantics, maybe we should store the partition offset when we create the source.

cc @fuyufjh @tabVersion @liurenjie1024

@liurenjie1024
Copy link
Contributor

It's by design. latest is meaningless in batch query. User is supposed to use _rw_kafka_timestamp to filter out messages: https://docs.risingwave.com/docs/current/create-source-kafka/#query-kafka-timestamp

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/high type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants