Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve: stream key derive: the hidden join key can be replaced by its vnode #12845

Open
fuyufjh opened this issue Oct 14, 2023 · 5 comments
Open
Assignees

Comments

@fuyufjh
Copy link
Member

fuyufjh commented Oct 14, 2023

As we know, the join key must be included in the downstream stream key. The reason is:

The join key must be included the stream key under any circumstance.

Taking fact table-dimension table join as an example

select * from orders left join customers on orders.customer_id = customers.id

If orders.customer_id is updated, it would generate a pair of U- and U+, which then becomes - and + and sent to 2 actors of the next fragment, saying, a Materialize.

Only by including the customer_id in the stream key can it prevent the + from arriving at Materialize before -, causing a sanity check panic.

However, actually we can use vnode(join_key) as a substitute, which is much shorter (only 16-bit) than the join_key itself. This is because that the disordering only happen between different vnodes (or more accurately - different actors, but actor may be scaled up/down from time to time)

@github-actions github-actions bot added this to the release-1.4 milestone Oct 14, 2023
@BugenZhao
Copy link
Member

This is really a good catch. We may also apply this to over-window and group top-n.

@fuyufjh
Copy link
Member Author

fuyufjh commented Oct 20, 2023

This is really a good catch. We may also apply this to over-window and group top-n.

Do they have this problem? I think GroupTopN's dist key must be explicit i.e. the Group key

@BugenZhao
Copy link
Member

This is really a good catch. We may also apply this to over-window and group top-n.

Do they have this problem? I think GroupTopN's dist key must be explicit i.e. the Group key

Yeah. Ideally, these operators may keep the primary key unchanged. We have to also include the group key / partition key to prevent disordering. BTW, what do you mean by "explicit"? 👀

@chenzl25
Copy link
Contributor

chenzl25 commented Nov 6, 2023

When I was trying to implement this feature, I found that our current representation of stream_key fn stream_key(&self) -> Option<Vec<usize>> is hard to express the stream_key we want to construct for joins.

Join schema (for inner join type) is simply a concatenation of LHS and RHS and there is no place for the vnode column, so we can't express stream_key with the vnode column.

Another way to achieve this goal is by using a project and overwriting its stream key instead of deriving, however, the project could get merged by optimizer rules and we would easily lose this overwriting. It is too invasive to implement.

I haven't got a good idea to implement this feature elegantly. Any better ideas?

@st1page st1page self-assigned this Nov 10, 2023
@st1page st1page modified the milestones: release-1.5, release-1.6 Dec 5, 2023
@st1page st1page modified the milestones: release-1.6, release-1.7 Jan 9, 2024
@chenzl25 chenzl25 removed this from the release-1.7 milestone Mar 6, 2024
Copy link
Contributor

This issue has been open for 60 days with no activity. Could you please update the status? Feel free to continue discussion or close as not planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants