Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: Backward Compatibility of Stream Plan #43

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
39 changes: 39 additions & 0 deletions rfcs/0043-backward-compatibility-of-stream-plan.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
---
feature: Backward Compatibility of Stream Plan
authors:
- "st1page"
start_date: "2023/1/31"
---

# Backward Compatibility of Stream Plan

## Summary

- Distinguish the nightly and stable for SQL features and stream plan node protobuf.
- Use a Copy-on-Write style on changing the stable stream plan node protobuf.

## Motivation

In https://github.com/risingwavelabs/rfcs/issues/41, we discussed the backward compatibility. The protobuf structure of stream plan nodes is a special part.
- The plan node's structure is usually modified more frequently than other protobuf structures, such as the catalog, especially when new SQL features are being developed. The changes to the plan node are not only adding optional fields (which can be solved by protobuf), but also its meaning and behaviors of the operator. For example, the state table information of streamAgg underwent breaking changes in 0.1.13 and in 0.1.16, the source executor is no longer responsible for generating row_id. The format of sort and overAgg has not been confirmed so far.
- In other databases, the plan node is just used as a communication protocol between the frontend and compute node, so the compute node can only support the latest version of the plan node format and reject all requests with an unknown plan node. But our stream plan must be persistent in the meta store, meaning that a compute node must be compatible with all versions of the old plan's protobuf format.

In conclusion, we need to find a way to balance rapid development and backward compatibility, especially for the stream plan node.

## Design

### Nightly and Stable SQL Features
Distinguish the nightly and stable features when releasing versions. RW will not ensure compatibility for streaming jobs with the nightly features in subsequent releases. For example, if the "emit on close" feature is released as a nightly feature in release v0.1.17 and a user creates an MV with that feature on a v0.1.17 cluster, RW in v0.1.18 and subsequent versions cannot ensure that the existing streaming job will run successfully. Users can drop the MVs with the nightly feature before upgrading the cluster. For those who really want to upgrade the nightly feature, we can write helper scripts. The stable features will be tested with new released compute nodes on old version streaming plans. Also, with the list of stable features, we can more easily test backward compatibility.

### Nightly and Stable Stream Plan Node
How to determine if a SQL Feature is stable? Developers should comment the compatibility annotation on the protobuf struct of each stream plan node (similar to annotations in Java). The annotation contains: "nightly v0.1.14", "stable v0.1.15", "deprecated v0.1.16". A plan node will initially have a nightly annotation. When the developer ensures that the plan node structure is stable, a stable annotation should be added to the protobuf struct. When the developer ensures that the frontend will not generate the plan node, a deprecated annotation should be added to the protobuf struct. A SQL feature is considered stable if all stream plan nodes generated by any version of the optimizer are stable.

To be discussed: What is the proper format of these comments in the proto files and how to check that all plan nodes have one in CI checks?

### Copy-on-Write Style Changes on Stable Plan Node Protobuf
Copy link
Member

@xxchan xxchan Feb 2, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So basically you mean immutable and versioned protobuf messages? 🤔

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW we can assume all fields to be required afterwards?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. And I think our current stream execution expect that in fact(so many unwrap in from_proto).

How to maintain compatibility of the plan node's protobuf? If a developer wants to make changes to a stable plan node, he should add a new plan node protobuf definition. For example, if they want to add a new field in `StreamHashAgg`, they must define a new protobuf struct `StreamHashAggV2` and add the field to it. Note that there can be multiple versions of protobuf, but they can share the same implementation.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the only change allowed is to add new fields, we can just add a version field in the protobuf message and add new fields to the same message defintion instead of creating multiple protobuf definitions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s discussed here #43 (comment)

adding field is just one of the cases🤔 Any change is allowed


Why is it so complicated and why not just rely on protobuf's compatibility? To achieve compatibility, protobuf actually makes the struct such that all fields are optional. When a protobuf struct is used as a RPC interface, the caller provides a combination of optional fields, and the callee should try their best to handle all meaningful combinations or return an error. Based on the following facts, I believe the Copy-on-Write Style Changes is better.
- Changes to stable plan nodes are limited. We can make breaking changes within the same release, and a stable plan node will not be modified too often. As a result, the number of duplicated plan node definitions will not be excessive.
- Here, "returning an error" is unacceptable because if we cannot resolve the stored streaming plan, the cluster cannot boot up. We must ensure that the compute node can accept any combination of fields from historical versions. Storing all these combinations in different version's plan node definitions helps maintain compatibility, or it will simply exist in the compute node's code and be easily forgotten.