diff --git a/rfcs/0043-backward-compatibility-of-stream-plan.md b/rfcs/0043-backward-compatibility-of-stream-plan.md new file mode 100644 index 00000000..9c44e2e5 --- /dev/null +++ b/rfcs/0043-backward-compatibility-of-stream-plan.md @@ -0,0 +1,39 @@ +--- +feature: Backward Compatibility of Stream Plan +authors: + - "st1page" +start_date: "2023/1/31" +--- + +# Backward Compatibility of Stream Plan + +## Summary + +- Distinguish the nightly and stable for SQL features and stream plan node protobuf. +- Use a Copy-on-Write style on changing the stable stream plan node protobuf. + +## Motivation + +In https://github.com/risingwavelabs/rfcs/issues/41, we discussed the backward compatibility. The protobuf structure of stream plan nodes is a special part. +- The plan node's structure is usually modified more frequently than other protobuf structures, such as the catalog, especially when new SQL features are being developed. The changes to the plan node are not only adding optional fields (which can be solved by protobuf), but also its meaning and behaviors of the operator. For example, the state table information of streamAgg underwent breaking changes in 0.1.13 and in 0.1.16, the source executor is no longer responsible for generating row_id. The format of sort and overAgg has not been confirmed so far. +- In other databases, the plan node is just used as a communication protocol between the frontend and compute node, so the compute node can only support the latest version of the plan node format and reject all requests with an unknown plan node. But our stream plan must be persistent in the meta store, meaning that a compute node must be compatible with all versions of the old plan's protobuf format. + +In conclusion, we need to find a way to balance rapid development and backward compatibility, especially for the stream plan node. + +## Design + +### Nightly and Stable SQL Features +Distinguish the nightly and stable features when releasing versions. RW will not ensure compatibility for streaming jobs with the nightly features in subsequent releases. For example, if the "emit on close" feature is released as a nightly feature in release v0.1.17 and a user creates an MV with that feature on a v0.1.17 cluster, RW in v0.1.18 and subsequent versions cannot ensure that the existing streaming job will run successfully. Users can drop the MVs with the nightly feature before upgrading the cluster. For those who really want to upgrade the nightly feature, we can write helper scripts. The stable features will be tested with new released compute nodes on old version streaming plans. Also, with the list of stable features, we can more easily test backward compatibility. + +### Nightly and Stable Stream Plan Node +How to determine if a SQL Feature is stable? Developers should comment the compatibility annotation on the protobuf struct of each stream plan node (similar to annotations in Java). The annotation contains: "nightly v0.1.14", "stable v0.1.15", "deprecated v0.1.16". A plan node will initially have a nightly annotation. When the developer ensures that the plan node structure is stable, a stable annotation should be added to the protobuf struct. When the developer ensures that the frontend will not generate the plan node, a deprecated annotation should be added to the protobuf struct. A SQL feature is considered stable if all stream plan nodes generated by any version of the optimizer are stable. + +To be discussed: What is the proper format of these comments in the proto files and how to check that all plan nodes have one in CI checks? + +### Copy-on-Write Style Changes on Stable Plan Node Protobuf +How to maintain compatibility of the plan node's protobuf? If a developer wants to make changes to a stable plan node, he should add a new plan node protobuf definition. For example, if they want to add a new field in `StreamHashAgg`, they must define a new protobuf struct `StreamHashAggV2` and add the field to it. Note that there can be multiple versions of protobuf, but they can share the same implementation. + +Why is it so complicated and why not just rely on protobuf's compatibility? To achieve compatibility, protobuf actually makes the struct such that all fields are optional. When a protobuf struct is used as a RPC interface, the caller provides a combination of optional fields, and the callee should try their best to handle all meaningful combinations or return an error. Based on the following facts, I believe the Copy-on-Write Style Changes is better. +- Changes to stable plan nodes are limited. We can make breaking changes within the same release, and a stable plan node will not be modified too often. As a result, the number of duplicated plan node definitions will not be excessive. +- Here, "returning an error" is unacceptable because if we cannot resolve the stored streaming plan, the cluster cannot boot up. We must ensure that the compute node can accept any combination of fields from historical versions. Storing all these combinations in different version's plan node definitions helps maintain compatibility, or it will simply exist in the compute node's code and be easily forgotten. +