Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Doc] V3.4 Doc - Merge Commit #54279

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions docs/en/loading/StreamLoad.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,74 @@ SELECT * FROM table2;
4 rows in set (0.01 sec)
```

#### Merge Stream Load requests

From v3.4.0, the system supports merging multiple Stream Load requests.

Merge Commit is an optimization for Stream Load, designed for high-concurrency, small-batch (from KB to tens of MB) real-time loading scenarios. In earlier versions, each Stream Load request would generate a transaction and a data version, which led to the following issues in high-concurrency loading scenarios:

- Excessive data versions impact query performance, and limiting the number of versions may cause `too many versions` errors.
- Data version merging through Compaction increases resource consumption.
- It generates small files, increasing IOPS and I/O latency. And in shared-data clusters, this also raises cloud object storage costs.
- Leader FE node, as the transaction manager, may become a single point of bottleneck.

Merge Commit mitigates these issues by merging multiple concurrent Stream Load requests within a time window into a single transaction. This reduces the number of transactions and versions generated by high-concurrency requests, thereby improving loading performance.

Merge Commit supports both synchronous and asynchronous modes. Each mode has advantages and disadvantages. You can choose based on your use cases.

- **Synchronous mode**

The server returns only after the merged transaction is committed, ensuring the loading is successful and visible.

- **Asynchronous mode**

The server returns immediately after receiving the data. This mode does not ensure the loading is successful.

| **Mode** | **Advantages** | **Disadvantages** |
| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| Synchronous | <ul><li>Ensures data persistence and visibility upon request return.</li><li>Guarantees that multiple sequential loading requests from the same client are executed in order.</li></ul> | Each loading request from the client is blocked until the server closes the merge window. It may reduce the data processing capability of a single client if the window is excessively large. |
| Asynchronous | Allows a single client to send subsequent loading requests without waiting for the server to close the merge window, improving loading throughput. | <ul><li>Does not guarantee data persistence or visibility upon return. The client must later verify the transaction status.</li><li>Does not guarantee that multiple sequential loading requests from the same client are executed in order.</li></ul> |

##### Start a Stream Load

- Run the following command to start a Stream Load job with Merge Commit enabled in synchronous mode, and set the merging window to `5000` milliseconds and degree of parallelism to `2`:

```Bash
curl --location-trusted -u <username>:<password> \
-H "Expect:100-continue" \
-H "column_separator:," \
-H "columns: id, name, score" \
-H "enable_merge_commit:true" \
-H "merge_commit_interval_ms:5000" \
-H "merge_commit_parallel:2" \
-T example1.csv -XPUT \
http://<fe_host>:<fe_http_port>/api/mydatabase/table1/_stream_load
```

- Run the following command to start a Stream Load job with Merge Commit enabled in asynchronous mode, and set the merging window to `60000` milliseconds and degree of parallelism to `2`:

```Bash
curl --location-trusted -u <username>:<password> \
-H "Expect:100-continue" \
-H "column_separator:," \
-H "columns: id, name, score" \
-H "enable_merge_commit:true" \
-H "merge_commit_async:true" \
-H "merge_commit_interval_ms:60000" \
-H "merge_commit_parallel:2" \
-T example1.csv -XPUT \
http://<fe_host>:<fe_http_port>/api/mydatabase/table1/_stream_load
```

:::note

- Merge Commit only supports merging **homogeneous** loading requests into a single database and table. "Homogeneous" indicates that the Stream Load parameters are identical, including: common parameters, JSON format parameters, CSV format parameters, `opt_properties`, and Merge Commit parameters.
- For loading CSV-formatted data, you must ensure that each row ends with a line separator. `skip_header` is not supported.
- The server automatically generates labels for transactions. They will be ignored if specified.
- Merge Commit merges multiple loading requests into a single transaction. If one request contains data quality issues, all requests in the transaction will fail.

:::

#### Check Stream Load progress

After a load job is complete, StarRocks returns the result of the job in JSON format. For more information, see the "Return value" section in [STREAM LOAD](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md).
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ StarRocks provides the loading method HTTP-based Stream Load to help you load da

Since v3.2.7, Stream Load supports compressing JSON data during transmission, reducing network bandwidth overhead. Users can specify different compression algorithms using parameters `compression` and `Content-Encoding`. Supported compression algorithms including GZIP, BZIP2, LZ4_FRAME, and ZSTD. For more information, see [data_desc](#data_desc).

From v3.4.0, the system supports merging multiple Stream Load requests. For more information, see [Merge Commit parameters](#merge-commit-parameters).

> **NOTICE**
>
> - After you load data into a StarRocks table by using Stream Load, the data of the materialized views that are created on that table is also updated.
Expand Down Expand Up @@ -132,6 +134,26 @@ The parameters in the `data_desc` descriptor can be divided into three types: co

When you load JSON data, also note that the size per JSON object cannot exceed 4 GB. If an individual JSON object in the JSON data file exceeds 4 GB in size, an error "This parser can't support a document that big." is reported.

### Merge Commit parameters

Enables Merge Commit for multiple concurrent Stream Load requests within a specified time window and to merge them into a single transaction.

| **Parameter** | **Required** | **Description** |
| ------------------------ | ------------ | ------------------------------------------------------------ |
| enable_merge_commit | No | Whether to enable the Merge Commit for the loading request. Valid values: `true` and `false` (Default). |
| merge_commit_async | No | The server's return mode. Valid values:<ul><li>`true`: Enables asynchronous mode, where the server returns immediately after receiving the data. This mode does not ensure the loading is successful.</li><li>`false`(Default): Enables synchronous mode, where the server returns only after the merged transaction is committed, ensuring the loading is successful and visible.</li></ul> |
| merge_commit_interval_ms | Yes | The size of the merging time window. Unit: milliseconds. Merge Commit attempts to merge loading requests received within this window into a single transaction. A larger window improves merging efficiency but increases latency. |
| merge_commit_parallel | Yes | The degree of parallelism for the loading plan created for each merging window. Parallelism can be adjusted based on the load of ingestion. Increase this value if there are many requests and/or a large amount of data to load. The parallelism is limited to the number of BE nodes, calculated as `max(merge_commit_parallel, number of BE nodes)`. |

:::note

- Merge Commit only supports merging **homogeneous** loading requests into a single database and table. "Homogeneous" indicates that the Stream Load parameters are identical, including: common parameters, JSON format parameters, CSV format parameters, `opt_properties`, and Merge Commit parameters.
- For loading CSV-formatted data, you must ensure that each row ends with a line separator. `skip_header` is not supported.
- The server automatically generates labels for transactions. They will be ignored if specified.
- Merge Commit merges multiple loading requests into a single transaction. If one request contains data quality issues, all requests in the transaction will fail.

:::

### opt_properties

Specifies some optional parameters, which are applied to the entire load job. Syntax:
Expand Down Expand Up @@ -552,3 +574,34 @@ curl --location-trusted -u <username>:<password> \
> **NOTE**
>
> In the preceding example, the outermost layer of the JSON data is an array structure as indicated by a pair of square brackets `[]`. The array structure consists of multiple JSON objects that each represent a data record. Therefore, you need to set `strip_outer_array` to `true` to strip the outermost array structure. The keys `title` and `timestamp` that you do not want to load are ignored during loading. Additionally, the `json_root` parameter is used to specify the root element, which is an array, of the JSON data.

### Merge Stream Load requests

- Run the following command to start a Stream Load job with Merge Commit enabled in synchronous mode, and set the merging window to `5000` milliseconds and degree of parallelism to `2`:

```Bash
curl --location-trusted -u <username>:<password> \
-H "Expect:100-continue" \
-H "column_separator:," \
-H "columns: id, name, score" \
-H "enable_merge_commit:true" \
-H "merge_commit_interval_ms:5000" \
-H "merge_commit_parallel:2" \
-T example1.csv -XPUT \
http://<fe_host>:<fe_http_port>/api/mydatabase/table1/_stream_load
```

- Run the following command to start a Stream Load job with Merge Commit enabled in asynchronous mode, and set the merging window to `60000` milliseconds and degree of parallelism to `2`:

```Bash
curl --location-trusted -u <username>:<password> \
-H "Expect:100-continue" \
-H "column_separator:," \
-H "columns: id, name, score" \
-H "enable_merge_commit:true" \
-H "merge_commit_async:true" \
-H "merge_commit_interval_ms:60000" \
-H "merge_commit_parallel:2" \
-T example1.csv -XPUT \
http://<fe_host>:<fe_http_port>/api/mydatabase/table1/_stream_load
```
68 changes: 68 additions & 0 deletions docs/zh/loading/StreamLoad.md
Original file line number Diff line number Diff line change
Expand Up @@ -239,6 +239,74 @@ SELECT * FROM table2;
4 rows in set (0.01 sec)
```

#### 合并 Stream Load 请求

从 v3.4.0 开始,系统支持合并多个 Stream Load 请求。

Merge Commit(合并提交)是针对 Stream Load 的优化设计,适用于高并发、小批量(从 KB 到数十 MB)的实时导入场景。在先前版本中,每个 Stream Load 请求都会生成一个事务和一个数据版本,这在高并发导入场景下会导致以下问题:

- 过多的数据版本会影响查询性能,而限制版本数量可能会引发 `too many versions` 错误。
- 通过 Compaction 合并数据版本会增加资源消耗。
- 会生成大量小文件,增加 IOPS 和 I/O 延迟。存算分离模式下,还会提高云存储成本。
- Leader FE 节点作为事务管理者可能成为单点瓶颈。

Merge Commit 将一个时间窗口内的多个并发的 Stream Load 请求合并为一个事务,从而缓解这些问题。这种方式减少了高并发请求生成的事务和版本数量,从而提升导入性能。

Merge Commit 支持同步模式和异步模式两种方式,每种方式各有优缺点,可以根据实际需求进行选择。

- **同步模式**

服务器在合并的事务提交完成后才返回,确保导入成功且数据可见。

- **异步模式**

服务器在接收到数据后立即返回,但不保证导入成功。

| **模式** | **优点** | **缺点** |
| ------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |
| 同步模式 | <ul><li>确保请求返回时数据已持久化且可见。</li><li>保证同一客户端的多个顺序发送的请求按序执行。</li></ul> | 单个客户端的每个请求会被阻塞,直至服务器合并窗口结束。如果窗口过大,可能会降低单个客户端的数据处理能力。 |
| 异步模式 | 客户端可以在不等待服务器关闭合并窗口的情况下发送后续导入请求,提高导入吞吐量。 | <ul><li>返回时不保证数据已持久化或可见。客户端需要在稍后验证事务状态。</li><li>不保证同一客户端的多个顺序发送的请求按序执行。</li></ul> |

##### 提交导入作业

- 使用以下命令发起一个启用了 Merge Commit 功能的 Stream Load 作业,模式为同步模式,合并窗口设置为 `5000` 毫秒,并行度设置为 `2`:

```Bash
curl --location-trusted -u <username>:<password> \
-H "Expect:100-continue" \
-H "column_separator:," \
-H "columns: id, name, score" \
-H "enable_merge_commit:true" \
-H "merge_commit_interval_ms:5000" \
-H "merge_commit_parallel:2" \
-T example1.csv -XPUT \
http://<fe_host>:<fe_http_port>/api/mydatabase/table1/_stream_load
```

- 使用以下命令发起一个启用了 Merge Commit 功能的 Stream Load 作业,模式为异步模式,合并窗口设置为 `60000` 毫秒,并行度设置为 `2`:

```Bash
curl --location-trusted -u <username>:<password> \
-H "Expect:100-continue" \
-H "column_separator:," \
-H "columns: id, name, score" \
-H "enable_merge_commit:true" \
-H "merge_commit_async:true" \
-H "merge_commit_interval_ms:60000" \
-H "merge_commit_parallel:2" \
-T example1.csv -XPUT \
http://<fe_host>:<fe_http_port>/api/mydatabase/table1/_stream_load
```

:::note

- Merge Commit 仅支持单库单表的**同构**导入请求合并。“同构”是指的 Stream Load 的参数完全一致,包括:公共参数、JSON 格式参数、CSV 格式参数、`opt_properties` 以及 Merge Commit 参数。
- 导入 CSV 格式的数据时,需要确保每行数据结尾都有行分隔符,不支持 `skip_header`。
- 服务器会自动生成事务标签,手动指定标签会被忽略。
- Merge Commit 会将多个导入请求合并到一个事务中。如果某个请求存在数据质量问题,则事务中的所有请求都会失败。

:::

#### 查看 Stream Load 导入进度

导入作业结束后,StarRocks 会以 JSON 格式返回本次导入作业的结果信息,具体请参见 STREAM LOAD 文档中“[返回值](../sql-reference/sql-statements/loading_unloading/STREAM_LOAD.md#返回值)”章节。
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ Stream Load 是一种基于 HTTP 协议的同步导入方式,支持将本地

从 3.2.7 版本起,STREAM LOAD 支持在传输过程中对 JSON 数据进行压缩,减少网络带宽开销。用户可以通过 `compression` 或 `Content-Encoding` 参数指定不同的压缩方式,支持 GZIP、BZIP2、LZ4_FRAME、ZSTD 压缩算法。参见[相关语法](#data_desc)。

从 v3.4.0 开始,系统支持合并多个 Stream Load 请求。参见[Merge Commit 参数](#merge-commit-参数)。

> **注意**
>
> - Stream Load 操作会同时更新和 StarRocks 原始表相关的物化视图的数据。
Expand Down Expand Up @@ -131,6 +133,26 @@ http://<fe_host>:<fe_http_port>/api/<database_name>/<table_name>/_stream_load

另外,导入 JSON 格式的数据时,需要注意单个 JSON 对象的大小不能超过 4 GB。如果 JSON 文件中单个 JSON 对象的大小超过 4 GB,会提示 "This parser can't support a document that big." 错误。

### Merge Commit 参数

用于启用 Merge Commit 功能,在指定的时间窗口内合并多个并发的 Stream Load 请求,并将它们合并为一个事务。

| **参数名称** | **是否必选** | **参数说明** |
| ------------------------ | ------------ | ------------------------------------------------------------ |
| enable_merge_commit | 否 | 是否为导入请求启用 Merge Commit。有效值:`true` 或 `false`(默认值)。 |
| merge_commit_async | 否 | 服务器返回模式。有效值:<ul><li>`true`:启用异步模式,服务器在接收到数据后立即返回,但不保证导入成功。</li><li>`false`(默认值):启用同步模式,服务器在合并的事务提交完成后才返回,确保导入成功且数据可见。</li></ul> |
| merge_commit_interval_ms | 是 | 合并时间窗口的大小。单位:毫秒。Merge Commit 会尝试将时间窗口内接收到的导入请求合并到一个事务中。窗口越大,合并效率越高,但延迟也会增加。 |
| merge_commit_parallel | 是 | 每个合并窗口创建的导入计划的并行度。可以根据导入负载调整该值。如果请求数量多,数据量大,可提高该值。并行度受 BE 节点数量限制,计算方式为 `max(merge_commit_parallel, BE 节点数量)`。 |

:::note

- Merge Commit 仅支持单库单表的**同构**导入请求合并。“同构”是指的 Stream Load 的参数完全一致,包括:公共参数、JSON 格式参数、CSV 格式参数、`opt_properties` 以及 Merge Commit 参数。
- 导入 CSV 格式的数据时,需要确保每行数据结尾都有行分隔符,不支持 `skip_header`。
- 服务器会自动生成事务标签,手动指定标签会被忽略。
- Merge Commit 会将多个导入请求合并到一个事务中。如果某个请求存在数据质量问题,则事务中的所有请求都会失败。

:::

### opt_properties

用于指定一些导入相关的可选参数。指定的参数设置作用于整个导入作业。语法如下:
Expand Down Expand Up @@ -568,3 +590,34 @@ curl --location-trusted -u <username>:<password> \
> **说明**
>
> 上述示例中,JSON 数据的最外层是一个通过中括号 [] 表示的数组结构,并且数组结构中的每个 JSON 对象都表示一条数据记录。因此,需要设置 `strip_outer_array` 为 `true`来裁剪最外层的数组结构。导入过程中,未指定的字段 `title` 和 `timestamp` 会被忽略掉。另外,示例中还通过 `json_root` 参数指定了需要真正导入的数据为 `RECORDS` 字段对应的值,即一个 JSON 数组。

### 合并 Stream Load 请求

- 使用以下命令发起一个启用了 Merge Commit 功能的 Stream Load 作业,模式为同步模式,合并窗口设置为 `5000` 毫秒,并行度设置为 `2`:

```Bash
curl --location-trusted -u <username>:<password> \
-H "Expect:100-continue" \
-H "column_separator:," \
-H "columns: id, name, score" \
-H "enable_merge_commit:true" \
-H "merge_commit_interval_ms:5000" \
-H "merge_commit_parallel:2" \
-T example1.csv -XPUT \
http://<fe_host>:<fe_http_port>/api/mydatabase/table1/_stream_load
```

- 使用以下命令发起一个启用了 Merge Commit 功能的 Stream Load 作业,模式为异步模式,合并窗口设置为 `60000` 毫秒,并行度设置为 `2`:

```Bash
curl --location-trusted -u <username>:<password> \
-H "Expect:100-continue" \
-H "column_separator:," \
-H "columns: id, name, score" \
-H "enable_merge_commit:true" \
-H "merge_commit_async:true" \
-H "merge_commit_interval_ms:60000" \
-H "merge_commit_parallel:2" \
-T example1.csv -XPUT \
http://<fe_host>:<fe_http_port>/api/mydatabase/table1/_stream_load
```
Loading