Update data ingestion overview about batch #55

WanYixian · 2024-11-15T10:07:34Z

Description

Update data ingestion overview

Related Doc issue

Resolve #53

WanYixian · 2024-11-18T08:05:36Z

delivery/overview.mdx

+
+## Batching strategy for file sink
+
+The batching strategy ensures that the contents within a chunk are not further split, which enables the decoupling of the file sink. Currently, the batching strategy is available for Parquet encode. 


Hi @wcy-fdu, wanna a double check if the batching strategy is available for only parquet or both parquet and json? Thanks!

For all supported encode: json, csv and parquet.

Signed-off-by: IrisWan <[email protected]>

wcy-fdu

Rest LGTM.

We can highlight two things in the document:

The purpose of batching is to avoid the file sink from generating many small files
The conditions for batching do not guarantee accuracy, because the purpose is 1.

wcy-fdu · 2024-11-20T08:08:05Z

delivery/overview.mdx

+
+## Batching strategy for file sink
+
+The batching strategy ensures that the contents within a chunk are not further split, which enables the decoupling of the file sink. Currently, the batching strategy is available for Parquet encode. 


For all supported encode: json, csv and parquet.

wcy-fdu · 2024-11-20T08:09:14Z

delivery/overview.mdx

+    For batching based on row count, RisingWave checks whether the maximum row count threshold has been reached after each chunk is written (`sink_writer.write_batch()`). If the threshold is met, the writing of the file is completed.
+
+- **Batching based on rollover interval**:
+    For batching based on the time interval, RisingWave checks the threshold each time a chunk is about to be written (`sink_writer.write_batch()`) and when a barrier is encountered (`sink_writer.barrier()`). Note that if a barrier gets stuck, batching may not strictly adhere to the preset rollover interval, leading to possible delays in writing. Future implementations will optimize this process by monitoring the writer itself, rather than relying solely on barriers or chunks.


Note that if a barrier gets stuck, batching may not strictly adhere to the preset rollover interval, leading to possible delays in writing. Future implementations will optimize this process by monitoring the writer itself, rather than relying solely on barriers or chunks.

This is implementation detail, we don't need to expose to user.

wcy-fdu · 2024-11-20T08:10:41Z

delivery/overview.mdx

+- **Batching based on rollover interval**:
+    For batching based on the time interval, RisingWave checks the threshold each time a chunk is about to be written (`sink_writer.write_batch()`) and when a barrier is encountered (`sink_writer.barrier()`). Note that if a barrier gets stuck, batching may not strictly adhere to the preset rollover interval, leading to possible delays in writing. Future implementations will optimize this process by monitoring the writer itself, rather than relying solely on barriers or chunks.
+
+The actual number of rows in a file may slightly exceed the set maximum row count. In extreme cases, it may surpass `chunk_size - 1` rows (currently set at 255). However, this slight excess is generally acceptable due to the typically large row counts in Parquet files.


I am wondering whether it is necessary to expose the concept of chunk to users. I suggest to be vague and just say that the condition for batching is relatively coarse-grained. Setting a certain number of lines for batching does not mean that the output file must have this many lines.

wcy-fdu · 2024-11-20T08:12:01Z

delivery/overview.mdx

+With the batching strategy for file sink, file writing is no longer dependent on the arrival of barriers.  `BatchingLogSinkOf` determines when to truncate the log store. Once a file is written, it will be truncated. However, if batching occurs across barriers and no writing has occurred by the time a barrier arrives, the barrier will not trigger truncation. 
+
+If no batching strategy is defined, the previous logic will still apply, meaning that a file will be forcefully written upon the arrival of a checkpoint barrier.


Implementation detail, remove it.

We just need to tell the user that if no conditions for batch collection are set, a default batch collection strategy will be given.

wcy-fdu

risingwavelabs/risingwave#18472 also introduce partition_by, which is used to partition files by time. Please help add a description of this parameter(perhaps in file naming part): if it is set, there will be subdirectories in the file naming path.

delivery/overview.mdx

wcy-fdu · 2024-11-21T09:37:54Z

delivery/overview.mdx

+
+<Note>The condition for batching is relatively coarse-grained. The actual number of rows or exact timing of file completion may vary from the specified thresholds, as this function is intentionally flexible to prioritize efficient file management.</Note>
+
+If no conditions for batch collection are set, RisingWave will apply a default batching strategy to ensure proper file writing and data consistency.


Better to indicate what the default policy is(IIRC 10s, pls check the origin pr).

Co-authored-by: congyi wang <[email protected]> Signed-off-by: IrisWan <[email protected]>

wcy-fdu

Thanks for the efforts!

direct correct link

d4b9452

WanYixian marked this pull request as draft November 15, 2024 10:07

mintlify bot deployed to staging November 15, 2024 10:08 View deployment

WanYixian added 2 commits November 18, 2024 15:59

add batching strategy for file sink

5d7c2dd

availability

62a8e80

WanYixian commented Nov 18, 2024

View reviewed changes

WanYixian marked this pull request as ready for review November 18, 2024 08:05

WanYixian requested a review from wcy-fdu November 18, 2024 08:06

Merge branch 'main' into wyx/resolve_51

86cacab

Signed-off-by: IrisWan <[email protected]>

wcy-fdu reviewed Nov 20, 2024

View reviewed changes

WanYixian added 5 commits November 21, 2024 17:02

availability for encode

835d12f

remove implementation detail

8d234e1

hide the concept of chunk

87b0c80

remove another implementation detail

efe6beb

highlight something

12f3cd1

wcy-fdu reviewed Nov 21, 2024

View reviewed changes

WanYixian and others added 7 commits November 22, 2024 09:44

Update delivery/overview.mdx

e194630

Co-authored-by: congyi wang <[email protected]> Signed-off-by: IrisWan <[email protected]>

Update delivery/overview.mdx

6226ed5

Co-authored-by: congyi wang <[email protected]> Signed-off-by: IrisWan <[email protected]>

default batching strategy

d7f342d

Update .wordlist.txt

9febcca

add partition by and example

5adee3b

Merge branch 'main' into wyx/resolve_51

26dbdef

rename into path_partition_prefix

875a54a

wcy-fdu approved these changes Nov 25, 2024

View reviewed changes

WanYixian merged commit 4471321 into main Nov 25, 2024
2 checks passed

WanYixian deleted the wyx/resolve_51 branch November 25, 2024 05:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update data ingestion overview about batch #55

Update data ingestion overview about batch #55

WanYixian commented Nov 15, 2024 •

edited

Loading

WanYixian Nov 18, 2024

wcy-fdu Nov 20, 2024

wcy-fdu left a comment

wcy-fdu Nov 20, 2024

wcy-fdu Nov 20, 2024

wcy-fdu Nov 20, 2024

wcy-fdu Nov 20, 2024

wcy-fdu left a comment •

edited

Loading

wcy-fdu Nov 21, 2024

wcy-fdu left a comment


		## Batching strategy for file sink

		The batching strategy ensures that the contents within a chunk are not further split, which enables the decoupling of the file sink. Currently, the batching strategy is available for Parquet encode.

		With the batching strategy for file sink, file writing is no longer dependent on the arrival of barriers. `BatchingLogSinkOf` determines when to truncate the log store. Once a file is written, it will be truncated. However, if batching occurs across barriers and no writing has occurred by the time a barrier arrives, the barrier will not trigger truncation.

		If no batching strategy is defined, the previous logic will still apply, meaning that a file will be forcefully written upon the arrival of a checkpoint barrier.


		<Note>The condition for batching is relatively coarse-grained. The actual number of rows or exact timing of file completion may vary from the specified thresholds, as this function is intentionally flexible to prioritize efficient file management.</Note>

		If no conditions for batch collection are set, RisingWave will apply a default batching strategy to ensure proper file writing and data consistency.

Update data ingestion overview about batch #55

Update data ingestion overview about batch #55

Conversation

WanYixian commented Nov 15, 2024 • edited Loading

Description

Related Doc issue

WanYixian Nov 18, 2024

Choose a reason for hiding this comment

wcy-fdu Nov 20, 2024

Choose a reason for hiding this comment

wcy-fdu left a comment

Choose a reason for hiding this comment

wcy-fdu Nov 20, 2024

Choose a reason for hiding this comment

wcy-fdu Nov 20, 2024

Choose a reason for hiding this comment

wcy-fdu Nov 20, 2024

Choose a reason for hiding this comment

wcy-fdu Nov 20, 2024

Choose a reason for hiding this comment

wcy-fdu left a comment • edited Loading

Choose a reason for hiding this comment

wcy-fdu Nov 21, 2024

Choose a reason for hiding this comment

wcy-fdu left a comment

Choose a reason for hiding this comment

WanYixian commented Nov 15, 2024 •

edited

Loading

wcy-fdu left a comment •

edited

Loading