Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add shared source #69

Merged
merged 5 commits into from
Nov 22, 2024
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog/product-lifecycle.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ Below is a list of all features in the public preview phase:

| Feature name | Start version |
| :-- | :-- |
| [Shared source](/sql/commands/sql-create-source/#shared-source) | 2.1 |
| [ASOF join](/docs/current/query-syntax-join-clause/#asof-joins) | 2.1 |
| [Partitioned Postgres CDC table](/docs/current/ingest-from-postgres-cdc/) | 2.1 |
| [Map type](/docs/current/data-type-map/) | 2.0 |
Expand Down
Binary file added images/non-shared-source.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/shared-source.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added images/table-with-connectors.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
80 changes: 80 additions & 0 deletions sql/commands/sql-create-source.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,86 @@ If Kafka is part of your technical stack, you can also use the Kafka connector i

For complete step-to-step guides about ingesting MySQL and PostgreSQL data using both approaches, see [Ingest data from MySQL](/docs/current/ingest-from-mysql-cdc/) and [Ingest data from PostgreSQL](/docs/current/ingest-from-postgres-cdc/).

## Shared source

Shared source improves resource utilization and data consistency when working with Kafka sources in RisingWave. This will only affect Kafka sources created after the version updated and will not affect any existing Kafka sources.

<Note>
**PUBLIC PREVIEW**

This feature is in the public preview stage, meaning it's nearing the final product but is not yet fully stable. If you encounter any issues or have feedback, please contact us through our [Slack channel](https://www.risingwave.com/slack). Your input is valuable in helping us improve the feature. For more information, see our [Public preview feature list](/product-lifecycle/#features-in-the-public-preview-stage).
</Note>

### Configure

Shared source is enabled by default. You can also set the session variable `streaming_use_shared_source` to control whether to enable it.

```sql
# change the config in the current session
SET streaming_use_shared_source=[true|false];

# change the default value of the session variable in the cluster
# (the current session is not affected)
ALTER SYSTEM SET streaming_use_shared_source=[true|false];
```

To completely disable it at the cluster level, go to [`risingwave.toml`](https://github.com/risingwavelabs/risingwave/blob/main/src/config/example.toml#L146) configuration file, and set the `stream_enable_shared_source` to `false`.

### Compared with non-shared source

With non-shared sources, when using the `CREATE SOURCE` statement:
- No streaming jobs would be instantiated. A source is just a set of metadata stored in the catalog.
- Only when a materialized view references the source, a `SourceExecutor` will be created to start the process of data ingestion.
WanYixian marked this conversation as resolved.
Show resolved Hide resolved

This leads to increased resource usage and potential inconsistencies:
- Each `SourceExecutor` consumed Kafka resources independently, adding pressure to both the Kafka broker and RisingWave.
- Independent `SourceExecutor` instances could result in different consumption progress, causing temporary inconsistencies when joining materialized views.

<Frame>
<img src="/images/non-shared-source.png"/>
</Frame>

With shared sources, when using the `CREATE SOURCE` statement:
- It will instantiate a single `SourceExecutor` immediately.
- All materialized views referencing the same source share the `SourceExecutor`.
- The downstream materialized views will only forwards data from the upstream sources, instead of consuming from Kafka independently.

This improves resource utilization and consistency.

<Frame>
<img src="/images/shared-source.png"/>
</Frame>

When creating a materialized view, RisingWave backfills historical data from Kafka. The process blocks the DDL statement until backfill completes.

- To configure this behavior, use the [SET BACKGROUND_DDL](/sql/commands/sql-set-background-ddl) command. This is similar to the backfilling procedure when creating a materialized view on tables and materialized views.

- To monitoring backfill progress, use the [SHOW JOBS](/sql/commands/sql-show-jobs) command or check `Kafka Consumer Lag Size` in the Grafana dashboard (under `Streaming`).


<Note>If you set up a retention policy or if the external system can only be accessed once (like message queues), and the data is no longer available, any newly created materialized views won’t be able to backfill the complete historical data. This can lead to inconsistencies with earlier materialized views.</Note>


### Compared with table

A `CREATE TABLE` statement can provide similar benefits to shared sources, except that it needs to persist all consumed data.

For table with connector, downstream materialized views backfill historical data from the table instead of external sources, which may be more efficient and cause less pressure to the external system. This also gives table stronger consistency guarantee, as historical data will be ensured to be present.

Tables offer other features that enhance their utility in data ingestion workflows. See [Table with connectors](/ingestion/overview#table-with-connectors).

<Frame>
<img src="/images/table-with-connectors.png"/>
</Frame>

<Note>
**LIMITATION**

Currently, shared source is only applicable to Kafka sources. Other sources are unaffected.
WanYixian marked this conversation as resolved.
Show resolved Hide resolved

Shared sources do not support `ALTER SOURCE`. Use non-shared sources if you require this functionality.
</Note>

## See also

<CardGroup>
Expand Down