Skip to content

Commit

Permalink
(website) Update shape docs to include details of where clause optimi…
Browse files Browse the repository at this point in the history
…sation (#2185)

Following on from #2182 on @thruflo 's suggestion, this PR add's where
clause optimisation documentation to the main docs and links to it from
various places. This includes what types of where clauses are optimised.

---------

Co-authored-by: James Arthur <[email protected]>
  • Loading branch information
robacourt and thruflo authored Dec 19, 2024
1 parent 0fab338 commit 88c77cd
Show file tree
Hide file tree
Showing 4 changed files with 78 additions and 47 deletions.
107 changes: 66 additions & 41 deletions website/docs/guides/shapes.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,60 +41,62 @@ A client can choose to sync one shape, or lots of shapes. Many clients can sync

Shapes are defined by:

- a `table`, such as `projects`
- a `where` clause, used to filter the rows in that table, such as `status='active'`
- a `columns` clause, used to only sync a subset of the columns in that table, such as `columns=id,title,status`
- a [table](#table), such as `items`
- an optional [where clause](#where-clause) to filter which rows are included in the shape
- an optional [columns](#columns) clause to select which columns are included

> [!IMPORTANT] Limitations
> Shapes are currently single table, whole row only. You can sync all the rows in a table, or a subset of the rows in that table. You can't yet sync an [include tree](#single-table) without filtering or joining in the client.
A shape contains all of the rows in the table that match the where clause, if provided. If a columns clause is provided, the synced rows will only contain those selected columns.

### `table`
> [!Warning] Limitations
> Shapes are currently [single table](#single-table). Shape definitions are [immutable](#immutable).
This is the root table of the shape. It must match a table in your Postgres database.
### Table

This is the root table of the shape. All shapes must specify a table and it must match a table in your Postgres database.

The value can be just a tablename like `projects`, or can be a qualified tablename prefixed by the database schema using a `.` delimiter, such as `foo.projects`. If you don't provide a schema prefix, then the table is assumed to be in the `public.` schema.

### `where` clause
### Where clause

Optional where clause to filter rows in the `table`.
Shapes can define an optional where clause to filter out which rows from the table are included in the shape. Only rows that match the where clause will be included.

This must be a valid [PostgreSQL WHERE clause](https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-WHERE) using SQL syntax, e.g.:
The where clause must be a valid [PostgreSQL query expression](https://www.postgresql.org/docs/current/queries-table-expressions.html#QUERIES-WHERE) in SQL syntax, e.g.:

- `title='Electric'`
- `status IN ('backlog', 'todo')`

You can use logical operators like `AND` and `OR` to group multiple conditions, e.g.:
Where clauses support:

1. columns of numerical types, `boolean`, `uuid`, `text`, `interval`, date and time types (with the exception of `timetz`), [Arrays](https://github.com/electric-sql/electric/issues/1767) (but not yet [Enums](https://github.com/electric-sql/electric/issues/1709))
1. operators that work on those types: arithmetics, comparisons, logical/boolean operators like `OR`, string operators like `LIKE`, etc.

You can use `AND` and `OR` to group multiple conditions, e.g.:

- `title='Electric' OR title='SQL'`
- `title='Electric' AND status='todo'`

> [!WARNING] Limitations
> Electric needs to be able to evaluate where clauses outside of Postgres, so it supports a subset of SQL types and expressions.
> 1. you can use columns of numerical types, `boolean`, `uuid`, `text`, `interval`, date and time types (with the exception of `timetz`)
> 1. operators that work on those types: arithmetics, comparisons, boolean operators like `OR`, string operators like `LIKE`, etc.
> 1. [Arrays](https://github.com/electric-sql/electric/issues/1767) and [Enums](https://github.com/electric-sql/electric/issues/1709) are not yet supported in where clauses
>
> For the full and up-to-date list of supported types, operators and functions, see their implementation in [`known_functions.ex`](https://github.com/electric-sql/electric/blob/main/packages/sync-service/lib/electric/replication/eval/env/known_functions.ex), while some expressions are handled in the [parser](https://github.com/electric-sql/electric/blob/main/packages/sync-service/lib/electric/replication/eval/parser.ex) (look for the `do_parse_and_validate_tree()` function).
>
> ---
>
> Some general rules that shape where clauses abide by are:
> 1. where clauses can only refer to columns in the target row
> 1. where clauses can't perform joins or refer to other tables
> 1. where clauses can't use non-deterministic SQL functions like `count()` or `now()`
>
> If you need to use a data type or where clause feature that isn't yet supported, please feel free to [raise a Feature Request](https://github.com/electric-sql/electric/discussions/categories/feature-requests) on GitHub.
Where clauses are limited in that they:

### `columns`
1. can only refer to columns in the target row
1. can't perform joins or refer to other tables
1. can't use non-deterministic SQL functions like `count()` or `now()`

Optional list of columns to include in the rows from the table, e.g.:
See [`known_functions.ex`](https://github.com/electric-sql/electric/blob/main/packages/sync-service/lib/electric/replication/eval/env/known_functions.ex) and [`parser.ex`](https://github.com/electric-sql/electric/blob/main/packages/sync-service/lib/electric/replication/eval/parser.ex) for the source of truth on which types, operators and functions are currently supported. If you need a feature that isn't supported yet, please [raise a feature request](https://github.com/electric-sql/electric/discussions/categories/feature-requests).

- `columns=id,title,status` - Only include the id, title, and status columns.
- `columns=id,"Status-Check"` - Only include id and Status-Check columns, quoting the identifiers where necessary.
> [!Warning] Throughput
> Where clause evaluation impacts [data throughput](#throughput). Some where clauses are [optimized](#optimized-where-clauses).
They should always include the primary key columns, and should be formed as a comma separated list of column names exactly as they are in the database schema. When not specified all columns are synced.
### Columns

This is an optional list of columns to select. When specified, only the columns listed are synced. When not specified all columns are synced.

For example:

- `columns=id,title,status` - only include the `id`, `title` and `status` columns
- `columns=id,"Status-Check"` - only include `id` and `Status-Check` columns, quoting the identifiers where necessary

The specified columns must always include the primary key column(s), and should be formed as a comma separated list of column names &mdash; exactly as they are in the database schema. If the identifier was defined as case sensitive and/or with special characters, then you must quote it.

If the identifier was defined as case sensitive and/or with special characters, then you must quote it in the `columns` parameter as well.

## Subscribing to shapes

Expand Down Expand Up @@ -160,6 +162,36 @@ Or you can use framework integrations like the [`useShape`](/docs/integrations/r

See the [Quickstart](/docs/quickstart) and [HTTP API](/docs/api/http) docs for more information.

## Throughput

Electric evaluates [where clauses](#where-clause) when processing changes from Postgres and matching them to [shape logs](/docs/api/http#shape-log). If there are lots of shapes, this means we have to evaluate lots of where clauses. This has an impact on data throughput.

There are two kinds of where clauses:

1. [optimized where clauses](#optimized-where-clauses): a subset of clauses that we've optimized the evaluation of
1. non-optimized where clauses: all other where clauses

With non-optimized where clauses, throughput is inversely proportional to the number of shapes. If you have 10 shapes, Electric can process 1,400 changes per second. If you have 100 shapes, throughput drops to 140 changes per second.

With optimized where clauses, Electric can evaluate millions of clauses at once and maintain a consistent throughput of ~5,000 row changes per second **no matter how many shapes you have**. If you have 10 shapes, Electric can process 5,000 changes per second. If you have 1,000 shapes, throughput remains at 5,000 changes per second.

For more details see the [benchmarks](/docs/reference/benchmarks#_7-write-throughput-with-optimized-where-clauses).

### Optimized where clauses

We currently optimize the evaluation of the following clauses:

- `field = constant` - literal equality checks against a constant value.
We optimize this by indexing shapes by their constant, allowing a single lookup to retrieve all
shapes for that constant instead of evaluating the where clause for each shape.
Note that this index is internal to Electric and unrelated to Postgres indexes.
- `field = constant AND another_condition` - the `field = constant` part of the where clause is optimized as above, and any shapes that match are iterated through to check the other condition. Providing the first condition is enough to filter out most of the shapes, the write processing will be fast. If however `field = const` matches for a large number of shapes, then the write processing will be slower since each of the shapes will need to be iterated through.
- `a_non_optimized_condition AND field = constant` - as above. The order of the clauses is not important (Electric will filter by optimized clauses first).

> [!Warning] Need additional where clause optimization?
> We plan to optimize a much larger subset of Postgres where clauses. If you need a particular clause optimized, please [raise an issue on GitHub](https://github.com/electric-sql/electric) or [let us know on Discord](https://discord.electric-sql.com).

## Limitations

### Single table
Expand All @@ -183,21 +215,14 @@ You can upvote and discuss adding support for include trees here:
### Immutable

Shapes are currently immutable.
Shape definitions are currently immutable.

Once a shape subscription has been started, it's definition cannot be changed. If you want to change the data in a shape, you need to start a new subscription.

You can upvote and discuss adding support for mutable shapes here:

- [Editable shapes #1677](https://github.com/electric-sql/electric/discussions/1677)

<!--
## Performance
... add links to benchmarks here ...
-->

### Dropping tables

When dropping a table from Postgres you need to *manually* delete all shapes that are defined on that table.
Expand Down
15 changes: 9 additions & 6 deletions website/docs/reference/benchmarks.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ The first two benchmarks measure a client's initial sync time:

The next four measure how long it takes for clients to recieve an update after a write:

3. [many disjoint shapes](#_3-many-disjoint-shapes)
3. [many independent shapes](#_3-many-independent-shapes)
4. [one shape with many clients](#_4-one-shape-with-many-clients)
5. [many overlapping shapes, each with a single client](#_5-many-overlapping-shapes-each-with-a-single-client)
6. [many overlapping shapes, one client](#_6-many-overlapping-shapes-one-client)
Expand Down Expand Up @@ -98,7 +98,7 @@ This measures a single client syncing a single large shape of up-to 1M rows. The

### Live updates

#### 3. Many disjoint shapes
#### 3. Many independent shapes

<figure>
<a :href="UnrelatedShapesOneClientLatency">
Expand All @@ -111,12 +111,13 @@ This measures a single client syncing a single large shape of up-to 1M rows. The
This benchmark evaluates the time it takes for a write operation to reach a client subscribed to the relevant shape. On the x-axis, the number of active shapes is shown.
Each shape in this benchmark is independent, ensuring that a write operation affects only one shape at a time.

The two graphs differ based on the type of WHERE clause used for the shapes:
- **Top Graph:** The WHERE clause is in the form `field = constant`, where each shape is assigned a unique constant. These types of WHERE clause, along with similar patterns,
The two graphs differ based on the type of where clause used for the shapes:
- **Top Graph:** The where clause is in the form `field = constant`, where each shape is assigned a unique constant. These types of where clause, along with
[other patterns](/docs/guides/shapes#optimised-where-clauses),
are optimised for high performance regardless of the number of shapes — analogous to having an index on the field. As shown in the graph, the latency remains consistently
flat at 6ms as the number of shapes increases. This 6ms latency includes 3ms for PostgreSQL to process the write operation and 3ms for Electric to propagate it.
We are actively working to optimise additional WHERE clause types in the future.
- **Bottom Graph:** The WHERE clause is in the form `field LIKE constant`, an example of a non-optimised query type.
We are actively working to optimise additional where clause types in the future.
- **Bottom Graph:** The where clause is in the form `field ILIKE constant`, an example of a non-optimised query type.
In this case, the latency increases linearly with the number of shapes because Electric must evaluate each shape individually to determine if it is affected by the write.
Despite this, the response times remain low, a tenth of a second for 10,000 shapes.

Expand Down Expand Up @@ -192,6 +193,8 @@ is using an optimised where clause, specifically `field = constant`.
> value for each shape. This index is internal to Electric, and nothing to do with Postgres indexes. It's a hashmap if you're interested.
> `field = const AND another_condition` is another pattern we optimise. We aim to optimise a large subset of Postgres where clauses in the future.
> Optimised where clauses mean that we can process writes in a quarter of a millisecond, regardless of how many shapes there are.
>
> For more information on optimised where clauses, see the [shape API](/docs/guides/shapes#optimised-where-clauses).
The top graph shows throughput for Postgres 14, the bottom graph for Postgres 15.

Expand Down
3 changes: 3 additions & 0 deletions website/electric-api.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,9 @@ paths:
Optional where clause to filter rows in the `table`.
This should be a valid PostgreSQL WHERE clause using SQL syntax.
For more details on what is supported and what is optimal,
see the [where clause documentation](https://electric-sql.com/docs/guides/shapes#where-clause).
examples:
title_filter:
value: '"title=''Electric''"'
Expand Down
Binary file modified website/public/img/guides/sync-shape.jpg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 88c77cd

Please sign in to comment.