How to support StreamingWindow operator in Velox #5925

JkSelf · 2023-07-31T07:23:11Z

JkSelf
Jul 31, 2023
Collaborator

When it comes to the Window operator in Spark, it requires access to data from multiple partitions in order to compute the window functions correctly. To achieve this, Spark employs a two-step process:

Partitioning: Spark first partitions the input data across multiple nodes in a distributed manner. Each partition contains a subset of the data.
Sorting: After partitioning, Spark performs a pre-sorting step within each partition. This is necessary because the Window operator needs to operate on specific ranges or windows of data defined by the window boundaries. By ensuring that the data within each partition is sorted appropriately, Spark can efficiently compute the window functions.

Once the data is partitioned and sorted, Spark can evaluate the Window operator in parallel on different sets of partitions across the cluster. This allows for efficient and scalable computation of window functions, as each node can independently process its assigned partitions and then combine the results as needed.

Typically, there is no requirement to perform sorting within the Window operator. However, in Velox, the Window operator does sort the data. To reduce the memory footprint in stream processing scenarios, we can consider using the StreamingWindow operator instead of the regular Window operator. With StreamingWindow, we can define windows and apply window functions to the grouped data as soon as it becomes available, without the need for materializing and sorting the entire partition. This allows for more efficient and low-latency processing of streaming data while minimizing memory requirements.

Here are the two approaches :

Implement a separate StreamingWindow operator: In this approach, a new operator specifically designed for streaming processing is implemented. This operator would not inherit directly from the regular Window operator. However, they may share some common code or logic to handle windowing operations efficiently. See Optimize WindowOperator for pre-sorted input #5437
Integrate the StreamingWindow functionality into the Window operator: Alternatively, the streaming processing logic can be directly incorporated into the existing Window operator.

JkSelf · 2023-07-31T07:24:18Z

JkSelf
Jul 31, 2023
Collaborator Author

@mbasmanova @aditi-pandit We can discuss the final solution in this discussions.

0 replies

JkSelf · 2023-07-31T07:29:44Z

JkSelf
Jul 31, 2023
Collaborator Author

I would recommend using method 1 due to the differing logic between StreamingWindow and Window. Combining them into a single implementation would require a large if-else branch to distinguish between the two logics. By keeping StreamingWindow and Window as separate implementations, we can ensure clarity and separation of concerns. It allows for easier understanding of each operator's specific behavior and facilitates independent modifications or optimizations in the future. For the spill case, both StreamingWindow and Window operators require spill logic. It would be beneficial to consolidate the spill logic into a common code section if method 1 is chosen.

2 replies

aditi-pandit Aug 1, 2023
Collaborator

@JkSelf : Did you try prototyping this ?

In my mind during initialization you would determine if streaming or not. In the addInput and getOutput logic the streaming vs sorting based logic would have individual functions for the separate logic and the common stuff would be separate functions as well so there aren't really big if-else conditions.

JkSelf Aug 1, 2023
Collaborator Author

Distinctions exist not only between the addInput and getOutput methods but also among the three methods: isFinished, noMoreInput, and createPeerAndFrameBuffers. May require five if-else conditions.

mbasmanova · 2023-07-31T16:00:44Z

mbasmanova
Jul 31, 2023
Collaborator

@JkSelf Thank you for opening an issue. First, let's clarify why Spark pre-sorts the inputs before executing Window operator. My guess is that sorting can be done in a distributed matter, e.g. many nodes can partially sort the data, then a single node can merge sort the results. In addition, I imagine that Spark can partition the data first and evaluate Window operator on different sets of partitions on multiple nodes in parallel. Is this how Spark works? Let's make sure to add this information to the description on this issue to explain why we want Velox to support Window over pre-sorted inputs.

0 replies

JkSelf · 2023-08-01T02:13:44Z

JkSelf
Aug 1, 2023
Collaborator Author

@mbasmanova
Yes, you are correct in your understanding of how Spark works with respect to the Window operator and sorting.

When it comes to the Window operator in Spark, it requires access to data from multiple partitions in order to compute the window functions correctly. To achieve this, Spark employs a two-step process:

Partitioning: Spark first partitions the input data across multiple nodes in a distributed manner. Each partition contains a subset of the data.
Sorting: After partitioning, Spark performs a pre-sorting step within each partition. This is necessary because the Window operator needs to operate on specific ranges or windows of data defined by the window boundaries. By ensuring that the data within each partition is sorted appropriately, Spark can efficiently compute the window functions.

Once the data is partitioned and sorted, Spark can evaluate the Window operator in parallel on different sets of partitions across the cluster. This allows for efficient and scalable computation of window functions, as each node can independently process its assigned partitions and then combine the results as needed.

0 replies

JkSelf · 2023-08-01T02:31:52Z

JkSelf
Aug 1, 2023
Collaborator Author

@mbasmanova
Already updated the issue description. And Which approach do you prefer?

0 replies

JkSelf · 2023-08-01T02:32:15Z

JkSelf
Aug 1, 2023
Collaborator Author

@aditi-pandit

0 replies

mbasmanova · 2023-08-01T09:37:53Z

mbasmanova
Aug 1, 2023
Collaborator

@JkSelf Thank you for clarifying. Would you share some queries and excerpts of Spark query plans that show the partitioning, partial sorting, final sorting and window plan nodes? I'd like to double check my understanding to make sure it is complete.

5 replies

aditi-pandit Aug 1, 2023
Collaborator

@JkSelf : Would really help.

JkSelf Aug 2, 2023
Collaborator Author

Take the lineitem table in TPCH as an example. And the query is select row_number() over (partition by l_suppkey order by l_orderkey) from lineitem. Spark inserts a Sort operator before the Window operator, utilizing the l_suppkey + l_orderkey as the sort keys. This sorting operation ensures that the data is sorted within each partition, rather than across the entire dataset.

mbasmanova Aug 2, 2023
Collaborator

@JkSelf It is a bit difficult to work with images. Any chance you can share text version of the plan?

This sorting operation ensures that the data is sorted within each partition, rather than across the entire dataset.

I wonder how does this happen? Which plan nodes partitioning by l_suppkey?

JkSelf Aug 2, 2023
Collaborator Author

The physical plan includes a shuffle exchange node, which will re-partition the data based on the l_suppkey column. Notably, the Sort operator has a global value of false (as indicated in the following physical plan). This configuration ensures that the data is sorted within each partition, maintaining order within partitions rather than across the entire dataset.

== Physical Plan ==
*(3) Project [row_number() OVER (PARTITION BY l_suppkey ORDER BY l_orderkey ASC NULLS FIRST unspecifiedframe$()) AS ROW_NUMBER#5]
+- Window [row_number() windowspecdefinition(l_suppkey#0, l_orderkey ASC NULLS FIRST, specifiedwindowframe(RowFrame, unboundedpreceding$(), currentrow$()))], [_we0#6], [l_suppkey#0 ASC NULLS FIRST, l_orderkey#1 ASC NULLS FIRST]
   +- *(2) Sort [l_suppkey#0 ASC NULLS FIRST, l_orderkey#1 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(l_suppkey#0, 200), true, [id=#13]
         +- *(1) Scan ExistingRDD[l_suppkey#0,l_orderkey#1]

mbasmanova Aug 2, 2023
Collaborator

@JkSelf Thanks. This is super helpful.

mbasmanova · 2023-08-01T09:45:11Z

mbasmanova
Aug 1, 2023
Collaborator

Looking at Presto WindowNode, I see that it supports specifying whether inputs are partially partitioned (prePartitionedInputs) and sorted (preSortedOrderPrefix).

public class WindowNode
        extends InternalPlanNode
{
    private final PlanNode source;
    private final Set<VariableReferenceExpression> prePartitionedInputs;
    private final Specification specification;
    private final int preSortedOrderPrefix;
    private final Map<VariableReferenceExpression, Function> windowFunctions;
    private final Optional<VariableReferenceExpression> hashVariable;

This seems to be a more general case of what you have in Spark. I suggest we modify WindowNode in Velox in a similar manner to allow support for both Presto and Spark. This would also be similar how we define streaming aggregations by specifying that input is clustered on a subset of grouping keys.

1 reply

aditi-pandit Aug 1, 2023
Collaborator

Yes, we will implement the optimization at some point to use the prePartitionedInput and preSortedOrderPrefix to optimize the sorting in the Window Operator.

This style is more generic and the Spark setup can be the most optimized situation of this where no sorting is required since the input stream is completely sorted with the partitionKeys and sortingKeys.

Given this line of optimizations, combining the streaming into the current Window logic seems most suitable to me as well.

mbasmanova · 2023-08-01T09:50:31Z

mbasmanova
Aug 1, 2023
Collaborator

I don't have a strong opinion on whether streaming window should be a separate operator or integrated into existing Window operator. For aggregations, we have a separate operator for the case when input is clustered on all grouping keys while HashAggregation operator has support for partially grouped inputs.

If you decide to implement a separate StreamingWindow operator, do make sure not to make it inherit from the Window operator class. Instead, extract shared logic into a separate class and use it directly.

It might be helpful to experiment with extending existing Window operator to add support for streaming case. For example, it is not clear to me that the following statement is accurate:

Combining them into a single implementation would require a large if-else branch to distinguish between the two logics.

0 replies

aditi-pandit · 2023-08-01T17:17:12Z

aditi-pandit
Aug 1, 2023
Collaborator

If we want to separate StreamingWindow there is also another design in my mind.

We could split the WindowOperator into 2 separate operators (lets call them WindowBuild and WndowOutput say). WindowBuild is different in the current Window and StreamingWindow logics. Window accumulates all input rows and starts pushing rows out in partition+sort order after all input is received. There might not be an active "pushing rows out", but rather just sharing the RowContainer. StreamingWindow identifies the rows of each partition and has a batches them one at a time for WindowOutput operator.

The WindowOutput does the job of invoking the WindowFunction and sends rows downstream. This is common for both implementations.

This separation could also just be a logical class/code separation within Window operator and not a separate physical operator in the plan per-se.

@mbasmanova, @JkSelf : What do you think of this proposal ?

6 replies

aditi-pandit Aug 3, 2023
Collaborator

@JkSelf : The PR for the streaming window accumulates all input rows in addInput(...) into the RowContainer. During input processing it determines if the partition has changed. If it has then then it sets up the sortedRows which is used for further output when getOutput is called. Once all the partition rows are output, then the data is cleaned from the RowContainer.

Since window functions work over window frames that could cover the entire partition, we can't really process individual rows at a time without seeing the entire partition.

StreamingWindowBuild here will do the same as in on addInput put rows in the RowContainer and set up sortedRows when partition changes. For StreamingWindowBuild::getOutput it will pass the rows in sortedRows to WindowOutput for further outputting.
We would need to add an API between WindowBuild and WindowOutput to clean the RowContainer.

In general, it seems like just doing a code separation (without separate operators) between WindowBuild and WIndowOutput could give us a separation.

@mbasmanova : What do you think ? I might prototype this idea and share a PR if you are in agreement also.

JkSelf Aug 3, 2023
Collaborator Author

@aditi-pandit Thanks for the detailed explanation.

Do we still need to add the #5437 for including the StreamingWindow operator? @aditi-pandit @mbasmanova

aditi-pandit Aug 3, 2023
Collaborator

@JkSelf : Lets keep it around. I'll try to share the prototype by end of the week and then we can figure how to handle the code.

JkSelf Aug 18, 2023
Collaborator Author

@aditi-pandit Do you have any further progress for StreamingWindow?

aditi-pandit Aug 18, 2023
Collaborator

@JkSelf : I started #6011 for the refactoring. Please can you take a look too and send me any questions you have for using it for StreamingWindow ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to support StreamingWindow operator in Velox #5925

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 10 comments 14 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to support StreamingWindow operator in Velox #5925

JkSelf Jul 31, 2023 Collaborator

Replies: 10 comments · 14 replies

JkSelf Jul 31, 2023 Collaborator Author

JkSelf Jul 31, 2023 Collaborator Author

aditi-pandit Aug 1, 2023 Collaborator

JkSelf Aug 1, 2023 Collaborator Author

mbasmanova Jul 31, 2023 Collaborator

JkSelf Aug 1, 2023 Collaborator Author

JkSelf Aug 1, 2023 Collaborator Author

JkSelf Aug 1, 2023 Collaborator Author

mbasmanova Aug 1, 2023 Collaborator

aditi-pandit Aug 1, 2023 Collaborator

JkSelf Aug 2, 2023 Collaborator Author

mbasmanova Aug 2, 2023 Collaborator

JkSelf Aug 2, 2023 Collaborator Author

mbasmanova Aug 2, 2023 Collaborator

mbasmanova Aug 1, 2023 Collaborator

aditi-pandit Aug 1, 2023 Collaborator

mbasmanova Aug 1, 2023 Collaborator

aditi-pandit Aug 1, 2023 Collaborator

aditi-pandit Aug 3, 2023 Collaborator

JkSelf Aug 3, 2023 Collaborator Author

aditi-pandit Aug 3, 2023 Collaborator

JkSelf Aug 18, 2023 Collaborator Author

aditi-pandit Aug 18, 2023 Collaborator

JkSelf
Jul 31, 2023
Collaborator

Replies: 10 comments 14 replies

JkSelf
Jul 31, 2023
Collaborator Author

JkSelf
Jul 31, 2023
Collaborator Author

aditi-pandit Aug 1, 2023
Collaborator

JkSelf Aug 1, 2023
Collaborator Author

mbasmanova
Jul 31, 2023
Collaborator

JkSelf
Aug 1, 2023
Collaborator Author

JkSelf
Aug 1, 2023
Collaborator Author

JkSelf
Aug 1, 2023
Collaborator Author

mbasmanova
Aug 1, 2023
Collaborator

aditi-pandit Aug 1, 2023
Collaborator

JkSelf Aug 2, 2023
Collaborator Author

mbasmanova Aug 2, 2023
Collaborator

JkSelf Aug 2, 2023
Collaborator Author

mbasmanova Aug 2, 2023
Collaborator

mbasmanova
Aug 1, 2023
Collaborator

aditi-pandit Aug 1, 2023
Collaborator

mbasmanova
Aug 1, 2023
Collaborator

aditi-pandit
Aug 1, 2023
Collaborator

aditi-pandit Aug 3, 2023
Collaborator

JkSelf Aug 3, 2023
Collaborator Author

aditi-pandit Aug 3, 2023
Collaborator

JkSelf Aug 18, 2023
Collaborator Author

aditi-pandit Aug 18, 2023
Collaborator