Enable parallel backfill in run.py #672

pengyu-hou · 2024-01-31T03:47:21Z

Summary

This PR will add a new param parallelism to run.py to use with start-ds and end-ds. The purpose of this param is to allow users submit multiple spark jobs to backfill data in parallel.

Why / Goal

This enables users a better backfilling experience

Test Plan

Added Unit Tests
Covered by existing CI
Integration tested
tested on GW: SPARK_VERSION=3.1.1 python3 ./scripts/run.py --mode=backfill --conf=production/staging_queries/team/testing_config.v1 --start-ds 2023-12-08 --end-ds 2023-12-31 --parallelism 6 | tee ~/test/staging_query.log

## Checklist - [ ] Documentation update

Reviewers

@airbnb/zipline-maintainers

hzding621 · 2024-02-02T20:01:40Z

api/py/ai/chronon/repo/run.py

+        first_command = command_list.pop(0)
+        check_call(first_command)


curious why we run the first command by itself?

Actually, it is not necessary. I made some simplification.

cristianfr · 2024-02-05T20:07:01Z

As I understand it.
Create --start-ds --end-ds + parallelism:
Divides the job into parallel date ranges (as opposed to step-days which is sequential date ranges).

This can an effect with failing jobs though right? It may be hard to recover from one range that fails within the sequence of jobs.

Another concern, some of our jobs in the spark level have hole filling logic, iirc. So the newer jobs may (in parallel) may find the hole and try to fill it (all at the same time).

So what I'm thinking is we have a few systems that look into the past data:

Hole filling logic
Define compute based on existing data. (Partitions to fill I think is the name).
This would parallelism at the run.py level.

What's the advantage of doing it at the run.py level vs having the spark job parallelize jobs itself (which would prevent the conflict between multiple spark jobs (unaware of each other) trying to fill the same date range).

pengyu-hou · 2024-02-06T21:39:13Z

As I understand it. Create --start-ds --end-ds + parallelism: Divides the job into parallel date ranges (as opposed to step-days which is sequential date ranges).

This can an effect with failing jobs though right? It may be hard to recover from one range that fails within the sequence of jobs.

Another concern, some of our jobs in the spark level have hole filling logic, iirc. So the newer jobs may (in parallel) may find the hole and try to fill it (all at the same time).

So what I'm thinking is we have a few systems that look into the past data:

Hole filling logic

Define compute based on existing data. (Partitions to fill I think is the name).

This would parallelism at the run.py level.

What's the advantage of doing it at the run.py level vs having the spark job parallelize jobs itself (which would prevent the conflict between multiple spark jobs (unaware of each other) trying to fill the same date range).

Had a chat with @cristianfr and @nikhilsimha offline that this will be our short term approach for backfill performance. We will continue to work on a long term approach on the aggregation core to further improve the performance.

update description. Signed-off-by: Pengyu Hou <[email protected]>

pengyu-hou added 4 commits January 29, 2024 17:11

wip

b7376c4

wip

0fee5cb

wip

949cc6a

added ut

a2265a8

pengyu-hou requested review from nikhilsimha and cristianfr January 31, 2024 03:47

fix indent

73ccf69

pengyu-hou changed the title ~~Pengyu parallel run~~ Enable parallel backfill in run.py Jan 31, 2024

added blank line

a66e080

hzding621 approved these changes Feb 2, 2024

View reviewed changes

simplification

5dd6ced

cristianfr approved these changes Feb 6, 2024

View reviewed changes

Update run.py

3745655

update description. Signed-off-by: Pengyu Hou <[email protected]>

pengyu-hou merged commit b59ab4a into master Feb 7, 2024
7 checks passed

pengyu-hou deleted the pengyu--parallel-run branch February 7, 2024 19:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable parallel backfill in run.py #672

Enable parallel backfill in run.py #672

pengyu-hou commented Jan 31, 2024

hzding621 Feb 2, 2024

pengyu-hou Feb 2, 2024

cristianfr commented Feb 5, 2024

pengyu-hou commented Feb 6, 2024

		first_command = command_list.pop(0)
		check_call(first_command)

Enable parallel backfill in run.py #672

Enable parallel backfill in run.py #672

Conversation

pengyu-hou commented Jan 31, 2024

Summary

Why / Goal

Test Plan

Reviewers

hzding621 Feb 2, 2024

Choose a reason for hiding this comment

pengyu-hou Feb 2, 2024

Choose a reason for hiding this comment

cristianfr commented Feb 5, 2024

pengyu-hou commented Feb 6, 2024