Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable parallel backfill in run.py #672

Merged
merged 8 commits into from
Feb 7, 2024
Merged

Conversation

pengyu-hou
Copy link
Collaborator

Summary

This PR will add a new param parallelism to run.py to use with start-ds and end-ds. The purpose of this param is to allow users submit multiple spark jobs to backfill data in parallel.

Why / Goal

This enables users a better backfilling experience

Test Plan

  • Added Unit Tests
  • Covered by existing CI
  • Integration tested
  • tested on GW: SPARK_VERSION=3.1.1 python3 ./scripts/run.py --mode=backfill --conf=production/staging_queries/team/testing_config.v1 --start-ds 2023-12-08 --end-ds 2023-12-31 --parallelism 6 | tee ~/test/staging_query.log
image ## Checklist - [ ] Documentation update

Reviewers

@airbnb/zipline-maintainers

@pengyu-hou pengyu-hou changed the title Pengyu parallel run Enable parallel backfill in run.py Jan 31, 2024
Comment on lines 494 to 495
first_command = command_list.pop(0)
check_call(first_command)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why we run the first command by itself?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, it is not necessary. I made some simplification.

@cristianfr
Copy link
Collaborator

As I understand it.
Create --start-ds --end-ds + parallelism:
Divides the job into parallel date ranges (as opposed to step-days which is sequential date ranges).

This can an effect with failing jobs though right? It may be hard to recover from one range that fails within the sequence of jobs.

Another concern, some of our jobs in the spark level have hole filling logic, iirc. So the newer jobs may (in parallel) may find the hole and try to fill it (all at the same time).

So what I'm thinking is we have a few systems that look into the past data:

  • Hole filling logic
  • Define compute based on existing data. (Partitions to fill I think is the name).
  • This would parallelism at the run.py level.

What's the advantage of doing it at the run.py level vs having the spark job parallelize jobs itself (which would prevent the conflict between multiple spark jobs (unaware of each other) trying to fill the same date range).

@pengyu-hou
Copy link
Collaborator Author

As I understand it. Create --start-ds --end-ds + parallelism: Divides the job into parallel date ranges (as opposed to step-days which is sequential date ranges).

This can an effect with failing jobs though right? It may be hard to recover from one range that fails within the sequence of jobs.

Another concern, some of our jobs in the spark level have hole filling logic, iirc. So the newer jobs may (in parallel) may find the hole and try to fill it (all at the same time).

So what I'm thinking is we have a few systems that look into the past data:

  • Hole filling logic
  • Define compute based on existing data. (Partitions to fill I think is the name).
  • This would parallelism at the run.py level.

What's the advantage of doing it at the run.py level vs having the spark job parallelize jobs itself (which would prevent the conflict between multiple spark jobs (unaware of each other) trying to fill the same date range).

Had a chat with @cristianfr and @nikhilsimha offline that this will be our short term approach for backfill performance. We will continue to work on a long term approach on the aggregation core to further improve the performance.

update description. 

Signed-off-by: Pengyu Hou <[email protected]>
@pengyu-hou pengyu-hou merged commit b59ab4a into master Feb 7, 2024
7 checks passed
@pengyu-hou pengyu-hou deleted the pengyu--parallel-run branch February 7, 2024 19:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants