-
Notifications
You must be signed in to change notification settings - Fork 56
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable parallel backfill in run.py #672
Conversation
api/py/ai/chronon/repo/run.py
Outdated
first_command = command_list.pop(0) | ||
check_call(first_command) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
curious why we run the first command by itself?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, it is not necessary. I made some simplification.
As I understand it. This can an effect with failing jobs though right? It may be hard to recover from one range that fails within the sequence of jobs. Another concern, some of our jobs in the spark level have hole filling logic, iirc. So the newer jobs may (in parallel) may find the hole and try to fill it (all at the same time). So what I'm thinking is we have a few systems that look into the past data:
What's the advantage of doing it at the run.py level vs having the spark job parallelize jobs itself (which would prevent the conflict between multiple spark jobs (unaware of each other) trying to fill the same date range). |
Had a chat with @cristianfr and @nikhilsimha offline that this will be our short term approach for backfill performance. We will continue to work on a long term approach on the aggregation core to further improve the performance. |
update description. Signed-off-by: Pengyu Hou <[email protected]>
Summary
This PR will add a new param
parallelism
to run.py to use withstart-ds
andend-ds
. The purpose of this param is to allow users submit multiple spark jobs to backfill data in parallel.Why / Goal
This enables users a better backfilling experience
Test Plan
SPARK_VERSION=3.1.1 python3 ./scripts/run.py --mode=backfill --conf=production/staging_queries/team/testing_config.v1 --start-ds 2023-12-08 --end-ds 2023-12-31 --parallelism 6 | tee ~/test/staging_query.log
Reviewers
@airbnb/zipline-maintainers