This repo demonstrates different Apache Beam runners with Python application code; emphasis: PortableRunner with Spark.
See main.py
for supported runners and code related to each.
- Docker
- Python 3.9
- Install Python 3.9 (preferably via virtual environment)
- Install deps:
pip install -r requirements.txt
- Activate env (e.g.,
source venv/bin/activate
)
From ./dev-utils
: docker-compose up
# You can run the script file directly.
python main.py
# To run passing command line arguments.
python main.py --input-text="someMultiPart camelCased Words"
# To run the tests.
python -m unittest -v
Spark executioners: http://localhost:8081/
Spark jobs: http://localhost:4040/jobs/ (this endpoint is hosted by the Beam Job server and is only available while jobs are running)
App log output: stdout of worker nodes
- The Beam job server and SDK harness (workers) must share a file volume for staging.
- There is never direct communication between the job serve and SDK harness. Comms go: Beam job server -> Spark master -> Spark workers -> Beam SDK harness (workers).
- Artifact endpoint only communicates which artifacts are needed. Processes then look to above referenced volume for retrieving those artifacts.
This software is distributed under the terms of both the MIT license and the Apache License (Version 2.0).
See LICENSE for details.
Beam application logic in this repo is based on https://github.com/apache/beam-starter-python.
Also gleaned bits from:
- https://github.com/sambvfx/beam-flink-k8s
- https://www.mail-archive.com/[email protected]/msg07434.html
- https://lists.apache.org/thread/zq7n1kxbxlnoh8ryrjvfhg68xjcd3lt6
- https://github.com/mosche/beam-portable-spark/blob/main/docker-compose.yml
Feel free to open a PR against this repo.