Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC Airflow DAG pipeline with Kubernetes Executor #4479

Open
10 tasks
btylerburton opened this issue Oct 2, 2023 · 0 comments
Open
10 tasks

POC Airflow DAG pipeline with Kubernetes Executor #4479

btylerburton opened this issue Oct 2, 2023 · 0 comments
Labels
H2.0/Harvest-General General Harvesting 2.0 Issues

Comments

@btylerburton
Copy link
Contributor

btylerburton commented Oct 2, 2023

User Story

In order to achieve our production SLA's, our new Harvesting platform will need to perform with ease at scale, in order to do that it's recommended best practice to employ the Kubernetes Executor as outlined here.

This ticket will create a POC sample pipeline that we can learn from and tune.

Acceptance Criteria

[ACs should be clearly demoable/verifiable whenever possible. Try specifying them using BDD.]

  • GIVEN we create a pipeline with Airflow
    AND we are using the Kubernetes Executor
    THEN we can begin to understand issues with running these technologies at scale

Background

Operating Airflow at scale presents unique issues for each deployment. We can only begin to understand the nuances of our platform by iterating on a production ETL pipeline using the same conventions that we know we need to operate at scale.

Security Considerations (required)

  • Security controls not handled properly in current iteration of SSB with EKS

Sketch

  • Spin up Airflow instance using local KinD cluster
  • [ ]
  • Utilize Kubernetes Executor to run tasks in a K8s pod that's part of SSB boundary
  • Configure an ETL pipeline to process DCAT records
    • Determine an appropriate harvest source to use for our baseline
    • Use Astro SDK grab data from S3 or CKAN API
    • Create a single transformation step in the pipeline using the Kubernetes executor
    • Do some small amount of data processing in the K8s container using Snowflake
    • Return the result to Airflow
  • Determine the best way to monitor a TaskGroup and it's accompanying Tasks aso that we can use data-driven methods to improve our implementation
@btylerburton btylerburton changed the title POC Airflow DAG pipeline with Kubernetes Executor & Snowflake POC Airflow DAG pipeline with Kubernetes Executor Oct 3, 2023
@btylerburton btylerburton moved this to New Dev in data.gov team board Oct 3, 2023
@btylerburton btylerburton added the H2.0/Harvest-General General Harvesting 2.0 Issues label Oct 13, 2023
@btylerburton btylerburton moved this from New Dev to 🧊 Icebox in data.gov team board Nov 27, 2023
@btylerburton btylerburton removed the Epic label Jan 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
H2.0/Harvest-General General Harvesting 2.0 Issues
Projects
Archived in project
Development

No branches or pull requests

1 participant