-
Notifications
You must be signed in to change notification settings - Fork 0
Getting Started with Data Engineering
Welcome to the "Getting Started in Data Engineering" wiki! This guide is designed to equip aspiring data engineers with the knowledge and tools they need to get started in this exciting field. Whether you're a complete beginner or looking to brush up on your skills, this wiki will provide you with a solid foundation.
And if you haven't read it yet, I'd start with Joe Reis & Matt Housley's book Fundamentals of Data Engineering. That one tends to create lightbulb moments to see how everything fits together
Tip
Just a shameless plug, by the way: If any of this piques your interest, or if you're looking for outside help on this topic, then head over to Aimpoint Digital's Data Engineering & Infrastructure page. Drop us a line! We'd love to help guide you in your data engineering journey!
Why You Need It: SQL is the foundational language for interacting with relational databases, which are the heart of storing and organizing structured data. Data engineers use SQL to query, extract, and manipulate data for various tasks like building pipelines and populating data warehouses
Step 1: Learn SQL
- Start here - roadmap.sh > SQL
- Free BigQuery training: https://cloud.google.com/blog/topics/training-certifications/free-google-cloud-bigquery-training
- Free Snowflake training: https://www.snowflake.com/en/resources/learn/training/
- Free Databricks training: https://www.databricks.com/resources/webinar/learn-databricks-sql-from-the-experts
Step 2: Try using SQL:
- BigQuery β has pretty generous free tier and lots of public datasets to query
- Snowflake β has 14 day free trial
- Databricks - has 14 day free trial
Why You Need It: A versatile language widely used in data engineering for data cleaning, transformation, scripting, and automation. Its vast ecosystem of libraries like pandas, NumPy, and scikit-learn simplifies data analysis and manipulation. Also, pytest is an awesome framework for unit testing. You should be doing unit testing
-
roadmap.sh > Python: The site roadmaps.sh is an incredible visual learning journey framework (kind of like scikit-learn's Choose the right estimator)
- I've used this Python one, as well as these: DevOps, AI & Data Scientist, MLOps, SQL, Software Design & Architecture
- Al Sweigartβs books
- Automate The Boring Stuff With Python - free online book that is an incredible resource for someone brand-new to Python (especially if they're a Windows user)
- Beyond The Boring Stuff With Python - free online book; this is the "SQL" to Automate The Boring Stuff π€ - recommended reading for anyone wanting to learn Object-Oriented Programming in Python, among many more topics. I treat this like "Python 102"
-
Python Cookbook - Bunch of worked examples showing how to do more advanced patterns in Python (ever wondered what the
*args, **kwargs
means?) - Go look at how other people do data viz: https://www.kaggle.com/code
Why You Need It: Data pipelines involve multiple steps and data sources. Orchestration tools help visualize, schedule, and automate these workflows, ensuring tasks run in the right order and at the right time - leading to higher trust in the end product. Data Engineering pipelines tend to be "DAGs" (Directed Acyclic Graph) - basically a left-to-right branching flow. Orchestrators manage DAGs to make sure everything runs successfully in order
- dbt (Data Build Tool): Write & schedule SQL transformations, test data integrity, document processes. Its core functionality allows you to write SQL code to transform your data and then schedule and automate these transformations
- Airbyte: Consolidate data from various sources into one warehouse with pre-built connectors for databases, SaaS apps, and more
-
Airflow: Programmatically author, schedule, & monitor complex workflows
- See Airflow Ecosystem for options of hosted Airflow (hint, you're probably not going to want to host it yourself unless you want to go real deep into Software Engineering)
-
Dagster: Open-source orchestrator for ML, analytics & ETL pipelines, prioritizing code quality & reusability
- See this "Dagster vs. Others" page for their thoughts on when & why to use it: https://dagster.io/vs
- Prefect: Modern workflow management for complex tasks, emphasizing ease of use & reliability. Has dynamic workflow generation & execution - offers both cloud & on-premise options
Why You Need It: Dirty data leads to downstream chaos and bad decisions. Data engineers use these tools to define, test, and monitor data quality standards, ensuring reliable insights
Tools:
- dbt (Data Build Tool): As mentioned in the "Orchestration" section, dbt also plays a significant role in data quality. dbt's testing enables automatic verification of data integrity and freshness
- Elementary Data's dbt Test Hub: really nice interface for picking tests given a use case (e.g. Anomaly detection, Schema, Freshness, etc)
-
Datafold: Track data changes, compare datasets, ensure code changes don't break things
- Also offers free open-source tool data-diff for table comparisons and migration testing
-
Great Expectations: Framework for writing assertions about your data, promoting data testing, documentation, and profiling
- Consider dbt-expectations package for deploying GE-like tests directly in dbt
- Soda: Open-source data reliability tool for monitoring, validating, and documenting data quality with SQL or Python DSL (Domain-Specific Language)
- XKCD #1205: Is it worth the time?: As engineers, we tend to want to try new tools and make our own and our teams' lives easier and more efficient. In typical Randall Munroe fashion, this post is a bit tongue-in-cheek, but I've seriously found myself referencing this pretty often for thinking through whether "the juice is worth the squeeze"
- Atlanta Data Community β obviously a local resource, but if you're in some other location and want to contribute, I'm glad to rename the repo and make Atlanta a subfolder! It's on GitHub...fork & PR if you want π
- Book, Flow Engineering by Steve Pereira & Andrew Davis: these guys NAILED it with this one. As a Data Engineer, you'll run into lots of change management challenges. This book gives you exact instructions for how to run a couple hour session with a cross-functional team to do "Value Stream Mapping" with a focus on tech/data. Incredible resource π€―
- The Karen Martin Group has some really great resources on using "Lean" methodologies in an office setting
- Free Webinars: https://tkmg.com/webinars/
- Karen wrote the forward in above book Flow Engineering, and her book The Outstanding Organization is another great resource for systems thinking & operational excellence --> chapter 1 is free to download!
- Aimpoint Digital Blog: Learn from market-leading experts at Aimpoint: an analytics, data engineering, operations research, and Artificial Intelligence advisory and solution engineering firm
- FUAAANG data blogs: Meta (Facebook/Insta), Uber, (let me know if you find Apple's), Amazon, AirBnB, Netflix, Google
- Software Engineering Daily: because Data Engineering is a subset of Software Engineering (come at me, bro)
- benn.substack: Benn Stancil has an absolutely unhinged style of writing, and keeps up with data industry, often poking fun at trends he sees. Get ready to click through to 300 links per blog post (they're pretty fun references)
- Data Engineering Weekly: Solid weekly breakdown of top stories & developments in data / AI space
- Data [SQL] Patterns: Ergest Xheblati has some seriously innovative takes here...from metric trees, to root cause analysis, to data culture transformation
- Monday Morning Data Chat: Authors of "Fundamentals of Data Engineering", Joe Reis & Matt Housley, chat with data practitioners & leaders about all things data
- Data Engineering Podcast: has solid deep dives on DE topics, with lots of open source founder/contributor types
- ML Ops Community Podcast: Most chill tech-y podcast host you'll come across. Demetrios founded the ML Ops Community and this podcast has been amazing for keeping up with developments in the space (some say ML Ops is a subset of Data Engineering - come at me, bro)
- Joe Reis Show
- Wharton Moneyball Podcast
- Experiencing Data with Brian T OβNeill
- The Changelog: Software Development, Open Source
- Tableau Tim: My colleague at Aimpoint - he puts out some incredible content, and will help you stay up to date with the Tableau & analytics ecosystem
- Andreas Kretz: Founder of learndataengineering.com
- SuperDataBrothers: Two guys who are actually brothers, with an excellent Mario visual theme. They cover trends and tools in the BI space
- Data Engineering Zoomcamp: Free open source data engineering learning, and keeps up pretty well with modern tooling (great place to get your hands on GCP, Docker, Airflow, dbt, dlt, Python, and Kafka)
- awesome-data-engineering
- awesome-dbt
- Awesome Lists generally