Skip to content

Getting Started with Data Engineering

Brent Brewington edited this page Jun 27, 2024 · 15 revisions

πŸ“˜ Intro

Welcome to the "Getting Started in Data Engineering" wiki! This guide is designed to equip aspiring data engineers with the knowledge and tools they need to get started in this exciting field. Whether you're a complete beginner or looking to brush up on your skills, this wiki will provide you with a solid foundation.

And if you haven't read it yet, I'd start with Joe Reis & Matt Housley's book Fundamentals of Data Engineering. That one tends to create lightbulb moments to see how everything fits together

Tip

Just a shameless plug, by the way: If any of this piques your interest, or if you're looking for outside help on this topic, then head over to Aimpoint Digital's Data Engineering & Infrastructure page. Drop us a line! We'd love to help guide you in your data engineering journey!

πŸ› οΈ Languages & Tools

πŸ’Ύ SQL

Why You Need It: SQL is the foundational language for interacting with relational databases, which are the heart of storing and organizing structured data. Data engineers use SQL to query, extract, and manipulate data for various tasks like building pipelines and populating data warehouses

Step 1: Learn SQL

Step 2: Try using SQL:

  • BigQuery – has pretty generous free tier and lots of public datasets to query
  • Snowflake – has 14 day free trial
  • Databricks - has 14 day free trial

🐍 Python

Why You Need It: A versatile language widely used in data engineering for data cleaning, transformation, scripting, and automation. Its vast ecosystem of libraries like pandas, NumPy, and scikit-learn simplifies data analysis and manipulation. Also, pytest is an awesome framework for unit testing. You should be doing unit testing

🎼 Orchestration

Why You Need It: Data pipelines involve multiple steps and data sources. Orchestration tools help visualize, schedule, and automate these workflows, ensuring tasks run in the right order and at the right time - leading to higher trust in the end product. Data Engineering pipelines tend to be "DAGs" (Directed Acyclic Graph) - basically a left-to-right branching flow. Orchestrators manage DAGs to make sure everything runs successfully in order

  • dbt (Data Build Tool): Write & schedule SQL transformations, test data integrity, document processes. Its core functionality allows you to write SQL code to transform your data and then schedule and automate these transformations
  • Airbyte: Consolidate data from various sources into one warehouse with pre-built connectors for databases, SaaS apps, and more
  • Airflow: Programmatically author, schedule, & monitor complex workflows
    • See Airflow Ecosystem for options of hosted Airflow (hint, you're probably not going to want to host it yourself unless you want to go real deep into Software Engineering)
  • Dagster: Open-source orchestrator for ML, analytics & ETL pipelines, prioritizing code quality & reusability
  • Prefect: Modern workflow management for complex tasks, emphasizing ease of use & reliability. Has dynamic workflow generation & execution - offers both cloud & on-premise options

🧹 Data Quality

Why You Need It: Dirty data leads to downstream chaos and bad decisions. Data engineers use these tools to define, test, and monitor data quality standards, ensuring reliable insights

Tools:

  • dbt (Data Build Tool): As mentioned in the "Orchestration" section, dbt also plays a significant role in data quality. dbt's testing enables automatic verification of data integrity and freshness
  • Datafold: Track data changes, compare datasets, ensure code changes don't break things
    • Also offers free open-source tool data-diff for table comparisons and migration testing
  • Great Expectations: Framework for writing assertions about your data, promoting data testing, documentation, and profiling
  • Soda: Open-source data reliability tool for monitoring, validating, and documenting data quality with SQL or Python DSL (Domain-Specific Language)

πŸ“ Other Notes

  • XKCD #1205: Is it worth the time?: As engineers, we tend to want to try new tools and make our own and our teams' lives easier and more efficient. In typical Randall Munroe fashion, this post is a bit tongue-in-cheek, but I've seriously found myself referencing this pretty often for thinking through whether "the juice is worth the squeeze"
  • Atlanta Data Community – obviously a local resource, but if you're in some other location and want to contribute, I'm glad to rename the repo and make Atlanta a subfolder! It's on GitHub...fork & PR if you want 😁
  • Book, Flow Engineering by Steve Pereira & Andrew Davis: these guys NAILED it with this one. As a Data Engineer, you'll run into lots of change management challenges. This book gives you exact instructions for how to run a couple hour session with a cross-functional team to do "Value Stream Mapping" with a focus on tech/data. Incredible resource 🀯
  • The Karen Martin Group has some really great resources on using "Lean" methodologies in an office setting

πŸš€ List of Content Creators to Follow

πŸ“° Blogs & Newsletters

  • Aimpoint Digital Blog: Learn from market-leading experts at Aimpoint: an analytics, data engineering, operations research, and Artificial Intelligence advisory and solution engineering firm
  • FUAAANG data blogs: Meta (Facebook/Insta), Uber, (let me know if you find Apple's), Amazon, AirBnB, Netflix, Google
  • Software Engineering Daily: because Data Engineering is a subset of Software Engineering (come at me, bro)
  • benn.substack: Benn Stancil has an absolutely unhinged style of writing, and keeps up with data industry, often poking fun at trends he sees. Get ready to click through to 300 links per blog post (they're pretty fun references)
  • Data Engineering Weekly: Solid weekly breakdown of top stories & developments in data / AI space
  • Data [SQL] Patterns: Ergest Xheblati has some seriously innovative takes here...from metric trees, to root cause analysis, to data culture transformation

🎧 Podcasts

πŸŽ₯ YouTube

  • Tableau Tim: My colleague at Aimpoint - he puts out some incredible content, and will help you stay up to date with the Tableau & analytics ecosystem
  • Andreas Kretz: Founder of learndataengineering.com
  • SuperDataBrothers: Two guys who are actually brothers, with an excellent Mario visual theme. They cover trends and tools in the BI space

:octocat: GitHub