This project implements concepts learn from Data Engineering Zoomcamp course. To build a pipeline to ingest the data store them in a datalake and process them into the Data warehouse.
The Data used in this project is from FRED Economic Data. Using the FRED API to call the data using Python with Prefect Orchestration. Store them in Google Cloud Storage Bucket and Use DBT to transform data into BigQuery and connect with Looker Studio to find insight from the data.
There are multiple data in FRED Economic data. I want to create a pipeline to ingest these databases base on the topic choose. Analyze the data to find any interesting trends. The Data is being updated in different frequencies Monthly, quarterly, Yearly, etc. Some of the data went back to the 1990.
In this project, I aim to create the pipeline to ingest the above data and create report to find the
- The trend of the data
- Breakdown of the data by Country
This project used the tool below.
- Perfect as data Orchestration
- DBT core as a data transformation tool for data warehouse
- Terraform as infrastructure setup and management
- Docker for hosting Prefect Agent
- Google Cloud Storage Bucket as a data lake.
- Google BigQuery as a Data warehouse
- Google Compute Engine as VM to host the pipeline
- Looker Studio for report and visualization
- Makefile for ease of reproducibility
The Data flow for this project
The data is called from FRED API.
We first need to get the category id from for all the topics. But FRED does not have an endpoint to get all the category id so to get this data we need to scape it from the Category Page by ussing this Python script.
Checking the Robots.txt FRED does not disallow scraping of this data.
After we ran the script and get the category id we then can use the id to call Cagetory Series endpoint to get all the series associated with the category.
After we get the series id we can call Maps API - Series Group Info to get the group id of the series not all series id have group id and there is no easy way to find out which series have seris group id and which one is not. So to reduce the complexity down. I have extract the series with the group id and saved them in this CSV.
This file use a column Active to indicate which data should be call daily by the script. And have data for the frequency of the data collect along with the date data start collect and the current data date.
then we use the series group id to call the Maps API - Series Data endpoint which will return the data by country for the id we requested.
The data is stored in google cloud storage in stagging folder for the most recent data.
When the new data is call the previos data is move to the archive folder to keep at log and will be delete in 30 day
We use Google BigQuery to connect to the data using external table data sources this connection is defined in Terraform file. The dataset is seperate into 2 datasets for development and production.
This project uses DBT for data transformation the model is separated into 2 stages core and stagging. In staging this is for casting the data type into the correct type and in core use this to join all the tables from stagging and from seed into one table
Use Looker Studio to connect to the BigQuery data warehouse and create reports to find trends and insight
Due to the trail account the above link may no longer work.
If you can't access the link please use this video to see the dashboard created instead.
Used prefect as an orchestrator tool to schedule the daily call of our script and data transformation using the deployment functionality.
Prerequisite
:
To reproduce this project you would need the below account
- Google Cloud Account And Service Account
- Prefect Cloud Account And API Key
- Fred's Economic Account And API Key
You also need below package
- Makefile
pip install make
- Gcloud CLI
pip install gcloud
- Terraform
- DotEnv
pip install python-dotenv
Clone this project
git clone https://github.com/Chalermdej-l/Final_Project_FredETE
Access the clone directory
cd Final_Project_FredETE
Input the credential create in the Prerequisite
step into the .env file
Input the credential.json
create in Google Cloud Account into the folder cred
Run the following command using Makefile depending on your system
make update-yml-window
make update-yml-linix
This code will populate the credential in the YAML file using the credential input in the .env file
!!Please note if you ran this command before input the credential please re-clone the project again as the YAML file will be populate with incorrect data
Next, let's setup the infrastructure
make infra-setup
This command will setup the terraform and run the plan to check for any error
To create the resource please run
make infra-create
This will create BigQuery, Google Cloud Storage, VM Instances
Once the code is done please go to the VM Instances and copy the external IP
Please input the External IP into the .env file we will need this to connect to the VM
Open a new terminal
and navigate to the clone directory we will use this terminal to connect to our created VM.
And run the below command
make vm-connect
This script will connect to the VM. There might be a question asking to save this host into the known host please select yes
.
After we are in the VM please clone the repository again
git clone https://github.com/Chalermdej-l/Final_Project_FredETE
Then navigate to the clone folder
cd Final_Project_FredETE
Run the below command to install Python Make and dotenv
sudo apt-get update -y
sudo apt install python3-pip -y
sudo pip install make python-dotenv
Then go back to the local terminal
and run
make vm-copycred
This will copy the credential we input in .env and the credential.json we download to the VM
Go back to the VM terminal
and run the below command to setup the credential
make update-yml-linix
!!Please note if you ran this command before copy the credential over please re-clone the project again in the VM and start at previous step
Then run the below command to install Docker and Docker-Compose
make vm-setup
sudo curl -L "https://github.com/docker/compose/releases/download/1.29.2/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
docker-compose --version
If docker-compose installed correctly then there will print the current version.
After finishing run the below command to create the docker image
make docker-build
Then run this command to spin up the docker image which will host our prefect agent in the background
make docker-up
After the docker is running please run this commnad will deploy our python script to prefect with the schedule to run monthly
make deployment-create
This command will deploy the sciprt to run DBT to tranform our database with a schedule to run monthly. As the data only updated by month.
make deployment-dbtprod
Then please run the below command to set up the data to call the script
make dbt-ingest
If you run this command and receive this eror [Errno 2] No such file or directory: 'cred/credential.json'
then run the below command
nano profiles.yml
And please change the keyfile to ../cred/credential.json
and re run the above command
This will run the DBT seed and set up the data for the script to run
Please go to Prefect and run the job in below order to start ingesting the data
-
Fred-Category
-
Fred-Series
-
Fred-MapAPI
-
DBT-Daily-prod
After finish running all the jobs the data will be ingested into BigQuery
After finish with the project if you want to remove the project you can run the below command in local terminal
make infra-down
This command will run terraform destory to all resource we created in this project.
- Implement CI/CD
- Explore other Endpoint of the API