(To perform this lab using Sagemaker Notebooks and Dev Endpoints instead, go here)
A Glue Development Endpoint is an environment for you to develop and test your Glue scripts / jobs. Configuring a Development Endpoint spins up the necessary network and machines to simplify ETL scripting with AWS resources in a VPC.
In this lab, you will be joining two separate dataframes: one from the raw
datasets from Firehose against a manually uploaded reference dataset.
The raw dataset contains list of tracks, devices and activities from Firehose. The reference dataset contains a list of tracks, track titles and artist names.
You will be using Glue to perform basic transformations such as filtering and joining.
In this step, you will upload and crawl a new Glue dataset from a manual JSON file.
-
Open your S3 Bucket *YOUR_USERNAME-datalake-demo-bucket: https://s3.console.aws.amazon.com/s3/home?region=us-east-1#
-
Open the subfolder data, and create a subfolder called reference_data. Your bucket should look like this:
*--YOUR_USERNAME-datalake-demo-bucket │ ├── data/ │ └── raw/ │ └── reference_data/ │ │ └── (..other project assets: code etc.)
-
Download the following file tracks_list.json, and upload it into the
reference_data/
folder. -
Open the Glue crawler console. Select the crawler you have created CrawlDataFromKDG and Run crawler.
- The crawlwer picks up new data in the S3 bucket and automatically creates new tables in the database
- Notice how this creates two new Glue tables for
raw
andreference_data
.
In this step you will be creating a transformation to join and filter the raw
and reference
data using Glue ETL Jobs.
-
Go to the jobs tab in Glue ETL: https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=jobs
-
Click on Add job
-
Create a Job with the following properties:
- Name: JoinRawWithReference
- IAM role: AWSGlueServiceRoleLab
- Type: Spark
- Type: Spark 2.4, Python 3 (Glue 1.0)
- This job runs: A proposed script generated by AWS Glue
- Script file name: join_data.py
- S3 path where the script is stored: s3://<YOUR_USERNAME>-datalake-demo-bucket/jobs/
- Temporary directory: s3://<YOUR_USERNAME>-datalake-demo-bucket/tmp/
- Monitoring options: ☑️ Job metrics, Continuous logging and Spark UI
- Amazon S3 prefix for Spark event logs: s3://<YOUR_USERNAME>-datalake-demo-bucket/logs
- Security configuration, script libraries and job parameters:
- Maximum capacity: 2
- Click on Next
- Name: JoinRawWithReference
-
Choose a data source: raw
-
Choose a transform type: Change schema
-
Choose a data target:
- Create tables in your data target
- Data store: Amazon s3
- Format: Parquet
- Target path: s3://<YOUR_USERNAME>-datalake-demo-bucket/data/processed/
-
Map the source columns to target columns
- Delete ❌ the columns for partition_0,partition_1,partition_2,partition_3
- Add column: add columns for track_name and artist_name. Select the column type as string
-
Save job and edit script
-
You are now taken to the template builder for your job. In this workspace, you can use the options on the top right: Source, Target, Target Location, Transform and Spigot to build your ETL Job.
- View the contents of the generated python script.
- View the generated diagram on the left of the code panel. This diagram is generated using annotations used in script. Click on each step of the workflow to see the corresponding annotation.
-
You can use the following steps to update your script. Alternatively, you can copy this following code snippet join.py to replace the script. *
-
Click on Generate diagram to verify the workflow. You should see a similar diagram:
-
Click on Save and Run job.
- Review and override the job parameters if needed. In this demo, we can skip this and Run job immediately.
- Observer the Progress bar and logs for changes.
- The job may take an estimate of 8 minutes to complete
-
After the job has completed, we can add the processed data to our Glue Data Catalog by re-running our
CrawlDataFromKDG
crawler- Go to Crawlers. Check
CrawlDataFromKDG
and Run crawler - Go to Tables. You should now have a new table
processed
with the correct S3 location and classification:parquet
- Go to Crawlers. Check
Once the ETL script has ran successfully, you can inspect the output of the Glue job
- Look into your S3 Bucket: YOUR_USERNAME-datalake-demo-bucket/data/processed_data
- Inspect the new Glue table
processed_data
using Athena
In this step, we will create a Workflow to execute our Job
-
Go to Glue Workflows: https://console.aws.amazon.com/glue/home?region=us-east-1#etl:tab=workflows and Add workflow
-
Use a workflow name JoinProcessedDataHourly and Add workflow
-
Select your workflow JoinProcessedDataHourly and look at the workspace at the bottom panel.
-
Click on Add trigger
- We must first add a trigger to start the job. We can create and reuse triggers for different Workflows.
- Click on Add new
- Fill in the following details for a trigger:
- Name: Hourly
- Trigger type: Schedule
- Frequency: Hourly
- Start minute: 00 (or any nearest minute you'd like to observe)
-
Click on Add note
- Select your Job:
JoinRawWithReference
- Select your Job:
-
You are done! Wait for the time to past and meet your scheduled minute.
- Explore more built-in transformations provided by Glue: Built-in Transforms
▶️ - Explore other Glue configurations such as Bookmarks to pass job state parameters.
- Explore the Glue developer guide here: PDF
▶️