TLDR
In this tutorial, we integrate dbt with Mage to create a data pipeline, moving data from a source to a PostgreSQL database and performing SQL transformations through staged models. By setting up Docker and PostgreSQL, and following a step-by-step process, we effectively manage data orchestration and analytics using Mage and dbt.
Outline
Introduction to dbt
Mage quickstart
dbt project setup
Setting up dbt packages
Let’s create some data
dbt integration and staging layer
SQL transformations
Conclusion
Introduction to dbt
Mage is a magical data engineering tool, and its integration with data build tool (dbt) makes it even more magical. Using dbt core blocks within a Mage data pipeline allows developers to easily generate SQL transformations and load data into the different schemas of a database. This demonstration will require the understanding of some YAML code and other analytics engineering best practices for transforming and loading data.
For this project you will need
and a local
database. Your goal is to move data from your source location to the target PostgreSQL database.
Let’s get started on our basic project.
Mage Quickstart
After you install docker and a PostgreSQL database you can follow the steps below to start a Mage instance in a docker container.
Run the following command to start your Mage Project with dbt integration (The command below is for a mac, see
for other operating systems.
1
docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai /app/run_app.sh mage start [project_name]
If you are using Docker Desktop you can open Mage from the GUI by clicking the link in the Ports. It should look like this 6789:6789. If you are not using Docker Desktop open the Mage overview page at
.
Click the new pipeline button and then click Standard (Batch) option located at the top left of the GUI to begin the pipeline.
Either use the name provided by Mage or delete it and customize a name for your pipeline in the popup screen provided. You can give the pipeline a description if you want.
Hit create and you will be taken to the Pipeline editor GUI in Mage.
dbt project setup
Setting up a dbt project is straight forward in Mage, where building your profiles.yml file is prompted upon setup. Initially we will set up by navigating to our dbt folder within our project and then begin the initial setup from there. Follow these steps below:
From the src folder navigate to your dbt folder by running the command below.
1
cd <your mage project name>/dbt
Once inside the dbt folder run the command below to create your dbt project. After you run the command you should see the dbt folder structure populate.
1
dbt init <dbt project name>
After the folder structure populates you will be prompted to enter some information in the terminal to build out the profiles.yml file in your dbt folder. The profiles.yml file should look like the code below
1
2
3
4
5
6
7
8
9
10
11
12
dbt_tutorial:
outputs:
dev:
dbname: <your db name> # your postgres database name remove < >
host: host.docker.internal
pass: <your db password> # your postgres database password remove < >
port: 5432
schema: public
threads: 1
type: postgres
user: postgres
target: dev
💡
NOTE
:
Copy the profiles.yml file above and include it in the hierarchy of your dbt project folder by right clicking on the folder name and selecting create New file. Name the file profiles.yml.
Setting Up dbt Packages
While this project does not utilize any dbt packages, if you want to take your new magical powers a step further in creating a more detailed project, it’s necessary to know this information. Utilize dbt packages in Mage by following these instructions below:
Right click on the your dbt_project folder (dbt_tutorial) and select Create New Folder from the drop down menu
Generate a file called packages.yml from the popup menu and click Create file. After this you should see a packages.yml file populate in your dbt_tutorial folder.
Enter in the YAML code below to install set up the dbt packages for install
1
2
3
4
5
packages:
- package: dbt-labs/dbt_utils
version: 1.1.1
- package: calogica/dbt_date
version: 0.10.0
Hover over the right pop out and select the terminal
From the terminal navigate to your dbt project folder and run the code below
1
dbt deps
With a wave of your wand and your magical dbt powers within Mage, you installed two of the most popular dbt dependency packages, dbt_utils and calogica. Next we will customize the dbt_projects.yml file and will be ready to start extracting, transforming, and loading data.
Customize the dbt_project.yml File
Lets go ahead and create the formulation to generate the dbt_project.yml file. There are a few simple additions you need wave your magic wand over to move the data into your PostgreSQL database. Follow the formula below for the magic touch:
Click the dbt_project.yml file located in your dbt project folder, the text editor popup should appear.
Edit the file based on the picture below (if you are familiar with YAML and how it interacts with dbt sql files, you can customize this file to your liking).
Save the file and close the text editor, and get ready for the real magic to start
Let’s create some data
Tutorials are best when experienced, they mimic real life examples. In the wild, Mage is used to orchestrate data and can perform all the ETL operations. If you are a company who already uses dbt, Mage has you covered with a managed dbt Core integration.
The data we will formulate for this tutorial mimics a medical risk score analysis. Our goal is to conjure up an analytics engineering pipeline that has a staging layer, a refined layer, and a serving layer.
Lets get started:
From the edit pipeline page select the data loader block
Choose Python, and then from the expanded choices choose Generic (no template).
Replace the boilerplate code in the Mage Data Loader with the code below. This will create a medical records table that somewhat resembles real life.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import random
import string
import pandas as pd
from typing import Dict, List
from faker import Faker
import random
if 'data_loader' not in globals():
from mage_ai.data_preparation.decorators import data_loader
if 'test' not in globals():
from mage_ai.data_preparation.decorators import test
@data_loader
def load_data(*args, **kwargs):
"""
Template code for loading data from any source.
Returns:
Anything (e.g. data frame, dictionary, array, int, str, etc.)
"""
# Initialize Faker
fake = Faker()
# Number of samples to generate
num_samples = 20000
# Generate data
data = {
'patient_id': [f'PID{str(i).zfill(5)}' for i in range(1, num_samples + 1)],
'name': [fake.name() for _ in range(num_samples)],
'date_of_birth': [fake.date_of_birth(minimum_age=18, maximum_age=90).strftime('%Y-%m-%d') for _ in range(num_samples)],
'diastolic_bp': [random.randint(60, 90) for _ in range(num_samples)],
'systolic_bp': [random.randint(90, 140) for _ in range(num_samples)],
'pulse': [random.randint(50, 100) for _ in range(num_samples)],
'height_in_inches': [round(random.uniform(57, 84), 1) for _ in range(num_samples)],
'weight_in_lbs': [round(random.uniform(100, 350), 1) for _ in range(num_samples)]
}
# Create DataFrame
df = pd.DataFrame(data)
return df
@test
def test_output(output, *args) -> None:
"""
Template code for testing the output of the block.
"""
assert output is not None, 'The output is undefined'
Run the code by hitting the blue right arrow button at the top right of the data loader block. You should see a sample output below the block.
After the block runs Mage prints out a sample for you to evaluate. You should now have 20,000 rows and 7 columns of data consisting of people’s mock personal information and their medical test results.
dbt Integration and Staging Layer
Source: Giphy
Mage helps manage some of the manual processes dbt Core presents the user with such as creating the sources.yml file. Let’s get started on creating the dbt project folders and files by completing the following steps:
First, create a bronze, silver, and gold folder structure under the models folder within the dbt project folder. You can also go ahead and add the files to the folders.
Create folders and files by right clicking on their respective folder locations. To create a folder select New folder, to create a new file select New file.
In the popup window give the folder or file a name and click the create button.
From the pipeline editor page click the dbt Model button and then select Single Model or Snapshot (from file)
The popup below will appear, choose the file name under the bronze folder since that is our staging layer
The pipeline editor page should now have your data loader block and a dbt model block
Next click on the edit parents button located at the top of the block and you will see a pop-out from the right side appear.
Either click inside your parent block in the dependency tree or write in the name of your block and click save dependencies.
Completing this step will create a mage_sources.yml file in your project folder. To see this file move your mouse to your file structures search bar and click the refresh button located to the right of the search bar.
The mage_sources.yml file should look similar to the YAML code below
1
2
3
4
5
6
7
8
9
10
11
12
13
sources:
- description: Dataframes Mage upstream blocks
loader: mage
name: mage_dbt_tutorial
schema: public
tables:
- description: Dataframe for block `get_data` of the `dbt_tutorial` mage pipeline.
identifier: mage_dbt_tutorial_get_data
meta:
block_uuid: get_data
pipeline_uuid: dbt_tutorial
name: dbt_tutorial_get_data
version: 2
Begin writing your SQL code in the first Mage dbt block (see example below). The source code comes from the mage_sources.yml file. The first argument within the parentheses is the sources name, the second argument is the tables name.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
{{
config(
materialized='view'
)
}}
SELECT
name
, date_of_birth as dob
, diastolic_bp
, systolic_bp
, pulse
, height_in_inches as height
, weight_in_lbs as weight
FROM
{{source('mage_dbt_tutorial', 'dbt_tutorial_get_data')}}
Execute the code within Mage by hitting the run button at the top of the block or command enter
You should see the logs begin running and a sample of the query return when the block completes
Execute the code below
1
dbt run --select bronze_medical
Executing the dbt run command builds the model and exports it to your target database. Check your PostgreSQL database and you will see a view in the Public schema called bronze_medical, or whatever name you gave your first model.
SQL Transformations
Source: Giphy
You just landed your data as a staging file in your bronze folder. Once we are ready to run the dbt files this staging file will be pushed to your PostgreSQL database as well. But first let’s run some SQL transformations on the project.
Our magical medical office gave us a requirement that we should remove people’s names and dates of birth, but generate their age and include it as a column in the model. For the silver model file complete the following:
Write the following SQL code to generate the Age of a person, but remove their name and date of birth from the data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
{{
config(
materialized='view'
)
}}
WITH bronze_Silver as (
SELECT
*
,FLOOR(EXTRACT(YEAR FROM AGE(CURRENT_DATE, dob))) AS age
FROM
{{ ref('bronze_medical') }}
)
SELECT
patient_id
, age
, diastolic_bp
, systolic_bp
, pulse
, height
, weight
FROM
bronze_silver
Execute the dbt run command to
1
dbt run --select silver_medical
Next the medical analysis team needed to create some business rules about how they scored patients’ heart health risk. The following requirements are needed to help patients and doctors monitor their risk of having heart issues:
Develop a score rating for age, diastolic_bp, systolic_bp, pulse, and bmi
Sum the score on a new column
Develop of score rating for low, medium, and high risk
Make sure to include the business rules in the gold model by taking the following steps:
Write your sql code as below with the rating system, then compile and run the model.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
WITH patient_bmi AS (
SELECT
patient_id
, age
, diastolic_bp
, systolic_bp
, pulse
, height
, weight
, ROUND(CAST((weight / (height * height) * 703) AS numeric), 2) AS bmi
FROM
{{ ref('silver_medical') }}
),
risk_score_calculation AS (
SELECT
*
, CASE
WHEN age < 40 THEN 2
WHEN age BETWEEN 40 AND 60 THEN 1
ELSE 0
END AS age_score
-- Calculate diastolic_bp score (0 = bad, 1 = medium, 2 = good)
, CASE
WHEN diastolic_bp < 60 OR diastolic_bp > 80 THEN 0
ELSE 2
END AS diastolic_bp_score
-- Calculate systolic_bp score (0 = bad, 1 = medium, 2 = good)
, CASE
WHEN systolic_bp < 90 OR systolic_bp > 120 THEN 0
ELSE 2
END AS systolic_bp_score
-- Calculate pulse score (0 = bad, 1 = medium, 2 = good)
, CASE
WHEN pulse < 60 OR pulse > 100 THEN 0
ELSE 2
END AS pulse_score
-- Calculate bmi score (0 = bad, 1 = medium, 2 = good)
, CASE
WHEN bmi < 18.5 OR bmi >= 30 THEN 0
WHEN bmi BETWEEN 25 AND 29.9 THEN 1
ELSE 2
END AS bmi_score
FROM
patient_bmi
),
total_risk_score AS (
SELECT
*
, age_score + diastolic_bp_score + systolic_bp_score + pulse_score + bmi_score AS cardiac_risk_score
FROM risk_score_calculation
)
SELECT
*
, CASE
WHEN cardiac_risk_score <= 5 THEN 'high risk'
WHEN cardiac_risk_score BETWEEN 6 AND 8 THEN 'medium risk'
ELSE 'low risk'
END AS cardiac_risk
FROM
total_risk_score
Execute the dbt run command to run all models
1
dbt run
This code will run all the models. In your PostgreSQL database you should see the bronze and silver model in the views dropdown and the gold model in the Table drop down. Your data pipeline should look similar to the picture below.
Conclusion
In this tutorial, we successfully integrated Mage, a powerful data engineering tool, with dbt to create an efficient data pipeline. We set up a development environment using Docker and PostgreSQL, initialized a new Mage project, and configured dbt for SQL transformations. By generating mock medical data and applying transformations through a structured bronze, silver, and gold model approach, we effectively demonstrated the capabilities of both Mage and dbt in managing and transforming data. This integration simplifies the orchestration of complex data workflows, enhancing your ability to perform advanced analytics and data-driven decision-making.