Developing business intelligence datasets with Mage part 1: Mastering medallion architecture

First published on July 31, 2024

 

12 minute read

Cole Freeman

TLDR

This article explains how to implement medallion architecture using Mage, a data pipeline tool. Medallion architecture organizes data into three layers: Bronze (raw), Silver (cleaned), and Gold (analytics-ready). The tutorial guides readers through setting up Mage, generating sample medical data, and building each layer using SQL transformations, demonstrating how to progress from raw data ingestion to creating analytics-ready datasets.

Outline

  • 1.0 Medallion architecture: an overview

  • 2.0 The three layers of medallion architecture

  • 2.1 Bronze layer — raw data

  • 2.2 Silver layer — cleansed data

  • 2.3 Gold layer — analytics and serving data

  • 3.0 Medallion architecture tutorial

  • 3.1 Mage quickstart

  • 3.2 Create some secrets

  • 3.3 Configure your Mage project

  • 3.4 Generate data

  • 3.5 Building the bronze layer

  • 3.6 Building the silver layer

  • 3.7 Building the gold layer

  • 4.0 Conclusion

1.0 Medallion architecture: an overview

In the ever-evolving landscape of data engineering, organizations are constantly seeking efficient and scalable ways to manage their data pipelines. One architectural pattern that has gained significant traction in recent years is the medallion architecture. This approach, popularized by Databricks, offers a structured and flexible framework for organizing data workflows, from ingestion to analytics-ready datasets. In this article, we’ll discuss the Medallion architecture, exploring its principles, benefits, and implementation strategies.

Medallion architecture is a data design engineering pattern that organizes data workflows into three distinct layers, each representing a different stage of data refinement. These layers are commonly referred to as Bronze, Silver, and Gold.

This architecture addresses several common challenges in data engineering, including maintaining data quality through various processing stages, scaling with growing data volumes and variety, adapting to changing business requirements, and optimizing query performance. By adopting a layered approach, Medallion architecture provides a systematic way to tackle these challenges, offering a clear path from raw data ingestion to analytics-ready datasets.

Source: GIPHY

2.0 The three layers of medallion architecture

2.1 Bronze layer — raw data

The Bronze layer serves as the initial landing zone for raw data. It captures data in its original, unaltered form from various sources such as databases, APIs, file systems, and streaming platforms. The data is stored in its native format, whether it’s structured, semi-structured, or unstructured. Along with the raw data, important metadata such as source information and ingestion timestamp are recorded.

This layer acts as a historical record, preserving the original state of the data for compliance, auditing, and potential reprocessing needs. When implementing the Bronze layer, organizations often leverage object storage solutions and employ data partitioning strategies to optimize data retrieval.

2.2 Silver layer — cleansed data

The Silver layer represents the first stage of data refinement. It takes the raw data from the Bronze layer and applies various transformations to create a more standardized and cleaned dataset. In this layer, data undergoes cleansing operations, including handling missing values, removing duplicates, and correcting known errors or inconsistencies.

Data validation processes are applied to ensure adherence to predefined business rules and quality standards. A consistent schema is imposed across different data sources, which may involve type casting, column renaming, and structural transformations. Normalization is also performed, converting data into standardized formats to simplify integration and analysis.

To implement the Silver layer effectively, organizations often utilize distributed processing frameworks for efficient data transformation at scale. Data quality rules are typically implemented as part of the ETL process, and schema management tools are employed to handle schema evolution.

2.3 Gold layer — analytics and serving data

The Gold layer represents the highest level of data refinement in the Medallion architecture. It contains analytics-ready datasets tailored for specific business use cases and optimized for query performance. In this layer, data undergoes aggregation to create pre-aggregated datasets that support common analytical queries and reporting needs.

Dimensional modeling is often applied, organizing data into fact and dimension tables to facilitate business intelligence and OLAP operations. Feature engineering is another crucial aspect, where complex features or metrics that directly support business KPIs or machine learning models are derived.

The Gold layer is also where query optimization takes place. Data is structured to optimize for common access patterns, ensuring that end-users can access the data they need quickly and efficiently. When implementing the Gold layer, organizations often use columnar storage formats, implement caching mechanisms, and apply advanced indexing techniques to optimize data retrieval.

Source: GIPHY

3.0 Medallion architecture tutorial

Now that you For this tutorial you will need to install Mage, if you don’t already have an instance, and you will need a PostgreSQL database. Click the 

 link to install the database.

3.1 Mage quickstart

After you install docker and a PostgreSQL database you can follow the steps below to start a Mage instance in a docker container.

  • Run the following command to start your Mage Project (The command below is for a mac, see 

     for other operating systems). - You can substitute a custom project name at the end of the code

1
docker run -it -p 6789:6789 -v $(pwd):/home/src mageai/mageai /app/run_app.sh mage start [project_name]
  • If you are using Docker Desktop you can open Mage from the GUI by clicking the link in the Ports. It should look like this 6789:6789. If you are not using Docker Desktop open the Mage overview page at 

    .

  • Click the new pipeline button and then click Standard (Batch) option located at the top left of the GUI to begin the pipeline.

  • Either use the name provided by Mage or delete it and customize a name for your pipeline in the popup screen provided. You can give the pipeline a description if you want.

  • Hit create and you will be taken to the Pipeline editor GUI in Mage.

3.2 Create some secrets

  • Mage gives developers three options to handle project secrets such as API Keys, Passwords, and Cloud Access Keys. The secret manager hides sensitive information from other users when projects are shared or collaborated on. Developers can use environment variables to keep secrets out of the source code, integrate with AWS Secrets Manager for secure storage and retrieval, or utilize Mage’s built-in secret management system. These options ensure that sensitive information remains protected and access-controlled.

For this project, one secret needs to be created:

  • PostgreSQL User Password

Take the following steps to create a secret:

  • From the Pipeline Editor page choose Secrets from the right navigation menu.

  • Click the new button to enter a secret.

  • Enter your secret name and secret value and press enter to save your new secret.

  • You will see a green secret name appear when the Secret is saved.

  • Use this secret in your project based on the instructions listed below the Secret in the Mage UI.

  • Click the new pipeline button and then click Standard (Batch) option located at the top left of the GUI to begin the pipeline.

  • Either use the name provided by Mage or delete it and customize a name for your pipeline in the popup screen provided. You can give the pipeline a description if you want.

  • Hit create and you will be taken to the Pipeline editor GUI in Mage.

Make sure to create the database password secret. You can also store any other secrets you are adding to this project in the same manner.

3.3 Configure your Mage project

To load your data in Mage you will need to configure your io_config.yml Postgres settings. First, we’ll configure the 

io_config.yml

 file so that our Mage pipeline can communicate with the PostgreSQL database. Configure the io_config.yml file similar to what is pictured below

3.4 Generate data

Tutorials are best when experienced, they mimic real-life examples. In the wild, Mage is used to orchestrate data and can perform all the ETL operations. If you are a company who already uses dbt, Mage has you covered with a managed dbt Core integration.

The data we will formulate for this tutorial mimics a medical risk score analysis. Our goal is to conjure up an analytics engineering pipeline that has a staging layer, a refined layer, and a serving layer.

Let’s get started:

  • From the edit pipeline page click All blocks

  • Choose the Base template (generic) under the Python choices

  • Replace the boilerplate code in the Mage Data Loader with the code below. This will create a medical records table that somewhat resembles real life.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import pandas as pd
from faker import Faker
import random
from datetime import datetime

if 'data_loader' not in globals():
    from mage_ai.data_preparation.decorators import data_loader
if 'test' not in globals():
    from mage_ai.data_preparation.decorators import test
@data_loader
def load_data(*args, **kwargs):
    """
    Template code for loading data from any source.
    Returns:
        Anything (e.g. data frame, dictionary, array, int, str, etc.)
    """
    # Initialize Faker
    fake = Faker()
  # Number of samples to generate
    num_samples = 20000
  # Generate data
    data = {
        'patient_id': [f'PID{str(i).zfill(5)}' for i in range(1, num_samples + 1)],
        'name': [fake.name() for _ in range(num_samples)],
        'date_of_birth': [fake.date_of_birth(minimum_age=18, maximum_age=90).strftime('%Y-%m-%d') for _ in range(num_samples)],
        'diastolic_bp': [random.randint(60, 90) for _ in range(num_samples)],
        'systolic_bp': [random.randint(90, 140) for _ in range(num_samples)],
        'pulse': [random.randint(50, 100) for _ in range(num_samples)],
        'height_in_inches': [round(random.uniform(57, 84), 1) for _ in range(num_samples)],
        'weight_in_lbs': [round(random.uniform(100, 350), 1) for _ in range(num_samples)],
        'updated_at': [datetime.now().strftime('%Y-%m-%d %H:%M:%S') for _ in range(num_samples)]
    }
        # Create DataFrame
    df = pd.DataFrame(data)
    return df
@test
def test_output(output, *args) -> None:
    """
    Template code for testing the output of the block.
    """
    assert output is not None, 'The output is undefined'
  • Run the code by hitting the blue right arrow button at the top right of the data loader block. You should see a sample output below the block.

After the block runs Mage prints out a sample for you to evaluate. You should now have 20,000 rows and 7 columns of data consisting of people’s mock personal information and their medical test results.

3.5 Building the bronze layer

The bronze layer is our landing layer, where data in its raw form is stored. To create the bronze layer complete the following steps.

  • From the edit pipeline page click “All Blocks”

  • Hover over Transformer and then select the Base template (generic) block from the SQL blocks

  • Enter a schema name in the ‘Schema’ text box - This should match a schema name in your Postgres database - Postgres databases have a default ‘Public’ schema

  • Enter a table name in the ‘Table (optional)’ text box - This should be the name you want to call the table

  • Finally, enter in your select statement and run the block by hitting the purple arrow button at the top of the block

3.6 Building the silver layer

Follow the same initial set of instructions written in the bronze layer for the silver layer. For the silver layer, in our code the data team requested we remove all PII from the dataset prior to visualization. They want to know the age of each subject, but do not want their name or date of birth. Enter the code below into the SQL block and then run it.

3.7 Building the gold layer

Follow the same initial set of instructions written in the bronze layer for the gold layer. For the gold layer, in our code the data team requested we calculate a risk score of whether a person is at low, medium, or high risk of having heart issues. Enter the code below into the SQL block and then run the code.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
WITH patient_bmi AS (
    SELECT
        patient_id
        , age
        , diastolic_bp
        , systolic_bp
        , pulse
        , height
        , weight
        , ROUND(CAST((weight / (height * height) * 703) AS numeric), 2) AS bmi
    FROM 
        {{ df_1 }}
),
risk_score_calculation AS (
    SELECT
        *
        , CASE 
            WHEN age < 40 THEN 2 
            WHEN age BETWEEN 40 AND 60 THEN 1 
            ELSE 0 
        END AS age_score
        -- Calculate diastolic_bp score (0 = bad, 1 = medium, 2 = good)
        , CASE 
            WHEN diastolic_bp < 60 OR diastolic_bp > 80 THEN 0 
            ELSE 2 
        END AS diastolic_bp_score
        -- Calculate systolic_bp score (0 = bad, 1 = medium, 2 = good)
        , CASE 
            WHEN systolic_bp < 90 OR systolic_bp > 120 THEN 0 
            ELSE 2 
        END AS systolic_bp_score
        -- Calculate pulse score (0 = bad, 1 = medium, 2 = good)
        , CASE 
            WHEN pulse < 60 OR pulse > 100 THEN 0 
            ELSE 2 
        END AS pulse_score
        -- Calculate bmi score (0 = bad, 1 = medium, 2 = good)
        , CASE 
            WHEN bmi < 18.5 OR bmi >= 30 THEN 0 
            WHEN bmi BETWEEN 25 AND 29.9 THEN 1 
            ELSE 2 
        END AS bmi_score
    FROM 
        patient_bmi
),
total_risk_score AS (
    SELECT
        *
        , age_score + diastolic_bp_score + systolic_bp_score + pulse_score + bmi_score AS cardiac_risk_score
    FROM risk_score_calculation
)
SELECT
    *
    , CASE
        WHEN cardiac_risk_score <= 5 THEN 'high risk'
        WHEN cardiac_risk_score BETWEEN 6 AND 8 THEN 'medium risk'
        ELSE 'low risk'
    END AS cardiac_risk
FROM 
    total_risk_score

4.0 Conclusion

In conclusion, medallion architecture offers a robust and flexible approach to organizing data workflows in modern data engineering. By structuring data processing into Bronze, Silver, and Gold layers, it provides a clear path from raw data to business-ready insights. As organizations continue to grapple with increasing data volumes and complexity, the Medallion architecture provides a valuable framework for building data platforms that can adapt to changing needs while delivering reliable and timely insights.