TLDR
In this Mage Academy lesson on feature engineering, we’ll learn about the aggregate functions min() and max(), and see how they’re helpful in analyzing and understanding the data.
Glossary
Data Aggregation
Why is it necessary
Definition
Example
How to code
Data Aggregation
Data aggregation is known as summarization of data. Some of the most common aggregate functions are min(), max(), mean(), count(), sum() etc.
Why is it necessary
Data aggregation is a part of the data analysis process. Data analysis is the first and most critical step of model building. This allows us to delve deeper into the data and help us understand the data better.
Definition
In this lesson, we’ll explore min() and max() functions in detail.
min(): This function helps us find the minimum or least value in a feature or column.
max(): This function helps us find the maximum or highest value in a feature or column.
We can apply aggregate functions in 2 different ways:
Case-1
: Apply aggregate functions on a single feature or column i.e., analyzing each column individually.
Case-2
: Apply aggregate functions on groups i.e., we’ll group rows and analyze each group individually.
Example
Consider a dataset with 2 columns "Product" and "Price". Let’s apply aggregate functions (min() and max()) to find minimum and maximum value in the “Price” column.
Case-1: Find minimum and maximum price in the “Price” column.
Find minimum price
Find maximum price
Case-2: Group rows i.e., group products of the same category and find minimum and maximum price of each category.
Grouping is a 3 step process as shown below:
Step-1
: Split the rows into groups based on the “Product” column.
There are 3 unique products (Laptop, Desk, Chair) in the “Product” column, so the rows are split into 3 groups.
Step-2
: Find the minimum price of each unique product
Step-3
: Display the output. For this, we’ll combine each group’s output to form a data frame and display the data frame.
Steps to find minimum value of each unique product
Steps to find maximum value of each unique product
How to code
In recent years, the popularity of ridesharing has skyrocketed. The key benefits of ridesharing are that it’s inexpensive, convenient, and allows anyone to easily travel from 1 location to another.
Image by mohamed Hassan from Pixabay
Service providers frequently change prices based on time, traffic, the number of cabs available, and other factors. As costs fluctuate, it's beneficial to offer users a range of prices for a specific route. So, with the help of
data, let’s find the minimum and maximum prices for each unique route.
Python
Find the minimum and maximum price of each unique route.
Step-1
:
First let’s group rides by source and then by destination. To do this, we’ll iterate through the rows of rides data and save the “source” as keys of the dictionary. The final result should be as shown below.
Output format: {‘sourceA’: [(destination1, price1), (destination1, price2),...], ‘sourceB’:[(destination1, price1), (destination1, price2),...],....}
Step-2
:
Find minimum price
By comparing the prices of routes with the same starting location and destination, we'll find the minimum price of each route.
Lowest price of each unique route
Find maximum price
By comparing the prices of routes with the same starting point and destination, we'll find the highest price for each route.
Highest price of each unique route
From the output, we see that the price from “Haymarket Square” to “North Station” ranges between 3.0 and 32.5, “Haymarket Square” to “West End” ranges between 3.0 and 27.5, etc.
Pandas
Group rows of the same route, and find the minimum and maximum price of each individual route.
Pandas has a built-in function
groupby()
that’s used to group rows in a dataset.
This function is used along with
min()
and
max()
functions to find minimum and maximum values of each unique group.
Find minimum price
Find maximum price
Magical no code solution
For quick analysis and results, try our product, Mage. Our service features an "Edit data" area with multiple aggregation options. Apart from analyzing the data, you can create a new column and store the aggregation results that help in further analysis of the data.
Want to learn more about machine learning (ML)? Visit
! ✨🔮