TLDR
In this Mage Academy lesson on feature engineering, we’ll learn how we’d count the number of distinct values in a column by group.
Glossary
How it works
Example code
Magical no-code solution ✨🔮
How it works
In large datasets, it’s useful to utilize at-a-glance metrics like the amount of unique values in a column to gain familiarity with the kind of data we will be working with.
Cropped Mr. Sun menu
To better explain what an aggregate function is, I’ll explain by example. If we owned this Mr. Sun boba shop, at the end of the day, we can use
group by
to group all the boba sold today into the categories detailed in the menu above. Then, we can use the aggregate function,
count unique values
, to tally the number of unique drinks sold per category.
With this analysis, would we see that every drink had been sold at least once today? If we see that we didn’t sell even one Mango, Strawberry, and Honey Kiwi smoothie for more than a couple days, should we consider making these drinks seasonal, so we don’t need to supply our store with fresh fruits during the winter months?
Thus, count unique values, as the name suggests, is an aggregate function that inspects a column and returns a count of the number of different values per group.
Example code
From scratch
In this section, I’ll detail the steps to group the boba drinks by a unique value in the column “Category”. Once we have the groups, we’ll total the amount of unique drinks that were sold in each category to obtain the count of unique values.
1
2
3
4
import pandas as pd
df = pd.read_csv('boba-history.csv')
df
Dictionaries are Python’s most straightforward way of storing and counting unique string values. Thus, we’ll use a dictionary key to keep track of the unique category names, like “Mr. Sun Specialties” or “Milk Tea.” We’ll iterate through our dataset, saving the drink names as list values with the group that the drink is categorized as. For example, under the key– or group– Milk Tea, the drinks Earl Grey and Angel Caramel are listed as values:
{ ‘Milk Tea’: [‘Royal Earl Grey Milk Tea, ‘Angel Caramel Milk Tea’]}
1
2
3
4
5
6
7
8
9
10
11
12
13
14
categories = {}
for _, row in df.iterrows():
key = row['Category']
if key not in categories:
# save drink name to category
categories[key] = [row['Drink']]
else:
# check if drink name is already added
if row['Drink'] not in categories[key]:
categories[key].append(row['Drink'])
categories
Lastly, to get the number of unique drinks in each category, we can use the Python
function since all the drinks for each category are stored in the corresponding list.
And here are the results! We have 1 unique drink ordered from the “Fresh Tea” category, 2 from “Milk Foam,” etc. Looking at the extensive menu Mr. Sun has, if only 1 or 2 drinks are ordered from each category of our menu each day, maybe we can consider reducing our menu so we can minimize the storage space of all the raw ingredients used to make the less popular drinks.
Using Pandas
For our next example, I’d like to pose a different question from the previous section. Let’s say I want to open a new Mr. Sun branch in a different Bay Area city. It would be useful to get a count of how many competitors I have per city!
Using these two Pandas built-in aggregate functions,
and
, we’ll be analyzing this Bay Area boba spots
to find how many boba shops (count) are in each city (group by).
1
2
3
4
import pandas as pd
df = pd.read_csv('bayarea_boba_spots.csv')
df
Then, we’d use groupby() a built-in Pandas function that returns an object similar to the dictionary we created in the “From scratch” section. It’s messy to view the object, but I wrote a helper function “
” that prints the unique list of cities in the Bay that have at least one boba shop:
Now that we have the groups, we can use .count() to count the number of unique boba shops by city.
1
2
gp = df.groupby(['city'])['name'].count()
gp
Thus, the complete code needed to perform our tally of boba shops per city is:
Magical no-code solution ✨🔮
Additionally, our product, Mage, has functions that can help you find the distinct count in of a feature based on the groupings of another.
After uploading your data, you can find the no-code function using Edit data > Aggregate. Then, select the column you’d like to group by from the drop-down menu (“city” for our example). Make sure you select “distinct count” to count the unique values. Finally, select the column you’re counting unique values in (I selected “name” of the boba shop”).
Want to learn more about machine learning (ML)? Visit
! ✨🔮