TLDR
In this Mage Academy lesson on data cleaning, we’ll learn how to remove duplicate row entries of a column value in Pandas.
Outline
When’s it necessary?
How to code
Magical no-code solution 🪄
When’s it necessary?
Duplicate data can skew prediction results.
Thus, for columns that should contain unique values, it’s important to search for and exclude any duplicate rows to achieve a more general and accurate prediction.
How to code
Observing Kaggle’s
for example, the extra rows containing “Mega” versions of Pokemon aren’t needed to analyze the entire Pokemon index, since Megas are simply beefier copies of the same Pokemon.
Thiagoazen’s Pokemon dataset, ft. 3 Charizards
From scratch
While a built-in function (see next section) gets the job done, we will also present an algorithm that filters unique values of a column using a dictionary, just in case it shows up in an assignment or exam. 😉
By looking at the first ten rows of data, we can see several duplicates in the “Name” column that we need to remove (like Venusaur).
1
2
3
import pandas as pd
data = pd.read_csv("PokemonDb.csv")
data
Thus, we store only the first occurrence of a Pokemon’s name in the dictionary. As we check the rows one by one (using
), we check if the name is already in the dictionary and
the row if it is.
1
2
3
4
5
6
7
8
9
uniqueNames = {}
# Keeps only the first duplicate
for i, row in data.iterrows():
if row["Name"] in uniqueNames:
data.drop(i, inplace=True)
uniqueNames[row["Name"]] = True
data
The complete code if you’d like to try it yourself:
Built-in Pandas Function
The promised built-in function,
deletes rows based on duplicates in a list of column name(s) that you specify in the
subset
parameter.
Image generated using carbon.now.sh
Magical no-code solution 🪄
Last, but definitely not least, Mage has a row transformation action that removes duplicates from your dataset! Try this if you’d like to leverage AI without learning the ins and outs of Pandas.
Want to learn more about machine learning (ML)? Visit
! ✨🔮