TLDR
In this Mage Academy lesson on data cleaning, we’ll learn how to remove columns and discuss various scenarios in which it's necessary to remove columns from a dataset.
Glossary
Why is it necessary
Scenarios
How to code
Why is it necessary?
“Garbage in, garbage out”
When you’re building machine learning models, the quality of data is always a concern. It’s because when bad data is fed into machine learning algorithms, the predictions are bad. Therefore, most of the time is spent on cleaning the data. There are several ways to clean the data, and
removing columns
is a part of the data cleaning process.
Scenarios
Let's look at some of the most common scenarios when removing columns is essential.
Duplicated columns
Columns with high percentage of missing values
Columns with 1 unique value
columns with unique values
1. Duplicated columns:
If the same column appears twice or more, we can keep only one of the columns and remove the others as the data is the same in both columns.
For example, let’s take a look at the
dataset.
Step-1: Load the dataset using Python’s pandas library. We use the
read_csv()
function to read files that have the
.csv
extension.
Step-2: Look if there are any duplicate columns (i.e., the columns with the same name or if the data is the same in any 2 columns).
Let’s take a closer look at the columns “Lat Long,” “Latitude,” & “Longitude”.
The data in the “Lat Long” column is the same as data in the “Latitude” and “Longitude” columns.
Step-3: If there are duplicate columns, we can remove such columns from the dataset.
As the “Lat Long” column is a duplicated column we can remove it from the dataset.
2. Columns with high percentage of missing values:
The columns with a high percentage (i.e., >50) of missing values will be of no use while predicting the target or output, as they don’t contain any information about the data. So, we can remove such columns from the dataset.
Step-1: Load the
using Python’s pandas library. We use the
read_csv()
function to read files that have the
.csv
extension.
Step-2: Calculate the total count of missing values in each column. We’ll use Python’s pandas library to perform this operation. We’ll use
isnull() and sum()
methods to get the total count of missing values.
Step-3: Calculate the percentage of null values in each column.
Let's calculate the percentage of null values in the "weight" column, as that's the only column that has null values.
Step-4: If the percentage of null values is greater than 50, remove such columns from the dataset.
As the “weight” column has 90% missing values, we can remove the “weight” column from the dataset.
3. Columns with constant values:
If the count of unique values is 1, this indicates that there isn’t much variation or all the values in the column are constant. Such columns don’t help in predicting the output, so we remove them from the dataset.
Step-1: Load the dataset using Python’s pandas library. We use the
read_csv()
function to read files that have the
.csv
extension.
Step-2: List all the column labels in the dataset using the
columns
method.
Step-3: Calculate the count of unique values in all columns using
nunique()
function.
Step-3: Remove the columns if the count of unique values is 1.
We can remove “Count,” “Country,” and “State” columns from the dataset as these columns each have only 1 unique value.
4. Categorical columns with unique values:
If all the values in the
columns are unique, i.e. If the count of unique values is the same as the number of records/data points, remove such columns from the dataset as they don’t help to predict the output.
Note: Don’t remove the column if it’s an ID column.
Step-1: Load the
using Python’s pandas library. We use the
read_csv()
function to read files that have the
.csv
extension.
Step-2: Check the number of records/data points in a dataset by using
shape
function
Step-3: List all the column labels in the dataset using the
columns
method.
Step-5: Filter categorical columns from the dataset using
dtypes
function. Categorical columns are of object data type.
Step-4: Calculate the count of unique values in categorical columns using the nunique
()
function.
Step-5: Remove columns that have the same number of unique values as the number of records.
Count of unique values in the “name” column is the same as the number of records in the dataset. So we can remove the “name” column from the dataset.
How to code
Using pandas library
Step-1: Load
using
.read_csv()
function.
Step-2: Let’s remove “name” and “weight” columns using the
drop()
method.
Dataset after removing columns
Magical no code solution
If you're building models with Mage, removing columns is a breeze! The good news is that Mage also calculates the percentage of missing values by default and suggests removing a column that has a high percentage of missing values.
GIF
Want to learn more about machine learning (ML)? Visit
! ✨🔮