Data Cleaning - Remove columns

First published on February 17, 2022

Last updated at March 9, 2022

 

9 minute read

Jahnavi C.

TLDR

In this Mage Academy lesson on data cleaning, we’ll learn how to remove columns and discuss various scenarios in which it's necessary to remove columns from a dataset.

Glossary

  • Why is it necessary

  • Scenarios

  • How to code

Why is it necessary?

“Garbage in, garbage out”

When you’re building machine learning models, the quality of data is always a concern. It’s because when bad data is fed into machine learning algorithms, the predictions are bad. Therefore, most of the time is spent on cleaning the data. There are several ways to clean the data, and 

removing columns

is a part of the data cleaning process.

Scenarios

Let's look at some of the most common scenarios when removing columns is essential.

  1. Duplicated columns

  2. Columns with high percentage of missing values

  3. Columns with 1 unique value

  4. columns with unique values

1. Duplicated columns:

If the same column appears twice or more, we can keep only one of the columns and remove the others as the data is the same in both columns.

For example, let’s take a look at the 

dataset. 

Step-1: Load the dataset using Python’s pandas library. We use the 

read_csv()

function to read files that have the 

.csv

extension.

Step-2: Look if there are any duplicate columns (i.e., the columns with the same name or if the data is the same in any 2 columns). 

Let’s take a closer look at the columns “Lat Long,” “Latitude,” & “Longitude”. 

The data in the “Lat Long” column is the same as data in the “Latitude” and “Longitude” columns. 

Step-3: If there are duplicate columns, we can remove such columns from the dataset. 

As the “Lat Long” column is a duplicated column we can remove it from the dataset.

2. Columns with high percentage of missing values:

The columns with a high percentage (i.e., >50) of missing values will be of no use while predicting the target or output, as they don’t contain any information about the data. So, we can remove such columns from the dataset. 

Step-1: Load the 

using Python’s pandas library. We use the 

read_csv()

function to read files that have the 

.csv

extension.

Step-2: Calculate the total count of missing values in each column. We’ll use Python’s pandas library to perform this operation. We’ll use 

isnull() and sum() 

methods to get the total count of missing values.

Step-3: Calculate the percentage of null values in each column.

Let's calculate the percentage of null values in the "weight" column, as that's the only column that has null values.

Step-4: If the percentage of null values is greater than 50, remove such columns from the dataset.

As the “weight” column has 90% missing values, we can remove the “weight” column from the dataset.

3. Columns with constant values:

If the count of unique values is 1, this indicates that there isn’t much variation or all the values in the column are constant. Such columns don’t help in predicting the output, so we remove them from the dataset.

Step-1:  Load the dataset using Python’s pandas library. We use the 

read_csv()

function to read files that have the 

.csv

extension.

Step-2: List all the column labels in the dataset using the 

columns 

method.

Step-3: Calculate the count of unique values in all columns using 

nunique()

function. 

Step-3: Remove the columns if the count of unique values is 1.

We can remove “Count,” “Country,” and “State” columns from the dataset as these columns each have only 1 unique value.

4. Categorical columns with unique values:

If all the values in the 

columns are unique, i.e. If the count of unique values is the same as the number of records/data points, remove such columns from the dataset as they don’t help to predict the output.

Note: Don’t remove the column if it’s an ID column.

Step-1: Load the 

using Python’s pandas library. We use the 

read_csv()

function to read files that have the 

.csv

extension.

Step-2: Check the number of records/data points in a dataset by using 

shape

function

Step-3: List all the column labels in the dataset using the 

columns 

method.

Step-5: Filter categorical columns from the dataset using 

dtypes

function. Categorical columns are of object data type. 

Step-4: Calculate the count of unique values in categorical columns using the nunique

()

function. 

Step-5: Remove columns that have the same number of unique values as the number of records.

Count of unique values in the “name” column is the same as the number of records in the dataset. So we can remove the “name” column from the dataset. 

How to code

Using pandas library

Step-1: Load 

using

.read_csv()

function.

Step-2: Let’s remove “name” and “weight” columns using the 

drop()

method.

Dataset after removing columns

Magical no code solution

If you're building models with Mage, removing columns is a breeze! The good news is that Mage also calculates the percentage of missing values by default and suggests removing a column that has a high percentage of missing values.

GIF

Want to learn more about machine learning (ML)? Visit 

! ✨🔮