TLDR
In this Mage Academy lesson on data cleaning, we’ll go over variance in detail and see how to identify and remove low variance columns from a dataset.
Glossary
Why is it necessary
Variance
How to code
Why is it necessary?
Data distribution
We can remove columns from the dataset if the columns aren’t useful for predicting the output. Low variance columns are such columns that don’t contribute much while predicting the output as they don’t contain much information. Therefore, it's recommended to remove the low variance columns.
Variance
Variance measures the spread of data, i.e., it measures how far each data point is from the mean.
Variance is calculated using the following formula:
Mathematically we can write variance as shown below:
Zero
variance indicates that
all
the values in the column are
constant
.
Low
variance indicates that
most
of the values in the column are
similar
and are very close to mean.
High
variance indicates that values in the column are
not similar
and are spread far from the mean.
Numerical data
Usually we calculate variance by using a formula. But for
data columns we don’t use a formula, instead we visualize the distributions of the categories with the help of Python’s visualization libraries like seaborn, matplotlib, etc.
Zero
variance indicates that the distribution of categories in the column are identical.
Low
variance indicates that the distribution of categories in the column are nearly the same.
High
variance indicates that the distribution of categories in the column are
not
similar and vary.
Calculate variance
From scratch
Let’s take one column and see how we calculate variance for
data.
Step-1: Calculate mean
Step-2: Find the difference between each data point and mean
Step-3: Square the difference values
Step-4: Sum all the squared difference values
Step-5: Calculate variance
Using pandas library (for numerical data)
Let’s calculate variance for all the columns in the dataset that has numerical data.
Step-1: Load the
using Python’s pandas library. We use the
read_csv
function to read files that have the
.csv
extension.
Step-2: Calculate variance of each column using
.var()
function
Step-3: Remove columns if variance is low.
Variance of “history” and “physics” columns is low when compared to “english” and “math” columns variance, so we can remove these columns from the dataset.
Using pandas library (for categorical data)
Step-1: Load the
using Python’s pandas library. We use the
read_csv()
function to read files that have the
.csv
extension.
Step-2: Plot the distribution of categorical columns using Python’s seaborn library.
We’ll use the
countplot()
function to visualize the distribution.
Step-3: Drop columns if variance is low.
Variance of “school” and “pass” columns is low, so we can remove these columns from the dataset.
How to code
Using Pandas library:
We’ve seen that “history,” “physics,” “school” and “pass” columns have low variance. So, we’ll use the
.drop()
method to remove these columns.
Magical no code solution
When you're building models with Mage, it’s easy to remove columns.
Want to learn more about machine learning (ML)? Visit
! ✨🔮