Guide to Churn Prediction : Part 4 — Graphical analysis

First published on January 26, 2022

Last updated at February 9, 2022

10 minute read

Jahnavi C.

TLDR

In this blog, we’ll explore and unlock the mysteries of the Telco Customer Churn dataset using descriptive graphical methods.

Outline

Recap
Before we begin
Statistical concepts
Descriptive graphical analysis
Conclusion

Recap

In part 3 of the series,

, we analyzed and explored the

Telco Customer Churn

dataset using the descriptive statistical analysis method and gained an overview of the data.

Before we begin

This guide assumes that you are familiar with data types. If you’re unfamiliar, please read blogs on

and

data types.

Statistical concepts

Let’s understand some statistical concepts that help us in further analysis of the data.

Distribution

A distribution shows how

often

each

unique

value appears in a dataset. We visualize distributions by plotting various graphs such as histograms, density plots, bar charts, pie charts etc.

Distribution graphs

These are graphs that are used to visualize distributions. We’ll use histograms or density plots to visualize continuous data distributions.

Normal distribution

Normal distribution graph

In normal distribution, data is

symmetrically

distributed, i.e., the data distribution graph follows a

bell shape

and is symmetric about the mean. Normal distribution is also known as

gaussian

distribution.

Continuous data distribution shapes

Source: GIPHY

Continuous data distribution is expected to follow normal distribution. However, in real time, continuous data is not normally distributed, and its distribution graphs can take any of the following shapes:

Positive skew
: This is also known as
right-skewed
distribution. The distribution graph has a
long tail
to the
right
and a
peak
to the
left
.
Symmetrical
: This is also known as
normal or gaussian
distribution.

The distribution graph resembles a bell shape, and the shape of the distribution is precisely the same on both sides of the dotted line.
Negative skew
: This is also known as
left-skewed
distribution. The distribution graph has a
long tail
to the
left
and a
peak
to the
right
.

Descriptive graphical analysis

Descriptive graphical analysis is yet another method of exploratory data analysis. It’s the process of analyzing data with the aid of

graphs

This analysis provides us with

in-depth

knowledge of the sample data.

Descriptive graphical analysis is further divided into

types:

Univariate graphical analysis:
Uni means
1
, so the process of analyzing 1 feature is known as univariate graphical analysis.
Multivariate graphical analysis:
Multi means
2
or
more
, so the process of analyzing 2 or more features is known as multivariate graphical analysis.

In this blog, we’ll go over univariate graphical analysis.

Univariate graphical analysis

Source: GIPHY

The main purpose of univariate graphical analysis is to understand the distribution patterns of features.To

visualize

these distributions, we’ll utilize Python libraries like

matplotlib

and

seaborn

. These libraries contain a variety of graphical methods (such as histograms, count plots, KDE plots, violin plots, etc.) that help us visualize distributions in different styles.

Now, let’s perform univariate graphical analysis on continuous data features.

Import libraries and load dataset

Let’s start with importing the necessary libraries and loading the cleaned dataset. Check out the link to

to see how we cleaned the dataset.

1
2
3
4
5
6
1 import pandas as pd
2 import matplotlib.pyplot as plt # python library to plot graphs
3 import seaborn as sns # python library to plot graphs
4 %matplotlib inline # displays graphs on jupyter notebook
5 df = pd.read_csv('cleaned_dataset.csv')
6 df # prints data set

Cleaned dataset

Identify continuous data features

Continuous data features are of

float

data type. So let’s check the data types of features using the

dtypes

function and identify continuous data features.

1
1 df.dtypes

Data types of features

Observations:

“Latitude,” “Longitude,” “Monthly Charges,” and “Total Charges” features are of

float

data type, so they are

continuous

data features.

Create a new dataset

df_cont,

with df_cont

containing all the continuous data features and display the first 5 records using

head()

method.

1
2
1 df_cont = df[['Latitude','Longitude','Monthly Charges','Total Charges']]
2 df_cont.head()

Continuous data features

Distribution graphs

We can visualize continuous data feature distributions using graphical methods like

histograms, displots

KDE

plots, etc.

Histogram plots

: These are

graphical

representations of the

frequency

individual

values in a dataset. Each bar is a bin that represents the count of observations that fall within the bin.

1
2
3
4
5
6
7
8
1 fig = plt.figure(figsize=(14, 8)) # sets the size of the plot with width as 14 and height as 8
2 for i,columns in enumerate(df_cont.columns, 1):
3    ax = plt.subplot(1,4,i) # creates 4 subplots in one single row
4    sns.histplot(x=df_cont[columns]) # creates histogram plots for each feature in df_cont dataset
5    ax.set_xlabel(None) # removes the labels on x-axis
6    ax.set_title(f'Distribution of {columns}') # adds a title to each subplot
7    plt.tight_layout(w_pad=3) # adds padding between the subplots
8 plt.show() # displays the plots

Histogram plots

KDE plots

: Kernel density estimate (KDE) plots are

smoothed

versions of

histograms

that help us understand the exact

shape

of distributions.

1
2
3
4
5
6
7
8
1 fig = plt.figure(figsize=(14, 8)) # sets the size of the plot with width as 14 and height as 8
2 for i,columns in enumerate(df_cont.columns, 1): 
3    ax = plt.subplot(1,4,i) # creates 4 subplots in one single row
4    sns.kdeplot(x=df_cont[columns]) # creates kde plots for each feature in df_cont dataset
5    ax.set_xlabel(None) # removes the labels on x-axis
6    ax.set_title(f'Distribution of {columns}') # adds a title to each subplot
7    plt.tight_layout(w_pad=3) # adds padding between the subplots
8 plt.show() # displays the plots

KDE plots

Observations:

None of the features are normally distributed.

Now, let’s take a closer look at all distributions.

KDE plots of “Latitude” and “Longitude”

Observations:

“Latitude” and “Longitude” data distribution shapes show

peaks, therefore their distributions are

bimodal

KDE plot of “Monthly Charges”

Observations:

Customers’ current monthly charges vary between $0 and ~$120.
The data distribution shape shows 3 peaks, so it’s a
multimodal
distribution. This indicates that there may be 3 distinct customer groups. We can divide customers into groups based on the amount they pay. For example, customers who paid less than $40 can be formed into a group.
Approximately 75% of the customers paid more than $40.

KDE plot of “Total Charges”

Observations:

Customers’ last quarter total charges vary between $0 and ~$8000.
The distribution has a tail to the right, so it’s a
right-skewed
distribution.
The dotted region’s area is large. This indicates that in the last quarter, most of the customers paid less than $2500.
The blue-shaded area is very small, this indicates that very few customers paid more than $5000.

Conclusion

Machine learning algorithms perform better when

continuous

data features are

normally

distributed.

Source: GIPHY

Therefore, before feeding data into machine learning algorithms, it’s recommended to perform univariate graphical analysis to check the distribution shapes of continuous data features.

That’s it for this blog. Next, in the series, we’ll perform uniform variate graphical analysis on discrete and categorical data.

Thanks for reading!!

GET STARTED

COMMUNITY

LEARN

Magical powers for data teams

GET STARTED

COMMUNITY

LEARN

2024 Mage Technologies, Inc.