Guide to Churn Prediction : Part 4 — Graphical analysis

First published on January 26, 2022

Last updated at February 9, 2022

 

10 minute read

Jahnavi C.

TLDR

In this blog, we’ll explore and unlock the mysteries of the Telco Customer Churn dataset using descriptive graphical methods.

Outline

  • Recap

  • Before we begin

  • Statistical concepts

  • Descriptive graphical analysis

  • Conclusion

Recap

In part 3 of the series,

, we analyzed and explored the

Telco Customer Churn

dataset using the descriptive statistical analysis method and gained an overview of the data.

Before we begin

This guide assumes that you are familiar with data types. If you’re unfamiliar, please read blogs on

and

data types.

Statistical concepts

Let’s understand some statistical concepts that help us in further analysis of the data.

Distribution

A distribution shows how

often

each

unique

value appears in a dataset. We visualize distributions by plotting various graphs such as histograms, density plots, bar charts, pie charts etc.

Distribution graphs

These are graphs that are used to visualize distributions. We’ll use histograms or density plots to visualize continuous data distributions.

Normal distribution

Normal distribution graph

In normal distribution, data is

symmetrically

distributed, i.e., the data distribution graph follows a

bell shape

and is symmetric about the mean. Normal distribution is also known as

gaussian

distribution.

Continuous data distribution shapes

Source: GIPHY

Continuous data distribution is expected to follow normal distribution. However, in real time, continuous data is not normally distributed, and its distribution graphs can take any of the following shapes:

  • Positive skew

    : This is also known as

    right-skewed

    distribution. The distribution graph has a

    long tail

    to the

    right 

    and a

    peak

    to the

    left

    .

  • Symmetrical

    : This is also known as

    normal or gaussian 

    distribution.

     

    The distribution graph resembles a bell shape, and the shape of the distribution is precisely the same on both sides of the dotted line.

  • Negative skew

    : This is also known as

    left-skewed

    distribution. The distribution graph has a

    long tail

    to the

    left

    and a

    peak

    to the

    right

    .

Descriptive graphical analysis

Descriptive graphical analysis is yet another method of exploratory data analysis. It’s the process of analyzing data with the aid of

graphs

.

 

This analysis provides us with

in-depth

knowledge of the sample data.

Descriptive graphical analysis is further divided into

2

types:

  1. Univariate graphical analysis: 

    Uni means

    1

    , so the process of analyzing 1 feature is known as univariate graphical analysis.

  2. Multivariate graphical analysis: 

    Multi means

    2

    or

    more

    , so the process of analyzing 2 or more features is known as multivariate graphical analysis.

In this blog, we’ll go over univariate graphical analysis.

Univariate graphical analysis

Source: GIPHY

The main purpose of univariate graphical analysis is to understand the distribution patterns of features.To

visualize

these distributions, we’ll utilize Python libraries like

matplotlib

and

seaborn

. These libraries contain a variety of graphical methods (such as histograms, count plots, KDE plots, violin plots, etc.) that help us visualize distributions in different styles.

Now, let’s perform univariate graphical analysis on continuous data features.

Import libraries and load dataset

Let’s start with importing the necessary libraries and loading the cleaned dataset. Check out the link to

to see how we cleaned the dataset.

1
2
3
4
5
6
1 import pandas as pd
2 import matplotlib.pyplot as plt # python library to plot graphs
3 import seaborn as sns # python library to plot graphs
4 %matplotlib inline # displays graphs on jupyter notebook
5 df = pd.read_csv('cleaned_dataset.csv')
6 df # prints data set

Cleaned dataset

Identify continuous data features

Continuous data features are of

float

data type. So let’s check the data types of features using the

dtypes

function and identify continuous data features.

1
1 df.dtypes

Data types of features

Observations:

“Latitude,” “Longitude,” “Monthly Charges,” and “Total Charges” features are of

float

data type, so they are

continuous

data features.

Create a new dataset

Create a new dataset

df_cont, 

with df_cont

 

containing all the continuous data features and display the first 5 records using

head()

method.

1
2
1 df_cont = df[['Latitude','Longitude','Monthly Charges','Total Charges']]
2 df_cont.head()

Continuous data features

Distribution graphs

We can visualize continuous data feature distributions using graphical methods like

histograms, displots

,

KDE 

plots, etc.

Histogram plots

: These are

graphical

representations of the

frequency

of

individual

values in a dataset. Each bar is a bin that represents the count of observations that fall within the bin.

1
2
3
4
5
6
7
8
1 fig = plt.figure(figsize=(14, 8)) # sets the size of the plot with width as 14 and height as 8
2 for i,columns in enumerate(df_cont.columns, 1):
3    ax = plt.subplot(1,4,i) # creates 4 subplots in one single row
4    sns.histplot(x=df_cont[columns]) # creates histogram plots for each feature in df_cont dataset
5    ax.set_xlabel(None) # removes the labels on x-axis
6    ax.set_title(f'Distribution of {columns}') # adds a title to each subplot
7    plt.tight_layout(w_pad=3) # adds padding between the subplots
8 plt.show() # displays the plots

Histogram plots

KDE plots

: Kernel density estimate (KDE) plots are

smoothed

versions of

histograms

that help us understand the exact

shape

of distributions.

1
2
3
4
5
6
7
8
1 fig = plt.figure(figsize=(14, 8)) # sets the size of the plot with width as 14 and height as 8
2 for i,columns in enumerate(df_cont.columns, 1): 
3    ax = plt.subplot(1,4,i) # creates 4 subplots in one single row
4    sns.kdeplot(x=df_cont[columns]) # creates kde plots for each feature in df_cont dataset
5    ax.set_xlabel(None) # removes the labels on x-axis
6    ax.set_title(f'Distribution of {columns}') # adds a title to each subplot
7    plt.tight_layout(w_pad=3) # adds padding between the subplots
8 plt.show() # displays the plots

KDE plots

Observations:

None of the features are normally distributed.

Now, let’s take a closer look at all distributions.

KDE plots of “Latitude” and “Longitude”

Observations:

“Latitude” and “Longitude” data distribution shapes show

2

peaks, therefore their distributions are

bimodal

.

KDE plot of “Monthly Charges”

Observations:

  1. Customers’ current monthly charges vary between $0 and ~$120.

  2. The data distribution shape shows 3 peaks, so it’s a

    multimodal

    distribution. This indicates that there may be 3 distinct customer groups. We can divide customers into groups based on the amount they pay. For example, customers who paid less than $40 can be formed into a group.

  3. Approximately 75% of the customers paid more than $40.

KDE plot of “Total Charges”

Observations:

  1. Customers’ last quarter total charges vary between $0 and ~$8000.

  2. The distribution has a tail to the right, so it’s a

    right-skewed 

    distribution.

  3. The dotted region’s area is large. This indicates that in the last quarter, most of the customers paid less than $2500.

  4. The blue-shaded area is very small, this indicates that very few customers paid more than $5000.

Conclusion

Machine learning algorithms perform better when

continuous

data features are

normally

distributed.

Source: GIPHY

Therefore, before feeding data into machine learning algorithms, it’s recommended to perform univariate graphical analysis to check the distribution shapes of continuous data features.

That’s it for this blog. Next, in the series, we’ll perform uniform variate graphical analysis on discrete and categorical data.

Thanks for reading!!