Guide to model training: Part 1 - Qualitative data

First published on November 4, 2021

Last updated at November 22, 2021

 

6 minute read

Nathaniel Tjandra

TLDR

Machines have a hard time understanding data humans can read. Apply encoding on qualitative data to train models more efficiently.

Outline

  • Introduction

  • Before we begin

  • Qualitative data

  • Encoding data

  • Conclusion

Introduction

Most applications will have collected

in their lifetime of usage. This data is extremely useful to their business to make improvements, but is also applicable to developing AI solutions. Machine learning is an AI solution that is best done when a company has a sufficient amount of data to detect patterns and behaviors, from ranking the present to predicting the future. In this guide, we’ll look at how to clean big data so machines may interpret it faster, more accurately, and remain human interpretable.

Before we begin

This guide will use the

dataset, collected for a marketing campaign. It contains data on a customer’s personal life to analyze them and replicate their decision making.

That’s a lot of data

If you’re interested in how to create datasets like this, read our

on Data Preparation/customize data. As a precursor, read our

if you aren’t familiar with Pandas or machine learning basics.

Qualitative data

To start off with our big data, we’ll choose data points that are letters not numbers, because machine’s do not understand numbers as well as humans. The following columns, “Education, Marital_Status”, represent qualitative textual data. Qualitative data has two forms, nominal and ordinal data. Qualitative is sometimes also called categorical.

Source: Towards Data Science

Ordinal data

A form of qualitative data is ordinal data. This type of data requires more thinking as it pins the quality or value of the data against each other. This gives the machine insight on how each different value is proportional with respect to each other. For instance, in the data above the column for “

Education”

is ordinal data as the amount of investment varies. To complete a high school diploma (“Basic”), associate’s degree (“2n cycle”), bachelor’s degree (“Graduation”), master’s degree(“Master”), and PhD vary in terms of expertise.

Nominal data

A simpler form of qualitative data is nominal data. This type of data is called nominal, because it doesn’t require too many changes to transform it into something easily machine interpretable. To put it simply, all that is required is a mapping from the non-numerical value to a single numeric value. The column for marital status is an example of nominal data, because whether a person is married, divorced, or a widow, doesn’t make them better and cannot be weighted against. In fact, it’s dangerous to use weights as they can make models discriminatory and biased.

Encoding data

Now that we understand the differences between qualitative data types, we can take a look at how to encode them to reflect the approach of adding or removing the scales. There are two powerful encoding methods supported by SciKit learn for

and

, but we won’t be using them. Instead, we’ll be using only Pandas to truly understand the finer steps of what’s happening behind the function.

Weighing labels

For ordinal data, we’ll want to begin by assigning labels based on their priority or values. In our education dataset, we have multiple levels of education with varying levels. We start by creating a map to rank each level of education with respect to each other on a 10 point scale.

The pinnacle of education, being the PhD ranking is a 10. On the other hand, the lowest value in the dataset will rank at a 1. Traditionally, students who continue higher education will complete up to a bachelor’s before seeking a job. In this case we shall weigh it as a 6. The 2n cycle, or associate will be equidistant from a masters with respect to the 5. I chose the equidistance as 2. As a result, our final weights will be <1, 3, 5, 8, 10>, for <’Basic’, ‘2n Cycle’, ‘Graduation’, ‘Master’, ‘PhD’>.

We store this as a map

By choosing these values, this gives the machine insight on how each different value is proportional with respect to each other. For instance, in the data above the column for “Education

is ordinal data as the amount of investment varies. To complete a high school diploma (graduation), associate’s degree (2n cycle), bachelor’s degree, master’s degree, and PhD vary in terms of expertise.

Result of the new map

One Hot Encoding

Nominal data on the other hand should be weighed equally for everything. In order to do this, we cannot give them unequal values. To accomplish mapping a non-numerical value, we take a new column for each unique value and assign whether it is present or not.

Looks like this dataset has some funny answers

Each unique value in “

Marital Status”

requires a new column. One shortcut in Pandas is to use prefixes to create a new existing column for each value in the column.

Every present value has been changed to a 1, and non-present with a 0.

Conclusion

Now that we understand the differences between qualitative data types and how to encode our big data into machine understandable numbers. We learned the best practices to remove bias when converting textual data. In the next section, we’ll take a look at quantitative numerical data, and explore the differences and approaches to continuous and discrete data through scaling.