TLDR
We’ll show you how to build a basic machine learning model. For a quick background on AI/ML, please check out this
for an overview.
In this tutorial, we’ll use the Titanic dataset to predict which passengers survived the crash. The dataset includes information on each passenger, the cabin they stayed in, their gender, and more.
High level steps for building an AI/ML model
Data preparation
Choose algorithm
Hyperparameter tuning
Train model
Evaluate performance
Deploy/Integrate model
Setup
Go to
, click the top left “File” and click “New notebook”.
Download this file called
.
In your “New notebook” on Google Colaboratory, click the folder icon in the top left and then drag the file you just downloaded called “titanic_survival.csv” and drop it into that area.
Data preparation
Here are the steps when preparing data:
Download and split data
Add columns
Remove columns
Impute values
Scale values
Encode values
Select features
1. Download and split data
We need to load the data into memory by downloading it from a website, a database, data warehouse, SaaS tool, etc. Once we download it, we can load it in memory to operate on quickly. Before we split the data, we’ll need to determine which column we want to predict. In this tutorial, we’ll predict which passengers survived.
1
2
3
4
5
6
7
8
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv('/content/titanic_survival.csv')
label_feature_name = 'Survived'
X = df.drop(columns=[label_feature_name])
y = df[label_feature_name]
After that, we need to split the data into 2 parts: 1 for training the AI/ML model (aka train set) and 1 for evaluating the performance of the model (aka test set). The train set will have 80% of the rows from the original data. There are different strategies for splitting the data; however, a common method is to stratify the data so that there is a representative number of rows in both the train set and test set.
1
2
3
4
5
6
X_train_raw, X_test_raw, y_train, y_test = train_test_split(
X,
y,
stratify=y,
test_size=0.2,
)
2. Add columns
The data you downloaded may not have all the columns you need. You may want to add a few more columns by combining existing columns or performing some sort of calculation. For example, you may want to create a column called “year” which extracts the year of a date value from the birthday column.
1
2
3
4
5
6
7
8
9
10
11
12
13
[python: language]
df = X_train_raw.copy()
# Add a column to determine if the person can vote
df['can_vote'] = df['Age'].apply(lambda age: 1 if age >= 18 else 0)
# 892 passengers can vote; aka they are 18 or older
df['can_vote'].value_counts()
# Cabin letter: a cabin can be denoted as B123. The cabin letter will be B.
df.loc[:, 'cabin_letter'] = df['Cabin'].apply(
lambda cabin: cabin[0] if cabin and type(cabin) is str else None,
)
3. Remove columns
There may be columns that you don’t think the model should learn from. For example, the model may not care about specific user IDs or email addresses (the email domain might matter). In these cases, we want to remove these columns from the data. By removing these columns, we help the model focus on what matters instead of trying to make sense of data that has no impact on the prediction. For example, a passenger’s ID probably has very little impact on whether they survived the sinking of the Titanic.
1
2
3
4
df = df.drop(columns=['Name', 'PassengerId'])
# Name and PassengerId is no longer a column
df.columns.tolist()
4. Impute values
Your data may have missing values in a particular column. The AI/ML model has a hard time knowing what to do with missing values. We can help it by filling in those missing values using some heuristic. For example, there are a lot of missing values in the “Cabin” column. For those with no known cabin, we’ll fill in the value “somewhere out of sight”. For those with missing age, we’ll use the median age to fill in those missing values.
1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.impute import SimpleImputer
print(f'Missing values in "Cabin": {len(df[df["Cabin"].isna()].index)}')
df.loc[df['Cabin'].isna(), 'Cabin'] = 'somewhere out of sight'
df.loc[df['cabin_letter'].isna(), 'cabin_letter'] = 'ZZZ'
print(f'Missing values in "Age": {len(df[df["Age"].isna()].index)}')
age_imputer = SimpleImputer(strategy='median')
df.loc[:, ['Age']] = age_imputer.fit_transform(df[['Age']])
print(f'Missing values in "Embarked": {len(df[df["Embarked"].isna()].index)}')
df.loc[df['Embarked'].isna(), 'Embarked'] = 'no idea'
5. Scale values
Adjust the values of number columns to fall within similar ranges so that large numbers (such as seconds since epoch) don’t affect the prediction disproportionately as much as smaller values (such as age).
For example, if you have a column that is in seconds and a column that is in days, the difference in seconds between today and last week is 604,800 seconds. The difference in days between today and last week is 7. If we don’t scale these values, then the model will think the column with seconds has a greater distance between 2 numbers than the column with days.
There are multiple scaling strategies such as standard scaler and normalizer. For more information, check out this
.
1
2
3
4
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df.loc[:, ['Age']] = scaler.fit_transform(df[['Age']])
6. Encode values
AI/ML algorithms perform mathematical operations using numbers. We must convert columns that contain strings into a number representation. A common technique is to encode categorical values. For example, we can convert the value “male” to 0 and “female” to 1. Note: we’re going to use one-hot encoding to convert these strings into numbers. For further explanation why, check out this
.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
from sklearn.preprocessing import OneHotEncoder
categorical_columns = ['Pclass', 'Sex', 'Embarked', 'cabin_letter']
categorical_encoder = OneHotEncoder(handle_unknown='ignore')
categorical_encoder.fit(df[categorical_columns])
# Add the new columns to the data
new_column_names = []
for idx, cat_column_name in enumerate(categorical_columns):
values = categorical_encoder.categories_[idx]
new_column_names += [f'{cat_column_name}_{value}' for value in values]
df.loc[:, new_column_names] = \
categorical_encoder.transform(df[categorical_columns]).toarray()
7. Select features
Now that we’ve prepared our data, we need to select the features we want our model to learn from. There are many techniques for doing this (
’s tool handles this automatically for you). For this tutorial, we’ll simply select the features we manually added, scaled, or encoded.
1
2
3
4
5
6
7
8
features_to_use = [
'Age',
'SibSp',
'Parch',
'Fare',
'can_vote',
] + new_column_names
X_train = df[features_to_use].copy()
Choose algorithm
Once our data is in a state that is ready to be trained on, we must choose an algorithm to use. Different algorithms are best suited for different types of problems and different types of data. For this tutorial, we’ll use a basic algorithm called logistic regression that’ll help us classify which passengers survived the Titanic crash.
1
2
3
from sklearn.linear_model import LogisticRegression
classifier = LogisticRegression(max_iter=10000)
Hyperparameter tuning
An AI/ML model has parameters that aren’t related to the features (aka columns in the data). These “hyper” parameters control how the model behaves throughout its training. When improving AI/ML models, it’s common to try a bunch of different combinations of hyperparameters that’ll yield the best results. We’ll skip this optimization for this tutorial (keep an eye out for a future article on this topic).
Train model
We take the data that was prepared (X_train) and the actual results (y_train) for each row (e.g. whether the passenger survived the Titanic) and feed it into the model. The model will learn from looking at the values in each column and seeing what result it produces (1 for survived, 0 for not survived). Once the model learns from all the data, it will finish training and can be used to make predictions on unseen data.
1
classifier.fit(X_train, y_train)
Evaluate performance
Prepare test data
Use model to predict on test data
Calculate model
Determine baseline performance and compare
1. Prepare test data
First, we’ll prepare our test data (e.g. add columns, remove columns, impute values, scale values, encode values, and select features) in the same way we did for our train set. One caveat is that we won’t “fit” our standard scaler or our encoders because we only want to “fit” those on the train set.
Note: the code below is an exact copy of the code written above during data preparation for the train set, except we are calling functions on the variable containing the test data. A better engineering practice would be to refactor the code by creating a reusable function that accepts a Pandas dataframe as an argument, calls all the data preparation steps on that dataframe, and returns it.
Here is the code written and not refactored for clarity sake:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
X_test = X_test_raw.copy()
# Add columns
X_test['can_vote'] = X_test['Age'].apply(lambda age: 1 if age >= 18 else 0)
X_test.loc[:, 'cabin_letter'] = X_test['Cabin'].apply(
lambda cabin: cabin[0] if cabin and type(cabin) is str else None,
)
# Remove columns
X_test = X_test.drop(columns=['Name', 'PassengerId'])
# Impute values
X_test.loc[X_test['Cabin'].isna(), 'Cabin'] = 'somewhere out of sight'
X_test.loc[X_test['cabin_letter'].isna(), 'cabin_letter'] = 'ZZZ'
X_test.loc[:, ['Age']] = age_imputer.transform(X_test[['Age']])
X_test.loc[X_test['Embarked'].isna(), 'Embarked'] = 'no idea'
# Scale columns
X_test.loc[:, ['Age']] = scaler.transform(X_test[['Age']])
# Encode values
X_test.loc[:, new_column_names] = categorical_encoder.transform(
X_test[categorical_columns],
).toarray()
# Select features
X_test = X_test[features_to_use].copy()
2. Use model to predict on test data
Next, we use the model to predict who survives from the test data (remember we split the data earlier during data preparation). y_pred = classifier.predict(X_test)
3. Calculate model accuracy
Regression and classification models have different metrics that are used to evaluate the performance of the model. Since we’re using a classification model (even though it’s called logistic regression, it can be used for classifying), we’ll use
as our metric. If there were multiple categories we’re predicting, we’ll also want to use
and
as a metric.
1
2
3
4
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy score: {accuracy}')
4. Determine baseline performance and compare
In order for us to understand how good this accuracy is, we need to establish a baseline. In this specific example, the baseline accuracy will be the number of people who didn’t survive within the test set divided by the number of rows in the test set.
1
2
3
4
baseline_accuracy_score = y_test.value_counts()[0] / len(y_test)
print(f'Model performance. : {accuracy}')
print(f'Baseline performance: {baseline_accuracy_score}')
Deploy/Integrate model
Once you trained the model and fine-tuned it to your business needs, it’s time to integrate it into your product or business operations. There are several ways of doing this: you can deploy the model to an online server where the model’s prediction can be accessed via an API request or you can set up your model to perform batch predictions and export those predictions to your data warehouse, data lake, etc.
Deploying your model, maintaining the model, keeping it up-to-date so that it makes relevant predictions, and making sure your feature data is fresh and readily available to retrieve for online predictions is time-consuming, costly, extremely complex, and a non-differentiating skillset. Instead of focusing your energy on this particular aspect, it’s common to rely on other tools for this service. A tool like
not only helps you prepare your data and train your model, it also helps you access your model from an API endpoint and keeps the model relevant by retraining it regularly.
Conclusion
Here is the link to the
.
Additional resources
Best book for breaking into machine learning: