TLDR
After creating multiple models, it’s hard to keep track of all of them especially in a collaborative work environment. Learn how to save and load your models using Pickle!
Outline
Recap
Before we begin
Model differences
Saving the model
Loading the model
Conclusion
Recap
In our last part, we successfully created a model for a remarketing campaign for the holidays. To review the model, we’ll need to share the model results with our cross functional team and staff. We’ll want our data scientists and data analysts to be able to access the data without remaking the model every time they close their computer.
Lost progress
Before we begin
In this guide, we’ll cover how to export our machine learning model and import it back in using Python, no prior knowledge is required. In part 6, we completed training a classification model, so now we’ll be exporting it. The dataset can also be found
.
Model differences
When creating a model, it’s worth noting that each time you run it the results may change due to a random value, also known as a
seed
. Therefore, even when you have the same data, the model may give different results when you run it. For instance, our model uses
Logistic Regression
to train it, which is a
discriminative
machine learning algorithm.
Create a model using Logistic Regression
Discriminative algorithms
This doesn’t mean that the algorithm is discriminatory, but rather it tries to draw a line between our data to represent a boundary. This line is also referred to as the
decision boundary
. Then, it will classify the data based on where it ends up, depending on which side of the boundary, in our case it’s whether a user will click on the remarketing email.
Decision boundary (Source: Vidhya)
Splitting the data
When data is split into a train and test set, not all values are guaranteed to be the same each time because there’s no set seed. In this case, each time the model is trained, the algorithm will use a
pseudo-random
value that makes multiple splits that are highly unlikely to be the same.
As unlikely as it seems, repeats do happen (Source: Dilbert)
To keep it consistent, we set random_state equal to a constant value. For this model I’ve chosen 3493 as my seed, to have the same resulting splits making it easier to replicate.
Saving the model
To save our model, we’ll use the
pickle
function in Python. The pickle function starts by
pickling
the data, converting it through serialization into a byte stream. This serialization is a sequence of bytes arranged to form the hierarchy, or order, of the original model. Note that only booleans, integers, strings, arrays, dictionaries, functions, classes, and other Python original data types may be pickled. It cannot pickle numpy objects unless using
, which has a similar syntax.
As for why it’s called Pickle… (Source: PngItem)
Remember the Pickle
The term remains shrouded in mystery as to why it’s called pickle, but a fun way to remember the name is due to the process of why people pickle. Traditionally, many cultures practice pickling as a form of preservation and storage. Having a longer shelf time means they can go back without the food spoiling. Likewise, data scientists aren’t going to complete optimizing a model in one sitting, nor will developers share their computers.
Pickling isn’t only for pickles. There’s kimchi too! (Source: ABCNews)
Dumping Pickle
The simplest way to save a model is as a byte object tied directly to a variable. This can be useful if you don’t need it as a file or want to experiment with different models in the same sitting. When using
.dumps
(with an s), the model is stored into a byte object.
The command to create a pickle file is
pickle.dump
, which converts the model into a pickle and places it into a file. First, we’ll open the file with write access, write our pickle into it, and then close it.
Remember to specify wb to allow the program to write to the file.
Loading the model
Similar to saving the model, we’ll use pickle again to load our data with the
load
and
loads
function. Once you have a pickle, you can open it up to retrieve the original data using pickle.load. Likewise, pickle.loads will take a byte instead.
From variable
Remember to specify rb to allow the program to read the file.
Checking the Pickle
Next, we’ll train it on the same split of X_train and X_test and evaluate the scores. When loading a model, results are always the same, since it’s the same model and data.
The scores all match, so the pickle is correct.
Conclusion
Thus ends this segment on saving and loading your machine learning model. We hope that you’re able to remember the pickle, the process of pickling, and will pickle and share your machine learning models. In the next series, we’ll export our results stored as
to evaluate our model metrics more thoroughly.