TLDR
Apply feature engineering by converting time series data to numerical values for training machine learning models.
Outline
Recap
Before we begin
The datetime data type
Converting to date
What’s next?
Recap
In our series so far, we've gone over scaling data to prepare for model training. We started with a dataset filled with categorical and numerical values and scaled them so that a computer could understand them. For the remainder of our dataset, we're almost ready to begin model training; we just need to scale our dates.
Before we begin
In this section, we’ll be revisiting the datatypes of numerical and categorical values. Please read
and
before proceeding if you’re unfamiliar with those terms. We’ll be using the same
dataset used throughout the model training guides.
Importance of dates
When collecting data to feed into machine learning models, it's common to have data on when a user signed up. The model can use this information to find hidden correlation between users. Maybe there was a sign-up bonus or event for users when creating an account. The data would reflect on the success and failure and would be considered when reviewing the model.
Modern day standards
Dates are important and critical to success, especially when collaborating across different locations or countries. Dates can be written in so many ways, across multiple time zones, so the internet agreed on a standard to be used, under ISO 8601, last updated in 2019. It simplifies dates into what's known as the datetime format, to represent dates using numerical values to begin formatting.
The datetime data types
Our dates are formatted as 2021-11-30 as an example. It follows a year, month, day format. But when you think about what data type it is, it's hard to say for sure. A computer thinks of it as an object or string at first. But when humans look at it, it's obviously a number. So what is the actual data type?
strftime format
In Pandas, there is a
to_datetime
function that will convert the datatype to a
datetime
value. This usually requires a formatter that specifies how to parse the input by year, month, day, day of week, month name, hour, minute, second, and even account for 12 hour time or time zones. Datetimes in Pandas follow the
strftime
format used in UNIX.
Datetime abbreviations and outputs cheat sheet
(Source:
)
Converting dates
In our current dataset we have one datetime value,
Dt_Customer
,
logged when a user first signs up for an account. Upon inspection, it’s a string or object data type.
String to datetime
Looking at the output, we see 21-08-2021, which shows that it is in month, day, year format. By comparing with the cheatsheet, to format it we’ll match it with
%d-%m-%Y
.
The output standard is YYY-MM-DD
Datetime to Integer
But we aren't completed yet. Even though we have it in datetime format, machines still cannot understand it. To finish off the conversion, we'll break down the datetime into their own columns for year, month, and day.
The datetime format must follow the ISO, and contain functions that allow it to parse specific portions. For Pandas we’ll be using the
dt.year
,
dt.month
, and
dt.day
methods.
Once we are sure that the values match, let’s remove the original column so the dataset contains only machine readable values.
What’s next
Now that all of our data has been modified to be so simple that a computer can understand and generate models. Throughout the series we've covered scaling data, filling in missing values, and now converting to datetime. For our finale, we'll take all of our finished datasets from parts 1 thru 4, and combine them together to begin training a classification model for remarketing on whether we should send or not send another email to our customers.