Feature Engineering - Combine datasets

First published on March 7, 2022

8 minute read

Rakhee D.

TLDR

When you are working on a project, generally you collect data from several sources. Suppose you need to combine two datasets (tables, dataframes, etc.), like rows and columns of a table. To do this we will use the “join” operation of “Pandas”.

Outline

Introduction
Before we begin
Relationships
Join dataframes
Conclusion

Introduction

Often this is a challenge with every aspiring developer to combine datasets from multiple sources into rows and columns. This may sound simple to those who are familiar with Python programming but may be terrifying to those who are new to it. In this guide, we’ll see a few ” join” techniques using “Pandas'' which can do this job in a very simple manner.

Before we begin

We need to first understand how to combine information from multiple files. For this, we’ll use 3 files that I have created to demonstrate the function of “join” in Python. The “merge” function combines multiple datasets into a nice orderly fashion in a single dataframe for further analysis.

Various subjects are being taught in school that are assigned to different teachers. They update their own files to evaluate students’ overall performance for examinations. 1 of the evaluation techniques is by assigning a percentage out of 100 for each subject which is also termed as “exam points” or “marks”. Let’s look at the dataset with the following explanation on each dataframe.

- This file contains the students’ percentage of marks for the 10th standard examination with student id.

-This file contains the students’ percentage of marks for the 12th standard examination with student id.

- This file contains the data of Student ID with first_name and last_name.

Like all other guides, we’ll be using Google Collab and start by importing those datasets into dataframes.

Relationships

Previously, we covered the “merge” function in dataframe, “relative joins”, ”left” and “right” joins with “union” and “intersection'' for the email content dataset, using “Pandas”.

Join dataframes

We’ll use the “join” function in 3 steps:

Step-1

Loading Datasets in Python.

We'll be using 3 different datasets and loading these datasets into 3 different dataframes.

Let us now use the “head” function to display the first few rows of each dataframe.

Step-2

Combining 2 similar dataframes using “Append”.

The “Append” function combines the 2 similar dataframes vertically.

The resultant dataframe is “allMarks” and we have used “shape” to compare the shape of all 3 dataframes.

The final output after “append” is as below.

Step-3

Combining information from 2 dataframes using “merge”.

The “merge” function requires the necessary attributes on which 2 dataframes will be merged. You can explicitly specify the name of the key column using the “on” keyword which takes a column name or a list of column names.

Another important argument of “merge'' is “how”. This specifies the type of “join” you want to perform on the dataframes. Here are the different “join” types you can perform.

Inner Join

“Inner join” returns a dataframe for both left and right dataframes, on their common column names, into a single dataframe in the merged data. It is performed by default if you don’t provide any argument.

We’ll identify the names of students here with their percentage of marks in the 10th standard. In our case “id” is common for both “marks10th” and “IdandName” dataframe.

After” inner join'' it displays a single dataframe by combining both left and right dataframes on the common column name “id”.

Finally, the resultant dataframe “inner'' has a list of student names mapped along with their marks.

Left Join

A teacher of a particular school wants to know the performance of students in an examination. In this case, we can use the concept of ”left join''. ” ``Left join” is also called ”left outer join”.

“Left join” returns all the rows of the left dataframe and all the non-matching rows of the left dataframe contain “NaN'' for the column in the right dataframe.

We have a dataframe “allMarks'' which has combined the marks for both 10th and 12th standard students’ vertically. It has “id” in common for both “allMarks” and “IdandName” dataframes. In the “how” argument we specify “left” to perform “left outer join”.

At times you may wish to merge 2 dataframes with different column names. But the argument “on” only works for the same column in the left and right dataframes. Therefore we use the “left_on” and “right_on” to specify the 2 column names as shown below.

Right Join

Now suppose there is another task to identify the names of students with their percentage of marks in 12th Standard. So we can use the concept of “right join” with dataframe “Idandname” and “marks12th” as we have “Id” is common for both the dataframes.

“Right join” returns all the rows of the right dataframe. All the non-matching rows of the right dataframe contain “NaN” for the column in the left dataframe.

Full Join

“Full join” is also called “Full outer join” which returns the rows that either have a match in the left dataframe or right dataframe.

Another interesting task is to combine dataframes to identify the names of students with their percentages in both the 10th and 12th standards.

We'll use the “allMarks” dataframe since it has all the percentages of marks for both the 10th and 12th standards. In our case, “id” is common for both “allMarks” and “IdandName” dataframe.

For the rows which don’t match in both the dataframes, the resulting dataframe will return “NaN” for no matched values in the row. We have used “indicator” to understand where the rows originate from.

As we can see in the output the rows from 0-4 come from both the dataframes as they have the same values of “id”.

The rows from 50-57 come from the left dataframe and it returns “NaN” for the column names that lack a matching row in the right dataframe.

Conclusion

Isn’t this interesting? You can now get your tables in the form of rows and columns without any queries in a very simple and quick way.

We can perform ” join” operations with Mage by selecting a use case to build a model. Mage allows us to upload datasets or use existing datasets to clean the data by following the suggestions and enhancing the data to prepare our model for training. The columns we select are to optimize, to rank, to specify the optimization direction and select the columns with a unique identifier and we start training the model.

After the model is trained we can review the model’s performance. The model gives us the accuracy and also displays the top features which infer the predictions of the model. We can also retrain the model to improve the performance and finally deploy the model using API.

Below is the gif of the whole process for training the model to combine datasets using Mage.

Thanks for reading.

Want to learn more about machine learning (ML)? Visit

! ✨🔮

GET STARTED

COMMUNITY

LEARN

Magical powers for data teams

GET STARTED

COMMUNITY

LEARN

2024 Mage Technologies, Inc.