Data Cleaning - Sort values

First published on March 9, 2022

Last updated at April 19, 2022

 

6 minute read

Felicia Kuan

TLDR

In this Mage Academy lesson on data cleaning, we’ll learn how to sort out our mess– I mean– row data in one or more columns by ascending and descending order.

Glossary

  • Best time to sort?

  • Dataset

  • Sort by…

  • “Abra” - kadabra! ✨🔮

Best time to sort?

By definition, sorting seems obvious– given text or numerical data, if you display text data in alphabetical order, you’re sorting by ascending values. Conversely,  if you display the biggest to smallest numbers, you’re sorting by descending values.

Rather than asking what sorting is, the worthier question is whether your data pre-processing is optimized by sorting specific columns. Thus, what kinds of insights are you searching for in your data? In which situations, with what type of data, does sorting columns benefit your analysis? 

Fanart of Psyduck

Below are scenarios described by our co-founders where it was immensely useful to leverage sorting:

  • Viewing a user’s first and last page visit: when given user traffic data on a website (like our Mage blog) we can 

    sort by timestamp

    per user. The first page tells us which page brought them to the site. The last page would say what made them drop off, and hopefully it’s not one of my articles 😛 

  • Discern what kind of content users like to consume in the morning: As a newsfeed app, if we 

    sort by timestamp

    per user, we can gather a list of articles they ❤️ to store in a column. With this data, we can train a model that decides what to show the user in the morning and keep them engaged with the app. 

As shown in these examples, sort is rarely used on its own, and it’s most powerful when used with other operations, like to 

group by

user. 

Dataset

For this lesson, we’ll be using ThiagoAZen’s 

to visualize the functionality of sort. 

However, since only three features are relevant to us, we will drop all but the columns Name, Type1, and Attack. For additional data cleaning, we’ll also 

1
2
3
4
data = df[["Name", "Type1", "Attack"]]
pokedex= data.drop_duplicates(subset=["Name"], 
keep="first")
pokedex

That’s a lot of Pokemon to work with! Thankfully, we have some Pokemon who also know 

. Maybe he can come with us and help! 

Squirtle doing a bubble sort 

Sort by

Ascending

As the name suggests, sorting by ascending values means that values in the column at the beginning are smaller than the values at the end. For alphabetical text entries, this means “A” is the first (ASCII value: 65) and “Z” is the last value (ASCII: 90). For ASCII value look-up, you can refer to 

.

Thus, if we wanted the Pokedex in alphabetical order, we sort the “Name” column using:

1
pokedex.sort_values(by=["Name"])

Now that it’s sorted, in the “Name” column, “Abra” has a smaller value than “Absol,” so “Abra” is placed before. 

Pokemon images from Bulbapedia

For string comparisons between two strings, what determines which word is positioned first is by the ASCII values of the first non-matching character. When comparing “Abra” with “Absol,” the ASCII values of the first two characters are the same, so we are not yet able to determine which Pokemon appears first in the list.

Thus the only comparison that matters is between the third character of each word– “R” and “S.” Since the ASCII value of “R,” 114, is smaller than S’ 115, “Abra” gets to be positioned before “Absol.” 

Descending

Sorting by descending values means that the largest values appear at the beginning of the set. 

If we sort by descending “Attack”, this means the strongest Pokemon are at the top of the list, so we should be sure to catch ‘em 😉

1
pokedex.sort_values(by=["Attack"], ascending = False)

Since it’s descending, the Pokemon with the higher attack is placed first. This is the opposite behavior from the above section. Therefore, since the Pokemon Zacian has an attack of 170, which is larger than Rampardos’ 165, Zacian is sorted into a higher position. 

Images from Bulbapedia

Multiple columns

Arguably, sort is even more useful when we sort by multiple columns. If we first sort by descending value on “Type1,” all the Pokemon are sorted into groups where all the Water-type Pokemon are kept together:

Thus, when we sort again by “Attack,” the strongest Pokemon appear at the top of a group of types. 

1
pokedex.sort_values(by=["Type1", 'Attack'], ascending = False)

This is useful as we can now find the largest and smallest values, not only across the entire dataset, but within groups as well. Now, you have the knowledge to look out for and capture the strongest Pokemon of each type! 

“Abra” - kadabra!  ✨🔮

Data cleaning is the most tedious yet important aspect of data analysis. Let Mage handle all the fuss of sorting so you can re-allocate that energy to derive value from your data (or rest because you’ve been working hard).

Source: Blue Bandana Jake

There’s no code involved, but we need you to specify which column needs sorting, and which order. Go to Edit Data > Sort and enter the column and the order you’ll be sorting by. 

Since I wanted to match the example above, I sorted by the two columns, “type1” and “attack,” in descending order.

Abrakadabra! Your data is all sorted with just a couple clicks and no code! 

Want to learn more about machine learning (ML)? Visit 

! ✨🔮