Nothing Special   »   [go: up one dir, main page]

How to Reset a pandas DataFrame Index

How to Reset a pandas DataFrame Index

by Ian Eyre Nov 06, 2024 intermediate best-practices data-science python

In this tutorial, you’ll learn how to reset a pandas DataFrame index, the reasons why you might want to do this, and the problems that could occur if you don’t.

Before you start your learning journey, you should familiarize yourself with how to create a pandas DataFrame. Knowing the difference between a DataFrame and a pandas Series will also prove useful to you.

In addition, you may want to use the data analysis tool Jupyter Notebook as you work through the examples in this tutorial. Alternatively, JupyterLab will give you an enhanced notebook experience, but feel free to use any Python environment you wish.

As a starting point, you’ll need some data. To begin with, you’ll use the band_members.csv file included in the downloadable materials that you can access by clicking the link below:

The table below describes the data from band_members.csv that you’ll begin with:

Column Name PyArrow Data Type Description
first_name string First name of member
last_name string Last name of member
instrument string Main instrument played
date_of_birth string Member’s date of birth

As you’ll see, the data has details of the members of the rock band The Beach Boys. Each row contains information about its various members both past and present.

Throughout this tutorial, you’ll be using the pandas library to allow you to work with DataFrames, as well as the newer PyArrow library. The PyArrow library provides pandas with its own optimized data types, which are faster and less memory-intensive than the traditional NumPy types that pandas uses by default.

If you’re working at the command line, you can install both pandas and pyarrow using the single command python -m pip install pandas pyarrow. If you’re working in a Jupyter Notebook, you should use !python -m pip install pandas pyarrow. Regardless, you should do this within a virtual environment to avoid clashes with the libraries you use in your global environment.

Once you have the libraries in place, it’s time to read your data into a DataFrame:

Python
>>> import pandas as pd

>>> beach_boys = pd.read_csv(
...     "band_members.csv"
... ).convert_dtypes(dtype_backend="pyarrow")

First, you used import pandas to make the library available within your code. To construct the DataFrame and read it into the beach_boys variable, you used pandas’ read_csv() function, passing band_members.csv as the file to read. Finally, by passing dtype_backend="pyarrow" to .convert_dtypes() you convert all columns to pyarrow types.

If you want to verify that pyarrow data types are indeed being used, then beach_boys.dtypes will satisfy your curiosity:

Python
>>> beach_boys.dtypes
first_name            string[pyarrow]
last_name             string[pyarrow]
instrument            string[pyarrow]
date_of_birth         string[pyarrow]
dtype: object

As you can see, each data type contains [pyarrow] in its name.

If you wanted to analyze the date information thoroughly, then you would parse the date_of_birth column to make sure dates are read as a suitable pyarrow date type. This would allow you to analyze by specific days, months or years, and so on, as commonly found in pivot tables.

The date_of_birth column is not analyzed in this tutorial, so the string data type it’s being read as will do. Later on, you’ll get the chance to hone your skills with some exercises. The solutions include the date parsing code if you want to see how it’s done.

Now that the file has been loaded into a DataFrame, you’ll probably want to take a look at it:

Python
>>> beach_boys
  first_name last_name instrument date_of_birth
0      Brian    Wilson       Bass   20-Jun-1942
1       Mike      Love  Saxophone   15-Mar-1941
2         Al   Jardine     Guitar   03-Sep-1942
3      Bruce  Johnston       Bass   27-Jun-1942
4       Carl    Wilson     Guitar   21-Dec-1946
5     Dennis    Wilson      Drums   04-Dec-1944
6      David     Marks     Guitar   22-Aug-1948
7      Ricky    Fataar      Drums   05-Sep-1952
8    Blondie   Chaplin     Guitar   07-Jul-1951

DataFrames are two-dimensional data structures similar to spreadsheets or database tables. A pandas DataFrame can be considered a set of columns, with each column being a pandas Series. Each column also has a heading, which is the name property of the Series, and each row has a label, which is referred to as an element of its associated index object.

The DataFrame’s index is shown to the left of the DataFrame. It’s not part of the original band_members.csv source file, but is added as part of the DataFrame creation process. It’s this index object you’re learning to reset.

The index of a DataFrame is an additional column of labels that helps you identify rows. When used in combination with column headings, it allows you to access specific data within your DataFrame. The default index labels are a sequence of integers, but you can use strings to make them more meaningful. You can actually use any hashable type for your index, but integers, strings, and timestamps are the most common.

With these preliminaries out of the way, it’s time to take your Little Deuce Coupe to the beach, do a pop-up, and surf through some learning. You’ll now investigate the main ways of re-indexing a DataFrame. You could apply an Index object directly to the DataFrame’s .index property, or use the DataFrame’s .set_axis() method. To begin with, you’ll use the DataFrame’s .reset_index() method.

Take the Quiz: Test your knowledge with our interactive “How to Reset a pandas DataFrame Index” quiz. You’ll receive a score upon completion to help you track your learning progress:


Interactive Quiz

How to Reset a pandas DataFrame Index

This quiz will challenge your knowledge of resetting indexes in pandas DataFrames. You won't find all the answers in the tutorial, so you'll need to do some extra investigating. By finding all the answers, you're sure to learn some interesting things along the way.

How to Reset a pandas DataFrame Index With .reset_index()

The DataFrame.reset_index() method, as its name suggests, is used to reset the index of a pandas DataFrame back to its default. You can also use this method to reset the MultiIndex on your DataFrame if it has one. You’ll learn more about how to reset multi-indexes later.

Suppose you’ve been performing some data cleansing on your DataFrame and have decided to sort it alphabetically by first_name. You can do this using .sort_values():

Python
>>> beach_boys.sort_values(by="first_name")
  first_name last_name instrument date_of_birth
2         Al   Jardine     Guitar   03-Sep-1942
8    Blondie   Chaplin     Guitar   07-Jul-1951
0      Brian    Wilson       Bass   20-Jun-1942
3      Bruce  Johnston       Bass   27-Jun-1942
4       Carl    Wilson     Guitar   21-Dec-1946
6      David     Marks     Guitar   22-Aug-1948
5     Dennis    Wilson      Drums   04-Dec-1944
1       Mike      Love  Saxophone   15-Mar-1941
7      Ricky    Fataar      Drums   05-Sep-1952

While you’re happy that the sort has worked perfectly, you’re not happy with the current state of its index. The index no longer bears any resemblance to the updated order of the rows.

To fix this, your first thought might be to use .reset_index() with its default parameters, but the result might not be what you want:

Python
>>> beach_boys.sort_values(by="first_name").reset_index()
   index first_name last_name instrument date_of_birth
0      2         Al   Jardine     Guitar   03-Sep-1942
1      8    Blondie   Chaplin     Guitar   07-Jul-1951
2      0      Brian    Wilson       Bass   20-Jun-1942
3      3      Bruce  Johnston       Bass   27-Jun-1942
4      4       Carl    Wilson     Guitar   21-Dec-1946
5      6      David     Marks     Guitar   22-Aug-1948
6      5     Dennis    Wilson      Drums   04-Dec-1944
7      1       Mike      Love  Saxophone   15-Mar-1941
8      7      Ricky    Fataar      Drums   05-Sep-1952

You now have a new default index, which is a sequential series of numbers starting at zero that clearly defines the row order. However, the original index has remained and has been moved to a new column that’s confusingly named index. This may be acceptable if you need to return the DataFrame to its original sort order in the future, but more often than not, its inclusion just wastes memory. You’ll usually want to get rid of it.

Fortunately, this is easy to do. By default, .reset_index() makes a copy of the original DataFrame, applies the default index to this second copy, and returns that second copy back to you. This means that the index on the original DataFrame remains unchanged, so no harm has been done to it.

You can instruct .reset_index() to drop the original index completely and replace it with a fresh default one by setting its drop parameter to True. You can also re-assign the original reference to the new DataFrame. The updated version will then be the only one that exists from this point forward:

Python
>>> beach_boys = (
...     beach_boys.sort_values(by="first_name")
...     .reset_index(drop=True)
... )
>>> beach_boys
  first_name last_name instrument date_of_birth
0         Al   Jardine     Guitar   03-Sep-1942
1    Blondie   Chaplin     Guitar   07-Jul-1951
2      Brian    Wilson       Bass   20-Jun-1942
3      Bruce  Johnston       Bass   27-Jun-1942
4       Carl    Wilson     Guitar   21-Dec-1946
5      David     Marks     Guitar   22-Aug-1948
6     Dennis    Wilson      Drums   04-Dec-1944
7       Mike      Love  Saxophone   15-Mar-1941
8      Ricky    Fataar      Drums   05-Sep-1952

You’ve now tidied up your DataFrame the way you want it. You’ve also re-assigned the original reference to the new DataFrame, meaning that irritating index column has been sunk without a trace.

Now it’s time for you to have some fun fun fun and try the following challenge:

You’ve already seen how it’s possible to reset an index while retaining the old one. When you do this, pandas adds it as a new column with the title index. Take a look at the documentation for .reset_index() and see if you can figure out how to customize this new column’s name.

Now re-read the data from band_members.csv into a fresh DataFrame, and use beach_boys.index = range(1, 10) to update its index. Finally, use .reset_index() to see if you can customize your DataFrame and make it look like this:

Python
>>> beach_boys
   old_index first_name last_name instrument date_of_birth
0          1      Brian    Wilson       Bass   20-Jun-1942
1          2       Mike      Love  Saxophone   15-Mar-1941
2          3         Al   Jardine     Guitar   03-Sep-1942
3          4      Bruce  Johnston       Bass   27-Jun-1942
4          5       Carl    Wilson     Guitar   21-Dec-1946
5          6     Dennis    Wilson      Drums   04-Dec-1944
6          7      David     Marks     Guitar   22-Aug-1948
7          8      Ricky    Fataar      Drums   05-Sep-1952
8          9    Blondie   Chaplin     Guitar   07-Jul-1951

You’re aiming to make sure the existing index isn’t only retained, but renamed as well.

You’ll find a solution in the Solutions.ipynb notebook provided in the downloadable materials.

Although the .reset_index() method you’ve just learned about is the most customizable way to reset a pandas DataFrame index, allowing you to handle multi-indexes, it’s certainly not the only way. It’s possible to do it the gremmie way by using first principles.

Reset an Index Directly With .index

When working with DataFrames, you can identify rows using .loc[] or .iloc[]. Each has its use case, although either can be used on any DataFrame.

A DataFrame’s index is referenced using its .index property. To replace the current index with one of your own, you assign .index an iterable containing your new index labels. This allows you lots of scope to customize your DataFrame indexes beyond the default incrementing numbers.

Before you explore .loc[] and .iloc[], you’ll add a custom index to your DataFrame that contains the initials of the Beach Boys band members. First, you’ll read the band members’ data from a CSV file and convert it to pyarrow data types:

Python
>>> beach_boys = pd.read_csv(
...     "band_members.csv"
... ).convert_dtypes(dtype_backend="pyarrow")

>>> initials = ["BW", "ML", "AJ", "BJ", "CW", "DW", "DM", "RF", "BC"]
>>> beach_boys.index = initials
>>> beach_boys
   first_name last_name instrument date_of_birth
BW      Brian    Wilson       Bass   20-Jun-1942
ML       Mike      Love  Saxophone   15-Mar-1941
AJ         Al   Jardine     Guitar   03-Sep-1942
BJ      Bruce  Johnston       Bass   27-Jun-1942
CW       Carl    Wilson     Guitar   21-Dec-1946
DW     Dennis    Wilson      Drums   04-Dec-1944
DM      David     Marks     Guitar   22-Aug-1948
RF      Ricky    Fataar      Drums   05-Sep-1952
BC    Blondie   Chaplin     Guitar   07-Jul-1951

You’ll take a look at an alternative way of resetting an index in a short while, but first, you’ll dive into row selection.

Select Rows Using .loc[] and .iloc[]

The .loc[] attribute allows you to index rows by their index label. This is particularly useful when the underlying index labels have intrinsic meaning within the DataFrame, such as usernames or timestamps.

If you slice with .loc[], it uses a closed interval, meaning that both the start and end index values you specify will appear in the output. So by selecting a slice of rows by username, your output will show the first one, the last one, and everything in between.

The .iloc[] attribute allows you to find rows based on their index position regardless of the index label’s value. When you use .iloc[], you use zero-based indexing. In other words, the first row is at location 0, not 1 as you might expect. .iloc[] uses an open interval, which means that while the start index value you specify will appear in the output, the end one won’t.

While these concepts will be very familiar to you if you’ve ever sliced a string or list, it’s important to note that the final item .iloc[] appears to select won’t actually be selected.

Of course, it doesn’t matter whether you use .iloc[] or .loc[] if you’re dealing with default indexes because these also use zero-based sequences, so their labels and positions match. However, if your index contains strings or a non-default numeric sequence, the parameters passed to .iloc[] will not resemble the actual index value of the row you’re accessing, making your code more difficult to read.

Now that your DataFrame has been updated with a fresh index, you can use it to select some rows using both .loc[] and .iloc[]:

Python
 1>>> beach_boys = pd.read_csv(
 2...     "band_members.csv"
 3... ).convert_dtypes(dtype_backend="pyarrow")
 4
 5>>> initials = ["BW", "ML", "AJ", "BJ", "CW", "DW", "DM", "RF", "BC"]
 6>>> beach_boys.index = initials
 7
 8>>> beach_boys.loc[["BW"]]
 9   first_name last_name instrument date_of_birth
10BW      Brian    Wilson       Bass   20-Jun-1942
11
12>>> beach_boys.iloc[[1]]
13   first_name last_name instrument date_of_birth
14ML       Mike      Love  Saxophone   15-Mar-1941
15
16>>> beach_boys.loc["BW":"BJ"]
17   first_name last_name instrument date_of_birth
18BW      Brian    Wilson       Bass   20-Jun-1942
19ML       Mike      Love  Saxophone   15-Mar-1941
20AJ         Al   Jardine     Guitar   03-Sep-1942
21BJ      Bruce  Johnston       Bass   27-Jun-1942
22
23>>> beach_boys.iloc[1:4]
24   first_name last_name instrument date_of_birth
25ML       Mike      Love  Saxophone   15-Mar-1941
26AJ         Al   Jardine     Guitar   03-Sep-1942
27BJ      Bruce  Johnston       Bass   27-Jun-1942

In line 8, you accessed the first row whose index value is BW by passing this as a string into .loc[]. By passing it as a list to .loc[["BW"]], your output becomes a DataFrame. You could have passed it in directly, but this would have produced a pandas Series which behaves differently. The main point to note is that by passing an index value into .loc[], you returned the row whose index label is that value.

Compare this to the output after line 12. The code accesses row number 1, this time with .iloc[[1]]. Notice how the output is different. Using .iloc[[1]] means that you’ve selected the second row of the DataFrame. Remember, .iloc[] treats the DataFrame’s first row as 0, regardless of the actual index label.

In line 16, you selected the first four rows by passing the labels of the first and fourth rows into .loc[].

Finally, when you passed the slice 1:4 into .iloc[] in line 23, you selected the rows starting at index position one—in other words, the second row but ending at index position three. Again, this is due to the zero-based numbering effect of rows and because the last slice parameter of position 4 is excluded. Remember .iloc[] assumes an open interval.

Now, it’s time for you to try another challenge. If you get it wrong, you can always Do It Again:

Once again, re-read the data from band_members.csv into a fresh DataFrame. Now, see if you can create the DataFrame shown below by assigning an appropriate list to beach_boys.index, then select the bottom two rows using both .loc[] and .iloc[]:

Python
>>> beach_boys
   first_name last_name instrument date_of_birth
2       Brian    Wilson       Bass   20-Jun-1942
4        Mike      Love  Saxophone   15-Mar-1941
6          Al   Jardine     Guitar   03-Sep-1942
8       Bruce  Johnston       Bass   27-Jun-1942
10       Carl    Wilson     Guitar   21-Dec-1946
12     Dennis    Wilson      Drums   04-Dec-1944
14      David     Marks     Guitar   22-Aug-1948
16      Ricky    Fataar      Drums   05-Sep-1952
18    Blondie   Chaplin     Guitar   07-Jul-1951

This time, your index contains even numbers starting at 2.

You’ll find a solution in the Solutions.ipynb notebook provided in the downloadable materials.

So far, you’ve used .reset_index() and .index to reset a pandas DataFrame index. Now it’s time to take a carve and look at yet another alternative.

Reset an Index Directly With .set_axis()

A third way to reset an index is to use the DataFrame’s .set_axis() method. This method allows you to assign a new RangeIndex object to your DataFrame, and also allows you to change the column labels.

To alter the existing index of a DataFrame, you can pass .set_axis() a range object using the built-in range() constructor. This will assign an interval of ascending integers to the index that start at zero:

Python
>>> beach_boys = pd.read_csv(
...     "band_members.csv"
... ).convert_dtypes(dtype_backend="pyarrow")

>>> beach_boys.set_axis(range(len(beach_boys)))
  first_name last_name instrument date_of_birth
0      Brian    Wilson       Bass   20-Jun-1942
1       Mike      Love  Saxophone   15-Mar-1941
2         Al   Jardine     Guitar   03-Sep-1942
3      Bruce  Johnston       Bass   27-Jun-1942
4       Carl    Wilson     Guitar   21-Dec-1946
5     Dennis    Wilson      Drums   04-Dec-1944
6      David     Marks     Guitar   22-Aug-1948
7      Ricky    Fataar      Drums   05-Sep-1952
8    Blondie   Chaplin     Guitar   07-Jul-1951

Here, you used .set_axis() to reset the index back to its default value. To do this, you passed .set_axis() a range whose length was equal to that of the original beach_boys DataFrame. Using len() ensures there are the correct amount of numbers for each row in the DataFrame. The numbers generated will be zero-based and be just sufficient to cover each row. As you can see, the index has now been reset to its default value.

Once again it’s time to check your understanding. Feel free to ask a friend for help, but only if they’re called Rhonda:

Re-read the data from band_members.csv Now, use .set_axis() to see if you can make it look like this:

Python
>>> beach_boys
   first_name last_name instrument date_of_birth
0       Brian    Wilson       Bass   20-Jun-1942
1        Mike      Love  Saxophone   15-Mar-1941
4          Al   Jardine     Guitar   03-Sep-1942
9       Bruce  Johnston       Bass   27-Jun-1942
16       Carl    Wilson     Guitar   21-Dec-1946
25     Dennis    Wilson      Drums   04-Dec-1944
36      David     Marks     Guitar   22-Aug-1948
49      Ricky    Fataar      Drums   05-Sep-1952
64    Blondie   Chaplin     Guitar   07-Jul-1951

This time you want each value of the default index to be squared.

As usual, you’ll find a solution in the Solutions.ipynb notebook provided in the downloadable materials.

Now that you know how to reset a DataFrame’s index, you’ll move on and learn why you’d want to do this. It’s not quite time to hang loose just yet.

Restore a Sequential Index

When you cleanse data prior to analysis, you often need to remove certain rows from a DataFrame. For example, you may need to remove duplicate or other unwanted rows. When you remove these, the index values of the rows are removed as well. This could lead to gaps appearing in the index. By resetting the index, you can fix this.

Later, in the section about aligning indexes of several DataFrames, you’ll learn how missing values in a DataFrame’s index can wreak havoc when you need to analyze its data. For now, you’ll mess up the index of your lovely beach_boys DataFrame:

Python
 1>>> beach_boys = pd.read_csv(
 2...     "band_members.csv"
 3... ).convert_dtypes(dtype_backend="pyarrow")
 4
 5>>> beach_boys
 6  first_name last_name instrument date_of_birth
 70      Brian    Wilson       Bass   20-Jun-1942
 81       Mike      Love  Saxophone   15-Mar-1941
 92         Al   Jardine     Guitar   03-Sep-1942
103      Bruce  Johnston       Bass   27-Jun-1942
114       Carl    Wilson     Guitar   21-Dec-1946
125     Dennis    Wilson      Drums   04-Dec-1944
136      David     Marks     Guitar   22-Aug-1948
147      Ricky    Fataar      Drums   05-Sep-1952
158    Blondie   Chaplin     Guitar   07-Jul-1951
16
17>>> beach_boys.drop(labels=[3, 5])
18  first_name last_name instrument date_of_birth
190      Brian    Wilson       Bass   20-Jun-1942
201       Mike      Love  Saxophone   15-Mar-1941
212         Al   Jardine     Guitar   03-Sep-1942
224       Carl    Wilson     Guitar   21-Dec-1946
236      David     Marks     Guitar   22-Aug-1948
247      Ricky    Fataar      Drums   05-Sep-1952
258    Blondie   Chaplin     Guitar   07-Jul-1951

As you can see, the DataFrame contains the original nine rows of band members with a default index. To remove some rows, you used the DataFrame’s .drop() method. The rows to be dropped were defined in the Python list provided to its labels parameter. In this case, the rows whose index label is 3 or 5 were both removed. Look carefully at the index in the output produced below line 17 and you’ll see that the index is no longer sequential.

To fix this, you can use any of the techniques you’ve learned so far. Here, you’ll use .reset_index() which, because it doesn’t change the underlying DataFrame, means you can reuse the original cleanly-indexed DataFrame in later examples:

Python
>>> (
...     beach_boys
...     .drop(labels=[3, 5])
...     .reset_index()
... )
   index first_name last_name instrument date_of_birth
0      0      Brian    Wilson       Bass   20-Jun-1942
1      1       Mike      Love  Saxophone   15-Mar-1941
2      2         Al   Jardine     Guitar   03-Sep-1942
3      4       Carl    Wilson     Guitar   21-Dec-1946
4      6      David     Marks     Guitar   22-Aug-1948
5      7      Ricky    Fataar      Drums   05-Sep-1952
6      8    Blondie   Chaplin     Guitar   07-Jul-1951

As you saw earlier, .reset_index() has added a new default index to your DataFrame. However, you’re not happy just yet because your original index still remains, albeit in a new column. Once again, you need to adjust the drop parameter of .reset_index() to complete the job:

Python
>>> (
...     beach_boys
...     .drop(labels=[3, 5])
...     .reset_index(drop=True)
... )
  first_name last_name instrument date_of_birth
0      Brian    Wilson       Bass   20-Jun-1942
1       Mike      Love  Saxophone   15-Mar-1941
2         Al   Jardine     Guitar   03-Sep-1942
3       Carl    Wilson     Guitar   21-Dec-1946
4      David     Marks     Guitar   22-Aug-1948
5      Ricky    Fataar      Drums   05-Sep-1952
6    Blondie   Chaplin     Guitar   07-Jul-1951

As you can see, by setting drop=True, the original index is now nowhere to be seen, but your shiny new one looks, well, new and shiny.

Once again, it’s time to check your understanding. Oh, and if you can’t figure it out, Don’t Worry Baby:

To start, re-read the data from band_members.csv, then use beach_boys.drop(labels=[3, 5]) to remove some of the rows. You’ve just learned how to reset the index to its default value using .reset_index(drop=True). See if you can do this again using either of the other two techniques you’ve learned. Your answer should look like this:

Python
>>> beach_boys
  first_name last_name instrument date_of_birth
0      Brian    Wilson       Bass   20-Jun-1942
1       Mike      Love  Saxophone   15-Mar-1941
2         Al   Jardine     Guitar   03-Sep-1942
3       Carl    Wilson     Guitar   21-Dec-1946
4      David     Marks     Guitar   22-Aug-1948
5      Ricky    Fataar      Drums   05-Sep-1952
6    Blondie   Chaplin     Guitar   07-Jul-1951

As you can see, the default index has been restored, and the original is nowhere to be found.

As with the other exercises, you’ll find a solution in the Solutions.ipynb notebook provided in the downloadable materials.

All your efforts so far may mean you’re now seeing a twin fin. Next, you’ll get back on your board and get rid of those duplicates.

Remove Duplicate Index Values

You may be surprised to learn that indexes can sometimes have duplicate values. They don’t necessarily need to be unique identifiers. However, duplicates are usually something you want to avoid because they can cause issues. Fortunately, .reset_index() can handle this for you.

Duplicate index values often arise when two DataFrames are merged. The duplication can cause problems with incorrect row selection, slicing, and filtering. Before you can see these problems, you first need to apply some duplicate values to your index:

Python
 1>>> beach_boys = pd.read_csv(
 2...     "band_members.csv"
 3... ).convert_dtypes(dtype_backend="pyarrow")
 4
 5>>> guitar_players = beach_boys.query(
 6...     "instrument == 'Guitar'"
 7... ).reset_index(drop=True)
 8
 9>>> guitar_players
10  first_name last_name instrument date_of_birth
110         Al   Jardine     Guitar   03-Sep-1942
121       Carl    Wilson     Guitar   21-Dec-1946
132      David     Marks     Guitar   22-Aug-1948
143    Blondie   Chaplin     Guitar   07-Jul-1951
15
16>>> others = beach_boys.query(
17...     "instrument != 'Guitar'"
18... ).reset_index(drop=True)
19
20>>> others
21  first_name last_name instrument date_of_birth
220      Brian    Wilson       Bass   20-Jun-1942
231       Mike      Love  Saxophone   15-Mar-1941
242      Bruce  Johnston       Bass   27-Jun-1942
253     Dennis    Wilson      Drums   04-Dec-1944
264      Ricky    Fataar      Drums   05-Sep-1952

Here, you’ve split the original DataFrame into two new ones. The guitar_players DataFrame, shown below line 9, contains the records of those group members who play guitar. The others DataFrame, shown below line 20, contains the rest of the members.

To select the guitar players, you passed the "instrument == 'Guitar'" query string into .query() in lines 5 through 7, which extracts all rows where the instrument column values match "Guitar".

Lines 16 through 18 use similar code that creates a second DataFrame containing the other rows. In this case, musicians who aren’t marked as guitar players.

In both cases, .reset_index() was used to make sure the index in both new DataFrames was sequential. This ensured that some identical index values appear across both DataFrames. When you merge both of these together, you might think you’ll get back to your original DataFrame, but you won’t:

Python
>>> all_beach_boys = pd.concat([guitar_players, others])

>>> all_beach_boys
  first_name last_name instrument date_of_birth
0         Al   Jardine     Guitar   03-Sep-1942
1       Carl    Wilson     Guitar   21-Dec-1946
2      David     Marks     Guitar   22-Aug-1948
3    Blondie   Chaplin     Guitar   07-Jul-1951
0      Brian    Wilson       Bass   20-Jun-1942
1       Mike      Love  Saxophone   15-Mar-1941
2      Bruce  Johnston       Bass   27-Jun-1942
3     Dennis    Wilson      Drums   04-Dec-1944
4      Ricky    Fataar      Drums   05-Sep-1952

You’ve created a single new DataFrame using concat(). By passing it both guitar_players and others, your new all_beach_boys DataFrame shows the original nine band members once more, but the index contains duplicates. Now that you’ve a DataFrame with duplicate indexes, you’ll investigate the problems this can cause.

Suppose you want to select the fourth row—the row whose index position is 3. You can’t use .loc[] to do this because the duplicate index causes issues. Run the code below and you’ll see the problem:

Python
>>> all_beach_boys.loc[3]
  first_name last_name instrument date_of_birth
3    Blondie   Chaplin     Guitar   07-Jul-1951
3     Dennis    Wilson      Drums   04-Dec-1944

>>> all_beach_boys.iloc[[3]]
  first_name last_name instrument date_of_birth
3    Blondie   Chaplin     Guitar   07-Jul-1951

Take a look at the highlighted lines. As you can see, because .loc[] selected rows whose index label is 3, two records were returned. To fix this, you’d need to use .iloc[] to select the required single row.

The duplicate index values also cause havoc when you try to select contiguous rows using slicing. Suppose you wanted to see the rows at index positions three and four. Your first attempt might be to try .loc[]:

Python
>>> all_beach_boys.loc[3:4]
Traceback (most recent call last):
  ...
KeyError: 'Cannot get left slice bound for non-unique label: 3'

>>> all_beach_boys.iloc[3:5]
  first_name last_name instrument date_of_birth
3    Blondie   Chaplin     Guitar   07-Jul-1951
0      Brian    Wilson       Bass   20-Jun-1942

As you can see, when you try to pass the required index positions into .loc[], it throws a KeyError exception because you have a non-unique label. To fix this, you’d need to resort to using .iloc[] instead.

Suppose you now want to see those elements with index labels 1 and 3. This time, you use the DataFrame’s .filter() method:

Python
>>> all_beach_boys.filter(items=[1, 3], axis="index")
Traceback (most recent call last):
  ...
ValueError: cannot reindex on an axis with duplicate labels

You’ve tried to filter the DataFrame with index labels 1 and 3 by passing these in as a list to the items parameter of .filter(). You’ve also set the axis parameter to "index" to apply the filter to the index. Despite your code being technically correct, the result is a ValueError exception because of the duplicate labels.

At the moment, you’re probably feeling a bit rag dolled with all of these setbacks. It’s time to get back on your board again and solve the problems:

See if you can fix this code so that both .loc[] and .iloc[] produce the same results, and so .filter() works as expected when applied to all_beach_boys. Also, make sure the problematic old index has been removed:

Python
>>> all_beach_boys.loc[[3]]
  first_name last_name instrument date_of_birth
3    Blondie   Chaplin     Guitar   07-Jul-1951

>>> all_beach_boys.iloc[[3]]
  first_name last_name instrument date_of_birth
3    Blondie   Chaplin     Guitar   07-Jul-1951

>>> all_beach_boys.filter(items=[1, 3], axis="index")
  first_name last_name instrument date_of_birth
1       Carl    Wilson     Guitar   21-Dec-1946
3    Blondie   Chaplin     Guitar   07-Jul-1951

Well done. Everything works!

Now suppose you want to promote an existing column to an index. Is this possible? It’s time to drop in, catch a wave, and see.

Use an Existing Column as an Index

While the default sequential numeric index provides a unique accessor to the rows within your DataFrame, it’s unlikely that it’ll have any inherent meaning. For example, the numbers assigned to each row of the DataFrames you’ve used up to this point have no meaning relative to the data they’re indexing.

In the DataFrame read in from band_members.csv, band member Brian Wilson has an index value of 0 simply because he appears first in the file. This has no inherent meaning, even though it may offend you if you’re a fan and think he should be number 1.

While the concept of using an unrelated sequential index will be familiar to you if you’ve worked with keys in a relational database, you may want something more meaningful in your DataFrames.

If you want more user-friendly index labels, you could use an existing column and promote it to the index. To do this, you use the .set_index() method. Although you can promote any existing column to become the index, keep in mind that unless your intended column contains unique values, you’ll still run into the same problems with duplicates that you saw earlier.

Suppose you want to reset your index to contain the first_name labels. You could do this as shown here:

Python
>>> beach_boys = pd.read_csv(
...     "band_members.csv"
... ).convert_dtypes(dtype_backend="pyarrow")

>>> (
...     beach_boys
...     .set_index("first_name")
...     .loc[["Brian", "Carl"]]
... )
           last_name instrument date_of_birth
first_name
Brian         Wilson       Bass   20-Jun-1942
Carl          Wilson     Guitar   21-Dec-1946

When you call .set_index() on beach_boys and pass in "first_name", this column gets promoted to the index. You can then use .loc[] to select one or more rows using the first names of the musicians you’re interested in.

In some cases, you may wish to reset your index so that its existing values become more meaningful to the data they’re indexing. While you can’t use .reset_index() for this, you could apply something more suitable to the DataFrame’s .index attribute.

A common example is using employee identifiers instead of plain sequential integers. The code below updates the existing sequential index with more contextual values:

Python
>>> beach_boys.index = [f"Employee_{x + 1}" for x in range(len(beach_boys))]

>>> beach_boys
           first_name last_name instrument date_of_birth
Employee_1      Brian    Wilson       Bass   20-Jun-1942
Employee_2       Mike      Love  Saxophone   15-Mar-1941
Employee_3         Al   Jardine     Guitar   03-Sep-1942
Employee_4      Bruce  Johnston       Bass   27-Jun-1942
Employee_5       Carl    Wilson     Guitar   21-Dec-1946
Employee_6     Dennis    Wilson      Drums   04-Dec-1944
Employee_7      David     Marks     Guitar   22-Aug-1948
Employee_8      Ricky    Fataar      Drums   05-Sep-1952
Employee_9    Blondie   Chaplin     Guitar   07-Jul-1951

In this code, you used a list comprehension to create a list of strings containing ["Employee_1", "Employee_2", ...]. This then gets assigned to the .index attribute of beach_boys.

Now that you have a more meaningful index, you can use it in the same way you used the default numeric index. For example, you can select rows by their new Employee_ values:

Python
>>> beach_boys.loc[["Employee_4"]]
           first_name last_name instrument date_of_birth
Employee_4      Bruce  Johnston       Bass   27-Jun-1942

In the code above, you used .loc[] to select the record of Employee_4.

Now for a bit of a mind-bender. Wouldn’t it be nice if you could do this:

Using the original beach_boys DataFrame read in from band_members.csv, create an index consisting of the usernames of staff in the format <last_name><initial>. Your final result should look like this:

Python
>>> beach_boys
          first_name last_name instrument date_of_birth
WilsonB        Brian    Wilson       Bass   20-Jun-1942
LoveM           Mike      Love  Saxophone   15-Mar-1941
JardineA          Al   Jardine     Guitar   03-Sep-1942
JohnstonB      Bruce  Johnston       Bass   27-Jun-1942
WilsonC         Carl    Wilson     Guitar   21-Dec-1946
WilsonD       Dennis    Wilson      Drums   04-Dec-1944
MarksD         David     Marks     Guitar   22-Aug-1948
FataarR        Ricky    Fataar      Drums   05-Sep-1952
ChaplinB     Blondie   Chaplin     Guitar   07-Jul-1951

As you can see, the index now contains a common username format, which will make it straightforward to select users by their usernames.

Next, you’ll gain some experience tandem surfing. It’s time to use not one, but two DataFrames together.

Align Indexes of Several DataFrames

One of the great features of working with pandas DataFrames is that you can use the basic arithmetic operators to add their data together. Unfortunately, you’ll only be allowed this convenience if their indexes align. Otherwise, you’ll run into problems.

Suppose you’re analyzing the weekly sales of records from a record store. Two weeks of sales data are stored inside two CSV files named week1_record_sales.csv and week2_record_sales.csv. For demonstration purposes, both files contain identical sales data but their indexes are different:

Python
>>> week1_sales = pd.read_csv(
...     "week1_record_sales.csv"
... ).set_index("index")

>>> week2_sales = pd.read_csv(
...     "week2_record_sales.csv"
... ).set_index("index")

>>> week1_sales
       day  sales
index
0      Mon    100
1      Tue    150
2      Wed    200
3      Thu    250
4      Fri    300

>>> week2_sales
       day  sales
index
1      Mon    100
2      Tue    150
3      Wed    200
4      Thu    250
5      Fri    300

Each file is read into a DataFrame and contains daily sales information. Every row is identified by its index column, which has been set as the DataFrame’s index with .set_index().

Suppose you now want to find the total sales of both weeks. This should be achievable with little more than a simple arithmetic operation:

Python
>>> week1_sales["sales"] + week2_sales["sales"]
index
0      NaN
1    250.0
2    350.0
3    450.0
4    550.0
5      NaN
Name: sales, dtype: float64

As you can see, something’s gone wrong. Because both DataFrames contain the same data, you’d expect the answers to be twice their original values. Instead, the first and last answers are NaN, meaning an arithmetic calculation couldn’t be performed due to missing values. Also, the remaining results are incorrect.

Both of these issues were caused by mismatched indexes. The NaN values have appeared because neither index 0 nor index 5 appear in both of your DataFrames. The calculations are wrong because, for example, the Wednesday sales figure of 200 is indexed as 2 in your first DataFrame, while index 2 refers to the Tuesday sales of 150 in your second DataFrame. When you add these, the result is meaningless.

You can also merge two DataFrames in the same way that relational database tables can be merged. This allows you to see matching data from both DataFrames in the same place. Again, if you join on index, each index value must refer to the related data in both DataFrames.

For example, say you wanted to see all data for both sales weeks. To do this, you can use the DataFrame’s .merge() method:

Python
>>> week1_sales.merge(week2_sales, left_index=True, right_index=True)
      day_x  sales_x day_y  sales_y
index
1       Tue      150   Mon      100
2       Wed      200   Tue      150
3       Thu      250   Wed      200
4       Fri      300   Thu      250

Right away you can see some problems. Neither the records with index 0 nor index 5 are anywhere to be seen. The daily figures don’t align either.

Here, you’ve performed an inner join of both DataFrames, meaning only records whose index values appear in both DataFrames will be merged. Because index values 0 and 5 don’t appear in both, they’re not included in the merge. The days are mismatched because the same index refers to different days in each DataFrame.

To fix both of these problems, you need to make sure both DataFrames are using the same index. One way would be to reset the index on week2_sales to its default. This will then match that used by week1_sales, but only because the daily data for both DataFrames are already in the same order:

Python
>>> week2_sales = week2_sales.reset_index(drop=True)

As before, to reset the index back to its default, you use .reset_index() and pass True to its drop parameter to remove the problematic original index. Now when you run both of the previous pieces of code, the results are far more palatable:

Python
>>> week1_sales["sales"] + week2_sales["sales"]
index
0    200
1    300
2    400
3    500
4    600
Name: sales, dtype: int64

>>> week1_sales.merge(week2_sales, left_index=True, right_index=True)
      day_x  sales_x day_y  sales_y
index
0       Mon      100   Mon      100
1       Tue      150   Tue      150
2       Wed      200   Wed      200
3       Thu      250   Thu      250
4       Fri      300   Fri      300

As you can see, everything now matches and nothing is missing. Aligning the indexes has solved both issues in one fell swoop. However, this example only worked because the rows within the DataFrames were in the correct order to begin with. This won’t always be the case.

This time you’re going to think laterally. So give your brain a Wipeout and see if you can solve this exercise:

Suppose you’re happy with the index of week2_sales, but not with that of the week1_sales DataFrame. See if you can use one of the techniques you learned about earlier to apply the index from week2_sales to that of week1_sales. Don’t forget to make sure both the addition and .merge() still produce the correct output.

So far, you’ve merged the DataFrames on numerical indexes. Can you think of a better alternative that would still work even if both original DataFrames had their rows in a different order? Again, make sure that both the addition and .merge() produce the correct result.

Remember, you can always take a look at the solution in your downloaded materials.

To round off your learning experience, you’ll finish up by learning how to reset multi-level indexes on DataFrames. Not something surfers normally do, unless they’re Pythonistas like you.

Reset Multi-Indexes

Each of the DataFrames you’ve been working with so far has consisted of single-column Index objects. DataFrames also support MultiIndex objects, which provide hierarchical, or multi-level indexes for your DataFrames.

In this section, you start to wake up from your California Dreaming of the Beach Boys and decide to have breakfast. You’ll use the cereals.csv file to help you decide which cereal to eat. This file contains data about various popular breakfast cereals from a range of manufacturers. The original data comes from Kaggle and is freely available under the Creative Commons License. Here, you’re using a cut-down version of it.

The first thing that you’ll need to do is read the cereals data into a DataFrame:

Python
>>> cereals = pd.read_csv("cereals.csv").convert_dtypes(
...     dtype_backend="pyarrow"
... )
>>> cereals.head()
                        name    manufacturer  type  fiber
0                  100% Bran         Nabisco  Cold   10.0
1          100% Natural Bran     Quaker Oats  Cold    2.0
2                   All-Bran        Kelloggs  Cold    9.0
3  All-Bran with Extra Fiber        Kelloggs  Cold   14.0
4             Almond Delight  Ralston Purina  Cold    1.0

As you can see, the file contains details of different breakfast cereals. When you call the DataFrame’s .head() method, you see the first five records that reveal the cereal’s name and manufacturer. You can also see its type, which tells you whether the cereal is to be eaten hot or cold, as well as its fiber content.

Not surprisingly, this DataFrame has a simple index. One quick way to create a MultiIndex is to create a pivot table from it:

Python
>>> cereals.pivot_table(
...     values="fiber",
...     index=["manufacturer", "type"],
...     aggfunc="mean",
... )
                                     fiber
manufacturer                type
American Home Food Products Hot        0.0
General Mills               Cold  1.272727
Kelloggs                    Cold   2.73913
Nabisco                     Cold       4.6
                            Hot        1.0
Post                        Cold  2.777778
Quaker Oats                 Cold  1.142857
                            Hot        2.7
Ralston Purina              Cold     1.875

This pivot table, which is actually another DataFrame, analyzes the raw data by working out the average fiber content for each type of cereal for each manufacturer. The index of this DataFrame is a bit different from what you’re used to seeing:

Python
>>> cereals.pivot_table(
...     values="fiber",
...     index=["manufacturer", "type"],
...     aggfunc="mean",
... ).index
MultiIndex([('American Home Food Products',  'Hot'),
            (              'General Mills', 'Cold'),
            (                   'Kelloggs', 'Cold'),
            (                    'Nabisco', 'Cold'),
            (                    'Nabisco',  'Hot'),
            (                       'Post', 'Cold'),
            (                'Quaker Oats', 'Cold'),
            (                'Quaker Oats',  'Hot'),
            (             'Ralston Purina', 'Cold')],
           names=['manufacturer', 'type'])

The MultiIndex in this pivot_table consists of both manufacturer and type columns. Instead of the simple single columns you’ve seen so far, you now have a more complex multi-level structure.

In this example, you’ve created a MultiIndex consisting of two levels. These are defined by passing "manufacturer" and "type" to the index parameter of .pivot_table().

Within the MultiIndex object, manufacturer is known as level 0, while type is level 1. These level numbers are important if you need to reset the index because they allow you to do so on a specific level, or even reset it completely.

Suppose you wanted to reset only level 1, corresponding to type, which relegates it to a separate column before removing it completely:

Python
>>> cereals = pd.read_csv("cereals.csv").convert_dtypes(dtype_backend="pyarrow")
>>> cereals.pivot_table(
...     values="fiber",
...     index=["manufacturer", "type"],
...     aggfunc="mean"
... ).reset_index(level=1, drop=True)

>>> cereals
                                fiber
manufacturer
American Home Food Products       0.0
General Mills                1.272727
Kelloggs                      2.73913
Nabisco                           4.6
Nabisco                           1.0
Post                         2.777778
Quaker Oats                  1.142857
Quaker Oats                       2.7
Ralston Purina                  1.875

By passing both level=1 and drop=True to .reset_index(), you drop the type details, retaining only the manufacturer information as a simple Index.

Take care when doing this, because you’ve now created duplicate index values with all of the problems you saw earlier. In addition, your data has now lost some of its meaning. For example, there’s confusion over what both the Nabisco and Quaker Oats labels are showing you. You no longer know which one refers to the hot cereals and which to the cold.

Be aware that resetting part of a MultiIndex can have undesirable side effects that aren’t always obvious.

Time to Catch a Wave and go on a Surfin’ Safari through your penultimate challenge:

Using the previous code for guidance, see if you can produce this exact output to provide clarification to the Nabisco and Quaker Oats confusion:

Python
>>> cereals
                     manufacturer     fiber
type
Cold                General Mills  1.272727
Cold                     Kelloggs   2.73913
Cold                      Nabisco       4.6
Cold                         Post  2.777778
Cold                  Quaker Oats  1.142857
Cold               Ralston Purina     1.875
Hot   American Home Food Products       0.0
Hot                       Nabisco       1.0
Hot                   Quaker Oats       2.7

The type information forms the index while the manufacturer information is back in a column.

Sometimes, it’s better to reset all levels of a MultiIndex but retain all data. When you apply .reset_index() to a MultiIndex using its default parameters, you’ll replace the complete index with a simple default version and create additional columns from the original index inside your DataFrame:

Python
>>> cereals.pivot_table(
...     values="fiber",
...     index=["manufacturer", "type"],
...     aggfunc="mean",
... ).reset_index()
                  manufacturer  type     fiber
0  American Home Food Products   Hot       0.0
1                General Mills  Cold  1.272727
2                     Kelloggs  Cold   2.73913
3                      Nabisco  Cold       4.6
4                      Nabisco   Hot       1.0
5                         Post  Cold  2.777778
6                  Quaker Oats  Cold  1.142857
7                  Quaker Oats   Hot       2.7
8               Ralston Purina  Cold     1.875

This time your DataFrame has a default index applied to it, but it also has new manufacturer and type columns, as well as the aggregated fiber column’s data. Most importantly, the data hasn’t lost any of its meaning.

This final challenge will test those Good Vibrations you now have about your ability to work with DataFrame indexes.

See if you can solve the undesirable side effects of your original pivot table by flattening the MultiIndex. Perhaps it’ll look something like this:

Python
>>> cereals
                                       fiber
(American Home Food Products, Hot)       0.0
(General Mills, Cold)               1.272727
(Kelloggs, Cold)                     2.73913
(Nabisco, Cold)                          4.6
(Nabisco, Hot)                           1.0
(Post, Cold)                        2.777778
(Quaker Oats, Cold)                 1.142857
(Quaker Oats, Hot)                       2.7
(Ralston Purina, Cold)                 1.875

The index now consists of one level instead of two, but each level is a tuple.

That’s it! You should be stoked that you now have a comprehensive understanding of how to reset the index on your DataFrames using several techniques. You’ve also had a bonus lesson on how to incorporate Beach Boys song titles and corny surfing references into a Python tutorial. A truly rad original from your friends at Real Python.

Conclusion

In this tutorial, you learned that while the .reset_index() method is the most customizable way to reset an index, it isn’t the fastest option. However, it is useful when you’re working with MultiIndex resets.

You also learned that directly applying the index to the DataFrame’s .index property is the quickest way to reset an index, and that it changes the original DataFrame in the process. Additionally, you discovered how the .set_axis() method allows you to reset and re-label your index should you want to.

Congratulations on completing this tutorial, and enjoy applying these newfound skills to better prepare your DataFrames for analysis. Happy surfing!

Take the Quiz: Test your knowledge with our interactive “How to Reset a pandas DataFrame Index” quiz. You’ll receive a score upon completion to help you track your learning progress:


Interactive Quiz

How to Reset a pandas DataFrame Index

This quiz will challenge your knowledge of resetting indexes in pandas DataFrames. You won't find all the answers in the tutorial, so you'll need to do some extra investigating. By finding all the answers, you're sure to learn some interesting things along the way.

🐍 Python Tricks 💌

Get a short & sweet Python Trick delivered to your inbox every couple of days. No spam ever. Unsubscribe any time. Curated by the Real Python team.

Python Tricks Dictionary Merge

About Ian Eyre

Ian is an avid Pythonista and Real Python contributor who loves to learn and teach others.

» More about Ian

Each tutorial at Real Python is created by a team of developers so that it meets our high quality standards. The team members who worked on this tutorial are:

Master Real-World Python Skills With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

Master Real-World Python Skills
With Unlimited Access to Real Python

Locked learning resources

Join us and get access to thousands of tutorials, hands-on video courses, and a community of expert Pythonistas:

Level Up Your Python Skills »

What Do You Think?

Rate this article:

What’s your #1 takeaway or favorite thing you learned? How are you going to put your newfound skills to use? Leave a comment below and let us know.

Commenting Tips: The most useful comments are those written with the goal of learning from or helping out other students. Get tips for asking good questions and get answers to common questions in our support portal.


Looking for a real-time conversation? Visit the Real Python Community Chat or join the next “Office Hours” Live Q&A Session. Happy Pythoning!