Basic Pandas: How to add a column to a DataFrame

Pandas is one of my favorite Python libraries, and I use it every day. A very common action is to add a column to a DataFrame. This is a pretty basic task. I’m going to look at a few examples to better show what is happening when we add a column, and how we need to think about the index of our data when we add it.

Let’s start with a very simple DataFrame. This DataFrame has 4 columns of random floating point values. The index of this DataFrame will also be the default, a RangeIndex of the size of the DataFrame. I’ll assume this python code is run in either a Jupyter notebook or ipython session with pandas installed. I used version 1.1.0 when I wrote this.

import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.rand(6,4), columns=['a', 'b', 'c', 'd'])

display(df)
          a         b         c         d
0  0.028948  0.613221  0.122755  0.754660
1  0.880772  0.581651  0.968752  0.551583
2  0.107115  0.511918  0.574167  0.871300
3  0.830062  0.622413  0.118231  0.444581
4  0.264822  0.370572  0.001680  0.394488
5  0.749247  0.412359  0.092063  0.350451

Let’s start with the simplest way to add a column, such as a single value. This will be applied to all rows in the DataFrame.

df['e'] = .5

display(df['e'])
0    0.5
1    0.5
2    0.5
3    0.5
4    0.5
5    0.5
Name: e, dtype: float64

Now, under the hood, pandas is making life easier for you and taking your scalar value (the 0.5) and turning it into an array and using it to build a Series with the index (in this case a RangeIndex) of your DataFrame.

This is sort of the equivalent:

df['e_prime'] = pd.Series(.5, index=pd.RangeIndex(6))

You can also pass in an array yourself without an index, but it must match the dimensions of your DataFrame

Indexing in pandas can be so confusing

There are so many ways to do the same thing! What is the difference between .loc, .iloc, .ix, and []?  You can read the official documentation but there's so much of it and it seems so confusing. You can ask a question on Stack Overflow, but you're just as likely to get too many different and confusing answers as no answer at all. And existing answers don't fit your scenario.

You just need to get started with the basics.

What if you could quickly learn the basics of indexing and selecting data in pandas with clear examples and instructions on why and when you should use each one? What if the examples were all consistent, used realistic data, and included extra relevant background information?

Master the basics of pandas indexing with my free ebook. You'll learn what you need to get comfortable with pandas indexing. Covered topics include:

  • what an index is and why it is needed
  • how to select data in both a Series and DataFrame.
  • the difference between .loc, .iloc, .ix, and [] and when (and if) you should use them.
  • slicing, and how pandas slicing compares to regular Python slicing
  • boolean indexing
  • selecting via callable
  • how to use where and mask.
  • how to use query, and how it can help performance
  • time series indexing

Because it's highly focused, you'll learn the basics of indexing and be able to fall back on this knowledge time and again as you use other features in pandas.

Just give me your email and you'll get the free 57 page e-book, along with helpful articles about Python, pandas, and related technologies once or twice a month. Unsubscribe at any time.

Invalid email address
df['f'] = np.random.rand(6,1)

If you try to do this with a non-matching shape, it won’t work. This is because the DataFrame won’t know where to put the values. You can try it and see the Exception that pandas raises.

Now what happens when the data you want to add doesn’t match your current DataFrame, but it does have an index? Specifically, what if the index is different on the right hand side?

df['g'] = pd.Series(np.random.rand(50), index=pd.RangeIndex(2,52))

display(df[['e', 'e_prime', 'f', 'g']])
     e  e_prime         f         g
0  0.5      0.5  0.777879       NaN
1  0.5      0.5  0.621390       NaN
2  0.5      0.5  0.294869  0.283777
3  0.5      0.5  0.024411  0.695215
4  0.5      0.5  0.173954  0.585524
5  0.5      0.5  0.276633  0.751469

So what happened here? Our column g only has values at rows 2 through 5, even though we assigned a series with 50 values. Well, these were the rows that matched our index. For the rows that didn’t have values, a NaN was inserted. You can try doing this where none of the data matches on the index and see what happens. You’ll end up with a full column of NaNs. Another way to think of this is that we could use the loc method to select the rows we wanted to update, but unless we set the index on the right hand side, we still need to align with the shape of the DataFrame.

df.loc[2:5, 'g_prime'] = np.random.rand(4)
display(df['g_prime'])
0         NaN
1         NaN
2    0.130246
3    0.419122
4    0.312587
5    0.101704
Name: g_prime, dtype: float64

The main lesson here is to realize that assigning a column to a DataFrame can lead to some surprising results if you don’t realize whether what you are assigning has a matching index or not.

Have anything to say about this topic?