Pandas is great for dealing with both numerical and text data. In most projects you’ll need to clean up and verify your data before analysing or using it for anything useful. Data might be delivered in databases, csv or other formats of data file, web scraping results, or even manually entered. Once you have loaded data into pandas, you’ll likely need to convert it to a type that makes the most sense for what you are trying to accomplish. In this post, I’m going to review the basic datatypes in pandas and how to safely and accurately convert data.

## DataFrame and Series

First, let’s review the basic container types in pandas, `Series`

and `DataFrame`

. A `Series`

is a one dimensional labeled array of data, backed by a `NumPy`

array. A `DataFrame`

is a two-dimensional structure that consists of multiple Series columns that share an index. A Series has a data type, referenced as `dtype`

, and all elements in that `Series`

will share the same type.

## But what types?

The data type can be a core `NumPy`

datatype, which means it could be a numerical type, or Python object. But the type can also be a pandas extension type, known as an `ExtensionDType`

. Without getting into too much detail, just know two very common examples are the `CategoricalDType`

, and in pandas 1.0+, the `StringDType`

. For now, what’s important to remember is that all elements in a `Series`

share the same type.

What’s important to realize is that when constructiong a `Series`

or a `DataFrame`

, pandas will pick the datatype that can represent all values in the `Series`

(or `DataFrame`

). Let’s look at an example to make this more clear. Note, this example was run using pandas version 1.1.4.

>>> import pandas as pd >>> s = pd.Series([1.0, 'N/A', 2]) >>> s 0 1 1 N/A 2 2 dtype: object

As you can see, pandas has chosen the `object`

type for my `Series`

since it can represent values that are floating point numbers, strings, and integers. The individual items in this `Series`

are all of a different type in this case, but can be represented as objects.

>>> print(type(s[0])) <class 'float'> >>> print(type(s[1])) <class 'str'> >>> print(type(s[2])) <class 'int'>

## So, what’s the problem?

The problem with using object for everything is that you rarely want to work with your data this way. Looking at this first example, if you had imported this data from a text file you’d most likely want it to be treated as numerical, and perhaps calculate some statistical values from it.

>>> try: ... s.mean() ... except Exception as ex: ... print(ex) ... unsupported operand type(s) for +: 'float' and 'str'

It’s clear here that the `mean`

function fails because it’s trying to add up the values in the `Series`

and cannot add the ‘N/A’ to the running sum of values.

## So how do we fix this?

Well, we could inspect the values and convert them by hand or using some other logic, but luckily pandas gives us a few options to do this in a sensible way. Let’s go through them all.

## astype

First, you can try to use `astype`

to convert values. `astype`

is limited, however, because if it cannot convert a value it will either raise an error or return the original value. Because of this, it cannot completely help us in this situation.

>>> try: ... s.astype('float') ... except Exception as ex: ... print(ex) ... could not convert string to float: 'N/A'

But `astype`

is very useful, so before moving on, let’s look at a few examples where you would use it. First, if your data was all convertible between types, it would do just what you want.

>>> s2 = pd.Series([1, "2", "3.4", 5.5]) >>> print(s2) 0 1 1 2 2 3.4 3 5.5 dtype: object >>> print(s2.astype('float')) 0 1.0 1 2.0 2 3.4 3 5.5 dtype: float64

Second, `astype`

is useful for saving space in `Series`

and `DataFrame`

s, especially when you have repeated types that can be expressed as categoricals. Categoricals can save memory and also make data a little more readable during analysis since it will tell you all the possible values. For example:

>>> s3 = pd.Series(["Black", "Red"] * 1000) >>> >>> s3.astype('category') 0 Black 1 Red 2 Black 3 Red 4 Black ... 1995 Red 1996 Black 1997 Red 1998 Black 1999 Red Length: 2000, dtype: category Categories (2, object): ['Black', 'Red'] >>> >>> print("String:", s3.memory_usage()) String: 16128 >>> print("Category:", s3.astype('category').memory_usage()) Category: 2224 >>>

You can also save space by using smaller `NumPy`

types.

>>> s4 = pd.Series([22000, 3, 1, 9]) >>>s4.memory_usage() 160 >>> s4.astype('int8').memory_usage() 132

But note there is an error above! `astype`

will happily convert numbers that don’t fit in the new type without reporting the error to you.

>>> s4.astype('int8') 0 -16 1 3 2 1 3 9 dtype: int8

Note that you can also use `astype`

on `DataFrame`

s, even specifying different values for each column

>>> df = pd.DataFrame({'a': [1,2,3.3, 4], 'b': [4, 5, 2, 3], 'c': ["4", 5.5, "7.09", 1]}) >>> df.astype('float') a b c 0 1.0 4.0 4.00 1 2.0 5.0 5.50 2 3.3 2.0 7.09 3 4.0 3.0 1.00 >>> df.astype({'a': 'uint', 'b': 'float16'}) a b c 0 1 4.0 4 1 2 5.0 5.5 2 3 2.0 7.09 3 4 3.0 1

## to_numeric (or to_datetime or to_timedelta)

There are a few better options available in pandas for converting one-dimensional data (i.e. one `Series`

at a time). These methods provide better error correction than `astype`

through the optional `errors`

and `downcast`

parameters. Take a look at how it can deal with the first `Series`

created in this post. Using `coerce`

for errors will turn any conversion errors into `NaN`

. Passing in `ignore`

will get the same behavior we had available in `astype`

, returning our original input. Likewise, passing in `raise`

will raise an exception.

>>> pd.to_numeric(s, errors='coerce') 0 1.0 1 NaN 2 2.0 dtype: float64

And if we want to save some space, we can safely downcast to the minimim size that will hold our data without errors (getting `int16`

instead of `int64`

if we didn’t downcast)

>>> pd.to_numeric(s4, downcast='integer') 0 22000 1 3 2 1 3 9 dtype: int16 >>> pd.to_numeric(s4).dtype dtype('int64')

The `to_datetime`

and `to_timedelta`

methods will behave similarly, but for dates and timedeltas.

>>> pd.to_numeric(s4).dtype dtype('int64') >>> pd.to_timedelta(['2 days', '5 min', '-3s', '4M', '1 parsec'], errors='coerce') TimedeltaIndex([ '2 days 00:00:00', '0 days 00:05:00', '-1 days +23:59:57', '0 days 00:04:00', NaT], dtype='timedelta64[ns]', freq=None) >>> pd.to_datetime(['11/1/2020', 'Jan 4th 1919', '20200930 08:00:31']) DatetimeIndex(['2020-11-01 00:00:00', '1919-01-04 00:00:00', '2020-09-30 08:00:31'], dtype='datetime64[ns]', freq=None)

Since these functions are all for 1-dimensional data, you will need to use `apply`

on a `DataFrame`

. For instance, to downcast all the values to the smallest possible floating point size, use the downcast parameter.

>>> from functools import partial >>> df.apply(partial(pd.to_numeric, downcast='float')).dtypes a float32 b float32 c float32 dtype: object

## infer_objects

If you happend to have a pandas object that consists of objects that haven’t been converted yet, both `Series`

and `DataFrame`

have a method that will attempt to convert those objects to the most sensible type. To see this, you have to do a sort of contrived example, because pandas will attempt to convert objects when you create them. For example:

>>> pd.Series([1, 2, 3, 4], dtype='object').infer_objects().dtype int64 >>> pd.Series([1, 2, 3, '4'], dtype='object').infer_objects().dtype object >>>pd.Series([1, 2, 3, 4]).dtype int64

You can see here that if the Series happens to have all numerical types (in this case integers) but they are stored as objects, it can figure out how to convert these to integers. But it doesn’t know how to convert the ‘4’ to an integer. For that, you need to use one of the techniques from above.

## convert_dtypes

This method is new in pandas 1.0, and can convert to the best possible `dtype`

that supports `pd.NA`

. Note that this will be the pandas dtype versus the `NumPy`

dtype (i.e. `Int64`

instead of `int64`

).

>>> pd.Series([1, 2, 3, 4], dtype='object').convert_dtypes().dtype Int64 >>> pd.Series([1, 2, 3, '4'], dtype='object').convert_dtypes().dtype object >>> pd.Series([1, 2, 3, 4]).convert_dtypes().dtype Int64

## What should you use most often then?

What I recommend doing is looking at your raw data once it is imported. Depending on your data source, it may already be in the dtype that you want. But once you need to convert it, you have all the tools you need to do this correctly. For numeric types, the `pd.to_numeric`

method is best suited for doing this conversion in a safe way, and with wise use of the `downcast`

parameter, you can also save space. Consider using `astype("category")`

when you have repeated data to save some space as well. The `convert_dtypes`

and `infer_objects`

methods are not going to be that helpful in most cases unless you somehow have data stored as objects that is readily convertible to another type. Remember, there’s no magic function in pandas that will ensure you have the best data type for every case, you need to examine and understand your own data to use or analyze it correctly. But knowing the best way to do that conversion is a great start.