Often times we wish to select columns of one or more data types; for instance to apply a data type specific operation to all of them e.g. select columns with a numeric data type and then round each to 2 decimal places.
See General Operations for a coverage of data types and data type transformation.
We wish to select all columns of one particular data type.
In this example, we wish to select all columns of a numeric data type (i.e. float or integer).
df_2 = df.select_dtypes('number')
Here is how this works:
select_dtypes()
method of a Pandas data frame takes a reference to a data type and returns a new data frame comprised only of columns of the original data frame that are of the specified data type.int8
or float64
or pass the name of a dtype
in the NumPy hierarchy e.g. np.int64
(find a full list here). However, where the select_dtypes()
function stands out is that it accepts references to entire “classes” of data types e.g. number
to refer to all numeric (integer and float) data types which makes it convenient and robust to work with.'number'
to capture all columns of any numeric data type (integer or float).'integer'
to capture all columns of an integer data type e.g. int64
'float'
to capture all columns of a float data type e.g. float64
‘object’
to capture string columns‘boolean’
to capture columns of a boolean data type‘category’
to capture columns of the Pandas Category
data typesum()
to each column of the data frame. For instance, df.sum()
returns the sum of each column of a data frame df
. Usually to successfully use these convenience methods, all columns must be of a compatible data type. Therefore, the coding pattern of selecting columns of a certain data type then applying a Data Frame method is quite common eg df.select_dtype('number').mean()
. See Selecting by Data Criteria for more on this pattern.Alternatively:
There are two methods for selecting columns by data type that via loc[]
. Those are:
By Direct Comparison
df_2 = df.loc[:,(df.dtypes=='int64')]
df.dtypes
returns a Series
of the data types of the columns of the DataFrame
df
(in the same order as the columns).Series
with a given data type to return a Series
of boolean values that is True
where the data type matches and False
otherwise.loc[]
then returns the columns corresponding to Series
elements that are True
. Recall loc[]
works with both boolean vector and column names.loc[]
allows for a consistent scheme for selecting columns in different ways whether it is by name, by name pattern, data type, or data criteria.df.loc[:, (df.dtypes == 'float64') & df.columns.str.startswith('col_')]
df.loc[:, (df.dtypes == 'float64') | (df.dtypes == 'int64')]
(where we assumed that only int64
and float64
are used but there can be many more). While we can use loc[]
for column selection by data type, in situations where we wish to capture a class of data types, it can get quite tedious. In those situations we recommend select_dtypes()
. As much as we care for consistency, this is a case where functionality trumps consistency.Using Pandas.api.types
sub-package
df_2 = df.loc[:, df.apply(lambda x: pd.api.types.is_numeric_dtype(x))]
pd.api.types.is_numeric_dtype()
is one of many functions provided by the Pandas.api.types
sub package that act on one column at a time and return True
if the column is of the data type that the function checks for and False
otherwise.apply()
to iterate over all columns of the DataFrame
df
to check whether the data type of each is numeric. See Section 4.3 for an explanation of selecting dynamically via a function.select_dtypes()
.apply()
to iterate over all columns to check their data types.We wish to select all columns of two or more data types.
In this example, we wish to select all columns of integer or logical data types.
df_2 = df.select_dtypes(["integer", "boolean"])
Here is how this works:
select_dtypes()
(named include
) can take a list of data type references.select_dtypes()
will return a data frame that contains columns with only those data types (columns of other data types are excluded).