Selecting by Data Type

Often times we wish to select columns of one or more data types; for instance to apply a data type specific operation to all of them e.g. select columns with a numeric data type and then round each to 2 decimal places.

See General Operations for a coverage of data types and data type transformation.

Single Data Type

We wish to select all columns of one particular data type.

In this example, we wish to select all columns of a numeric data type (i.e. float or integer).

df_2 = df.select_dtypes('number')

Here is how this works:

  • The select_dtypes() method of a Pandas data frame takes a reference to a data type and returns a new data frame comprised only of columns of the original data frame that are of the specified data type.
  • We can refer to any of the Python data types e.g. int8 or float64 or pass the name of a dtype in the NumPy hierarchy e.g. np.int64 (find a full list here). However, where the select_dtypes() function stands out is that it accepts references to entire “classes” of data types e.g. number to refer to all numeric (integer and float) data types which makes it convenient and robust to work with.
  • Common data type references are:
    • 'number' to capture all columns of any numeric data type (integer or float).
    • 'integer' to capture all columns of an integer data type e.g. int64
    • 'float' to capture all columns of a float data type e.g. float64
    • ‘object’ to capture string columns
    • ‘boolean’ to capture columns of a boolean data type
    • ‘category’ to capture columns of the Pandas Category data type
  • A Pandas data frame has a set of methods that allow for concise code when applying common function e.g. sum() to each column of the data frame. For instance, df.sum() returns the sum of each column of a data frame df. Usually to successfully use these convenience methods, all columns must be of a compatible data type. Therefore, the coding pattern of selecting columns of a certain data type then applying a Data Frame method is quite common eg df.select_dtype('number').mean(). See Selecting by Data Criteria for more on this pattern.

Alternatively:

There are two methods for selecting columns by data type that via loc[]. Those are:

By Direct Comparison

df_2 = df.loc[:,(df.dtypes=='int64')]
  • Here is how this works:
    • df.dtypes returns a Series of the data types of the columns of the DataFrame df (in the same order as the columns).
    • We can compare the Series with a given data type to return a Series of boolean values that is True where the data type matches and False otherwise.
    • loc[] then returns the columns corresponding to Series elements that are True. Recall loc[] works with both boolean vector and column names.
  • The advantages of this approach are:
    • Using loc[] allows for a consistent scheme for selecting columns in different ways whether it is by name, by name pattern, data type, or data criteria.
    • Easy to combine with other column selection logic e.g. df.loc[:, (df.dtypes == 'float64') & df.columns.str.startswith('col_')]
  • The disadvantage is we do not have a way to pick a class of data type e.g. all numeric columns without spelling them out e.g. df.loc[:, (df.dtypes == 'float64') | (df.dtypes == 'int64')] (where we assumed that only int64 and float64 are used but there can be many more). While we can use loc[] for column selection by data type, in situations where we wish to capture a class of data types, it can get quite tedious. In those situations we recommend select_dtypes(). As much as we care for consistency, this is a case where functionality trumps consistency.

Using Pandas.api.types sub-package

df_2 = df.loc[:, df.apply(lambda x: pd.api.types.is_numeric_dtype(x))]
  • Here is how this works:
    • pd.api.types.is_numeric_dtype() is one of many functions provided by the Pandas.api.types sub package that act on one column at a time and return True if the column is of the data type that the function checks for and False otherwise.
    • We use apply() to iterate over all columns of the DataFrame df to check whether the data type of each is numeric. See Section 4.3 for an explanation of selecting dynamically via a function.
  • The advantages of this approach are:
    • The same two advantages of the approach above.
    • Possible to refer to data type “classes” such as all numeric data types like select_dtypes().
  • The disadvantage is it is verbose and potentially expensive to use apply() to iterate over all columns to check their data types.

Multiple Data Types

We wish to select all columns of two or more data types.

In this example, we wish to select all columns of integer or logical data types.

df_2 = df.select_dtypes(["integer", "boolean"])

Here is how this works:

  • The first argument of select_dtypes() (named include) can take a list of data type references.
  • When passed a list of data type references, select_dtypes() will return a data frame that contains columns with only those data types (columns of other data types are excluded).
PYTHON
I/O