Selecting by Name

We wish to select one or more column from a data frame and we wish to identify the columns to be selected by specifying their name(s).

Single Column

We wish to select and extract a single column from a data frame by specifying their name.

When selecting a single column, we have the choice of returning that as a Series or as a DataFrame. The choice of which is appropriate depends on whether the operation(s) we wish to run afterwards expect a Series or a DataFrame.

As Series

We wish to select a single column from a data frame by specifying their name and to return the selected column as a Series.

df.loc[:, 'col_1']

Here is how this works:

  • We apply the selector loc[] to the data frame df.
  • loc[] can be used to select columns by name.
  • While loc[] can select columns by name, it can only subset rows (see Filtering) The first argument to loc[] describes the rows to select and the second argument describes the columns to select.
    • We pass the name of the desired column (here col_1) as the second argument of loc[].
    • Since we are looking to get all rows for the specified column (here col_1), we pass : as the first argument for loc[] (without passing : as the first argument, loc[] would assume that we are selecting rows).
  • The output is a Series holding the column col_1 of the data frame df.

Alternatively

df['col_1'] 

Here is how this works:

  • We use the bracket operator [] to select a single column by name.
  • The output is a Series holding the selected column; here col_1.
  • While the bracket operator [] is popular for column selection, we recommend using loc[]. The disadvantage of using the bracket operator for selecting columns are:
    • It is ambiguous. Because the bracket [] operator can be used for both column selection and row selection, it is at times not obvious whether we are referring to rows or columns. It is in general good practice to stay away from ambiguous constructs. The slightly more verbose df.loc[:, 'col_1'] provides an unambiguous alternative that only ever refers to columns.
    • It doesn’t support all the column selection scenarios that loc[] supports; e.g. selecting a range of columns (which we cover below).
    • While both the bracket operator [] and loc[] can be used in a chain, loc[] feels better.

As Data Frame

We wish to select a single column from a data frame by specifying their name and to return the selected column as a data frame of one column.

df.loc[:, ['col_1']]

Here is how this works:

  • We describe the use of loc[] for column selection by name above.
  • To return a data frame we pass the name of the column as a list to the second argument of loc[]; i.e. df.loc[:, ['col_1']].

Alternatively:

df[['col_1']]

Here is how this works:

  • To use the bracket operator to return one column as a data frame, we pass the name of the column as a list; i.e. ['col_1'].
  • We recommend loc[] over the bracket operator [] for column selection for the reasons described above.

List of Columns

Given a data frame, we wish to return another data frame that is comprised of a subset of the columns of the original data frame. We wish to specify the columns that we wish to select by their name.

df.loc[:, ['col_1', 'col_2']]

Here is how this works:

  • loc[] selects columns by name.
  • loc[] is designed to subset both rows and columns simultaneously. The first argument to loc[] describes the rows to select and the second argument describes the columns to select.
    • We pass the names of the desired columns as a list to the second argument of loc[]; here ['col_1', 'col_2'].
    • Since we are looking to get all rows for the specified column (here col_1), we pass : as the first argument for loc[]. Without passing : as the first argument, loc[] would assume that we are selecting rows.
  • The output is a data frame containing the selected columns in the same order as the column names in the input list.
  • loc[] is great for chaining. We can refer to columns created earlier within the same chain via their name in loc[] (without a need for a lambda function).
  • We get a KeyError if:
    • We refer to a column that does not exist (or if we mistype a column name).
    • We pass multiple columns that are not wrapped in a list i.e. if we omit the internal square brackets.

Alternatively:

df[['col_1','col_2']]

Here is how this works:

  • To use the bracket operator to return a data frame containing a subset of the columns of the original data frame, we can pass a list containing the column names to the bracket operator.
  • We recommend loc[] over the bracket operator [] for column selection for the reasons described above.

Range of Columns

Given a data frame, we wish to return another data frame that is comprised of a range of columns from the original data frame i.e. we wish to return every column between a given start column and end column including both start and end. We wish to specify the start and end column by their name.

df_2 = df.loc[:, 'col_1':'col_4']

Here is how this works:

  • The : before the comma instructs loc[] to return all rows.
  • The : in 'col_1':'col_4' instructs loc[] to select a range of consecutive columns.
  • Selection by a name range is inclusive i.e. both start and stop column names as well as every column between them is returned. In this example, col_1, col_4 as well as any columns in between will be returned.
  • Selecting a range of columns by name can only be done with loc[] not with the brackets [] operator (we can pass a range to the bracket operator but it assumes that the range denotes rows not columns).
PYTHON
I/O