We wish to select one or more column from a data frame and we wish to identify the columns to be selected by specifying their name(s).
We wish to select and extract a single column from a data frame by specifying their name.
When selecting a single column, we have the choice of returning that as a Series
or as a DataFrame
. The choice of which is appropriate depends on whether the operation(s) we wish to run afterwards expect a Series
or a DataFrame
.
As Series
We wish to select a single column from a data frame by specifying their name and to return the selected column as a Series
.
df.loc[:, 'col_1']
Here is how this works:
loc[]
to the data frame df
.loc[]
can be used to select columns by name.loc[]
can select columns by name, it can only subset rows (see Filtering) The first argument to loc[]
describes the rows to select and the second argument describes the columns to select.col_1
) as the second argument of loc[]
.col_1
), we pass :
as the first argument for loc[]
(without passing :
as the first argument, loc[]
would assume that we are selecting rows).Series
holding the column col_1
of the data frame df
.Alternatively
df['col_1']
Here is how this works:
[]
to select a single column by name.Series
holding the selected column; here col_1
.[]
is popular for column selection, we recommend using loc[]
. The disadvantage of using the bracket operator for selecting columns are:[]
operator can be used for both column selection and row selection, it is at times not obvious whether we are referring to rows or columns. It is in general good practice to stay away from ambiguous constructs. The slightly more verbose df.loc[:, 'col_1']
provides an unambiguous alternative that only ever refers to columns.loc[]
supports; e.g. selecting a range of columns (which we cover below).[]
and loc[]
can be used in a chain, loc[]
feels better.As Data Frame
We wish to select a single column from a data frame by specifying their name and to return the selected column as a data frame of one column.
df.loc[:, ['col_1']]
Here is how this works:
loc[]
for column selection by name above.loc[]
; i.e. df.loc[:, ['col_1']]
.Alternatively:
df[['col_1']]
Here is how this works:
['col_1']
.loc[]
over the bracket operator []
for column selection for the reasons described above.Given a data frame, we wish to return another data frame that is comprised of a subset of the columns of the original data frame. We wish to specify the columns that we wish to select by their name.
df.loc[:, ['col_1', 'col_2']]
Here is how this works:
loc[]
selects columns by name.loc[]
is designed to subset both rows and columns simultaneously. The first argument to loc[]
describes the rows to select and the second argument describes the columns to select.loc[]
; here ['col_1', 'col_2']
.col_1
), we pass :
as the first argument for loc[]
. Without passing :
as the first argument, loc[]
would assume that we are selecting rows.loc[]
is great for chaining. We can refer to columns created earlier within the same chain via their name in loc[]
(without a need for a lambda function).KeyError
if:Alternatively:
df[['col_1','col_2']]
Here is how this works:
loc[]
over the bracket operator []
for column selection for the reasons described above.Given a data frame, we wish to return another data frame that is comprised of a range of columns from the original data frame i.e. we wish to return every column between a given start
column and end
column including both start
and end
. We wish to specify the start
and end
column by their name.
df_2 = df.loc[:, 'col_1':'col_4']
Here is how this works:
:
before the comma instructs loc[]
to return all rows.:
in 'col_1':'col_4'
instructs loc[]
to select a range of consecutive columns.col_1
, col_4
as well as any columns in between will be returned.loc[]
not with the brackets []
operator (we can pass a range to the bracket operator but it assumes that the range denotes rows not columns).