We wish to select one or more column from a data frame and we wish to identify the columns to be selected by specifying their name(s).
We wish to select and extract a single column from a data frame by specifying their name.
When selecting a single column, we have the choice of returning that as a vector or as a data frame. The choice of which is appropriate depends on whether the operation(s) we wish to run afterwards expect a vector or a data frame.
As Data Frame
We wish to select a single column from a data frame by specifying their name and to return the selected column as a data frame of one column.
df_2 = df %>% select(col_1)
Here is how this works:
df
to the function select()
.select()
the name of the column we wish to select; here col_1
.tidyverse
.select()
is the definitive function (referred to as “verb” in tidyverse
vernacular) for column selection in the tidyverse
.As Vector
We wish to select a single column from a data frame by specifying their name and to return the selected column as a vector.
col = df %>% pull(col_1)
Here is how this works:
df
to the function pull()
.pull()
the name of the column we wish to select; here col_1
.pull()
returns the selected column as a vector (not a data frame like what select()
returns).pull()
essentially performs the same function as the $
operator from base R; i.e. df%>% pull(col_1)
is equivalent to df$col_1
. The reason we prefer pull()
is that it is fits better in a chain (looks better with pipes).Given a data frame, we wish to return another data frame that is comprised of a subset of the columns of the original data frame. We wish to specify the columns that we wish to select by their name.
df_2 = df %>% select(col_1, col_2)
Here is how this works:
df
to the function select()
.select()
the names of the column we wish to select separated by commas; here col_1, col_2
.select()
is the definitive function for column selection in the tidyverse
.Given a data frame, we wish to return another data frame that is comprised of a range of columns from the original data frame i.e. we wish to return every column between a given start
column and end
column including both start
and end
. We wish to specify the start
and end
column by their name.
df_2 = df %>% select(col_1:col_4)
Here is how this works:
df
to the function select()
.select()
the names of the the start
and end
column for the range of columns we wish to extract; here the start column is col_1 and the end column is col_4.start
and end
columns are returned by select()
as part of the selected range of columns. So in this case, select()
returns col_1
, col_2
, col_3
and col_4
(assuming that is how the columns are in the original data frame df
).