Selecting by Name

We wish to select one or more column from a data frame and we wish to identify the columns to be selected by specifying their name(s).

Single Column

We wish to select and extract a single column from a data frame by specifying their name.

When selecting a single column, we have the choice of returning that as a vector or as a data frame. The choice of which is appropriate depends on whether the operation(s) we wish to run afterwards expect a vector or a data frame.

As Data Frame

We wish to select a single column from a data frame by specifying their name and to return the selected column as a data frame of one column.

df_2 = df %>% select(col_1)

Here is how this works:

  • We pass the data frame df to the function select().
  • We pass to select() the name of the column we wish to select; here col_1.
  • Note that we simply pass the column name without any quotes. This is one of the powerful conveniences provided by the tidyverse.
  • select() is the definitive function (referred to as “verb” in tidyverse vernacular) for column selection in the tidyverse.

As Vector

We wish to select a single column from a data frame by specifying their name and to return the selected column as a vector.

col = df %>% pull(col_1)

Here is how this works:

  • We pass the data frame df to the function pull().
  • We pass to pull() the name of the column we wish to select; here col_1.
  • pull() returns the selected column as a vector (not a data frame like what select() returns).
  • pull() essentially performs the same function as the $ operator from base R; i.e. df%>% pull(col_1) is equivalent to df$col_1. The reason we prefer pull() is that it is fits better in a chain (looks better with pipes).

List of Columns

Given a data frame, we wish to return another data frame that is comprised of a subset of the columns of the original data frame. We wish to specify the columns that we wish to select by their name.

df_2 = df %>% select(col_1, col_2)

Here is how this works:

  • We pass the data frame df to the function select().
  • We pass to select() the names of the column we wish to select separated by commas; here col_1, col_2.
  • As mentioned above, select() is the definitive function for column selection in the tidyverse.

Range of Columns

Given a data frame, we wish to return another data frame that is comprised of a range of columns from the original data frame i.e. we wish to return every column between a given start column and end column including both start and end. We wish to specify the start and end column by their name.

df_2 = df %>% select(col_1:col_4)

Here is how this works:

  • We pass the data frame df to the function select().
  • We pass to select() the names of the the start and end column for the range of columns we wish to extract; here the start column is col_1 and the end column is col_4.
  • Note that both the start and end columns are returned by select() as part of the selected range of columns. So in this case, select() returns col_1, col_2, col_3 and col_4 (assuming that is how the columns are in the original data frame df).