Standardize Names

We wish to change the column names of a data frame to have them conform to a standard naming convention. In particular, we cover converting column names to snake_case and to camelCase.

For general column name cleaning, see Implicit Renaming and String Operations.

Snake Case

We wish to change the names of columns so that they are in snake case i.e. “words” are separated by underscores ‘_’, and the entire name is in lowercase.

library(janitor)

df_2 = df %>% clean_names()

Here is how this works:

  • We use the clean_names() function from the janitor package.
  • The clean_names() function has an argument case that allows us to specify the naming style that we wish to have the column names conform to. By default case="snake" and, therefore, we left case unspecified.
  • The clean_names() function essentially does two things:
    • Cleans names: It replaces spaces or other delimiters with underscores ‘_’, drops characters that are not letters or numbers, lowers case, and makes names unique.
    • Converts styles: If the original column names were “clean” but in another case, say camel case, their case will be converted to the style specified by the case argument, which in this case is case=”snake”.

Alternative: Arbitrary Separator to Snake Case via Regex

df_2 = df %>% 
  rename_with(
    ~ str_replace_all(.x, '[\\W_]+', '_') %>% 
      str_to_lower())

Here is how this works:

  • We use rename_with() to rename columns by applying a function to existing column names and using the output as the new column names. See Implicit Renaming.
  • We use str_replace_all() to replace any separators with an underscore. See String Replacing.
  • The regular expression we use to capture separators is '[\\W_]+' and works as follows:
    • \\W captures any character that is not a letter or digit or underscore, i.e. it captures punctuation marks, symbols, and whitespace.
    • _ captures underscores. We need to capture underscores because \\W doesnt include underscore characters.
    • [] specifies an or relationship between the characters within
    • + captures one or more of the character specified previously
  • The replacement is '_' i.e. we replace what is captured by the pattern with an underscore.

Alternative: Camel Case to Snake Case via Regex

df_2 = df %>% 
  rename_with(
    ~ str_replace_all(.x, '(?<!^)(?=[A-Z])', '_') %>% 
      str_to_lower())

Here is how this works:

  • We use rename_with() to rename columns by applying a function to existing column names and using the output as the new column names. See Implicit Renaming.
  • We use str_replace_all() to replace any separators with an underscore. See String Replacing.
  • The regular expression we use here identifies a position in a string that is not the start of the string and where the next character is uppercase. It works as follows:
    • The regular expression (?<!^)(?=[A-Z]) is a combination of two positive lookaround assertions.
    • The first lookaround assertion (?<!^) is a negative look-behind. It asserts that the current position in the string is not the start of the string (^). The negative look-behind (?<!...) asserts that the preceding character(s) do not match the pattern inside the look-behind. In this case, ^ matches the start of the string, so this look-around asserts that the current position is not the start of the string.
    • The second lookaround assertion (?=[A-Z]) is a positive look-ahead. It asserts that the next character in the string is uppercase ([A-Z]). The positive look-ahead (?=...) asserts that the following character(s) match the pattern inside the look-ahead. In this case, the next character is uppercase, so this look-around asserts that the next character is uppercase.
  • The replacement is '_' i.e. we insert an underscore in each position matched by the regular expression above.

Camel Case

We wish to change the names of columns so that they are in camel case i.e. “words” are not separated by any delimiters and the first letter of each word, except the first, is in upper case.

library(janitor)

df_2 = df %>% 
    clean_names(case = "lower_camel")

Here is how this works:

  • We use the clean_names() function from the janitor package.
  • The clean_names() function has an argument case that allows us to specify the naming style that we wish to have the column names conform to, which in this case is case="lower_camel".

Extension: Pascal Case

library(janitor)

df_2 = df %>% 
    clean_names(case = "upper_camel")

Here is how this works:

This works similarly to the primary solution except that we set case="upper_camel" so that clean_names() changes column names to Pascal Case.

Extension: Capitalize Abbreviations

We wish to have abbreviations capitalized in the generated camel case column names.

library(janitor)

df_2 = df %>% 
    clean_names(case = "lower_camel", abbreviations = c("ID", "CVR"))

Here is how this works:

  • We use the clean_names() function with the case argument set to case="lower_camel".
  • In order to have abbreviations being all caps in the generated column names, we pass those abbreviations to the abbreviations argument of the clean_names() function.

Alternative: via Regular Expression

df_2 = df %>% 
  rename_with(
    ~str_replace_all(
      .x, 
      '[\\W_]+\\w', 
      ~str_to_upper(str_sub(.x, -1))))

Here is how this works:

  • We use rename_with() to rename columns by applying a function to existing column names and using the output as the new column names. See Implicit Renaming.
  • We use str_replace_all() to replace any separators with an underscore. See String Replacing.
  • The regular expression we use to capture separators is '[\\W_]+\\w' and works as follows:
    • \\W captures any character that is not a letter or digit or underscore, i.e. it captures punctuation marks, symbols, and whitespace.
    • _ captures underscores. We need to capture underscores because \\W doesnt include underscore characters.
    • [] specifies an or relationship between the characters within
    • + captures one or more of the character specified previously
    • \\w captures any word character
  • The replacement is ~str_to_upper(str_sub(.x, -1)) which is an anonymous function that captures the last character of the captured pattern and converts it to upper case. See Extracting by Location.
  • Note that we use an anonymous function inside another an anonymous function. The reference .x is relative to the innermost anonymous function.
R
I/O