Standardize Names

We wish to change the column names of a data frame to have them conform to a standard naming convention. In particular, we cover converting column names to snake_case and to camelCase.

For general column name cleaning, see Implicit Renaming and String Operations.

Snake Case

From Arbitrary Separator

We wish to change the names of columns that use a non-underscore separator so that they are in snake case i.e. “words” are separated by underscores ‘_’, and the entire name is in lowercase.

import janitor

df_2 = df.clean_names() 

Here is how this works:

  • We use the clean_names() function from the pyjanitor package (which needs to be installed).
  • Once we run import janitor, the package’s functions are registered as part of pandas and can be used as data frame methods.
  • The clean_names() function has an argument case that allows us to specify the naming style that we wish to have the column names conform to. By default case="snake" and, therefore, we left case unspecified.

Alternative: Import Function

from janitor import clean_names

df_2 = df.pipe(clean_names)

Here is how this works:

  • If we do not import the pyjanitor package as a whole, we can’t use its functions as data frame methods. Instead, we can pass the data frame to the function via the pipe() method.
  • Alternatively, we can use clean_names(df).

Alternative: via Regular Expression

df_2 = df.copy()
df_2.columns = df.columns\
    .str.replace(r'[\W_]+', '_', regex=True)\
    .str.lower()

Here is how this works:

  • We use df.columns to obtain the current column names.
  • We use str.replace() to replace any separators with an underscore. See String Replacing.
  • For most cases, the simple expression works to capture separators '[\W_]+'. It works as follows:
    • \\W captures any character that is not a letter or digit or underscore, i.e. it captures punctuation marks, symbols, and whitespace.
    • _ captures underscores. We need to capture underscores explicitly because \\W doesn't include underscore characters.
    • [] specifies an or relationship between the characters within.
    • + captures one or more of the character specified previously.
  • The replacement is '_' i.e. we replace what is captured by the pattern with an underscore.

From Camel Case

We wish to change the names of columns that are in camel case to snake case i.e. “words” are separated by underscores ‘_’, and the entire name is in lowercase.

import inflection

df_2 = df.copy()
df_2.columns = df.columns.map(inflection.underscore)

Here is how this works:

  • We use the underscore() function from the inflection package (which needs to be installed) to convert column names from camel case to snake case.
  • The function underscore() acts on individual strings, therefore we use map() to iterate on each column name returned by df.columns and apply underscore() to each.
  • Note that the input to underscore() must be in camel case for it to convert it to snake case. For non camel case input, see “From Arbitrary Separator” above.

Alternative: via Regular Expression

df_2 = df.copy()
df_2.columns = df.columns\
  .str.replace(r'(?<!^)(?=[A-Z])', '_', regex=True)\
  .str.lower()

Here is how this works:

  • We use df.columns to obtain the current column names.
  • We use str.replace() to replace any separators with an underscore. See String Replacing.
  • The regular expression we use here identifies a position in a string that is not the start of the string and where the next character is uppercase. It works as follows:
    • The regular expression (?<!^)(?=[A-Z]) is a combination of two positive lookaround assertions.
    • The first lookaround assertion (?<!^) is a negative look-behind. It asserts that the current position in the string is not the start of the string (^). The negative look-behind (?<!...) asserts that the preceding character(s) do not match the pattern inside the look-behind. In this case, ^ matches the start of the string, so this look-around asserts that the current position is not the start of the string.
    • The second lookaround assertion (?=[A-Z]) is a positive look-ahead. It asserts that the next character in the string is uppercase ([A-Z]). The positive look-ahead (?=...) asserts that the following character(s) match the pattern inside the look-ahead. In this case, the next character is uppercase, so this look-around asserts that the next character is uppercase.
  • The replacement is '_' i.e. we insert an underscore in each position matched by the regular expression above.

Camel Case

We wish to change the names of columns so that they are in camel case i.e. “words” are not separated by any delimiters and the first letter of each word, except the first, is in upper case.

df_2 = df.copy()
df_2.columns = df.columns\
    .str.replace(r'[\W_]+(\w)',
                 lambda x: x.group(1).upper(), 
                 regex=True)

Here is how this works:

  • We use df.columns to obtain the current column names.
  • We use str.replace() to replace any separators with an underscore. See String Replacing.
  • The regular expression we use to capture separators is '[\W_]+(\w)' and works as follows:
    • \W matches any character that is not a word character, this includes punctuation marks, symbols, and whitespace.
    • _ captures the underscore character. We need to capture underscores explicitly because \\W doesn't include underscore characters.
    • [] specifies an or relationship between the characters within.
    • + captures one or more of the character specified previously.
    • \\w captures any word character; i.e. letter, digit, or underscore.
    • () defines a capture group; i.e. one or more characters that can be referred to as a group.
  • str.replace() passes any patterns matched by the regular expression to the specified lambda function.
  • The lambda function extracts the first capture group via x.group(1) from the match (which contains the first letter after the separator) and converts that to upper case.

Extension: Pascal Case

df_2 = df.copy()
df_2.columns = df.columns\
    .str.capitalize()\
    .str.replace(r'[\W_](\w)',
                 lambda x: x.group(1).upper(),
                 regex=True)

Here is how this works:

This works similarly to the primary solution above except that we precede the call to str.replace() with a call to str.capitalize() to convert the first letter to upper case. See String Formatting.

Alternative: Snake Case to Camel Case

import inflection

df_2 = df.copy()
df_2.columns = df.columns.map(inflection.camelize)

Here is how this works:

  • We use the camelize() function from the inflection package (which needs to be installed) to convert column names from snake case to camel case.
  • The function camelize() acts on individual strings, therefore we use map() to iterate on each column name returned by df.columns and apply camelize() to each.
  • Note that the input to camelize() must be in snake case for it to be converted to camel case.
  • To convert to Pascal Case, we set the uppercase_first_letter argument of camelize() to uppercase_first_letter=True.
PYTHON
I/O