We wish to change the column names of a data frame to have them conform to a standard naming convention. In particular, we cover converting column names to snake_case and to camelCase.
For general column name cleaning, see Implicit Renaming and String Operations.
From Arbitrary Separator
We wish to change the names of columns that use a non-underscore separator so that they are in snake case i.e. “words” are separated by underscores ‘_’
, and the entire name is in lowercase.
import janitor
df_2 = df.clean_names()
Here is how this works:
clean_names()
function from the pyjanitor
package (which needs to be installed).import janitor
, the package’s functions are registered as part of pandas and can be used as data frame methods.clean_names()
function has an argument case
that allows us to specify the naming style that we wish to have the column names conform to. By default case="snake"
and, therefore, we left case
unspecified.Alternative: Import Function
from janitor import clean_names
df_2 = df.pipe(clean_names)
Here is how this works:
pyjanitor
package as a whole, we can’t use its functions as data frame methods. Instead, we can pass the data frame to the function via the pipe()
method.clean_names(df)
.Alternative: via Regular Expression
df_2 = df.copy()
df_2.columns = df.columns\
.str.replace(r'[\W_]+', '_', regex=True)\
.str.lower()
Here is how this works:
df.columns
to obtain the current column names.str.replace()
to replace any separators with an underscore. See String Replacing.'[\W_]+'
. It works as follows:\\W
captures any character that is not a letter or digit or underscore, i.e. it captures punctuation marks, symbols, and whitespace._
captures underscores. We need to capture underscores explicitly because \\W
doesn't include underscore characters.[]
specifies an or relationship between the characters within.+
captures one or more of the character specified previously.'_'
i.e. we replace what is captured by the pattern with an underscore.From Camel Case
We wish to change the names of columns that are in camel case to snake case i.e. “words” are separated by underscores ‘_’
, and the entire name is in lowercase.
import inflection
df_2 = df.copy()
df_2.columns = df.columns.map(inflection.underscore)
Here is how this works:
underscore()
function from the inflection package (which needs to be installed) to convert column names from camel case to snake case.underscore()
acts on individual strings, therefore we use map()
to iterate on each column name returned by df.columns
and apply underscore()
to each.underscore()
must be in camel case for it to convert it to snake case. For non camel case input, see “From Arbitrary Separator” above.Alternative: via Regular Expression
df_2 = df.copy()
df_2.columns = df.columns\
.str.replace(r'(?<!^)(?=[A-Z])', '_', regex=True)\
.str.lower()
Here is how this works:
df.columns
to obtain the current column names.str.replace()
to replace any separators with an underscore. See String Replacing.(?<!^)(?=[A-Z])
is a combination of two positive lookaround assertions.(?<!^)
is a negative look-behind. It asserts that the current position in the string is not the start of the string (^
). The negative look-behind (?<!...)
asserts that the preceding character(s) do not match the pattern inside the look-behind. In this case, ^
matches the start of the string, so this look-around asserts that the current position is not the start of the string.(?=[A-Z])
is a positive look-ahead. It asserts that the next character in the string is uppercase ([A-Z]
). The positive look-ahead (?=...)
asserts that the following character(s) match the pattern inside the look-ahead. In this case, the next character is uppercase, so this look-around asserts that the next character is uppercase.'_'
i.e. we insert an underscore in each position matched by the regular expression above.We wish to change the names of columns so that they are in camel case i.e. “words” are not separated by any delimiters and the first letter of each word, except the first, is in upper case.
df_2 = df.copy()
df_2.columns = df.columns\
.str.replace(r'[\W_]+(\w)',
lambda x: x.group(1).upper(),
regex=True)
Here is how this works:
df.columns
to obtain the current column names.str.replace()
to replace any separators with an underscore. See String Replacing.'[\W_]+(\w)'
and works as follows:\W
matches any character that is not a word character, this includes punctuation marks, symbols, and whitespace._
captures the underscore character. We need to capture underscores explicitly because \\W
doesn't include underscore characters.[]
specifies an or relationship between the characters within.+
captures one or more of the character specified previously.\\w
captures any word character; i.e. letter, digit, or underscore.()
defines a capture group; i.e. one or more characters that can be referred to as a group.str.replace()
passes any patterns matched by the regular expression to the specified lambda
function.x.group(1)
from the match (which contains the first letter after the separator) and converts that to upper case.Extension: Pascal Case
df_2 = df.copy()
df_2.columns = df.columns\
.str.capitalize()\
.str.replace(r'[\W_](\w)',
lambda x: x.group(1).upper(),
regex=True)
Here is how this works:
This works similarly to the primary solution above except that we precede the call to str.replace()
with a call to str.capitalize()
to convert the first letter to upper case. See String Formatting.
Alternative: Snake Case to Camel Case
import inflection
df_2 = df.copy()
df_2.columns = df.columns.map(inflection.camelize)
Here is how this works:
camelize()
function from the inflection package (which needs to be installed) to convert column names from snake case to camel case.camelize()
acts on individual strings, therefore we use map()
to iterate on each column name returned by df.columns
and apply camelize()
to each.camelize()
must be in snake case for it to be converted to camel case.uppercase_first_letter
argument of camelize()
to uppercase_first_letter=True
.