In this section, we cover getting and setting the properties or attributes of a string column or a column that we wish to subsequently treat as a string.
We will look at four classes of properties that we commonly need to deal with when working with string data; those are:
We wish to set the data type of a literal or a vector to string (character).
In this example, we wish to convert the data type of a numeric column, col_1
, to string (character).
df_2 = df %>%
mutate(col_2 = as.character(col_1))
Here is how this works:
We use the function as.character()
from base R to convert the data type of a literal or a vector to a string (character) data type. See Data Type Setting.
Alternative: Auto Cast
In this example, we wish to concatenate the string ‘ mins ago.’
to each numerical value in the numeric column col_1
.
df_2 = df %>%
mutate(col_2 = str_c(col_1, ' mins ago.'))
Here is how this works:
str_c()
to concatenate the string ‘ mins ago.’
to each numerical value in the numeric column col_1
. See Combining.‘ mins ago.’
.‘’
, the data type of the numeric column will be converted to string just like in the primary solution above.We wish to obtain the number of characters in a string literal or of each element in a vector of strings.
df_2 = df %>%
mutate(col_2 = str_length(col_1))
Here is how this works:
We use the function str_length()
to compute the number of characters in each element in the column col_1
.
We wish to check the type of characters held in a string.
In this example, we wish to check whether an element in the string column col_1
holds: a sequence of digits, a decimal, a sequence of alphanumeric characters, empty spaces.
df_2 = df %>%
mutate(
is_int = str_detect(col_1, '^\\d*$'),
is_dec = str_detect(col_1, '^\\d*\\.?\\d*$'),
is_aln = str_detect(col_1, '^[[:alnum:]]*$'),
is_spc = str_detect(col_1, '^\\s*$'))
Here is how this works:
str_detect()
along with an appropriate regular expression to check on the type of characters that make up a string. See Detecting.\\d
to detect any digit character.[[:alnum:]]
to detect any alphanumeric character i.e. a letter or a digit.\\s
detects a white space*
detects 0 or more occurrences?
detects 0 or 1 occurrences^
detects the start of a string$
detects the end of a stringGet Encoding
We wish to get the encoding used in a string
uchardet::detect_str_enc(df$col_1)
Here is how this works:
We use the function detect_str_enc()
from the uchardet
package to identify the encoding of each element in a string column or vector.
List Encodings
We wish to get a list of supported encodings.
stringi::stri_enc_list()
Here is how this works:
The function stri_enc_list()
from the package stringi
returns a list of supported encodings.
Set Encoding
We wish to set the encoding of a given string to a particular encoding.
str_conv(df$col_1, 'Latin-9')
Here is how this works:
str_conv()
from the package stringr
to set (override) the current encoding of a string to a different encoding.stri_enc_list()
. See “List Encodings” above.Get Global Locale
We wish to obtain some basic information about the current locale.
stringi::stri_locale_info()
Here is how this works:
The function stri_locale_info()
from the package stringi
returns a description of the current locale that includes the language, country, and locale name.
List Locales
We wish to get a list of supported locales.
stringi::stri_locale_list()
Here is how this works:
The function stri_locale_list()
from the package stringi
returns a list of supported locales.
Set Operation Locale
We wish to define a locale to use in a particular operation.
In this example, we wish to sort the rows of the data frame df
by the values of the string column col_1
while taking into account that the characters are in Lithuanian, where 'y'
comes between 'i'
and 'k'
.
df_2 = df %>%
arrange(str_rank(col_1, locale = "lt"))
Here is how this works:
stringr
library is that if an operation is locale-sensitive; i.e. would return different outputs for the same input depending on the locale e.g. sorting and case setting, it would accept a locale
argument to which we can pass the desired locale.locale
is assumed.str_rank()
from the stringr
package to obtain the ranks as integers of the values of the column col_1
in the locale "lt" (Lithuanian). See String Sorting.Set Global Locale
We wish to change the global locale for the current environment.
In this example, we wish to set the global locale to UAE Arabic.
stringi::stri_locale_set('ar_AE')
Here is how this works:
stri_locale_set()
from the package stringi
to set (override) the current locale.stri_locale_list()
. See “List Locales” above.