Properties

In this section, we cover getting and setting the properties or attributes of a string column or a column that we wish to subsequently treat as a string.

We will look at four classes of properties that we commonly need to deal with when working with string data; those are:

  • Data Type where we will cover getting and setting the data type of a column that we wish to treat as a string.
  • Length where we cover how to obtain the number of characters in each string literal in a string column.
  • Content Type where we cover how to check what is the content held in a string e.g. does the string hold an integer value.
  • Encoding where we cover how to get or set the encoding of a string or string column.
  • Locale where we cover how to get the global locale or set the locale used for a particular locale sensitive operation.

Data Type

We wish to set the data type of a literal or a vector to string (character).

In this example, we wish to convert the data type of a numeric column, col_1, to string (character).

df_2 = df %>%
  mutate(col_2 = as.character(col_1))

Here is how this works:

We use the function as.character() from base R to convert the data type of a literal or a vector to a string (character) data type. See Data Type Setting.

Alternative: Auto Cast

In this example, we wish to concatenate the string ‘ mins ago.’ to each numerical value in the numeric column col_1.

df_2 = df %>%
    mutate(col_2 = str_c(col_1, ' mins ago.'))

Here is how this works:

  • Typically, we rely on R to automatically convert (type-cast) the data type to string (character) when needed by the operation being performed.
  • We use the function str_c() to concatenate the string ‘ mins ago.’ to each numerical value in the numeric column col_1. See Combining.
  • The numeric values of each element in the column col_1 are automatically converted to string before concatenating the string ‘ mins ago.’.
  • If we concatenate an empty string ‘’, the data type of the numeric column will be converted to string just like in the primary solution above.

Length

We wish to obtain the number of characters in a string literal or of each element in a vector of strings.

df_2 = df %>%
  mutate(col_2 = str_length(col_1))

Here is how this works:

We use the function str_length() to compute the number of characters in each element in the column col_1.

Content Type

We wish to check the type of characters held in a string.

In this example, we wish to check whether an element in the string column col_1 holds: a sequence of digits, a decimal, a sequence of alphanumeric characters, empty spaces.

df_2 = df %>% 
  mutate(
    is_int = str_detect(col_1, '^\\d*$'),
    is_dec = str_detect(col_1, '^\\d*\\.?\\d*$'),
    is_aln = str_detect(col_1, '^[[:alnum:]]*$'),
    is_spc = str_detect(col_1, '^\\s*$'))

Here is how this works:

  • We use the function str_detect() along with an appropriate regular expression to check on the type of characters that make up a string. See Detecting.
  • The key elements of the regular expressions we use here are:
    • \\d to detect any digit character.
    • [[:alnum:]] to detect any alphanumeric character i.e. a letter or a digit.
    • \\s detects a white space
    • * detects 0 or more occurrences
    • ? detects 0 or 1 occurrences
    • ^ detects the start of a string
    • $ detects the end of a string

Encoding

Get Encoding

We wish to get the encoding used in a string

uchardet::detect_str_enc(df$col_1)

Here is how this works:

We use the function detect_str_enc() from the uchardet package to identify the encoding of each element in a string column or vector.

List Encodings

We wish to get a list of supported encodings.

stringi::stri_enc_list()

Here is how this works:

The function stri_enc_list() from the package stringi returns a list of supported encodings.

Set Encoding

We wish to set the encoding of a given string to a particular encoding.

str_conv(df$col_1, 'Latin-9')

Here is how this works:

  • We use the function str_conv() from the package stringr to set (override) the current encoding of a string to a different encoding.
  • We can obtain the name to use to refer to a particular encoding by looking at the list returned by stri_enc_list(). See “List Encodings” above.

Locale

Get Global Locale

We wish to obtain some basic information about the current locale.

stringi::stri_locale_info()

Here is how this works:

The function stri_locale_info() from the package stringi returns a description of the current locale that includes the language, country, and locale name.

List Locales

We wish to get a list of supported locales.

stringi::stri_locale_list()

Here is how this works:

The function stri_locale_list() from the package stringi returns a list of supported locales.

Set Operation Locale

We wish to define a locale to use in a particular operation.

In this example, we wish to sort the rows of the data frame df by the values of the string column col_1 while taking into account that the characters are in Lithuanian, where 'y' comes between 'i' and 'k'.

df_2 = df %>% 
  arrange(str_rank(col_1, locale = "lt"))

Here is how this works:

  • The standard in the stringr library is that if an operation is locale-sensitive; i.e. would return different outputs for the same input depending on the locale e.g. sorting and case setting, it would accept a locale argument to which we can pass the desired locale.
  • If we do not pass a locale argument, the current global locale is assumed.
  • In this example, we use the function str_rank() from the stringr package to obtain the ranks as integers of the values of the column col_1 in the locale "lt" (Lithuanian). See String Sorting.

Set Global Locale

We wish to change the global locale for the current environment.

In this example, we wish to set the global locale to UAE Arabic.

stringi::stri_locale_set('ar_AE')

Here is how this works:

  • We use the function stri_locale_set() from the package stringi to set (override) the current locale.
  • We can obtain the name to use to refer to a particular locale by looking at the list returned by stri_locale_list(). See “List Locales” above.
R
I/O