Combining

In this section, we cover the five common string combining scenarios that we come across in data manipulation contexts:

  1. Concatenating: Append two or more strings to each other.
  2. Collapsing: Concatenate all elements of a string vector into one string. A common application is to aggregate the values of a string column into a single string.
  3. Interpolating: Insert the value of one or more string variables into a string template.
  4. Repeating: Concatenate a string with itself a given number of times.
  5. Padding a string with a given character up to a given width.

Concatenating

We wish to concatenate the values of one column to the corresponding values of another column.

In this example, we wish to concatenate the values of the column col_1 to the corresponding values of the column col_2 with a comma as a separator into a new column col_3.

SELECT col_1,
       col_2,
       CONCAT(col_1, ', ', col_2) AS col_3
FROM table_1;

Here is how this works:

  • We use the CONCAT() function to concatenate the values in the column col_1 with the corresponding values in the column col_2 with a separator ', ‘ in between.
  • The CONCAT() function takes any number of values to concatenate.

Alternative: Using concatenation operator

SELECT col_1,
       col_2,
       col_1 || ', ' || col_2 AS col_3
FROM table_1;

Here is how this works:

  • We use the concatenation operator || to concatenate the values in the column col_1 with the corresponding values in the column col_2 with a separator ', ‘ in between.
  • The concatenation operator combines multiple values into one.

Extension: Handling Missing Values

We wish to concatenate the values of two columns and substitute missing values with a given token ( string).

In this example, we wish to concatenate the values of the column col_1 to the corresponding values of the column col_2 with a comma as a separator into a new column col_3. We wish to substitute missing values with ‘-’.

SELECT col_1,
       col_2,
       CONCAT(IFNULL(col_1, '-'), ', ', IFNULL(col_2, '-')) AS col_3
FROM table_1;

Here is how this works:

  • If we use CONCAT() with columns that have missing values, any concatenation operation where any of the values being concatenated is a missing value would have returned a missing value as the result of the operation.
  • One way to deal with this it to replace missing values with some particular string (which can be an empty string). In this example, we wish to use ‘-’ as a replacement for missing values.
  • We use the function IFNULL() to replace missing values in the columns to be concatenated, which here are col_1 and col_2, with a particular replacement string, which here is '-'.
  • See Missing Values for a general coverage of working with missing values.

Collapsing

We wish to concatenate all values of a string column into one string with an optional separator.

In this example, we wish to concatenate the values of the column col_2 into a single string for each group where the groups are defined by the values of the column col_1.

SELECT col_1,
       STRING_AGG(col_2, ',') col_3
FROM table_1
GROUP BY col_1;

Here is how this works:

  • The GROUP BY clause groups the rows of a table by the values in one or more columns. The rows are then assigned to a group based on the values in the grouping column(s). In this example, the rows are being grouped by the values in the col_1 column.
  • STRING_AGG() function is used to concatenate a list of strings, separated by a specified delimiter (',' in this example).
  • For dealing with missing values, see “Extension: Handling Missing Values” under Concatenating above.

Interpolating

We wish to create a new string column by inserting the value of one or more string variables into a string template.

In this example, we wish to create a new column col_3 from the values of the columns col_1 and col_2 where each value follows the template '{col_1} vs {col_2}' for the corresponding rows.

SELECT col_1,
       col_2,
       FORMAT('%s vs %s', col_1, col_2) AS col_3
FROM table_1;

Here is how this works:

  • We use the FORMAT() function to create a new column col_3 by inserting values of the columns col_1 and col_2 into the template '%s vs %s'.
  • The FORMAT() is quite convenient to work with. We use it as follows:
    • We pass to it the static string template that we wish to use which in this case is '%s vs %s'.
    • We pass the columns we wish to replace in the order of replacement to get the desired output.

Repeating

We wish to create a new string column by repeating the values in an existing string column a set number of times.

In this example, we wish to create a new column col_3 where each value is the repetition ( concatenation to itself) of the corresponding value in the col_2 3 times.

SELECT col_1,
       col_2,
       REPEAT(col_2, 3) AS col_3
FROM table_1;

Here is how this works:

  • We use the REPEAT() function to create a new column col_3 by repeating (concatenating to itself) each value in the column col_2 a specified number of times.
  • The REPEAT() function takes an integer argument, which in this case is 3, specifying the number of times each value of the calling string column will be repeated.
  • The output is the original table table_1 with an additional columncol_3 holding the values from the column col_2 each repeated 3 times.

Padding

We wish to pad a string with a specified character for up to a given width.

In this example, we wish to create a new column col_3 that is a padding of the values of the column col_2 to get each to 4 characters.

SELECT col_1,
       col_2,
       LPAD(col_2, 4, '*') col_3
FROM table_1;

Here is how this works:

  • We use the LPAD() to pad the values of the column col_2 by adding the character * to the left so each string value is 4 characters long.
  • The LPAD method takes three arguments:
    • original_value which is the input we wish to pad in this case col_2.
    • return_length: which is the desired length of the output padded string and which in this case is 4. If return_length is less than or equal to the original_value length, the input will be truncated to match return_length. For example, LPAD('hello world', 7) returns 'hello w'.
    • pattern which specified the pattern to use for padding the input string to match the given return_length and which in this case is '*'.
  • RPAD() achieves the same but by padding on the right hand side.
  • The output is the original table table_1 with an additional columncol_3 holding the values from the column col_2 with padding applied to the left side of each value.
SQL
I/O