Summarize Non-Numeric Column

On this page, we cover how to get a summary of a non-numeric column, so we may gain some insight into the data it holds.

Non-numeric columns include string, Boolean and date-time columns. For the purpose of high-level inspection, they can more or less be summarized in the same way e.g. number of unique values.

Unique Values

We wish to get the unique values that a non-numeric column takes.

SELECT COUNT(DISTINCT col_10) AS result
FROM refcon.dataset.table_1;

Here is how this works:

  • We use COUNT DISTINCT to count the unique values for col_10.

Frequency

We wish to know how many times each of the possible values of a particular non-numeric column of interest exist in a table. In other words, for how many rows of the table does the non-numeric column of interest take each of its possible values?

SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10;

Here is how this works:

  • We use the GROUP BY clause to group by col_10.
  • COUNT(1) counts the number of rows for each value of col_10.

Proportion

We wish to know the proportion (percentage or density) of the total number of rows (observations) that take each possible value of the non-numeric column (variable).

SELECT col_10
     , COUNT(1) / (SUM(COUNT(1)) OVER ()) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;

Here is how this works:

  • SUM(COUNT(1)) OVER () is a window function that counts the number of rows in the table and ignores the GROUP BY clause.
  • COUNT(1) counts the number of rows for each value of col_10.

Alternatively:

WITH table_rows AS
         (SELECT COUNT(1) AS row_count
          FROM refcon.dataset.table_1)
SELECT col_10,
       COUNT(1) / (SELECT row_count FROM table_rows) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;

Here is how this works:

  • First we get the total number of rows in the table in table_rows CTE.
  • We use the GROUP BY clause to group by col_10.
  • COUNT(1) counts the number of rows for each value of col_10.
  • We divide COUNT(1) by the row number using a subquery.

and should we wish to see counts and proportions side by side we could do this:

SELECT col_10
     , COUNT(1)                           AS row_count
     , COUNT(1) / (SUM(COUNT(1)) OVER ()) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;

Frequency of Frequency

We wish to know how many values occur a particular number of times e.g. how many values occur exactly 1 time.

WITH stats AS (SELECT col_10, COUNT(1) occurrence
               FROM refcon.dataset.table_1
               GROUP BY col_10)
SELECT occurrence, COUNT(col_10) value_count
FROM stats
GROUP BY occurrence;

Here is how this works:

  • We calculate the occurrence of each value in col_10 in stats CTE using GROUP BY clause.
  • We then GROUP BY occurrence and count the values of col_10.

Filter Summary

When dealing with real data sets, the column we may wish to summarize with is likely to have many possible values some of which are infrequent (rarely occurring) and often not of much value to the task at hand (or of all the values if we are looking for rarely occurring values) but could make the output noisy and harder to process. In such situations, it helps to filter the summary down to the parts that are useful to the task at hand which is what we cover in this section.

By Frequency

We wish to see only values of a non-numeric column that have a certain frequency or higher (or lower perhaps to view possible encoding issues).

In this example, we wish to get values of a variable that exists 300 times or more in the data set.

SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10
HAVING result >= 300;

Here is how this works:

  • The aggregation works as explained above.
  • HAVING filters after the aggregation is performed. i.e. it filters out the grouped data.

By Proportion

We wish to see only values of a non-numeric column that have a certain proportion or higher (or lower perhaps to view possible encoding issues).

In this example, we wish to get values of a non-numeric variable that comprises 10% or more of the data set.

WITH table_rows AS
         (SELECT COUNT(1) AS row_count
          FROM refcon.dataset.table_1)
SELECT col_10,
       COUNT(1) / (SELECT row_count FROM table_rows) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10
HAVING proportion > 0.1

Here is how this works:

This works similarly to the “Filtering By Frequency” scenario above except that we are filtering by proportion.

By Value

We wish to get the frequency of only a particular set of a column’s possible values (rather than all of its possible values).

In this example, we wish to get the number of occurrences of two possible values, ‘S’ and ‘M’, of the column col_10 .

SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
WHERE col_10 IN ('S', 'M')
GROUP BY col_10

Here is how this works:

  • We filter the values we wish to keep using the WHERE clause.
  • We then GROUP BY col_10 as explained above in the "Frequency" scenario.

Most Frequent

We wish to limit the output to just the most prevalent (or least prevalent) values of the variable being summarized.

SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10
ORDER BY result DESC
LIMIT 2;

Here is how this works:

  • We GROUP BY col_10 first to get the frequency.
  • We sort the results in descending order using ORDER BY DESC.
  • We get the most prevalent values by using LIMIT clause.
SQL
I/O