On this page, we cover how to get a summary of a non-numeric column, so we may gain some insight into the data it holds.
Non-numeric columns include string, Boolean and date-time columns. For the purpose of high-level inspection, they can more or less be summarized in the same way e.g. number of unique values.
We wish to get the unique values that a non-numeric column takes.
SELECT COUNT(DISTINCT col_10) AS result
FROM refcon.dataset.table_1;
Here is how this works:
COUNT DISTINCT
to count the unique values for col_10.We wish to know how many times each of the possible values of a particular non-numeric column of interest exist in a table. In other words, for how many rows of the table does the non-numeric column of interest take each of its possible values?
SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10;
Here is how this works:
GROUP BY
clause to group by col_10
.COUNT(1)
counts the number of rows for each value of col_10
.We wish to know the proportion (percentage or density) of the total number of rows (observations) that take each possible value of the non-numeric column (variable).
SELECT col_10
, COUNT(1) / (SUM(COUNT(1)) OVER ()) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;
Here is how this works:
SUM(COUNT(1)) OVER ()
is a window function that counts the number of rows in the table and ignores the GROUP BY
clause.COUNT(1)
counts the number of rows for each value of col_10
.Alternatively:
WITH table_rows AS
(SELECT COUNT(1) AS row_count
FROM refcon.dataset.table_1)
SELECT col_10,
COUNT(1) / (SELECT row_count FROM table_rows) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;
Here is how this works:
table_rows
CTE.GROUP BY
clause to group by col_10
.COUNT(1)
counts the number of rows for each value of col_10
.COUNT(1)
by the row number using a subquery.and should we wish to see counts and proportions side by side we could do this:
SELECT col_10
, COUNT(1) AS row_count
, COUNT(1) / (SUM(COUNT(1)) OVER ()) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;
We wish to know how many values occur a particular number of times e.g. how many values occur exactly 1 time.
WITH stats AS (SELECT col_10, COUNT(1) occurrence
FROM refcon.dataset.table_1
GROUP BY col_10)
SELECT occurrence, COUNT(col_10) value_count
FROM stats
GROUP BY occurrence;
Here is how this works:
col_10
in stats CTE using GROUP BY
clause.GROUP BY
occurrence and count the values of col_10
.When dealing with real data sets, the column we may wish to summarize with is likely to have many possible values some of which are infrequent (rarely occurring) and often not of much value to the task at hand (or of all the values if we are looking for rarely occurring values) but could make the output noisy and harder to process. In such situations, it helps to filter the summary down to the parts that are useful to the task at hand which is what we cover in this section.
By Frequency
We wish to see only values of a non-numeric column that have a certain frequency or higher (or lower perhaps to view possible encoding issues).
In this example, we wish to get values of a variable that exists 300 times or more in the data set.
SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10
HAVING result >= 300;
Here is how this works:
HAVING
filters after the aggregation is performed. i.e. it filters out the grouped data.By Proportion
We wish to see only values of a non-numeric column that have a certain proportion or higher (or lower perhaps to view possible encoding issues).
In this example, we wish to get values of a non-numeric variable that comprises 10% or more of the data set.
WITH table_rows AS
(SELECT COUNT(1) AS row_count
FROM refcon.dataset.table_1)
SELECT col_10,
COUNT(1) / (SELECT row_count FROM table_rows) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10
HAVING proportion > 0.1
Here is how this works:
This works similarly to the “Filtering By Frequency” scenario above except that we are filtering by proportion.
By Value
We wish to get the frequency of only a particular set of a column’s possible values (rather than all of its possible values).
In this example, we wish to get the number of occurrences of two possible values, ‘S’
and ‘M’
, of the column col_10
.
SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
WHERE col_10 IN ('S', 'M')
GROUP BY col_10
Here is how this works:
WHERE
clause.GROUP BY col_10
as explained above in the "Frequency" scenario.Most Frequent
We wish to limit the output to just the most prevalent (or least prevalent) values of the variable being summarized.
SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10
ORDER BY result DESC
LIMIT 2;
Here is how this works:
GROUP BY col_10
first to get the frequency.ORDER BY DESC
.LIMIT
clause.