On this page, we cover how to get a summary of a non-numeric column, so we may gain some insight into the data it holds.

Non-numeric columns include string, Boolean and date-time columns. For the purpose of high-level inspection, they can more or less be summarized in the same way e.g. number of unique values.

We wish to get the unique values that a non-numeric column takes.

```
SELECT COUNT(DISTINCT col_10) AS result
FROM refcon.dataset.table_1;
```

Here is how this works:

- We use
`COUNT DISTINCT`

to count the unique values for col_10.

We wish to know how many times each of the possible values of a particular non-numeric column of interest exist in a table. In other words, for how many rows of the table does the non-numeric column of interest take each of its possible values?

```
SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10;
```

Here is how this works:

- We use the
`GROUP BY`

clause to group by`col_10`

. `COUNT(1)`

counts the number of rows for each value of`col_10`

.

We wish to know the proportion (percentage or density) of the total number of rows (observations) that take each possible value of the non-numeric column (variable).

```
SELECT col_10
, COUNT(1) / (SUM(COUNT(1)) OVER ()) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;
```

Here is how this works:

`SUM(COUNT(1)) OVER ()`

is a window function that counts the number of rows in the table and ignores the`GROUP BY`

clause.`COUNT(1)`

counts the number of rows for each value of`col_10`

.

*Alternatively*:

```
WITH table_rows AS
(SELECT COUNT(1) AS row_count
FROM refcon.dataset.table_1)
SELECT col_10,
COUNT(1) / (SELECT row_count FROM table_rows) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;
```

Here is how this works:

- First we get the total number of rows in the table in
`table_rows`

CTE. - We use the
`GROUP BY`

clause to group by`col_10`

. `COUNT(1)`

counts the number of rows for each value of`col_10`

.- We divide
`COUNT(1)`

by the row number using a subquery.

and should we wish to see counts and proportions side by side we could do this:

```
SELECT col_10
, COUNT(1) AS row_count
, COUNT(1) / (SUM(COUNT(1)) OVER ()) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10;
```

We wish to know how many values occur a particular number of times e.g. how many values occur exactly 1 time.

```
WITH stats AS (SELECT col_10, COUNT(1) occurrence
FROM refcon.dataset.table_1
GROUP BY col_10)
SELECT occurrence, COUNT(col_10) value_count
FROM stats
GROUP BY occurrence;
```

Here is how this works:

- We calculate the occurrence of each value in
`col_10`

in stats CTE using`GROUP BY`

clause. - We then
`GROUP BY`

occurrence and count the values of`col_10`

.

When dealing with real data sets, the column we may wish to summarize with is likely to have many possible values some of which are infrequent (rarely occurring) and often not of much value to the task at hand (or of all the values if we are looking for rarely occurring values) but could make the output noisy and harder to process. In such situations, it helps to filter the summary down to the parts that are useful to the task at hand which is what we cover in this section.

**By Frequency**

We wish to see only values of a non-numeric column that have a certain frequency or higher (or lower perhaps to view possible encoding issues).

In this example, we wish to get values of a variable that exists 300 times or more in the data set.

```
SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10
HAVING result >= 300;
```

Here is how this works:

- The aggregation works as explained above.
`HAVING`

filters after the aggregation is performed. i.e. it filters out the grouped data.

**By Proportion**

We wish to see only values of a non-numeric column that have a certain proportion or higher (or lower perhaps to view possible encoding issues).

In this example, we wish to get values of a non-numeric variable that comprises 10% or more of the data set.

```
WITH table_rows AS
(SELECT COUNT(1) AS row_count
FROM refcon.dataset.table_1)
SELECT col_10,
COUNT(1) / (SELECT row_count FROM table_rows) AS proportion
FROM refcon.dataset.table_1
GROUP BY col_10
HAVING proportion > 0.1
```

Here is how this works:

This works similarly to the “Filtering By Frequency” scenario above except that we are filtering by proportion.

**By Value**

We wish to get the frequency of only a particular set of a column’s possible values (rather than all of its possible values).

In this example, we wish to get the number of occurrences of two possible values, `‘S’`

and `‘M’`

, of the column `col_10`

.

```
SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
WHERE col_10 IN ('S', 'M')
GROUP BY col_10
```

Here is how this works:

- We filter the values we wish to keep using the
`WHERE`

clause. - We then
`GROUP BY col_10`

as explained above in the "Frequency" scenario.

**Most Frequent**

We wish to limit the output to just the most prevalent (or least prevalent) values of the variable being summarized.

```
SELECT col_10, COUNT(1) AS result
FROM refcon.dataset.table_1
GROUP BY col_10
ORDER BY result DESC
LIMIT 2;
```

Here is how this works:

- We
`GROUP BY col_10`

first to get the frequency. - We sort the results in descending order using
`ORDER BY DESC`

. - We get the most prevalent values by using
`LIMIT`

clause.

SQL