We wish to get a summary of a numeric column so we may gain some insight into the data it holds.

We wish to generate common summary statistics for a numeric column e.g. the mean or the standard deviation. While we can compute each of those statistics one by one (see Variations below), it would be efficient during data inspection to use a single function that given a numeric column, computes the common summary statistics.

```
library(skimr)
df %>% skim(col_1)
```

Here is how this works:

- We pass the data frame
`df`

to the function`skim()`

. `skim()`

is a great convenience. With one command, we get a consolidated report that has the most common summary statistics like row count, mean, standard deviation, minimum value, maximum value, and percentiles.`skim()`

, from the`skimr`

package, is a much more powerful alternative to R’s built in`summary()`

function.

Consolidated summaries are great as an early step in the data inspection process. Often, however, we are interested in a particular summary statistic that may not be covered by the consolidated summary or the consolidated summary may be a bit too overwhelming. Say we just care about knowing the mean of a particular numeric column.

```
df %>% pull(col_1) %>% mean(na.rm = TRUE)
```

Here is how this works:

- We select the column we wish to summarize via
`pull()`

e.g.`pull(col_1)`

. - We then use the built in
`mean()`

function to compute the mean. - We set the argument
`na.rm = TRUE`

so mean would ignore`NA`

values and return the mean of the rest. - See Summary Statistics for how to compute all the common summary statistics.

We wish to obtain the sum of all values of a numeric column.

```
df %>% pull(col_1) %>% sum(na.rm = TRUE)
```

Here is how this works:

- We select the column we wish to summarize via
`pull()`

e.g.`pull(df)`

. - We then use the built in
`sum()`

function to compute the mean. - We set the argument
`na.rm = TRUE`

so mean would ignore`NA`

values and return the sum of the rest.

We can glean an understanding of the distribution of a numerical column by partitioning the range into it in a set of categories (binning) then computing the frequency of occurrence of those categories in the data.

```
library(janitor)
df %>% pull(col_1) %>% cut_interval(4) %>% tabyl()
```

Here is how this works:

- We select the column we wish to summarize via
`pull()`

e.g.`pull(col_1)`

. - We then pass the column to
`cut_interval(4)`

to partition the numeric column into four equal range “bins”. See Binning for a coverage of binning techniques. - We then pass the output of
`cut_interval()`

to`tabyl()`

to compute a frequency distribution i.e. in how many rows does the value of the column`col_1`

fall in to each bin. - We recommend the use of
`tabyl()`

from the`janitor`

package instead of base R’s`table()`

because it returns a clean data frame, automatically returns the percent and has enhanced tabulation functionality (which we use extensively in Multivariate Summary).

We wish to check what proportion of the values of a numerical variable is higher or lower than a given value. In this example, we wish to know the percentage of rows where the value of `col_1`

is greater than `0`

.

```
df %>% summarize(rate = mean(col_1 > 0))
```

Here is how this works:

- We compare the value of
`col_1`

with the threshold of choice, here`0`

, to return a boolean vector that is`TRUE`

for rows where`col_1 > 0`

and`FALSE`

otherwise. - We use
`summarize()`

to apply`mean()`

to the boolean vector generated by`col_1 > 0`

. - Applying
`mean()`

to a boolean vector is a way to compute the proportion of values that are true. It is equivalent to dividing the number of`TRUE`

values by the total number of values. See Boolean Operations. - Note that comparing to a missing value
`NA`

returns`NA`

. If we have missing values, we may wish to set the argument`na.rm = TRUE`

when we call`mean()`

.

If we have a numeric column encoded as a string column, we need to convert the data type to numeric before we can run numeric operations such as the summary statistics on this page.

```
df %>% mutate(col_1 = parse_number(col_1)) %>% skim(col_1)
```

Here is how this works:

- We use the
`parse_number()`

from the`readr`

package (part of the`tidyverse`

) to convert data type (cast) from string (character) to numeric. See Data Type Setting for more details. - We use
`mutate()`

to carry out the data transformation. See Data Transformation for more details. - Sometimes we have to process a string column to extract the numeric information into a numeric column. If so, that string manipulation often needs to be carried out before data type setting. We cover the string operations needed for data cleaning in String Operations.

R