We wish to get a summary of a numeric column so we may gain some insight into the data it holds.

We wish to generate common summary statistics for a numeric column e.g. the mean or the standard deviation. While we can compute each of those statistics one by one (see Variations below), it would be efficient during data inspection to use a single function that given a numeric column, computes the common summary statistics.

```
df['col_1'].describe()
```

Here is how this works:

- We select the column of interest via
`df['col_1']`

. - We then use
`describe()`

to return descriptive statistics for the numerical column selected via`df['col_1']`

. `describe()`

returns: count , mean, std, min, 25%, 50%, 75%, and max.- We can define which percentiles to include via the
`percentiles`

argument of`describe()`

like so`describe(percentiles=[0.05, 0.5, 0.95])`

.

Consolidated summaries are great as an early step in the data inspection process. Often, however, we are interested in a particular summary statistic that may not be covered by the consolidated summary or the consolidated summary may be a bit too overwhelming. Say we just care about knowing the mean of a particular numeric column.

```
df['col_1'].mean()
```

Here is how this works:

- We select the column we wish to summarize via the bracket operator e.g.
`df[’col_1’]`

. - We then use the built in
`mean()`

function for Pandas`Series`

objects to compute the mean. - See Summary Statistics for how to compute all the common summary statistics.

We wish to obtain the sum of all values of a numeric column.

```
df['col_1'].sum()
```

Here is how this works:

- We select the column we wish to summarize via the bracket operator e.g.
`df[’col_1’]`

. - We then use the built in
`sum()`

function for Pandas`Series`

objects to compute the mean. - The default is that
`sum()`

ignores missing values because the argument`skipna=True`

by default.

We can glean an understanding of the distribution of a numerical column by partitioning the range into it in a set of categories (binning) then computing the frequency of occurrence of those categories in the data.

```
df['col_1'].value_counts(bins = 4)
```

Here is how this works:

- The
`value_counts()`

function has an immensely useful`bins`

argument which partition the numeric column into n equal range “bins” (in this case`bins = 4`

). `value_counts()`

then computes the number of occurrences (frequency) of each bin.- This approach is appropriate where we only need the binned variable to compute a frequency distribution. To transform a numeric column into a categorical column we can use
`pd.cut()`

which we describe in Binning. - The default ordering is a descending order of frequency. Sometimes it's useful to order by the actual values, and not by frequency (especialy when we are are using
`bins`

). To do this, we can use`value_counts(bins = 4, sort=False)`

We wish to check what proportion of the values of a numerical variable is higher or lower than a given value. In this example, we wish to know the percentage of rows where the value of `col_1`

is greater than `0`

.

```
df['col_1'].dropna().gt(0).mean()
```

Here is how this works:

`gt()`

is a convenient function form of the greater than`>`

compassion and is one of a set of functions that Pandas offers that can be used in place of the numerical companion operators. Their advantage is cleaner chaining.- The logical expression
`df['col_1'].gt(0)`

(equivalent to`df['col_1'] > 0`

) compares each value in`col_1`

to`0`

and returns a`Series`

of booleans that is`True`

wherever the value of`col_1`

is greater than`0`

. - The
`mean()`

of a boolean vector is equivalent to the proportion of`True`

values. - In Python, comparing to a missing value (
`np.nan`

) returns`False`

. Therefore, when there are missing values we should drop the missing values via`dropna()`

before comparing. We cover working with missing values in Missing Values. If there are no missing values we could drop the`dropna()`

and use the simpler`df['col_1'].gt(0).mean()`

.

If we have a numeric column encoded as a string column, we need to convert the data type to numeric before we can run numeric operations such as the summary statistics on this page.

```
df['col_1'].astype(float).describe()
```

Here is how this works:

- We select the column of interest via
`df[’col_1’]`

. - We use the
`astype()`

while passing the argument`float`

to convert data type (cast) from string (character) to numeric. See Data Type Setting for more details. - Now that the column
`col_1`

has been transformed to a numeric data type, we can apply any of the summary operations described above e.g.`describe()`

. - Sometimes we have to process a string column to extract the numeric information into a numeric column. If so, that string manipulation often needs to be carried out before data type setting. We cover the string operations needed for data cleaning in String Operations.

PYTHON