Univariate Data Summary

In this section we look at how to summarize each column individually; e.g. taking the mean of a numerical column.

This section is structured into three parts as follows:

  1. Dataset Summary: Quickly obtaining a basic summary *of each column in a *Data Frame at once.
  2. Numeric Columns: Summarizing numeric columns e.g. obtaining the sum.
  3. Non Numeric Columns: Summarizing non-numeric columns (which may be of data types such as factor, string or date) e.g. obtaining the unique possible values.

In Multivariate Summary we will look at summarizing multiple columns together e.g. taking the mean of a numeric column for each group of rows where groups are defined by the value of a factor (categorical) column.

Generally speaking, in Pandas there are two modes of computing aggregate (or summary) operations:

  1. Using the aggregation API agg() (or apply()). The advantages are (a) general applicability and hence (b) consistency. The drawback is it might get a bit too verbose in a data inspection context.
  2. Using a set of convenience methods that may be applied directly to a data frame e.g. df.mean() or to an individual column (Series) e.g. df[’col_1’].mean(). These functions act on each column by default but can be made to act on rows by setting axis=1. The advantage is brevity which makes this a great option for data inspection. While making for succinct convenient code, this mode’s applicability is limited to the set of methods that the Panda’s authors implemented and if we need to apply an operation that is not covered, we would revert to the base aggregation API. That said, the most common aggregation operations, such as count, sum, mean, and median are covered.

In this section, we will stick to the convenience methods. We cover the aggregation API in Aggregating.

PYTHON
I/O