In this section we look at how to summarize each column individually; e.g. taking the mean of a numerical column.
This section is structured into three parts as follows:
In Multivariate Summary we will look at summarizing multiple columns together e.g. taking the mean of a numeric column for each group of rows where groups are defined by the value of a factor (categorical) column.
Generally speaking, in Pandas there are two modes of computing aggregate (or summary) operations:
agg()
(or apply()
). The advantages are (a) general applicability and hence (b) consistency. The drawback is it might get a bit too verbose in
a data inspection context.df.mean()
or to an
individual column (Series
) e.g. df[’col_1’].mean()
. These functions act on each column by default but can be made
to act on rows by setting axis=1
. The advantage is brevity which makes this a great option for data inspection.
While making for succinct convenient code, this mode’s applicability is limited to the set of methods that the
Panda’s authors implemented and if we need to apply an operation that is not covered, we would revert to the base
aggregation API. That said, the most common aggregation operations, such as count, sum, mean, and median are covered.In this section, we will stick to the convenience methods. We cover the aggregation API in Aggregating.