Multivariate Data Summary

A common data inspection activity is to examine how different columns vary together. This is commonly referred to as multivariate data summary or, when the variables are categorical, cross tabulation.

In this chapter we will cover the following common multivariate data summary scenarios:

  1. Data Frame Summary: Group the data frame by one or more columns then obtain a consolidated summary of each column for each group.
  2. Numeric by Factor: Obtain a summary of a given numerical variable for each group, where groups are defined by a given categorical variable.
  3. Factor by Factor: Obtain a summary of a given categorical variable against another categorical variable i.e. how likely are the combinations of values of the two categorical values to co-occur.
  4. Numeric by Two Factors: Obtain a summary of a given numerical variable for each group, where groups are defined by two categorical variables.
  5. Multiple Facets: Generalizing the scenarios above for more than two categorical columns.

Note: We will use crosstab() in this section for convenience and speed which is appropriate in a data inspection context. In a data processing context though we typically opt to explicitly group, aggregate, then pivot (if needed) to a wide format. We cover these topics in more detail in Aggregation and Reshaping.

PYTHON
I/O