Data Inspection

A key early step of any data project is to get acquainted with the dataset. We do so via a set of exploratory investigations that are often collectively referred to as Data Inspection. In addition to helping us get to grips with a new dataset, data inspection is something we continuously do throughout the course of a project as we manipulate data to verify the results of our actions.

In the following pages, we cover the key data inspection activities which together make up a solid data inspection routine:

  1. In Structure, we cover inspecting the dimensions, the column names, and column data types of a data frame. We also cover inspecting the size of a data frame in memory. This is a good starting point to quickly get a glimpse of the size and information contained in a dataset.
  2. In Segments, we cover viewing different sets of rows (which we refer to as segments) of the actual data. There is no substitute to looking thoughtfully at the data itself.
  3. In Univariate Summary, we cover summarizing individual columns e.g. the mean of a numerical column or the number of unique values in a string column.
  4. In Multivariate Summary, we cover summarizing columns against each other e.g. the mean value of a numerical column for each possible value of a categorical column.
  5. In Quality, we cover inspecting missing data and duplication in the data. Missing data and duplicate data are common data quality issues that may adversely affect the quality of data analysis if not understood and accounted for.

Notes on our approach for this section:

  1. During data inspection, preference is for interactive speedy analysis typically carried out in an interactive environment like the console or a notebook. R provides convenience functions for common operations that involve significantly fewer keystrokes than their equivalent general form expressions. While our approach favors general form expressions because of their wide utility, for data inspection preference will be for shorter form convenience alternative wherever possible to enable rapid inspection.
  2. We adopt the glimpse() function from dplyr() instead of base R’s str() function, the skim() function from the skimr package instead of base R’s summary(), and the tabyl() function from the janitor package instead of base R’s table(). The packages skimr and janitor are fairly popular and are tidyverse adjacent packages. i.e. they follow the same design principles as the tidyverse.
PYTHON
I/O