Data Inspection

A key early step of any data project is to get acquainted with the dataset. We do so via a set of exploratory investigations that are often collectively referred to as Data Inspection. In addition to helping us get to grips with a new dataset, data inspection is something we continuously do throughout the course of a project as we manipulate data to verify the results of our actions.

In this section, we cover the key data inspection activities which together make up a solid data inspection routine. The section is structured as follows:

  1. Structure where we cover inspecting the dimensions, the column names, and column data types of a table. We also cover inspecting the size of the table. This is a good starting point to quickly get a glimpse of the size and information contained in a dataset.
  2. Segments where we cover viewing different sets of rows ( which we refer to as segments) of the actual data. There is no substitute to looking thoughtfully at the data itself.
  3. Univariate Summary where we cover summarizing individual columns e.g. the mean of a numerical column or the number of unique values in a string column.
  4. Multivariate Summary where we cover summarizing columns against each other e.g. the mean value of a numerical column for each possible value of a categorical column.
  5. Quality where we cover inspecting missing data and duplication in the data. Missing data and duplicate data are common data quality issues that may adversely affect the quality of data analysis if not understood and accounted for.
SQL
I/O