This module focuses on understanding the underlying data before visualizing it. We will cover how to inspect data values for missing entries, outliers, and codings. We will also discuss different data structures, including cross-sectional, longitudinal, and roll-up data.
Outcomes:
Links:
Good examples:
We begin by inspecting individual cells. There are several questions we need to understand about the data before trying to visualize it.
Missing values are entries where data is not available. These can appear as NULL, NaN, empty strings (""), or even as placeholder text like "N/A" or -999. Missing values can distort averages, correlations, and trends.
You may need to remove or impute (fill in) these values before visualization.
Examples:
Outliers are data points that fall far outside the range of the majority of values.Outliers can skew visualizations such as bar charts and histograms, making it hard to see patterns among the majority of the data.
Examples:
150 in a demographic dataset.$1,000,000 in a store where the average is $200.Codings refer to how values are recorded. Inconsistent codings lead to improper grouping and inaccurate visual aggregation. Standardizing these formats is essential.
Examples:
M/F, Male/Female, or even 0/1.01/02/2023, 2023-02-01, or February 1, 2023.Properly classifying the type of each field helps determine appropriate visual encodings (e.g., bar charts for categorical data, scatter plots for continuous numeric data).
This is the first level of data typing—whether a field is textual or numeric. They store non-numeric data like names, categories, or descriptions.
Examples
City, Product Name, `Status = “Open/Closed”Contain numbers and can be further divided into integers and decimals. Numeric fields can be aggregated (mean, median, sum), while string fields are usually grouped or counted.
Age = 21, Number of Items Sold = 5Temperature = 98.6, Sales Revenue = 1234.56Can take on any value within a range, often measured rather than counted.
Height, Weight, Temperature, TimeConsist of distinct and separate values, often counted. A numeric field doesn’t automatically mean it’s continuous. For instance, ZIP Code is numeric but discrete and categorical in nature.
Number of children, Days of the week, Star rating = 1 to 5It’s essential to understand how data is structured and what kind of measurements are being recorded. Different data shapes and measurement contexts dictate different types of visualizations and preprocessing steps. Below are common structural and temporal classifications of data.
Cross-sectional data refers to observations collected at a single point in time, typically across different subjects, such as individuals, organizations, or regions. Cross-sectional data is static—ideal for comparisons between entities using bar charts, box plots, or maps.
Examples:
Example dataset structure average freshman, sophomore, junior, and senior GPAs for different majors:
| Major | FreshmanGPA | SophomoreGPA | JuniorGPA | SeniorGPA |
|---|---|---|---|---|
| Biology | 3.2 | 3.3 | 3.4 | 3.5 |
| ComputerSci | 3.5 | 3.6 | 3.7 | 3.8 |
| History | 3.1 | 3.2 | 3.3 | 3.4 |
Also known as time series data, longitudinal data involves repeated observations of the same variables over time. Longitudinal data reveals trends, cycles, and patterns over time. Visualizations should emphasize temporal dynamics.
Best practices suggest that line charts are commonly used to visualize longitudinal data. Do not use a bar chart for longitudinal data, as it obscures trends and continuity.
Examples:
Example dataset structure for Longitudinal Data:
| Student ID | Semester | GPA |
|---|---|---|
| 001 | Fall 2020 | 3.5 |
| 001 | Spring 2021 | 3.6 |
| 002 | Fall 2020 | 3.8 |
| 002 | Spring 2021 | 3.7 |
Roll-up data includes a mix of detailed and aggregated (summary) records in the same table. This is common in government or institutional data where subtotal or total rows are embedded within the raw dataset.
If not handled properly, roll-up rows can result in double-counting during analysis or incorrect visual representations. For example, including both “New York City” and “All U.S. Cities” in the same bar chart can be misleading.
Examples:
How to Handle:
Example Dataset with Roll-Up Rows:
| Location | Month | Sales |
|---|---|---|
| New York City | January | 1000 |
| New York City | February | 1200 |
| New York City | Total | 2200 |
| Los Angeles | January | 800 |
| Los Angeles | February | 900 |
| Los Angeles | Total | 1700 |
In wide data, each variable gets its own column. This format is often used for human readability and spreadsheet reports.
This structure is good for direct comparison between variables. However, most visualization libraries (e.g., ggplot2, Seaborn) expect data in tall format.
Example:
| Student | Math | English | History |
|---|---|---|---|
| Alice | 95 | 88 | 92 |
| Bob | 87 | 91 | 85 |
Tall (or long) data is more normalized. One row per observation, with columns for entity, variable name, and value.
Tall data is better for:
Example (same data as above, in tall format):
| Student | Subject | Score |
|---|---|---|
| Alice | Math | 95 |
| Alice | English | 88 |
| Alice | History | 92 |
| Bob | Math | 87 |
| Bob | English | 91 |
| Bob | History | 85 |