course_dv

Understand your data structure and values

This module focuses on understanding the underlying data before visualizing it. We will cover how to inspect data values for missing entries, outliers, and codings. We will also discuss different data structures, including cross-sectional, longitudinal, and roll-up data.

Outcomes:

Links:

Good examples:

Data Values

We begin by inspecting individual cells. There are several questions we need to understand about the data before trying to visualize it.

Missing Values

Missing values are entries where data is not available. These can appear as NULL, NaN, empty strings (""), or even as placeholder text like "N/A" or -999. Missing values can distort averages, correlations, and trends.

You may need to remove or impute (fill in) these values before visualization.

Examples:

Outliers

Outliers are data points that fall far outside the range of the majority of values.Outliers can skew visualizations such as bar charts and histograms, making it hard to see patterns among the majority of the data.

Examples:

Codings and Inconsistent Formats

Codings refer to how values are recorded. Inconsistent codings lead to improper grouping and inaccurate visual aggregation. Standardizing these formats is essential.

Examples:

Identifying the Type of Each Field

Properly classifying the type of each field helps determine appropriate visual encodings (e.g., bar charts for categorical data, scatter plots for continuous numeric data).

String or Number

This is the first level of data typing—whether a field is textual or numeric. They store non-numeric data like names, categories, or descriptions.

Examples

Numeric Fields:

Contain numbers and can be further divided into integers and decimals. Numeric fields can be aggregated (mean, median, sum), while string fields are usually grouped or counted.

Continuous vs. Discrete Numeric Fields

Continuous Fields:

Can take on any value within a range, often measured rather than counted.

Discrete Fields:

Consist of distinct and separate values, often counted. A numeric field doesn’t automatically mean it’s continuous. For instance, ZIP Code is numeric but discrete and categorical in nature.

Data Structure

It’s essential to understand how data is structured and what kind of measurements are being recorded. Different data shapes and measurement contexts dictate different types of visualizations and preprocessing steps. Below are common structural and temporal classifications of data.

Cross-Sectional Data

Cross-sectional data refers to observations collected at a single point in time, typically across different subjects, such as individuals, organizations, or regions. Cross-sectional data is static—ideal for comparisons between entities using bar charts, box plots, or maps.

Examples:

Example dataset structure average freshman, sophomore, junior, and senior GPAs for different majors:

Major FreshmanGPA SophomoreGPA JuniorGPA SeniorGPA
Biology 3.2 3.3 3.4 3.5
ComputerSci 3.5 3.6 3.7 3.8
History 3.1 3.2 3.3 3.4

Longitudinal Data

Also known as time series data, longitudinal data involves repeated observations of the same variables over time. Longitudinal data reveals trends, cycles, and patterns over time. Visualizations should emphasize temporal dynamics.

Best practices suggest that line charts are commonly used to visualize longitudinal data. Do not use a bar chart for longitudinal data, as it obscures trends and continuity.

Examples:

Example dataset structure for Longitudinal Data:

Student ID Semester GPA
001 Fall 2020 3.5
001 Spring 2021 3.6
002 Fall 2020 3.8
002 Spring 2021 3.7

Roll-Up Data

Roll-up data includes a mix of detailed and aggregated (summary) records in the same table. This is common in government or institutional data where subtotal or total rows are embedded within the raw dataset.

If not handled properly, roll-up rows can result in double-counting during analysis or incorrect visual representations. For example, including both “New York City” and “All U.S. Cities” in the same bar chart can be misleading.

Examples:

How to Handle:

Example Dataset with Roll-Up Rows:

Location Month Sales
New York City January 1000
New York City February 1200
New York City Total 2200
Los Angeles January 800
Los Angeles February 900
Los Angeles Total 1700

Wide vs. Tall Data

Wide Data

In wide data, each variable gets its own column. This format is often used for human readability and spreadsheet reports.

This structure is good for direct comparison between variables. However, most visualization libraries (e.g., ggplot2, Seaborn) expect data in tall format.

Example:

Student Math English History
Alice 95 88 92
Bob 87 91 85

Tall Data

Tall (or long) data is more normalized. One row per observation, with columns for entity, variable name, and value.

Tall data is better for:

Example (same data as above, in tall format):

Student Subject Score
Alice Math 95
Alice English 88
Alice History 92
Bob Math 87
Bob English 91
Bob History 85