course_dv

Understand your data structure and values

This module focuses on understanding the underlying data before visualizing it. We will cover how to inspect data values for missing entries, outliers, and codings. We will also discuss different data structures, including cross-sectional, longitudinal, and roll-up data.

Outcomes:

Understand data values
- Look for missing values, outliers, codings, etc…
- Identify type:
  - String or number (integer v. decimal)
  - Continuous v. discrete
Understand data structure
- Cross-sectional data
- Longitudinal data
- Roll-up data, where rows are both details and summary roll-ups (often found in government data)
- Wide v. tall data

Links:

Good examples:

Cohort: The generational collapse of American religion

Data Values

We begin by inspecting individual cells. There are several questions we need to understand about the data before trying to visualize it.

Missing Values

Missing values are entries where data is not available. These can appear as NULL, NaN, empty strings (""), or even as placeholder text like "N/A" or -999. Missing values can distort averages, correlations, and trends.

You may need to remove or impute (fill in) these values before visualization.

Examples:

In a dataset of students’ test scores, a missing value might indicate a student who didn’t take the test.
A survey might have skipped questions, resulting in empty fields.

Outliers

Outliers are data points that fall far outside the range of the majority of values.Outliers can skew visualizations such as bar charts and histograms, making it hard to see patterns among the majority of the data.

Examples:

A person reporting an age of 150 in a demographic dataset.
A customer spending $1,000,000 in a store where the average is $200.

Codings and Inconsistent Formats

Codings refer to how values are recorded. Inconsistent codings lead to improper grouping and inaccurate visual aggregation. Standardizing these formats is essential.

Examples:

Gender might be recorded as M/F, Male/Female, or even 0/1.
Dates might appear as 01/02/2023, 2023-02-01, or February 1, 2023.

Identifying the Type of Each Field

Properly classifying the type of each field helps determine appropriate visual encodings (e.g., bar charts for categorical data, scatter plots for continuous numeric data).

String or Number

This is the first level of data typing—whether a field is textual or numeric. They store non-numeric data like names, categories, or descriptions.

Examples

City, Product Name, `Status = “Open/Closed”

Numeric Fields:

Contain numbers and can be further divided into integers and decimals. Numeric fields can be aggregated (mean, median, sum), while string fields are usually grouped or counted.

Integer examples: Age = 21, Number of Items Sold = 5
Decimal examples: Temperature = 98.6, Sales Revenue = 1234.56

Continuous vs. Discrete Numeric Fields

Continuous Fields:

Can take on any value within a range, often measured rather than counted.

Examples: Height, Weight, Temperature, Time
Suitable visualizations: line charts, histograms, density plots, scatter plots

Discrete Fields:

Consist of distinct and separate values, often counted. A numeric field doesn’t automatically mean it’s continuous. For instance, ZIP Code is numeric but discrete and categorical in nature.

Examples: Number of children, Days of the week, Star rating = 1 to 5
Suitable visualizations: bar charts, dot plots, pie charts

Data Structure

It’s essential to understand how data is structured and what kind of measurements are being recorded. Different data shapes and measurement contexts dictate different types of visualizations and preprocessing steps. Below are common structural and temporal classifications of data.

Cross-Sectional Data

Cross-sectional data refers to observations collected at a single point in time, typically across different subjects, such as individuals, organizations, or regions. Cross-sectional data is static—ideal for comparisons between entities using bar charts, box plots, or maps.

Examples:

A census dataset showing the population of each U.S. state in the year 2020.
A marketing dataset showing ad spend across different campaigns for one specific quarter.
Student test scores across multiple schools for a single semester.

Example dataset structure average freshman, sophomore, junior, and senior GPAs for different majors:

Major	FreshmanGPA	SophomoreGPA	JuniorGPA	SeniorGPA
Biology	3.2	3.3	3.4	3.5
ComputerSci	3.5	3.6	3.7	3.8
History	3.1	3.2	3.3	3.4

Longitudinal Data

Also known as time series data, longitudinal data involves repeated observations of the same variables over time. Longitudinal data reveals trends, cycles, and patterns over time. Visualizations should emphasize temporal dynamics.

Best practices suggest that line charts are commonly used to visualize longitudinal data. Do not use a bar chart for longitudinal data, as it obscures trends and continuity.

Examples:

Daily stock prices for Apple over the past 5 years.
Monthly unemployment rates by country.
A student’s GPA recorded each semester.

Example dataset structure for Longitudinal Data:

Student ID	Semester	GPA
001	Fall 2020	3.5
001	Spring 2021	3.6
002	Fall 2020	3.8
002	Spring 2021	3.7

Roll-Up Data

Roll-up data includes a mix of detailed and aggregated (summary) records in the same table. This is common in government or institutional data where subtotal or total rows are embedded within the raw dataset.

If not handled properly, roll-up rows can result in double-counting during analysis or incorrect visual representations. For example, including both “New York City” and “All U.S. Cities” in the same bar chart can be misleading.

Examples:

A spreadsheet listing monthly spending by department, followed by yearly totals.
Education test data showing school-level results and then state-wide averages as separate rows.
Crime statistics by city with a final row for national totals.

How to Handle:

Flag or remove summary rows before aggregation.
Separate detail and summary records for different visualizations.

Example Dataset with Roll-Up Rows:

Location	Month	Sales
New York City	January	1000
New York City	February	1200
New York City	Total	2200
Los Angeles	January	800
Los Angeles	February	900
Los Angeles	Total	1700

Wide vs. Tall Data

Wide Data

In wide data, each variable gets its own column. This format is often used for human readability and spreadsheet reports.

This structure is good for direct comparison between variables. However, most visualization libraries (e.g., ggplot2, Seaborn) expect data in tall format.

Example:

Student	Math	English	History
Alice	95	88	92
Bob	87	91	85

Tall Data

Tall (or long) data is more normalized. One row per observation, with columns for entity, variable name, and value.

Tall data is better for:

Faceted plots (small multiples)
Grouped summaries
Most tidy-data-based plotting systems

Example (same data as above, in tall format):

Student	Subject	Score
Alice	Math	95
Alice	English	88
Alice	History	92
Bob	Math	87
Bob	English	91
Bob	History	85