course_model

Modeling to predict the future

Our prior course focused on visualizations as a means to understand. This course builds on that concept, but mainly focuses on creating models. A model is a simplified version of the real-world, expressed in mathematical terms (i.e., y = x + 1).

Outcomes

Define key terms
Compare statistical modeling v. machine learning
Explain p-values and confidence intervals (traditional statistical approach)
Simulate values and measure the results (modern machine-learning approach)

Links

Key Terms

Independent variables: the variables we use to predict the target (also called features or predictors)
Dependent variable: the variable we want to predict (also called target or outcome)
Model: a mathematical representation of the relationship between independent variables and the dependent variable. In its simplest form, y = x.

Approaches to Model Building

We have two different approaches to building a model:

Statistical modeling: build understandable relationships between variables to understand the data
Machine learning: build complex models that learn on their own from large datasets

Statistical modeling aims to understand relationships and test hypotheses, while machine learning aims to make accurate predictions regardless of interpretability.

Data Size: statistical modeling often works with smaller datasets, while machine learning typically requires larger datasets.
Model Complexity: statistical models are often simpler and more interpretable, while machine learning models can be more complex and less interpretable.

After we create a model, we want to evaluate its performance:

Accuracy: how well does the model predict the target variable
Generalizability: how well does the model perform on new data, not just the data it was trained on. This is the concept of overlearning - memorizing the data instead of learning patterns.
- We avoid overlearning by using test/train splits in ML, or with p-values in traditional statistical modeling.

EDA Process

The EDA process (exploratory data analysis) is similar for both statistical modeling and machine learning. This is covered in our prior class, see EDA Chapter. If you did not take the class with me, please go back and review the data structure chapter.

We follow these steps:

Load dataset
One a time, clean fields by:
- Understanding data structure (wide, long, roll-up, aggregated, longitudinal, cross-sectional)
- Clean field
  - Check data type
  - Find missing values, duplicates
- Visualize distribution (find outliers)
- Visualize relationships between variables

Once the data is clean, we can build a model:

Select features (independent variables) and target (dependent variable)
If using ML, split data into training and testing sets
Create a model
Evaluate the model
If needed, repeat the process.

Finally, we communicate our results:

Presentation
Written memo
Poster