course_model

Understand your data and its distribution

Understand the type of data you are using! This is necessary to choose the right model.

Outcomes:

Links:

Good examples:

Data Types

There are several types of data. The main types are continuous, discrete, and categorical. These do not fully align with Python data types (such as int, float, str), but are more about the meaning of the data.

Continuous data

Continuous data: Numerical data whose precision is set by the measuring tool, not the underlying value. A good example is height or weight.

Visualize with a histogram, line chart, or scatterplot. Continuous data does not show well on a bar chart, as it shows a lot of very small bars that are hard to read.

Discrete data

Discrete Data with a limited number of options (text or numbers) .

Visualize with counting and a bar chart.

Categorical data

Categorical data is a value describing a group or category with a limited number of options that do not have a meaningful numeric value.

Visualize with a bar chart or pie chart.

Descriptive Measures

After you understand your data type, you should calculate descriptive statistics. These summarize your data, and can be very useful for both understanding your data and communicating results.

Mean versus average

Data Distributions

After understanding your data and calculating descriptive statistics, look at each value’s distribution. This helps you choose the right model.

This normally requires some form of visualization, such as a histogram or density plot.

3-minute data science Normal distribution

At times, you may need to transform your data to better fit a model. Common transformations include: