Linear Regression is our most important prediction tool. It uses multiple numerical independent variables to predict a single output/dependent variable. We generally use the OLS (ordinary least squares) algorithm.
Outcomes:
Links:
Good Resources:
When to use Linear Regression:

Interpretation:
Weaknesses / Assumptions:
First we want to make sure that we understand our data. Begin by visually scanning the table. Then, use some functions to show what values are present in the data.
# Check field types and values
import pandas as pd
import numpy as np
# Sales table
df_raw = pd.DataFrame({
'sales_id': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10],
'sales_as_text': ['1,000', '2,000', '2,500', '10,000', '1,900', np.nan, '3,000', '4,000', '5,000', '6,000'],
'profit': [0, 400, 1600, 6250, 240, np.nan, 900, 1600, 500, 600],
'office_size': [16, 1, 2, 3, 1, 3, 2, 4, 5, 2],
'closed': [True, False, True, False, False, False, True, True, False, True],
'state': ['ca', 'ca', np.nan, 'NY', 'ca', np.nan, 'NY', 'NY', 'ca', 'ca'],
})
print("DataFrame dtypes:")
print(df_raw.dtypes)
df_raw.describe()
DataFrame dtypes:
sales_id int64
sales_as_text object
profit float64
office_size int64
closed bool
state object
dtype: object
| sales_id | profit | office_size | |
|---|---|---|---|
| count | 10.00000 | 9.000000 | 10.000000 |
| mean | 5.50000 | 1343.333333 | 3.900000 |
| std | 3.02765 | 1922.862450 | 4.433459 |
| min | 1.00000 | 0.000000 | 1.000000 |
| 25% | 3.25000 | 400.000000 | 2.000000 |
| 50% | 5.50000 | 600.000000 | 2.500000 |
| 75% | 7.75000 | 1600.000000 | 3.750000 |
| max | 10.00000 | 6250.000000 | 16.000000 |
You want to look for the distribution of values in each column, and the relationships between columns.
import matplotlib.pyplot as plt
import seaborn as sns
number_columns = df.select_dtypes(include=['number']).columns.tolist()
text_columns = df.select_dtypes(include=['object', 'category']).columns.tolist()
# Print a histogram and boxplot for each numeric column
for col in number_columns:
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
sns.histplot(df[col])
plt.title(f'Histogram of {col}')
plt.subplot(1, 2, 2)
sns.boxplot(x=df[col].dropna())
plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()
# Print a chart showing the distribution of values for each text column
for col in text_columns:
plt.figure(figsize=(6, 4))
sns.countplot(y=df[col], order=df[col].value_counts().index)
plt.title(f'Value counts of {col}')
plt.show()





# Plot a pairs printout of each of the numeric variables
# Note that we use the alpha to set the visibility of the points to 0.01 to avoid overplotting. This also works well for large datasets.
sns.pairplot(df, kind='reg', height=1.5, plot_kws={'scatter_kws': {'alpha': 0.4}})
plt.show()

Key items:
import numpy as np
import statsmodels.api as sm
import statsmodels.formula.api as smf
X = df[['office_size', 'sales_as_number', 'is_ny']]
y = df['profit']
# Fit the model
X = sm.add_constant(X) # Adds a constant term to the predictor
model = sm.OLS(y, X).fit()
predictions = model.predict(X)
print(model.summary())
OLS Regression Results
==============================================================================
Dep. Variable: profit R-squared: 0.753
Model: OLS Adj. R-squared: 0.604
Method: Least Squares F-statistic: 5.071
Date: Tue, 16 Dec 2025 Prob (F-statistic): 0.0562
Time: 16:02:09 Log-Likelihood: -74.009
No. Observations: 9 AIC: 156.0
Df Residuals: 5 BIC: 156.8
Df Model: 3
Covariance Type: nonrobust
===================================================================================
coef std err t P>|t| [0.025 0.975]
-----------------------------------------------------------------------------------
const -928.0026 949.517 -0.977 0.373 -3368.814 1512.809
office_size 2.5116 153.226 0.016 0.988 -391.367 396.391
sales_as_number 0.4813 0.176 2.734 0.041 0.029 0.934
is_ny 1109.9673 968.286 1.146 0.304 -1379.092 3599.027
==============================================================================
Omnibus: 0.989 Durbin-Watson: 0.450
Prob(Omnibus): 0.610 Jarque-Bera (JB): 0.607
Skew: 0.051 Prob(JB): 0.738
Kurtosis: 1.731 Cond. No. 1.14e+04
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.14e+04. This might indicate that there are
strong multicollinearity or other numerical problems.
Important parts of the summary output:
What explains each variable in our model?
| *P>/ | t/ | *: The probability of this result, given a random underlying variable. A low p-value (typically < 0.05) suggests that a good relationship exists. |
RMSE is the squared difference of each error. (https://www.statology.org/how-to-interpret-rmse/)[Link to a good reference]
Calculate the squared difference of each point, (10 - 10)^2 + (12 - 10)^2 + (8 - 10)^4 = 20
Divide by the number of observations, and take the square root. (20 / 3) ^ .5 = 2.58
We may also want to see the difference between our prediction and actual values. (10 - 10), (12 - 10), (8 - 10) –> (0, 2, -2)
So, the RSME is the square root of the variance. This is the average distance between observed data values and predicted.
The coefficient of determination tells us the proportion of variance in our dependent variable that can be explained by our independent variables.
It ranges from 0 to 1. Generally, the higher the number the better the prediction.
We will always use the adjusted R^2 when using multiple coeffients. This applies a small penalty for using additional columns.
# Calculate error metrics
# Find the residuals, which are the differences between the observed and predicted values
resid = model.resid
# The fitted values are the predicted values from the model
fitted = model.fittedvalues
# Calculate RMSE
rmse_value = np.sqrt(np.mean(resid*resid))
print(f"RMSE: {rmse_value}")
# Calculate R²
r2_value = model.rsquared
print(f"R²: {r2_value}")
# Plot residuals
sns.residplot(x=fitted, y=resid, lowess=True, scatter_kws={'alpha': 0.7})
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Residuals vs Fitted')
plt.axhline(0, color='red', linestyle='--')
plt.show()
RMSE: 901.6657510627008
R²: 0.7526298838455678
