Introduction
Simulating a linear regression model is a powerful way to understand its mechanics and evaluate statistical methods. In this post, you’ll learn how to:
- Generate synthetic data for a simple linear regression model.
- Visualize the simulated data.
- Fit a model to the data and compare the results with the true parameters.
Step 1: The Structure of a Linear Regression Model
A simple linear regression model has the form:
[ y = _0 + _1 x + ]
Where:
- ( y ): Dependent variable (response)
- ( _0 ): Intercept
- ( _1 ): Slope (effect of ( x ) on ( y ))
- ( x ): Independent variable (predictor)
- ( ): Random error, often assumed to follow a normal distribution with mean 0 and standard deviation ( ).
Step 2: Simulating Data
To simulate data, you’ll need to define the parameters ( _0 ), ( _1 ), and ( ), and then generate ( x ) and ( ).
Example: Simulating a Simple Linear Regression Model
# Parameters
n <- 100 # Number of observations
beta_0 <- 5 # Intercept
beta_1 <- 2 # Slope
sigma <- 1 # Standard deviation of error
# Simulate data
set.seed(123) # For reproducibility
x <- runif(n, 0, 10) # Random predictor variable
epsilon <- rnorm(n, mean = 0, sd = sigma) # Random error
y <- beta_0 + beta_1 * x + epsilon # Response variable
# Combine into a data frame
data <- data.frame(x = x, y = y)
# View the first few rows
head(data)Step 3: Visualizing the Simulated Data
Visualizing the data can help ensure it aligns with your expectations.
library(ggplot2)
# Scatter plot of x and y
ggplot(data, aes(x = x, y = y)) +
geom_point(color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Simulated Linear Regression Data",
x = "Predictor (x)",
y = "Response (y)") +
theme_minimal()Step 4: Fitting a Linear Model
Once the data is simulated, you can fit a linear regression model using lm() and compare the estimated coefficients to the true parameters.
# Fit a linear model
model <- lm(y ~ x, data = data)
# Summary of the model
summary(model)
# Compare true and estimated parameters
true_params <- c(beta_0 = beta_0, beta_1 = beta_1)
estimated_params <- coef(model)
# Display parameters
true_params
estimated_paramsStep 5: Exploring the Impact of Error
You can experiment with different values of ( ) (the standard deviation of the error term) to see how noise affects the fit of the model.
Example: Increasing ( )
# Increase sigma
sigma_high <- 5
# Simulate new data
epsilon_high <- rnorm(n, mean = 0, sd = sigma_high)
y_high <- beta_0 + beta_1 * x + epsilon_high
# Fit a new model
model_high <- lm(y_high ~ x)
# Compare summaries
summary(model_high)Conclusion
Simulating a linear regression model is a versatile approach to understanding its properties and exploring how different factors, like noise, impact model performance. By generating synthetic data, you can gain valuable insights into the behavior of regression models and their assumptions.
In the next post, we’ll explore simulating more advanced statistical models. Stay tuned!

Part 5: Advanced Statistical Simulations in R
Further Reading
Happy simulating!
Citation
@online{jarvis2022,
author = {Jarvis, Christopher},
title = {Simulation in {R} - {Part} 4},
date = {2022-12-04},
url = {https://christopher.jarvis.io/posts/2024-12-04-rsim4/},
langid = {en}
}