Introduction
Simulating a linear regression model is a powerful way to understand its mechanics and evaluate statistical methods. In this post, you’ll learn how to:
- Generate synthetic data for a simple linear regression model.
- Visualize the simulated data.
- Fit a model to the data and compare the results with the true parameters.
Step 1: The Structure of a Linear Regression Model
A simple linear regression model has the form:
[ y = _0 + _1 x + ]
Where:
- ( y ): Dependent variable (response)
- ( _0 ): Intercept
- ( _1 ): Slope (effect of ( x ) on ( y ))
- ( x ): Independent variable (predictor)
- ( ): Random error, often assumed to follow a normal distribution with mean 0 and standard deviation ( ).
Step 2: Simulating Data
To simulate data, you’ll need to define the parameters ( _0 ), ( _1 ), and ( ), and then generate ( x ) and ( ).
Example: Simulating a Simple Linear Regression Model
# Parameters
<- 100 # Number of observations
n <- 5 # Intercept
beta_0 <- 2 # Slope
beta_1 <- 1 # Standard deviation of error
sigma
# Simulate data
set.seed(123) # For reproducibility
<- runif(n, 0, 10) # Random predictor variable
x <- rnorm(n, mean = 0, sd = sigma) # Random error
epsilon <- beta_0 + beta_1 * x + epsilon # Response variable
y
# Combine into a data frame
<- data.frame(x = x, y = y)
data
# View the first few rows
head(data)
Step 3: Visualizing the Simulated Data
Visualizing the data can help ensure it aligns with your expectations.
library(ggplot2)
# Scatter plot of x and y
ggplot(data, aes(x = x, y = y)) +
geom_point(color = "blue", alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs(title = "Simulated Linear Regression Data",
x = "Predictor (x)",
y = "Response (y)") +
theme_minimal()
Step 4: Fitting a Linear Model
Once the data is simulated, you can fit a linear regression model using lm()
and compare the estimated coefficients to the true parameters.
# Fit a linear model
<- lm(y ~ x, data = data)
model
# Summary of the model
summary(model)
# Compare true and estimated parameters
<- c(beta_0 = beta_0, beta_1 = beta_1)
true_params <- coef(model)
estimated_params
# Display parameters
true_params estimated_params
Step 5: Exploring the Impact of Error
You can experiment with different values of ( ) (the standard deviation of the error term) to see how noise affects the fit of the model.
Example: Increasing ( )
# Increase sigma
<- 5
sigma_high
# Simulate new data
<- rnorm(n, mean = 0, sd = sigma_high)
epsilon_high <- beta_0 + beta_1 * x + epsilon_high
y_high
# Fit a new model
<- lm(y_high ~ x)
model_high
# Compare summaries
summary(model_high)
Conclusion
Simulating a linear regression model is a versatile approach to understanding its properties and exploring how different factors, like noise, impact model performance. By generating synthetic data, you can gain valuable insights into the behavior of regression models and their assumptions.
In the next post, we’ll explore simulating more advanced statistical models. Stay tuned!
Part 5: Advanced Statistical Simulations in R
Further Reading
Happy simulating!
Citation
@online{jarvis2022,
author = {Jarvis, Christopher},
title = {Simulation in {R} - {Part} 4},
date = {2022-12-04},
url = {https://christopher.jarvis.io/posts/2024-12-04-rsim4/},
langid = {en}
}