Example Linear Regression

LR with R software

1 Introduction

Linear regression is computed in R using:

the R function lm (for linear model)
math formula explicitly implemented in R software.

We show that we get the same results.

We simulate below data \(\lbrace x_i, y_i\rbrace_{i=1,\cdots,n}\) such that \(\forall i\) \[ y_i=a+b\ x_i+ \epsilon_i \] where \(\epsilon_i\sim \mathcal{N}(0,3^2)\) (Normal error) and \(x_i\sim\mathcal{U}([0,100])\) (uniformally distributed in the range 0 to 100). The ground truth values for the parameters are chosen \(a=2\) and \(b=1\).

n=100
error=rnorm(n,mean=0,sd=3) # gaussian noise
x=runif(n, min = 0, max = 100) # explanatory variable
a=2 # intercept (true value)
b=1 # slope (true value)
y=a+b*x+error
plot(x,y)

2 Linear Regression using R function lm

The code below aims are computing an estimate for \(a\) and \(b\) using the function lm in R.

Linearfit<-lm(y~x)
summary(Linearfit)

## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -5.612 -2.142  0.080  1.913  6.512 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.814569   0.556440   3.261  0.00153 ** 
## x           1.010187   0.009805 103.029  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.895 on 98 degrees of freedom
## Multiple R-squared:  0.9909, Adjusted R-squared:  0.9908 
## F-statistic: 1.061e+04 on 1 and 98 DF,  p-value: < 2.2e-16

plot(x,y)
lines(x,Linearfit$fitted.values,col = "red")

Exercise: What is the effect on increasing the value \(n\) on the estimates of parameters \(a\) and \(b\) and their Standard Error reported?

3 Linear Regression using Algebra in R

Least squares estimates of the parameters \(a\) and \(b\) are computed by creating the matrix \(\mathrm{X}\) with its first column full with 1s and its second column filled with the values of the explanatory variable \(x\). \[ \hat{\beta}=(\mathrm{X}^T\mathrm{X})^{-1} \mathrm{X}^T \mathbf{y} \] where \(\mathbf{y}\) is the vector collecting the values of the response variable \(y\). The estimated vector ofparameters is \(\hat{\beta}=(\hat{a}\ \hat{b} )^T\).

X=matrix(1,n,2)
X[,1]=rep(1, n) 
X[,2]=x


# least square estimate
BetaHat=solve(t(X)%*%X)%*%t(X)%*%y
BetaHat

##          [,1]
## [1,] 1.814569
## [2,] 1.010187

Note that these estimates are the same as the ones found by the R function lm for parameters \(a\) and \(b\).

The standard deviation of the errors \(\epsilon\) can also be computed:

fitted =X %*% BetaHat
residuals=y-fitted
SSE=t(residuals)%*%residuals
sigmahatsq=SSE/(n-length(BetaHat))
#Residual standard error 
sqrt(sigmahatsq)

##          [,1]
## [1,] 2.895173

The standard errors reported by lm for \(\hat{\beta}=(\hat{a},\hat{b})^T\) can also be computed mathematically with \[ \mathbb{C}\lbrack \hat{\beta}\rbrack =\sigma^2 \ (\mathrm{X}^{T}\mathrm{X})^{-1} \] see my notes online for the derivation of this formula.

# uncertainty of BetaHat (standard error)
CovBetaHat=solve(t(X)%*%X)*as.numeric(sigmahatsq)
Std.Error.BetaHat=sqrt(diag(CovBetaHat))
Std.Error.BetaHat

## [1] 0.55644024 0.00980488

Other statistics computed by lm:

#Multiple R-squared: (coefficient of determination) 
S0=sum((y-mean(y))*(y-mean(y)))
Rsq=(S0-SSE)/S0
Rsq

##           [,1]
## [1,] 0.9908522