Example Linear Regression
LR with R software
1 Introduction
Linear regression is computed in R using:
the R function
lm
(for linear model)math formula explicitly implemented in R software.
We show that we get the same results.
We simulate below data \(\lbrace x_i, y_i\rbrace_{i=1,\cdots,n}\) such that \(\forall i\) \[ y_i=a+b\ x_i+ \epsilon_i \] where \(\epsilon_i\sim \mathcal{N}(0,3^2)\) (Normal error) and \(x_i\sim\mathcal{U}([0,100])\) (uniformally distributed in the range 0 to 100). The ground truth values for the parameters are chosen \(a=2\) and \(b=1\).
=100
n=rnorm(n,mean=0,sd=3) # gaussian noise
error=runif(n, min = 0, max = 100) # explanatory variable
x=2 # intercept (true value)
a=1 # slope (true value)
b=a+b*x+error
yplot(x,y)
2 Linear Regression using R function lm
The code below aims are computing an estimate for \(a\) and \(b\) using the function lm in R.
<-lm(y~x)
Linearfitsummary(Linearfit)
##
## Call:
## lm(formula = y ~ x)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.612 -2.142 0.080 1.913 6.512
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.814569 0.556440 3.261 0.00153 **
## x 1.010187 0.009805 103.029 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.895 on 98 degrees of freedom
## Multiple R-squared: 0.9909, Adjusted R-squared: 0.9908
## F-statistic: 1.061e+04 on 1 and 98 DF, p-value: < 2.2e-16
plot(x,y)
lines(x,Linearfit$fitted.values,col = "red")
Exercise: What is the effect on increasing the value \(n\) on the estimates of parameters \(a\) and \(b\) and their Standard Error reported?
3 Linear Regression using Algebra in R
Least squares estimates of the parameters \(a\) and \(b\) are computed by creating the matrix \(\mathrm{X}\) with its first column full with 1s and its second column filled with the values of the explanatory variable \(x\). \[ \hat{\beta}=(\mathrm{X}^T\mathrm{X})^{-1} \mathrm{X}^T \mathbf{y} \] where \(\mathbf{y}\) is the vector collecting the values of the response variable \(y\). The estimated vector ofparameters is \(\hat{\beta}=(\hat{a}\ \hat{b} )^T\).
=matrix(1,n,2)
X1]=rep(1, n)
X[,2]=x
X[,
# least square estimate
=solve(t(X)%*%X)%*%t(X)%*%y
BetaHat BetaHat
## [,1]
## [1,] 1.814569
## [2,] 1.010187
Note that these estimates are the same as the ones found by the R function lm for parameters \(a\) and \(b\).
The standard deviation of the errors \(\epsilon\) can also be computed:
=X %*% BetaHat
fitted =y-fitted
residuals=t(residuals)%*%residuals
SSE=SSE/(n-length(BetaHat))
sigmahatsq#Residual standard error
sqrt(sigmahatsq)
## [,1]
## [1,] 2.895173
The standard errors reported by lm for \(\hat{\beta}=(\hat{a},\hat{b})^T\) can also be computed mathematically with \[ \mathbb{C}\lbrack \hat{\beta}\rbrack =\sigma^2 \ (\mathrm{X}^{T}\mathrm{X})^{-1} \] see my notes online for the derivation of this formula.
# uncertainty of BetaHat (standard error)
=solve(t(X)%*%X)*as.numeric(sigmahatsq)
CovBetaHat=sqrt(diag(CovBetaHat))
Std.Error.BetaHat Std.Error.BetaHat
## [1] 0.55644024 0.00980488
Other statistics computed by lm:
#Multiple R-squared: (coefficient of determination)
=sum((y-mean(y))*(y-mean(y)))
S0=(S0-SSE)/S0
Rsq Rsq
## [,1]
## [1,] 0.9908522