An introduction using lm()

R
model fitting
Author

Pedro J. Aphalo

Published

2023-11-27

Modified

2023-11-30

Abstract
In this page I give a brief presentation on the mechanics of fitting statistical models to observed data using R. I use a lineal model (LM), polynomial regression, as example. I describe the most important values contained in the model fit object returned by function lm() and how to extract them. I rely heavily on diagrams and data plots.
Keywords

predicted values, residual values, parameter estimates, model formulas

Introduction

Fitting a model to data consists in finding the parameter values that best explain the data or observations. These parameter values are estimates of those in a larger population of possible observations from which we have drawn a sample or subset. For deciding what is best, we need a criterion, and the most frequently used one is minimizing the “residual variance” by the method of ordinary least squares (OLS).

Model selection involves comparing models that differ in their structure, i.e., in the formula that relates parameters and data. Here we also need a criterion to “measure” how well a model describes the data, for example, AIC or BIC, which are information criteria. Alternatively, we can just choose “the simplest model that does a good enough job”.

Different approaches to model fitting make different assumptions about the observations. What assumptions are reasonable for our data determines what methods are applicable. Out of these, we can choose the one that best suits the nature of our scientific or practical question or hypothesis.

Modern statistical algorithms, and their implementation in R, strive to be widely applicable. I will emphasize this as well as what is common to different methods.

In general, there are many different aspects of a model fit that we may be interested in. The most computationally intensive step is the fitting itself. Thus R’s approach is to save the results from fitting, or fitted model, and separately query it for different derived numerical and graphical output as needed.

Tip

In this page some code chunks are “folded” so as to decrease the clutter. Above the R output, text or plots, you will find a small triangle followed by “Code”. Clicking on the triangle “unfolds” the code chunk making visible the R code used to produce the output shown. The code in the chunks can be copied by clicking on the top right corner, where an icon appears when the mouse cursor hovers over the code listing.

Attach packages used for plotting.

Code
library(ggpmisc)
library(ggbeeswarm)
library(broom)
theme_set(theme_bw(16))
Anscombe’s linear regression examples

This classical example from Anscombe (1973) demonstrates four very different data sets that yield exactly the same results when a linear regression model is fit to them, including \(R^2\) and \(P\) values.

Code
# we rearrange the data
my.mat <- matrix(as.matrix(anscombe), ncol=2)
my.anscombe <- 
  data.frame(x = my.mat[ , 1],
             y = my.mat[ , 2],
             case=factor(rep(1:4, rep(11,4))))
my.formula = y ~ x
ggplot(my.anscombe, aes(x,y)) +
  geom_point(shape=21, fill="orange", size=3) +
  geom_smooth(method="lm", formula = my.formula) +
  stat_poly_eq(formula = my.formula, 
               parse = TRUE, 
               label.y = "top", 
               label.x = "left", 
               use_label(c("eq", "R2", "P"))) +
  facet_wrap(~case, ncol=2) +
  theme_bw(16)

  1. If these four sets of data were real observations from an experiment, would you conclude the same in all cases with respect to the effect of \(x\) on the observed \(y\)?
  2. In which of the four cases the estimated line best described the overall trend of the response? 1- What is the difficulty in each of the other three cases?

These concocted data examples demonstrates that when fitting any model to data, extreme caution is always needed. We need to check graphically the observations and how well model estimates fit them. \(R^2\), \(P\) values, and other parameter’s estimates must always be considered in the light of the data, rather than in isolation.

Model-fitting core ideas

flowchart TB
s[Total variation] --> f(Model fitting)-->|origin explained| m[Model parameters]
f -->|origin not explained| e[Experimental error]

What is left as experimental error is the “residual” variation, what the model we fitted could not explain (= describe). The error estimate is the result of the interaction between data and model.

The next idea is that the variance that is explained by a model can be “partitioned” so that different “sources” of variation can be compared and their importance assessed.

flowchart TB
s[Total variation] --> f(Model fitting)-->|origin explained| m[Model parameters]
f --->|origin not explained| e[Experimental error]
m --> s1[y variation due to x1]
m --> s2[y variation due to x2]
m --> s3[y variation due to x3]

Differences among fitted models
  • In many cases we can choose among different models to fit.

  • Sometimes, models differ only in how we partition the same explained variation, facilitating different interpretations of the data.

  • In many other cases different models differ in how much of the total variation they can explain and how much remains unexplained.

Model formulas

In statistics models are described by equations, and in R using model formulas, which are similar to these equations.

A model formula defines the nature/\(\approx\)shape of the relationship between dependent (\(y\)) and independent or explanatory variables (\(x_i\)).

When a model is fitted numeric values to insert into the model formula are searched for. The selected values are those that explain as much of the variation among observations as possible, within the constraint of the given model formula.

How the estimates are computed depends on the type of model being fitted to data and the algorithm used. However, the objective of the computations is always the same, minimizing the residual variation. (It is not necessary to always to measure the variability as variance or to use ordinary least squares, OLS, methods. OLS is, however, most effective as long as errors are normally distributed.)

Linear models (LM)

I will start with some “unusual examples” (crazy approaches?) to try to convince you that there is much in common between model fitting and the simple statistics you already know. Grasping this is important as general concepts allows us to understand the bigger picture, and thus we can learn new methods as variations of, or extensions to, what we already know.

The mean (= average) as a fitted linear model

We compute the mean of sample as \({\bar x} = \frac{\sum_{i=1}^n x_i}{n}\) as an estimator of \(\mu\), the mean of the population from which the sample was drawn.

An example with R follows.

We create a vector of 20 random numbers from the Normal distribution to play with.

my.vector <- rnorm(20)
mean(my.vector)
[1] -0.249985
var(my.vector)
[1] 1.081016

We may ask: Can we also obtain these estimators by fitting a lineal model?

The answer is, “Yes! we can.”

We assume that the observations \(y_{ij}\) follow the linear model \[ y_{ij} = \mu + \epsilon_{ij}, \] and fit model to the data to obtain an estimate of \(\mu\).

This model in R, given that we have given the name my.vector to the observation, is represented by the formula my.vector ~ 1. This is a linear model, with only an intercept as only parameter.

fm1 <- lm(my.vector ~ 1)

Calling summary() on the result from fitting the model we get an estimate for the intercept, which is the same \(\bar x\) estimate we computed above.

Code
summary(fm1)

Call:
lm(formula = my.vector ~ 1)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.4304 -0.1753  0.1979  0.5706  1.8551 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)
(Intercept)  -0.2500     0.2325  -1.075    0.296

Residual standard error: 1.04 on 19 degrees of freedom

The ANOVA (analysis of variance) table, unsurprisingly, shows variance under the name of mean square (MS). The residual mean square, or error variance, or residual variance gives the same value we calculated above using function var().

Code
anova(fm1)
Analysis of Variance Table

Response: my.vector
          Df Sum Sq Mean Sq F value Pr(>F)
Residuals 19 20.539   1.081               

To fit the mean as a linear model, the total “Sum Sq”, or SS, was minimized.

So fitting a model is in fact similar to computing the mean of a single variable. However, while the mean is a single parameter, we can fit models with several parameters which are adjusted simultaneously to fit a point, curve, surface or hyper-surface, depending on the case.

When fitting a model we normally also estimate the standard deviation for each fitted parameter, allowing tests of significance.

The t-test as a fitted model linear model

Two vectors of numbers to play with.

vec1 <- 0 + rnorm(20)
vec2 <- 1 + rnorm(20)

We can assemble a data.frame with them, adding a factor with two levels representing groups.

df1 <- data.frame(obs = c(vec1, vec2), 
                  group = factor(rep(c("g1", "g2"), each = 20)))
head(df1)
          obs group
1  0.78496979    g1
2 -0.74799852    g1
3 -1.27472963    g1
4  0.87748322    g1
5  0.19784606    g1
6 -0.02507454    g1
summary(df1)
      obs          group  
 Min.   :-1.9245   g1:20  
 1st Qu.:-0.4916   g2:20  
 Median : 0.3935          
 Mean   : 0.2312          
 3rd Qu.: 1.0137          
 Max.   : 2.4852          

We can plot the data adding means and standard errors.

Code
ggplot(df1, aes(group, obs)) +
  stat_summary(fun.data = mean_se, color = "red") +
  geom_beeswarm()

mean(vec1)
[1] -0.4443724
mean(vec2)
[1] 0.9068544
var(vec1)
[1] 0.8727847
var(vec2)
[1] 0.6916313

One way of calling t.test() is by passing the data separately for each group.

t.test(vec1, vec2, var.equal = TRUE)

    Two Sample t-test

data:  vec1 and vec2
t = -4.8313, df = 38, p-value = 2.245e-05
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -1.9174089 -0.7850446
sample estimates:
 mean of x  mean of y 
-0.4443724  0.9068544 

We can do the same test calling t.test() using model formula syntax. The model formula obs ~ group, used here, corresponds to the model formulated as \(y_{ij} = \mu + \tau_i + \epsilon_{ij}\).

t.test(obs ~ group, data = df1, var.equal = TRUE)

    Two Sample t-test

data:  obs by group
t = -4.8313, df = 38, p-value = 2.245e-05
alternative hypothesis: true difference in means between group g1 and group g2 is not equal to 0
95 percent confidence interval:
 -1.9174089 -0.7850446
sample estimates:
mean in group g1 mean in group g2 
      -0.4443724        0.9068544 

We may ask: Can we test significance like with the t-test by fitting a lineal model?

The answer is, “Yes! we can.”

Intercept in R model formulas

Model formulas obs ~ 1 + group and obs ~ group are equivalent as the intercept is included by default. The second form is the most commonly used. Here, to make the comparison to the model equations easier, I include the 1 + term explicitly.

The main difference in the call to lm() is the missing var.equal = TRUE, as equal variances are always assumed by lm().

fm2 <- lm(obs ~ 1 + group, data = df1)
summary(fm2)

Call:
lm(formula = obs ~ 1 + group, data = df1)

Residuals:
    Min      1Q  Median      3Q     Max 
-1.9756 -0.5714  0.1241  0.6310  1.5784 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  -0.4444     0.1978  -2.247   0.0305 *  
groupg2       1.3512     0.2797   4.831 2.25e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.8844 on 38 degrees of freedom
Multiple R-squared:  0.3805,    Adjusted R-squared:  0.3642 
F-statistic: 23.34 on 1 and 38 DF,  p-value: 2.245e-05
anova(fm2)
Analysis of Variance Table

Response: obs
          Df Sum Sq Mean Sq F value    Pr(>F)    
group      1 18.258 18.2581  23.342 2.245e-05 ***
Residuals 38 29.724  0.7822                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With the three approaches, the P-value obtained is the same. The estimate of the difference the means of g1 and g2 is given in the summary for the model fit with lm() while t.test shows the two means. The square of the estimated t-value matches the F-value in the ANOVA table.

So, these examples show that model fitting provides estimates of population parameters, identically to computing a mean, as well as serving as the basis for tests of significance identical to the t-test. In addition, as we will see below, linear models include additional models and tests of significance to those possible with t-test.

Note

The implementation in R of t-tests by default does not assume equal variances in the two groups that are compared. Linear models, as implemented in lm() do.

Other R model-fitting functions cater for various additional cases.

ANOVA, regression and linear models

Linear models include what is called Analysis of Variance (ANOVA), linear regression, multiple regression, Analysis of Covariance, and a few other variations all wrapped together. So, one procedure caters for a wide range of situations and designs of experiments. While the t-test can be used to test differences between means of two groups, or between a mean and a fixed value, ANOVA can be used with two or more groups.

Linear models can be computed very efficiently but make several assumptions about the data. Other types of models, avoid some of the assumptions of LMs, but tend to be computationally more demanding.

ANOVA vs. Regression

In one-way ANOVA for a completely randomized design (with no blocks) we have the following model: every observation receives a contribution from the population mean contains \(\mu\), observations for the \(i\)th treatment receive a contribution from \(\tau_i\) but no other \(\tau\). Each observation in addition is affected by its own bit of “random” variation.

Thus, the observations \(y_{ij}\) follow the linear model \[ y_{ij} = \mu + \tau_i + \epsilon_{ij}, \] where \(\mu\) is the population mean, \(\tau_i\) the effect of treatment \(i\), and \(\epsilon_{ij}\) the random variation associated with observation \(y_{ij}\).

We assume that the residuals \(\epsilon_{ij}\) are independent and equally distributed — \(\epsilon_{ij} \sim N(0,\sigma^2)\), where variance \(\sigma^2\) is unknown and independent of group \(i\).


In linear regression the \(y\) values are obtained from several populations, each population being determined by the corresponding \(x\) value. The \(y\) variable is called dependent variable and the \(x\) variable is called independent variable.

The observations \(y_{ij}\) follow the linear model \[ y_{ij} = \beta_0 + \beta_1 x_i + \epsilon_{ij}, \]

where \(\beta_0\) is the population mean when \(x=0\), \(x_i\) are the observed values for the independent variable, \(\beta_1\) is the coefficient describing the slope, and \(\epsilon_{ij}\) the random variation associated with observation \(y_{ij}\).

We also assume that the residuals \(\epsilon_{ij}\) are independent and equally distributed — \(\epsilon_{ij} \sim N(0,\sigma^2)\), where variance \(\sigma^2\) is unknown and independent of \(x\).

In regression estimates of all \(\beta_i\) are usually of interest. In contrast, in ANOVA, the estimate of \(\mu\) across all treatments is considered not interesting, and the focus of the analysis is on the estimate of \(\tau\).

Assumption of Normality

What are the reasons why we so frequently assume that error variation follows a Normal distribution?

  1. Experience The variation in the real world rather frequently follows distributions that are rather similar to the Normal. Many exceptions exist, too.

  2. Convenience Assuming a Normal distribution makes computations a lot easier than assuming other distributions. Crucial in the absence of computers.

  3. Theory The distribution of parameter estimates such as the mean tend to be distributed “more normally” than the observations from which they are computed. May explain, at least in part, the first point above.

Why “linear” in the name

Linear models receive the name “linear” because of how the parameters that are estimated enter the model formula. They include models that when plotted can be represented by straight lines, but also by curves and even curved surfaces.

Here \(\beta\) enters the model equation as a multiplier of the $x_{i}.

\[ y_{ij} = \beta_0 + \beta_1\,x_i + \epsilon_{ij}, \] An example of a model non-linear in its parameters is

\[ y_{ij} = \beta_0 + \mathrm{e}^{\beta_1\,x_i} + \epsilon_{ij}, \] because \(\beta\) enters the model equation as a part of an exponent.

Whether a model to be fitted is linear in its parameters or non-linear is crucial from the perspective of the computations needed to estimate the values of the parameters. From the interpretation perspective, and use, differences are rather small.

Note

In linear models dummy variables, consisting of zeros and ones, can be used to code treatments levels. Thus, even though we rely on factors to describe treatments, these factors are converted into an encoding consistent with that used for regression, and the same computational procedure is used internally for regression and ANOVA.

Linear models in R

Both ANOVA and regression are described by linear models and have much more in common than what it looks at first sight.

In R little distinction is made between ANOVA and regression.

The functionlm() can be used for both, whether the variables in the model are factors or numeric vectors determines the coding (of the matrix used for describing the tests).

The function aov() uses lm() internally but differs in that summary gives an ANOVA table, and in that it can also deal directly with hierarchical models (e.g. split-plot designs).

Regression examples

I show here plots, but not not how one would do the data analysis in R.

Interpretation of the intercept

We generate some artificial data (it will be different each time you run the code below.)

Why will it be different? Try it?

Code
x <- -20:80
y <- x + 50 + rnorm(101, sd = 20)
my.data <- data.frame(x, y)
summary(my.data)
       x             y          
 Min.   :-20   Min.   :  3.111  
 1st Qu.:  5   1st Qu.: 59.510  
 Median : 30   Median : 79.192  
 Mean   : 30   Mean   : 82.706  
 3rd Qu.: 55   3rd Qu.:109.205  
 Max.   : 80   Max.   :152.656  

We fit a linear regression (R’s model formula y ~ x or its verbose equivalent y ~ 1 + x).

\[ y_{ij} = \beta_0 + \beta_1 x_i + \epsilon_{ij}, \]

Code
ggplot(my.data, aes(x, y)) +
  geom_point() +
  stat_poly_line(se = FALSE) +
  stat_poly_eq(use_label(c("eq")))

Residuals or deviations

All deviations from the fitted curve are assigned to variation in \(y\), as an assumption of LM fitting.

Code
ggplot(my.data, aes(x, y)) +
  geom_point() +
  stat_poly_line(se = FALSE) +
  stat_fit_deviations(colour = "red")

Code
ggplot(my.data, aes(x, y)) +
  geom_hline(yintercept = 0) +
  stat_fit_residuals(colour = "red")

Assumption about \(x\)

It is assumed that the \(x_i\) are measured without error.

Regression provides a means of predicting \(y\) from \(x\), but not the reverse.

For example, this does not mean that there is no biological variation, but instead that the values of the independent variable are known. As with other assumptions, small deviations are rarely important.

This stems from the idea of dependent vs. independent variables, and cause and effect. We assume that all the variation that is observed in the \(y_{ij}\) is caused by variation in the response to the known \(x_{ij}\) or to the known \(\tau_i\).

This is a very frequent assumption, but methods do exist that do not involve this assumption.

Variability of the estimates

A confidence band for the fitted model encloses the predicted relationship, in this case a straight line. It informs about the uncertainty about our estimate of the relationship for the whole population.

While we use \(\beta_i\) for the values of the parameters in the population, we use \(b_i\) for the estimates obtained by fitting the model to a sample.

Code
ggplot(my.data, aes(x, y)) +
  geom_point() +
  stat_poly_line() +
  stat_fit_tb(label.x = "left",
              tb.params = c("b[0]" = 1, "b[1]" = 2),
              tb.vars = c("Param." = 1, "Estimate" = 2,
                          "italic(s)[b[i]]" = 3, "italic(t)" = 4,
                          "italic(P)" = 5),
              parse = TRUE)

Some code examples

The code chunks below show the steps for analysis of variance (ANOVA) applied to weights of three groups of plants. The example data are included in R. You can run the code directly in this web page (in the browser, no R installation needed). The same code, of course, runs in RStudio and R itself.


    

    

A quick plot of the data.


    

A bit more sophisticated plot.

Code
ggplot(PlantGrowth, aes(group, weight)) +
  geom_beeswarm() +
  geom_boxplot(colour = "red", width = 0.33, fill = NA)


    

    

    

By editing the code in the chunks above, analyse the data set chickwts, also included in R.