Code
library(ggpmisc)
library(ggbeeswarm)
library(broom)
theme_set(theme_bw(16))
Pedro J. Aphalo
2023-11-27
2023-11-30
lm()
and how to extract them. I rely heavily on diagrams and data plots.
predicted values, residual values, parameter estimates, model formulas
Fitting a model to data consists in finding the parameter values that best explain the data or observations. These parameter values are estimates of those in a larger population of possible observations from which we have drawn a sample or subset. For deciding what is best, we need a criterion, and the most frequently used one is minimizing the “residual variance” by the method of ordinary least squares (OLS).
Model selection involves comparing models that differ in their structure, i.e., in the formula that relates parameters and data. Here we also need a criterion to “measure” how well a model describes the data, for example, AIC or BIC, which are information criteria. Alternatively, we can just choose “the simplest model that does a good enough job”.
Different approaches to model fitting make different assumptions about the observations. What assumptions are reasonable for our data determines what methods are applicable. Out of these, we can choose the one that best suits the nature of our scientific or practical question or hypothesis.
Modern statistical algorithms, and their implementation in R, strive to be widely applicable. I will emphasize this as well as what is common to different methods.
In general, there are many different aspects of a model fit that we may be interested in. The most computationally intensive step is the fitting itself. Thus R’s approach is to save the results from fitting, or fitted model, and separately query it for different derived numerical and graphical output as needed.
In this page some code chunks are “folded” so as to decrease the clutter. Above the R output, text or plots, you will find a small triangle followed by “Code”. Clicking on the triangle “unfolds” the code chunk making visible the R code used to produce the output shown. The code in the chunks can be copied by clicking on the top right corner, where an icon appears when the mouse cursor hovers over the code listing.
Attach packages used for plotting.
This classical example from Anscombe (1973) demonstrates four very different data sets that yield exactly the same results when a linear regression model is fit to them, including \(R^2\) and \(P\) values.
# we rearrange the data
my.mat <- matrix(as.matrix(anscombe), ncol=2)
my.anscombe <-
data.frame(x = my.mat[ , 1],
y = my.mat[ , 2],
case=factor(rep(1:4, rep(11,4))))
my.formula = y ~ x
ggplot(my.anscombe, aes(x,y)) +
geom_point(shape=21, fill="orange", size=3) +
geom_smooth(method="lm", formula = my.formula) +
stat_poly_eq(formula = my.formula,
parse = TRUE,
label.y = "top",
label.x = "left",
use_label(c("eq", "R2", "P"))) +
facet_wrap(~case, ncol=2) +
theme_bw(16)
These concocted data examples demonstrates that when fitting any model to data, extreme caution is always needed. We need to check graphically the observations and how well model estimates fit them. \(R^2\), \(P\) values, and other parameter’s estimates must always be considered in the light of the data, rather than in isolation.
flowchart TB s[Total variation] --> f(Model fitting)-->|origin explained| m[Model parameters] f -->|origin not explained| e[Experimental error]
What is left as experimental error is the “residual” variation, what the model we fitted could not explain (= describe). The error estimate is the result of the interaction between data and model.
The next idea is that the variance that is explained by a model can be “partitioned” so that different “sources” of variation can be compared and their importance assessed.
flowchart TB s[Total variation] --> f(Model fitting)-->|origin explained| m[Model parameters] f --->|origin not explained| e[Experimental error] m --> s1[y variation due to x1] m --> s2[y variation due to x2] m --> s3[y variation due to x3]
In many cases we can choose among different models to fit.
Sometimes, models differ only in how we partition the same explained variation, facilitating different interpretations of the data.
In many other cases different models differ in how much of the total variation they can explain and how much remains unexplained.
In statistics models are described by equations, and in R using model formulas, which are similar to these equations.
A model formula defines the nature/\(\approx\)shape of the relationship between dependent (\(y\)) and independent or explanatory variables (\(x_i\)).
When a model is fitted numeric values to insert into the model formula are searched for. The selected values are those that explain as much of the variation among observations as possible, within the constraint of the given model formula.
How the estimates are computed depends on the type of model being fitted to data and the algorithm used. However, the objective of the computations is always the same, minimizing the residual variation. (It is not necessary to always to measure the variability as variance or to use ordinary least squares, OLS, methods. OLS is, however, most effective as long as errors are normally distributed.)
I will start with some “unusual examples” (crazy approaches?) to try to convince you that there is much in common between model fitting and the simple statistics you already know. Grasping this is important as general concepts allows us to understand the bigger picture, and thus we can learn new methods as variations of, or extensions to, what we already know.
We compute the mean of sample as \({\bar x} = \frac{\sum_{i=1}^n x_i}{n}\) as an estimator of \(\mu\), the mean of the population from which the sample was drawn.
An example with R follows.
We create a vector of 20 random numbers from the Normal distribution to play with.
We may ask: Can we also obtain these estimators by fitting a lineal model?
The answer is, “Yes! we can.”
We assume that the observations \(y_{ij}\) follow the linear model \[ y_{ij} = \mu + \epsilon_{ij}, \] and fit model to the data to obtain an estimate of \(\mu\).
This model in R, given that we have given the name my.vector
to the observation, is represented by the formula my.vector ~ 1
. This is a linear model, with only an intercept as only parameter.
Calling summary()
on the result from fitting the model we get an estimate for the intercept, which is the same \(\bar x\) estimate we computed above.
Call:
lm(formula = my.vector ~ 1)
Residuals:
Min 1Q Median 3Q Max
-1.5496 -0.4536 -0.1105 0.4505 2.1738
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.1155 0.2346 0.492 0.628
Residual standard error: 1.049 on 19 degrees of freedom
The ANOVA (analysis of variance) table, unsurprisingly, shows variance under the name of mean square (MS). The residual mean square, or error variance, or residual variance gives the same value we calculated above using function var()
.
Analysis of Variance Table
Response: my.vector
Df Sum Sq Mean Sq F value Pr(>F)
Residuals 19 20.914 1.1007
To fit the mean as a linear model, the total “Sum Sq”, or SS, was minimized.
So fitting a model is in fact similar to computing the mean of a single variable. However, while the mean is a single parameter, we can fit models with several parameters which are adjusted simultaneously to fit a point, curve, surface or hyper-surface, depending on the case.
When fitting a model we normally also estimate the standard deviation for each fitted parameter, allowing tests of significance.
Two vectors of numbers to play with.
We can assemble a data.frame
with them, adding a factor with two levels representing groups.
obs group
1 -1.7871233 g1
2 1.3298560 g1
3 1.1845799 g1
4 0.1492178 g1
5 1.2758739 g1
6 -0.4530268 g1
obs group
Min. :-1.7871 g1:20
1st Qu.:-0.1600 g2:20
Median : 0.7491
Mean : 0.6845
3rd Qu.: 1.4611
Max. : 3.0655
We can plot the data adding means and standard errors.
One way of calling t.test()
is by passing the data separately for each group.
Two Sample t-test
data: vec1 and vec2
t = -3.8337, df = 38, p-value = 0.000461
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-1.8710820 -0.5778959
sample estimates:
mean of x mean of y
0.07228909 1.29677803
We can do the same test calling t.test()
using model formula syntax. The model formula obs ~ group
, used here, corresponds to the model formulated as \(y_{ij} = \mu + \tau_i + \epsilon_{ij}\).
Two Sample t-test
data: obs by group
t = -3.8337, df = 38, p-value = 0.000461
alternative hypothesis: true difference in means between group g1 and group g2 is not equal to 0
95 percent confidence interval:
-1.8710820 -0.5778959
sample estimates:
mean in group g1 mean in group g2
0.07228909 1.29677803
We may ask: Can we test significance like with the t-test by fitting a lineal model?
The answer is, “Yes! we can.”
Model formulas obs ~ 1 + group
and obs ~ group
are equivalent as the intercept is included by default. The second form is the most commonly used. Here, to make the comparison to the model equations easier, I include the 1 +
term explicitly.
The main difference in the call to lm()
is the missing var.equal = TRUE
, as equal variances are always assumed by lm()
.
Call:
lm(formula = obs ~ 1 + group, data = df1)
Residuals:
Min 1Q Median 3Q Max
-1.8594 -0.5998 -0.1825 0.7419 2.6604
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.07229 0.22585 0.320 0.750665
groupg2 1.22449 0.31940 3.834 0.000461 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.01 on 38 degrees of freedom
Multiple R-squared: 0.2789, Adjusted R-squared: 0.2599
F-statistic: 14.7 on 1 and 38 DF, p-value: 0.000461
Analysis of Variance Table
Response: obs
Df Sum Sq Mean Sq F value Pr(>F)
group 1 14.994 14.9937 14.697 0.000461 ***
Residuals 38 38.766 1.0202
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
With the three approaches, the P-value obtained is the same. The estimate of the difference the means of g1
and g2
is given in the summary for the model fit with lm()
while t.test
shows the two means. The square of the estimated t-value matches the F-value in the ANOVA table.
So, these examples show that model fitting provides estimates of population parameters, identically to computing a mean, as well as serving as the basis for tests of significance identical to the t-test. In addition, as we will see below, linear models include additional models and tests of significance to those possible with t-test.
The implementation in R of t-tests by default does not assume equal variances in the two groups that are compared. Linear models, as implemented in lm()
do.
Other R model-fitting functions cater for various additional cases.
Linear models include what is called Analysis of Variance (ANOVA), linear regression, multiple regression, Analysis of Covariance, and a few other variations all wrapped together. So, one procedure caters for a wide range of situations and designs of experiments. While the t-test can be used to test differences between means of two groups, or between a mean and a fixed value, ANOVA can be used with two or more groups.
Linear models can be computed very efficiently but make several assumptions about the data. Other types of models, avoid some of the assumptions of LMs, but tend to be computationally more demanding.
In one-way ANOVA for a completely randomized design (with no blocks) we have the following model: every observation receives a contribution from the population mean contains \(\mu\), observations for the \(i\)th treatment receive a contribution from \(\tau_i\) but no other \(\tau\). Each observation in addition is affected by its own bit of “random” variation.
Thus, the observations \(y_{ij}\) follow the linear model \[ y_{ij} = \mu + \tau_i + \epsilon_{ij}, \] where \(\mu\) is the population mean, \(\tau_i\) the effect of treatment \(i\), and \(\epsilon_{ij}\) the random variation associated with observation \(y_{ij}\).
We assume that the residuals \(\epsilon_{ij}\) are independent and equally distributed — \(\epsilon_{ij} \sim N(0,\sigma^2)\), where variance \(\sigma^2\) is unknown and independent of group \(i\).
In linear regression the \(y\) values are obtained from several populations, each population being determined by the corresponding \(x\) value. The \(y\) variable is called dependent variable and the \(x\) variable is called independent variable.
The observations \(y_{ij}\) follow the linear model \[ y_{ij} = \beta_0 + \beta_1 x_i + \epsilon_{ij}, \]
where \(\beta_0\) is the population mean when \(x=0\), \(x_i\) are the observed values for the independent variable, \(\beta_1\) is the coefficient describing the slope, and \(\epsilon_{ij}\) the random variation associated with observation \(y_{ij}\).
We also assume that the residuals \(\epsilon_{ij}\) are independent and equally distributed — \(\epsilon_{ij} \sim N(0,\sigma^2)\), where variance \(\sigma^2\) is unknown and independent of \(x\).
In regression estimates of all \(\beta_i\) are usually of interest. In contrast, in ANOVA, the estimate of \(\mu\) across all treatments is considered not interesting, and the focus of the analysis is on the estimate of \(\tau\).
What are the reasons why we so frequently assume that error variation follows a Normal distribution?
Experience The variation in the real world rather frequently follows distributions that are rather similar to the Normal. Many exceptions exist, too.
Convenience Assuming a Normal distribution makes computations a lot easier than assuming other distributions. Crucial in the absence of computers.
Theory The distribution of parameter estimates such as the mean tend to be distributed “more normally” than the observations from which they are computed. May explain, at least in part, the first point above.
Linear models receive the name “linear” because of how the parameters that are estimated enter the model formula. They include models that when plotted can be represented by straight lines, but also by curves and even curved surfaces.
Here \(\beta\) enters the model equation as a multiplier of the $x_{i}.
\[ y_{ij} = \beta_0 + \beta_1\,x_i + \epsilon_{ij}, \] An example of a model non-linear in its parameters is
\[ y_{ij} = \beta_0 + \mathrm{e}^{\beta_1\,x_i} + \epsilon_{ij}, \] because \(\beta\) enters the model equation as a part of an exponent.
Whether a model to be fitted is linear in its parameters or non-linear is crucial from the perspective of the computations needed to estimate the values of the parameters. From the interpretation perspective, and use, differences are rather small.
In linear models dummy variables, consisting of zeros and ones, can be used to code treatments levels. Thus, even though we rely on factors to describe treatments, these factors are converted into an encoding consistent with that used for regression, and the same computational procedure is used internally for regression and ANOVA.
Both ANOVA and regression are described by linear models and have much more in common than what it looks at first sight.
In R little distinction is made between ANOVA and regression.
The functionlm()
can be used for both, whether the variables in the model are factor
s or numeric
vectors determines the coding (of the matrix used for describing the tests).
The function aov()
uses lm()
internally but differs in that summary gives an ANOVA table, and in that it can also deal directly with hierarchical models (e.g. split-plot designs).
I show here plots, but not not how one would do the data analysis in R.
We generate some artificial data (it will be different each time you run the code below.)
Why will it be different? Try it?
x y
Min. :-20 Min. : 4.304
1st Qu.: 5 1st Qu.: 48.900
Median : 30 Median : 78.365
Mean : 30 Mean : 78.231
3rd Qu.: 55 3rd Qu.:100.868
Max. : 80 Max. :155.966
We fit a linear regression (R’s model formula y ~ x
or its verbose equivalent y ~ 1 + x
).
\[ y_{ij} = \beta_0 + \beta_1 x_i + \epsilon_{ij}, \]
All deviations from the fitted curve are assigned to variation in \(y\), as an assumption of LM fitting.
It is assumed that the \(x_i\) are measured without error.
Regression provides a means of predicting \(y\) from \(x\), but not the reverse.
For example, this does not mean that there is no biological variation, but instead that the values of the independent variable are known. As with other assumptions, small deviations are rarely important.
This stems from the idea of dependent vs. independent variables, and cause and effect. We assume that all the variation that is observed in the \(y_{ij}\) is caused by variation in the response to the known \(x_{ij}\) or to the known \(\tau_i\).
This is a very frequent assumption, but methods do exist that do not involve this assumption.
A confidence band for the fitted model encloses the predicted relationship, in this case a straight line. It informs about the uncertainty about our estimate of the relationship for the whole population.
While we use \(\beta_i\) for the values of the parameters in the population, we use \(b_i\) for the estimates obtained by fitting the model to a sample.
The code chunks below show the steps for analysis of variance (ANOVA) applied to weights of three groups of plants. The example data are included in R. You can run the code directly in this web page (in the browser, no R installation needed). The same code, of course, runs in RStudio and R itself.
A quick plot of the data.
A bit more sophisticated plot.
By editing the code in the chunks above, analyse the data set chickwts
, also included in R.
---
title: "Linear Models"
subtitle: "An introduction using `lm()`"
author: "Pedro J. Aphalo"
date: 2023-11-27
date-modified: 2023-11-30
categories: [R, model fitting]
keywords: [predicted values, residual values, parameter estimates, model formulas]
format:
html:
mermaid:
theme: neutral
code-fold: true
code-tools: true
callout-icon: false
engine: knitr
filters:
- webr
abstract:
In this page I give a brief presentation on the mechanics of fitting statistical models to observed data using R. I use a lineal model (LM), polynomial regression, as example. I describe the most important values contained in the model fit object returned by function `lm()` and how to extract them. I rely heavily on diagrams and data plots.
---
# Introduction
_Fitting a model_ to data consists in finding the parameter values that best
explain the data or observations. These parameter values are estimates of
those in a larger population of possible observations from which we have
drawn a sample or subset. For deciding what is best, we need a criterion, and
the most frequently used one is minimizing the "residual variance" by the
method of ordinary least squares (OLS).
_Model selection_ involves comparing models that differ in their structure, i.e.,
in the formula that relates parameters and data. Here we also need a criterion
to "measure" how well a model describes the data, for example, AIC or BIC,
which are information criteria. Alternatively, we can just choose "the simplest
model that does a good enough job".
Different approaches to model fitting make different assumptions about the
observations. What assumptions are reasonable for our data determines what
methods are applicable. Out of these, we can choose the one that best suits
the nature of our scientific or practical question or hypothesis.
Modern statistical algorithms, and their implementation in R, strive to be
widely applicable. I will emphasize this as well as what is common to different
methods.
In general, there are many different aspects of a model fit that we may be
interested in. The most computationally intensive step is the fitting itself.
Thus R's approach is to save the results from fitting, or _fitted model_, and
separately query it for different derived numerical and graphical output as
needed.
::: callout-tip
In this page some code chunks are "folded" so as to decrease the clutter. Above the R
output, text or plots, you will find a small triangle followed by "Code".
Clicking on the triangle "unfolds" the code chunk making visible the R code used
to produce the output shown.
The code in the chunks can be copied by clicking on the top right corner, where
an icon appears when the mouse cursor hovers over the code listing.
:::
Attach packages used for plotting.
```{r, message = FALSE}
library(ggpmisc)
library(ggbeeswarm)
library(broom)
theme_set(theme_bw(16))
```
::: callout-caution
# Anscombe's linear regression examples
This classical example from Anscombe (1973) demonstrates four very different data sets that yield exactly the same results when a linear regression model is fit to them, including $R^2$ and $P$ values.
```{r}
# we rearrange the data
my.mat <- matrix(as.matrix(anscombe), ncol=2)
my.anscombe <-
data.frame(x = my.mat[ , 1],
y = my.mat[ , 2],
case=factor(rep(1:4, rep(11,4))))
my.formula = y ~ x
ggplot(my.anscombe, aes(x,y)) +
geom_point(shape=21, fill="orange", size=3) +
geom_smooth(method="lm", formula = my.formula) +
stat_poly_eq(formula = my.formula,
parse = TRUE,
label.y = "top",
label.x = "left",
use_label(c("eq", "R2", "P"))) +
facet_wrap(~case, ncol=2) +
theme_bw(16)
```
1. If these four sets of data were real observations from an experiment, would you conclude the same in all cases with respect to the effect of $x$ on the observed $y$?
1. In which of the four cases the estimated line best described the _overall trend_ of the response?
1- What is the difficulty in each of the other three cases?
These concocted data examples demonstrates that when fitting any model to data, extreme caution is always needed. **We need to check graphically the observations and how well model estimates fit them.** $R^2$, $P$ values, and other parameter's estimates _must always be considered in the light of the data, rather than in isolation_.
:::
# Model-fitting core ideas
```{mermaid}
flowchart TB
s[Total variation] --> f(Model fitting)-->|origin explained| m[Model parameters]
f -->|origin not explained| e[Experimental error]
```
What is left as experimental error is the "residual" variation, what the model we fitted could not explain (= describe). _The error estimate is the result of the interaction between data and model._
The next idea is that the variance that is explained by a model can be "partitioned" so that different "sources" of variation can be compared and their importance assessed.
```{mermaid}
flowchart TB
s[Total variation] --> f(Model fitting)-->|origin explained| m[Model parameters]
f --->|origin not explained| e[Experimental error]
m --> s1[y variation due to x1]
m --> s2[y variation due to x2]
m --> s3[y variation due to x3]
```
::: callout-tip
# Differences among fitted models
- In many cases we can choose among different models to fit.
- Sometimes, models differ only in how we partition the same explained variation, facilitating different interpretations of the data.
- In many other cases different models differ in how much of the total variation they can explain and how much remains unexplained.
:::
::: callout-tip
# Model formulas
In statistics models are described by equations, and in R using _model formulas_, which are similar to these equations.
_A _model formula_ defines the nature/$\approx$shape of the relationship between dependent ($y$) and independent or explanatory variables ($x_i$)._
When a model is fitted numeric values to insert into the model formula are searched for. The selected values are those that explain as much of the variation among observations as possible, within the constraint of the given model formula.
How the estimates are computed depends on the type of model being fitted to data and the algorithm used. However, the objective of the computations is always the same, minimizing the residual variation. (It is not necessary to always to measure the variability as variance or to use ordinary least squares, OLS, methods. OLS is, however, most effective _as long as errors are normally distributed_.)
:::
# Linear models (LM)
I will start with some "unusual examples" (crazy approaches?) to try to convince you that there is much in common between model fitting and the simple statistics you already know. Grasping this is important as general concepts allows us to understand the bigger picture, and thus we can learn new methods as variations of, or extensions to, what we already know.
## The mean (= average) as a fitted linear model
We compute the mean of sample as ${\bar x} = \frac{\sum_{i=1}^n x_i}{n}$ as an estimator of $\mu$, the mean of the population from which the sample was drawn.
An example with R follows.
We create a vector of 20 random numbers from the Normal distribution to play with.
```{r}
#| code-fold: false
my.vector <- rnorm(20)
```
```{r}
#| code-fold: false
mean(my.vector)
var(my.vector)
```
**We may ask:** _Can we also obtain these estimators by fitting a lineal model?_
The answer is, "Yes! we can."
We assume that the observations $y_{ij}$ follow the linear model
$$
y_{ij} = \mu + \epsilon_{ij},
$$
and fit model to the data to obtain an estimate of $\mu$.
This model in R, given that we have given the name `my.vector` to the observation, is represented by the formula `my.vector ~ 1`. This is a linear model, with only an intercept as only parameter.
```{r}
#| code-fold: false
fm1 <- lm(my.vector ~ 1)
```
Calling `summary()` on the result from fitting the model we get an estimate for the intercept, which is the same $\bar x$ estimate we computed above.
```{r}
summary(fm1)
```
The ANOVA (analysis of variance) table, unsurprisingly, shows variance under the name of _mean square_ (MS). The residual mean square, or error variance, or residual variance gives the same value we calculated above using function `var()`.
```{r}
anova(fm1)
```
To fit the mean as a linear model, the total "Sum Sq", or SS, was minimized.
So fitting a model is in fact similar to computing the mean of a single variable. However, while the mean is a single parameter, we can fit models with several parameters which are adjusted simultaneously to fit a point, curve, surface or hyper-surface, depending on the case.
When fitting a model we normally also estimate the standard deviation for each fitted parameter, allowing tests of significance.
### The _t_-test as a fitted model linear model
Two vectors of numbers to play with.
```{r}
#| code-fold: false
vec1 <- 0 + rnorm(20)
vec2 <- 1 + rnorm(20)
```
We can assemble a `data.frame` with them, adding a factor with two levels representing groups.
```{r}
#| code-fold: false
df1 <- data.frame(obs = c(vec1, vec2),
group = factor(rep(c("g1", "g2"), each = 20)))
head(df1)
summary(df1)
```
We can plot the data adding means and standard errors.
```{r}
ggplot(df1, aes(group, obs)) +
stat_summary(fun.data = mean_se, color = "red") +
geom_beeswarm()
```
```{r}
#| code-fold: false
mean(vec1)
mean(vec2)
var(vec1)
var(vec2)
```
One way of calling `t.test()` is by passing the data separately for each group.
```{r}
#| code-fold: false
t.test(vec1, vec2, var.equal = TRUE)
```
We can do the same test calling `t.test()` using _model formula_ syntax. The model formula `obs ~ group`, used here, corresponds to the model formulated as $y_{ij} = \mu + \tau_i + \epsilon_{ij}$.
```{r}
#| code-fold: false
t.test(obs ~ group, data = df1, var.equal = TRUE)
```
**We may ask:** _Can we test significance like with the t-test by fitting a lineal model?_
The answer is, "Yes! we can."
::: callout-caution
# Intercept in R model formulas
Model formulas `obs ~ 1 + group` and `obs ~ group` are equivalent as the intercept is included by default. The second form is the most commonly used. Here, to make the comparison to the model equations easier, I include the `1 +` term explicitly.
:::
The main difference in the call to `lm()` is the missing `var.equal = TRUE`, as equal variances are always assumed by `lm()`.
```{r}
#| code-fold: false
fm2 <- lm(obs ~ 1 + group, data = df1)
summary(fm2)
anova(fm2)
```
With the three approaches, the _P_-value obtained is the same. The estimate of the difference the means of `g1` and `g2` is given in the summary for the model fit with `lm()` while `t.test` shows the two means. The square of the estimated _t_-value matches the _F_-value in the ANOVA table.
So, these examples show that model fitting provides estimates of population parameters, identically to computing a mean, as well as serving as the basis for tests of significance identical to the _t_-test. In addition, as we will see below, linear models include additional models and tests of significance to those possible with _t_-test.
::: callout-note
The _implementation_ in R of _t_-tests by default does not assume equal variances in the two groups that are compared. Linear models, as implemented in `lm()` do.
Other R model-fitting functions cater for various additional cases.
:::
### ANOVA, regression and linear models
Linear models include what is called Analysis of Variance (ANOVA), linear regression, multiple regression, Analysis of Covariance, and a few other variations all wrapped together. So, one procedure caters for a wide range of situations and designs of experiments. While the _t_-test can be used to test differences between means of two groups, or between a mean and a fixed value, ANOVA can be used with two or more groups.
Linear models can be computed very efficiently but make several assumptions about the data. Other types of models, avoid some of the assumptions of LMs, but tend to be computationally more demanding.
## ANOVA vs. Regression
In _one-way ANOVA_ for a completely randomized design (with no blocks) we have the
following model: every observation receives a contribution from the population
mean contains $\mu$, observations for the $i$th treatment receive a contribution
from $\tau_i$ but no other $\tau$. Each observation in addition is affected by
its own bit of "random" variation.
Thus, the observations $y_{ij}$ follow the linear model
$$
y_{ij} = \mu + \tau_i + \epsilon_{ij},
$$
where $\mu$ is the population mean, $\tau_i$ the effect of
treatment $i$, and $\epsilon_{ij}$ the random variation associated
with observation $y_{ij}$.
We assume that the residuals $\epsilon_{ij}$
are independent and equally distributed --- $\epsilon_{ij} \sim
N(0,\sigma^2)$, where variance $\sigma^2$ is unknown and
independent of group $i$.\vspace{2cm}
---
In linear regression the $y$ values are obtained from several
populations, each population being determined by the corresponding
$x$ value. The $y$ variable is called _dependent variable_
and the $x$ variable is called _independent variable_.
The observations $y_{ij}$ follow the linear model
$$
y_{ij} = \beta_0 + \beta_1 x_i + \epsilon_{ij},
$$
where $\beta_0$ is the population mean when $x=0$, $x_i$ are the
observed values for the independent variable, $\beta_1$ is the
coefficient describing the slope, and $\epsilon_{ij}$ the random
variation associated with observation $y_{ij}$.
We also assume that the residuals
$\epsilon_{ij}$ are independent and equally distributed ---
$\epsilon_{ij} \sim N(0,\sigma^2)$, where variance $\sigma^2$ is
unknown and independent of $x$.
In regression estimates of all $\beta_i$ are usually of interest. In contrast, in ANOVA, the estimate of $\mu$ across all treatments is considered not interesting, and the focus of the analysis is on the estimate of $\tau$.
::: callout-tip
# Assumption of Normality
_What are the reasons why we so frequently assume that error variation follows a Normal distribution?_
1. **Experience** The variation in the real world _rather frequently_ follows distributions that are rather similar to the Normal. _Many exceptions exist, too._
1. **Convenience** Assuming a Normal distribution makes computations a lot easier than assuming other distributions. _Crucial in the absence of computers._
1. **Theory** The distribution of parameter estimates such as the mean tend to be distributed "more normally" than the observations from which they are computed. May explain, at least in part, the first point above.
:::
::: callout-tip
# Why "linear" in the name
Linear models receive the name "linear" because of how the parameters that are estimated enter the _model formula_. They include models that when plotted can be represented by straight lines, but also by curves and even curved surfaces.
Here $\beta$ enters the model equation as a multiplier of the $x_{i}.
$$
y_{ij} = \beta_0 + \beta_1\,x_i + \epsilon_{ij},
$$
An example of a model non-linear in its parameters is
$$
y_{ij} = \beta_0 + \mathrm{e}^{\beta_1\,x_i} + \epsilon_{ij},
$$
because $\beta$ enters the model equation as a part of an exponent.
Whether a model to be fitted is linear in its parameters or non-linear is crucial from the perspective of the computations needed to estimate the values of the parameters. From the interpretation perspective, and use, differences are rather small.
:::
::: callout-note
In linear models _dummy_ variables, consisting of zeros and ones, can be used to code
treatments levels. Thus, even though we rely on factors to describe treatments, these factors are converted into an encoding consistent with that used for regression, and the same computational procedure is used internally for regression and ANOVA.
:::
## Linear models in R
Both **ANOVA** and **regression** are described by **linear models** and have
much more in common than what it looks at first sight.
In R little distinction is made between ANOVA and regression.
The function`lm()` can be used for both, whether the variables
in the model are `factor`s or `numeric` vectors
determines the coding (of the matrix used for describing the tests).
The function `aov()` uses `lm()` internally but differs in
that summary gives an ANOVA table, and in that it can also deal
directly with hierarchical models (e.g. split-plot designs).
## Regression examples
I show here plots, but not not how one would do the data analysis in R.
### Interpretation of the intercept
We generate some artificial data (it will be different each time you run the code below.)
Why will it be different? Try it?
```{r}
x <- -20:80
y <- x + 50 + rnorm(101, sd = 20)
my.data <- data.frame(x, y)
summary(my.data)
```
We fit a linear regression (R's model formula `y ~ x` or its verbose equivalent `y ~ 1 + x`).
$$
y_{ij} = \beta_0 + \beta_1 x_i + \epsilon_{ij},
$$
```{r}
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(se = FALSE) +
stat_poly_eq(use_label(c("eq")))
```
### Residuals or deviations
All deviations from the fitted curve are assigned to variation in $y$, as an assumption of LM fitting.
```{r}
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line(se = FALSE) +
stat_fit_deviations(colour = "red")
```
```{r}
ggplot(my.data, aes(x, y)) +
geom_hline(yintercept = 0) +
stat_fit_residuals(colour = "red")
```
::: callout-caution
# Assumption about $x$
_It is assumed that the $x_i$ are measured without error._
Regression provides a means of predicting $y$ from $x$, but not the reverse.
For example, this does not mean that there is no biological variation, but instead that the values of the independent variable are known. As with other assumptions, small deviations are rarely important.
This stems from the idea of dependent vs. independent variables, and cause and effect. We assume that all the variation that is observed in the $y_{ij}$ is caused by variation in the response to the known $x_{ij}$ or to the known $\tau_i$.
_This is a very frequent assumption, but methods do exist that do not involve this assumption._
:::
### Variability of the estimates
A confidence band for the fitted model encloses the predicted relationship, in this case a straight line. It informs about the uncertainty about our estimate of the relationship for the whole population.
While we use $\beta_i$ for the values of the parameters in the population, we use $b_i$ for the estimates obtained by fitting the model to a sample.
```{r}
ggplot(my.data, aes(x, y)) +
geom_point() +
stat_poly_line() +
stat_fit_tb(label.x = "left",
tb.params = c("b[0]" = 1, "b[1]" = 2),
tb.vars = c("Param." = 1, "Estimate" = 2,
"italic(s)[b[i]]" = 3, "italic(t)" = 4,
"italic(P)" = 5),
parse = TRUE)
```
## Some code examples
The code chunks below show the steps for analysis of variance (ANOVA) applied to weights of three groups of plants. The example data are included in R. You can run the code directly in this web page (in the browser, no R installation needed). The same code, of course, runs in RStudio and R itself.
```{webr-r}
help(PlantGrowth)
```
```{webr-r}
summary(PlantGrowth)
```
A quick plot of the data.
```{webr-r}
boxplot(weight ~ group, data = PlantGrowth)
```
A bit more sophisticated plot.
```{r}
ggplot(PlantGrowth, aes(group, weight)) +
geom_beeswarm() +
geom_boxplot(colour = "red", width = 0.33, fill = NA)
```
```{webr-r}
fmPG <- lm(weight ~ group, data = PlantGrowth)
fmPG
```
```{webr-r}
plot(fmPG, which = 1)
plot(fmPG, which = 2)
# plot(fmPG, which = 3)
# plot(fmPG, which = 4)
plot(fmPG, which = 5)
# plot(fmPG, which = 6)
```
```{webr-r}
anova(fmPG)
```
**By editing the code in the chunks above, analyse the data set `chickwts`, also included in R.**