Fitted-model labels with ‘ggpmisc’

Polynomials, major axis, quantile and non-linear regression lines, bands and equations

Plotting examples
Author

Pedro J. Aphalo

Published

2023-02-27

Modified

2023-07-30

Abstract

Example R code for plots with predictions from fitted models labelled with model fitted equations, parameter estimates, R2, F-value, P-value, etc. Polynomial regression, quantile regression, major-axis regression, robust linear model fits as well as non-linear regression are examplified. Labels make use of layer functions from package ‘ggpmisc’.

Keywords

ggplot2 pkg, ggpp pkg, ggpmisc pkg, data labels, plot annotations, model equations

Tip

In this page code chunks are “folded” so as to decrease the clutter when searching for examples. Above each plot you will find one or more “folded” code chuncks signalled by a small triangle followed by “Code”. Clicking on the triangle “unfolds” the code chunk making visible the R code used to produce the plot.

The code in the chunks can be copied by clicking on the top right corner, where an icon appears when the mouse cursor hovers over the code listing.

The </> Code drop down menu to the right of the page title makes it possible to unfold all code chunks and to view the Quarto source of the whole web page.

Names of functions and other R objects are linked to the corresponding on-line help pages. The names of R extension packages are linked to their documentation web sites when available.

Introduction

Here you will find examples of ggplots with labels for fitted model equations, various parameter estimates. The labels have been added to the plots with geometries defined in package ggpp and statistics from package ggpmisc both available at CRAN.

Important

One needs to always check that annotations do not occlude anything significant, such as observations in the base plot. This needs special care when using annotations together with batch plotting. Either ensure that the scale limits of the base plot are expanded to avoid overlap or that the layer with the equations is the lowest one, i.e., added to the plot first.

Data labels and plot annotations

Data labels add textual information directly related to individual data points (shown as glyphs). Text position in this case is dependent on the scales used to represent data points. Text is usually displaced so that it does not occlude the glyph representing the data point and when the link to the data point is unclear, this link is signalled with a line segment or arrow. Data labels are distinct from annotations in that they contribute directly to the representation of data on a plot or map.

Annotations differ from data labels, in that their position is decoupled from their meaning. Insets can be thought as larger, but still self-contained annotations. In most cases the reading of inset tables and plots depends only weakly on the plot or map in which they are included.

In the case of annotations and insets the designer of a data visualization has the freedom to locate them anywhere, as long as they do not occlude features used to describe data. I will use the term annotation irrespective if the “labels” are textual or graphical.

The equations showing parameter estimates from models fit to data are normally displayed as annotations. However, it is also possible to consider them data labels “connected” to individual curves representing the corresponding model fits.

Annotations showing parameters estimates

Fitted model equations and other related estimates can be useful in plots as they ensure that graphical representation as a curve and numerical values for parameters that cannot be read from the curve itself are displayed as text on the same plot.

Note

When adding annotations one should be aware that they add clutter to a plot, and clutter can make it difficult to see the patterns of interest in the data represented as points or curves.

So, as usual, less is more, include those annotations that are relevant to the message conveyed by a plot and nothing more.

Fully automating the construction of fitted model equations is far from trivial for the general case, so I have implemented automation for specific types of models: polynomials fitted by OLS as linear models, polynomials fitted by quantile regression and major-axis regression fits. They are currently available as pairs of statistics with consistent interfaces, with each pair consisting in a curve-plotting statistics and an annotations statistics.

As shown below other model formulas can be rather easily assembled and added as annotations if one is familar with R’s expressions as used for plotmath.

Preliminaries

The code used is shown on-demand above each plot and can be copied. We first load the packages we will use.

When package ggpmisc is loaded and attached, packages ggpp and ggplot2 are also attached. The only function from ggplot2 that is redefined by ggpp is annotate(), which remains backwards compatible with ggplot2.

Linear correlation

Code
# generate artifical data for a linear relationship
# two groups with different standard deviation around a slope = 1
set.seed(94321)
x <- (1:100) / 10
yA <- x + rnorm(length(x), sd = 2)
yB <- x + rnorm(length(x), sd = 8)
df1 <- data.frame(x = rep(x, 2),
                  y = c(yA, yB),
                  group = rep(c("A", "B"), rep(length(x), 2L)))

The examples in this section make use of stat_correlation() from package ggpmisc.

Pearson correlation.

Code
ggplot(subset(df1, group == "A"), 
       aes(x, y)) +
  geom_point() +
  stat_correlation()

Code
ggplot(subset(df1, group == "A"), 
       aes(x, y)) +
  geom_point() +
  stat_correlation(small.r = TRUE)

Spearman correlation. (The default is to not compute the confidence interval.)

Code
ggplot(subset(df1, group == "A"), 
       aes(x, y)) +
  geom_point() +
  stat_correlation(method = "spearman")
Skipping bootstrap estimation as 'conf.level' <= 0

Kendall correlation. (The default is to not compute the confidence interval.)

Code
ggplot(subset(df1, group == "A"), 
       aes(x, y)) +
  geom_point() +
  stat_correlation(method = "kendall")
Skipping bootstrap estimation as 'conf.level' <= 0

Select which labels to show in plot.

Code
ggplot(subset(df1, group == "A"), aes(x, y)) +
  geom_point() +
  stat_correlation(mapping = use_label(c("R", "P", "n")),
                   method = "kendall")
Skipping bootstrap estimation as 'conf.level' <= 0

Select which labels to show in plot.

Code
ggplot(subset(df1, group == "A"), aes(x, y)) +
  geom_point() +
  stat_correlation(mapping = use_label(c("R", "t", "P", "n")))

Confidence interval for R (Pearson’s correlation).

Code
ggplot(subset(df1, group == "A"), aes(x, y)) +
  geom_point() +
  stat_correlation(mapping = use_label(c("R", "R.confint")))

Confidence interval for Spearman’s correlation.

Code
ggplot(subset(df1, group == "A"), 
       aes(x, y)) +
  geom_point() +
  stat_correlation(mapping = use_label(c("R", "R.confint")),
                   r.conf.level = 0.95,
                   method = "spearman")

Grouping supported.

Code
ggplot(df1, aes(x, y, color = group)) +
  geom_point() +
  stat_correlation()

Facets supported.

Code
ggplot(df1, aes(x, y)) +
  geom_point() +
  stat_correlation(label.x = "right", label.y = "bottom") +
  facet_wrap(~group)

Highlighting based on estimates, here estimated R but it is possible to use other estimates like P-value.

Code
ggplot(df1, aes(x, y)) +
  geom_point() +
  stat_correlation(mapping = 
                     aes(color = ifelse(after_stat(cor) > 0.5,
                                        "red", "black"))) +
  scale_color_identity() +
  facet_wrap(~group)

Alternatives

Tip

Package ‘ggpubr’ provides alternative approaches to correlation testing and reporting.

Polynomials

Polynomials as linear models, but many other linear models exist. The most common case is linear regression. They are very frequently used, and in the case of linear regression the role of parameters can be easily recognized.

The examples in this section make use of stat_poly_line() and stat_poly_eq() from package ggpmisc.

The model corresponding to lineal regression is a first degree polynomial.

Code
ggplot(df1, aes(x, y)) +
  geom_point() +
  stat_poly_line() +
  stat_poly_eq(use_label("eq"), 
               label.x = "right", label.y = "bottom") +
  facet_wrap(~group)

Most likely the variables mapped onto aesthetics x and y have different names. If we use a symbol to represent these variables, we may want to use these symbols in the equation.

Code
ggplot(df1, aes(x, y)) +
  geom_point() +
  stat_poly_line() +
  stat_poly_eq(use_label("eq"), 
               eq.with.lhs = "italic(t)~`=`~", 
               eq.x.rhs = "~italic(E)",
               label.x = "right", label.y = "bottom") +
  labs(x = expression("Irrandiance, "*italic(E)~~(W~m^{-2})),
       y = expression("Temperature, "*italic(t)~~(degree*"C"))) +
  facet_wrap(~group)

Code
set.seed(4321)
# generate artificial data for a 3rd degree polynomial
x <- 1:100
y <- (x + x^2 + x^3) + rnorm(length(x), mean = 0, sd = mean(x^3) / 4)
y <- y / max(y)
df2 <- data.frame(x, 
                  y, 
                  group = c("A", "B"), 
                  y2 = y * c(1, 2) + c(0, 0.2),
                  block = c("a", "a", "b", "b"),
                  wt = sqrt(x))

\(R^2\) is the default label, as its estimate is always available.

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(formula = formula)

Fitted model equation, available for polynomials with no missing terms.

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(mapping = use_label("eq"), formula = formula)

\(R_\mathrm{adj}^2\) and P-value added separately.

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(mapping = use_label("adj.R2"), formula = formula) +
  stat_poly_eq(mapping = use_label("P"), label.x = "right", label.y = "bottom", size = 3,
               formula = formula)

Fitted model equation and \(R_\mathrm{adj}^2\).

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(mapping = use_label(c("eq", "adj.R2"), sep = "~~italic(\"with\")~~"),
               formula = formula)

Equation with no left-hand-side (lhs).

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(mapping = use_label("eq"),
               eq.with.lhs = FALSE,
               formula = formula)

Variable names in the equation set to be meaningful.

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(mapping = use_label(c("eq", "R2")),
               eq.with.lhs = "italic(h)~`=`~",
               eq.x.rhs = "~italic(z)",
               formula = formula) +
  labs(x = expression(italic(z)), y = expression(italic(h)))

Even Greek charactes and maths can be incorporated when needed.

Code
formula <- y ~ poly(x, 2, raw = TRUE)
ggplot(df2, aes(x, log10(y + 1e6))) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(mapping = use_label("eq"),
               eq.with.lhs = "plain(log)[10](italic(delta)+10^6)~`=`~",
               eq.x.rhs = "~Omega",
               formula = formula) +
  labs(y = expression(plain(log)[10](italic(delta)+10^6)), 
       x = expression(Omega))

Caution

stat_poly_eq() by default assumes that the fitted polynomial has the variable mapped to the x or y aesthetic as explanatory variable depending on the model formula and argument passed to parameter orientation. It does not decode any transformations applied on the fly to x or y. In addition it does not function correctly with linear models that are not polynomials.

If a transformation is applied, the user is responsible for replacing as shown below the \(x\) (and/or \(y\)) symbol in the equation by setting an alternative value matching the transformation applied in the model formula.

Code
formula <- y ~ I(x^2) # we apply a transformation to x within the formula
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(mapping = use_label("eq"), 
               eq.x.rhs = "~x^2",
               formula = formula)

In some cases, like when applying a power as transformation and fitting higher degree polynomials, one may need to manually build the R expression representing the equation to simplify it.

The examples shown above also work with groups.

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y2, colour = group)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(aes(label = after_stat(eq.label)), 
               formula = formula, 
               vstep = 0.08) +
  theme_bw()

Using colours to indicate the groups to which equations correspond is not always best. Here we use labels to the left of each equation.

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y2, linetype = group, grp.label = group)) +
  geom_point() +
  stat_poly_line(formula = formula, color = "black") +
  stat_poly_eq(aes(label = after_stat(paste("bold(", grp.label, "*':')~~~", 
                                      eq.label, sep = ""))),
               formula = formula)

Facets are also supported.

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y2)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(aes(label = after_stat(eq.label)),
               formula = formula) +
  facet_wrap(facets = vars(group), scales = "free_y", ncol = 1)

In the stats defined in ggpmsic the fit method function can modify the model passed as argument because the one used to build the equation is retrieved from the fitted model object when possible. In this example, a linear regression is fitted if the slope is significant, but if not the mean is reported, both graphically and numerically.

Code
# user defined fit method removes the slope if 
# the slope is not significant
poly_or_mean <- function(formula, data, ...) {
   fm <- lm(formula = formula, data = data, ...)
   if (anova(fm)[["Pr(>F)"]][1] > 0.1) {
      lm(formula = y ~ 1, data = data, ...)
   } else {
      fm
   }
}

# we create a plot as usual, but with our method
ggplot(mpg, aes(displ, hwy)) +
   geom_point() +
   stat_poly_line(method = "poly_or_mean") +
   stat_poly_eq(method = poly_or_mean,
   aes(label = after_stat(eq.label)),
   label.x = "right") +
   theme(legend.position = "bottom") +
   facet_wrap(~class, ncol = 2)

Faceting works also with free x and/or y scales in panels.

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y2, fill = block)) +
  geom_point(shape = 21, size = 3) +
  stat_poly_line(formula = formula) +
  stat_poly_eq(aes(label = after_stat(rr.label)), size = 3,
               geom = "label_npc", alpha = 0.33,
               formula = formula) +
  facet_wrap(~group, scales = "free_y") +
  theme(legend.position = "top")

Faceting works also with free x and/or y scales in panels when the position of labels is set by a string.

Code
formula <- y ~ poly(x, 3, raw = TRUE)
ggplot(df2, aes(x, y2, fill = block)) +
  geom_point(shape = 21, size = 3) +
  stat_poly_line(formula = formula) +
  stat_poly_eq(use_label("R2"), size = 3,
               geom = "label_npc", alpha = 0.33,
               formula = formula,
               label.x = "right", label.y = "bottom",
               rr.digits = 3) +
  stat_poly_eq(use_label(c("F", "P")),
               size = 3,
               geom = "label_npc", alpha = 0.33,
               formula = formula) +
  facet_wrap(~group, scales = "free_y") +
  theme(legend.position = "top")

Some artificial data with rather high random variation on both x and y.

Code
set.seed(94321)
a <- (1:100) / 10
x <- a + rnorm(length(x), sd = 1)
y <- a + rnorm(length(x), sd = 2)
df3 <- data.frame(x = x, y = y)

Linear regression through the origin.

Code
# model with intercept = 0
formula <- y ~ x + 0
ggplot(df3, aes(x, y)) +
  geom_point() +
  stat_poly_line(formula = formula) +
  stat_poly_eq(formula = formula,
               mapping = use_label("eq"))

Regressions of y on x and x on y in the same plot. The lines are drawn for the range of the explanatory variable, which differs in this case between the two fits.

Code
# the default for formula is y ~ x
ggplot(df3, aes(x, y)) +
  geom_point() +
  stat_poly_line(color = "blue") +
  stat_poly_eq(mapping = use_label(c("R2", "eq")), 
               color = "blue") +
  stat_poly_line(color = "red", 
                 orientation = "y") +
  stat_poly_eq(mapping = use_label(c("R2", "eq")), 
               color = "red", 
               orientation = "y",
               label.y = 0.9)

Alternatives

Tip

Package ‘ggpubr’ provides a renamed copy of stat_poly_eq() taken from an old version of ggpmisc. The version in ‘ggpubr’ is much more limited in its functionality and even contains bugs. The line drawing code in stat_smooth() from ggplot2 was used as a basis for stat_poly_line(), the main difference is that by default it uses lm() as method irrespective of the data size while stat_smooth() switches to using splines for large numbers of observations.

Major axis regression

If x and y are both subject to random errors, and none is clearly the cause of the other, we should use major axis regression instead of linear regression.

The examples in this section make use of stat_ma_line() and stat_ma_eq() from package ggpmisc.

Code
ggplot(df3, aes(x, y)) +
  geom_point() +
  stat_ma_line() +
  stat_ma_eq(mapping = use_label(c("R2", "eq")))

Tip

No alternatives I know of.

Quantile regression

Quantile regression is frequently used is some disciplines like Economics but less frequently in other fields. It is robust and can be rather easily interpreted based on its similarity to the well known box plots.

The examples in this section make use of stat_quant_line(), stat_quant_band() and stat_quant_eq() from package ggpmisc.

Caution

Quantile regression fits are done by numerical approximation, and frequently generate warnings for non-unique solutions or similar problems. Their importance needs to be assessed by users. In the examples below they seem mostly unimportant given the large number of observations and have been silenced in the output.

A median regression with a band limited by the upper and lower quartile regressions.

Code
# the default for formula is y ~ x
ggplot(df3, aes(x, y)) +
  geom_point() +
  stat_quant_band() +
  stat_quant_eq(quantiles = 0.5)

Median regression through the origin.

Code
# model with intercept = 0
formula <- y ~ x + 0
ggplot(df3, aes(x, y)) +
  geom_point() +
  stat_quant_band(formula = formula) +
  stat_quant_eq(formula = formula, quantiles = 0.5)

Quantile regressions of y on x and x on y.

Code
# the default for formula is y ~ x
ggplot(df3, aes(x, y)) +
  geom_point() +
  stat_quant_band(color = "blue") +
  stat_quant_eq(quantiles = 0.5, color = "blue") +
  stat_quant_band(color = "red", 
                 orientation = "y") +
  stat_quant_eq(quantiles = 0.5, color = "red", 
               orientation = "y", label.y = 0.9)

Fitting a polynomial by quantile regression.

Code
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_quant_band(formula = y ~ poly(x, 2)) +
  stat_quant_eq(formula = y ~ poly(x, 2), quantiles = 0.5)

Two quantiles are by default plotted as lines,

Code
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_quant_line(formula = y ~ poly(x, 2), quantiles = c(0.05, 0.95)) +
  stat_quant_eq(formula = y ~ poly(x, 2), quantiles = c(0.05, 0.95))

A single quantile, here the median, is plotted as a line plus a confidence band.

Code
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_quant_line(formula = y ~ poly(x, 2), quantiles = 0.5) +
  stat_quant_eq(formula = y ~ poly(x, 2), quantiles = 0.5)

Equations labelled by quantile.

Code
ggplot(df2, aes(x, y)) +
  geom_point() +
  stat_quant_band(formula = y ~ poly(x, 2), 
                  color = "black", fill = "grey60") +
  stat_quant_eq(aes(label = paste(after_stat(grp.label), "*\": \"*",
                                  after_stat(eq.label), sep = "")),
                formula = y ~ poly(x, 2)) +
  theme_classic()

Equations labelled by group and quantile.

Code
ggplot(df2, aes(x, y, group = group, linetype = group, 
                    shape = group, grp.label = group)) +
  geom_point() +
  stat_quant_line(formula = y ~ poly(x, 2), 
                  quantiles = c(0.1, 0.9), 
                  color = "black") +
  stat_quant_eq(aes(label = paste(after_stat(grp.label), "*\": \"*",
                                  after_stat(eq.label), sep = "")),
                formula = y ~ poly(x, 2), 
                quantiles = c(0.1, 0.9)) +
  theme_classic()

Alternatives

Tip

Package ggplot2 defines stat_quantile() which can be used to plot as a line some types of quantile regression fits, but is less flexible than stat_quant_line() and stat_quant_band().

Non-linear models

The examples in previous sections were based on ready assembled labels. Here we show examples of how to generate labels for equations and other parameter estimates with an approach applicable to many different types of models.

The examples in this section make use of stat_fit_tidy() from package ggpmisc.

We use as example a fit of the Michaelis-Menthen equation of reaction kinetics, a function that is non-linear in its parameters.

# We use methods from package 'broom'
library(broom)
Code
micmen.formula <- y ~ SSmicmen(x, Vm, K) 
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls", 
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_tidy(method = "nls", 
                method.args = list(formula = micmen.formula),
                label.x = "right",
                label.y = "bottom",
                aes(label = paste("V[m]~`=`~", signif(after_stat(Vm_estimate), digits = 3),
                                  "%+-%", signif(after_stat(Vm_se), digits = 2),
                                  "~~~~K~`=`~", signif(after_stat(K_estimate), digits = 3),
                                  "%+-%", signif(after_stat(K_se), digits = 2),
                                  sep = "")),
                parse = TRUE) +
  theme_bw()

Code
micmen.formula <- y ~ SSmicmen(x, Vm, K) 
ggplot(Puromycin, aes(conc, rate, colour = state)) +
  geom_point() +
  geom_smooth(method = "nls", 
              formula = micmen.formula,
              se = FALSE) +
  stat_fit_tidy(method = "nls", 
                method.args = list(formula = micmen.formula),
                label.x = "center",
                label.y = "bottom",
                vstep = 0, hstep = -0.3,
                aes(label = paste("V~`=`~frac(", signif(after_stat(Vm_estimate), digits = 2), "~C,",
                                  signif(after_stat(K_estimate), digits = 2), "+C)",
                                  sep = "")),
                parse = TRUE) +
  labs(x = "C", y = "V") +
  theme_bw()

Alternatives

Tip

In all cases it is possible to do the model fitting before creating the plot, assembling labels in one’s own R code and passing them to ggplot() as data. The possibilities are nearly unlimited but implementing them, depending on one’s own familiarity with R and ‘ggplot2’ can become very time consuming. I have in part written ggpp and ggpmisc to avoid as a user having to remind myself of how to write such code repeatedly. My hope is that others will also save time and effort by using the packages I have developed.