‘ggplot2’ Basics

The Layered Grammar of Graphics

R
plotting
Author

Pedro J. Aphalo

Published

2023-05-03

Modified

2023-07-27

Abstract

Plot construction with the Layered Grammar of Graphics implemented in R package ‘ggplot2’ and extensions can feel like a puzzle unless the mechanics are understood. I explain how ‘ggplot2’ plots are assembled and later rendered. I focus on the grammar and its features.

Keywords

ggplot2 pkg, data visualization, dataviz

Note

To see the source of this document click on “</> CODE” to the right of the page title. The page is written using Quarto which is an enhanced version of R Markdown. The diagrams are created with Mermaid, a language inspired by the simplicity of Markdown.

Warning

Package ‘ggplot2’ has gained new features over its long life, and although few changes have been ‘code brealking’ you should be aware that the examples in this page have been tested with version (==3.4.2).

1 Grammar of Graphics

The Grammar of Graphics (GoG) was deviced by Lee Wilkinson (1999) as a mathematical theory applicable to the construction of any data visualization. It it most relevant to the design of software for data visualization, and does not concern visual and perceptual aspects of plotting data. It can be depicted as a sequence of steps connected by operations (Figure 1).

flowchart TB
  A(Data variables) --> B(algebra) --> C(Scales) --> D(Statistics) --> E(Geometry) --> F(Coordinates) --> G(Aesthetics) --> H(Rendered\ndata visualization)
Figure 1: The Grammar of Graphics depicted as a simplified conceptual diagram.

Basing plot construction on a grammar gives users the freedom to create new types of data visualizations including those not imagined by the designers of software. This is similar to how a natural language like English allows writers and poets to express original ideas within a context that makes them understandable to others and amenable to be processed by existing software and hardware.

A recent article by J. Friedly (2022) in The Nightingale summarises a debate between L. Wilkinson and M. Friendly about what the GoG actually is.

The GoG is conceptual and identifies the key classes of objects starting from data and the classes of methods used to transform one to the next, giving a final graphic object that still needs to be rendered (J. Friedly 2022). To be used to actually construct data visualizations the GoG needs to be implemented as computer software. Use of the GoG as a paradigm affects how users of software will interact with a computer to construct plots, but designers of software user interfaces based on the GoG remain responsible for many decisions affecting how plots are in practice constructed.

One very successful software implementation of the GoG is R package ‘ggplot2’ and its extensions. I have myself published four R packages extending the types of plots that can be constructed using the GoG as implemented in ‘ggplot2’. These packages although respecting Wilkinson’s original GoG extend the implementation of this GoG in R.

2 Package ‘ggplot2’

It is best to think of ‘ggplot2’ and its extensions as a language used to specify how to build a plot. This language gives access to the GoG abstraction about the structure of data plots and how to build this structure step by step.

Understanding the GoG abstraction is the most important aspect of learning to create ggplots. Once one grasps the general picture, what remains is just the choice among different “building blocks”, as building blocks of the same “kind” can in most cases replace each other, differing drastically in the graphical output, but minimally in how they are combined into to a plot-construction description.

Note

I consider packages like ‘ggrepel’, ‘ggpmisc’, and ‘ggbeeswarm’ that provide extensions to the grammar, as extensions to package ‘ggplot2’. Some packages like ‘ggpubr’ mostly define functions that build whole plots using ‘ggplot2’ and return "gg" objects, but are designed to be used on their own. Such functions do not extend the grammar of graphics, they define their own user interface. Package ‘ggspectra’ has double personality as it extends the grammar but also defines special autoplot() methods for spectra. It remains consistent with ‘ggplot2’ because the generic method autoplot() is defined in package ‘ggplot2’.

3 The main steps

Differently to many other data plotting approaches, ggplots are constructed as R objects. So plotting data consists in constructing an object of class "gg" followed by its rendering. As any R object, "gg" objects can be stored in variables and can be printed to display them. When printed ggplots are rendered by default as a graphical representation, however they can also be displayed as text to reveal their structure. That both representations can be obtained at each step of their construction and that "gg" objects can be built by the succesive addition of components has two main advantages: 1) we can save parts of a "gg" object and reuse them, and 2) we can build a plot bit by bit and check the effect of each addition to the "gg" objects. It is possible, but infrequently needed, to edit an existing ggplot object: one can add, remove and replace components and also change the order of the layers. In most cases it is easier to edit the code used to create the plot, but if one does not have access to the original code or data, editing a ggplot can save the day. Editing is also an effective way of learning the internals of ggplots if one is interested in them. My package ‘gginnards’ makes editing easier. I developed this pacakge to help me learn how ‘ggplot2’ works and to debug the code in my other packages with extensions to ‘ggplot2’.

The steps needed to create a plot using the grammar of graphics are:

  1. Build an R object with data and instructions for making the plot.

  2. (Possibly add to or even “edit” the R object).

  3. “Render” the plot (convert it into a graphical object).

  4. Display the graphical object or save it in a file.

4 The layers

We can use different abstractions to describe a plot, both static and dynamic. Structurally, a plot can be thought as a stack of graphic layers each drawn on a transparent imaginary substrate. Thus, similarly as when drafting a plot with ink on paper, what we draw first can be occluded by something we draw on top of it. When we build complex plots we construct a "gg" object layer by layer; these layers even if not drawn at the time we add them, will be rendered into graphical objects in the order we have added them to the "gg" object.

The abstraction based on layers is the key to the flexibility of ggplots: we can build an almost infinite variety of plots by combining different layers, each one of them, quite simple and with an easy to understand role. In reality, we can also adjust things to an extent within each layer. Importantly, not only layer functions defined in ‘ggplot2’ but also layer functions defined in other R packages or in a user script can be used to add layers to ggplots, further expanding the available range of available types of layers.

Note

Defining new layer functions is fairly simple as layer functions defined in extension packages or scripts can rely on ‘ggplot2’ to do most of the work. This suggests that the overwhelming success of ‘ggplot2’ is similarly to the success of R itself supported by the easy with which new types of plots can be implemented as extensions.

5 The data flow

I consider now, a dynamic abstraction, the data flow describing what transformations are applied to the data in different components of a ggplot (Figure 2). These transformations take place when the plot is rendered, not when it is built, and take place separately in each layer. A ggplot object contains data but also functions that describe the operations to be carried on the data during rendering.

flowchart LR
  A(Layer data) --> B[statistic] --> C[geometry] --> D[layer 'grobs']
  A -.-> b[identity\nstatistic] -.-> C
Figure 2: Data flow in a single plot layer. ggplot objects can contain zero, one, or more layers.

The data plotted in a ggplot can be shared among all (Figure 3) or some layers or be different for each layer (Figure 4). During rendering each layer generates graphical objects (grobs for short) and other code in ‘ggplot2’ creates the ancillary grobs such as those for the axes, grid, background and legends or keys. When a plot is rendered, all these grobs are collected into an R object, i.e., what code within ‘ggplot2’ creates are instructions to draw all graphical features in the final plot. Not yet the plot itself, which is in a final step rendered by R’s graphic devices into any of the formats supported by R.

flowchart LR
  A(Plot data) --> B1[statistic 1] --> C1[geometry 1] --> D[layer 1 'grobs'\nlayer 2 'grobs'\nlayer 3 'grobs']
  A--> B2[statistic 2] --> C2[geometry 2] --> D 
  A--> B3[statistic 3] --> C3[geometry 3] --> D 
  D --> E1(computer screen)
  D -.-> E2(PDF file)
  D -.-> E3(EPS file)
  D -.-> E4(JPEG file)
  D -.-> E5(SVG file)
  D -.-> E6(PNG file)
  D -.-> E7(TIFF file)
Figure 3: Data flow in a plot containing three layers, sharing the same data. The geoms during plot rendering output graphical objects, or “plot drawing instructions”. The final rendering into a specific file or screen format is done by R’s graphic devices, not by code in the ‘ggplot2’ package.
flowchart LR
  A1(Layer 1 data) --> B1[statistic 1] --> C1[geometry 1] --> D[layer 1 'grobs'\nlayer 2 'grobs']
  A2(Layer 2 data) --> B2[statistic 2] --> C2[geometry 2] --> D 
  D --> E1(computer screen)
  D -.-> E2(PDF file)
  D -.-> E3(EPS file)
  D -.-> E4(JPEG file)
  D -.-> E5(SVG file)
  D -.-> E6(PNG file)
  D -.-> E7(TIFF file)
Figure 4: Data flow in a plot containing two layers, each one with different data. See legend to Figure 3 for details.

As with all abstractions, the simple diagrams and explanations above ignored the real complexity. One of the ignored steps is crucial: how information in the data is encoded as graphical elements drawn in the plot, and how can we control this step or mapping. In the next section we will build a plot one step at a time.

6 Building a plot step by step

Building a plot one step at a time, and printing it at each step demonstrates how the grammar of graphics and ‘ggplot2’ work. The first step is to attach the packages we will use.

library(scales)
library(ggplot2)
library(gginnards)

An empty "gg" object can be rendered as a plot.

ggplot()

We pass a data frame containing the data to be plotted. As we pass it as an argument to ggplot() it becomes the default data for the individual layers we will later add. Once a mapping is present, the range of values mapped to each aesthetic becomes known, and x and y axes are added.

ggplot(data = mtcars)

The mapping of variables in the data to plot aesthetics is done with function aes(). As we pass the value returned by function aes() as an argument to ggplot() this mapping becomes the default mapping for the individual layers we will later add.

ggplot(data = mtcars,
       aes(x = disp, y = mpg))

Geometries are layer functions, geom_point() used here, creates a graphical representation of the data as mapped to the x and y aesthetics as symbols or points on the drawing area of the plot.

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point()

Above, the shape and colour of the points are the default ones. We can, instead of mapping variables to aesthetics, assign constant values to aesthetics. This is best done directly as arguments to layer functions as shown here rather than using aes().

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point(color = "red", shape = "square open", size = 3)

The default statistic of geom_point() is stat_identity() that does not alter the data. So by default geom_point() behaves as is no statistic was present, thus above the observations were plotted as is. All other statistics modify the data before it reaches the geometry. We add as an example stat_smooth() which fits a smoother to the data. We override the default geometry of stat_smooth() setting it to geom_line() with geom = "line".

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point() +
  stat_smooth(geom = "line", method = "lm", formula = y ~ x)

The correspondence between values in the data and values of an aesthetic is controlled by the corresponding scale. Here we replace the scale used by default (scale_y_continuous()) by scale_y_log10() so that the y axis uses a logarithmic scale. Scales, as shown here only change the graphical representation. The legends and tick labels still show the values before the transformation, which in most cases makes the plot easy to read.

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point() +
  stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
  scale_y_log10()

Coordinates are applied after the statistics “see” the data, so changing the limits with them is similar to zooming into a finalized plot based on all the data. This is very important to remember when statistics are used as in a plot like this using scale limits to zoom in would result in the regression being fitted only to the data actually visible within the plotting area, while using coordinate limits will ensure that the regression is fitted to the whole data set.

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point() +
  stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
  coord_cartesian(ylim = c(15, 25))

A transformation applied through a coordinate affects the values after the statistics has computed them, thus in this plot the linear regression is represented by a curve. This is in contrast to the example above with scale_y_log10() where the linear regression was fit to the log10() transformed data and thus graphically represented by a straight line in spite of the transformed y scale.

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point() +
  stat_smooth(geom = "line", method = "lm", formula = y ~ x) +
  coord_trans(y = "log10")

Themes are similar to style sheets, and they control the appearance and position of only those graphical elements that are not created by layer functions.

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point() +
  theme_classic()

Many elements in themes are defined hierarchically, and for example text sizes are by default set relative to a base size. Here we increase the size of text elements and change the base font family. However, the size of the points is not affected.

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point() +
  theme_classic(base_size = 20, base_family = "serif")

We can modify individual theme settings, instead or in addition to replacing the theme as a whole.

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point() +
  theme(axis.title = element_text(face = "bold"),
        axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

In a plot for publication, axis and legend labels usually need to be clearer and more elegant that simple variable names. labs() is a convenience function that makes setting these texts straightforward. Embedding of new line characters (\n) withing the character strings is supported, and in some cases very useful.

ggplot(data = mtcars,
       aes(x = disp, y = mpg)) +
  geom_point() +
  labs(x = "Engine displacement (cubic inches)",
       y = "Fuel use efficiency\n(miles per gallon)",
       title = "Motor Trend Car Road Tests",
       subtitle = "Source: 1974 Motor Trend US magazine")

Finally, as shown here, mappings can be to R expressions not just variables in data. For example here we plot mpg (miles per gallon) vs. disp / cyl (the displacement of individual engine cylinders).

ggplot(data = mtcars, 
       aes(x = disp / cyl, y = mpg, colour = factor(cyl))) +
  geom_point()

7 The internals

For most users of ‘ggplot2’ and its extensions it is crucial to understand the grammar of graphics. The internals of "gg" objects can be ignored by most users, although a rough idea of how ‘ggplot2’ works can be useful when facing error messages and “code that does not work”. It can be also useful in cases when modifying an existing "gg" object is the only available or easiest approach.

Above we have implicitly printed the plots into their graphical representation. Here we save the "gg" object into variable p and then explore the structure of the object, which reveals how the different components of the “plot drawing recipe” are stored. For example, one can see that “actions” are stored in the object as function definitions.

p <- ggplot(mpg, aes(x = displ, y = hwy)) +
  geom_point()
summary(p)
data: manufacturer, model, displ, year, cyl, trans, drv, cty, hwy, fl,
  class [234x11]
mapping:  x = ~displ, y = ~hwy
faceting: <ggproto object: Class FacetNull, Facet, gg>
    compute_layout: function
    draw_back: function
    draw_front: function
    draw_labels: function
    draw_panels: function
    finish_data: function
    init_scales: function
    map_data: function
    params: list
    setup_data: function
    setup_params: function
    shrink: TRUE
    train_scales: function
    vars: function
    super:  <ggproto object: Class FacetNull, Facet, gg>
-----------------------------------
geom_point: na.rm = FALSE
stat_identity: na.rm = FALSE
position_identity 
str(p, max.level = 1, list.len = 4)
Object size: 29.3 kB
List of 9
 $ data       : tibble [234 × 11] (S3: tbl_df/tbl/data.frame)
 $ layers     :List of 1
 $ scales     :Classes 'ScalesList', 'ggproto', 'gg' <ggproto object: Class ScalesList, gg>
    add: function
    clone: function
    find: function
    get_scales: function
    has_scale: function
    input: function
    n: function
    non_position_scales: function
    scales: list
    super:  <ggproto object: Class ScalesList, gg> 
 $ mapping    :List of 2
  [list output truncated]
str(p$layers, max.level = 1)
List of 1
 $ :Classes 'LayerInstance', 'Layer', 'ggproto', 'gg' <ggproto object: Class LayerInstance, Layer, gg>
    aes_params: list
    compute_aesthetics: function
    compute_geom_1: function
    compute_geom_2: function
    compute_position: function
    compute_statistic: function
    computed_geom_params: NULL
    computed_mapping: NULL
    computed_stat_params: NULL
    constructor: call
    data: waiver
    draw_geom: function
    finish_statistics: function
    geom: <ggproto object: Class GeomPoint, Geom, gg>
        aesthetics: function
        default_aes: uneval
        draw_group: function
        draw_key: function
        draw_layer: function
        draw_panel: function
        extra_params: na.rm
        handle_na: function
        non_missing_aes: size shape colour
        optional_aes: 
        parameters: function
        rename_size: FALSE
        required_aes: x y
        setup_data: function
        setup_params: function
        use_defaults: function
        super:  <ggproto object: Class Geom, gg>
    geom_params: list
    inherit.aes: TRUE
    layer_data: function
    map_statistic: function
    mapping: NULL
    position: <ggproto object: Class PositionIdentity, Position, gg>
        compute_layer: function
        compute_panel: function
        required_aes: 
        setup_data: function
        setup_params: function
        super:  <ggproto object: Class Position, gg>
    print: function
    setup_layer: function
    show.legend: NA
    stat: <ggproto object: Class StatIdentity, Stat, gg>
        aesthetics: function
        compute_group: function
        compute_layer: function
        compute_panel: function
        default_aes: uneval
        dropped_aes: 
        extra_params: na.rm
        finish_layer: function
        non_missing_aes: 
        optional_aes: 
        parameters: function
        required_aes: 
        retransform: TRUE
        setup_data: function
        setup_params: function
        super:  <ggproto object: Class Stat, gg>
    stat_params: list
    super:  <ggproto object: Class Layer, gg> 

Package ‘gginnards’ makes it rather easy to modify "gg" objects. I find occasionally useful to alter the order of layers and to insert layers out-of-order in an exisiting "gg" object . Deleting those variables that have not been mapped to aesthetics from the data stored within an exisiting "gg" object can be sometimes simpler than constructing again the plot object with a subset of the data.

Tip

This brief introduction only touched on the basic aspects of the grammar of graphics as implemented in R package ‘ggplot2’. This is enough to get many different types of plots done successfully. In most cases, doing different types of plots requires one to find a suitable layer function. While in many cases default arguments to these functions will yield a usable plot, normally, studying the help page of a layer function will make clear its features and how to use them effectively. Thus, it is not necessary, and a waste of time, to try to become an expert across all possible types of plots. It is enough to understand how plots are assembled, and learn when the need arises, how to use individual layer functions.

A good place to start looking for layer functions vailable in extension packages is to visiti the site ggplot2 extensions and its gallery, which is especially effective for those packages that are specialized and export a single or a few layer functions because the gallery displays a single example plot per package, and the list of extensions only very few examples per package.

Several of the pages here listed under galleries contain many examples of the use of the extensions to ‘ggplot2’ that I have published in packages ‘ggpp’, ‘ggpmisc’, ‘gginnards’ and ‘ggspectra’.