library(tidyverse) # for dplyr, ggplot2 packages
library(ggh4x) # for minor axis functionality
Introduction
Rook et al. (2022) analyzed historical data to answer the question of how many ciscoes are needed for stocking in the Laurentian Great Lakes. Their Figure 4 shows lake-specific estimates of fry, fall fingerling, and age-1 stocking densities used to determine the number of Cisco (Coregonus artedi) needed for stocking to mimic historical and contemporary age-1 and adult recruitment rates in Wisconsin waters of Lake Superior. I attempt to recreate their figure here.
The following packages are loaded for use below. Also one function from each of scales
and ggtext
is used with ::
such that the entire packages are not attached here.
The figure recreated here is often called a “dynamite plot.” I understand that there is controversy around these types of plots (see here]. However, they appear often in the fisheries literature, for better or worse. With this in mind, I thought it worthwhile to some to show how one might be constructed with ggplot2
.
Get Data
Rook et al. (2022) provided summarized data presented in Figure 4 in their Data Supplement S1. Unfortunately the data were provided as a table in a MSWord document. I tried to extract the data from the Word document using the officer
package but was unsuccessful. Fortunately, the tables looked like they were originally created in MSExcel, so I copied the “Smax” (i.e., “adult equivalents”) and “Rmax” (age-1 equivalents) portions of the table and pasted them into separate sheets of a single Excel file.
Each sheet is read into a temporary data frame below. Note that the authors used “N/A” to represent missing values so this was accounted for with na=
. Further note that a new variable called Type
was added to each data frame to identify whether the data represented the “Smax” or “Rmax” results.
<- readxl::read_excel("Rooketal2021_S1.xlsx",sheet="Smax",na=c("","N/A"))
tmp1 <- readxl::read_excel("Rooketal2021_S1.xlsx",sheet="Rmax",na=c("","N/A")) tmp2
The two data frames were then row-bound (i.e., stacked) together by putting the data frames into a name list that is given to bind_rows()
. The .id=
argument will create a new variable called Type
with the names from the list. Additionally, the Lake
variable was renamed to Summary
(it contains the summary metrics) and the original “long” names were abbreviated to match those used in the legend of Figure 4 in Rook et al. (2022).1
1 as.data.frame()
is used to make this a regular data frame rather than a “tibble.”
<- bind_rows(list(Smax=tmp1,Rmax=tmp2),.id="Type") |>
tmp rename(Summary=Lake,
`Hist. Fry`=`Historical Fry`,`Hist. Fingerling`=`Historical Fingerling`,
`Hist. Age-1`=`Historical Age-1`,`Cont. Fry`=`Contemporary Fry`,
`Cont. Fingerling`=`Contemporary Fingerling`,`Cont. Age-1`=`Contemporary Age-1`) |>
as.data.frame()
head(tmp)
#R| Type Summary Hist. Fry Hist. Fingerling Hist. Age-1 Cont. Fry
#R| 1 Smax Erie 1553 959 593 156
#R| 2 Smax U95 1727 1056 643 174
#R| 3 Smax L95 1391 870 545 140
#R| 4 Smax Saint Clair 1490 927 578 151
#R| 5 Smax U95 1671 1031 634 169
#R| 6 Smax L95 1314 825 527 135
#R| Cont. Fingerling Cont. Age-1
#R| 1 111 78
#R| 2 122 86
#R| 3 99 72
#R| 4 107 76
#R| 5 118 83
#R| 6 97 70
These data are in an awkward format with the now-named Summary
variable containing the lake name followed by U95
and L95
for the upper and lower 95% confidence interval endpoints. I would prefer to have a variable called Lake
that would, for example, say Erie
for the first three rows above and Saint Clair
for the next three rows (and so on for the other lakes). Then, the lake names under Summary
would be replaced with Median
as the values to the right of the lake names are median stocking densities.
I begin by creating the Lake
variable by rep()
eating the names in the first, fourth, seventh, etc. positions in Summary
three times. The first, fourth, seventh, etc. positions are identified as a sequence from 1 to 34 in intervals of 3.2
2 Lake
was moved to the first variable with relocate()
to better see the effect of this code.
<- tmp |>
tmp mutate(Lake=rep(Summary[seq(1,34,by=3)],each=3)) |>
relocate(Lake)
head(tmp)
#R| Lake Type Summary Hist. Fry Hist. Fingerling Hist. Age-1 Cont. Fry
#R| 1 Erie Smax Erie 1553 959 593 156
#R| 2 Erie Smax U95 1727 1056 643 174
#R| 3 Erie Smax L95 1391 870 545 140
#R| 4 Saint Clair Smax Saint Clair 1490 927 578 151
#R| 5 Saint Clair Smax U95 1671 1031 634 169
#R| 6 Saint Clair Smax L95 1314 825 527 135
#R| Cont. Fingerling Cont. Age-1
#R| 1 111 78
#R| 2 122 86
#R| 3 99 72
#R| 4 107 76
#R| 5 118 83
#R| 6 97 70
Now, the lake names in those same positions in Summary
are replaced with “MEDIAN”.
$Summary[seq(1,34,by=3)] <- "MEDIAN"
tmphead(tmp)
#R| Lake Type Summary Hist. Fry Hist. Fingerling Hist. Age-1 Cont. Fry
#R| 1 Erie Smax MEDIAN 1553 959 593 156
#R| 2 Erie Smax U95 1727 1056 643 174
#R| 3 Erie Smax L95 1391 870 545 140
#R| 4 Saint Clair Smax MEDIAN 1490 927 578 151
#R| 5 Saint Clair Smax U95 1671 1031 634 169
#R| 6 Saint Clair Smax L95 1314 825 527 135
#R| Cont. Fingerling Cont. Age-1
#R| 1 111 78
#R| 2 122 86
#R| 3 99 72
#R| 4 107 76
#R| 5 118 83
#R| 6 97 70
These data are still in a format that is not conducive to plotting. What is ultimately needed is for the “MEDIAN”, “U95”, and “L95” to be variables with their respective values, and a variable should indicate the timing (“historical” or “contemporary”) and stage (“fry”, “fingerling”, or “age-1”) for those values.
I begin this process by creating a “longer” data frame that will stack the current “historical” and “contemporary” columns on top of each other. To do this use pivot_longer()
where the columns to pivot in cols=
are from Hist. Fry
to Cont. Age-1
3, the values under those columns will now be in Number
, and the names of those columns will be moved to Stage
.
3 The :
identifies all contiguous columns between the two columns identified.
<- tmp |>
tmp pivot_longer(cols=`Hist. Fry`:`Cont. Age-1`,values_to="Number",names_to="Stage")
head(tmp,n=12)
#R| # A tibble: 12 × 5
#R| Lake Type Summary Stage Number
#R| <chr> <chr> <chr> <chr> <dbl>
#R| 1 Erie Smax MEDIAN Hist. Fry 1553
#R| 2 Erie Smax MEDIAN Hist. Fingerling 959
#R| 3 Erie Smax MEDIAN Hist. Age-1 593
#R| 4 Erie Smax MEDIAN Cont. Fry 156
#R| 5 Erie Smax MEDIAN Cont. Fingerling 111
#R| 6 Erie Smax MEDIAN Cont. Age-1 78
#R| 7 Erie Smax U95 Hist. Fry 1727
#R| 8 Erie Smax U95 Hist. Fingerling 1056
#R| 9 Erie Smax U95 Hist. Age-1 643
#R| 10 Erie Smax U95 Cont. Fry 174
#R| 11 Erie Smax U95 Cont. Fingerling 122
#R| 12 Erie Smax U95 Cont. Age-1 86
The Summary
column then needs to be “expanded” to form three new columns with the median and confidence interval endpoints. This is accomplished with pivot_wider()
with the names of the new columns coming from Summary
and the values in the new columns coming from Number
.
<- tmp |>
tmp pivot_wider(names_from=Summary,values_from=Number)
head(tmp,n=12)
#R| # A tibble: 12 × 6
#R| Lake Type Stage MEDIAN U95 L95
#R| <chr> <chr> <chr> <dbl> <dbl> <dbl>
#R| 1 Erie Smax Hist. Fry 1553 1727 1391
#R| 2 Erie Smax Hist. Fingerling 959 1056 870
#R| 3 Erie Smax Hist. Age-1 593 643 545
#R| 4 Erie Smax Cont. Fry 156 174 140
#R| 5 Erie Smax Cont. Fingerling 111 122 99
#R| 6 Erie Smax Cont. Age-1 78 86 72
#R| 7 Saint Clair Smax Hist. Fry 1490 1671 1314
#R| 8 Saint Clair Smax Hist. Fingerling 927 1031 825
#R| 9 Saint Clair Smax Hist. Age-1 578 634 527
#R| 10 Saint Clair Smax Cont. Fry 151 169 135
#R| 11 Saint Clair Smax Cont. Fingerling 107 118 97
#R| 12 Saint Clair Smax Cont. Age-1 76 83 70
The Type
, Lake
, and Stage
variables were then converted to factors with the levels controlled to match that of Rook et al. (2022). I also ordered the data by Stage
within Lake
within Type
for personal preference.
<- tmp |>
dat mutate(Type=factor(Type,levels=c("Smax","Rmax")),
Lake=factor(Lake,levels=c("Erie","Saint Clair","Ontario","Michigan","Huron","Superior")),
Stage=factor(Stage,levels=c("Hist. Fry","Hist. Fingerling","Hist. Age-1",
"Cont. Fry","Cont. Fingerling","Cont. Age-1"))) |>
arrange(Type,Lake,Stage)
This data frame, now called dat
, is now ready for plotting.
I suspect there is an easier way to wrangle these data into this format, but I don’t know what it is. It likely would have been easier to create this format from the raw data, rather than the tabularized results provided.
Recreating Figure 2
Preliminaries
Rook et al. (2022) used various shades of gray to identify the time-stage (i.e., Stage
) data in their Figure 4. The object below is a named vector with approximations to their color choices.
<- c("Hist. Fry"="gray20","Hist. Fingerling"="gray50","Hist. Age-1"="gray80",
sclrs "Cont. Fry"="gray35","Cont. Fingerling"="gray65","Cont. Age-1"="gray95")
Figure 4 is a side-by-side bar chart with confidence intervals on each bar. Sided-by-side bars require “dodging” in ggplot2
(see below). The confidence intervals also need to be “dodged” to match the bars, but they will only match if the “dodging” is pre-defined. Dodging is defined with position_dodge()
using width=
to determine how much the side-by-side objects will overlap. A width=
of 0 will result in complete overlap, whereas a width=
of 1 results in no overlap with a slight separation between bars. A value of 0.9 resulted in no overlap but touching bars as used in Rook et al. (2022).
<- position_dodge(width=0.9) pd
Finally, Rook et al. (2022) labeled each panel of their Figure 4 within the plot region, rather than as a “title” above the plot region. Thus, as in this previous post about Rook et al. (2022), the default facet labels cannot be used. An added challenge here though is that the labels they used contain plain text, italicized text, and a subscript. Thus, the “trick” used in that post cannot be easily used here.
Here I create a very simple data frame with two variables. The first variable is Type
which contains the two types of data found in dat
. It is important that this variable is exactly as it is in dat
(i.e., factored, levels the same) as this is the variable that is going to define the facets, or sub-panels, of Figure 4. The second variable, Label
, contains the “long” labels for each panel. Here I use markdown code to create the italics (i.e., *
) and the subscript (i.e., ~
).4
4 These will be used in geom_richtext()
from ggtext
below, rather than plotmath
which I find exceedingly difficult to get correct.
<- data.frame(Type=factor(c("Smax","Rmax"),levels=c("Smax","Rmax")),
datlbls Label=c("Adult (*S~max~*) Equivalents",
"Age-1 (*R~max~*) Equivalents"))
Putting It Together
The basic bar plot is constructed from the summarized data in dat
using geom_col()
with Lake
mapped to the x-axis, MEDIAN
mapped to the y-axis, and Stage
mapped to the fill=
color. The bars will not be sided-by-side by default; they need to be “dodged” by setting postion=
to the dodge position value set above. color=
is set to black to outline each bar as in Rook et al. (2022). facet_wrap()
is used to separate the plots by Type
, with only one column used. The x-axis was “freed” so that the axis and labels would be shown in both facets as in Rook et al. (2022).
ggplot() +
geom_col(data=dat,mapping=aes(x=Lake,y=MEDIAN,fill=Stage),
position=pd,color="black") +
facet_wrap(vars(Type),ncol=1,scales="free_x")
Confidence intervals are added with geom_errorbar()
with Lake
again mapped to the x-axis, L95
and U95
mapped to the minimum and maximum y interval values, and Stage
set as a group.5 To match the bars the confidence intervals must be dodged in the same way with postion=
and the “cap” on the interval was made smaller with width=
.
5 The data needs a group
here to match the group created by fill=
in geom_col()
.
ggplot() +
geom_col(data=dat,mapping=aes(x=Lake,y=MEDIAN,fill=Stage),
position=pd,color="black") +
geom_errorbar(data=dat,mapping=aes(x=Lake,ymin=L95,ymax=U95,group=Stage),
position=pd,width=0.5) +
facet_wrap(vars(Type),ncol=1,scales="free_x")
Colors were then set with scale_fill_manual()
using the sclrs
vector defined above. In addition, the y-axis was given a better title, the limits were set, axis expansion was removed, major breaks (i.e., labelled) were set at intervals of 200, and minor breaks were set at intervals of 100.6
6 These axis modifications were described in more detail in this post.
ggplot() +
geom_col(data=dat,mapping=aes(x=Lake,y=MEDIAN,fill=Stage),
position=pd,color="black") +
geom_errorbar(data=dat,mapping=aes(x=Lake,ymin=L95,ymax=U95,group=Stage),
position=pd,width=0.5) +
facet_wrap(vars(Type),ncol=1,scales="free_x") +
scale_fill_manual(values=sclrs) +
scale_y_continuous(name="Stocking Density (fish/ha)",
limits=c(0,1800),expand=expansion(mult=0),
breaks=scales::breaks_width(200),
minor_breaks=scales::breaks_width(100),
guide="axis_minor")
Labels are now added to the facets with geom_richtext()
from the ggtext
package. Here data=
is set to datlbls
, which is why data=dat
was used in geom_col()
and geom_errobar()
rather than being set in ggplot()
.7 The Label
in datlbls
is mapped to the labels, but x=
and y=
are defined outside of the data frame. Specifically, x=
is set to 3.5 because each category is listed behind the scenes with sequential numbers beginning at 1. With six categories on the x-axis, the center will be at 3.5. y=
is set to Inf
to generically be set at the largest y value plotted. Thus, the labels will be centered from left-to-right (x-axis orientation) and at the top (y-axis orientation). vjust=
is used to move the label down slightly (see this post), label.color=NA
removes the default box around the label, and size=
was used to reduce the default size slightly.
7 If multiple data frames are used when constructing a figure, then those data frames are usually declared in the geoms rather than in ggplot()
.
ggplot() +
geom_col(data=dat,mapping=aes(x=Lake,y=MEDIAN,fill=Stage),
position=pd,color="black") +
geom_errorbar(data=dat,mapping=aes(x=Lake,ymin=L95,ymax=U95,group=Stage),
position=pd,width=0.5) +
::geom_richtext(dat=datlbls,mapping=aes(label=Label),x=3.5,y=Inf,
ggtextvjust=0.9,label.color=NA,size=3.5) +
facet_wrap(vars(Type),ncol=1,scales="free_x") +
scale_fill_manual(values=sclrs) +
scale_y_continuous(name="Stocking Density (fish/ha)",
limits=c(0,1800),expand=expansion(mult=0),
breaks=scales::breaks_width(200),
minor_breaks=scales::breaks_width(100),
guide="axis_minor")
Finally, the theme was modified to more closely match Figure 4 in Rook et al. (2022). Specifically, theme_class()
was used as the base and facet labels were removed, the x-axis title was removed, major tick mark size was increased, minor tick mark size was set to 50% of the major tick mark size, the legend was moved to the upper right of the plot, the legend title was removed, the legend text was made smaller, and the legend box size was made smaller.
ggplot() +
geom_col(data=dat,mapping=aes(x=Lake,y=MEDIAN,fill=Stage),
position=pd,color="black") +
geom_errorbar(data=dat,mapping=aes(x=Lake,ymin=L95,ymax=U95,group=Stage),
position=pd,width=0.5) +
::geom_richtext(dat=datlbls,mapping=aes(label=Label),x=3.5,y=Inf,
ggtextvjust=0.9,label.color=NA,size=3.5) +
facet_wrap(vars(Type),ncol=1,scales="free_x") +
scale_fill_manual(values=sclrs) +
scale_y_continuous(name="Stocking Density (fish/ha)",
limits=c(0,1800),expand=expansion(mult=0),
breaks=scales::breaks_width(200),
minor_breaks=scales::breaks_width(100),
guide="axis_minor") +
theme_classic() +
theme(strip.text=element_blank(),
axis.title.x=element_blank(),
axis.ticks.length=unit(5,units="pt"),
ggh4x.axis.ticks.length.minor=rel(0.5),
legend.position=c(1,1),
legend.justification=c(1.05,0.95),
legend.title=element_blank(),
legend.text=element_text(size=7),
legend.key.size = unit(0.75,units="line")
)
This largely reproduces Figure 4 in Rook et al. (2022) with the exceptions that (a) the tick marks don’t cross the x-axes and (b) the x-axis tick marks are centered on the group of bars rather than between the group of bars.
References
Reuse
Citation
@online{h. ogle2023,
author = {H. Ogle, Derek},
title = {Rook Et Al. (2022) {Estimated} {Cisco} {Stocking}
{Densities}},
date = {2023-03-20},
url = {https://fishr-core-team.github.io/fishR//blog/posts/2023-3-20-Rooketal2022_Fig4},
langid = {en}
}