library(dplyr) # for mutate(), case_when()
library(forcats) # for fct_recode(), fct_collapse()
The following packages are loaded for use below. The plyr
package is also used but it is not loaded because I am only going to use one specific function from plyr
(i.e., mapvalues()
).]
Introduction
I have received a few queries recently that can be categorized as “How do I collapse a list of categories or values into a shorter list of categories or values?” For example, one user wanted to collapse species of fish into their respective families. Another user wanted to collapse years into decades. Data wrangling such as this is common in fisheries and is briefly described below.
Sample Data
The following creates a very simple sample of 250 individuals on which the species (as a short abbreviation) and year of capture were recorded. Because I am creating random example data below, I set the random number seed to make the results reproducible.
set.seed(678394) # for reproducibility of random data
<- 250 # to allow easily changing sample size
n <- data.frame(species=sample(c("BLG","LMB","PKS","WAE","YEP","CRP"),
dat replace=TRUE),
n,year=sample(1980:2017,n,replace=TRUE))
head(dat)
#R| species year
#R| 1 YEP 1996
#R| 2 PKS 2005
#R| 3 PKS 2013
#R| 4 YEP 2014
#R| 5 CRP 2016
#R| 6 PKS 2006
Recode Categories
This example demonstrates how to change the codes in one variable (e.g., species abbreviations) to new codes in another variable (e.g., long species names).
Before recoding I find it easier to create a vector that contains the original codes to convert from. For example, unique()
extracts the vector of species abbreviations found in the species
variable of the example data, which I then saved in short
and alphabetized to make the next steps easier.
<- unique(dat$species) |>
short sort()
short
#R| [1] "BLG" "CRP" "LMB" "PKS" "WAE" "YEP"
In addition, I also create a vector of codes that these codes will be converted to. For example, the long
vector below contains the long-form names for each species (in the same order as the abbreviations in short
)
<- c("Bluegill","Carp","Largemouth Bass",
long "Pumpkinseed","Walleye","Yellow Perch")
You should “column-bind” these two vectors together to ensure that the codes align.
cbind(short,long)
#R| short long
#R| [1,] "BLG" "Bluegill"
#R| [2,] "CRP" "Carp"
#R| [3,] "LMB" "Largemouth Bass"
#R| [4,] "PKS" "Pumpkinseed"
#R| [5,] "WAE" "Walleye"
#R| [6,] "YEP" "Yellow Perch"
The mapvalues()
function (from plyr
) may be used to efficiently recode character (or factor) values.1 Because mapvalues()
operates on a vector, it should be used within mutate()
(from dplyr
) to add a new variable with the recoded values to a data frame. Within mutate()
the first argument to mapvalues()
is the variable that contains the original data to be recoded. A vector of categories to code from is given in from=
and a vector of new categories to code to is given in to=
. For example, the combined use of mutate()
and mapvalues()
below demonstrates creating a new variable in the data frame with the long species names.
1 The use of plyr::
in front of mapvalues()
ensures that mapvalues()
from plyr
and not another package will be used and allows for not loading the entire plyr
package.
<- dat |>
dat mutate(speciesL=plyr::mapvalues(species,from=short,to=long))
head(dat)
#R| species year speciesL
#R| 1 YEP 1996 Yellow Perch
#R| 2 PKS 2005 Pumpkinseed
#R| 3 PKS 2013 Pumpkinseed
#R| 4 YEP 2014 Yellow Perch
#R| 5 CRP 2016 Carp
#R| 6 PKS 2006 Pumpkinseed
This use of mapvalues()
and mutate()
is described in Section 2.2.7 of my book Introductory Fisheries Analyses with R.
The fct_recode()
function (from forcats
) can also be used to recode categories. Within mutate()
the first argument to fct_recode()
is the original factor variable. Subsequent arguments are of the form new level name equal to old level name.2 For example, the same recoding to long species name is shown below.
2 Any levels not listed in fct_recode()
will be retained with their original names.
<- dat |>
dat mutate(speciesL2=fct_recode(species,
"Bluegill" = "BLG",
"Carp" = "CRP",
"Largemouth Bass" = "LMB",
"Pumpkinseed" = "PKS",
"Walleye" = "WAE",
"Yellow Perch" = "YEP"))
head(dat)
#R| species year speciesL speciesL2
#R| 1 YEP 1996 Yellow Perch Yellow Perch
#R| 2 PKS 2005 Pumpkinseed Pumpkinseed
#R| 3 PKS 2013 Pumpkinseed Pumpkinseed
#R| 4 YEP 2014 Yellow Perch Yellow Perch
#R| 5 CRP 2016 Carp Carp
#R| 6 PKS 2006 Pumpkinseed Pumpkinseed
Collapse Categories
In some instances, one may want to collapse some categories into a single category (e.g., species into a family). This is easily accomplished with mapvalues()
or fct_recode()
by simply repeating some of the “to” categories. For example, family
contains family names that correspond to the species names in the data frame. Note how multiple species have the same family name category.
<- c("Centrarchidae","Cyprinidae","Centrarchidae",
fam "Centrarchidae","Percidae","Percidae")
cbind(short,long,fam)
#R| short long fam
#R| [1,] "BLG" "Bluegill" "Centrarchidae"
#R| [2,] "CRP" "Carp" "Cyprinidae"
#R| [3,] "LMB" "Largemouth Bass" "Centrarchidae"
#R| [4,] "PKS" "Pumpkinseed" "Centrarchidae"
#R| [5,] "WAE" "Walleye" "Percidae"
#R| [6,] "YEP" "Yellow Perch" "Percidae"
The example below shows how to convert the species name abbreviations to family names. In addition, the last use of mapvalues()
shows how to change the long-form names to family names. This last example is, of course, repetitive, but it is used here to demonstrate how mutate()
allows a variable that was “just created” to be immediately used.
<- dat |>
dat mutate(family=plyr::mapvalues(species,from=short,to=fam),
family2=plyr::mapvalues(speciesL,from=long,to=fam))
head(dat)
#R| species year speciesL speciesL2 family family2
#R| 1 YEP 1996 Yellow Perch Yellow Perch Percidae Percidae
#R| 2 PKS 2005 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| 3 PKS 2013 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| 4 YEP 2014 Yellow Perch Yellow Perch Percidae Percidae
#R| 5 CRP 2016 Carp Carp Cyprinidae Cyprinidae
#R| 6 PKS 2006 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
The “collapsing” of multiple levels into one level can also be accomplished with fct_collapse()
(from forcats
). The first argument to this function is again the variable containing the “old” levels. Subsequent arguments are formed by setting a new level name equal to a vector containing old level names to collapse.
<- dat |>
dat mutate(family3=fct_collapse(species,
"Centarchidae" = c("BLG","PKS","LMB"),
"Percidae" = c("WAE","YEP"),
"Cyprinidae" = c("CRP")))
head(dat)
#R| species year speciesL speciesL2 family family2
#R| 1 YEP 1996 Yellow Perch Yellow Perch Percidae Percidae
#R| 2 PKS 2005 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| 3 PKS 2013 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| 4 YEP 2014 Yellow Perch Yellow Perch Percidae Percidae
#R| 5 CRP 2016 Carp Carp Cyprinidae Cyprinidae
#R| 6 PKS 2006 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| family3
#R| 1 Percidae
#R| 2 Centarchidae
#R| 3 Centarchidae
#R| 4 Percidae
#R| 5 Cyprinidae
#R| 6 Centarchidae
Collapse Values into Categories
It is also common to categorize a numeric variable. For example, a “decade” variable is derived from the year variable in this example.
The case_when()
function (from dplyr
) may be used to efficiently collapse discrete values into categories. This function also operates on vectors and, thus, must be used with mutate()
to add a variable to a data frame. The arguments to case_when()
are a series of two-sided formulae where the left-side is a conditioning statement based on the original data and the right-side is the value that should appear in the new variable when that condition is TRUE
. For example, the first line in case_when()
below asks “if the year variable is in the values from 1980 to 1989 then the new category should be ‘1980s’.”3 For example, the code below creates a new variable called decade
that identifies the decade that corresponds to the year-of-capture variable.
3 The colon operator creates a sequence of all integers between the two numbers separated by the colon. The %in%
is used on conditional statements to determine if a value is contained within a vector, returning TRUE
if it is and FALSE
if it is not.
<- dat |>
dat mutate(decade=case_when(
%in% 1980:1989 ~ "1980s",
year %in% 1990:1999 ~ "1990s",
year %in% 2000:2009 ~ "2000s",
year %in% 2010:2019 ~ "2010s"
year
))head(dat)
#R| species year speciesL speciesL2 family family2
#R| 1 YEP 1996 Yellow Perch Yellow Perch Percidae Percidae
#R| 2 PKS 2005 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| 3 PKS 2013 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| 4 YEP 2014 Yellow Perch Yellow Perch Percidae Percidae
#R| 5 CRP 2016 Carp Carp Cyprinidae Cyprinidae
#R| 6 PKS 2006 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| family3 decade
#R| 1 Percidae 1990s
#R| 2 Centarchidae 2000s
#R| 3 Centarchidae 2010s
#R| 4 Percidae 2010s
#R| 5 Cyprinidae 2010s
#R| 6 Centarchidae 2000s
The lines in case_when()
operate sequentially (like a series of “if” statements) such that the above operation can be more succinctly coded as below. Also note in this example that the resulting variable is numeric rather than categorical (simply as an example).
<- dat |>
dat mutate(decade2=case_when(
<= 1989 ~ 1980,
year <= 1999 ~ 1990,
year <= 2009 ~ 2000,
year <= 2019 ~ 2010,
year
))head(dat)
#R| species year speciesL speciesL2 family family2
#R| 1 YEP 1996 Yellow Perch Yellow Perch Percidae Percidae
#R| 2 PKS 2005 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| 3 PKS 2013 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| 4 YEP 2014 Yellow Perch Yellow Perch Percidae Percidae
#R| 5 CRP 2016 Carp Carp Cyprinidae Cyprinidae
#R| 6 PKS 2006 Pumpkinseed Pumpkinseed Centrarchidae Centrarchidae
#R| family3 decade decade2
#R| 1 Percidae 1990s 1990
#R| 2 Centarchidae 2000s 2000
#R| 3 Centarchidae 2010s 2010
#R| 4 Percidae 2010s 2010
#R| 5 Cyprinidae 2010s 2010
#R| 6 Centarchidae 2000s 2000
str(dat)
#R| 'data.frame': 250 obs. of 9 variables:
#R| $ species : chr "YEP" "PKS" "PKS" "YEP" ...
#R| $ year : int 1996 2005 2013 2014 2016 2006 2002 2012 2013 2014 ...
#R| $ speciesL : chr "Yellow Perch" "Pumpkinseed" "Pumpkinseed" "Yellow Perch" ...
#R| $ speciesL2: Factor w/ 6 levels "Bluegill","Carp",..: 6 4 4 6 2 4 5 1 3 6 ...
#R| $ family : chr "Percidae" "Centrarchidae" "Centrarchidae" "Percidae" ...
#R| $ family2 : chr "Percidae" "Centrarchidae" "Centrarchidae" "Percidae" ...
#R| $ family3 : Factor w/ 3 levels "Centarchidae",..: 3 1 1 3 2 1 3 1 1 3 ...
#R| $ decade : chr "1990s" "2000s" "2010s" "2010s" ...
#R| $ decade2 : num 1990 2000 2010 2010 2010 2000 2000 2010 2010 2010 ...
You may be motivated from this example to use case_when()
to develop a length category variable from measure lengths. While this is possible it is not efficient as you would have several conditions within case_when()
(to span all measured lengths) and you would need to make sure that your conditions covered the range of measured lengths. I urge you to examine lencat()
in FSA
for the purpose of creating length categories (see examples here).
Reuse
Citation
@online{h. ogle2018,
author = {H. Ogle, Derek},
title = {Collapsing {Categories} or {Values}},
date = {2018-03-30},
url = {https://fishr-core-team.github.io/fishR//blog/posts/2018-3-30_Collapsing_Categories_or_Values},
langid = {en}
}