Collapsing Categories or Values

How to collapse categories or values into other categories
Data Wrangling
Author

Derek H. Ogle

Published

Mar 30, 2018

Modified

Dec 19, 2022

Note

The following packages are loaded for use below. The plyr package is also used but it is not loaded because I am only going to use one specific function from plyr (i.e., mapvalues()).]

library(dplyr)   # for mutate(), case_when()
library(forcats) # for fct_recode(), fct_collapse()

 

Introduction

I have received a few queries recently that can be categorized as “How do I collapse a list of categories or values into a shorter list of categories or values?” For example, one user wanted to collapse species of fish into their respective families. Another user wanted to collapse years into decades. Data wrangling such as this is common in fisheries and is briefly described below.

 

Sample Data

The following creates a very simple sample of 250 individuals on which the species (as a short abbreviation) and year of capture were recorded. Because I am creating random example data below, I set the random number seed to make the results reproducible.

set.seed(678394)  # for reproducibility of random data
n <- 250          # to allow easily changing sample size
dat <- data.frame(species=sample(c("BLG","LMB","PKS","WAE","YEP","CRP"),
                                 n,replace=TRUE),
                  year=sample(1980:2017,n,replace=TRUE))
head(dat)
#R|    species year
#R|  1     YEP 1996
#R|  2     PKS 2005
#R|  3     PKS 2013
#R|  4     YEP 2014
#R|  5     CRP 2016
#R|  6     PKS 2006

 

Recode Categories

This example demonstrates how to change the codes in one variable (e.g., species abbreviations) to new codes in another variable (e.g., long species names).

Before recoding I find it easier to create a vector that contains the original codes to convert from. For example, unique() extracts the vector of species abbreviations found in the species variable of the example data, which I then saved in short and alphabetized to make the next steps easier.

short <- unique(dat$species) |>
  sort()
short
#R|  [1] "BLG" "CRP" "LMB" "PKS" "WAE" "YEP"

In addition, I also create a vector of codes that these codes will be converted to. For example, the long vector below contains the long-form names for each species (in the same order as the abbreviations in short)

long <- c("Bluegill","Carp","Largemouth Bass",
          "Pumpkinseed","Walleye","Yellow Perch")

You should “column-bind” these two vectors together to ensure that the codes align.

cbind(short,long)
#R|       short long             
#R|  [1,] "BLG" "Bluegill"       
#R|  [2,] "CRP" "Carp"           
#R|  [3,] "LMB" "Largemouth Bass"
#R|  [4,] "PKS" "Pumpkinseed"    
#R|  [5,] "WAE" "Walleye"        
#R|  [6,] "YEP" "Yellow Perch"

The mapvalues() function (from plyr) may be used to efficiently recode character (or factor) values.1 Because mapvalues() operates on a vector, it should be used within mutate() (from dplyr) to add a new variable with the recoded values to a data frame. Within mutate() the first argument to mapvalues() is the variable that contains the original data to be recoded. A vector of categories to code from is given in from= and a vector of new categories to code to is given in to=. For example, the combined use of mutate() and mapvalues() below demonstrates creating a new variable in the data frame with the long species names.

1 The use of plyr:: in front of mapvalues() ensures that mapvalues() from plyr and not another package will be used and allows for not loading the entire plyr package.

dat <- dat |>
  mutate(speciesL=plyr::mapvalues(species,from=short,to=long))
head(dat)
#R|    species year     speciesL
#R|  1     YEP 1996 Yellow Perch
#R|  2     PKS 2005  Pumpkinseed
#R|  3     PKS 2013  Pumpkinseed
#R|  4     YEP 2014 Yellow Perch
#R|  5     CRP 2016         Carp
#R|  6     PKS 2006  Pumpkinseed

 

This use of mapvalues() and mutate() is described in Section 2.2.7 of my book Introductory Fisheries Analyses with R.

 

The fct_recode() function (from forcats) can also be used to recode categories. Within mutate() the first argument to fct_recode() is the original factor variable. Subsequent arguments are of the form new level name equal to old level name.2 For example, the same recoding to long species name is shown below.

2 Any levels not listed in fct_recode() will be retained with their original names.

dat <- dat |>
  mutate(speciesL2=fct_recode(species,
                              "Bluegill" = "BLG",
                              "Carp" = "CRP",
                              "Largemouth Bass" = "LMB",
                              "Pumpkinseed" = "PKS",
                              "Walleye" = "WAE",
                              "Yellow Perch" = "YEP"))
head(dat)
#R|    species year     speciesL    speciesL2
#R|  1     YEP 1996 Yellow Perch Yellow Perch
#R|  2     PKS 2005  Pumpkinseed  Pumpkinseed
#R|  3     PKS 2013  Pumpkinseed  Pumpkinseed
#R|  4     YEP 2014 Yellow Perch Yellow Perch
#R|  5     CRP 2016         Carp         Carp
#R|  6     PKS 2006  Pumpkinseed  Pumpkinseed

 

Collapse Categories

In some instances, one may want to collapse some categories into a single category (e.g., species into a family). This is easily accomplished with mapvalues() or fct_recode() by simply repeating some of the “to” categories. For example, family contains family names that correspond to the species names in the data frame. Note how multiple species have the same family name category.

fam <- c("Centrarchidae","Cyprinidae","Centrarchidae",
         "Centrarchidae","Percidae","Percidae")
cbind(short,long,fam)
#R|       short long              fam            
#R|  [1,] "BLG" "Bluegill"        "Centrarchidae"
#R|  [2,] "CRP" "Carp"            "Cyprinidae"   
#R|  [3,] "LMB" "Largemouth Bass" "Centrarchidae"
#R|  [4,] "PKS" "Pumpkinseed"     "Centrarchidae"
#R|  [5,] "WAE" "Walleye"         "Percidae"     
#R|  [6,] "YEP" "Yellow Perch"    "Percidae"

The example below shows how to convert the species name abbreviations to family names. In addition, the last use of mapvalues() shows how to change the long-form names to family names. This last example is, of course, repetitive, but it is used here to demonstrate how mutate() allows a variable that was “just created” to be immediately used.

dat <- dat |>
  mutate(family=plyr::mapvalues(species,from=short,to=fam),
         family2=plyr::mapvalues(speciesL,from=long,to=fam))
head(dat)
#R|    species year     speciesL    speciesL2        family       family2
#R|  1     YEP 1996 Yellow Perch Yellow Perch      Percidae      Percidae
#R|  2     PKS 2005  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|  3     PKS 2013  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|  4     YEP 2014 Yellow Perch Yellow Perch      Percidae      Percidae
#R|  5     CRP 2016         Carp         Carp    Cyprinidae    Cyprinidae
#R|  6     PKS 2006  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae

The “collapsing” of multiple levels into one level can also be accomplished with fct_collapse() (from forcats). The first argument to this function is again the variable containing the “old” levels. Subsequent arguments are formed by setting a new level name equal to a vector containing old level names to collapse.

dat <- dat |>
  mutate(family3=fct_collapse(species,
                              "Centarchidae" = c("BLG","PKS","LMB"),
                              "Percidae" = c("WAE","YEP"),
                              "Cyprinidae" = c("CRP")))
head(dat)
#R|    species year     speciesL    speciesL2        family       family2
#R|  1     YEP 1996 Yellow Perch Yellow Perch      Percidae      Percidae
#R|  2     PKS 2005  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|  3     PKS 2013  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|  4     YEP 2014 Yellow Perch Yellow Perch      Percidae      Percidae
#R|  5     CRP 2016         Carp         Carp    Cyprinidae    Cyprinidae
#R|  6     PKS 2006  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|         family3
#R|  1     Percidae
#R|  2 Centarchidae
#R|  3 Centarchidae
#R|  4     Percidae
#R|  5   Cyprinidae
#R|  6 Centarchidae

 

Collapse Values into Categories

It is also common to categorize a numeric variable. For example, a “decade” variable is derived from the year variable in this example.

The case_when() function (from dplyr) may be used to efficiently collapse discrete values into categories. This function also operates on vectors and, thus, must be used with mutate() to add a variable to a data frame. The arguments to case_when() are a series of two-sided formulae where the left-side is a conditioning statement based on the original data and the right-side is the value that should appear in the new variable when that condition is TRUE. For example, the first line in case_when() below asks “if the year variable is in the values from 1980 to 1989 then the new category should be ‘1980s’.”3 For example, the code below creates a new variable called decade that identifies the decade that corresponds to the year-of-capture variable.

3 The colon operator creates a sequence of all integers between the two numbers separated by the colon. The %in% is used on conditional statements to determine if a value is contained within a vector, returning TRUE if it is and FALSE if it is not.

dat <- dat |>
  mutate(decade=case_when(
    year %in% 1980:1989 ~ "1980s",
    year %in% 1990:1999 ~ "1990s",
    year %in% 2000:2009 ~ "2000s",
    year %in% 2010:2019 ~ "2010s"
  ))
head(dat)
#R|    species year     speciesL    speciesL2        family       family2
#R|  1     YEP 1996 Yellow Perch Yellow Perch      Percidae      Percidae
#R|  2     PKS 2005  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|  3     PKS 2013  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|  4     YEP 2014 Yellow Perch Yellow Perch      Percidae      Percidae
#R|  5     CRP 2016         Carp         Carp    Cyprinidae    Cyprinidae
#R|  6     PKS 2006  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|         family3 decade
#R|  1     Percidae  1990s
#R|  2 Centarchidae  2000s
#R|  3 Centarchidae  2010s
#R|  4     Percidae  2010s
#R|  5   Cyprinidae  2010s
#R|  6 Centarchidae  2000s

The lines in case_when() operate sequentially (like a series of “if” statements) such that the above operation can be more succinctly coded as below. Also note in this example that the resulting variable is numeric rather than categorical (simply as an example).

dat <- dat |>
  mutate(decade2=case_when(
    year <= 1989 ~ 1980,
    year <= 1999 ~ 1990,
    year <= 2009 ~ 2000,
    year <= 2019 ~ 2010,
  ))
head(dat)
#R|    species year     speciesL    speciesL2        family       family2
#R|  1     YEP 1996 Yellow Perch Yellow Perch      Percidae      Percidae
#R|  2     PKS 2005  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|  3     PKS 2013  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|  4     YEP 2014 Yellow Perch Yellow Perch      Percidae      Percidae
#R|  5     CRP 2016         Carp         Carp    Cyprinidae    Cyprinidae
#R|  6     PKS 2006  Pumpkinseed  Pumpkinseed Centrarchidae Centrarchidae
#R|         family3 decade decade2
#R|  1     Percidae  1990s    1990
#R|  2 Centarchidae  2000s    2000
#R|  3 Centarchidae  2010s    2010
#R|  4     Percidae  2010s    2010
#R|  5   Cyprinidae  2010s    2010
#R|  6 Centarchidae  2000s    2000
str(dat)
#R|  'data.frame':  250 obs. of  9 variables:
#R|   $ species  : chr  "YEP" "PKS" "PKS" "YEP" ...
#R|   $ year     : int  1996 2005 2013 2014 2016 2006 2002 2012 2013 2014 ...
#R|   $ speciesL : chr  "Yellow Perch" "Pumpkinseed" "Pumpkinseed" "Yellow Perch" ...
#R|   $ speciesL2: Factor w/ 6 levels "Bluegill","Carp",..: 6 4 4 6 2 4 5 1 3 6 ...
#R|   $ family   : chr  "Percidae" "Centrarchidae" "Centrarchidae" "Percidae" ...
#R|   $ family2  : chr  "Percidae" "Centrarchidae" "Centrarchidae" "Percidae" ...
#R|   $ family3  : Factor w/ 3 levels "Centarchidae",..: 3 1 1 3 2 1 3 1 1 3 ...
#R|   $ decade   : chr  "1990s" "2000s" "2010s" "2010s" ...
#R|   $ decade2  : num  1990 2000 2010 2010 2010 2000 2000 2010 2010 2010 ...

 

Warning

You may be motivated from this example to use case_when() to develop a length category variable from measure lengths. While this is possible it is not efficient as you would have several conditions within case_when() (to span all measured lengths) and you would need to make sure that your conditions covered the range of measured lengths. I urge you to examine lencat() in FSA for the purpose of creating length categories (see examples here).

 

Reuse

Citation

BibTeX citation:
@online{h. ogle2018,
  author = {H. Ogle, Derek},
  title = {Collapsing {Categories} or {Values}},
  date = {2018-03-30},
  url = {https://fishr-core-team.github.io/fishR//blog/posts/2018-3-30_Collapsing_Categories_or_Values},
  langid = {en}
}
For attribution, please cite this work as:
H. Ogle, D. 2018, March 30. Collapsing Categories or Values. https://fishr-core-team.github.io/fishR//blog/posts/2018-3-30_Collapsing_Categories_or_Values.