Adding Zero Catches

Add zeroes for species not caught so that CPE calculations are correct
Explanation
Data Wrangling
CPE
Author

Derek H. Ogle

Published

Apr 19, 2018

Modified

Dec 20, 2022

Note

The following packages are loaded for use below. I also set the random number seed for reproducibility of the randomly generated sample data below.

library(dplyr)    # for group_by(), summarize(), mutate(), right_join()
library(tidyr)    # for complete(), nesting()
set.seed(678394)  # for reproducibility of randomly created data

 

Introduction

Much of my work is with undergraduates who are first learning to analyze fisheries data. A common “learning opportunity” occurs when students are asked to compute the mean catch (or CPE), along with a standard deviation (SD), across multiple gear sets for each species. The learning opportunity occurs because some species will invariably not be caught in some gear sets. When the students summarize the number of fish caught for each species in each gear set those species not caught in a particular gear set will not “appear” in their data. Thus, when calculating the mean, the student will get the correct numerator (sum of catch across all gear sets) but not denominator (they use number of catches summed rather than total number of gear sets), which inflates (over-estimates) the mean catch and (usually) deflates (under-estimates) the SD of catches. Once confronted with this issue, they easily realize how to correct the mean calculation, but calculating the standard deviation is still an issue. These problems are exacerbated when using software to compute these summary statistics across many individual gear sets.

In software, the “trick” is to add a zero for each species not caught in a specific gear set that was caught in at least one gear set. For example, if Bluegill were caught in at least one gear set but not in the third gear set, then a zero must be added as the catch of Bluegill in the third gear set. The addZeroCatch() function in the FSA package was an attempt to efficiently add these zeroes. This function has proven useful over the years, but I have become dissatisfied with its clunkiness. Additionally, I recently became aware of the complete() function in the tidyr package which holds promise for handling the same task. In this post, I explore the use of complete() for handling this issue.

 

Simple Data

In this first example, the data consists of species and length recorded for each captured fish organized by the gear set identification number (ID) and held in the fishdat data.frame.

#R|    ID species  tl
#R|  1  1     BLG 148
#R|  2  1     BLG 153
#R|  3  1     BLG 147
#R|  4  1     BLG 149
#R|  5  1     BLG 144
#R|  6  1     BLG 145

The catch of each species in each gear set may be found using group_by() and summarize() with n().1

  • 1  I find the tibble structure returned by group_by() to be annoying with simple data frames like this. Thus, I usually use as.data.frame() to remove it.

  • catch <- fishdat |>
      group_by(ID,species) |>
      summarize(num=n()) |>
      as.data.frame()
    catch
    #R|     ID species num
    #R|  1   1     BLG  10
    #R|  2   1     LMB   5
    #R|  3   1     YEP   5
    #R|  4   2     LMB   9
    #R|  5   2     YEP   7
    #R|  6   3     BLG  12
    #R|  7   3     YEP   7
    #R|  8   4     BLG   1
    #R|  9   4     LMB  11
    #R|  10  4     YEP  11
    #R|  11  5     BLG   9

    From this it is seen that three species (“BLG”, “LMB”, and “YEP”) were captured across all nets, but that “BLG” were not captured in “ID=2”, “LMB” were not captured in “ID=3”, and “LMB” and “YEP” were not captured in “ID=5”. The sample size, mean, and SD of catches per species from these data may be found by again using group_by() and summarize(). However, these calculations are INCORRECT because they do not include the zero catches of “BLG” in “ID=2”, “LMB” in “ID=3”, and “LMB” and “YEP” in “ID=5”. The problem is most evident in the sample sizes, which should be five (gear sets) for each species.

    ## Example of INCORRECT summaries because not using zeroes
    catch |> 
      group_by(species) |>
      summarize(n=n(),mn=mean(num),sd=sd(num)) |>
      as.data.frame()
    #R|    species n       mn       sd
    #R|  1     BLG 4 8.000000 4.830459
    #R|  2     LMB 3 8.333333 3.055050
    #R|  3     YEP 4 7.500000 2.516611

    The complete() function can be used to add rows to a data frame for variables (or combinations of variables) that should be present in the data frame (relative to other values that are present) but are not. The complete() function takes a data frame as its first argument (but will be “piped” in below with |>) and the variable or variables that will be used to identify which items are missing. For example, with these data, a zero should be added to num for missing combinations defined by ID and species.

    ## Example of default complete ... see below to add zeroes, not NAs
    catch |>
      complete(ID,species) |>
      as.data.frame()
    #R|     ID species num
    #R|  1   1     BLG  10
    #R|  2   1     LMB   5
    #R|  3   1     YEP   5
    #R|  4   2     BLG  NA
    #R|  5   2     LMB   9
    #R|  6   2     YEP   7
    #R|  7   3     BLG  12
    #R|  8   3     LMB  NA
    #R|  9   3     YEP   7
    #R|  10  4     BLG   1
    #R|  11  4     LMB  11
    #R|  12  4     YEP  11
    #R|  13  5     BLG   9
    #R|  14  5     LMB  NA
    #R|  15  5     YEP  NA

    From this result, it is seen that complete() added a row for “BLG” in “ID=2”, “LMB” in “ID=3”, and “LMB” and “YEP” in “ID=5” as we had hoped. However, complete() adds NAs by default. The value to add can be changed with fill=, which takes a list that includes the name of the variable to which the NAs were added (num in this case) set equal to the value to be added (0 in this case).2

  • 2 Here the result is saved into the catch data frame, thus modifying the original data frame with the addition of the zeroes.

  • catch <- catch |>
      complete(ID,species,fill=list(num=0)) |>
      as.data.frame()
    catch
    #R|     ID species num
    #R|  1   1     BLG  10
    #R|  2   1     LMB   5
    #R|  3   1     YEP   5
    #R|  4   2     BLG   0
    #R|  5   2     LMB   9
    #R|  6   2     YEP   7
    #R|  7   3     BLG  12
    #R|  8   3     LMB   0
    #R|  9   3     YEP   7
    #R|  10  4     BLG   1
    #R|  11  4     LMB  11
    #R|  12  4     YEP  11
    #R|  13  5     BLG   9
    #R|  14  5     LMB   0
    #R|  15  5     YEP   0

    These correct catch data can then be summarized as above to show the correct sample size, mean, and SD of catches per species.

    catch |>
      group_by(species) |>
      summarize(n=n(),mn=mean(num),sd=sd(num)) |>
      as.data.frame()
    #R|    species n  mn       sd
    #R|  1     BLG 5 6.4 5.504544
    #R|  2     LMB 5 5.0 5.049752
    #R|  3     YEP 5 6.0 4.000000

     

    Multiple Values to Receive Zeroes

    Suppose that the fish data included a column that indicates whether the fish was marked and returned to the waterbody or not.

    #R|    ID species  tl marked
    #R|  1  1     BLG 148    YES
    #R|  2  1     BLG 153    YES
    #R|  3  1     BLG 147    YES
    #R|  4  1     BLG 149    YES
    #R|  5  1     BLG 144    YES
    #R|  6  1     BLG 145    YES

    The catch and number of fish marked and returned per gear set ID and species may again be computed with group_by() and summarize(). Note, however, the use of ifelse() to use a 1 if the fish was marked and a 0 if it was not. Summing these values returns the number of fish that were marked. Giving this data frame to complete() as before will add zeroes for both the num and nmarked variables as long as both are included in the list given to fill=.

    catch2 <- fishdat2 |>
      group_by(ID,species) |>
      summarize(num=n(),
                nmarked=sum(ifelse(marked=="YES",1,0)))
    catch2
    #R|  # A tibble: 11 × 4
    #R|  # Groups:   ID [5]
    #R|        ID species   num nmarked
    #R|     <dbl> <chr>   <int>   <dbl>
    #R|   1     1 BLG        10       8
    #R|   2     1 LMB         5       2
    #R|   3     1 YEP         5       2
    #R|   4     2 LMB         9       5
    #R|   5     2 YEP         7       2
    #R|   6     3 BLG        12       3
    #R|   7     3 YEP         7       4
    #R|   8     4 BLG         1       0
    #R|   9     4 LMB        11       4
    #R|  10     4 YEP        11       7
    #R|  11     5 BLG         9       6

    There are two things to note in this output. First, that there are no zeroes for num and nmarked for the same species and gear sets as before. Second, the summarization was across two groups but summarize() only removes one of the group_by() variables. Thus, this result is still grouped by ID as shown above, which will interfere with using complete() to add the zeroes. This grouping can be removed with ungroup() as shown below before using complete().

    catch2 <- catch2 |>
      ungroup() |>
      complete(ID,species,fill=list(num=0,nmarked=0)) |>
      as.data.frame()
    catch2
    #R|     ID species num nmarked
    #R|  1   1     BLG  10       8
    #R|  2   1     LMB   5       2
    #R|  3   1     YEP   5       2
    #R|  4   2     BLG   0       0
    #R|  5   2     LMB   9       5
    #R|  6   2     YEP   7       2
    #R|  7   3     BLG  12       3
    #R|  8   3     LMB   0       0
    #R|  9   3     YEP   7       4
    #R|  10  4     BLG   1       0
    #R|  11  4     LMB  11       4
    #R|  12  4     YEP  11       7
    #R|  13  5     BLG   9       6
    #R|  14  5     LMB   0       0
    #R|  15  5     YEP   0       0

     

    More Information that Does Not Get Zeroes

    Suppose there is a data frame called geardat that contains information specific to each gear set.

    geardat
    #R|    ID mon year  lake run effort
    #R|  1  1 May 2018 round   1   1.34
    #R|  2  2 May 2018 round   2   1.87
    #R|  3  3 May 2018 round   3   1.56
    #R|  4  4 May 2018  twin   1   0.92
    #R|  5  5 May 2018  twin   2   0.67

    And, for the purposes of this example, suppose that we have summarized catch data WITHOUT the zeroes having been added.

    catch3 <- fishdat2 |>
      group_by(ID,species) |>
      summarize(num=n(),
                nmarked=sum(ifelse(marked=="YES",1,0))) |>
      as.data.frame()
    catch3
    #R|     ID species num nmarked
    #R|  1   1     BLG  10       8
    #R|  2   1     LMB   5       2
    #R|  3   1     YEP   5       2
    #R|  4   2     LMB   9       5
    #R|  5   2     YEP   7       2
    #R|  6   3     BLG  12       3
    #R|  7   3     YEP   7       4
    #R|  8   4     BLG   1       0
    #R|  9   4     LMB  11       4
    #R|  10  4     YEP  11       7
    #R|  11  5     BLG   9       6

    Finally, suppose that these summarized catch data are joined with the gear data such that the gear set specific information is shown with each catch.

    catch3 <- right_join(geardat,catch3,by="ID")
    catch3
    #R|     ID mon year  lake run effort species num nmarked
    #R|  1   1 May 2018 round   1   1.34     BLG  10       8
    #R|  2   1 May 2018 round   1   1.34     LMB   5       2
    #R|  3   1 May 2018 round   1   1.34     YEP   5       2
    #R|  4   2 May 2018 round   2   1.87     LMB   9       5
    #R|  5   2 May 2018 round   2   1.87     YEP   7       2
    #R|  6   3 May 2018 round   3   1.56     BLG  12       3
    #R|  7   3 May 2018 round   3   1.56     YEP   7       4
    #R|  8   4 May 2018  twin   1   0.92     BLG   1       0
    #R|  9   4 May 2018  twin   1   0.92     LMB  11       4
    #R|  10  4 May 2018  twin   1   0.92     YEP  11       7
    #R|  11  5 May 2018  twin   2   0.67     BLG   9       6

    These data simulate what might be seen from a flat database.

    With these data, zeroes still need to be added as defined by missing combinations of ID and species. However, if only these two variables are included in complete() then zeroes will be added for mon, year, lake, run, and effort, which is not desired. These five variables are connected to or “nested” with the ID variable (i.e., if you know ID then you know the values of these other variables) and should be treated as a group. Nesting of variables can be handled in complete() by including the names of all the connected variables in nesting().

    catch3 |> complete(nesting(ID,mon,year,lake,run,effort),species,
                        fill=list(num=0,nmarked=0)) |>
      as.data.frame()
    #R|     ID mon year  lake run effort species num nmarked
    #R|  1   1 May 2018 round   1   1.34     BLG  10       8
    #R|  2   1 May 2018 round   1   1.34     LMB   5       2
    #R|  3   1 May 2018 round   1   1.34     YEP   5       2
    #R|  4   2 May 2018 round   2   1.87     BLG   0       0
    #R|  5   2 May 2018 round   2   1.87     LMB   9       5
    #R|  6   2 May 2018 round   2   1.87     YEP   7       2
    #R|  7   3 May 2018 round   3   1.56     BLG  12       3
    #R|  8   3 May 2018 round   3   1.56     LMB   0       0
    #R|  9   3 May 2018 round   3   1.56     YEP   7       4
    #R|  10  4 May 2018  twin   1   0.92     BLG   1       0
    #R|  11  4 May 2018  twin   1   0.92     LMB  11       4
    #R|  12  4 May 2018  twin   1   0.92     YEP  11       7
    #R|  13  5 May 2018  twin   2   0.67     BLG   9       6
    #R|  14  5 May 2018  twin   2   0.67     LMB   0       0
    #R|  15  5 May 2018  twin   2   0.67     YEP   0       0

    It is possible to have nesting with species as well. Suppose, for example, that the scientific name for the species was included in the original fishdata2 that was summarized (using a combination of the examples from above, but not shown here) to catch4.

    catch4
    #R|     ID species                spsci num nmarked mon year  lake run effort
    #R|  1   1     BLG  Lepomis macrochirus  10       8 May 2018 round   1   1.34
    #R|  2   1     LMB Micropterus dolomieu   5       2 May 2018 round   1   1.34
    #R|  3   1     YEP     Perca flavescens   5       2 May 2018 round   1   1.34
    #R|  4   2     LMB Micropterus dolomieu   9       5 May 2018 round   2   1.87
    #R|  5   2     YEP     Perca flavescens   7       2 May 2018 round   2   1.87
    #R|  6   3     BLG  Lepomis macrochirus  12       3 May 2018 round   3   1.56
    #R|  7   3     YEP     Perca flavescens   7       4 May 2018 round   3   1.56
    #R|  8   4     BLG  Lepomis macrochirus   1       0 May 2018  twin   1   0.92
    #R|  9   4     LMB Micropterus dolomieu  11       4 May 2018  twin   1   0.92
    #R|  10  4     YEP     Perca flavescens  11       7 May 2018  twin   1   0.92
    #R|  11  5     BLG  Lepomis macrochirus   9       6 May 2018  twin   2   0.67

    The zeroes are then added to this data.frame making sure to note the nesting of species and spsci.

    catch4 |> 
      complete(nesting(ID,mon,year,lake,run,effort),
               nesting(species,spsci),
               fill=list(num=0,nmarked=0)) |>
      as.data.frame()
    #R|     ID mon year  lake run effort species                spsci num nmarked
    #R|  1   1 May 2018 round   1   1.34     BLG  Lepomis macrochirus  10       8
    #R|  2   1 May 2018 round   1   1.34     LMB Micropterus dolomieu   5       2
    #R|  3   1 May 2018 round   1   1.34     YEP     Perca flavescens   5       2
    #R|  4   2 May 2018 round   2   1.87     BLG  Lepomis macrochirus   0       0
    #R|  5   2 May 2018 round   2   1.87     LMB Micropterus dolomieu   9       5
    #R|  6   2 May 2018 round   2   1.87     YEP     Perca flavescens   7       2
    #R|  7   3 May 2018 round   3   1.56     BLG  Lepomis macrochirus  12       3
    #R|  8   3 May 2018 round   3   1.56     LMB Micropterus dolomieu   0       0
    #R|  9   3 May 2018 round   3   1.56     YEP     Perca flavescens   7       4
    #R|  10  4 May 2018  twin   1   0.92     BLG  Lepomis macrochirus   1       0
    #R|  11  4 May 2018  twin   1   0.92     LMB Micropterus dolomieu  11       4
    #R|  12  4 May 2018  twin   1   0.92     YEP     Perca flavescens  11       7
    #R|  13  5 May 2018  twin   2   0.67     BLG  Lepomis macrochirus   9       6
    #R|  14  5 May 2018  twin   2   0.67     LMB Micropterus dolomieu   0       0
    #R|  15  5 May 2018  twin   2   0.67     YEP     Perca flavescens   0       0

     

    Final Thoughts

    This is my first exploration with complete() and it looks promising for this task of adding zeroes to data frames of catch by gear set for gear sets in which a species was not caught. I will be curious to hear what others think of this function and how it might fit in their workflow.