Skip to contents

Match names that start or contain a specified text string

Usage

fuzzy_filter(
  name,
  by = c("scientificName", "vernacularName"),
  provider = getOption("taxadb_default_provider", "itis"),
  match = c("contains", "starts_with"),
  version = latest_version(),
  db = td_connect(),
  ignore_case = TRUE,
  collect = TRUE
)

Arguments

name

vector of names (scientific or common, see by) to be matched against.

by

a column name in the taxa_tbl (following Darwin Core Schema terms). The filtering join is executed with this column as the joining variable.

provider

from which provider should the hierarchy be returned? Default is 'itis', which can also be configured using options(default_taxadb_provider=..."). See [td_create] for a list of recognized providers.

match

should we match by names starting with the term or containing the term anywhere in the name?

version

Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details.

db

a connection to the taxadb database. See details.

ignore_case

should we ignore case (capitalization) in matching names? Can be significantly slower to run.

collect

logical, default TRUE. Should we return an in-memory data.frame (default, usually the most convenient), or a reference to lazy-eval table on disk (useful for very large tables on which we may first perform subsequent filtering operations.)

Details

Note that fuzzy filter will be fast with an single or small number of names, but will be slower if given a very large vector of names to match, as unlike other filter_ commands, fuzzy matching requires separate SQL calls for each name. As fuzzy matches should all be confirmed manually in any event, e.g. not every common name containing "monkey" belongs to a primate species.

This method utilizes the database operation %like% to filter tables without loading into memory. Note that this does not support the use of regular expressions at this time.

Examples

# \donttest{
  # \dontshow{
   ## All examples use a temporary directory
   Sys.setenv(TAXADB_HOME=file.path(tempdir(), "taxadb"))
   options("taxadb_default_provider"="itis_test")
  # }

## match any common name containing:
name <- c("woodpecker", "monkey")
fuzzy_filter(name, "vernacularName")
#> # A tibble: 254 × 15
#>    taxonID      scientificName     taxonRank acceptedNameUsageID taxonomicStatus
#>    <chr>        <chr>              <chr>     <chr>               <chr>          
#>  1 ITIS:1025108 Simia lugens       species   ITIS:1025104        synonym        
#>  2 ITIS:1063217 Cercopithecus pol… species   ITIS:1063216        synonym        
#>  3 ITIS:572943  Aotus azarai       species   ITIS:944172         synonym        
#>  4 ITIS:573056  Pygathrix brelichi species   ITIS:944260         synonym        
#>  5 ITIS:944191  Oreonax flavicauda species   ITIS:572961         synonym        
#>  6 ITIS:944303  Mycetes niger      species   ITIS:572939         synonym        
#>  7 ITIS:944308  Cheirogaleus comm… species   ITIS:572951         synonym        
#>  8 ITIS:944314  Nyctipithecus ruf… species   ITIS:572952         synonym        
#>  9 ITIS:944316  Nyctipithecus spi… species   ITIS:572952         synonym        
#> 10 ITIS:944322  Cebus brissonii    species   ITIS:572954         synonym        
#> # ℹ 244 more rows
#> # ℹ 10 more variables: update_date <lgl>, kingdom <chr>, phylum <chr>,
#> #   class <chr>, order <chr>, family <chr>, genus <chr>, specificEpithet <chr>,
#> #   vernacularName <chr>, infraspecificEpithet <lgl>

## match scientific name
fuzzy_filter("Chera", "scientificName",
             match = "starts_with")
#> # A tibble: 6 × 15
#>   taxonID      scientificName      taxonRank acceptedNameUsageID taxonomicStatus
#>   <chr>        <chr>               <chr>     <chr>               <chr>          
#> 1 ITIS:1025105 Cheracebus medemi   species   ITIS:1025105        accepted       
#> 2 ITIS:1025107 Cheracebus lucifer  species   ITIS:1025107        accepted       
#> 3 ITIS:1025110 Cheracebus purinus  species   ITIS:1025110        accepted       
#> 4 ITIS:1025104 Cheracebus lugens   species   ITIS:1025104        accepted       
#> 5 ITIS:1025106 Cheracebus torquat… species   ITIS:1025106        accepted       
#> 6 ITIS:1025111 Cheracebus regulus  species   ITIS:1025111        accepted       
#> # ℹ 10 more variables: update_date <lgl>, kingdom <chr>, phylum <chr>,
#> #   class <chr>, order <chr>, family <chr>, genus <chr>, specificEpithet <chr>,
#> #   vernacularName <chr>, infraspecificEpithet <lgl>
# }