Match names that start or contain a specified text string
Usage
fuzzy_filter(
name,
by = c("scientificName", "vernacularName"),
provider = getOption("taxadb_default_provider", "itis"),
match = c("contains", "starts_with"),
version = latest_version(),
db = td_connect(),
ignore_case = TRUE,
collect = TRUE
)
Arguments
- name
vector of names (scientific or common, see
by
) to be matched against.- by
a column name in the taxa_tbl (following Darwin Core Schema terms). The filtering join is executed with this column as the joining variable.
- provider
from which provider should the hierarchy be returned? Default is 'itis', which can also be configured using
options(default_taxadb_provider=...")
. See[td_create]
for a list of recognized providers.- match
should we match by names starting with the term or containing the term anywhere in the name?
- version
Which version of the taxadb provider database should we use? defaults to latest. See tl_import for details.
- db
a connection to the taxadb database. See details.
- ignore_case
should we ignore case (capitalization) in matching names? Can be significantly slower to run.
- collect
logical, default
TRUE
. Should we return an in-memory data.frame (default, usually the most convenient), or a reference to lazy-eval table on disk (useful for very large tables on which we may first perform subsequent filtering operations.)
Details
Note that fuzzy filter will be fast with an single or small number
of names, but will be slower if given a very large vector of
names to match, as unlike other filter_
commands,
fuzzy matching requires separate SQL calls for each name.
As fuzzy matches should all be confirmed manually in any event, e.g.
not every common name containing "monkey" belongs to a primate species.
This method utilizes the database operation %like%
to filter tables without
loading into memory. Note that this does not support the use of regular
expressions at this time.
Examples
# \donttest{
## match any common name containing:
name <- c("woodpecker", "monkey")
fuzzy_filter(name, "vernacularName")
#> # A tibble: 254 × 15
#> taxonID scientificName taxonRank acceptedNameUsageID taxonomicStatus
#> <chr> <chr> <chr> <chr> <chr>
#> 1 ITIS:1025108 Simia lugens species ITIS:1025104 synonym
#> 2 ITIS:1063217 Cercopithecus pol… species ITIS:1063216 synonym
#> 3 ITIS:572943 Aotus azarai species ITIS:944172 synonym
#> 4 ITIS:573056 Pygathrix brelichi species ITIS:944260 synonym
#> 5 ITIS:944191 Oreonax flavicauda species ITIS:572961 synonym
#> 6 ITIS:944303 Mycetes niger species ITIS:572939 synonym
#> 7 ITIS:944308 Cheirogaleus comm… species ITIS:572951 synonym
#> 8 ITIS:944314 Nyctipithecus ruf… species ITIS:572952 synonym
#> 9 ITIS:944316 Nyctipithecus spi… species ITIS:572952 synonym
#> 10 ITIS:944322 Cebus brissonii species ITIS:572954 synonym
#> # ℹ 244 more rows
#> # ℹ 10 more variables: update_date <lgl>, kingdom <chr>, phylum <chr>,
#> # class <chr>, order <chr>, family <chr>, genus <chr>, specificEpithet <chr>,
#> # vernacularName <chr>, infraspecificEpithet <lgl>
## match scientific name
fuzzy_filter("Chera", "scientificName",
match = "starts_with")
#> # A tibble: 6 × 15
#> taxonID scientificName taxonRank acceptedNameUsageID taxonomicStatus
#> <chr> <chr> <chr> <chr> <chr>
#> 1 ITIS:1025110 Cheracebus purinus species ITIS:1025110 accepted
#> 2 ITIS:1025105 Cheracebus medemi species ITIS:1025105 accepted
#> 3 ITIS:1025107 Cheracebus lucifer species ITIS:1025107 accepted
#> 4 ITIS:1025104 Cheracebus lugens species ITIS:1025104 accepted
#> 5 ITIS:1025106 Cheracebus torquat… species ITIS:1025106 accepted
#> 6 ITIS:1025111 Cheracebus regulus species ITIS:1025111 accepted
#> # ℹ 10 more variables: update_date <lgl>, kingdom <chr>, phylum <chr>,
#> # class <chr>, order <chr>, family <chr>, genus <chr>, specificEpithet <chr>,
#> # vernacularName <chr>, infraspecificEpithet <lgl>
# }