Remove duplicated media records

remove_duplicates removes duplicated media records.

Usage

remove_duplicates(
  metadata,
  same_repo = FALSE,
  cores = getOption("suwo_cores", 1),
  pb = getOption("suwo_pb", TRUE),
  repo_priority = c("Xeno-Canto", "GBIF", "iNaturalist", "Macaulay Library", "WikiAves"),
  verbose = getOption("suwo_verbose", TRUE)
)

Arguments

metadata: data frame obtained with the function find_duplicates(). The data frame must have the column 'duplicate_group' returned by find_duplicates().
same_repo: Logical argument indicating if observations labeled as duplicates that belong to the same repository should be removed. Default is FALSE. If TRUE, only one of the duplicated observations from the same repository will be retained in the output data frame. This is useful as it can be expected that observations from the same repository are not true duplicates (e.g. different recordings uploaded to Xeno-Canto with the same date, time and location by the same user), but rather have not been documented with enough precision to be told apart.
cores: Numeric vector of length 1. Controls whether parallel computing is applied by specifying the number of cores to be used. Default is 1 (i.e. no parallel computing). Can be set globally for the current R session via the "mc.cores" option (e.g. options(mc.cores = 2)). Note that some repositories might not support parallel queries from the same IP address as it might be identified as denial-of-service cyberattack.
pb: Logical argument to control if progress bar is shown. Default is TRUE. Can be set globally for the current R session via the "suwo_pb" option ( options(suwo_pb = TRUE)). Not shown if only a few observations are found.
repo_priority: Character vector indicating the priority of repositories when selecting which observation to retain when duplicates are found. Default is c("Xeno-Canto", "GBIF", "iNaturalist", "Macaulay Library", "Wikiaves"), which gives priority to repositories in which media downloading is more straightforward (Xeno-Canto and GBIF).
verbose: Logical argument that determines if text is shown in console. Default is TRUE. Can be set globally for the current R session via the "suwo_verbose" option ( options(suwo_verbose = TRUE)).

Value

A single data frame with a subset of the 'metadata' with those observations that were determined not to be duplicates.

Details

When compiling data from multiple repositories, duplicated media records are a common issue, particularly for sound recordings. These duplicates occur both through data sharing between repositories like Xeno-Canto and GBIF, and when users upload the same file to multiple platforms. In such cases those multiple observations seem to refer to the same media file and therefore, only one copy is needed. This function removes duplicate observations identified with the function find_duplicates(). When duplicates are found, one observation from each group of duplicates is retained in the output data frame. However, if multiple observations from the same repository are labeled as duplicates, by default (same_repo = FALSE) all of them are retained in the output data frame. This is useful as it can be expected that observations from the same repository are not true duplicates (e.g. different recordings uploaded to Xeno-Canto with the same date, time and location by the same user), but rather have not been documented with enough precision to be told apart. This behavior can be modified. If same_repo = TRUE, only one of the duplicated observations from the same repository will be retained in the output data frame. The function will give priority to repositories in which media downloading is more straightforward (Xeno-Canto and GBIF), but this can be modified with the argument 'repo_priority'.

Author

Marcelo Araya-Salas (marcelo.araya@ucr.ac.cr)

Examples

if(interactive()){
# get metadata from 2 repos
gb <- query_gbif(species = "Turdus rufiventris", format =  "sound")

key <- "YOUR XENO CANTO API KEY"
xc <- query_xenocanto(species = "Turdus rufiventris", api_key = key)

# combine metadata
merged_metadata <- merge_metadata(xc, gb)

# find duplicates
label_dup_metadata <- find_duplicates(metadata = merged_metadata)

# remove duplicates
dedup_metadata <- remove_duplicates(label_dup_metadata)
}

Usage

Arguments

Value

Details

See also

Author

Examples

About

Community

Resources