remove_duplicates removes duplicated media records.
Arguments
- metadata
data frame obtained from possible duplicates with the function
find_duplicates(). The data frame must have the column 'duplicate_group' returned byfind_duplicates().- same_repo
Logical argument indicating if observations labeled as duplicates that belong to the same repository should be removed. Default is
FALSE. IfTRUE, only one of the duplicated observations from the same repository will be retained in the output data frame. This is useful as it can be expected that observations from the same repository are not true duplicates (e.g. different recordings uploaded to Xeno-Canto with the same date, time and location by the same user), but rather have not been documented with enough precision to be told apart.- cores
Numeric vector of length 1. Controls whether parallel computing is applied by specifying the number of cores to be used. Default is 1 (i.e. no parallel computing). Can be set globally for the current R session via the "mc.cores" option (e.g.
options(mc.cores = 2)). Note that some repositories might not support parallel queries from the same IP address as it might be identified as denial-of-service cyberattack.- pb
Logical argument to control if progress bar is shown. Default is
TRUE. Can be set globally for the current R session via the "suwo_pb" option (options(suwo_pb = TRUE)). Not shown if only a few observations are found.- repo_priority
Character vector indicating the priority of repositories when selecting which observation to retain when duplicates are found. Default is
c("Xeno-Canto", "GBIF", "iNaturalist", "Macaulay Library", "Wikiaves", "Observation"), which gives priority to repositories in which media downloading is more straightforward (Xeno-Canto and GBIF).- verbose
Logical argument that determines if text is shown in console. Default is
TRUE. Can be set globally for the current R session via the "suwo_verbose" option (options(suwo_verbose = TRUE)).
Value
A single data frame with a subset of the 'metadata' with those observations that were determined not to be duplicates.
Details
When compiling data from multiple repositories, duplicated media
records are a common issue, particularly for sound recordings. These
duplicates occur both through data sharing between repositories like
Xeno-Canto and GBIF, and when users upload the same file to multiple
platforms. In such cases those multiple observations seem to refer to the
same media file and therefore, only one copy is needed. This function
removes duplicate observations identified with the function
find_duplicates(). When duplicates are found, one observation
from each group of duplicates is retained in the output data frame.
However, if multiple observations from the same repository are
labeled as duplicates, by default (same_repo = FALSE) all of them
are retained in the output data frame. This is useful as it can be
expected that observations from the same repository are not true
duplicates (e.g. different recordings uploaded to Xeno-Canto with
the same date, time and location by the same user), but rather have not
been documented with enough precision to be told apart. This behavior can
be modified. If same_repo = TRUE, only one of the duplicated
observations from the same repository will be retained in the output data
frame. The function will give priority to repositories in which media
downloading is more straightforward (Xeno-Canto and GBIF), but this can be
modified with the argument 'repo_priority'.
Author
Marcelo Araya-Salas (marcelo.araya@ucr.ac.cr)
Examples
# get metadata from 2 repos
gb <- query_gbif(species = "Turdus rufiventris", format = "sound")
#> ✔ Obtaining metadata (743 matching records found) 🎊
#> ■■■■■■■■■■■■■■■■■■■■■ 67% | ETA: 2s
#> ■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■ 100% | ETA: 0s
#>
#> ! 2 observations do not have a download link and were removed from the results (inlcuded as an attribute called 'excluded_results').
if(interactive()){
key <- "YOUR XENO CANTO API KEY"
xc <- query_xenocanto(species = "Turdus rufiventris", api_key = key)
# combine metadata
merged_metadata <- merge_metadata(xc, gb)
# find duplicates
label_dup_metadata <- find_duplicates(metadata = merged_metadata)
# remove duplicates
dedup_metadata <- remove_duplicates(label_dup_metadata)
}
