Deduplicate records

dedup(x, how = "one", tolerance = 0.9)

Arguments

x

(data.frame) A data.frame, tibble, or data.table

how

(character) How to deal with duplicates. The default of "one" keeps one record of each group of duplicates, and drops the others, putting them into the dups attribute. "all" drops all duplicates, in case e.g., you don't want to deal with any records that are duplicated, as e.g., it may be hard to tell which one to remove.

tolerance

(numeric) Score (0 to 1) at which to determine a match. You'll want to inspect outputs closely to tweak this value based on your data, as results can vary.

Value

Returns a data.frame, optionally with attributes

Examples

df <- sample_data_1 smalldf <- df[1:20, ] smalldf <- rbind(smalldf, smalldf[10,]) smalldf[21, "key"] <- 1088954555 NROW(smalldf)
#> [1] 21
dp <- dframe(smalldf) %>% dedup() NROW(dp)
#> [1] 20
attr(dp, "dups")
#> # A tibble: 1 x 5 #> name longitude latitude date key #> <chr> <dbl> <dbl> <dttm> <dbl> #> 1 Ursus americanus -76.8 35.5 2015-04-05 23:00:00 1088954555
# Another example - more than one set of duplicates df <- sample_data_1 twodups <- df[1:10, ] twodups <- rbind(twodups, twodups[c(9, 10), ]) rownames(twodups) <- NULL NROW(twodups)
#> [1] 12
dp <- dframe(twodups) %>% dedup() NROW(dp)
#> [1] 10
attr(dp, "dups")
#> # A tibble: 2 x 5 #> name longitude latitude date key #> <chr> <dbl> <dbl> <dttm> <int> #> 1 Ursus americanus -78.3 36.9 2015-03-20 21:11:24 1088923534 #> 2 Ursus americanus -76.8 35.5 2015-04-05 23:00:00 1088954559