This test flags datasets where a significant fraction of records has been subject to a common degree minute to decimal degree conversion error, where the degree sign is recognized as decimal delimiter.
Usage
cd_ddmm(
x,
lon = "decimalLongitude",
lat = "decimalLatitude",
ds = "dataset",
pvalue = 0.025,
diff = 1,
mat_size = 1000,
min_span = 2,
value = "clean",
verbose = TRUE,
diagnostic = FALSE
)Arguments
- x
data.frame. Containing geographical coordinates and species names.
- lon
character string. The column with the longitude coordinates. Default = “decimalLongitude”.
- lat
character string. The column with the latitude coordinates. Default = “decimalLatitude”.
- ds
a character string. The column with the dataset of each record. In case
xshould be treated as a single dataset, identical for all records. Default = “dataset”.- pvalue
numeric. The p-value for the one-sided t-test to flag the test as passed or not. Both ddmm.pvalue and diff must be met. Default = 0.025.
- diff
numeric. The threshold difference for the ddmm test. Indicates by which fraction the records with decimals below 0.6 must outnumber the records with decimals above 0.6. Default = 1
- mat_size
numeric. The size of the matrix for the binomial test. Must be changed in decimals (e.g. 100, 1000, 10000). Adapt to dataset size, generally 100 is better for datasets < 10000 records, 1000 is better for datasets with 10000 - 1M records. Higher values also work reasonably well for smaller datasets, therefore, default = 1000. For large datasets try 10000.
- min_span
numeric. The minimum geographic extent of datasets to be tested. Default = 2.
- value
character string. Defining the output value. See value.
- verbose
logical. If TRUE reports the name of the test and the number of records flagged.
- diagnostic
logical. If TRUE plots the analyses matrix for each dataset.
Value
Depending on the ‘value’ argument, either a data.frame
with summary statistics and flags for each dataset (“dataset”) or a
data.frame containing the records considered correct by the test
(“clean”) or a logical vector (“flags”), with TRUE = test passed and FALSE =
test failed/potentially problematic. Default =
“clean”.
Details
If the degree sign is recognized as decimal delimiter during coordinate
conversion, no coordinate decimals above 0.59 (59') are possible. The test
here uses a binomial test to test if a significant proportion of records in
a dataset have been subject to this problem. The test is best adjusted via
the diff argument. The lower diff, the stricter the test. Also scales
with dataset size. Empirically, for datasets with < 5,000 unique coordinate
records diff = 0.1 has proven reasonable flagging most datasets with
>25% problematic records and all dataset with >50% problematic records.
For datasets between 5,000 and 100,000 geographic unique records diff
= 0.01 is recommended, for datasets between 100,000 and 1 M records diff =
0.001, and so on.
Note
See https://ropensci.github.io/CoordinateCleaner/ for more details and tutorials.
See also
Other Datasets:
cd_round()
Examples
clean <- data.frame(species = letters[1:10],
decimalLongitude = runif(100, -180, 180),
decimalLatitude = runif(100, -90,90),
dataset = "FR")
cd_ddmm(x = clean, value = "flagged")
#> Testing for dd.mm to dd.dd conversion errors
#> Flagged 0 records
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [61] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [76] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [91] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#problematic dataset
lon <- sample(0:180, size = 100, replace = TRUE) + runif(100, 0,0.59)
lat <- sample(0:90, size = 100, replace = TRUE) + runif(100, 0,0.59)
prob <- data.frame(species = letters[1:10],
decimalLongitude = lon,
decimalLatitude = lat,
dataset = "FR")
cd_ddmm(x = prob, value = "flagged")
#> Testing for dd.mm to dd.dd conversion errors
#> Flagged 100 records
#> [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [13] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [25] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [37] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [49] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [61] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [73] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [85] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [97] FALSE FALSE FALSE FALSE
