Validate a data frame against a field_types()
specification, and prepare
for aggregation.
Usage
prepare_data(
df,
field_types,
override_column_names = FALSE,
na = c("", "NA", "NULL"),
dataset_description = NULL,
show_progress = TRUE
)
Arguments
- df
A data frame
- field_types
field_types()
object specifying names and types of fields (columns) in the supplieddf
. See also field_types_available.- override_column_names
If
FALSE
, column names in the supplieddf
must match the names specified infield_types
exactly. IfTRUE
, column names in the supplieddf
will be replaced with the names specified infield_types
. The specification must therefore contain the columns in the correct order. Default =FALSE
- na
vector containing strings that should be interpreted as missing values. Default =
c("","NA","NULL")
. Additional column-specific values can be specified in thefield_types()
object- dataset_description
Short description of the dataset being checked. This will appear on the report. If blank, the name of the data frame object will be used
- show_progress
Print progress to console. Default =
TRUE
Examples
# load example data into a data.frame
raw_data <- read_data(
system.file("extdata", "example_prescriptions.csv", package = "daiquiri"),
delim = ",",
col_names = TRUE
)
# validate and prepare the data for aggregation
source_data <- prepare_data(
raw_data,
field_types = field_types(
PrescriptionID = ft_uniqueidentifier(),
PrescriptionDate = ft_timepoint(),
AdmissionDate = ft_datetime(includes_time = FALSE),
Drug = ft_freetext(),
Dose = ft_numeric(),
DoseUnit = ft_categorical(),
PatientID = ft_ignore(),
Location = ft_categorical(aggregate_by_each_category = TRUE)
),
override_column_names = FALSE,
na = c("", "NULL"),
dataset_description = "Example data provided with package"
)
#> field_types supplied:
#> PrescriptionID <uniqueidentifier>
#> PrescriptionDate <timepoint> options: includes_time
#> AdmissionDate <datetime>
#> Drug <freetext>
#> Dose <numeric>
#> DoseUnit <categorical>
#> PatientID <ignore>
#> Location <categorical> options: aggregate_by_each_category
#>
#> Checking column names against field_types...
#> Importing source data [Example data provided with package]...
#> Removing column-specific na values...
#> Checking data against field_types...
#> Selecting relevant warnings...
#> Identifying nonconformant values...
#> Checking and removing missing timepoints...
#> Checking for duplicates...
#> Sorting data...
#> Loading into source_data structure...
#> PrescriptionID
#> PrescriptionDate
#> AdmissionDate
#> Drug
#> Dose
#> DoseUnit
#> PatientID
#> Location
#> Finished
source_data
#> Dataset: Example data provided with package
#>
#> Overall:
#> Columns in source: 8
#> Columns imported: 7
#> Rows in source: 8996
#> Duplicate rows removed: 1
#> Rows imported: 8993
#> Column used for timepoint: PrescriptionDate
#> Min timepoint value: 2021-01-01
#> Max timepoint value: 2021-12-31 23:00:00
#> Rows missing timepoint values removed: 2
#> Strings interpreted as missing values: "","NULL"
#> Total validation warnings: 8
#>
#> Datafields:
#> field_name field_type datatype count missing
#> 1 PrescriptionID uniqueidentifier character 8993 0 (0%)
#> 2 PrescriptionDate timepoint double 8993 0 (0%)
#> 3 AdmissionDate datetime double 4991 4002 (45%)
#> 4 Drug freetext character 8993 0 (0%)
#> 5 Dose numeric double 8984 9 (0.1%)
#> 6 DoseUnit categorical character 8964 29 (0.3%)
#> 7 PatientID ignore NA NA NA
#> 8 Location categorical character 8993 0 (0%)
#> min max validation_warnings
#> 1 10000 9999 0
#> 2 2021-01-01 2021-12-31 23:00:00 2
#> 3 1800-01-01 2021-12-31 1
#> 4 Abacavir + lamiVUDine vancomycin 0
#> 5 0.2 7e+05 5
#> 6 MegaUnit unit(s) 0
#> 7 NA NA NA
#> 8 SITE1 SITE4 0
#>
#> Validation warnings:
#>
#> field_name message
#> <char> <char>
#> 1: PrescriptionDate Missing or invalid value in Timepoint field
#> 2: AdmissionDate expected valid date, but got '2021-06-31'
#> 3: Dose expected no trailing characters, but got '1.5g'
#> 4: Dose expected no trailing characters, but got '4.5 grams'
#> 5: Dose expected a double, but got 'See Instructions'
#> 6: Dose expected no trailing characters, but got '80/400 mg'
#> instances
#> <int>
#> 1: 2
#> 2: 1
#> 3: 1
#> 4: 1
#> 5: 2
#> 6: 1