This vignette describes the functionality of datefixR
in more detail than the README file. DatefixR
is a lightweight package consisting of two main user-accessible functions, fix_date_char()
and fix_date_df()
, which converts dates written in different formats into R’s built-in Date
class. The former is designed to modify character vectors whilst the latter is intended for modifying columns of a data frame (or tibble). fix_date_app()
is a third function which allows dates to be standardized via a Shiny app interface. You can learn more about the Shiny app in its dedicated vignette.
Practically, this package is most useful when handling date data which has been supplied via text boxes (instead of a date-specific input with a consistent date format). However, this package may also be useful to validate the format of date data (see date and month imputation).
Usage
Date standardization
Firstly, we will demonstrate date standardization without imputation. We consider a data frame with two columns of dates in differing formats with no missing data.
bad.dates <- data.frame(
id = seq(5),
some.dates = c(
"02/05/92",
"01-04-2020",
"1996/05/01",
"2020-05-01",
"02-04-96"
),
some.more.dates = c(
"01 03 2015",
"2nd January 2010",
"01/05/1990",
"03-Dec-2012",
"02 April 2020"
)
)
knitr::kable(bad.dates)
id | some.dates | some.more.dates |
---|---|---|
1 | 02/05/92 | 01 03 2015 |
2 | 01-04-2020 | 2nd January 2010 |
3 | 1996/05/01 | 01/05/1990 |
4 | 2020-05-01 | 03-Dec-2012 |
5 | 02-04-96 | 02 April 2020 |
fix_date_df()
requires two arguments, df
, a data frame (or tibble) object and col.names
, a character vector containing the names of columns to be standardized. By default, the first column of the data frame is assumed to contain row IDs. These IDs are used if a warning or error is raised to assist the user with locating the source of the error. The ID column can also be manually provided via the id
argument.
The output from this function is a data frame or tibble (dependent on the object type of the the first argument, df
) with the selected date columns now standardized and belonging to the Date
class.
fixed.dates <- fix_date_df(
bad.dates,
c("some.dates", "some.more.dates")
)
knitr::kable(fixed.dates)
id | some.dates | some.more.dates |
---|---|---|
1 | 1992-05-02 | 2015-03-01 |
2 | 2020-04-01 | 2010-01-02 |
3 | 1996-05-01 | 1990-05-01 |
4 | 2020-05-01 | 2012-12-03 |
5 | 1996-04-02 | 2020-04-02 |
datefixR
can handle many different formats including -, /, ., or white space separation, year-first or day-first, and month supplied as a number, an abbreviation or full length name.
fix_date_char()
is similar to fix_date_df()
but only applies to a single character object.
fix_date_char("01 02 2014")
#> [1] "2014-02-01"
Localization
datefixR
currently supports dates being provided in English, Français (French), Deutsch (German), español (Spanish), and Русский (Russian) by recognizing both names of months in these languages and formatting customs. Expected languages do not need to be specified and can be provided just like any other date to be standardized.
fix_date_char("7 de septiembre del 2014")
#> [1] "2014-09-07"
Functions in datefixR
assume day-first instead of month-first when day, month, and year are all given numerically (unless year is given first). However, this behavior can be modified by passing format = "mdy"
to function calls.
fix_date_char("01 02 2014", format = "mdy")
#> [1] "2014-01-02"
If the month is given by name, then datefixR
will automatically detect the correct format without the format
argument needing to be specified by the user.
fix_date_char("July 4th, 1776")
#> [1] "1776-07-04"
Date and month imputation
By default, datefixR
imputes missing months as July, and missing days of the month as the first day. As such, “1992” converts to
fix_date_char("1992")
#> [1] "1992-07-01"
The argument for defaulting to July is 1st-2nd July is halfway through the year (on a non leap year). Therefore, assuming the year supplied is indeed correct, the imputation has a maximum potential error of 6 months from the true date. However, this behavior can be changed by supplying the day.impute
and month.impute
arguments with an integer corresponding to the desired day and month. For example, day.impute = 1
and month.impute = 1
results in the first day of January being imputed instead which is often a more common imputation for missing date data.
fix_date_char("1992", day.impute = 1, month.impute = 1)
#> [1] "1992-01-01"
The imputation mechanism can also be modified to impute NA
if a month or day is missing by setting day.impute
or month.impute
to NA
. This will also result in a warning being raised.
fix_date_char("1992", month.impute = NA)
#> Warning: NA imputed (date: 1992)
#> [1] NA
Finally, imputation can be prevented by setting day.impute
or month.impute
to NULL
. This will result in an error being raised if the day or month are missing respectively.
fix_date_char("1992", month.impute = NULL)
# ERROR
day.impute
and month.impute
can also be passed to fix_date_df()
for similar functionality.
example.df <- data.frame(
id = seq(1, 3),
some.dates = c("2014", "April 1990", "Mar 19")
)
fix_date_df(example.df, "some.dates", day.impute = 1, month.impute = 1)
#> id some.dates
#> 1 1 2014-01-01
#> 2 2 1990-04-01
#> 3 3 2019-03-01
Converting numeric dates
By default, if a date is given numerically (I.E no separators such as “/”, “-”, or white space) and is more than four character long, then this date is assumed to have been converted from R
’s numeric date format. If a Date
object is converted to a numeric
object in R, the assigned value becomes the number of days from 1970-01-01
. Note that the date must still be converted to a character
object before being passed to a datefixR
function.
date <- as.numeric(as.Date("2023-01-17"))
print(date)
#> [1] 19374
fix_date_char(as.character(date))
#> [1] "2023-01-17"
However if a date is converted to a numeric date in Excel, the number of days is instead counted from 1900-01-01
. If a user expects that dates to have been converted to numerical dates in Excel, then excel = TRUE
can be passed to a datefixR
function to rectify this.
fix_date_char("44941", excel = TRUE)
#> [1] "2023-01-15"
Roman numeral months
Oracle Database can use Roman numerals to format months and this custom is also used in some biological contexts. If dates in need of parsing are in this format, roman.numeral = TRUE
can be passed to fix_date_char()
or fix_date_df()
. This implementation is currently experimental and is not guaranteed to work alongside other date formats.
fix_date_char("12/IV/2019", roman.numeral = TRUE)
#> [1] "2019-04-12"
Citation
If you use this package in your research, please consider citing datefixR
. An up-to-date citation can be obtained by running
citation("datefixR")
#> To cite package 'datefixR' in publications use:
#>
#> Constantine-Cooke N (2023). _datefixR: Standardize Dates in Different
#> Formats or with Missing Data_. doi:10.5281/zenodo.5655311
#> <https://doi.org/10.5281/zenodo.5655311>, R package version
#> 1.6.0.9000, <https://CRAN.R-project.org/package=datefixR>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {{datefixR}: Standardize Dates in Different Formats or with Missing Data},
#> author = {Nathan Constantine-Cooke},
#> year = {2023},
#> note = {R package version 1.6.0.9000},
#> doi = {10.5281/zenodo.5655311},
#> url = {https://CRAN.R-project.org/package=datefixR},
#> }