datefixR standardizes dates in different formats or with missing data: for example dates which have been provided from free text web forms.
There are many different formats dates are commonly represented with: the order of day, month, or year can differ, different separators (“-”, “/”, or whitespace) can be used, months can be numerical, names, or abbreviations and year given as two digits or four.
datefixR takes dates in all these different formats and converts them to R’s built-in date class. If
datefixR cannot standardize a date, such as because it is too malformed, then the user is told which date cannot be standardized and the corresponding ID for the row.
datefixR also allows the imputation of missing days and months with user-controlled behavior.
Not familiar with R or want to quickly try out
datefixR? Check out the shiny app here.
datefixR is now available on CRAN.
The most up-to-date (hopefully) stable version of
datefixR can be installed via r-universe
# Enable universe(s) by ropensci options(repos = c( ropensci = 'https://ropensci.r-universe.dev', CRAN = 'https://cloud.r-project.org')) install.packages('datefixR')
If you wish to live on the cutting edge of
datefixR development, then the development version can be installed via
datefixR has a “Getting Started” vignette which describes how to use this package in more detail than this page. View the vignette by either calling
or visiting the vignette on the package website
datefixR is most commonly used to standardize columns of date data in a data frame or tibble. For this demonstration, we will use an example toy dataset provided alongside the package,
|1||02 05 92||2015|
We can standardize these date columns by using the
fix_date_df() function and passing the data frame/tibble object and a character vector of column names for the corresponding columns to fix.
datefixR imputes missing days of the month as 01, and missing months as 07 (July). However, this behavior can be modified via the
example.df <- data.frame(example = "1994") fix_date_df(example.df, "example", month.impute = 1) #> example #> 1 1994-01-01
datefixR assume day-first instead of month-first when day, month, and year are all given (unless year is given first). However this behavior can be modified by passing
format = "mdy" to function calls.
Date and time data are often reported together in the same variable (known as “datetime”). However datetime formats are not supported by
datefixR. The current rationale is this package is mostly used to handle dates entered via free text web forms and it is much less common for both date and time to be reported together in this input method. However, if there is significant demand for support for datetime data in the future this may added.
The package is written solely in R and seems fast enough for my current use cases (a few hundred rows). However, I may convert the core for loop to C++ in the future if I (or others) need it to be faster.
- When a date fails to parse in lubridate then the user is simply told how many dates failed to parse. In
datefixRthe user is told the ID (assumed to be the first column by default but can be user-specified) corresponding to the date which failed to parse and reports the considered date: making it much easier to figure out which dates supplied failed to parse and why.
- When imputing a missing day or month, there is no user-control over this behavior. For example, when imputing a missing month, the user may wish to impute July, the middle of the year, instead of January. However, January will always be imputed in lubridate. In
datefixR, this behavior can be controlled by the
- These functions require all possible date formats to be specified in the
ordersargument, which may result in a date format not being considered if the user forgets to list one of the possible formats. By contrast,
datefixRonly needs a format to be specified if month-first is to be preferred over day-first when guessing a date.
However, lubridate of course excels in general date manipulation and is an excellent tool to use alongside
An alternative function is
anytime::anydate() which also attempts to convert dates to a consistent format (POSIXct). However anytime assumes year, month, and day have all been provided and does not permit imputation. Moreover, if a date cannot be parsed, then the date is converted to an NA object and no warning is raised- which may lead to issues later in the analysis.
Both lubridate and and anytime use compiled code and therefore have the potential to be orders of magnitude faster than
datefixR. However, in my own testing, I found anytime to actually be slower than
datefixR: consistently being over 3 times slower (testing up to 10,000 dates).
lubridate::parse_date_time() (which is written in R) is an order of magnitude of time faster than
lubridate::parse_date_time2(), which is written in C but only allows numeric dates, is even faster. If you are don’t mind not having control over imputation, do not expect to have to deal with many dates which fail to parse, are confident you will specify all potential formats the supplied dates will be in, and you have many many dates to standardize (hundreds of thousands or more), lubridate’s functions may be a better option than
linelist::guess_dates() appears to also have performed a somewhat similar role to the above functions. However, this function did not leave the experimental lifecycle phase and the package itself is no longer available on CRAN.
If you are interested in contributing to
datefixR, please read our contributing guide.
Please note that this package is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.
If you use this package in your research, please consider citing
datefixR! An up-to-date citation can be obtained by running