Creating realistic data
Roel M. Hogervorst
Source:vignettes/creating-realistic-data.Rmd
creating-realistic-data.Rmd
Introduction
Charlatan creates realistic looking data. By now you have seen some examples of the high level and low level APIs. But maybe you would like to see some examples of generating data for specific use-cases? Then this vignette is for you. It will show you how to create fake transactional data, fake health care (PII) data and finally an example of a plumber API that returns fake data.
But first a slight detour into different ways of faking data.
Other options for faking data
Simulation of data (creating fake data that looks realistic) is not always the best option, there are other options that we will discuss first.
- use real data
- use pseudonomous or anonymous data
- create a dataset from scratch
Using real actual data is great for live systems. You want actual data for actual predictions or actual inference. It is, however smart to reduce your data to only the columns you need.
Using real actual data is a bad idea for demonstration use-cases. Personal Identifyable Information (PII) are sensitive data and need to be handled with care. Even if you strip away the full names it is relatively easy to pinpoint specific people on the basis of several data points.
Pseudonymous or anonymous data is a safer way to deal with personal identifyable data. By stripping away PII or replacing values with other random values you make it harder to link the data back to actual people. The main pro in using this kind of data is that you keep the data structure intact. The relationship between variables remain, and you can do predictions/inference on that.
It is quite a lot of work to make your data truly anonymous. You need surprisingly little data to de-anonomize data and link it back to actual people.
Creating a fake dataset from scratch is relatively
easy when the data is small. Using the build-in functions
likesample
, runif
, and rnorm
etc.
is very doable. It is slightly faster than using this package.
Using completely random values does make it difficult to talk to stakeholders, they would not understand the random values. It is also really a lot of work to generate more complex data or many different columns.
What this package does, and does not do
This package creates realistic looking values, but does not help you in creating relations in a data set. For instance you can create streets, cities and postal codes, but those components are unrelated, the postal code is not related to the street nor to the city.
This package creates realistic names, jobs, streets, business names, phone numbers and much more. It also can create those things for different countries and languages. For example:
- You could create a dataset with real Czech names like ‘František Krejčí’ or ‘Nikol Machová’.
- You can create addresses that look like they come from New Zealand:“68 Morris Concourse Tererongo 6128” or “485 Manawarongo Way, Pedersen 3196”.
- You can create Danish jobs like “Børsmægler” or “Ventilationsmontør”.
Examples
Next are a few examples of faked data with the help of the {charlatan} package. The examples are with the en_US locale, but work for many locales.
Fake business transactional data
This example shows transactional data, from an ecommerce website for example, and will also show you how to add some logical structure to your data. This example comes from an issue opened on our github.
Let’s imagine we sell clothes on the internet. What would that data look like?
I would need:
- ids for items
- item information
- a price paid
- customer information (Who bought it, where do we ship it)
Steps for creating realistic looking business transactional data
- create products
- create orders
- combine the two
We first create a few categories and subcategories. This is too specific for the {charlatan} package (and different for every store). I imagine you have a better idea for this data then I have.
In this example I have categories Shoes, Jeans and Dresses. All Shoes have a prefix that starts with 1, all Jeans have prefix starting with 2, and Dresses start with 5. We combine the prefix with a random number to have consistent product ids ( I have no idea if clothing stores actually do this, but it looked neat).
# create product data
products <- data.frame(
prefix = c(rep(1, 5), rep(2, 2), rep(5, 2)),
product_id = fraudster_cl$integer(n = 9, min = 1000, max = 9999),
main_category = c(rep("Shoes", 5), rep("Jeans", 2), rep("Dresses", 2)),
sub_category = c("Dress shoes", "Tennis shoes", "Boots", "Hiking boots", "Country & Western style boots", "Regular fit", "Straight fit", "Summer dress", "Evening gown")
)
## when you have {dplyr} installed there are way cleaner ways to do this
products$product_id <- as.integer(sprintf("%s%s", as.character(products$prefix), products$product_id))
products
#> prefix product_id main_category sub_category
#> 1 1 12006 Shoes Dress shoes
#> 2 1 17515 Shoes Tennis shoes
#> 3 1 16826 Shoes Boots
#> 4 1 13916 Shoes Hiking boots
#> 5 1 13979 Shoes Country & Western style boots
#> 6 2 23811 Jeans Regular fit
#> 7 2 22250 Jeans Straight fit
#> 8 5 59459 Dresses Summer dress
#> 9 5 51427 Dresses Evening gown
Then we create the orders with a price, a product id, and location. Orders also have an email address and are shipped to an address.
# create orders
orders <- data.frame(
order_id = fraudster_cl$integer(n = n, min = 10000, max = 90000),
location_id = fraudster_cl$integer(n = n, min = 1, max = 5),
price_paid = fraudster_cl$integer(n = n, min = 1, max = 9900) / 100,
product_id = sample(products$product_id, size = n, replace = TRUE),
order_email = fraudster_cl$email(n = n),
customer_name = fraudster_cl$name(n = n),
shipping_address = fraudster_cl$address(n = n)
)
Finally we combine everything:
# combine orders and transactions
example_transactions <- merge(orders, products)
## reorder the columns to let it make more sense.
example_transactions[, c("order_id", "location_id", "product_id", "main_category", "sub_category", "price_paid", "customer_name", "order_email", "shipping_address")]
#> order_id location_id product_id main_category sub_category
#> 1 38900 5 13979 Shoes Country & Western style boots
#> 2 82163 5 13979 Shoes Country & Western style boots
#> 3 26387 2 16826 Shoes Boots
#> 4 49425 1 17515 Shoes Tennis shoes
#> 5 32062 4 17515 Shoes Tennis shoes
#> price_paid customer_name order_email
#> 1 68.89 Sal Klein meghan99@group.com
#> 2 21.13 Darnell Stiedemann dillon.bartoletti@gmail.com
#> 3 74.89 Santana Langosh wyman.brody@gmail.com
#> 4 74.75 Earley Hickle howell.samantha@hotmail.com
#> 5 31.19 Dr. Andra White MD helmer.hayes@corwin.net
#> shipping_address
#> 1 05992 Dona Squares\nMooreview, MH 32273
#> 2 2481 Plummer Drive\nMicaylatown, VA 45528
#> 3 86826 Zeke Lodge Apt. 861\nEast Saigeland, TN 70514-0551
#> 4 407 Herzog Pass\nEast Enochfurt, NJ 23062
#> 5 516 Marks Forest Apt. 007\nTowneton, MT 48453
Notice that customer_name and email are completely unrelated. You could create a customer ‘table’ like the product table above to create a bit more structure.
Protected health information
Here is how you simulate protected health information with {charlatan}. Here we use the low level api.
This example also comes from an issue in our github.
First a setup:
# setup the providers
ap <- AddressProvider_en_US$new()
pp <- PersonProvider_en_US$new()
ip <- InternetProvider_en_US$new()
lp <- LoremProvider_en_US$new()
SSNP <- SSNProvider_en_US$new()
dtp <- DateTimeProvider$new()
np <- NumericsProvider$new()
pnp <- PhoneNumberProvider_en_US$new()
set.seed(1235)
We don’t have a list of counties in the US (there are 3007 of them).
So we will use a random word from the LoremProvider
with
county. It is probably nicer if you wrap this into a function.
Generate a single ‘record’ for a person:
prot_health <- list(
first_name = pp$first_name(),
last_name = pp$last_name(),
phone_number = pnp$render(),
fax_number = pnp$render(),
street = ap$street_address(),
zipcode = ap$postcode(),
email = ip$email(),
county = paste0(lp$word(), " county"),
SSN = SSNP$render(),
dob = as.Date(dtp$date_time_between("1930-01-01", "1990-12-31")),
# I've decided record number is an integer between 10000 - 99999
medical_record_number = np$integer(min = 10000, max = 99999),
ip_address = ip$ipv4()
)
prot_health
#> $first_name
#> [1] "Efrem"
#>
#> $last_name
#> [1] "Schaden"
#>
#> $phone_number
#> [1] "243.246.1773"
#>
#> $fax_number
#> [1] "+68(4)7385942080"
#>
#> $street
#> [1] "2414 Howell Stravenue"
#>
#> $zipcode
#> [1] "54142"
#>
#> $email
#> [1] "alta70@gmail.com"
#>
#> $county
#> [1] "benefit county"
#>
#> $SSN
#> [1] "275-05-1676"
#>
#> $dob
#> [1] "1954-06-03"
#>
#> $medical_record_number
#> [1] 76973
#>
#> $ip_address
#> [1] "133.80.147.164"
We can also create medical records in sequence with a custom function. In the following example I create a sequence of events based on a date.
#' Generate a bunch of dates in sequence
gen_med_record <- function(date_value, events = 4, event_types = c("admission", "x-ray", "blood-test", "general exam")) {
days <- sort(np$integer(events, 1, 365))
result <- data.frame(
date = date_value + days
)
result$event <- sample(event_types, size = nrow(result), replace = TRUE)
result
}
result <- gen_med_record(date_value = as.Date("2022-03-01"), events = 5)
result$medical_record_number <- prot_health$medical_record_number
result
#> date event medical_record_number
#> 1 2022-05-12 blood-test 76973
#> 2 2022-05-14 general exam 76973
#> 3 2022-10-03 x-ray 76973
#> 4 2022-11-21 x-ray 76973
#> 5 2023-02-09 general exam 76973
plumber api
You can also create a plumber API. For this to work you need the {plumber} package installed.
# plumber.R
fraudster_cl <- fraudster('en_US')
#* Create a random address
#* @param n how many do you want
#* @get /adress
function(n=1){
list(address=fraudster_cl$address())
}
#* Create a random company
#* @param n how many do you want
#* @get /company
function(n=1){
list(address=fraudster_cl$company())
}
Then run the file like this to start an API:
plumber::plumb("R/plumber.R") %>% plumber::pr_run()