Creating realistic data

library(charlatan)

Introduction

Charlatan creates realistic looking data. By now you have seen some examples of the high level and low level APIs. But maybe you would like to see some examples of generating data for specific use-cases? Then this vignette is for you. It will show you how to create fake transactional data, fake health care (PII) data and finally an example of a plumber API that returns fake data.

But first a slight detour into different ways of faking data.

Other options for faking data

Simulation of data (creating fake data that looks realistic) is not always the best option, there are other options that we will discuss first.

use real data
use pseudonomous or anonymous data
create a dataset from scratch

Using real actual data is great for live systems. You want actual data for actual predictions or actual inference. It is, however smart to reduce your data to only the columns you need.

Using real actual data is a bad idea for demonstration use-cases. Personal Identifyable Information (PII) are sensitive data and need to be handled with care. Even if you strip away the full names it is relatively easy to pinpoint specific people on the basis of several data points.

Pseudonymous or anonymous data is a safer way to deal with personal identifyable data. By stripping away PII or replacing values with other random values you make it harder to link the data back to actual people. The main pro in using this kind of data is that you keep the data structure intact. The relationship between variables remain, and you can do predictions/inference on that.

It is quite a lot of work to make your data truly anonymous. You need surprisingly little data to de-anonomize data and link it back to actual people.

Creating a fake dataset from scratch is relatively easy when the data is small. Using the build-in functions likesample, runif, and rnorm etc. is very doable. It is slightly faster than using this package.

Using completely random values does make it difficult to talk to stakeholders, they would not understand the random values. It is also really a lot of work to generate more complex data or many different columns.

What this package does, and does not do

This package creates realistic looking values, but does not help you in creating relations in a data set. For instance you can create streets, cities and postal codes, but those components are unrelated, the postal code is not related to the street nor to the city.

This package creates realistic names, jobs, streets, business names, phone numbers and much more. It also can create those things for different countries and languages. For example:

You could create a dataset with real Czech names like ‘František Krejčí’ or ‘Nikol Machová’.
You can create addresses that look like they come from New Zealand:“68 Morris Concourse Tererongo 6128” or “485 Manawarongo Way, Pedersen 3196”.
You can create Danish jobs like “Børsmægler” or “Ventilationsmontør”.

Examples

Next are a few examples of faked data with the help of the {charlatan} package. The examples are with the en_US locale, but work for many locales.

Fake business transactional data

This example shows transactional data, from an ecommerce website for example, and will also show you how to add some logical structure to your data. This example comes from an issue opened on our github.

Let’s imagine we sell clothes on the internet. What would that data look like?

I would need:

ids for items
item information
a price paid
customer information (Who bought it, where do we ship it)

Steps for creating realistic looking business transactional data

create products
create orders
combine the two

# setup
fraudster_cl <- fraudster("en_US")
n <- 5
set.seed(1235)

We first create a few categories and subcategories. This is too specific for the {charlatan} package (and different for every store). I imagine you have a better idea for this data then I have.

In this example I have categories Shoes, Jeans and Dresses. All Shoes have a prefix that starts with 1, all Jeans have prefix starting with 2, and Dresses start with 5. We combine the prefix with a random number to have consistent product ids ( I have no idea if clothing stores actually do this, but it looked neat).

# create product data
products <- data.frame(
  prefix = c(rep(1, 5), rep(2, 2), rep(5, 2)),
  product_id = fraudster_cl$integer(n = 9, min = 1000, max = 9999),
  main_category = c(rep("Shoes", 5), rep("Jeans", 2), rep("Dresses", 2)),
  sub_category = c("Dress shoes", "Tennis shoes", "Boots", "Hiking boots", "Country & Western style boots", "Regular fit", "Straight fit", "Summer dress", "Evening gown")
)
## when you have {dplyr} installed there are way cleaner ways to do this
products$product_id <- as.integer(sprintf("%s%s", as.character(products$prefix), products$product_id))
products
#>   prefix product_id main_category                  sub_category
#> 1      1      12006         Shoes                   Dress shoes
#> 2      1      17515         Shoes                  Tennis shoes
#> 3      1      16826         Shoes                         Boots
#> 4      1      13916         Shoes                  Hiking boots
#> 5      1      13979         Shoes Country & Western style boots
#> 6      2      23811         Jeans                   Regular fit
#> 7      2      22250         Jeans                  Straight fit
#> 8      5      59459       Dresses                  Summer dress
#> 9      5      51427       Dresses                  Evening gown

Then we create the orders with a price, a product id, and location. Orders also have an email address and are shipped to an address.

# create orders
orders <- data.frame(
  order_id = fraudster_cl$integer(n = n, min = 10000, max = 90000),
  location_id = fraudster_cl$integer(n = n, min = 1, max = 5),
  price_paid = fraudster_cl$integer(n = n, min = 1, max = 9900) / 100,
  product_id = sample(products$product_id, size = n, replace = TRUE),
  order_email = fraudster_cl$email(n = n),
  customer_name = fraudster_cl$name(n = n),
  shipping_address = fraudster_cl$address(n = n)
)

Finally we combine everything:

# combine orders and transactions
example_transactions <- merge(orders, products)
## reorder the columns to let it make more sense.
example_transactions[, c("order_id", "location_id", "product_id", "main_category", "sub_category", "price_paid", "customer_name", "order_email", "shipping_address")]
#>   order_id location_id product_id main_category                  sub_category
#> 1    38900           5      13979         Shoes Country & Western style boots
#> 2    82163           5      13979         Shoes Country & Western style boots
#> 3    26387           2      16826         Shoes                         Boots
#> 4    49425           1      17515         Shoes                  Tennis shoes
#> 5    32062           4      17515         Shoes                  Tennis shoes
#>   price_paid      customer_name                 order_email
#> 1      68.89          Sal Klein          meghan99@group.com
#> 2      21.13 Darnell Stiedemann dillon.bartoletti@gmail.com
#> 3      74.89    Santana Langosh       wyman.brody@gmail.com
#> 4      74.75      Earley Hickle howell.samantha@hotmail.com
#> 5      31.19 Dr. Andra White MD     helmer.hayes@corwin.net
#>                                           shipping_address
#> 1                  05992 Dona Squares\nMooreview, MH 32273
#> 2                2481 Plummer Drive\nMicaylatown, VA 45528
#> 3 86826 Zeke Lodge Apt. 861\nEast Saigeland, TN 70514-0551
#> 4                407 Herzog Pass\nEast Enochfurt, NJ 23062
#> 5            516 Marks Forest Apt. 007\nTowneton, MT 48453

Notice that customer_name and email are completely unrelated. You could create a customer ‘table’ like the product table above to create a bit more structure.

Protected health information

Here is how you simulate protected health information with {charlatan}. Here we use the low level api.

This example also comes from an issue in our github.

First a setup:

# setup the providers
ap <- AddressProvider_en_US$new()
pp <- PersonProvider_en_US$new()
ip <- InternetProvider_en_US$new()
lp <- LoremProvider_en_US$new()
SSNP <- SSNProvider_en_US$new()
dtp <- DateTimeProvider$new()
np <- NumericsProvider$new()
pnp <- PhoneNumberProvider_en_US$new()

set.seed(1235)

We don’t have a list of counties in the US (there are 3007 of them). So we will use a random word from the LoremProvider with county. It is probably nicer if you wrap this into a function.

Generate a single ‘record’ for a person:

prot_health <- list(
  first_name = pp$first_name(),
  last_name = pp$last_name(),
  phone_number = pnp$render(),
  fax_number = pnp$render(),
  street = ap$street_address(),
  zipcode = ap$postcode(),
  email = ip$email(),
  county = paste0(lp$word(), " county"),
  SSN = SSNP$render(),
  dob = as.Date(dtp$date_time_between("1930-01-01", "1990-12-31")),
  # I've decided record number is an integer between 10000 - 99999
  medical_record_number = np$integer(min = 10000, max = 99999),
  ip_address = ip$ipv4()
)
prot_health
#> $first_name
#> [1] "Efrem"
#> 
#> $last_name
#> [1] "Schaden"
#> 
#> $phone_number
#> [1] "243.246.1773"
#> 
#> $fax_number
#> [1] "+68(4)7385942080"
#> 
#> $street
#> [1] "2414 Howell Stravenue"
#> 
#> $zipcode
#> [1] "54142"
#> 
#> $email
#> [1] "alta70@gmail.com"
#> 
#> $county
#> [1] "benefit county"
#> 
#> $SSN
#> [1] "275-05-1676"
#> 
#> $dob
#> [1] "1954-06-03"
#> 
#> $medical_record_number
#> [1] 76973
#> 
#> $ip_address
#> [1] "133.80.147.164"

We can also create medical records in sequence with a custom function. In the following example I create a sequence of events based on a date.

#' Generate a bunch of dates in sequence
gen_med_record <- function(date_value, events = 4, event_types = c("admission", "x-ray", "blood-test", "general exam")) {
  days <- sort(np$integer(events, 1, 365))
  result <- data.frame(
    date = date_value + days
  )
  result$event <- sample(event_types, size = nrow(result), replace = TRUE)
  result
}

result <- gen_med_record(date_value = as.Date("2022-03-01"), events = 5)
result$medical_record_number <- prot_health$medical_record_number
result
#>         date        event medical_record_number
#> 1 2022-05-12   blood-test                 76973
#> 2 2022-05-14 general exam                 76973
#> 3 2022-10-03        x-ray                 76973
#> 4 2022-11-21        x-ray                 76973
#> 5 2023-02-09 general exam                 76973

plumber api

You can also create a plumber API. For this to work you need the {plumber} package installed.

# plumber.R

fraudster_cl <- fraudster('en_US')

#* Create a random address
#* @param n how many do you want
#* @get /adress
function(n=1){
  list(address=fraudster_cl$address())
}

#* Create a random company
#* @param n how many do you want
#* @get /company
function(n=1){
  list(address=fraudster_cl$company())
}

Then run the file like this to start an API:

plumber::plumb("R/plumber.R") %>% plumber::pr_run()

Conclusion

As you can see this package helps you in generating plausible data, but to add structure (relationships between variables) to your data you need to do work yourself.

Roel M. Hogervorst