In a common use case for gender prediction, you have a data frame with a column for first names and a column for birth years (or, two columns specifying a minimum and a maximum potential birth year). This function wraps the gender function to efficiently apply it to such a data frame. The result is a data frame with one prediction of the gender for each unique combination of first name and birth year. The resulting data frame can then be merged back into your original data frame.

gender_df(
  data,
  name_col = "name",
  year_col = "year",
  method = c("ssa", "ipums", "napp", "demo")
)

Arguments

data

A data frame containing first names and birth year or range of potential birth years.

name_col

A string specifying the name of the column containing the first names.

year_col

Either a single string specifying the birth year associated with the first name, or character vector with two elements: the names of the columns with the minimum and maximum years for the range of potential birth years.

method

One of the historical methods provided by this package: "ssa", "ipums", "napp", or "demo". See gender for details.

Value

A data frame with columns from the output of the gender function, and one row for each unique combination of first names and birth years.

See also

Examples

library(dplyr)
#> #> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’: #> #> filter, lag
#> The following objects are masked from ‘package:base’: #> #> intersect, setdiff, setequal, union
demo_df <- tibble(names = c("Hillary", "Hillary", "Hillary", "Madison", "Madison"), birth_year = c(1930, 2000, 1930, 1930, 2000), min_year = birth_year - 1, max_year = birth_year + 1, stringsAsFactors = FALSE) # Using the birth year for the predictions. # Notice that the duplicate value for Hillary in 1930 is removed gender_df(demo_df, method = "demo", name_col = "names", year_col = "birth_year")
#> # A tibble: 4 x 6 #> name proportion_male proportion_female gender year_min year_max #> <chr> <dbl> <dbl> <chr> <dbl> <dbl> #> 1 Hillary 1 0 male 1930 1930 #> 2 Madison 1 0 male 1930 1930 #> 3 Hillary 0 1 female 2000 2000 #> 4 Madison 0.0069 0.993 female 2000 2000
# Using a range of years gender_df(demo_df, method = "demo", name_col = "names", year_col = c("min_year", "max_year"))
#> # A tibble: 4 x 6 #> name proportion_male proportion_female gender year_min year_max #> <chr> <dbl> <dbl> <chr> <dbl> <dbl> #> 1 Hillary 1 0 male 1929 1931 #> 2 Madison 1 0 male 1929 1931 #> 3 Hillary 0.0065 0.994 female 1999 2001 #> 4 Madison 0.0072 0.993 female 1999 2001