The goal of gendercoder is to allow simple recoding of free-text gender responses.
Installation
This package is not on CRAN. To use this package please run the following code:
devtools::install_github("ropensci/gendercoder")
library(gendercoder)
Why would we do this?
Researchers who collect self-reported demographic data from respondents occasionally collect gender using a free-text response option. This has the advantage of respecting the gender diversity of respondents without prompting users and potentially including misleading responses. However, this presents a challenge to researchers in that some inconsistencies in typography and spelling create a larger set of responses than would be required to fully capture the demographic characteristics of the sample.
For example, male participants may provide free-text responses as “male”, “man”, “mail”, “mael”. Non-binary participants may provide responses as “nonbinary”, “enby”, “non-binary”, “non binary”
Manually coding of such free-text responses this is often not
feasible with larger datasets. gendercoder()
uses
dictionaries of common misspellings to re-code free-text responses into
a consistent set of responses. The small number of responses not
automatically re-coded by gendercoder() can then be feasibly manually
recoded.
Motivating example
gendercoder()
includes a sample dataset with actual
free-text responses to the question “What is your gender?” from a number
of studies of English-speaking participants. The sample dataset includes
responses from 7756 participants. Naive coding identifies 103 unique
responses to this item.
library(gendercoder)
library(dplyr)
sample %>%
group_by(Gender) %>%
summarise(count = n()) %>%
arrange(-count) %>%
knitr::kable(caption = "Summary of gender categories before coding")
Gender | count |
---|---|
female | 1836 |
Male | 1755 |
Female | 1722 |
male | 1705 |
FEMALE | 113 |
Female | 113 |
MALE | 107 |
F | 88 |
f | 58 |
M | 35 |
m | 33 |
female | 24 |
Male | 21 |
woman | 19 |
male | 7 |
masculino | 7 |
Man | 6 |
man | 5 |
Woman | 4 |
female | 3 |
Femail | 3 |
femal | 3 |
Femalw | 2 |
Frmale | 2 |
Gender | 2 |
MALE | 2 |
2 | |
Masculino | 2 |
Nonbinary | 2 |
femail | 2 |
Male | 1 |
male | 1 |
% | 1 |
40 | 1 |
54 | 1 |
Agender | 1 |
Androgynous | 1 |
Apache Helicopter… Just kidding. There are only two. I am a Male. | 1 |
Asian | 1 |
Demale | 1 |
FEMAIIL | 1 |
FEMAL | 1 |
FEMALE | 1 |
FEMale | 1 |
Femaile | 1 |
Femal | 1 |
Female (cisgender) | 1 |
Female to non-binary | 1 |
Femalee | 1 |
Femalep | 1 |
Feminine | 1 |
G | 1 |
Gender is a social construct - I’m sexually female | 1 |
Girl | 1 |
MALR | 1 |
MAle | 1 |
Malae | 1 |
Male | 1 |
Male(Sex, Gender is a silly construct) | 1 |
Male. | 1 |
Man/Male | 1 |
Mslr | 1 |
NB | 1 |
Non Binary | 1 |
Non binary | 1 |
Non-binary | 1 |
Trans man | 1 |
Transgender man | 1 |
Transsexual male (FTM) | 1 |
agender (woman) | 1 |
cis female | 1 |
demigirl | 1 |
emale | 1 |
famela | 1 |
feamle | 1 |
fem | 1 |
femae | 1 |
femake | 1 |
femal3 | 1 |
femald | 1 |
female (Cisgender) | 1 |
femals | 1 |
femenina | 1 |
fenale | 1 |
fmale | 1 |
ftm | 1 |
g | 1 |
girl | 1 |
mALE | 1 |
mae | 1 |
mael | 1 |
maill | 1 |
make | 1 |
males | 1 |
maoe | 1 |
masculino | 1 |
nb | 1 |
non-binary | 1 |
transgender | 1 |
transgender female | 1 |
transmale | 1 |
transmasculine | 1 |
woman | 1 |
Recoding using the gender_coder()
function classifies
all but 28 responses into pre-defined response categories.
sample %>%
head(10) %>%
mutate(recoded_gender = recode_gender(gender = Gender, dictionary = manylevels_en)) %>%
knitr::kable(caption = "The manylevels_en dictionary applied to `head(sample)`")
Gender | recoded_gender |
---|---|
FEMALE | woman |
FEMAL | woman |
male | man |
FEMALE | woman |
female | woman |
male | man |
feamle | woman |
male | man |
male | man |
male | man |
sample %>%
mutate(recoded_gender = recode_gender(gender = Gender, dictionary = manylevels_en)) %>%
filter(!is.na(recoded_gender)) %>%
group_by(recoded_gender) %>%
summarise(count = n()) %>%
arrange(-count) %>%
knitr::kable(caption = "Summary of gender categories after use of the *manylevels_en* dictionary")
recoded_gender | count |
---|---|
woman | 4015 |
man | 3692 |
non-binary | 8 |
transgender man | 4 |
cis woman | 3 |
girl | 2 |
agender | 1 |
androgynous | 1 |
female | 1 |
transgender | 1 |
transgender woman | 1 |
In this dataset unclassified responses are a mix of unusual responses and apparent response errors (e.g. numbers and symbols). While some of these are genuinely missing (i.e. Gender = 40), other could be manually recoded, or added to a custom dictionary.
sample %>%
mutate(recoded_gender = recode_gender(gender = Gender, dictionary = manylevels_en)) %>%
filter(is.na(recoded_gender)) %>%
knitr::kable(caption = "All responses not classified by the built-in dictionary")
Gender | recoded_gender |
---|---|
40 | NA |
Gender | NA |
agender (woman) | NA |
masculino | NA |
Male. | NA |
54 | NA |
% | NA |
Female to non-binary | NA |
Asian | NA |
masculino | NA |
masculino | NA |
masculino | NA |
masculino | NA |
Man/Male | NA |
demigirl | NA |
femenina | NA |
masculino | NA |
Masculino | NA |
transmasculine | NA |
Gender | NA |
Gender is a social construct - I’m sexually female | NA |
Apache Helicopter… Just kidding. There are only two. I am a Male. | NA |
masculino | NA |
masculino | NA |
Male(Sex, Gender is a silly construct) | NA |
Transsexual male (FTM) | NA |
Masculino | NA |
Options within the function
dictionary
The package provides two built-in dictionaries. The use of these is
controlled using the dictionary
argument. The first
dictionary = manylevels_en
provides corrects spelling and
standardises terms while maintaining the diversity of responses. This is
the default dictionary for gendercoder() as it preserves as much gender
diversity as possible.
However in some cases you may wish to collapse gender into a smaller
set of categories by using the fewlevels_en
dictionary
(dictionary = fewlevels_en
). This dictionary contains fewer
gender categories, “man”, “woman”, “boy”, “girl”, and “sex and gender
diverse”.
The “man” category includes all participants who indicate that they are
- male
- trans male (including female to male transgender respondents)
- cis male
The “woman” category includes all participants who indicate that they are
- female
- trans female (including male to female transgender respondents)
- cis female
The “sex and gender diverse” category includes all participants who indicate that they are
- agender
- androgynous
- intersex
- non-binary
- gender-queer
sample %>%
head(10) %>%
mutate(recoded_gender = recode_gender(gender = Gender, dictionary = fewlevels_en)) %>%
knitr::kable(caption = "The fewlevels_en dictionary applied to `head(sample)`")
Gender | recoded_gender |
---|---|
FEMALE | woman |
FEMAL | woman |
male | man |
FEMALE | woman |
female | woman |
male | man |
feamle | woman |
male | man |
male | man |
male | man |
sample %>%
mutate(recoded_gender = recode_gender(gender = Gender, dictionary = fewlevels_en)) %>%
group_by(recoded_gender) %>%
summarise(count = n()) %>%
arrange(-count) %>%
knitr::kable(caption = "Summary of gender categories after use of the *fewlevels_en* dictionary")
recoded_gender | count |
---|---|
woman | 4020 |
man | 3696 |
NA | 27 |
sex and gender diverse | 11 |
girl | 2 |
You can also specify a custom dictionary to replace or supplement the built-in dictionary. The custom dictionary should be a list in the following format.
# name of the vector element is the user input value and the vector element is the
# replacement value corresponding to that name as a lower case string.
custom_dictionary <- c(
masculino = "man",
hombre = "man",
mujer = "woman",
femenina = "woman"
)
str(custom_dictionary)
## Named chr [1:4] "man" "man" "woman" "woman"
## - attr(*, "names")= chr [1:4] "masculino" "hombre" "mujer" "femenina"
Custom dictionaries can be used in place of a built-in dictionary or can supplement the built-in dictionary by providing a vector of vectors to the dictionary argument. Where the lists contain duplicated elements, the last version of the duplicated value will be used for recoding. This allows you to use the built-in dictionary but change the coding of one or more responses from that dictionary. Here the addition of Spanish terms allows for recoding of 11 previously uncoded responses.
sample %>%
mutate(recoded_gender = recode_gender(gender = Gender,
dictionary = c(fewlevels_en,
custom_dictionary))) %>%
group_by(recoded_gender) %>%
summarise(count = n()) %>%
arrange(-count) %>%
knitr::kable(caption = "Summary of gender categories after use of the combined dictionaries")
recoded_gender | count |
---|---|
woman | 4021 |
man | 3706 |
NA | 16 |
sex and gender diverse | 11 |
girl | 2 |
sample %>%
mutate(recoded_gender = recode_gender(gender = Gender,
dictionary = c(fewlevels_en,
custom_dictionary))) %>%
filter(is.na(recoded_gender)) %>%
knitr::kable(caption = "All responses not classified by the combined dictionaries")
Gender | recoded_gender |
---|---|
40 | NA |
Gender | NA |
agender (woman) | NA |
Male. | NA |
54 | NA |
% | NA |
Female to non-binary | NA |
Asian | NA |
Man/Male | NA |
demigirl | NA |
transmasculine | NA |
Gender | NA |
Gender is a social construct - I’m sexually female | NA |
Apache Helicopter… Just kidding. There are only two. I am a Male. | NA |
Male(Sex, Gender is a silly construct) | NA |
Transsexual male (FTM) | NA |
retain_unmatched
The retain_unmatched
argument is used to determine the
handling for recoding of values not contained in the dictionary. By
default, unmatched values are coded as NA.
retain_unmatched = TRUE
will fill unmatched responses with
the participant provided response.
sample %>%
mutate(recoded_gender = recode_gender(gender = Gender,
dictionary = c(fewlevels_en,
custom_dictionary),
retain_unmatched = TRUE)) %>%
group_by(recoded_gender) %>%
summarise(count = n()) %>%
arrange(-count) %>%
knitr::kable(caption = "Summary of gender categories after use of the combined dictionary and `retain_unmatched = TRUE`")
## Some results not matched from the dictionary have been filled with the original values.
recoded_gender | count |
---|---|
woman | 4021 |
man | 3706 |
sex and gender diverse | 11 |
Gender | 2 |
girl | 2 |
% | 1 |
40 | 1 |
54 | 1 |
Apache Helicopter… Just kidding. There are only two. I am a Male. | 1 |
Asian | 1 |
Female to non-binary | 1 |
Gender is a social construct - I’m sexually female | 1 |
Male(Sex, Gender is a silly construct) | 1 |
Male. | 1 |
Man/Male | 1 |
Transsexual male (FTM) | 1 |
agender (woman) | 1 |
demigirl | 1 |
transmasculine | 1 |
A disclaimer on handling gender responses
This package attempts to remove typographical errors from free text gender data. The defaults that we used are specific to our context and the time at which the package was developed and your data or context may be different.
We offer two built-in dictionaries, manylevels_en and fewlevels_en. Both are necessarily opinionated about how gender descriptors collapse into categories.
However, as these are culturally specific, they may not be suitable for your data. In particular the fewlevels_en option makes opinionated choices about some responses that we want to acknowledge are potentially problematic. Specifically,
- In ‘fewlevels_en’ coding intersex responses are recoded as ‘sex and gender diverse’
- In ‘fewlevels_en’ responses where people indicate they are trans and indicate their identified gender are recoded as the identified gender (e.g. ‘Male to Female’ is recoded as ‘woman’). We wish to acknowledge that this may not reflect how some individuals would classify themselves when given these categories and in some contexts may make systematic errors. The manylevels_en coding dictionary attempts to avoid these issues as much as possible - however users can provide a custom dictionary to add to or overwrite our coding decisions if they feel this is more appropriate. We welcome people to update the built-in dictionary where desired responses are missing.
- In both dictionaries, we assume that typographical features such as spacing are not relevant to recoding the gender response (e.g. we assume that “genderqueer” and “gender queer” are equivalent). This is unlikely to be true for all contexts.
The ‘manylevels_en’ coding separates out those who identify as trans female/male or cis female/male into separate categories it should not be assumed that all people who describe as male/female are cis, if you are assessing trans status we recommend a two part question see:
Bauer, Greta & Braimoh, Jessica & Scheim, Ayden & Dharma, Christoffer. (2017). Transgender-inclusive measures of sex/gender for population surveys: Mixed-methods evaluation and recommendations. PLoS ONE. 12.
Contributing to this package
This package is a reflection of cultural context of the package contributors. We acknowledge that understandings of gender are bound by both culture and time and are continually changing. As such, we welcome issues and pull requests to make the package more inclusive, more reflective of current understandings of gender inclusive languages and/or suitable for a broader range of cultural contexts. We particularly welcome addition of non-English dictionaries or of other gender-diverse responses to the manylevels_en and fewlevels_en dictionaries.
The “Adding to the dictionary” vignette includes information about how to make changes to the dictionary either for your own use or when contributiong to the gendercoder package.