Large social science literatures are devoted to examining the role of an individual’s gender, ethnicity, or nationality on a host of behaviors and circumstances. This means that researchers often want to know these characteristics of individuals. Not all pre-existing datasets contain this information, though, and it can be difficult for scholars to locate, particularly if they work with exotic samples.
Even if reseachers do not have data on these theoretically important covariates for individuals, though, there in many cases in which they know individual names. Thanks to recent developments in machine learning, these names can be used to probabilistically identify the gender, ethnicity, leaf nationality, or origin of their bearers. These exciting advancements can potentially catalyze existing research programs on gender, race, ethnicity, coethnicity, and national origins.
Unfortunately, most of the available name classifiers are very expensive to use. Thankfully, there are two free or cheap-to-use tools that this package leverages:
NamePrism - A non-commercial program for academic research - Cost: Free with API token (60 requests/minute rate limit) - Get token: https://www.name-prism.com/api - Used for: Ethnicity and nationality classification - Reference: Ye et al 2017
NamSor - Commercial API with free tier - Cost: 5,000 units/month free (gender = 1 unit/name) - Get API key: https://namsor.app/ - Used for: Gender classification - Info: https://github.com/namsor/namsor-api
The nomine package provides simple R functions to query
these APIs without needing to write custom code.
get_ethnicities(names, t, warnings = FALSE)Classify names by 6 U.S. ethnicities using NamePrism. - Input: Vector of full names (“First Last”) - Returns: Probabilities for: 2PRACE, Hispanic, API, Black, AIAN, White - Cost: Free (rate-limited)
get_nationalities(names, t, warnings = FALSE)Classify names by 39 leaf nationalities using NamePrism. - Input: Vector of full names (“First Last”) - Returns: Probabilities for 39 cultural/national origin categories - Cost: Free (rate-limited) - Categories: See https://name-prism.com/about
get_gender(given, family, api_key)Classify names by gender using NamSor v2. - Input: Vectors of first and last names - Returns: Gender classification (“male”/“female”) and scale (-1 to +1) - Cost: 1 unit per name (5,000 free/month)
The latest development version (1.0.2) is on GitHub and can be installed using devtools.
if(!require("devtools")){
install.packages("devtools")
}
devtools::install_github("lobsterbush/nomine")# Get your NamePrism token: https://www.name-prism.com/api
# Get your NamSor API key: https://namsor.app/library(nomine)
# Example names
names <- c("Charles Crabtree", "Volha Chykina", "Maria Garcia")
# Get ethnicity probabilities
results <- get_ethnicities(names, t = "YOUR_NAMEPRISM_TOKEN")
# View results
print(results[, c("input", "White", "Hispanic", "Black")])
# input White Hispanic Black
# 1 Charles Crabtree 0.85 0.03 0.05
# 2 Volha Chykina 0.72 0.02 0.01
# 3 Maria Garcia 0.15 0.78 0.02# Get nationality probabilities
results <- get_nationalities(names, t = "YOUR_NAMEPRISM_TOKEN")
# View top nationality for each name
print(results[, c("input", "CelticEnglish", "European-Russian", "Hispanic-Spanish")])
# input CelticEnglish European-Russian Hispanic-Spanish
# 1 Charles Crabtree 0.82 0.03 0.02
# 2 Volha Chykina 0.05 0.68 0.01
# 3 Maria Garcia 0.03 0.01 0.75# Example names (first and last separate)
first_names <- c("Volha", "Charles", "Maria")
last_names <- c("Chykina", "Crabtree", "Garcia")
# Get gender classifications
results <- get_gender(first_names, last_names, api_key = "YOUR_NAMSOR_KEY")
# View results
print(results[, c("first_name", "last_name", "gender", "scale")])
# first_name last_name gender scale
# 1 Volha Chykina female 0.95
# 2 Charles Crabtree male -0.99
# 3 Maria Garcia female 0.89For 1,000 names:
| Function | API | Cost | Notes |
|---|---|---|---|
get_ethnicities() |
NamePrism | Free | Rate-limited to 60/min (~17 min total) |
get_nationalities() |
NamePrism | Free | Rate-limited to 60/min (~17 min total) |
get_gender() |
NamSor v2 | Free | Uses 1,000 of 5,000 free units/month |
For 10,000 names: - Ethnicities/Nationalities: Still free with NamePrism (takes ~3 hours) - Gender: 10,000 units = $10 with NamSor (5,000 free + 5,000 paid)
The package uses NamePrism for ethnicity/nationality because it’s free and designed for academic research, while using NamSor v2 for gender because: - Gender classification is computationally simpler (1 unit vs 10 units) - 5,000 free gender classifications per month covers most research needs - NamSor’s gender classifier is highly accurate and well-maintained
get_gender() function now requires only a single
api_key parameter instead of separate secret
and user parametersPlease use the issue tracker for problems, questions, or feature requests. If you would rather email with questions or comments, you can contact Charles Crabtree or Christian Chacua and they will try to address the issue.
If you would like to contribute to the package, that is great! We welcome pull requests and new developers.
Users and potential contributors can test the software with the example code provided in the documentation for each function.
Thanks to Karl Broman and Hadley Wickham for providing excellent free guides to building R packages.