Selected fields of metadata about each of the Project Gutenberg works. These
were collected using the gitenberg Python package, particularly the
pg_rdf_to_json
function.
Format
A tbl_df (see tibble or dplyr) with one row for each work in Project Gutenberg and the following columns:
- gutenberg_id
Numeric ID, used to retrieve works from Project Gutenberg
- title
Title
- author
Author, if a single one given. Given as last name first (e.g. "Doyle, Arthur Conan")
- author_id
Project Gutenberg author ID
- language
Language ISO 639 code, separated by / if multiple. Two letter code if one exists, otherwise three letter. See https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes
- gutenberg_bookshelf
Which collection or collections this is found in, separated by / if multiple
- rights
Generally one of three options: "Public domain in the USA." (the most common by far), "Copyrighted. Read the copyright notice inside this book for details.", or "None"
- has_text
Whether there is a file containing digits followed by
.txt
in Project Gutenberg for this record (as opposed to, for example, audiobooks). If not, cannot be retrieved withgutenberg_download
Details
To find the date on which this metadata was last updated, run
attr(gutenberg_metadata, "date_updated")
.
Examples
library(dplyr)
library(stringr)
gutenberg_metadata
#> # A tibble: 69,199 × 8
#> gutenberg_id title author guten…¹ langu…² guten…³ rights has_t…⁴
#> <int> <chr> <chr> <int> <chr> <chr> <chr> <lgl>
#> 1 1 "The Declaration … Jeffe… 1638 en Politi… Publi… TRUE
#> 2 2 "The United State… Unite… 1 en Politi… Publi… TRUE
#> 3 3 "John F. Kennedy'… Kenne… 1666 en NA Publi… TRUE
#> 4 4 "Lincoln's Gettys… Linco… 3 en US Civ… Publi… TRUE
#> 5 5 "The United State… Unite… 1 en United… Publi… TRUE
#> 6 6 "Give Me Liberty … Henry… 4 en Americ… Publi… TRUE
#> 7 7 "The Mayflower Co… NA NA en NA Publi… TRUE
#> 8 8 "Abraham Lincoln'… Linco… 3 en US Civ… Publi… TRUE
#> 9 9 "Abraham Lincoln'… Linco… 3 en US Civ… Publi… TRUE
#> 10 10 "The King James V… NA NA en Banned… Publi… TRUE
#> # … with 69,189 more rows, and abbreviated variable names ¹gutenberg_author_id,
#> # ²language, ³gutenberg_bookshelf, ⁴has_text
gutenberg_metadata %>%
count(author, sort = TRUE)
#> # A tibble: 21,227 × 2
#> author n
#> <chr> <int>
#> 1 NA 4892
#> 2 Various 3798
#> 3 Anonymous 867
#> 4 Shakespeare, William 326
#> 5 Twain, Mark 235
#> 6 Lytton, Edward Bulwer Lytton, Baron 223
#> 7 Ebers, Georg 175
#> 8 Dickens, Charles 172
#> 9 Verne, Jules 169
#> 10 Balzac, Honoré de 151
#> # … with 21,217 more rows
# look for Shakespeare, excluding collections (containing "Works") and
# translations
shakespeare_metadata <- gutenberg_metadata %>%
filter(
author == "Shakespeare, William",
language == "en",
!str_detect(title, "Works"),
has_text,
!str_detect(rights, "Copyright")
) %>%
distinct(title)
if (FALSE) {
shakespeare_works <- gutenberg_download(shakespeare_metadata$gutenberg_id)
}
# note that the gutenberg_works() function filters for English
# non-copyrighted works and does de-duplication by default:
shakespeare_metadata2 <- gutenberg_works(
author == "Shakespeare, William",
!str_detect(title, "Works")
)
# date last updated
attr(gutenberg_metadata, "date_updated")
#> [1] "2022-11-04"