The gutenbergr package helps you download and process public domain works from Project Gutenberg. This vignette introduces the package’s metadata datasets and core downloading functionality.
Exploring the Metadata
gutenberg_metadata
The gutenberg_metadata dataset contains information
about each work in the Project Gutenberg collection:
gutenberg_metadata#> # A tibble: 81,068 × 8
#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
#> <int> <chr> <chr> <int> <fct> <chr>
#> 1 1 "The De… Jeffe… 1638 en Politics/American …
#> 2 2 "The Un… Unite… 1 en Politics/American …
#> 3 3 "John F… Kenne… 1666 en Category: Essays, …
#> 4 4 "Lincol… Linco… 3 en US Civil War/Categ…
#> 5 5 "The Un… Unite… 1 en United States/Poli…
#> 6 6 "Give M… Henry… 4 en American Revolutio…
#> 7 7 "The Ma… NA NA en Category: History …
#> 8 8 "Abraha… Linco… 3 en US Civil War/Categ…
#> 9 9 "Abraha… Linco… 3 en US Civil War/Categ…
#> 10 10 "The Ki… NA NA en Banned Books List …
#> # ℹ 81,058 more rows
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>
You can filter this to find specific works:
gutenberg_metadata |>
filter(title == "Persuasion")#> # A tibble: 3 × 8
#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
#> <int> <chr> <chr> <int> <fct> <chr>
#> 1 105 Persuasi… Auste… 68 en "Category: Novels/…
#> 2 22963 Persuasi… Auste… 68 en ""
#> 3 36777 Persuasi… Auste… 68 fr "FR Littérature/Ca…
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>
The metadata currently in the package was last updated on 11 January 2026.
gutenberg_works()
In most analyses, you’ll want to filter for English works, avoid
duplicates, and include only books with downloadable text. The
gutenberg_works() function does this automatically:
#> # A tibble: 62,685 × 8
#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
#> <int> <chr> <chr> <int> <fct> <chr>
#> 1 1 "The De… Jeffe… 1638 en Politics/American …
#> 2 2 "The Un… Unite… 1 en Politics/American …
#> 3 3 "John F… Kenne… 1666 en Category: Essays, …
#> 4 4 "Lincol… Linco… 3 en US Civil War/Categ…
#> 5 5 "The Un… Unite… 1 en United States/Poli…
#> 6 6 "Give M… Henry… 4 en American Revolutio…
#> 7 7 "The Ma… NA NA en Category: History …
#> 8 8 "Abraha… Linco… 3 en US Civil War/Categ…
#> 9 9 "Abraha… Linco… 3 en US Civil War/Categ…
#> 10 10 "The Ki… NA NA en Banned Books List …
#> # ℹ 62,675 more rows
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>
You can also filter directly within the function:
gutenberg_works(author == "Austen, Jane")#> # A tibble: 14 × 8
#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
#> <int> <chr> <chr> <int> <fct> <chr>
#> 1 105 "Persua… Auste… 68 en "Category: Novels/…
#> 2 121 "Northa… Auste… 68 en "Gothic Fiction/Ca…
#> 3 141 "Mansfi… Auste… 68 en "Category: Novels/…
#> 4 158 "Emma" Auste… 68 en "Category: Novels/…
#> 5 161 "Sense … Auste… 68 en "Category: Romance…
#> 6 946 "Lady S… Auste… 68 en "Category: Novels/…
#> 7 1212 "Love a… Auste… 68 en "Category: Romance…
#> 8 1342 "Pride … Auste… 68 en "Best Books Ever L…
#> 9 31100 "The Co… Auste… 68 en "Category: Romance…
#> 10 37431 "Pride … Auste… 68 en "Category: Plays/F…
#> 11 42078 "The Le… Auste… 68 en "Category: Biograp…
#> 12 63569 "The Wa… Auste… 68 en "Category: Novels/…
#> 13 74233 "Fragme… Auste… 68 en "Category: Novels/…
#> 14 77117 "The Wa… Auste… 68 en ""
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>
# Using regular expressions
gutenberg_works(str_detect(author, "Austen"))#> # A tibble: 24 × 8
#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
#> <int> <chr> <chr> <int> <fct> <chr>
#> 1 105 Persuas… Auste… 68 en Category: Novels/C…
#> 2 121 Northan… Auste… 68 en Gothic Fiction/Cat…
#> 3 141 Mansfie… Auste… 68 en Category: Novels/C…
#> 4 158 Emma Auste… 68 en Category: Novels/C…
#> 5 161 Sense a… Auste… 68 en Category: Romance/…
#> 6 946 Lady Su… Auste… 68 en Category: Novels/C…
#> 7 1212 Love an… Auste… 68 en Category: Romance/…
#> 8 1342 Pride a… Auste… 68 en Best Books Ever Li…
#> 9 17797 Memoir … Auste… 7603 en Category: Biograph…
#> 10 22536 Jane Au… Auste… 25392 en Category: Biograph…
#> # ℹ 14 more rows
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>
# Multiple conditions
gutenberg_works(author == "Dickens, Charles", has_text == TRUE)#> # A tibble: 93 × 8
#> gutenberg_id title author gutenberg_author_id language gutenberg_bookshelf
#> <int> <chr> <chr> <int> <fct> <chr>
#> 1 46 "A Chri… Dicke… 37 en Children's Literat…
#> 2 98 "A Tale… Dicke… 37 en Historical Fiction…
#> 3 564 "The My… Dicke… 37 en Mystery Fiction/Ca…
#> 4 580 "The Pi… Dicke… 37 en Best Books Ever Li…
#> 5 588 "Master… Dicke… 37 en Category: Novels/C…
#> 6 644 "The Ha… Dicke… 37 en Christmas/Category…
#> 7 650 "Pictur… Dicke… 37 en Category: Travel W…
#> 8 653 "The Ch… Dicke… 37 en Category: Novels/C…
#> 9 675 "Americ… Dicke… 37 en Category: Travel W…
#> 10 676 "The Ba… Dicke… 37 en Christmas/Category…
#> # ℹ 83 more rows
#> # ℹ 2 more variables: rights <fct>, has_text <lgl>
gutenberg_subjects
The gutenberg_subjects dataset pairs works with Library
of Congress classifications and subject headings:
gutenberg_subjects#> # A tibble: 260,915 × 3
#> gutenberg_id subject_type subject
#> <int> <fct> <chr>
#> 1 1 lcsh United States -- History -- Revolution, 1775-1783 …
#> 2 1 lcsh United States. Declaration of Independence
#> 3 1 lcc E201
#> 4 1 lcc JK
#> 5 2 lcsh Civil rights -- United States -- Sources
#> 6 2 lcsh United States. Constitution. 1st-10th Amendments
#> 7 2 lcc JK
#> 8 2 lcc KF
#> 9 3 lcsh United States -- Foreign relations -- 1961-1963
#> 10 3 lcsh Presidents -- United States -- Inaugural addresses
#> # ℹ 260,905 more rows
This is useful for finding works by genre or topic:
# Find detective stories
gutenberg_subjects |>
filter(subject == "Detective and mystery stories")#> # A tibble: 974 × 3
#> gutenberg_id subject_type subject
#> <int> <fct> <chr>
#> 1 170 lcsh Detective and mystery stories
#> 2 173 lcsh Detective and mystery stories
#> 3 244 lcsh Detective and mystery stories
#> 4 305 lcsh Detective and mystery stories
#> 5 330 lcsh Detective and mystery stories
#> 6 481 lcsh Detective and mystery stories
#> 7 547 lcsh Detective and mystery stories
#> 8 863 lcsh Detective and mystery stories
#> 9 905 lcsh Detective and mystery stories
#> 10 1155 lcsh Detective and mystery stories
#> # ℹ 964 more rows
#> # A tibble: 59 × 3
#> gutenberg_id subject_type subject
#> <int> <fct> <chr>
#> 1 108 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 2 221 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 3 244 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 4 834 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 5 1661 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 6 2097 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 7 2343 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 8 2344 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 9 2345 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> 10 2346 lcsh Holmes, Sherlock (Fictitious character) -- Fiction
#> # ℹ 49 more rows
You can join this with gutenberg_works() to download
books by subject:
# Get IDs of detective stories
detective_ids <- gutenberg_subjects |>
filter(subject == "Detective and mystery stories") |>
inner_join(gutenberg_works(), by = "gutenberg_id") |>
pull(gutenberg_id)
# Download a sample
gutenberg_download(detective_ids[1:5], meta_fields = c("title", "author"))
gutenberg_authors
The gutenberg_authors dataset contains author
information including aliases and birth/death years:
gutenberg_authors#> # A tibble: 26,550 × 7
#> gutenberg_author_id author alias birthdate deathdate wikipedia aliases
#> <int> <chr> <chr> <int> <int> <chr> <chr>
#> 1 1 United States U.S.… NA NA https://… U.S.A.
#> 2 3 Lincoln, Abr… NA 1809 1865 https://… United…
#> 3 4 Henry, Patri… NA 1736 1799 https://… NA
#> 4 5 Adam, Paul NA 1849 1931 https://… NA
#> 5 7 Carroll, Lew… Dodg… 1832 1898 https://… Dodgso…
#> 6 8 United State… NA NA NA https://… Agency…
#> 7 9 Melville, He… Melv… 1819 1891 https://… Melvil…
#> 8 10 Barrie, J. M… NA 1860 1937 https://… Barrie…
#> 9 11 Church of Je… NA NA NA https://… NA
#> 10 12 Smith, Josep… Smit… 1805 1844 https://… Smith,…
#> # ℹ 26,540 more rows
This can be useful for filtering by author characteristics:
# Find works by 19th century authors
nineteenth_century_gutenberg_authors <- gutenberg_authors |>
filter(birthdate >= 1800, birthdate < 1900) |>
inner_join(gutenberg_works(), by = "gutenberg_author_id")Downloading Books
Single Book
Download a book using its Gutenberg ID with
gutenberg_download():
persuasion <- gutenberg_download(105, meta_fields = c("title", "author"))
persuasion#> # A tibble: 8,357 × 4
#> gutenberg_id text title author
#> <int> <chr> <chr> <chr>
#> 1 105 "Persuasion" Persuasion Austen, Jane
#> 2 105 "" Persuasion Austen, Jane
#> 3 105 "" Persuasion Austen, Jane
#> 4 105 "by Jane Austen" Persuasion Austen, Jane
#> 5 105 "" Persuasion Austen, Jane
#> 6 105 "(1818)" Persuasion Austen, Jane
#> 7 105 "" Persuasion Austen, Jane
#> 8 105 "" Persuasion Austen, Jane
#> 9 105 "" Persuasion Austen, Jane
#> 10 105 "" Persuasion Austen, Jane
#> # ℹ 8,347 more rows
The result is a tibble with:
-
gutenberg_id- the book’s ID -
text- one row per line of text
Multiple Books
Download multiple books by providing a vector of IDs:
books <- gutenberg_download(c(105, 109))
books#> # A tibble: 9,579 × 4
#> gutenberg_id text title author
#> <int> <chr> <chr> <chr>
#> 1 109 "Renascence and Other Poems" Renascence, and Other Poems Millay…
#> 2 109 "" Renascence, and Other Poems Millay…
#> 3 109 "" Renascence, and Other Poems Millay…
#> 4 109 "by" Renascence, and Other Poems Millay…
#> 5 109 "" Renascence, and Other Poems Millay…
#> 6 109 "Edna St. Vincent Millay" Renascence, and Other Poems Millay…
#> 7 109 "" Renascence, and Other Poems Millay…
#> 8 109 "" Renascence, and Other Poems Millay…
#> 9 109 "" Renascence, and Other Poems Millay…
#> 10 109 "" Renascence, and Other Poems Millay…
#> # ℹ 9,569 more rows
Adding Metadata
Use the meta_fields argument to include additional
information:
books <- gutenberg_download(c(105, 109), meta_fields = c("title", "author"))
books |>
count(title)#> # A tibble: 2 × 2
#> title n
#> <chr> <int>
#> 1 Persuasion 8357
#> 2 Renascence, and Other Poems 1222
Downloading from gutenberg_works()
You can pipe the output of gutenberg_works() directly
into gutenberg_download():
# Download all of Aristotle's works with titles
aristotle_books <- gutenberg_works(author == "Aristotle") |>
gutenberg_download(meta_fields = "title")What’s Next?
Now that you have book texts as tibbles, you can:
- Perform text analysis with the tidytext package
- See the Text Mining Example vignette for a complete analysis workflow
- Explore the Natural Language Processing CRAN View for more text analysis packages
Additional Resources
- Match Wikipedia data with WikipediR or wikipediatrend
- Parse author names with humaniformat
- Predict gender from names with gender
