
Download one or more works using a Project Gutenberg ID
Source:R/gutenberg_download.R
gutenberg_download.Rd
Download one or more works by their Project Gutenberg IDs into
a data frame with one row per line per work. This can be used to download
a single work of interest or multiple at a time. You can look up the
Gutenberg IDs of a work using the gutenberg_works()
function or
the gutenberg_metadata
dataset.
Usage
gutenberg_download(
gutenberg_id,
mirror = NULL,
strip = TRUE,
meta_fields = NULL,
verbose = TRUE,
files = NULL,
...
)
Arguments
- gutenberg_id
A vector of Project Gutenberg ID, or a data frame containing a
gutenberg_id
column, such as from the results of agutenberg_works()
call- mirror
Optionally a mirror URL to retrieve the books from. By default uses the mirror from
gutenberg_get_mirror
- strip
Whether to strip suspected headers and footers using the
gutenberg_strip
function- meta_fields
Additional fields, such as
title
andauthor
, to add from gutenberg_metadata describing each book. This is useful when returning multiple- verbose
Whether to show messages about the Project Gutenberg mirror that was chosen
- files
A vector of .zip file paths. If given, this reads from the files rather than from the site. This is mostly used for testing when the Project Gutenberg website may not be available.
- ...
Extra arguments passed to
gutenberg_strip
, currently unused
Value
A two column tbl_df (a type of data frame; see tibble or dplyr packages) with one row for each line of the text or texts, with columns
- gutenberg_id
Integer column with the Project Gutenberg ID of each text
- text
A character vector
Details
Note that if strip = TRUE
, this tries to remove the
Gutenberg header and footer using the gutenberg_strip
function. This is not an exact process since headers and footers differ
between books. Before doing an in-depth analysis you may want to check
the start and end of each downloaded book.
Examples
# \donttest{
library(dplyr)
#>
#> Attaching package: ‘dplyr’
#> The following objects are masked from ‘package:stats’:
#>
#> filter, lag
#> The following objects are masked from ‘package:base’:
#>
#> intersect, setdiff, setequal, union
# download The Count of Monte Cristo
gutenberg_download(1184)
#> Determining mirror for Project Gutenberg from https://www.gutenberg.org/robot/harvest
#> Using mirror http://aleph.gutenberg.org
#> # A tibble: 61,308 × 2
#> gutenberg_id text
#> <int> <chr>
#> 1 1184 "THE COUNT OF MONTE CRISTO"
#> 2 1184 ""
#> 3 1184 ""
#> 4 1184 ""
#> 5 1184 "by Alexandre Dumas [père]"
#> 6 1184 ""
#> 7 1184 ""
#> 8 1184 ""
#> 9 1184 ""
#> 10 1184 ""
#> # … with 61,298 more rows
# download two books: Wuthering Heights and Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
#> # A tibble: 33,343 × 3
#> gutenberg_id text title
#> <int> <chr> <chr>
#> 1 768 "Wuthering Heights" Wuthering Heights
#> 2 768 "" Wuthering Heights
#> 3 768 "by Emily Brontë" Wuthering Heights
#> 4 768 "" Wuthering Heights
#> 5 768 "" Wuthering Heights
#> 6 768 "" Wuthering Heights
#> 7 768 "" Wuthering Heights
#> 8 768 "CHAPTER I" Wuthering Heights
#> 9 768 "" Wuthering Heights
#> 10 768 "" Wuthering Heights
#> # … with 33,333 more rows
books %>% count(title)
#> # A tibble: 2 × 2
#> title n
#> <chr> <int>
#> 1 Jane Eyre: An Autobiography 21001
#> 2 Wuthering Heights 12342
# download all books from Jane Austen
austen <- gutenberg_works(author == "Austen, Jane") %>%
gutenberg_download(meta_fields = "title")
austen
#> # A tibble: 170,600 × 3
#> gutenberg_id text title
#> <int> <chr> <chr>
#> 1 105 "Persuasion" Persuasion
#> 2 105 "" Persuasion
#> 3 105 "by Jane Austen" Persuasion
#> 4 105 "" Persuasion
#> 5 105 "(1818)" Persuasion
#> 6 105 "" Persuasion
#> 7 105 "" Persuasion
#> 8 105 "Contents" Persuasion
#> 9 105 "" Persuasion
#> 10 105 " CHAPTER I." Persuasion
#> # … with 170,590 more rows
austen %>%
count(title)
#> # A tibble: 10 × 2
#> title n
#> <chr> <int>
#> 1 "Emma" 16488
#> 2 "Lady Susan" 2540
#> 3 "Love and Freindship [sic]" 3714
#> 4 "Mansfield Park" 15670
#> 5 "Northanger Abbey" 7991
#> 6 "Persuasion" 8353
#> 7 "Pride and Prejudice" 14529
#> 8 "Sense and Sensibility" 12673
#> 9 "The Complete Project Gutenberg Works of Jane Austen\nA Linked Index o… 80073
#> 10 "The Letters of Jane Austen\r\nSelected from the compilation of her gr… 8569
# }