
Strip header and footer content from a Project Gutenberg book
Source:R/gutenberg_download.R
gutenberg_strip.Rd
Strip header and footer content from a Project Gutenberg book. This is based on some formatting guesses so it may not be perfect. It will also not strip tables of contents, prologues, or other text that appears at the start of a book.
Examples
# \donttest{
library(dplyr)
book <- gutenberg_works(title == "Pride and Prejudice") %>%
gutenberg_download(strip = FALSE)
head(book$text, 10)
#> [1] "The Project Gutenberg eBook of Pride and prejudice, by Jane Austen"
#> [2] ""
#> [3] "This eBook is for the use of anyone anywhere in the United States and"
#> [4] "most other parts of the world at no cost and with almost no restrictions"
#> [5] "whatsoever. You may copy it, give it away or re-use it under the terms"
#> [6] "of the Project Gutenberg License included with this eBook or online at"
#> [7] "www.gutenberg.org. If you are not located in the United States, you"
#> [8] "will have to check the laws of the country where you are located before"
#> [9] "using this eBook."
#> [10] ""
tail(book$text, 10)
#> [1] "necessarily keep eBooks in compliance with any particular paper"
#> [2] "edition."
#> [3] ""
#> [4] "Most people start at our website which has the main PG search"
#> [5] "facility: www.gutenberg.org"
#> [6] ""
#> [7] "This website includes information about Project Gutenberg-tm,"
#> [8] "including how to make donations to the Project Gutenberg Literary"
#> [9] "Archive Foundation, how to help produce our new eBooks, and how to"
#> [10] "subscribe to our email newsletter to hear about new eBooks."
text_stripped <- gutenberg_strip(book$text)
head(text_stripped, 10)
#> [1] " [Illustration:"
#> [2] ""
#> [3] " GEORGE ALLEN"
#> [4] " PUBLISHER"
#> [5] ""
#> [6] " 156 CHARING CROSS ROAD"
#> [7] " LONDON"
#> [8] ""
#> [9] " RUSKIN HOUSE"
#> [10] " ]"
tail(text_stripped, 10)
#> [1] ""
#> [2] " THE"
#> [3] " END"
#> [4] " ]"
#> [5] ""
#> [6] ""
#> [7] ""
#> [8] ""
#> [9] " CHISWICK PRESS:--CHARLES WHITTINGHAM AND CO."
#> [10] " TOOKS COURT, CHANCERY LANE, LONDON."
# }