CRAN release: 2022-12-22
- Remove the
tokenize_tweets()function, which is no longer supported.
CRAN release: 2018-03-21
- Add the
tokenize_ptb()function for Penn Treebank tokenizations (@jrnold) (#12).
- Add a function
chunk_text()to split long documents into pieces (#30).
- New functions to count words, characters, and sentences without tokenization (#36).
- New function
tokenize_tweets()preserves usernames, hashtags, and URLS (@kbenoit) (#44).
stopwords()function has been removed in favor of using the stopwords package (#46).
- The package now complies with the basic recommendations of the Text Interchange Format. All tokenization functions are now methods. This enables them to take corpus inputs as either TIF-compliant named character vectors, named lists, or data frames. All outputs are still named lists of tokens, but these can be easily coerced to data frames of tokens using the
- Add a new vignette “The Text Interchange Formats and the tokenizers Package” (#49).
tokenize_skip_ngramshas been improved to generate unigrams and bigrams, according to the skip definition (#24).
- C++98 has replaced the C++11 code used for n-gram generation, widening the range of compilers
tokenizerssupports (@ironholds) (#26).
tokenize_skip_ngramsnow supports stopwords (#31).
- If tokenisers fail to generate tokens for a particular entry, they return
- Keyboard interrupt checks have been added to Rcpp-backed functions to enable users to terminate them before completion (#37).
tokenize_words()gains arguments to preserve or strip punctuation and numbers (#48).
tokenize_ngrams()to return properly marked UTF8 strings on Windows (@patperry) (#58).
tokenize_tweets()now removes stopwords prior to stripping punctuation, making its behavior more consistent with
CRAN release: 2016-08-29
- Add the
- Improvements to documentation.
CRAN release: 2016-04-14