This package describes and validates formats for storing common object arising in text analysis as native R objects. Representations of a text corpus, document term matrix, and tokenized text are included. The corpus and tokens objects have multiple valid formats. Packages compliant with the tif proposal should accept all valid formats and should directly return, or provide conversion functions, for converting outputs into at least one of the formats (when applicable). The tokenized text format is extensible to include other annotations such as part of speech tags and named entities.
Author
Maintainer: Taylor Arnold taylor.arnold@acm.org
Authors:
Ken Benoit k.r.benoit@lse.ac.uk
Lincoln Mullen lmullen@gmu.edu
Adam Obeng contact@adamobeng.com
rOpenSci Text Workshop Participants (2017)