Tokenizers

A collection of functions with a consistent interface to convert natural language text into tokens.

Details

The tokenizers in this package have a consistent interface. They all take either a character vector of any length, or a list where each element is a character vector of length one. The idea is that each element comprises a text. Then each function returns a list with the same length as the input vector, where each element in the list are the tokens generated by the function. If the input character vector or list is named, then the names are preserved.

Author

Maintainer: Thomas Charlon charlon@protonmail.com (ORCID)

Authors:

Lincoln Mullen lincoln@lincolnmullen.com (ORCID)

Other contributors:

Os Keyes ironholds@gmail.com (ORCID) [contributor]
Dmitriy Selivanov selivanov.dmitriy@gmail.com [contributor]
Jeffrey Arnold jeffrey.arnold@gmail.com (ORCID) [contributor]
Kenneth Benoit kbenoit@lse.ac.uk (ORCID) [contributor]

Details

See also

Author

About

Community

Resources