The character shingle tokenizer functions like an n-gram tokenizer, except the units that are shingled are characters instead of words. Options to the function let you determine whether non-alphanumeric characters like punctuation should be retained or discarded.
Usage
tokenize_character_shingles(
x,
n = 3L,
n_min = n,
lowercase = TRUE,
strip_non_alphanum = TRUE,
simplify = FALSE
)
Arguments
- x
A character vector or a list of character vectors to be tokenized into character shingles. If
x
is a character vector, it can be of any length, and each element will be tokenized separately. Ifx
is a list of character vectors, each element of the list should have a length of 1.- n
The number of characters in each shingle. This must be an integer greater than or equal to 1.
- n_min
This must be an integer greater than or equal to 1, and less than or equal to
n
.- lowercase
Should the characters be made lower case?
- strip_non_alphanum
Should punctuation and white space be stripped?
- simplify
FALSE
by default so that a consistent value is returned regardless of length of input. IfTRUE
, then an input with a single element will return a character vector of tokens instead of a list.
Value
A list of character vectors containing the tokens, with one element
in the list for each element that was passed as input. If simplify =
TRUE
and only a single element was passed as input, then the output is a
character vector of tokens.
Examples
x <- c("Now is the hour of our discontent")
tokenize_character_shingles(x)
#> [[1]]
#> [1] "now" "owi" "wis" "ist" "sth" "the" "heh" "eho" "hou" "our" "uro" "rof"
#> [13] "ofo" "fou" "our" "urd" "rdi" "dis" "isc" "sco" "con" "ont" "nte" "ten"
#> [25] "ent"
#>
tokenize_character_shingles(x, n = 5)
#> [[1]]
#> [1] "nowis" "owist" "wisth" "isthe" "stheh" "theho" "hehou" "ehour" "houro"
#> [10] "ourof" "urofo" "rofou" "ofour" "fourd" "ourdi" "urdis" "rdisc" "disco"
#> [19] "iscon" "scont" "conte" "onten" "ntent"
#>
tokenize_character_shingles(x, n = 5, strip_non_alphanum = FALSE)
#> [[1]]
#> [1] "now i" "ow is" "w is " " is t" "is th" "s the" " the " "the h" "he ho"
#> [10] "e hou" " hour" "hour " "our o" "ur of" "r of " " of o" "of ou" "f our"
#> [19] " our " "our d" "ur di" "r dis" " disc" "disco" "iscon" "scont" "conte"
#> [28] "onten" "ntent"
#>
tokenize_character_shingles(x, n = 5, n_min = 3, strip_non_alphanum = FALSE)
#> [[1]]
#> [1] "now" "now " "now i" "ow " "ow i" "ow is" "w i" "w is" "w is "
#> [10] " is" " is " " is t" "is " "is t" "is th" "s t" "s th" "s the"
#> [19] " th" " the" " the " "the" "the " "the h" "he " "he h" "he ho"
#> [28] "e h" "e ho" "e hou" " ho" " hou" " hour" "hou" "hour" "hour "
#> [37] "our" "our " "our o" "ur " "ur o" "ur of" "r o" "r of" "r of "
#> [46] " of" " of " " of o" "of " "of o" "of ou" "f o" "f ou" "f our"
#> [55] " ou" " our" " our " "our" "our " "our d" "ur " "ur d" "ur di"
#> [64] "r d" "r di" "r dis" " di" " dis" " disc" "dis" "disc" "disco"
#> [73] "isc" "isco" "iscon" "sco" "scon" "scont" "con" "cont" "conte"
#> [82] "ont" "onte" "onten" "nte" "nten" "ntent" "ten" "tent" "ent"
#>