For all sequences in a cluster(s) calculate the frequency of separate words in either the sequence definitions or the reported feature name.

calc_wrdfrq(phylota, cid, min_frq = 0.1, min_nchar = 1,
  type = c("dfln", "nm"), ignr_pttrn = "[^a-z0-9]")

Arguments

phylota

Phylota object

cid

Cluster ID(s)

min_frq

Minimum frequency

min_nchar

Minimum number of characters for a word

type

Definitions (dfln) or features (nm)

ignr_pttrn

Ignore pattern, REGEX for text to ignore.

Value

list

Details

By default, anything that is not alphanumeric is ignored. 'dfln' and 'nm' match the slot names in a SeqRec, see list_seqrec_slots().

See also

Examples

data('dragonflies') # work out what gene region the cluster is likely representing with word freqs. random_cids <- sample(dragonflies@cids, 10) # most frequent words in definition line (calc_wrdfrq(phylota = dragonflies, cid = random_cids, type = 'dfln'))
#> $`766` #> named numeric(0) #> #> $`535` #> named numeric(0) #> #> $`76` #> sequence #> 0.1245614 #> #> $`445` #> named numeric(0) #> #> $`566` #> named numeric(0) #> #> $`426` #> named numeric(0) #> #> $`61` #> named numeric(0) #> #> $`554` #> named numeric(0) #> #> $`772` #> wrds #> rrna and #> 0.1578947 0.1052632 #> #> $`468` #> named numeric(0) #>
# most frequent words in feature name (calc_wrdfrq(phylota = dragonflies, cid = random_cids, type = 'nm'))
#> $`766` #> barcode #> 1 #> #> $`535` #> wrds #> rhswc1 rhswc2 rhswc3 rhswc4 rhswc5 #> 0.2 0.2 0.2 0.2 0.2 #> #> $`76` #> wrds #> internal spacer transcribed #> 0.3314286 0.3314286 0.3314286 #> #> $`445` #> numeric(0) #> #> $`566` #> coii #> 1 #> #> $`426` #> wrds #> rhlwc1 rhlwd1 rhlwe1 rhlwf1 rhlwf2 rhlwf3 rhlwf4 #> 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 #> #> $`61` #> co1 #> 1 #> #> $`554` #> numeric(0) #> #> $`772` #> wrds #> internal spacer transcribed #> 0.3333333 0.3333333 0.3333333 #> #> $`468` #> co1 #> 1 #>