Calculate word frequencies

For all sequences in a cluster(s) calculate the frequency of separate words in either the sequence definitions or the reported feature name.

Usage

calc_wrdfrq(
  phylota,
  cid,
  min_frq = 0.1,
  min_nchar = 1,
  type = c("dfln", "nm"),
  ignr_pttrn = "[^a-z0-9]"
)

Arguments

phylota: Phylota object
cid: Cluster ID(s)
min_frq: Minimum frequency
min_nchar: Minimum number of characters for a word
type: Definitions (dfln) or features (nm)
ignr_pttrn: Ignore pattern, REGEX for text to ignore.

Value

list

Details

By default, anything that is not alphanumeric is ignored. 'dfln' and 'nm' match the slot names in a SeqRec, see list_seqrec_slots().

Examples

data('dragonflies')
# work out what gene region the cluster is likely representing with word freqs.
random_cids <- sample(dragonflies@cids, 10)
# most frequent words in definition line
(calc_wrdfrq(phylota = dragonflies, cid = random_cids, type = 'dfln'))
#> $`65`
#> named numeric(0)
#> 
#> $`33`
#>      gene 
#> 0.1363636 
#> 
#> $`283`
#> named numeric(0)
#> 
#> $`301`
#> wrds
#>          cds         gene      histone ophiogomphus      partial           h3 
#>    0.1138211    0.1138211    0.1138211    0.1138211    0.1138211    0.1056911 
#>      voucher 
#>    0.1056911 
#> 
#> $`8`
#>      gene 
#> 0.1068649 
#> 
#> $`12`
#> named numeric(0)
#> 
#> $`134`
#>  h3 
#> 0.2 
#> 
#> $`699`
#> named numeric(0)
#> 
#> $`140`
#>      gene 
#> 0.1363636 
#> 
#> $`256`
#> named numeric(0)
#> 
# most frequent words in feature name
(calc_wrdfrq(phylota = dragonflies, cid = random_cids, type = 'nm'))
#> $`65`
#> barcode 
#>       1 
#> 
#> $`33`
#> numeric(0)
#> 
#> $`283`
#> numeric(0)
#> 
#> $`301`
#> numeric(0)
#> 
#> $`8`
#> numeric(0)
#> 
#> $`12`
#> barcode 
#>       1 
#> 
#> $`134`
#> numeric(0)
#> 
#> $`699`
#> barcode 
#>       1 
#> 
#> $`140`
#> numeric(0)
#> 
#> $`256`
#> numeric(0)
#>

Usage

Arguments

Value

Details

See also

Examples

About

Community

Resources