Skip to contents

For all sequences in a cluster(s) calculate the frequency of separate words in either the sequence definitions or the reported feature name.

Usage

calc_wrdfrq(
  phylota,
  cid,
  min_frq = 0.1,
  min_nchar = 1,
  type = c("dfln", "nm"),
  ignr_pttrn = "[^a-z0-9]"
)

Arguments

phylota

Phylota object

cid

Cluster ID(s)

min_frq

Minimum frequency

min_nchar

Minimum number of characters for a word

type

Definitions (dfln) or features (nm)

ignr_pttrn

Ignore pattern, REGEX for text to ignore.

Value

list

Details

By default, anything that is not alphanumeric is ignored. 'dfln' and 'nm' match the slot names in a SeqRec, see list_seqrec_slots().

Examples

data('dragonflies')
# work out what gene region the cluster is likely representing with word freqs.
random_cids <- sample(dragonflies@cids, 10)
# most frequent words in definition line
(calc_wrdfrq(phylota = dragonflies, cid = random_cids, type = 'dfln'))
#> $`129`
#> named numeric(0)
#> 
#> $`551`
#> wrds
#>      rrna       and 
#> 0.1578947 0.1052632 
#> 
#> $`485`
#> named numeric(0)
#> 
#> $`148`
#> sequence 
#>     0.15 
#> 
#> $`426`
#> named numeric(0)
#> 
#> $`404`
#> wrds
#>      cds     gene  histone  isolate  partial petalura 
#>    0.125    0.125    0.125    0.125    0.125    0.125 
#> 
#> $`576`
#> wrds
#>      rrna       and 
#> 0.1666667 0.1111111 
#> 
#> $`615`
#> named numeric(0)
#> 
#> $`689`
#> wrds
#>      alpha        cds elongation     factor       gene   macromia    partial 
#>  0.1071429  0.1071429  0.1071429  0.1071429  0.1071429  0.1071429  0.1071429 
#> 
#> $`735`
#> named numeric(0)
#> 
# most frequent words in feature name
(calc_wrdfrq(phylota = dragonflies, cid = random_cids, type = 'nm'))
#> $`129`
#> numeric(0)
#> 
#> $`551`
#> wrds
#>    internal      spacer transcribed 
#>   0.3333333   0.3333333   0.3333333 
#> 
#> $`485`
#> numeric(0)
#> 
#> $`148`
#> wrds
#>    internal      spacer transcribed 
#>   0.3333333   0.3333333   0.3333333 
#> 
#> $`426`
#> wrds
#>    rhlwc1    rhlwd1    rhlwe1    rhlwf1    rhlwf2    rhlwf3    rhlwf4 
#> 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 0.1428571 
#> 
#> $`404`
#> numeric(0)
#> 
#> $`576`
#> wrds
#>    internal      spacer transcribed 
#>   0.3333333   0.3333333   0.3333333 
#> 
#> $`615`
#> numeric(0)
#> 
#> $`689`
#> numeric(0)
#> 
#> $`735`
#> numeric(0)
#>