Skip to contents

For a list of records, construct a data.frame for insertion into SQL database.

Usage

gb_df_generate(
  records,
  min_length = 0,
  max_length = NULL,
  acc_filter = NULL,
  invert = FALSE
)

Arguments

records

character, vector of GenBank records in text format

min_length

Minimum sequence length, default 0.

max_length

Maximum sequence length, default NULL.

acc_filter

Character vector; accessions to include or exclude from the database as specified by invert.

invert

Logical vector of length 1; if TRUE, accessions in acc_filter will be excluded from the database; if FALSE, only accessions in acc_filter will be included in the database. Default FALSE.

Value

data.frame, or NULL if no records pass filters

Details

The resulting data.frame has five columns: accession, organism, raw_definition, raw_sequence, raw_record. The prefix 'raw_' indicates the data has been converted to the raw format, see ?charToRaw, in order to save on RAM. The raw_record contains the entire GenBank record in text format.

Use acc_filter and max and min sequence lengths to minimize the size of the database. All sequences have to be at least as long as min and less than or equal in length to max, unless max is NULL in which there is no maximum length. The final selection of sequences is the result of applying all filters (acc_filter, min_length, max_length) in combination.