dwctaxon is all about handling taxonomic data in the Darwin Core (DwC) taxon format.
But what is DwC?
DwC is a standard for biodiversity data
According to the official documentation,
Darwin Core is a standard maintained by the Darwin Core Maintenance Interest Group. It includes a glossary of terms (in other contexts these might be called properties, elements, fields, columns, attributes, or concepts) intended to facilitate the sharing of information about biological diversity by providing identifiers, labels, and definitions. Darwin Core is primarily based on taxa, their occurrence in nature as documented by observations, specimens, samples, and related information.
(emphasis added)
The “terms” referred to are typically encountered as columns in rectangular data (spreadsheets), such as scientificName
(scientific name of a taxon), lifeStage
(life stage of an organism when it was observed), etc. By providing a controlled vocabulary and clear definitions of terms, DwC greatly facilitates collection and sharing of biological data. For example, the Global Biodiversity Information Facility (GBIF), which synthesizes biodiversity data on a global scale, uses DwC.
In practice, a given set of DwC data are contained in an archive (zip file) including multiple spreadsheets (CSV files) and XML files with additional metadata. The spreadsheets typically include datasets like occurrences, taxonomy, and collection events.

DwC archive components, from https://github.com/gbif/ipt/ under the Apache license
While other parts of DwC such as organism and occurrence data are clearly important, they are out of the scope of dwctaxon, which only focuses on taxonomic data.
Features of the DwC taxon format
Most of the terms used in the DwC format for taxonomic data (“DwC taxon”) should be familiar to biologists. Here is a simple example mapping taxonomic data on the left to their DwC terms on the right for the genus Sarda:

Example terms in DwC taxon
However, there are some peculiarities that are good to be familiar with as follows.
Vertically oriented
The Linnaean system of taxonomy organizes taxa into a hierarchy, so we may be used to working with taxonomic data in “wide” format where each row is a species, like this:
species | genus | family | order |
---|---|---|---|
Crepidomanes minutum | Crepidomanes | Hymenophyllaceae | Hymenophyllales |
Indeed, in DwC taxon, taxonomic levels above species like genus
, family
, and order
are valid terms and may be used.
However, species
is not a valid DwC term. That is because each row of a DwC taxonomic database is a single scientific name of any rank, not just species. So it is typical for data to be oriented vertically (“long” format):
taxonRank | scientificName |
---|---|
species | Crepidomanes minutum |
genus | Crepidomanes |
family | Hymenophyllaceae |
order | Hymenophyllales |
And since genus
, family
, etc. are valid DwC terms, these can also be included (when applicable):
taxonRank | scientificName | genus | family | order |
---|---|---|---|---|
species | Crepidomanes minutum | Crepidomanes | Hymenophyllaceae | Hymenophyllales |
genus | Crepidomanes | Crepidomanes | Hymenophyllaceae | Hymenophyllales |
family | Hymenophyllaceae | NA | Hymenophyllaceae | Hymenophyllales |
order | Hymenophyllales | NA | NA | Hymenophyllales |
Machine and human friendly
If you browse through the DwC taxon terms, you will notice many pairs of similar terms such as acceptedNameUsage
and acceptedNameUsageID
, parentNameUsage
and parentNameUsageID
, etc. These each are used for similar purposes, but one is a value that is easy for humans to understand while the other is useful for machines (computer programs).
For example, acceptedNameUsage
is the accepted name of a synonym (e.g., Picea abies (L.) H. Karst as the accepted name of Pinus abies L.), and acceptedNameUsageID
is the unique ID (typically, taxonID
) of the accepted name (typically some short sequence of letters and numbers, but this depends on the dataset).
This makes the data format somewhat redundant, but it is also easier for a human to parse if they can see the actual accepted name of a synonym immediately, instead of having to look it up by taxonID
. On the other hand, scientificName
can include duplicates (in rare cases if the same name was published twice, etc.), so referring to an accepted name by its unique ID is safer and no problem for a computer.
We can see how this works in the example dataset that comes with dwctaxon, dct_filmies
:
library(dwctaxon)
head(dct_filmies)
#> # A tibble: 6 × 5
#> taxonID acceptedNameUsageID taxonomicStatus taxonRank scientificName
#> <chr> <chr> <chr> <chr> <chr>
#> 1 54115096 NA accepted species Cephalomanes atrovirens Presl
#> 2 54133783 54115097 synonym species Trichomanes crassum Copel.
#> 3 54115097 NA accepted species Cephalomanes crassum (Copel.) M. G. Price
#> 4 54133784 54115098 synonym species Trichomanes densinervium Copel.
#> 5 54115098 NA accepted species Cephalomanes densinervium (Copel.) Copel.
#> 6 54133786 54115100 synonym species Cephalomanes curvatum (J. Sm.) V. D. Bosch
Here, Trichomanes crassum Copel. is a synonym of Cephalomanes crassum (Copel.) M. G. Price (notice how the acceptedNameUsageID
of Trichomanes crassum Copel. matches the taxonID
of Cephalomanes crassum (Copel.) M. G. Price).
In this dataset, only acceptedNameUsageID
is used, but it would be valid to add a column with acceptedNameUsage
. To learn more about how to do so, please see the vignette("editing")
.
Extensible
There are many terms listed in the DwC taxon documentation – 37 by my count! However, it is unlikely a given taxonomic database uses all of them; in fact, most that I’ve encountered only use a subset of the terms, and there are none that are strictly required. But in practice you typically want at least scientificName
(scientific name of the taxon, including author if known) and taxonID
(a unique identifier for each row in the dataset).
Furthermore, some of the terms are likely to have restricted vocabularies. For example, a given dataset may only use a limited set of words to describe taxonomicStatus
like “accepted”, “synonym”, and “doubtful”. This is in contrast to a term that could be (nearly) anything, like scientificName
. DwC does not provide any official set of vocabularies; it is left to the database manager to determine that. One feature of dwctaxon is to verify that only the values you want to allow are used for a given term. To learn more about that, please see vignette("validation")
.
These qualities make the DwC taxon format flexible, so it can meet the needs of the dataset at hand. The dwctaxon functions try to provide sensible defaults, but they may need to be adjusted appropriately.