Retrieve compound IDs (CIDs) from PubChem.
Arguments
- query
character; search term, one or more compounds.
- from
character; type of input. See details for more information.
- domain
character; query domain, can be one of
"compound","substance","assay".- match
character; How should multiple hits be handled?,
"all"all matches are returned,"first"the first matching is returned,"ask"enters an interactive mode and the user is asked for input,"na"returns NA if multiple hits are found.- verbose
logical; should a verbose output be printed on the console?
- arg
character; optinal arguments like "name_type=word" to match individual words.
- first
deprecated. Use `match` instead.
- ...
currently unused.
Details
Valid values for the from argument depend on the
domain:
compound:"name","smiles","inchi","inchikey","formula","sdf","cas"(an alias for"xref/RN"), <xref>, <structure search>, <fast search>.substance:"name","sid",<xref>,"sourceid/<source id>"or"sourceall".assay:"aid",<assay target>.
<structure search> is assembled as "(substructure |
superstructure | similarity | identity) / (smiles
| inchi | sdf | cid)", e.g.
from = "substructure/smiles".
<xref> is assembled as "xref/(RegistryID |
RN | PubMedID | MMDBID | ProteinGI,
NucleotideGI | TaxonomyID | MIMID | GeneID |
ProbeID | PatentID)", e.g. from = "xref/RN" will query
by CAS RN.
<fast search> is either fastformula or it is assembled as
"(fastidentity | fastsimilarity_2d | fastsimilarity_3d |
fastsubstructure | fastsuperstructure)/(smiles |
smarts | inchi | sdf | cid)", e.g.
from = "fastidentity/smiles".
<source id> is any valid PubChem Data Source ID. When
from = "sourceid/<source id>", the query is the ID of the substance in
the depositor's database.
If from = "sourceall" the query is one or more valid Pubchem
depositor names. Depositor names are not case sensitive.
Depositor names and Data Source IDs can be found at https://pubchem.ncbi.nlm.nih.gov/sources/.
<assay target> is assembled as "target/(gi |
proteinname | geneid | genesymbol | accession)",
e.g. from = "target/geneid" will query by GeneID.
Note
Please respect the Terms and Conditions of the National Library of Medicine, https://www.nlm.nih.gov/databases/download.html the data usage policies of National Center for Biotechnology Information, https://www.ncbi.nlm.nih.gov/home/about/policies/, https://pubchem.ncbi.nlm.nih.gov/docs/programmatic-access, and the data usage policies of the indicidual data sources https://pubchem.ncbi.nlm.nih.gov/sources/.
References
Wang, Y., J. Xiao, T. O. Suzek, et al. 2009 PubChem: A Public Information System for Analyzing Bioactivities of Small Molecules. Nucleic Acids Research 37: 623–633.
Kim, Sunghwan, Paul A. Thiessen, Evan E. Bolton, et al. 2016 PubChem Substance and Compound Databases. Nucleic Acids Research 44(D1): D1202–D1213.
Kim, S., Thiessen, P. A., Bolton, E. E., & Bryant, S. H. (2015). PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic acids research, gkv396.
Eduard Szöcs, Tamás Stirling, Eric R. Scott, Andreas Scharmüller, Ralf B. Schäfer (2020). webchem: An R Package to Retrieve Chemical Information from the Web. Journal of Statistical Software, 93(13). doi:10.18637/jss.v093.i13 .
Examples
if (FALSE) { # \dontrun{
# might fail if API is not available
get_cid("Triclosan")
get_cid("Triclosan", arg = "name_type=word")
# from SMILES
get_cid("CCCC", from = "smiles")
# from InChI
get_cid("InChI=1S/CH5N/c1-2/h2H2,1H3", from = "inchi")
# from InChIKey
get_cid("BPGDAMSIGCZZLK-UHFFFAOYSA-N", from = "inchikey")
# from formula
get_cid("C26H52NO6P", from = "formula")
# from CAS RN
get_cid("56-40-6", from = "xref/rn")
# similarity
get_cid(5564, from = "similarity/cid")
get_cid("CCO", from = "similarity/smiles")
# from SID
get_cid("126534046", from = "sid", domain = "substance")
# sourceid
get_cid("VCC957895", from = "sourceid/23706", domain = "substance")
# sourceall
get_cid("Optopharma Ltd", from = "sourceall", domain = "substance")
# from AID (CIDs of substances tested in the assay)
get_cid(170004, from = "aid", domain = "assay")
# from GeneID (CIDs of substances tested on the gene)
get_cid(25086, from = "target/geneid", domain = "assay")
# multiple inputs
get_cid(c("Triclosan", "Aspirin"))
} # }
