SlideShare a Scribd company logo
1 of 46
ChemSpider and Traveling the Internet
            via Chemical Structures

                         Antony Williams
                        Drexel University, November 2012
Compounds and Identifiers
Chemistry on the Internet
   Where do you source chemistry information?
   What can you trust online?
   How can you recognize potential issues?
   Cross-referencing and curating data
Molfiles (http://en.wikipedia.org/wiki/Chemical_table_file)
Molfiles
   10 9 0 0 1 0 0 0     0 0 1 V2000
     31.2937 -9.0366    0.0000 C 0 0   0   0   0   0   0   0   0   0   0   0
     26.6526 -9.0366    0.0000 C 0 0   0   0   0   0   0   0   0   0   0   0
     31.2937 -7.7066    0.0000 O 0 0   0   0   0   0   0   0   0   0   0   0
     30.1161 -9.6877    0.0000 C 0 0   0   0   0   0   0   0   0   0   0   0
     25.5096 -9.6877    0.0000 O 0 0   0   0   0   0   0   0   0   0   0   0
     28.9731 -9.0366    0.0000 C 0 0   0   0   0   0   0   0   0   0   0   0
     27.8163 -9.7016    0.0000 C 0 0   0   0   0   0   0   0   0   0   0   0
     26.6664 -7.7066    0.0000 N 0 0   0   0   0   0   0   0   0   0   0   0
     32.4367 -9.6877    0.0000 O 0 0   0   0   0   0   0   0   0   0   0   0
     30.1161 -11.0177   0.0000 N 0 0   0   0   0   0   0   0   0   0   0   0
    3 1 2 0 0 0 0
    4 1 1 0 0 0 0
    9 1 1 0 0 0 0
    7 2 1 0 0 0 0
    5 2 2 0 0 0 0
    8 2 1 0 0 0 0
    6 4 1 0 0 0 0
    4 10 1 6 0 0 0
    7 6 1 0 0 0 0
   M END
Molfiles
 Molfiles are the primary exchange format between
  structure drawing packages
 Can be different between different drawing packages
 Most commonly carry X,Y coordinates for layout
 Can support polymers, organometallics, etc.
 Can carry 3D coordinates
SMILES (http://en.wikipedia.org/wiki/SMILES)
 SMILES is a common format
 Can support polymers,
  organometallics, etc.
 Does NOT carry X,Y or Z
  coordinates for layout so
  requires layout algorithms –
  can be problematic!
 Generally different between
  drawing packages
Stereo
Tautomers
SMILES
 ACD/Labs
 CC(C)CCC[C@@H](C)CCC[C@@H]
  (C)CCCC(C)=CCC2=C(C)C(=O)c1ccccc1C2=O

 OpenEye
 CC1=C(C(=O)c2ccccc2C1=O)C/C=C(C)/CCC[C
  @H](C)CCC[C@H](C)CCCC(C)C

 ChEMBL
 CC(C)CCC[C@@H](C)CCC[C@@H]
  (C)CCCC(=CCC1=C(C)C(=O)c2ccccc2C1=O)C
The InChI Identifier
InChI
 SINGLE code base managed by IUPAC –
  integrated into drawing packages. No variability
  as with SMILES
 InChI Strings can be reversed to structures –
  same problem as with SMILES – no layout
 Well adopted by the community (databases,
  publishers, blogs, Wikipedia) – good for searching
  the internet
The InChI Standard
Tautomers – “Mobile H Perception”
Double Bond Orientation
Stereo
Checking for Stereochemistry
Checking for Stereochemistry
Use your drawing package!
Checking for Stereochemistry
Checking for Stereochemistry
Checking for Stereochemistry
InChIKeys
Search the Web by Structure
InChIs
Databases and Standardization
Databases and Standardization
InChI
 No support for polymers, organometallics

 Many option settings can lead to variability and
  make integration across databases difficult –
  FixedH option especially problematic

 “Slight” chance of collisions of InChIKeys

 VERY USEFUL FOR INTEGRATING THE WEB
Vancomycin
Vancomycin




Search Molecular   Search Full Molecule
  SKELETON
Full Skeleton Search: 104 Hits
Full Molecule Search: 4 Hits
Where is chemistry online?
   Encyclopedic articles (Wikipedia)
   Chemical vendor databases
   Metabolic pathway databases
   Property databases
   Patents with chemical structures
   Drug Discovery data
   Scientific publications
   Compound aggregators
   Blogs/Wikis and Open Notebook Science
www.chemspider.com
How do we build it?
 We deal in Molfiles or SDF files – with coordinates

 Valence checking, charge imbalance

 We have our own “business logic” to standardize

 InChI to “aggregate tautomers” to one record

 We link out to external sites using their IDs
Searches: The INTERNET




All ChemSpider and Internet searches are “simply algorithms”
but synonym searching is based on an assertion
Validated Names for Searching…
Validating structures
 Check for “full stereo” and use stereo descriptors
  especially for checking!

 Check for quality of associated data sources

 Check against reference literature when available
  – but it can be wrong

 Question EVERYTHING!
Contributing to The Quality of Data
What is the Structure of Vitamin K?
Contributing to The Quality of Data
 What is the Structure of Vitamin K?

A lipid cofactor that is required for normal blood
clotting. Several forms of vitamin K have been
identified: VITAMIN K1 (phytomenadione)
derived      from      plants,     VITAMIN      K2
(menaquinone) from bacteria & synthetic
naphthoquinone provitamins, VITAMIN K3
(menadione).
What is the Structure of Vitamin K1?
CAS’s Common Chemistry
Wikipedia
Wolfram Alpha
DailyMed
ALL Different, ALL “Domoic Acids”
Thank you

Email: williamsa@rsc.org
Twitter: ChemConnector
Blog: www.chemspider.com/blog
Personal Blog: www.chemconnector.com
SLIDES: www.slideshare.net/AntonyWilliams

More Related Content

Similar to ChemSpider and Traveling the Internet via Chemical Structures

Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical RepresentationsNextMove Software
 
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...ChemAxon
 
2011 ebi industry workshop
2011 ebi industry workshop2011 ebi industry workshop
2011 ebi industry workshopMichel Dumontier
 
Why SKOS should be a Focal Point of your Linked Data Strategy
Why SKOS should be a Focal Point of your Linked Data StrategyWhy SKOS should be a Focal Point of your Linked Data Strategy
Why SKOS should be a Focal Point of your Linked Data StrategySemantic Web Company
 
The anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsThe anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsAlex Clark
 
Tsp 2018 presentation Simulated Annealing Method for Construction of High-Gi...
Tsp 2018 presentation Simulated Annealing Method  for Construction of High-Gi...Tsp 2018 presentation Simulated Annealing Method  for Construction of High-Gi...
Tsp 2018 presentation Simulated Annealing Method for Construction of High-Gi...Usatyuk Vasiliy
 
ACS Meeting New Orleans 2013 (CINF)
ACS Meeting New Orleans 2013 (CINF)ACS Meeting New Orleans 2013 (CINF)
ACS Meeting New Orleans 2013 (CINF)Markus Sitzmann
 
Building global chemistry network at the royal society of chemistry
Building global chemistry network at the royal society of chemistryBuilding global chemistry network at the royal society of chemistry
Building global chemistry network at the royal society of chemistryValery Tkachenko
 
ACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkMarkus Sitzmann
 
Monoids and sketches and crdts, oh my!
Monoids and sketches and crdts, oh my!Monoids and sketches and crdts, oh my!
Monoids and sketches and crdts, oh my!kscaldef
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in ActionSSA KPI
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidatapetermurrayrust
 
EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...
EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...
EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...ChemAxon
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014KDZ - Zentrum für Verwaltungsforschung
 
Lotico oct 2010
Lotico oct 2010Lotico oct 2010
Lotico oct 2010dallemang
 

Similar to ChemSpider and Traveling the Internet via Chemical Structures (20)

Chemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through DatabasesChemicals, Chemical Identifiers and Navigating Through Databases
Chemicals, Chemical Identifiers and Navigating Through Databases
 
Digital Chemical Representations
Digital Chemical RepresentationsDigital Chemical Representations
Digital Chemical Representations
 
Data integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientistData integration and building a profile for yourself as an online scientist
Data integration and building a profile for yourself as an online scientist
 
Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...Hosting public domain chemicals data online for the community – the challenge...
Hosting public domain chemicals data online for the community – the challenge...
 
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
EUGM 2014 - Roger Sayle (NextMove Software): Implementing ISO standard 11238 ...
 
2011 ebi industry workshop
2011 ebi industry workshop2011 ebi industry workshop
2011 ebi industry workshop
 
Why SKOS should be a Focal Point of your Linked Data Strategy
Why SKOS should be a Focal Point of your Linked Data StrategyWhy SKOS should be a Focal Point of your Linked Data Strategy
Why SKOS should be a Focal Point of your Linked Data Strategy
 
The anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithmsThe anatomy of a chemical reaction: Dissection by machine learning algorithms
The anatomy of a chemical reaction: Dissection by machine learning algorithms
 
Tsp 2018 presentation Simulated Annealing Method for Construction of High-Gi...
Tsp 2018 presentation Simulated Annealing Method  for Construction of High-Gi...Tsp 2018 presentation Simulated Annealing Method  for Construction of High-Gi...
Tsp 2018 presentation Simulated Annealing Method for Construction of High-Gi...
 
Oct 2011 ualr
Oct 2011 ualrOct 2011 ualr
Oct 2011 ualr
 
ACS Meeting New Orleans 2013 (CINF)
ACS Meeting New Orleans 2013 (CINF)ACS Meeting New Orleans 2013 (CINF)
ACS Meeting New Orleans 2013 (CINF)
 
Building global chemistry network at the royal society of chemistry
Building global chemistry network at the royal society of chemistryBuilding global chemistry network at the royal society of chemistry
Building global chemistry network at the royal society of chemistry
 
ACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF TalkACS San Francisco 2010 CINF Talk
ACS San Francisco 2010 CINF Talk
 
Monoids and sketches and crdts, oh my!
Monoids and sketches and crdts, oh my!Monoids and sketches and crdts, oh my!
Monoids and sketches and crdts, oh my!
 
Chemoinformatics in Action
Chemoinformatics in ActionChemoinformatics in Action
Chemoinformatics in Action
 
A Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and WikidataA Global Commons for Scientific Data: Molecules and Wikidata
A Global Commons for Scientific Data: Molecules and Wikidata
 
EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...
EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...
EUGM 2013 - Anh Kiet Tran Minh (CNRS): French Academic Compound Library: the ...
 
Accessing 3D Printable Structures Online
Accessing 3D Printable Structures OnlineAccessing 3D Printable Structures Online
Accessing 3D Printable Structures Online
 
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
Enterprise linked data - open or closed, Andreas Blumauer, Keynote SMWCon 2014
 
Lotico oct 2010
Lotico oct 2010Lotico oct 2010
Lotico oct 2010
 

ChemSpider and Traveling the Internet via Chemical Structures