Dr. Scott Kahn, CIO of Illumina, presents challenges and progress on big data solutions and its impact on scientific research at the 2013 Genome Informatics Alliance meeting.
2. 2
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
From Whence We Came…
ATGCCGTTT…
CCGGTTAAT…
GAATTGCAG…
6:A2567C
12:C123T
20:T4678A
30-40TB
˜5TB
600GB
˜20GB
3. 3
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Genomic Big Data
Large amounts of data generated in genomics; multiple
samples, size of data, etc
Integration of digital data to enrich context of samples;
DNA, RNA, methylation, time courses, spatial
distributions with samples, …
Fusion of digital data and categorical data; combination
rules (categories), extraction from unstructured inputs,
…
Tools and techniques appropriate for resultant data
sets; visualization, model building, exploration, …
Advances require data mining rather than the one-at-a-
time hypothesis testing approaches of today
4. 4
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Genomic Big Data and Personal Genome Information
PERSONAL SEQUENCE
(owned by individual/doctor)
Issued: 01 MAR 07 Recommended next check: 28 FEB 10
PGI id: 5910322 – 61215923014
RISK VARIANTS
(approved for clinical use)
Human Genome
Clinical studies Populations
SequencingFunctional annotation
3: 12,300 3: 12,400 ( kb )
PPARg
GENOMIC ANNOTATION
(in public domain)
Variant: C3 : 12,450,610 : T0.7/C0.3 :
PPARG : Pro/Leu :
Medical
consequence:
Associated with severe insulin
resistance, diabetes mellitus,
hypertension
Pharmacological
consequence:
Resistant to thiazolidinediones
CLINICAL DECISION
Consultation
Consent
Clinical assessment
Selected risk
information
5. 5
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Sequencing a 17-member three-generation
pedigree.
– Ultra deep sequencing improves sensitivity
– Leveraging inheritance information improves
accuracy
– Data and results made publicly available
Identifying ultra accurate genomic variants is
enabling rapid improvements in technology
and software
This data will allow us to assess accuracy for
many FDA submissions
We are collaborating with NIST & CDC to
develop a public resource for quantifying
sequencing accuracy
Platinum Genomes as a Truth Reference
Creating a catalogue of highly-accurate SNPs, indels & SVs
6. 6
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Reduction from 40 Q-scores to 8 Q-scores becoming accepted
Sequencing output is still increasing exponentially therefore further
compression is likely to be required
Platinum genome work suggest ~95% of genome is consistently called (this
95% is known as the platinum regions)
Regions which are reliably called may not need 8 Q-scores resolution
– we can reduce “well
sequenced” regions to 2 Q-
scores
Start with 8 Q-score bam file:
– Reduce the platinum regions
to 2 Q-scores (keep non-
platinum at 8 Q-scores)
– Reduce the platinum regions
to 1 Q-score
– Whole genome
2 Q-score
– Reduce platinum region to 2
Q-scores but also keep
original Q-scores of
mismatches (MM) and
anomalous reads
– ~40Gb (20Gb CRAM)
Data Reduction Via Vertical Compression (NA12882)
Build Total SNPs
(>Q20)
SNPs diff
genotype
(>Q20)
Not called in
Q-score
compressed
build
(>Q20)
Not called in 8
Q-score build
(>Q20)
8 Q-score 3,735,575
(3,627,165)
- - -
8 Q-score
technical
replicate
3,734,849
(3,626,485)
45,584
(22,400)
80,131 (29,211) 79,405 (28,845)
Platinum
Genome 2 Q-
score
3,732,568
(3,620,612)
3,255 (161) 3417 (63) 410 (127)
Platinum
Genome 1 Q-
score
3,764,928
(3,626,468)
4002 (584) 2605 (75) 31,958 (2964)
Whole Genome
2 Q-score
3,712,636
(3,598,400)
25,175 (1912) 24,237 (166) 1298 (112)
Platinum 2 q-
score keep MM
and anom.
reads
3,735,684
(3,627,226)
197 (123) 142 (35) 251 (102)
7. 7
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Faster Data – DNA to Result in <2 Days
12 core server
64Gb RAM
Sequence Analyze AnnotateSample
27 hr 8 hr
HiSeq2500 Isaac analysis overnight
40 hr
Fast turnaround is required for clinical applications
4.5 hr
PCR Free library
8. 8
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
WGS reveals somatic mutations in TERT
gene promoter of melanoma patients
Form a novel transcription factor binding
motif
Recurrence in melanoma is as high as
any known coding mutation
Importance of Non-coding Mutations – Bigger Data!
-200 -100
TERT gene
0 +100 +200
Gene (mutation) Incidence in
melanoma
TERT (promoter) 52%
BRAF (V600E) 53%
CDKN2A 50%
NRAS (Q61R) 28%
TERT (coding) 1%
Horn et al. & Huang et al., Science 2013
10. 10
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Surveillance of Leukaemia (CLL) – More Data Complexity!
0 6463 65 6662
Event
Timeline
Sequencing
Birth DeathTreatmentDiagnosis TreatmentTreatment
0
50
100
150
200
250
a b c d e
NORMAL
CLASS 4
CLASS 3
CLASS 2
CLASS 1
Time points
Abundance
Changing
subclonal
populations
0
1
2
3
4
5
c
NO
CL
CL
CL
CL
“Remission” has
disease
Schuh et al., Oxford
12. 12
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Utility Requires Complex Composite Information
iPad
Plug and Play
Cloud
Allele Frequency
in populations
www.1000genomes.org
Medical/Risk data
(with expert review)
Hgmd, pharmgkb
Genetic Variants
dbSNP
Functional Effects
ensembl.org,
genome.ucsc.edu,
encode.org
Disease association
genome.gov
ANNOTATED
GENOME
( gVCF)
<1Gbyte
Ancestry
Tissue type
Risk
Carrier status
Diagnosis
Drug
response
Annotate DisseminateInterpret
13. 13
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Apps
Public Genomic Databases
Users
EMR
Support & Engineering
Instruments
Genomic Big Data Ecosystems
14. 14
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Genomic Big Data Status
Researcher
Treatment choice
Clinician
Patient
Knowledge
Information
15. 15
COMPANY CONFIDENTIAL – INTERNAL USE ONLY
Challenges for this Meeting to Address
What data frameworks and models
are required?
How will genomes (DNA, RNA,
methylation states, etc) be
aggregated and compared?
How will collaboration and data
sharing evolve?
Where will the technology go and
how must the community respond
to lever the benefits
Brainstorming of ideas
Sessions from groups that have
experiences from many fields
Next steps!!
Actively participate and enjoy the entire
experience!