http://marianattestad.com
How big data is transforming biology and how we are using Python to make sense of it all. Genome sequencing, big data in genomics, long-read sequencing, SplitThreader string graph library in Python, cancer research, and personal genomics!
I am Maria Nattestad, a PhD student in computational biology at Cold Spring Harbor laboratory.
1. How big data is transforming biology and how
we are using Python to make sense of it all
Maria Nattestad
Computational biology PhD student
PyData NYC 2015
5. Mutations in the genome can lead to
cancer and other diseases
Over 20,000 genes are
scattered all over the genome.
The genome is the instruction
manual for creating a living
thing.
Some changes in the genome
do nothing or encode normal
variation like hair color, while
others can cause disease.
8. Sequencing by the numbers
• Human genome is 6 billion letters [ATCG]
• No technology exists that can read an entire
chromosome from end to end
• Illumina sequencing produces 100 letters of
sequence
• If the genome was random, this would be enough
9. The genome is not random
ATCGATCAT?ATCGATCATA
repeats
Because of this the human genome STILL has gaps
10. Repeats make it harder to assemble the genome
puzzle
A
B
R
CDR
RR
CR
B R DR
A R
A
B
R
C
D
If a repeat is longer than the reads
11. Long-read DNA Sequencing
Pacific Biosciences
Oxford Nanopore MinION
>10X as expensive as next-generation (Illumina) sequencing
>100X read length
14. How the human genome changes
during cancer
Normal human genome
15. How the human genome changes
during cancer
(Davidson et al, 2000)
80 chromosomes instead of 46
Cancer genome
Cell line from a woman with metastatic breast cancer in 1971, tumor cells
have been grown and studied in the lab ever since.
19. SplitThreader:
A new Python graph library for representing rearranged genomes
CHR 1
CHR 2
ATCGCCTA
GTCCATAG
8
10
2
ATCG CCGA
ATAGGTCC
CHR 1
CHR 2
10
2
8
20. Class structure of SplitThreader
Node Node
NodeNode Edge
Edge
Edge
Port Port Port Port
Port Port Port Port
Graph
Edge
Edge
Edge
Edge
Once you enter a node, you must exit out the other side like a tunnel.
21. Biological insights from SplitThreader
Depth first search
or
Breadth first search
Gene fusion finding
History of mutations
22. Using SplitThreader to find a gene fusion
CYTH1
EIF3H
CYTH1 EIF3H
Goal is to find a path like this:
23. Too many copies of Her2 contributes to making
cancer worse
Sequencing
Actual genome
Her2
Too much Her2
Too much signal to divide
Too many cell divisions
Cancer grows
24. About 40 copies of Her2 gene scattered around the
genome
Her2 gene
28. Her2
Chr 17
Chr 8
1. Healthy chromosome 17
2. Sequence copied into
chromosome 8
3. Subsequence copied within
chromosome 8
4. Complex variant and
inverted duplication within
chromosome 8
5. Subsequence copied within
chromosome 8
29. SplitThreader is open source on Github
ATCG CCGA
ATAGGTCC
CHR 1
CHR 2
10
2
8
https://github.com/marianattestad/splitthreader
Visualization with D3.js is underway!
Contributions are very welcome
31. Personal genomics
SNP chip Sequencing
• Illumina, SureGenomics
• About $1,000
• Captures large and small
mutations even if
completely novel and
unexpected
• 23andMe
• About $100
• Captures tiny mutations
scientists already know to
look for
32. Personal genomics debates
• Should the government allow these
companies to give people their genomic data?
– How about interpreting the health risks?
• Is sharing your own genome breaking your
family’s privacy?
SKBR3 is from a pleural effusion (lung metastasis) of breast cancer isolated from a 43-year-old women in 1971.
Spectral karyotyping (SKY)
Woman with breast cancer in 1971, cells isolated from a metastasis in her lungs. Some cancer cells can grow in the lab, so scientists use them to learn about human biology without experimenting on patients.