Crowdsourcing Biology: The Gene Wiki, BioGPS, and Citizen Science
1. Crowdsourcing Biology: The Gene
Wiki, BioGPS and GeneGames.org
Andrew Su, Ph.D.
@andrewsu
asu@scripps.edu
http://sulab.org
May 14, 2014
CBIIT
Slides: slideshare.net/andrewsu
Citizen Science!
2. Few genes are well annotated…
2
Data: NCBI, February 2013
41%
65%
CTNNB1
VEGFA
SIRT1
FGFR2
TGFB1
TP53
MEF2C
BMP4
LEF1
WNT5A
TNF
20,473
protein-
coding
genes
Genes, sorted by decreasing counts
GOAnnotation
Counts
3. … because the literature is sparsely curated?
3
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
4. … because the literature is sparsely curated?
4
0
10
20
30
40
1983 1988 1993 1998 2003 2008 2013
Average capacity of human scientist
6. 6
0
Sooner or later, the
research community will
need to be involved in the
annotation effort to scale
up to the rate of data
generation.
7. The Long Tail is a prolific source of content
7
Short
Head
Long Tail
Content
produced
Contributors (sorted)
News :
Video:
Product reviews:
Food reviews:
Talent judging:
Newspapers
TV/Hollywood
Consumer reports
Food critics
Olympics
Blogs
YouTube
Amazon reviews
Yelp
American Idol
9. Wikipedia has breadth and depth
9
http://en.wikipedia.org/wiki/Wikipedia:Size_comparisons, July 2008
Articles
Words
(millions)
Wikipedia Britannica
Online
10. 10
We can harness the
Long Tail of scientists
to directly participate in
the gene annotation
process.
14. Wiki success depends on a positive feedback
14
Gene wiki page utility
Number of
users
Number of
contributors
1001
2002
15. 10,000 gene “stubs” within Wikipedia
15
Protein structure
Symbols and
identifiers
Tissue expression
pattern
Gene Ontology
annotations
Links to structured
databases
Gene
summary
Protein
interactions
Linked
references
Huss, PLoS Biol, 2008
Utility
Users
Contributors
16. Gene Wiki has a critical mass of readers
16
Total: 4.0 million views / month
Huss, PLoS Biol, 2008; Good, NAR, 2011
Utility
Users
Contributors
17. Gene Wiki has a critical mass of editors
17
Increase of ~10,000 words / month from >1,000 edits
Currently 1.42 million words
Approximately equal to 230 full-length articles
Good, NAR, 2011
Utility
Users
Contributors
Editorcount
Editors
Edits
Editcount
18. A review article for every gene is powerful
18
References to the literature
Hyperlinks to related concepts
Reelin: 98 editors, 703 edits since July 2002
Heparin: 358 editors, 654 edits since June 2003
AMPK: 109 editors, 203 edits since March 2004
RNAi: 394 editors, 994 edits since October 2002
19. Making the Gene Wiki more computable
19
Structured annotationsFree text
20. Filling the gaps in gene annotation
20
Wikilink
GO exact
match
Gene Wiki
mapping
NCBI Entrez Gene: 334
Candidate
assertion
GO:0006897
6319 novel GO annotations
2147 novel DO annotations
21. Gene Wiki content improves enrichment analysis
23
p-value (PubMed only)
p-value
(PubMed + GW)
Muscle
contraction
More
significant
PubMed + GW
More
significant
PubMed only
Good BM et al., BMC Genomics, 2011
22. Making the Gene Wiki more computable
24
Structured annotationsFree text
Analyses
36. Utility: A simple and universal plugin interface
39
Utility
UsersContributors
Total of > 540 gene-centric online
databases registered as BioGPS plugins
37. Users: BioGPS has critical mass
40
• > 6400 registered users
• 14,000 unique visitors per month
• 155,000 page views per month
1. Harvard
2. NIH
3. UCSD
4. Scripps
5. MIT
6. Cambridge
7. U Penn
8. Stanford
9. Wash U
10. UNC
Top 10 organizations
Daily pageviewsUtility
UsersContributors
38. Contributors: Explicit and implicit knowledge
41
540 plugins registered
(>300 publicly shared)
by over 120 users
spanning 280+ domains
Utility
UsersContributors
39. Gene Annotation Query as a Service
42
http://mygene.info
• High performance
• 3M hits/month
• Highly scalable
• 13k species
• 16M genes
• Weekly data updates
• JSON output
• REST interface
• Python/R/JS libraries
42. The biomedical literature is growing fast
45
0
200,000
400,000
600,000
800,000
1,000,000
1,200,000
1983 1988 1993 1998 2003 2008 2013
Number of new PubMed-indexed articles
43. Information Extraction
46
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
44. Disease mentions in PubMed abstracts
47
NCBI Disease corpus
• 793 PubMed abstracts
• (100 development, 593 training, 100 test)
• 12 expert annotators (2 annotate each abstract)
6,900 “disease” mentions
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in
PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural
Language Processing. Association for Computational Linguistics.
45. Four types of disease mentions
48
Specific Disease:
• “Diastrophic dysplasia”
Disease Class:
• “Cancers”
Composite Mention:
• “prostatic , skin , and lung cancer”
Modifier:
• ..the “familial breast cancer” gene , BRCA2..
Doğan, Rezarta, and Zhiyong Lu. "An improved corpus of disease mentions in
PubMed citations." Proceedings of the 2012 Workshop on Biomedical Natural
Language Processing. Association for Computational Linguistics.
46. Question: Can a group of non-scientists
collectively perform concept recognition in
biomedical texts?
49
49. Amazon Mechanical Turk (AMT)
52
Requester
Amazon
For each task, specify:
• a qualification test
• how many workers per task
• how much we will pay per task
Manages:
• parallel execution of jobs
• worker access to tasks
via qualification tests
• payments
• task advertising
Workers
1. Create tasks
2. Execute
3. Aggregate
50. Instructions to workers
53
• Highlight all diseases and disease abbreviations
• “...are associated with Huntington disease ( HD )... HD patients
received...”
• “The Wiskott-Aldrich syndrome ( WAS ) , an X-linked
immunodeficiency…”
• Highlight the longest span of text specific to a disease
• “... contains the insulin-dependent diabetes mellitus locus …”
• Highlight disease conjunctions as single, long spans.
• “... a significant fraction of familial breast and ovarian cancer , but
undergoes…”
• Highlight symptoms - physical results of having a
disease
– “XFE progeroid syndrome can cause dwarfism, cachexia, and
microcephaly. Patients often display learning disabilities, hearing loss,
and visual impairment.
51. Qualification test
54
Test #1: “Myotonic dystrophy ( DM ) is associated with a ( CTG ) in
trinucleotide repeat expansion in the 3-untranslated region of a protein
kinase-encoding gene , DMPK , which maps to chromosome 19q13 . 3 . ”
Test #2: “Germline mutations in BRCA1 are responsible for most cases of
inherited breast and ovarian cancer . However , the function of the BRCA1
protein has remained elusive . As a regulated secretory protein , BRCA1
appears to function by a mechanism not previously described for tumour
suppressor gene products.”
Test #3: “We report about Dr . Kniest , who first described the condition in
1952 , and his patient , who , at the age of 50 years is severely
handicapped with short stature , restricted joint mobility , and blindness but
is mentally alert and leads an active life . This is in accordance with
molecular findings in other patients with Kniest dysplasia and…”
26 yes / no questions
54. Experimental design
• Task: Identify the disease mentions in
the 593 abstracts from the NCBI disease
corpus
– $0.06 per Human Intelligence Task (HIT)
– HIT = annotate one abstract from PubMed
– 5 workers annotate each abstract
57
55. This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
Aggregation function based on simple voting
58
5
1 or more votes (K=1)
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
K=2
K=3 K=4
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
56. Comparison to gold standard
59
F = 0.81, k = 2, N = 5
• 593 documents
• 7 days
• 17 workers
• $192.90
62. Comparisons to human annotators
65
Average level of
agreement
between expert
annotators
(stage 1)
F = 0.76
63. Comparisons to human annotators
66
F = 0.76
F = 0.87
Average level of
agreement
between expert
annotators
(stage 2)
64. 67
In aggregate, our worker
ensemble is faster, cheaper
and as accurate as a single
expert annotator for disease
concept recognition.
65. Information Extraction
68
1. Find mentions of high level concepts in
text
2. Map mentions to specific terms in
ontologies
3. Identify relationships between concepts
66. Annotating the relationships
69
This molecule inhibits the growth of a broad
panel of cancer cell lines, and is particularly
efficacious in leukemia cells, including
orthotopic leukemia preclinical models as
well as in ex vivo acute myeloid leukemia
(AML) and chronic lymphocytic leukemia
(CLL) patient tumor samples. Thus, inhibition
of CDK9 may represent an interesting
approach as a cancer therapeutic target
especially in hematologic malignancies.
therapeutic target
subject
predicate
object
GENE
DISEASE
69. 72
Doug Howe, ZFIN
John Hogenesch, U Penn
Jon Huss, GNF
Luca de Alfaro, UCSC
Angel Pizzaro, U Penn
Faramarz Valafar, SDSU
Pierre Lindenbaum,
Fondation Jean Dausset
Michael Martone, Rush
Konrad Koehler, Karo Bio
Warren Kibbe, Simon Lim, Northwestern
Lynn Schriml, U Maryland
Paul Pavlidis, U British Columbia
Peipei Ping, UCLA
Many Wikipedia editors
WP:MCB Project
Collaborators
Katie Fisch
Karthik Gangavarapu
Louis Gioia
Ben Good
Salvatore Loguercio
Adam Mark
Max Nanis
Ginger Tseung
Chunlei Wu
Group members
Contact
http://sulab.org
asu@scripps.edu
@andrewsu
+Andrew Su
Adriel Carolino
Erik Clarke
Jon Huss
Marc Leglise
Maximilian Ludvigsson
Ian MacLeod
Camilo Orozco
Key group alumni
Citizen Science logo based on
http://thenounproject.com/term/team
work/39543/
Funding and Support
(BioGPS: GM83924, Gene Wiki: GM089820, DA036134)
70. Related AMT work
73
• [1] Zhai et al 2013, used similar protocol to tag medication names in
clinical trials descriptions. F = 0.88 compared to gold standard
• [2] Burger et al, using microtask workers to identify relationships
between genes and mutations.
• [3] Aroyo & Welty, used workers to identify relations between
concepts in medical text.
[1] Zhai H. et al (2013) ”Web 2.0-Based Crowdsourcing for High-Quality Gold Standard
Development in Clinical Natural Language Processing” J Med Internet Res
[2] Burger, John, et al. (2014) "Hybrid curation of gene-mutation relations combining
automated extraction and crowdsourcing.” Mitre technical report
[3] Aroyo, Lora, and Chris Welty. Harnessing disagreement in crowdsourcing a relation
extraction gold standard. Tech. Rep. RC25371 (WAT1304-058), IBM
Research, 2013.
Notas del editor
We are very early in our efforts to comprehensively annotate human gene functionWhy important? Genome-scale surveys aren’t biased toward well studied genes, huge opportunity for biomedical discoveryNo IEA
If you believe that greater than 1.5% of articles have relevance to gene function, then it says there is a bottleneck in in our curation effortsNumbers updated 7/15/2011
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Relying on the entire community of scientists to digest the biomedical literature: identification filtering extraction summarization
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
Developer resources do not scale with usagePractical effects:Core developers’ time is always the rate-limiting step Addition of new features and data always feels slowEventually, new databases are created to fill the gap80% duplication for 20% innovation
MODs and portals
Genetics resources
Literature resources
Protein resources
Pathway and expression databases
Pathway and expression databases
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.
For each resourceBriefly describe the unstructured resourceDescribe the structuring approach
… but the amount of knowledge that is amenable to query and computation is tiny. We would like to have more efficient methods for information extraction.
Harmonic mean of the precision and recall593 training corpus
On 100 development data set
On 100 development data set
On 100 development data set
On 100 development data set
Phase 1: pairs of annotators work independently on computationally pre-annotated documents. Phase 2: annotators get to see each other’s annotations and then make changes Phase 3: all remaining inconsistencies resolved collaboratively
Also want to convince you that the Long Tail of bioinformatics developers is valuable too, but first have to convince you that there is a bottleneck in tool development.