Protein structure prediction with a focus on Rosetta
1. 1
WITH A FOCUS ON ROSETTA
This presentation was prepared by: Xavier Ambroggio,
ambroggiox@niaid.nih.gov
PROTEIN STRUCTURE PREDICTION
OFFICE OF CYBER INFRASTRUCTURE AND COMPUTATIONAL BIOLOGY
NATIONAL INSTITUTE OF ALLERGY AND INFECTIOUS DISEASES
2. Fall 2011 Computational Structural Biology Seminar Series
2
9 – 11 AM, T/Th in 12A/B51 http://training.cit.nih.gov
Week Day Date Course Instructor CIT Course #
Week 1
Tues Aug. 23 Fundamentals, Data Sources, and Visualization of Macromolecular Structure Darrell Hurt SS260-11001
Thurs Aug. 25 Generating Protein Structures from Homology Darrell Hurt SS270-11001
Week 2
Tues Aug. 30 Predicting Protein Structures from Amino Acid Sequences Xavier Ambroggio SS660-11001
Thurs Sept. 1 Predicting Macromolecular Complexes from Uncomplexed Structures Xavier Ambroggio SS670-11001
Week 3
Tues Sept. 6 Design and Analysis of Macromolecular Interfaces Xavier Ambroggio SS770-11001
Thurs Sept. 8 Analysis and Advanced Visualization of Macromolecular Structure Darrell Hurt SS330-11001
Week 4
Tues Sept. 13 Computational Drug Design Mike Dolan SS340-11001
Thurs Sept. 15 Introduction to Molecular Dynamics Mike Dolan TBA
Week 5 Thurs. Sept. 22 Advanced Molecular Dynamics Mike Dolan TBA
4. 4
Ab Initio Structure Prediction:
Given an amino acid sequence, find the tertiary structure
“Protein folding problem”
5. CASP: Critical Assessment of protein Structure Prediction
http://predictioncenter.org
• Double-blind experiment (…competition)
• World-wide scientific community
• Unbiased assessment of techniques in structure
prediction
• Biennial (every even year)
• “Pulse” of the prediction community
• What can be predicted?
• Which servers/algorithms perform best?
7. CASP Top Free-Modeling Servers
7
Why Rosetta focus?
• Standalone
• Versatile
RNA
design
dock
…
• Open Source
• Substantial Literature
• Shared methodology
Use any and all available servers!!!
9. ab initio predict the structure from sequence
relax refine the structure using Rosetta energy functions
idealize replace bond geometries with ideal values
loop modeling build and refine local structurally variable regions in context of a structural template
design optimize sequence given a structure with a fixed backbone
docking structure prediction for a protein-protein complex given subunits
ligand ligand docking
ddG prediction protein-protein interface and protein stability ddG stability calculations for mutations
scoring score input conformations with Rosetta energy functions
RNA predict RNA structures from sequences and design sequences from fixed structures
clustering grouping input structures by RMSD to each other for structure prediction analysis
backrub generate alternate backbone conformations based on sets of rotations
membrane ab initio predict the structures of helical membrane proteins
enzyme design redesign a protein around a ligand
domain assembly fixed domains connected by variable regions
antibody automated antibody homology modeling
XML parsing Parse XML scripts into protocols
Brief Description of Select Rosetta Functions
10. What types of protein domains can Rosetta fold?
Small, globular, soluble protein domains…
Small, simple membrane protein domains… …but not complex domains or
multi-domain proteins.
T4-lysozyme C-terminal domain
V-type Na+ ATP
synthase subunit
rhodopsin
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
A B C
11. What are the success rates?
High resolution predictions are achievable
• targets ≤100 residues
• success rate ~30%
• success rate with accurate secondary
structure ~50%
• a hallmark of accuracy: convergence
11
Slide content courtesy Rhiju Das, Baker Lab
12. What types of protein domains can no one fold?
CASP9: domains with no good FM predictions
Slide
content
adapted
from
talk
given
by
Lisa
Kinch
of
the
Grishin
lab
at
CASP9
mee>ng:
h@p://predic>oncenter.org/casp9/
• Non-‐globular
• Trimeric
• Fe
stabilized
• High
contact
order
Many
residues
close
in
3D,
far
in
1D
• +
elongated
sheet?
T0591d1,
3MWT
T0550d2,
3NQK
T0629d2,
2XGF
13. 1. Select
fragments
consistent
with
local
sequence
preferences
2. Assemble
fragments
into
models
with
na>ve-‐like
global
proper>es
3. Iden>fy
the
best
model
from
the
popula>on
of
decoys
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
Figures adapted from Charlie Strauss;
Protein structure prediction using ROSETTA, Rohl et al (2004) Methods in Enzymology, 383:66
Basic
Ab
Ini'o
Rose<a
protocol
14. Assembly
Decoy
Decoy
Decoy
Decoy
Decoy
Decoy
Decoy
Decoy
Decoy
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Fragment
Decoy
Fragment-Based Structure Prediction
Rosetta, Quark, …
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Template(s)
Model
Alignment
Homology modeling:
15. First atomic-resolution model
Target 0281 CASP6
• Topology sampled by ab initio trajectory
of homolog sequence (rmsd=2.2Å)
• Full atom refinement reduces rmsd to
1.5Å
• Side chain packing accurately
recovered
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
Figures adapted from Bradley P, Malmström L, Qian B, Schonbrun J, Chivian D, Kim DE, Meiler J, Misura KM, Baker D. Free modeling with Rosetta in CASP6. Proteins.
16. Folding Theory: Sequence-Structure Relationships
16
• Secondary structure formation is the earliest part of the folding process
• Local sequence codes for local structures… i.e. fragments
helical sequences in a folded protein tend to be helical in isolation
• Secondary structure prediction algorithms have ~70-80% accuracy
Partial failure due to tertiary interactions stabilizing secondary structure elements
17. Rosetta fragments
• 3 and 9 residue fragments matched to
query sequence
• database created from crystal structures
< 2.5Å resolution
< 50% sequence identity
• low resolution modeling
centroid representation of side chains
• ranked by:
alignment
Secondary structure predictions
• PSI-PRED
• SAM-T02
• Jufo
• PhD
17
19. # Rank G K L M Q E R A
13 1000 G K L
25 821 G R L
46 1000 K L M
21 635 R L M
43 923 K V M
26 523 R V M
15 970 M Q E
26 934 E R A
Separate 3-mer and 9-mer libraries generated
Slide content courtesy David Hoover, CIT, NIH
Example 3-mer fragment library
20. Making Fragment Libraries with Robetta
http://robetta.bakerlab.org/
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
21. Making Fragment Libraries on Biowulf
Slide content by David Hoover from: http://biowulf.nih.gov/apps/Rosetta23.html#RosettaFragments
22. 22
• Levinthal paradox:
Given either alpha, beta, or loop conformation, for protein of nres, 3nres possible conformations.
If nres = 100, sampling a conformation every 10-13 seconds = 1027 years to fold
Universe is 1010 years old.
Folding is non-random and cooperative.
• Many different combinations of secondary structure elements have similar stabilities
Tertiary (side-chain level) interactions drive folding towards the native topology
Phase transition results in a substantial energy gap between native and non-native structures
Folding Theory: The Folding Landscape
• Cyrus Levinthal, J. Chim. Phys. 65, 44; 1968
• Hue Sun Chan and Ken A. Dill, Protein Folding in the Landscape Perspective: Chevron Plots and Non-
Arrhenius Kinetics, Proteins: Structure, Function, and Genetics, Volume 30, No. 1, January 1998, pp 2-33.
Implications and requirements for folding algorithm:
• Fast conformational sampling algorithm
• Accurate scoring function
• Full-atom modeling
23. early centroid models centroid models final full-atom models
Assembly Coarse funnel to native-like decoys Fine-grained funnel to near-native decoys
24. Major Classes of Energy Functions in Rosetta
24
Low resolution: reduced atom representation (centroid)
simplified energy function
used for aggressive search of state space
High resolution: full-atom representation
detailed energy function
local search of state space
refinement and minimization
General
weighted sum of linear terms: Energy = w1*term1 + w2*term2 + …
pairwise decomposable (speed)
weighted for task, e.g. ligand docking
25. Low resolution (centroid) folding
25
Fragment insertion
conformation modification occurs in torsion space
initial insertions result in large changes in dihedrals
9 mers inserted first followed by 3 mers later in process
later insertions purposefully result in small changes in dihedrals random insertion
*
*
26. Sss + SHS - sheet and helix-sheet geometries
• Scβ density/compactness of structure
• Svdw no clashes
• SRgyr radius
of
gyra>on
(Rgyr),
globular structure
Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
Driving assembly towards native-like decoys
27. Low-resolution homolog folding improves prediction
• Collect homologs
• Create low-resolution models
cluster
• Thread query sequence onto models
• Proceed to fullatom refinement
…
…
…
Slide content adapted from Ora Schueler-Furman’s
“Workshop in Structural Computational Biology”
30. 30
High resolution (full-atom) refinement
Chen Y et al. Nucl. Acids Res. 2004;32:5147-5162
evaluating/optimizing specific atom-atom interactions
e.g. hydrogen bonding:
32. 32
Examples from the Rosetta@home archive of top predictions
Note: massively parallel computation
rosetta prediction
crystal structure
33. Detailed ab initio Rosetta Workflow
33
INPUT
• amino acid sequence
• secondary structure prediction(s)
• fragment library
• constraints from experimental data
• NMR
• biochemical/biophysical studies
• ...
LOW RESOLUTION FOLDING
• fragment insertions
• scoring
• filters
CLUSTERING
• groups of decoys with low RMSD to each other
• lowest energy decoy of clusters selected for
further refinement or prediction
HIGH RESOLUTION REFINEMENT
• backbone minimization
• rotamer optimization
ADDITIONAL MODELING
• identifying variable regions
• rebuilding
>103-106
trajectories
automated
manual
34. 34
Computational Considerations
Protocol Utility Caveats
Centroid • fast
• widely sample conformational space
• possibility of no near-native models after low
resolution folding
• no discrimination by energy
Full-atom
refinement
• near-native decoys separated by energy • more computationally demanding
• must have near-native in starting decoy pool
Combined • streamlined
• for powerful and massively parallel
computing
• most computationally demanding
• improvement only with sufficient sampling
36. 36
Architect of Rosetta@home: David Kim
A ~1000-fold increase in computational power
Native (CheY)
Lowest energy
Rosetta
structure
“brute force” approach
37. Computational power vs. accuracy
in ab initio structure prediction
37
Cα RMSD of lowest energy model to the native structure vs. sample size
Sample Size
RMSDtonative
Category 1:
Successful high-resolution predictions
Category 2:
Successful high-resolution predictions
with additional sampling
Category 3:
Unsuccessful predictions (with any amount of sampling)
Kim DE, Blum B, Bradley P, Baker D. Sampling bottlenecks in de novo protein structure prediction. J Mol Biol. 2009 Oct 16;393(1):249-60.
38. 38
“De novo” phasing: large-scale tests
Tests on 30 data sets
(covering 16 proteins)
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
TF Z-score Have I solved it?
< 5 no
5 - 6 unlikely
6 - 7 possibly
7 - 8 probably
> 8 definitely
39. 39
“De novo” phasing: large-scale tests
Tests on 30 data sets
(covering 16 proteins)
1hz5-sf.cif
Å
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
Rosetta-refined native
(positive controls)
Rosetta-refined de novo models
40. 40
“De novo” phasing: large-scale tests
Tests on 30 data sets
(covering 16 proteins)
1hz5-sf.cif
Success in 14/30 data sets
Å
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
Rosetta-refined native
(positive controls)
Rosetta-refined de novo models
41. 41
“De novo” phasing: large-scale tests
Tests on 30 data sets
(covering 16 proteins)
Rosetta-refined native
(positive controls)
Rosetta-refined de novo models
Rosetta-refined de novo models, fragments with
correct native 2° structure
1hz5-sf.cif
Å
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
42. Preparation for folding simulations
• proper secondary structure assignment
• constraints
• limit search space
• increase sampling efficiency
• decrease CPU time
42
43. Constraints
• There are constraint types and function types
Constraint types: AtomPair, Angle, Dihedral, etc.
Function types: Bounded, Spline, Harmonic, Gaussian, etc.
• Each constraint is scored individually and the total constraint score is the sum of all
individual scores
• Each constraint can have its own constraint type and function type.
In some cases, like when using Spline function, each constraint can have its own
weight
• How you define the constraint and how it’s scored depends on the constraint type;
this is same with function type.
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
45. Membrane protein ab initio
• RosettaMembrane divides the protein into:
hydrophobic
hydrophilic
soluble layers
• Specific scoring function for each layer
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
Figure from Yarov-Yarovoy, Schonbrun, and Baker 2006.
46. Input
Files
Spanfile
-‐
*.span
-‐-‐transmembrane
topology
predic>on
file
generated
using
octopus2span.pl
script
-‐-‐Input
OCTOPUS
topology
file
is
generated
at
h@p://octopus.cbr.su.se
using
protein
sequence
as
input.
Lipopholicity
predicDon
file
-‐
*.lips4
-‐-‐Generate
using
run_lips.pl
script
-‐-‐Need
input
FASTA
file,
spanfile,
blaspgp
and
nr
(NCBI)
database
to
run
Fragment
generaDon
-‐-‐Advised
to
use
SAM
but
not
JUFO
or
PSIPRED,
which
predict
TMH
regions
poorly
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
47. Folding and studying folding with molecular dynamics
Specialized hardware, ANTON capable of continuous ms length trajectories
Standard simulations:
1 - 3 µs simulations ~ months of HPC
Approximate Rates of Folding:
1 µs helix
10 µs sheet
100 µs fast folding protein
1+ ms typical protein
48. D E Shaw et al. Science 2010;330:341-346
simulation of villin at 300 K
2-8 µs folder
simulation of FiP35 at 337 K
20-80 µs folder
Blue: x-ray structures
Red: last frame of MD simulation
Folding proteins at x-ray resolution
49. Published by AAAS
tip of hairpin 1 (12-18, blue)
hairpin 1 (8-22, green)
hairpin 2 (19-30, orange)
full protein (2-33, red)
D E Shaw et al. Science 2010;330:341-346
Reversible folding simulation of FiP35.
50. Thank You
For questions or comments please contact:
ScienceApps@niaid.nih.gov
301.496.4455
50