Protein structure prediction with a focus on Rosetta

1
WITH A FOCUS ON ROSETTA
This presentation was prepared by: Xavier Ambroggio,
ambroggiox@niaid.nih.gov
PROTEIN STRUCTURE PREDICTION
OFFICE OF CYBER INFRASTRUCTURE AND COMPUTATIONAL BIOLOGY
NATIONAL INSTITUTE OF ALLERGY AND INFECTIOUS DISEASES

Fall 2011 Computational Structural Biology Seminar Series
2
9 – 11 AM, T/Th in 12A/B51 http://training.cit.nih.gov
Week Day Date Course Instructor CIT Course #
Week 1
Tues Aug. 23 Fundamentals, Data Sources, and Visualization of Macromolecular Structure Darrell Hurt SS260-11001
Thurs Aug. 25 Generating Protein Structures from Homology Darrell Hurt SS270-11001
Week 2
Tues Aug. 30 Predicting Protein Structures from Amino Acid Sequences Xavier Ambroggio SS660-11001
Thurs Sept. 1 Predicting Macromolecular Complexes from Uncomplexed Structures Xavier Ambroggio SS670-11001
Week 3
Tues Sept. 6 Design and Analysis of Macromolecular Interfaces Xavier Ambroggio SS770-11001
Thurs Sept. 8 Analysis and Advanced Visualization of Macromolecular Structure Darrell Hurt SS330-11001
Week 4
Tues Sept. 13 Computational Drug Design Mike Dolan SS340-11001
Thurs Sept. 15 Introduction to Molecular Dynamics Mike Dolan TBA
Week 5 Thurs. Sept. 22 Advanced Molecular Dynamics Mike Dolan TBA

Bioinformatics and Computational Biosciences Branch
3
Scientific
Collaboration
Scientific
Training
Custom Scientific
Software &
Infrastructure
•  Structural Biology
•  Phylogenetics
•  Statistics
•  Sequence Analysis
•  Microarray Analysis
•  NGS Analysis
•  Bioinformatics
•  Biological Networks
•  Function Prediction
•  …

4
Ab Initio Structure Prediction:
Given an amino acid sequence, find the tertiary structure
“Protein folding problem”

CASP: Critical Assessment of protein Structure Prediction
http://predictioncenter.org
•  Double-blind experiment (…competition)
•  World-wide scientific community
•  Unbiased assessment of techniques in structure
prediction
•  Biennial (every even year)
•  “Pulse” of the prediction community
•  What can be predicted?
•  Which servers/algorithms perform best?

6
CASP Overview
Blutsbrüder Design

CASP Top Free-Modeling Servers
7
Why Rosetta focus?
•  Standalone
•  Versatile
  RNA
  design
  dock
  …
•  Open Source
•  Substantial Literature
•  Shared methodology
Use any and all available servers!!!

Das & Baker Annu. Rev. Biochem 2008
prediction
design
Rosetta: multipurpose macromolecular modeling suite
CIT Course #
SS660-11001
CIT Course #
SS670-11001
CIT Course #
SS770-11001

ab initio predict the structure from sequence
relax refine the structure using Rosetta energy functions
idealize replace bond geometries with ideal values
loop modeling build and refine local structurally variable regions in context of a structural template
design optimize sequence given a structure with a fixed backbone
docking structure prediction for a protein-protein complex given subunits
ligand ligand docking
ddG prediction protein-protein interface and protein stability ddG stability calculations for mutations
scoring score input conformations with Rosetta energy functions
RNA predict RNA structures from sequences and design sequences from fixed structures
clustering grouping input structures by RMSD to each other for structure prediction analysis
backrub generate alternate backbone conformations based on sets of rotations
membrane ab initio predict the structures of helical membrane proteins
enzyme design redesign a protein around a ligand
domain assembly fixed domains connected by variable regions
antibody automated antibody homology modeling
XML parsing Parse XML scripts into protocols
Brief Description of Select Rosetta Functions

What types of protein domains can Rosetta fold?
Small, globular, soluble protein domains…
Small, simple membrane protein domains… …but not complex domains or
multi-domain proteins.
T4-lysozyme C-terminal domain
V-type Na+ ATP
synthase subunit
rhodopsin
Slide content adapted from Stephanie Hirst at the 2011 Vanderbilt Rosetta Workshop
A B C

What are the success rates?
High resolution predictions are achievable
•  targets ≤100 residues
•  success rate ~30%
•  success rate with accurate secondary
structure ~50%
•  a hallmark of accuracy: convergence
11
Slide content courtesy Rhiju Das, Baker Lab

What types of protein domains can no one fold?
CASP9: domains with no good FM predictions
Slide
content
adapted
from
talk
given
by
Lisa
Kinch
of
the
Grishin
lab
at
CASP9
mee>ng:
h@p://predic>oncenter.org/casp9/

•  Non-‐globular

•  Trimeric

•  Fe
stabilized

•  High
contact
order

Many
residues
close

in
3D,
far
in
1D

•  +
elongated
sheet?

T0591d1,
3MWT
T0550d2,
3NQK

T0629d2,
2XGF

1.  Select
fragments
consistent
with
local

sequence
preferences

2.  Assemble
fragments
into
models
with

na>ve-‐like
global
proper>es

3.  Iden>fy
the
best
model
from
the

popula>on
of
decoys

Slide content adapted from Ora Schueler-Furman’s “Workshop in Structural Computational Biology”
Figures adapted from Charlie Strauss;
Protein structure prediction using ROSETTA, Rohl et al (2004) Methods in Enzymology, 383:66
Basic
Ab
Ini'o
Rose<a
protocol

Assembly

Decoy

Decoy

Decoy

Decoy

Decoy

Decoy

Decoy

Decoy

Decoy

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Fragment

Decoy

Fragment-Based Structure Prediction
Rosetta, Quark, …
Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)

Template(s)
Model
Alignment
Homology modeling:

First atomic-resolution model
Target 0281 CASP6
•  Topology sampled by ab initio trajectory
of homolog sequence (rmsd=2.2Å)
•  Full atom refinement reduces rmsd to
1.5Å
•  Side chain packing accurately
recovered
Figures adapted from Bradley P, Malmström L, Qian B, Schonbrun J, Chivian D, Kim DE, Meiler J, Misura KM, Baker D. Free modeling with Rosetta in CASP6. Proteins.

Folding Theory: Sequence-Structure Relationships
16
•  Secondary structure formation is the earliest part of the folding process
•  Local sequence codes for local structures… i.e. fragments
  helical sequences in a folded protein tend to be helical in isolation
•  Secondary structure prediction algorithms have ~70-80% accuracy
  Partial failure due to tertiary interactions stabilizing secondary structure elements

Rosetta fragments
•  3 and 9 residue fragments matched to
query sequence
•  database created from crystal structures
  < 2.5Å resolution
  < 50% sequence identity
•  low resolution modeling
  centroid representation of side chains
•  ranked by:
  alignment
  Secondary structure predictions
•  PSI-PRED
•  SAM-T02
•  Jufo
•  PhD
17

KVFGRCELAAAMKRHGLDNYRGYSLGNWVC...
KVF
KVFGRCELA
VFG
VFGRCELAA
FGR
FGRCELAAA
GRC
GRCELAAAM
---------------------------------
EEEE TT S EEEEEEE TT HH...
query
sec str
Slide content courtesy David Hoover, CIT, NIH
Sliding fragment windows

# Rank G K L M Q E R A
13 1000 G K L
25 821 G R L
46 1000 K L M
21 635 R L M
43 923 K V M
26 523 R V M
15 970 M Q E
26 934 E R A
Separate 3-mer and 9-mer libraries generated
Slide content courtesy David Hoover, CIT, NIH
Example 3-mer fragment library

Making Fragment Libraries with Robetta
http://robetta.bakerlab.org/

Making Fragment Libraries on Biowulf
Slide content by David Hoover from: http://biowulf.nih.gov/apps/Rosetta23.html#RosettaFragments

22
•  Levinthal paradox:
  Given either alpha, beta, or loop conformation, for protein of nres, 3nres possible conformations.
  If nres = 100, sampling a conformation every 10-13 seconds = 1027 years to fold
  Universe is 1010 years old.
  Folding is non-random and cooperative.
•  Many different combinations of secondary structure elements have similar stabilities
  Tertiary (side-chain level) interactions drive folding towards the native topology
  Phase transition results in a substantial energy gap between native and non-native structures
Folding Theory: The Folding Landscape
•  Cyrus Levinthal, J. Chim. Phys. 65, 44; 1968
•  Hue Sun Chan and Ken A. Dill, Protein Folding in the Landscape Perspective: Chevron Plots and Non-
Arrhenius Kinetics, Proteins: Structure, Function, and Genetics, Volume 30, No. 1, January 1998, pp 2-33.
Implications and requirements for folding algorithm:
•  Fast conformational sampling algorithm
•  Accurate scoring function
•  Full-atom modeling

early centroid models centroid models final full-atom models
Assembly Coarse funnel to native-like decoys Fine-grained funnel to near-native decoys

Major Classes of Energy Functions in Rosetta
24
Low resolution: reduced atom representation (centroid)
  simplified energy function
  used for aggressive search of state space
High resolution: full-atom representation
  detailed energy function
  local search of state space
  refinement and minimization
General
  weighted sum of linear terms: Energy = w1*term1 + w2*term2 + …
  pairwise decomposable (speed)
  weighted for task, e.g. ligand docking

Low resolution (centroid) folding
25
  Fragment insertion
  conformation modification occurs in torsion space
  initial insertions result in large changes in dihedrals
  9 mers inserted first followed by 3 mers later in process
  later insertions purposefully result in small changes in dihedrals random insertion
*
*

Sss + SHS - sheet and helix-sheet geometries
•  Scβ density/compactness of structure
•  Svdw no clashes
•  SRgyr radius
of
gyra>on
(Rgyr),
globular structure
Driving assembly towards native-like decoys

Low-resolution homolog folding improves prediction
•  Collect homologs
•  Create low-resolution models
  cluster
•  Thread query sequence onto models
•  Proceed to fullatom refinement
…
…
…

Slide content adapted from Ora Schueler-Furman’s
“Workshop in Structural Computational Biology”

Low resolution (centroid) folding example
28

Clustering:
Graphical representation
29

30
High resolution (full-atom) refinement
Chen Y et al. Nucl. Acids Res. 2004;32:5147-5162
evaluating/optimizing specific atom-atom interactions
e.g. hydrogen bonding:

Comparison of low resolution, relax, and abrelax folding example
31

32
Examples from the Rosetta@home archive of top predictions
Note: massively parallel computation
rosetta prediction
crystal structure

Detailed ab initio Rosetta Workflow
33
INPUT
•  amino acid sequence
•  secondary structure prediction(s)
•  fragment library
•  constraints from experimental data
•  NMR
•  biochemical/biophysical studies
•  ...
LOW RESOLUTION FOLDING
•  fragment insertions
•  scoring
•  filters
CLUSTERING
•  groups of decoys with low RMSD to each other
•  lowest energy decoy of clusters selected for
further refinement or prediction
HIGH RESOLUTION REFINEMENT
•  backbone minimization
•  rotamer optimization
ADDITIONAL MODELING
•  identifying variable regions
•  rebuilding
>103-106
trajectories
automated
manual

34
Computational Considerations
Protocol Utility Caveats
Centroid •  fast
•  widely sample conformational space
•  possibility of no near-native models after low
resolution folding
•  no discrimination by energy
Full-atom
refinement
•  near-native decoys separated by energy •  more computationally demanding
•  must have near-native in starting decoy pool
Combined •  streamlined
•  for powerful and massively parallel
computing
•  most computationally demanding
•  improvement only with sufficient sampling

35
Native (CheY)
A ~1000-fold increase in computational power
Slide content courtesy Rhiju Das, Baker Lab

36
Architect of Rosetta@home: David Kim

A ~1000-fold increase in computational power
Native (CheY)
Lowest energy
Rosetta
structure
“brute force” approach

Computational power vs. accuracy
in ab initio structure prediction
37
Cα RMSD of lowest energy model to the native structure vs. sample size
Sample Size
RMSDtonative
Category 1:
Successful high-resolution predictions
Category 2:
Successful high-resolution predictions
with additional sampling
Category 3:
Unsuccessful predictions (with any amount of sampling)
Kim DE, Blum B, Bradley P, Baker D. Sampling bottlenecks in de novo protein structure prediction. J Mol Biol. 2009 Oct 16;393(1):249-60.

38
“De novo” phasing: large-scale tests
Tests on 30 data sets
(covering 16 proteins)
Slide content courtesy Rhiju Das, Baker Lab; Bin et al., Nature 2007.
TF Z-score Have I solved it?
< 5 no
5 - 6 unlikely
6 - 7 possibly
7 - 8 probably
> 8 definitely

39
1hz5-sf.cif
Å
Rosetta-refined native
(positive controls)
Rosetta-refined de novo models

40
1hz5-sf.cif
Success in 14/30 data sets
Å
(positive controls)

41
(positive controls)
Rosetta-refined de novo models, fragments with
correct native 2° structure
1hz5-sf.cif
Å

Preparation for folding simulations
•  proper secondary structure assignment
•  constraints
•  limit search space
•  increase sampling efficiency
•  decrease CPU time
42

Constraints
•  There are constraint types and function types
  Constraint types: AtomPair, Angle, Dihedral, etc.
  Function types: Bounded, Spline, Harmonic, Gaussian, etc.
•  Each constraint is scored individually and the total constraint score is the sum of all
individual scores
•  Each constraint can have its own constraint type and function type.
  In some cases, like when using Spline function, each constraint can have its own
weight
•  How you define the constraint and how it’s scored depends on the constraint type;
this is same with function type.

Constraint file example: EPR data
<cst type> <atom1> <res1> <atom2> <res2> <cst_func> <RosettaEPR> <Dcb> <weight> <bin>!
AtomPair CB 32 CB 36 SPLINE EPR_DISTANCE 16.0 1.0 0.5!
Constraint info Constraint Function info

Membrane protein ab initio
•  RosettaMembrane divides the protein into:
  hydrophobic
  hydrophilic
  soluble layers
•  Specific scoring function for each layer
Figure from Yarov-Yarovoy, Schonbrun, and Baker 2006.

Input
Files

Spanfile
-‐
*.span

-‐-‐transmembrane
topology
predic>on
file
generated
using
octopus2span.pl
script

-‐-‐Input
OCTOPUS
topology
file
is
generated
at
h@p://octopus.cbr.su.se
using
protein

sequence
as
input.

Lipopholicity
predicDon
file
-‐
*.lips4

-‐-‐Generate
using
run_lips.pl
script

-‐-‐Need
input
FASTA
file,
spanfile,
blaspgp
and
nr
(NCBI)
database

to
run

Fragment
generaDon

-‐-‐Advised
to
use
SAM
but
not
JUFO
or
PSIPRED,
which
predict
TMH
regions
poorly


Folding and studying folding with molecular dynamics
Specialized hardware, ANTON capable of continuous ms length trajectories
Standard simulations:
1 - 3 µs simulations ~ months of HPC
Approximate Rates of Folding:
1 µs helix
10 µs sheet
100 µs fast folding protein
1+ ms typical protein

D E Shaw et al. Science 2010;330:341-346
simulation of villin at 300 K
2-8 µs folder
simulation of FiP35 at 337 K
20-80 µs folder
Blue: x-ray structures
Red: last frame of MD simulation
Folding proteins at x-ray resolution

Published by AAAS
tip of hairpin 1 (12-18, blue)
hairpin 1 (8-22, green)
hairpin 2 (19-30, orange)
full protein (2-33, red)
D E Shaw et al. Science 2010;330:341-346
Reversible folding simulation of FiP35.

Thank You
For questions or comments please contact:
ScienceApps@niaid.nih.gov
301.496.4455
50

Protein structure prediction with a focus on Rosetta

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Protein structure prediction with a focus on Rosetta

Similar a Protein structure prediction with a focus on Rosetta (20)

Más de Bioinformatics and Computational Biosciences Branch

Más de Bioinformatics and Computational Biosciences Branch (20)

Último

Último (20)

Protein structure prediction with a focus on Rosetta