How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent 2013

How Parse built a mobile backend as a service
Charity Majors
November 14, 2013

© 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Friday, November 15, 13

What is Parse?
• platform for mobile developers
•
•
•
•

iOS, Android, WinRT
API and native SDKs
Scales automatically to handle traffic
Analytics, cloud code, file storage, push notifications, hosting


Parse is magic.


Parse is built on AWS
• Parse has never touched bare metal
• Recently acquired by Facebook
• Current plan is to stay on AWS
• We love AWS!


Parse is growing fast
• Developers
• Apps
• API usage
• Nodes and compute resources
• Connected devices



9/11/13

8/11/13

7/11/13

6/11/13

5/11/13

4/11/13

3/11/13

2/11/13

1/11/13

12/11/12

11/11/12

10/11/12

9/11/12

8/11/12

7/11/12

6/11/12

5/11/12

4/11/12

3/11/12

2/11/12

1/11/12

12/11/12

11/11/11

10/11/11

9/11/11

8/11/11

7/11/11

6/11/11

Developers


9/11/13

8/11/13

7/11/13

6/11/13

5/11/13

4/11/13

3/11/13

2/11/13

1/11/13

12/11/12

11/11/12

10/11/12

9/11/12

8/11/12

7/11/12

6/11/12

5/11/12

4/11/12

3/11/12

2/11/12

1/11/12

12/11/12

11/11/11

10/11/11

9/11/11

8/11/11

7/11/11

6/11/11

Developers
When PARSE
was acquired

Top left: Parse Grid Load last year
Top Right: Number of Hits last year
Bottom Left: Active PPNS Connections last year


1.5 years ago


Parse ops philosophy
• Work smarter, not harder
• Small team, full stack generalists
• Automate, automate, automate
• Our goal:
• 80% time working on things we want to do
• 20% time working on things we have to do


Past & Present
October 2012
• 60% time spent on must-do’s
• 40% time spent on want to do’s
• ~400 event alerts
• Very sleepy opsen

October 2013
• 20% time spent on must-do’s
• 80% time spent want to do’s
• ~100 event alerts (mostly daytime)
• Infra complexity has 5x’d but time to
•


manage it has dropped
We have shifted a lot of work from
ourselves to AWS

Takeaways
• ASGs are your best friend
• Automation should be reusable
• Choose your source of truth carefully


Parse stack


Infrastructure design choices
• Chef
• Amazon Route 53
• Use real hostnames
• Distribute evenly across 3 AZs
• Fail over automatically
• Single source of truth


Amazon EC2 design choices
• Standardize on a few instance types
• Makes reserved instances more efficient
• We use m1.large, m1.xlarge, m2.4xlarge (multi-core is a must). Prefer many
small disposable instances for stateless services.

• Security groups
• One group per role
• Verify working set with expected set using git/nagios

• All inbound requests come through Elastic Load Balancing
• Nothing talks directly to Amazon EC2 instances


API path
•
•
•
•
•

Elastic Load Balancing
nginx
haproxy
Ruby app servers (unicorns)
Go api servers (go rewrite from the
ground up)
• Go logging servers to FB endpoint


Hosting
• Elastic Load Balancing
• Elastic IPs for apex domain redirect service
• Go service that wraps cloud code and Amazon S3


Cloud code
•
•
•
•

Server-side javascript in v8 virtual machine
Third-party modules for partners (Stripe, Twilio, etc.)
Restrictive security groups
Scrub IPs with squid


Push
•
•
•
•
•

Resque on redis
Billions of pushes per month
700/sec steady state
Spikes to 10k/sec (15x burst)
PPNS holds sockets open to all
android devices
• PDNS to serve android phonehome IPs


MongoDB
•
•
•
•
•
•
•

12 replica sets, ~50 nodes, 2-4 TB per rs
Over 1M collections
Over 170k schemas
Autoindexing of keys based on entropy
Compute compound indexes from real traffic analysis
Implemented our own app-level sharding
PIOPS (striped RAID, 2000-8000 PIOPS/vol)
• totally saved our bacon. Amazon EBS was a killer.

• Fully instrumented provisioning with chef

Memcache
• Pool of memcaches with consistent hash
• I would use ElastiCache instead next time


Redis
• Queueing using resque
• Android outboxes
• Single-threaded
• Just started playing with ElastiCache redis


MySQL
• Trivially tiny and we would love to get rid of it
• ... but rails

• Considered Amazon RDS
•
•
•
•

No chained replication
Visibility is challenging
Even tiny periodic blips impact the API
... but AZ failover would be sooo nice


Cassandra
• Powers the front-end Parse Analytics
• Super fast writes and increments
• 12 node cluster of m2.4xlarge
• Ephemeral storage
• Cheap & won our benchmarks


Cassandra + Priam
• Initial token assignments
• Incremental backups to Amazon S3
• Uses Auto Scaling groups
• Amazon SimpleDB for tracking tokens, instance
identities
• Non-trivial to set up but WORTH IT


Infrastructure


First-generation infrastructure
Characteristics
• Ruby on Rails everywhere
• Chef to build AMIs
• Chef role per service
• Capistrano to deploy code
• Source of truth: git


First-generation infrastructure
Characteristics

Effects

• Chef to build AMIs
• Chef role per service
• Source of truth: git

• Sooo
• Make

the same change in many

places

• Full

deploy and restart any time

a single host is added or removed

• Fine


much hand-editing

for small static host sets

How to deploy 20 new servers:
• Run

20 knife-ec2 commands to

launch 20 hosts,

• Edit the cap deploy file,
• Edit the yml files, push to git,
• Do a cap cold deploy to new
hosts,

• Do

a full deploy/restart to all

the services that need to talk to
the new hosts


Total time elapsed:

1.5–2.5
hours

• Run


launch 20 hosts,

• Edit the cap deploy file,
• Edit the yml files, push to git,
• Do a cap cold deploy to new
hosts,

• Do

a full deploy/restart to all

the services that need to talk to
the new hosts


Total time elapsed:

1.5–2.5
MG.
O t ok
hours
no

PROBLEMS
• Babysitting
• Maintaining machine lists by hand
• No consistent human readable host naming
• Requires full code deploy to add single node
• Humans have to know things and make decisions


Second-generation infrastructure
Characteristics
• Chef to configure systems
• Chef to generate host lists
• Source of truth: chef


Second-generation infrastructure
Characteristics

Effects

• Chef to configure systems
• Chef to generate host lists
• Source of truth: chef

• YML

files, haproxy configs, etc

generated every chef run

• No

longer need to do full deploys

to affected services, just restart

• Only

one set of files to maintain

by hand (capistrano)


• Run


launch 20 hosts

• Edit the cap deploy file
• Do a cap cold deploy to

new

hosts

• Let

chef-client run to generate

YML files

• Restart

services that need to

talk to the new hosts


Total time elapsed:

30-60
minutes

• Run


launch 20 hosts

• Edit the cap deploy file
• Do a cap cold deploy to

new

hosts

• Let

chef-client run to generate

YML files

• Restart

services that need to

talk to the new hosts


Total time elapsed:

30-60

ILL
ST
!

minutes
ok
not

what are our primary goals?
• Scale up any class of service in < 5 minutes
• Automatically detect new nodes
• Automatically remove downed nodes from service
• No hand maintained lists ANYWHERE (ugh)
• Deploy fast—no time to build AMIs
• Option of deploying from master
• Design a new deploy process for go binaries


putting together a solution
Auto Scaling Groups

Jenkins + Amazon S3

• Each service lives in an ASG
• Same AMI used for most services
• Base AMI generated by chef
• System state managed by chef
• ASG named after chef role

• Runs unit tests
• Generate a tarball


artifact

for each successful build

• Upload

to Amazon S3, tag

with the build # and role

autoification
auto-bootstrap

auto-deploy

• Runs on first boot
• Infers chef role from ASG name
• Generates a client.rb and initial

• infers

runlist
DNS with

Amazon Route 53
a lock from zookeeper, so

DNS is atomic

• Bootstraps chef
• Auto-deploy

name

• pulls

build artifact from Amazon

S3

• Registers
• Grabs

the chef role from ASG

• unpacks

tarball, restarts

a better source of truth
zookeeper
• We LOVE zookeeper!!
• Service registration, service
discovery

• Distributed locking
• Coordinated actions,


unique ids

a better source of truth
zookeeper

how it works

• We LOVE zookeeper!!
• Service registration, service

• zkwatcher

discovery

• Distributed locking
• Coordinated actions,

detects the service is

up, establishes an ephemeral node
to zk

unique ids

• Or the service registers itself
• Ephemeral node goes away,
service gets deregistered

• Capistrano

asks zookeeper for the

list of alive servers to deploy to


Third-generation infrastructure
Characteristics
• Some go, some ruby
• Chef to maintain state
• ASG per chef role
• Capistrano + zk + jenkins
Amazon S3

• Source


of truth: zookeeper

+

Third-generation infrastructure
Characteristics

Effects

• Some go, some ruby
• Chef to maintain state
• ASG per chef role
• Capistrano + zk + jenkins

• No lists of hosts
• No manual labor
• Happy opsen

Amazon S3

• Source


of truth: zookeeper

+

Deploy 20 new servers:
• Adjust
• Have a

the size of the ASG
cocktail

Total time elapsed:

5-10
minutes


Deploy 20 new servers:
• Adjust
• Have a

the size of the ASG
cocktail

Total time elapsed:

5-10
minutes

Y!
YA

ASG caveats
• Amazon CloudWatch triggers are minimally useful for us
• Our bursts are usually too short and sharp
• No periodicity to our traffic patterns

• ... but we are lazy so we would like to add them anyway
• Need more tooling around downsizing ASGs gracefully
• Initial chef run may take 5-7 minutes
• Could someday optimize this
• Or eat the overhead of building AMIs with each successful jenkins build


Remaining issues
• When we get rid of ruby, get rid of cap
• Just use auto-deploy for everything
• Trigger a deploy by updating build version # in zookeeper

• Automatic failover for mysql and redis
• Move everything into VPC
• ASGs will really help with this!
• Then we can use internal load balancers instead of haproxy. Want badly.


Takeaways
• Single source of truth, or multiple sources of lies
• The more real-time your source of truth, the faster your
response time can be
• ASGs are amazing <3 <3


Please give us your feedback on this
presentation

MBL307
As a thank you, we will select prize
winners daily for completed surveys!


Thank You

How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent 2013

Recommended

Recommended

More Related Content

What's hot

What's hot (17)

Similar to How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent 2013

Similar to How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent 2013 (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

How Parse Built a Mobile Backend as a Service on AWS (MBL307) | AWS re:Invent 2013