Cassandra summit 2013 how not to use cassandra

June 17, 2013
#Cassandra13
Axel Liljencrantz
liljencrantz@spotify.com
How not to use
Cassandra

#Cassandra13
The Spotify backend

#Cassandra13
The Spotify backend
•  Around 4000 servers in 4 datacenters
•  Volumes
-  We have ~ 12 soccer ﬁelds of music
-  Streaming ~ 4 Wikipedias/second
-  ~ 24 000 000 active users

#Cassandra13
The Spotify backend
•  Specialized software powering Spotify
-  ~ 70 services
-  Mostly Python, some Java
-  Small, simple services responsible for single task

#Cassandra13
Storage needs
•  Used to be a pure PostgreSQL shop
•  Postgres is awesome, but...
-  Poor cross-site replication support
-  Write master failure requires manual intervention
-  Sharding throws most relational advantages out the
window

#Cassandra13
Cassandra @ Spotify
•  We started using Cassandra 2+ years ago
-  ~ 24 services use it by now
-  ~ 300 Cassandra nodes
-  ~ 50 TB of data
•  Back then, there was little information about how to design
eﬃcient, scalable storage schemas for Cassandra

#Cassandra13
Cassandra @ Spotify
•  We started using Cassandra 2+ years ago
-  ~ 24 services use it by now
-  ~ 300 Cassandra nodes
-  ~ 50 TB of data
•  Back then, there was little information about how to design
eﬃcient, scalable storage schemas for Cassandra
•  So we screwed up
•  A lot

#Cassandra13
How to misconﬁgure Cassandra

#Cassandra13
Read repair
•  Repair from outages during regular read operation
•  With RR, all reads request hash digests from all nodes
•  Result is still returned as soon as enough nodes have replied
•  If there is a mismatch, perform a repair

#Cassandra13
Read repair
•  Useful factoid: Read repair is performed across all data
centers
•  So in a multi-DC setup, all reads will result in requests being
sent to every data center
•  We've made this mistake a bunch of times
•  New in 1.1: dclocal_read_repair

#Cassandra13
Row cache
•  Cassandra can be conﬁgured to cache entire data rows in
RAM
•  Intended as a memcache alternative
•  Lets enable it. What's the worst that could happen, right?

#Cassandra13
Row cache
NO!
•  Only stores full rows
•  All cache misses are silently promoted to full row slices
•  All writes invalidate entire row
•  Don't use unless you understand all use cases

#Cassandra13
Compression
•  Cassandra supports transparent compression of all data
•  Compression algorithm (snappy) is super fast
•  So you can just enable it and everything will be better, right?

#Cassandra13
Compression
•  Cassandra supports transparent compression of all data
•  Compression algorithm (snappy) is super fast
•  So you can just enable it and everything will be better, right?
•  NO!
•  Compression disables a bunch of fast paths, slowing down
fast reads

#Cassandra13
How to misuse Cassandra

#Cassandra13
Performance worse over time
•  A freshly loaded Cassandra cluster is usually snappy
•  But when you keep writing to the same columns over for a
long time, the row will spread over more SSTables
•  And performance jumps oﬀ a cliﬀ
•  We've seen clusters where reads touch a dozen SSTables on
average
•  nodetool cfhistograms is your friend

#Cassandra13
Performance worse over time
•  CASSANDRA-5514
•  Every SSTable stores ﬁrst/last column of SSTable
•  Time series-like data is eﬀectively partitioned

#Cassandra13
Few cross continent clusters
•  Few cross continent Cassandra users
•  We are kind of on our own when it comes to some problems
•  Disable TCP nodelay
•  Reduced packet count by 20 %

#Cassandra13
How not to upgrade Cassandra

#Cassandra13
How not to upgrade Cassandra
•  Very few total cluster outages
-  Clusters have been up and running since the early 0.7
days, been rolling upgraded, expanded, full hardware
replacements etc.
•  Never lost any data!
-  No matter how spectacularly Cassandra fails, it has
never written bad data
-  Immutable SSTables FTW

#Cassandra13
Upgrade from 0.7 to 0.8
•  This was the ﬁrst big upgrade we did, 0.7.4 ⇾ 0.8.6
•  Everyone claimed rolling upgrade would work
-  It did not
•  One would expect 0.8.6 to have this ﬁxed
•  Patched Cassandra and rolled it a day later
•  Takeaways:
-  ALWAYS try rolling upgrades in a testing environment
-  Don't believe what people on the Internet tell you

#Cassandra13
•  We tried upgrading in test env, worked ﬁne
•  Worked ﬁne in production...
•  Except the last cluster
•  All data gone

#Cassandra13
•  We tried upgrading in test env, worked fine
•  Worked fine in production...
•  Except the last cluster
•  All data gone
•  Many keys per SSTable ⇾ corrupt bloom filters
•  Made Cassandra think it didn't have any keys
•  Scrub data ⇾ fixed
•  Takeaway: ALWAYS test upgrades using production data

#Cassandra13
•  After the previous upgrades, we did all the tests with
production data and everything worked ﬁne...
•  Until we redid it in production, and we had reports of missing
rows
•  Scrub ⇾ restart made them reappear
•  This was in December, have not been able to reproduce
•  PEBKAC?
•  Takeaway: ?

#Cassandra13
How not to deal with large clusters

#Cassandra13
Coordinator
•  Coordinator performs partitioning, passes on request to
the right nodes
•  Merges all responses

#Cassandra13
What happens if one node is slow?

#Cassandra13
Many reasons for temporary slowness:
•  Bad raid battery
•  Sudden bursts of compaction/repair
•  Bursty load
•  Net hiccup
•  Major GC
•  Reality

#Cassandra13
•  Coordinator has a request queue
•  If a node goes down completely, gossip will notice quickly
and drop the node
•  But what happens if a node is just super slow?

#Cassandra13
•  Gossip doesn't react quickly to slow nodes
•  The request queue for the coordinator on every node in
the cluster ﬁlls up
•  And the entire cluster stops accepting requests

#Cassandra13
•  Gossip doesn't react quickly to slow nodes
•  The request queue for the coordinator on every node in
the cluster ﬁlls up
•  And the entire cluster stops accepting requests
•  No single point of failure?

#Cassandra13
•  Solution: Partitioner awareness in client
•  Max 3 nodes go down
•  Available in Astyanax

#Cassandra13
How not to delete data

#Cassandra13
How is data deleted?
•  SSTables are immutable, we can't remove the data
•  Cassandra creates tombstones for deleted data
•  Tombstones are versioned the same way as any other
write

#Cassandra13
Do tombstones ever go away?
•  During compactions, tombstones can get merged into
SStables that hold the original data, making the
tombstones redundant
•  Once a tombstone is the only value for a speciﬁc column,
the tombstone can go away
•  Still need grace time to handle node downtime

#Cassandra13
•  Tombstones can only be deleted once all non-tombstone
values have been deleted
•  Tombstones can only be deleted if all values for the
speciﬁed row are all being compacted
•  If you're using SizeTiered compaction, 'old' rows will
rarely get deleted

#Cassandra13
•  Tombstones are a problem even when using levelled
compaction
•  In theory, 90 % of all rows should live in a single SSTable
•  In production, we've found that only 50 - 80 % of all reads
hit only one SSTable
•  In fact, frequently updated columns will exist in most
levels, causing tombstones to stick around

#Cassandra13
•  Deletions are messy
•  Unless you perform major compactions, tombstones will
rarely get deleted
•  The problem is much worse for «popular» rows
•  Avoid schemas that delete data!

#Cassandra13
TTL:ed data
•  Cassandra supports TTL:ed data
•  Once TTL:ed data expires, it should just be compacted
away, right?
•  We know we don't need the data anymore, no need for a
tombstone, so it should be fast, right?

#Cassandra13
TTL:ed data
•  Cassandra supports TTL:ed data
•  Once TTL:ed data expires, it should just be compacted
away, right?
•  We know we don't need the data anymore, no need for a
tombstone, so it should be fast, right?
•  Noooooo...
•  (Overwritten data could theoretically bounce back)

#Cassandra13
TTL:ed data
•  Drop entire sstables when all columns are expired

#Cassandra13
The Playlist service
Our most complex service
•  ~ 1 billion playlists
•  40 000 reads per second
•  22 TB of compressed data

#Cassandra13
Our old playlist system had many problems:
•  Stored data across hundreds of millions of ﬁles, making
backup process really slow.
•  Home brewed replication model that didn't work very well
•  Frequent downtimes, huge scalability problems

#Cassandra13
Our old playlist system had many problems:
•  Stored data across hundreds of millions of ﬁles, making
backup process really slow.
•  Home brewed replication model that didn't work very well
•  Frequent downtimes, huge scalability problems
•  Perfect test case for
Cassandra!

#Cassandra13
Playlist data model
•  Every playlist is a revisioned object
•  Think of it like a distributed versioning system
•  Allows concurrent modification on multiple offlined clients
•  We even have an automatic merge conflict resolver that
works really well!
•  That's actually a really useful feature

#Cassandra13
Playlist data model
•  Every playlist is a revisioned object
•  Think of it like a distributed versioning system
•  Allows concurrent modification on multiple offlined clients
•  We even have an automatic merge conflict resolver that
works really well!
•  That's actually a really useful feature said no one ever

#Cassandra13
Playlist data model
•  Sequence of changes
•  The changes are the authoritative data
•  Everything else is optimization
•  Cassandra pretty neat for storing this kind of stuﬀ
•  Can use consistency level ONE safely

#Cassandra13
Tombstone hell
•  The HEAD column family stores the sequence ID of the latest
revision of each playlist
•  90 % of all reads go to HEAD
•  mlock

#Cassandra13
Tombstone hell
•  Noticed that HEAD requests took several seconds for some
lists
•  Easy to reproduce in cassandra-cli:
• get playlist_head[utf8('spotify:user...')];
•  1-15 seconds latency; should be < 0.1 s
•  Copy SSTables to development machine for investigation

#Cassandra13
Tombstone hell
•  Noticed that HEAD requests took several seconds for some
lists
•  Easy to reproduce in cassandra-cli:
• get playlist_head[utf8('spotify:user...')];
•  1-15 seconds latency; should be < 0.1 s
•  Copy SSTables to development machine for investigation
•  Cassandra tool sstabletojson showed that the row contained
600 000 tombstones!

#Cassandra13
Tombstone hell
•  WAT‽
•  Data is in the column name
•  Used to detect forks

#Cassandra13
Tombstone hell
•  We expected tombstones would be deleted after 30 days
•  Nope, all tombstones since 1.5 years ago were there
•  Revelation: Rows existing in 4+ SSTables never have
tombstones deleted during minor compactions
•  Frequently updated lists exists in nearly all SSTables
Solution:
•  Major compaction (CF size cut in half)

#Cassandra13
Zombie tombstones
•  Ran major compaction manually on all nodes during a few
days.
•  All seemed well...
•  But a week later, the same lists took several seconds
again‽‽‽

#Cassandra13
Repair vs major compactions
A repair between the major compactions "resurrected" the
tombstones :(
New solution:
•  Repairs during Monday-Friday
•  Major compaction Saturday-Sunday
A (by now) well-known Cassandra anti-pattern:
Don't use Cassandra to store queues

#Cassandra13
Cassandra counters
•  There are lots of places in the Spotify UI where we count
things
•  # of followers of a playlist
•  # of followers of an artist
•  # of times a song has been played
•  Cassandra has a feature called distributed counters that
sounds suitable
•  Is this awesome?

#Cassandra13
Cassandra counters
•  Yep
•  They've actually worked pretty well for us.

#Cassandra13
How not to fail
•  Treat Cassandra as a utility belt
•  Flash
Lots of one-oﬀ solutions:
•  Weekly major compactions
•  Delete all sstables and recreate from scratch every day
•  Memlock frequently used SSTables in RAM

#Cassandra13
Lessons
•  Cassandra read performance is heavily dependent on the
temporal patterns of your writes
•  Cassandra is initially snappy, but various write patterns
make read performance slowly decrease
•  Making benchmarks close to useless

#Cassandra13
Lessons
•  Avoid repeatedly writing data to the same row over very
long spans of time
•  Avoid deleting data
•  If you're working at scale, you'll need to know how
Cassandra works under the hood
•  nodetool cfhistograms is your friend

#Cassandra13
Lessons
•  There are still various esoteric problems with large scale
Cassandra installations
•  Debugging them is really interesting
•  If you agree with the above statements, you should totally
come work with us

June 17, 2013
#Cassandra13
spotify.com/jobs
Questions?

Cassandra summit 2013 how not to use cassandra

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (14)

Similar a Cassandra summit 2013 how not to use cassandra

Similar a Cassandra summit 2013 how not to use cassandra (20)

Último

Último (20)

Cassandra summit 2013 how not to use cassandra