4. #Cassandra13
The Spotify backend
• Around 4000 servers in 4 datacenters
• Volumes
- We have ~ 12 soccer fields of music
- Streaming ~ 4 Wikipedias/second
- ~ 24 000 000 active users
5. #Cassandra13
The Spotify backend
• Specialized software powering Spotify
- ~ 70 services
- Mostly Python, some Java
- Small, simple services responsible for single task
6. #Cassandra13
Storage needs
• Used to be a pure PostgreSQL shop
• Postgres is awesome, but...
- Poor cross-site replication support
- Write master failure requires manual intervention
- Sharding throws most relational advantages out the
window
7. #Cassandra13
Cassandra @ Spotify
• We started using Cassandra 2+ years ago
- ~ 24 services use it by now
- ~ 300 Cassandra nodes
- ~ 50 TB of data
• Back then, there was little information about how to design
efficient, scalable storage schemas for Cassandra
8. #Cassandra13
Cassandra @ Spotify
• We started using Cassandra 2+ years ago
- ~ 24 services use it by now
- ~ 300 Cassandra nodes
- ~ 50 TB of data
• Back then, there was little information about how to design
efficient, scalable storage schemas for Cassandra
• So we screwed up
• A lot
10. #Cassandra13
Read repair
• Repair from outages during regular read operation
• With RR, all reads request hash digests from all nodes
• Result is still returned as soon as enough nodes have replied
• If there is a mismatch, perform a repair
11. #Cassandra13
Read repair
• Useful factoid: Read repair is performed across all data
centers
• So in a multi-DC setup, all reads will result in requests being
sent to every data center
• We've made this mistake a bunch of times
• New in 1.1: dclocal_read_repair
12. #Cassandra13
Row cache
• Cassandra can be configured to cache entire data rows in
RAM
• Intended as a memcache alternative
• Lets enable it. What's the worst that could happen, right?
13. #Cassandra13
Row cache
NO!
• Only stores full rows
• All cache misses are silently promoted to full row slices
• All writes invalidate entire row
• Don't use unless you understand all use cases
14. #Cassandra13
Compression
• Cassandra supports transparent compression of all data
• Compression algorithm (snappy) is super fast
• So you can just enable it and everything will be better, right?
15. #Cassandra13
Compression
• Cassandra supports transparent compression of all data
• Compression algorithm (snappy) is super fast
• So you can just enable it and everything will be better, right?
• NO!
• Compression disables a bunch of fast paths, slowing down
fast reads
17. #Cassandra13
Performance worse over time
• A freshly loaded Cassandra cluster is usually snappy
• But when you keep writing to the same columns over for a
long time, the row will spread over more SSTables
• And performance jumps off a cliff
• We've seen clusters where reads touch a dozen SSTables on
average
• nodetool cfhistograms is your friend
18. #Cassandra13
Performance worse over time
• CASSANDRA-5514
• Every SSTable stores first/last column of SSTable
• Time series-like data is effectively partitioned
19. #Cassandra13
Few cross continent clusters
• Few cross continent Cassandra users
• We are kind of on our own when it comes to some problems
• CASSANDRA-5148
• Disable TCP nodelay
• Reduced packet count by 20 %
21. #Cassandra13
How not to upgrade Cassandra
• Very few total cluster outages
- Clusters have been up and running since the early 0.7
days, been rolling upgraded, expanded, full hardware
replacements etc.
• Never lost any data!
- No matter how spectacularly Cassandra fails, it has
never written bad data
- Immutable SSTables FTW
22. #Cassandra13
Upgrade from 0.7 to 0.8
• This was the first big upgrade we did, 0.7.4 ⇾ 0.8.6
• Everyone claimed rolling upgrade would work
- It did not
• One would expect 0.8.6 to have this fixed
• Patched Cassandra and rolled it a day later
• Takeaways:
- ALWAYS try rolling upgrades in a testing environment
- Don't believe what people on the Internet tell you
23. #Cassandra13
Upgrade from 0.8 to 1.0
• We tried upgrading in test env, worked fine
• Worked fine in production...
• Except the last cluster
• All data gone
24. #Cassandra13
Upgrade from 0.8 to 1.0
• We tried upgrading in test env, worked fine
• Worked fine in production...
• Except the last cluster
• All data gone
• Many keys per SSTable ⇾ corrupt bloom filters
• Made Cassandra think it didn't have any keys
• Scrub data ⇾ fixed
• Takeaway: ALWAYS test upgrades using production data
25. #Cassandra13
Upgrade from 1.0 to 1.1
• After the previous upgrades, we did all the tests with
production data and everything worked fine...
• Until we redid it in production, and we had reports of missing
rows
• Scrub ⇾ restart made them reappear
• This was in December, have not been able to reproduce
• PEBKAC?
• Takeaway: ?
29. #Cassandra13
What happens if one node is slow?
Many reasons for temporary slowness:
• Bad raid battery
• Sudden bursts of compaction/repair
• Bursty load
• Net hiccup
• Major GC
• Reality
30. #Cassandra13
What happens if one node is slow?
• Coordinator has a request queue
• If a node goes down completely, gossip will notice quickly
and drop the node
• But what happens if a node is just super slow?
31. #Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes
• The request queue for the coordinator on every node in
the cluster fills up
• And the entire cluster stops accepting requests
32. #Cassandra13
What happens if one node is slow?
• Gossip doesn't react quickly to slow nodes
• The request queue for the coordinator on every node in
the cluster fills up
• And the entire cluster stops accepting requests
• No single point of failure?
33. #Cassandra13
What happens if one node is slow?
• Solution: Partitioner awareness in client
• Max 3 nodes go down
• Available in Astyanax
35. #Cassandra13
How not to delete data
How is data deleted?
• SSTables are immutable, we can't remove the data
• Cassandra creates tombstones for deleted data
• Tombstones are versioned the same way as any other
write
36. #Cassandra13
How not to delete data
Do tombstones ever go away?
• During compactions, tombstones can get merged into
SStables that hold the original data, making the
tombstones redundant
• Once a tombstone is the only value for a specific column,
the tombstone can go away
• Still need grace time to handle node downtime
37. #Cassandra13
How not to delete data
• Tombstones can only be deleted once all non-tombstone
values have been deleted
• Tombstones can only be deleted if all values for the
specified row are all being compacted
• If you're using SizeTiered compaction, 'old' rows will
rarely get deleted
38. #Cassandra13
How not to delete data
• Tombstones are a problem even when using levelled
compaction
• In theory, 90 % of all rows should live in a single SSTable
• In production, we've found that only 50 - 80 % of all reads
hit only one SSTable
• In fact, frequently updated columns will exist in most
levels, causing tombstones to stick around
39. #Cassandra13
How not to delete data
• Deletions are messy
• Unless you perform major compactions, tombstones will
rarely get deleted
• The problem is much worse for «popular» rows
• Avoid schemas that delete data!
40. #Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data
• Once TTL:ed data expires, it should just be compacted
away, right?
• We know we don't need the data anymore, no need for a
tombstone, so it should be fast, right?
41. #Cassandra13
TTL:ed data
• Cassandra supports TTL:ed data
• Once TTL:ed data expires, it should just be compacted
away, right?
• We know we don't need the data anymore, no need for a
tombstone, so it should be fast, right?
• Noooooo...
• (Overwritten data could theoretically bounce back)
44. #Cassandra13
The Playlist service
Our old playlist system had many problems:
• Stored data across hundreds of millions of files, making
backup process really slow.
• Home brewed replication model that didn't work very well
• Frequent downtimes, huge scalability problems
45. #Cassandra13
The Playlist service
Our old playlist system had many problems:
• Stored data across hundreds of millions of files, making
backup process really slow.
• Home brewed replication model that didn't work very well
• Frequent downtimes, huge scalability problems
• Perfect test case for
Cassandra!
46. #Cassandra13
Playlist data model
• Every playlist is a revisioned object
• Think of it like a distributed versioning system
• Allows concurrent modification on multiple offlined clients
• We even have an automatic merge conflict resolver that
works really well!
• That's actually a really useful feature
47. #Cassandra13
Playlist data model
• Every playlist is a revisioned object
• Think of it like a distributed versioning system
• Allows concurrent modification on multiple offlined clients
• We even have an automatic merge conflict resolver that
works really well!
• That's actually a really useful feature said no one ever
48. #Cassandra13
Playlist data model
• Sequence of changes
• The changes are the authoritative data
• Everything else is optimization
• Cassandra pretty neat for storing this kind of stuff
• Can use consistency level ONE safely
50. #Cassandra13
Tombstone hell
• The HEAD column family stores the sequence ID of the latest
revision of each playlist
• 90 % of all reads go to HEAD
• mlock
51. #Cassandra13
Tombstone hell
• Noticed that HEAD requests took several seconds for some
lists
• Easy to reproduce in cassandra-cli:
• get playlist_head[utf8('spotify:user...')];
• 1-15 seconds latency; should be < 0.1 s
• Copy SSTables to development machine for investigation
52. #Cassandra13
Tombstone hell
• Noticed that HEAD requests took several seconds for some
lists
• Easy to reproduce in cassandra-cli:
• get playlist_head[utf8('spotify:user...')];
• 1-15 seconds latency; should be < 0.1 s
• Copy SSTables to development machine for investigation
• Cassandra tool sstabletojson showed that the row contained
600 000 tombstones!
54. #Cassandra13
Tombstone hell
• We expected tombstones would be deleted after 30 days
• Nope, all tombstones since 1.5 years ago were there
• Revelation: Rows existing in 4+ SSTables never have
tombstones deleted during minor compactions
• Frequently updated lists exists in nearly all SSTables
Solution:
• Major compaction (CF size cut in half)
55. #Cassandra13
Zombie tombstones
• Ran major compaction manually on all nodes during a few
days.
• All seemed well...
• But a week later, the same lists took several seconds
again‽‽‽
56. #Cassandra13
Repair vs major compactions
A repair between the major compactions "resurrected" the
tombstones :(
New solution:
• Repairs during Monday-Friday
• Major compaction Saturday-Sunday
A (by now) well-known Cassandra anti-pattern:
Don't use Cassandra to store queues
57. #Cassandra13
Cassandra counters
• There are lots of places in the Spotify UI where we count
things
• # of followers of a playlist
• # of followers of an artist
• # of times a song has been played
• Cassandra has a feature called distributed counters that
sounds suitable
• Is this awesome?
60. #Cassandra13
How not to fail
• Treat Cassandra as a utility belt
• Flash
Lots of one-off solutions:
• Weekly major compactions
• Delete all sstables and recreate from scratch every day
• Memlock frequently used SSTables in RAM
61. #Cassandra13
Lessons
• Cassandra read performance is heavily dependent on the
temporal patterns of your writes
• Cassandra is initially snappy, but various write patterns
make read performance slowly decrease
• Making benchmarks close to useless
62. #Cassandra13
Lessons
• Avoid repeatedly writing data to the same row over very
long spans of time
• Avoid deleting data
• If you're working at scale, you'll need to know how
Cassandra works under the hood
• nodetool cfhistograms is your friend
63. #Cassandra13
Lessons
• There are still various esoteric problems with large scale
Cassandra installations
• Debugging them is really interesting
• If you agree with the above statements, you should totally
come work with us