Liveblogging: Migration from 100s to 100s of Millions

Migration From 100s to 100s of Millions – Messaging to Mobile Devices with Cassandra at Urban Airship

Urban Airship – provides hosting for mobile services that devs should not build themselves (e.g. push notifications, content delivery services, etc). Provide a unified API for all these services across all platforms (Andriod, iOS, RIMM, etc). Starting to have SLAs for throughput and latency.

Transactional intake system at Urban Airship:
API – Apache/Python/django+piston+pycassa
Device Negotiation layer – Java NIO+Hector
Message Delivery layer = Python, Java NIO + Hector
Device Data checkins = java HTTPS endpoint
Persistence – sharded postgreSQL, Cassandra 0.7, “diminishing footprint” of MongoDB 1.7

What do they use?
Cassandra
HBase
redis (analytics internal measurements)
MongoDB (phasing out)

They don’t use Riak.

Converging on Cassandra + PostgreSQL for transactions, HBase for data warehousing.

They started with PostgreSQL on EC2, but had so many writes that after 6 months they couldn’t scale, so they went to MongoDB.

MongoDB had: heavy disk I/O problems, non-sophisticated locking caused locking, deadlocking and replication slave lag that was just not working out for them.

So they moved to Cassandra. Why?
– Well-suited to data model – simple DAG’s
– lots of uuids and hashes which partition well
– retrievals don’t need ordering beyond row keys or time-series data (e.g. doesn’t matter what order 10 million devices are retrieved, just need to retrieve them!)
– Rolling minor version upgrades are easy in Cassandra, no downtime.
– Column TTLs were huge for them (and resulting expiration)
– Particularly well-suited to working around EC2 availability problems
– They needed to partition across multiple availability zones, they didn’t trust fault containment within one zone.
– Read repair and handoff generally did the right thing when a node would flap (Ubuntu #708920)
– No single point-of-failure
– Ability to alter consistency levels (CL) on a per-operation basis – some things aren’t important to be consistent right away, others are very important.

Cassandra tips:
– Know your data model – creating indexes after the fact is a PITA
– design around wide rows (but be careful of I/O, Thrift, Count problems)
– Favor JSON over packed binaries if possible (unless you’re Twitter)
– Be careful using Thrift in the stack – having other services that use Thrift that have to talk to Cassandra has some painful versioning limitations.
– Don’t fear the StorageProxy.
– Looking at the Cassandra source code and getting your hands dirty with the Java code is a MUST.
– Assume the client will fail (difference between read timeout and connection refused)
– When maintaining your own indexes, try and clean up after failure. (i.e. have a good rollback strategy)
– Be ready to clean up inconsistencies anyway
– Verify client library assumptions and exception handling, make sure that you know what’s going on when the client communicates that it couldn’t do a write. Understand what the client is doing so you can figure out whether to retry now or later or what.
– Embedding Cassandra for testing really helped

Cassandra in EC2:
– Ensure Dynamic Snitch is enabled (also make sure you check your config files during upgrades…they had Dynamic Snitch off in 0.6 due to bugs, when they upgraded to 0.7 they didn’t turn it on)
– Disk I/O – avoid EBS except for snapshot backups … or use S3. Stripe ephemerals, not EBS volumes, because Cassandra is network I/O heavy (b/c EBS is a networked disk).
– Avoid smaller instances all together — i.e. avoid virtualization if you can
– Don’t assume that traversing a close-proximity availability zone is more expensive than in the same availability zone — it is sometimes, often isn’t. (No comment on different regions, haven’t tested yet)
– Balance RAM costs vs. the costs of additional hosts. Spend time with the GC logs.

Java best practices:
– ALL Java services, including Cassandra, are managed via the same set of scripts. For them, in most cases, operators don’t treat cassandra different from HBase, one mechanism to take a thread or heap dump, all logging is consistent for GC, application, stdx for HBase and Cassandra, even init scripts use the same scripts that the operators do.
– Bare metal will rock your world
– configure +UseLargePages will be good to (on bare metal)
– Get familiar with GC logs (-XX:+PrintGCDetails), understand what degenerate CMS collection looks like, and what promotion failures look like. Urban Airship settled at -XX:CMSInitiatingOccupancyFraction=60, lowered from the default of 75, to do CMS collection before there’s a problem, to avoid promotion failures.

Operations:
– Understand when to compact
– Understand upgrade implications foro data files
– Watch hinted handoff closely
– Monitor JMX religiously

Looking forward:
– Cassandra is a great hammer, but not everything is a nail
– Co-processors would be awesome (hint hint!)
– They still spend too much time worrying about GC
– Glad to see the ecosystem around the product evolving, CQL, Pig, Brisk