Cloudant Labs: On Google Spanner

By Mike Miller

This past weekend Google released the latest in their seminal research publications -- Spanner. First introduced quietly nearly three years ago, Spanner is fully described, from motivation, to execution through operational performance data. The Spanner paper is technically revolutionary for a few reasons (described below), but most importantly it is continued proof that Google is a clear leader in distributed big data systems. In contrast to the original canon (GFS, MapReduce, Bigtable) and more recent replacements (Collosus, Percolator, Pregel, Dremel), I don't predict a Spanner clone will land as an Apache incubator anytime soon. The primary reason is that Spanner's largest design innovation leverages special hardware. Specifically one must install a coordinate network of GPS and atomic clocks in each participating datacenter.

At Cloudant we've built a globally distributed data layer that we've been operating at scale for years, so we're watching Spanner closely. Below I provide a non-Googler's perspective on the key elements of the Spanner publication and a personal opinion of its impact on distributed systems and databases going forward.

Spanner in a nutshell

Spanner is a globally distributed, temporally versioned database. It provides global distribution and replication of data to provide both high availability and to minimize latency of data reads and writes. Spanner accomplishes this using time-based serialization of events, partial locking, and synchronous replication. These ingredients enable Spanner's implementation for externally consistent transactions at global scale.

Spanner is a clear step in the RDBMS direction. While it emerged from work in BigTable, it meets many of the expectations of a typical RDBMS, but in a manner that can be globally distributed at massive scale. Indeed the original design goals presented by Jeff Dean in 2009 are ambitious -- millions of machines in thousands of data centers. To my knowledge, Spanner is the first system to support externally consistent transactions on a global scale and is therefore transformative work.

Spanner's Key Features

  • Externally consistent global write-transactions with synchronous replication.
  • Non-blocking reads in the past.
  • Lock-free read-only transactions.
  • Schematized, semi-relational (tabular) data model.
  • SQL-like query interface.
  • Efficient execution of atomic schema changes without requiring locking of the entire database.
  • Auto-sharding, auto-rebalancing, automatic failure response.
  • Exposes control of data replication and placement to user/application.

Guiding Design Principles

Spanner prioritizes consistency in the face of concurrency above all else. Google's philosophy is quite clear with the opening statement:

"We believe it is better to have application programmers deal with performance problems due to overuse of transactions as bottlenecks arise, rather than always coding around the lack of transactions."

That means that bottlenecks can and do occur, as witnessed by the fat tail on the read transactions in the F1 data in Table 6.

Availability for transactional writes and reads will suffer with node failures. Google adds, "Running two-phase commit over Paxos mitigates the availability problems." In practice, a single node failure naturally results in a 10 second window of partial unavailability (Figure 5, blue circles). I don't fully understand how to analyze Spanner in the face of Brewer's CAP theorem (consistency, availability, partition tolerance -- choose two at any moment in time). I believe Spanner is consistent and partition-able. In a globally distributed transactional system with synchronous replication a network partition may lead to write unavailability, depending on replica locations. Reads from local replicas certainly remain available. Interestingly, a new failure mode is introduced -- the clock. In the case that the time uncertainty grows (see below) Spanner slows down accordingly, which will ultimately lead to writes becoming unavailable.

Key Innovations

Spanner's key innovation is around time. It includes a novel system using GPS and Atomic Clocks to distribute a globally synchronized "proper time." The previous dogma in distributed systems was that synchronizing time within and between datacenters is insurmountably hard and uncertain. Ergo, serialization of requests is impossible at global scale. Google's key innovation is to accept uncertainty, keep it small (via atomic clocks and GPS), quantify the uncertainty and operate around it. In retrospect this is obvious, but it doesn't make it any less brilliant.

Once the notion of proper time is introduced, many previously challenging things become very efficient. There are existing algorithms for linearization of distributed systems, but Spanner's TrueTime API greatly reduces the complexity. For example, monotinicity across Paxos leaders follows naturally (trivially?) from the introduction of global proper time with quantified uncertainty.

Interestingly, the transaction rate of Spanner must be ultimately bounded by the uncertainty exposed by TrueTime (max-rate ~ 1 / t_{uncertainty}). Therefore, Spanner must provide fine-grained locking (row-based?) to prevent the system from grinding to a halt as transaction rates increase. Their operational data from F1 quotes 32.1E6 multi-site commits per day (at a mean latency of 103 ms). 32.1E6 writes/day implies 372 writes/sec, which is higher than the bound of 1/0.005 sec = 200 writes/sec set by the typical clock uncertainty. Therefore I conclude that the maximum transaction rate is likely a function of the both the time uncertainty and the granularity of the transactional locks required.

Use Cases

With its transactional nature, Spanner's sweet spot seems to be workloads that are relatively write-light and read-heavy. It appears to be targeted squarely at classic 3-tier applications at scale, specifically those that can accept mean latencies in the 1-100 ms range with large tails. E.g., The F1 ad backend requires thick clients, global distribution, scale and application-level buffering/caching. Last but not least, Spanner requires special hardware (atomic clocks and GPS) and likely won’t be immediately relevant for the Apache community.

Execution

  • Schema leveraged to preserve data locality.
  • Important changes w.r.t. BigTable to deal with data locality and access patterns.
  • Range partitioning of rows in tablets.
  • Paxos is everywhere.

Compromises

Spanner prioritizes consistency above all else. It is a fully connected system that requires "smart" clients capable of managing locality of requests, local buffering and transaction reasoning. Therefore availability and throughput compromise when time error increases. Notably, Spanner does not yet support automatic handling of secondary indices. Further, it does not support "offline" access with later reconciliation (ala CouchDB). This latter point is one that is very important to us at Cloudant. We've adopted the CouchDB API for our globally distributed data layer specifically because it enables offline access and deals with merging and conflict tracking and resolution efficiently and does all of that with JSON over a REST API (no client necessary). It's clear that offline storage and later reconciliation is on Google's map (see page 21 from Andrew Fikes' 2010 Faculty Summit talk). Spanner helps address locality, but doesn't naturally enable offline access.

Comparisons

Spanner's notion of global temporal serialization of events, thick clients and special hardware live in stark contrast to Cloudant and CouchDB (MVCC, master-master, JSON via HTTP, no clients). Spanner's largest feature overlap (distribution, synchronous replication, concurrency and externally consistent transactions) appears to be with Jim Starkey's NuoDB. However, there are notable differences in the implementation approaches of the two systems. Most importantly, Starkey claims to have solved concurrency without requiring serialization, but I have not analyzed his recent patent.

Benchmarks and Operational Data

The description of benchmarks was the only opaque portion of the article. For example, it's not clear how many zones were inflated for the data in Table 3, nor how to map Paxos groups onto node counts. I therefore assume these benchmarks were meant to show scaling behavior, not absolute performance. The operational data from the F1 deployment however is a wonderful indicator of Spanner's performance in the wild. Typical reads are near 10 ms and single- and multi-site writes average ~100 ms. There are notable tails on these distributions. E.g., the standard deviations for all reads is 40x the mean, weighing in at 376 ms, although the authors note that an asymmetric mix of spinning media and solid state drives likely contributes significantly to the tails.

Summary

Spanner is an excellent piece of work and well presented. It represents a pragmatic acceptance of developer's reluctance to reason in the absence of immediately consistent transactions and therefore strikes a bittersweet chord. Philosophically this feels like a big step forwards for distributed systems. Time isn't easily synchronized? No big deal, measure the uncertainty, minimize the uncertainty and program around it! Philosophically this also feels like a step towards traditional database management systems. Instead of focusing on parallel execution and multi-master scenarios that enable rich, offline application state synchronization, Spanner represents a cleverly executed attack on serialization and more efficient cursors.

Update: Alex Lloyd also gave a great presentation at Berlin Buzzwords 2012 on "Building Spanner".

Sign Up for Updates!

Recent Posts