At Turn, we build highly complex distributed systems. Due to their critical nature, it is necessary to be able to look deep into the internals of the application logic during runtime to understand why the system is behaving the way it is. To do this, we built Scrub: a tool to interactively analyze the state of distributed systems with minimal disruptions.
Given the mixture of legacy code, critical financial transactions and the ever-growing complexity of our partner relationships, the range of insights needed on a daily basis is very broad. In an ideal world, we would be able to store, index and search every request that arrives at our servers, the result of every computation that occurs due to these requests and every RPC call invoked to carry out these computations. This gives us perfect visibility into the working of the system.
Unfortunately, this approach will not scale. For a company that deals with more than three million QPS, we can easily collect multiple petabytes of raw data every day. Besides having a prohibitively expensive storage cost, gaining insights from this data is non-trivial. A thousand-node hadoop cluster will need at least a few minutes for each query. For debugging issues, we need insights quickly.
A Practical Middle-Ground: Scrub
Scrub models a distributed platform like a streaming database. There are two main parts to Scrub. The first is a client library, which provides APIs and Annotations (for the Java clients) to instrument different parts of the application code. This tells the Scrub client which objects are to be made visible to Scrub users, and at what point in the code can we safely take it out of application threads and into Scrub's thread pool for processing, and transmit over to the second part: the aggregation servers.
These objects are now emitted from various services in your application cluster to a Kafka topic, which are read by an array of consumers, aggregated and stored into a HBase table. We have tried to ensure that developers don't have to make a lot of code changes to integrate Scrub. The Scrub instrumentation API resembles that of any commonly used logging framework (such as Slf4j). Instead of taking simple strings, we can push any Java object into Scrub’s log() method. Each object can be thought of a “database table,” whose “columns” are the fields in this class. Every log() method invocation creates a separate stream of objects of the same type which are now queryable with Scrub’s query language.
In the figure above, Scrub is used to look at raw requests from an ad exchange, the RPC calls made to an ad server, the responses from these ad servers, and the various computations occurring within the adserver.
Optimizations: Queries, TTLs and Sampling (Or How Scrub Avoids Doing Any Work)
Scrub does not emit any data from a service until a user asks for it. Until a query is available for a particular stream, all "log-to-scrub" calls in application code are effectively NOPs. This allows us to instrument all corners of our codebase without worrying about the overhead it causes to the host service. Users can pose queries upon this "collective schema" as if it were a regular database. For example, a query can contain standard SQL operations such as selection, projection and aggregation operations along with filters specified in where clauses. Scrub tries to execute as many operations as possible in the host service itself. Specifically, it evaluates all filters and projection operations at the host. This reduces the load on the network to emit all tuples (which can be extremely high if not filtered).
Scrub achieves its insights at the expense of historical data. It logs data only when a query is issued. Every query has a TTL after which it stops emitting tuples from the service. This TTL can vary from two minutes to 24 hours, but having an explicit value guarantees that a host service will never "leak" queries. The CPU and IO resources need to be dedicated to Scrub threads only for a limited period of time.
As a user, you have the flexibility to run a query on a sample of services in your topology. You can ask Scrub to do the load balancing for you, or cherry-pick specific hosts to run queries (this is useful if you are A/B testing a new feature).
Scrub @ Turn
We have been using Scrub for more than a year in production at Turn. It has helped debug various production issues and campaign performance issues. Developers have used it to closely monitor new features. Due to its schema-based query language, it has allowed people with no knowledge of the internals of the system to query for data. This freed up cycles for developers to do more development, and allows support staff to be independent in gathering data (and even fix it, if possible) for certain issues.
As Eric Raymond once said, “Given enough eyeballs, all bugs are shallow.” Scrub allows anyone at Turn to easily dig deep down into the fine details of our system, allowing us to iterate faster on issues both internal and external to our system. Taking less time to get to a better product has made Turn faster than we would be without Scrub and faster than other companies working on similarly challenging problems.
We hope to release Scrub to the public sometime early next year, along with details of how it works and how to integrate it with your code. There are some advanced features in Scrub, which we didn’t get a chance to talk about here. Keep an eye out for future posts on Scrub.