Blog  Engineering  

Counting Towards Infinity, Part II: The Floating Buoy Experiment

by Amobee, August 08, 2016

Many moons ago, I came across an algorithm for estimating the median of a data stream:

int median = 0;
while (dataStream.hasNext()) {
int value = dataStream.next();
if (value > median) {
median++;
} else if (value < median) {
median–;
}
}

The algorithm is quite clever. Its simplicity is both appealing and a deterrent for software engineers. Regardless, the ability to obtain a fairly accurate estimate from a single counter (without hashing) is impressive. As for accuracy and convergence speed? That all depends on the throughput of the data stream, so the algorithm can be quite useful for certain applications.

Recently, I began thinking about how to properly obtain percentiles for large data streams. While solutions exist (e.g. Count-Min estimators with dyadic intervals), I couldn’t help but revisit the simple approach. What if we attach a probability to the estimator? How can we extend it to account for percentiles?

Before diving into the details, let me provide the following disclosure:

  1. What follows below is purely experimental. To the mathematically talented: I invite you to expose problems, propose changes, and derive error bounds. The ideas presented here are intuitive, but mathematic rigor is always desired. I look forward to your suggestions and improvements to supplement or refine this work.
  2. What follows below isn’t technically complex and I would not be surprised if similar work has been done in the past. Again, I welcome all feedback that can further my knowledge on this subject.

With that aside, let’s begin!

Adapting the Algorithm

Since we are dealing with very large streaming data, let’s consider sampling to create percentiles. We modify the simple median algorithm as follows:

int estimate = 0;
while (dataStream.hasNext()) {
int value = dataStream.next();
if (value > estimate && random.nextDouble() < percentile) {
estimate++;
} else if (value < estimate && random.nextDouble() > percentile) {
estimate–;
}
}

The idea is straightforward: Given an input data stream, filter values until the correct proportion is above and below the target percentile. Only let this sampled population affect the estimator. Then, drift the estimator in the same manner as before towards the desired percentile.

To test this approach, let us first experiment on a controlled data set. Given randomly shuffled integer arrays containing all values between [0, N-1], how long does it take for the estimator to converge on the correct percentile? We run an experiment targeting the 75th percentile for N = 1,000,000, feeding newly shuffled input arrays for each iteration:


In the graph above, the orange line represents the estimate, blue represents error, and black represents the actual percentile. Across numerous runs, estimate errors is consistently reduced to less than 1% after processing the 5th input array (5,000,000 elements). So far, the approach looks quite promising!

Simulating Real Data

Admittedly, real data is less structured than shuffled integer arrays. Real data input is acyclic and there is no guarantee that every value from the population distribution will be observed in the data stream. We attempt to simulate a real world problem by estimating population percentiles from observed data samples. Starting with simple distributions, we begin by applying our estimator on samples from the uniform distribution [0, 999,999]:

The results look familiar. As before, the orange line represents estimate, blue represents error, and black represents the actual 75th percentile. As the number of input values increase, sampling from a uniform distribution provides nearly identical convergence rates as shuffled arrays. On average, estimator error decrease to around 1% after 4,500,000 input values. But this can change if the estimator is initially further away. What if we need faster convergence? How can a poorly initialized iterator quickly approach the correct estimate, without processing vast amounts of data? For that, we explored two solutions.

Accelerated Growth and Decay

One solution is to accelerate the rate of change:

int estimate = 0;
boolean flag = true;
int rate = 1;
while (dataStream.hasNext()) {
int value = dataStream.next();
if (value > estimate && random.nextDouble() < percentile) {
if (flag) {
estimate += rate;
rate *= 2;
} else {
estimate++;
flag = true;
rate = 1;
}
} else if (value < estimate && random.nextDouble() > percentile) {
if (!flag) {
estimate -= rate;
rate *= 2;
} else {
estimate–;
flag = false;
rate = 1;
}
}
}

Here, the rate doubles as long as it is in the same direction as before. If direction changes, the rate is reset to 1. This allows a faster approach to the true percentile:


But once the near the true percentile, the estimator has difficulty converging. A few unlucky accelerations in the wrong direction can cause severe errors in the estimator:


To prevent this, we need to decrease the rate of acceleration as we approach the true percentile. However, for real data, we do not know what this value is. We try a few feedback mechanisms, but ultimately abandon the idea for what we feel is a cleaner solution.

Converging Tracers

Estimators have two nice properties:

  1. Low execution overhead. Estimators have remarkably low storage costs (a single counter) and high execution efficiency (relies on a single random number generation).
  2. Increased accuracy with input size. Unlike most sketches that have error proportional to input, estimates converge to the true value as input increases.

Since the estimators converge and are fairly cheap, why not spread a few of them and trace their paths to convergence? Below shows the performance of 11 different tracers spaced evenly across the data range:


Observing the traces, we realize we can gain a lot of information from the direction of the drift. Since drift is partially random, we need multiple iterations to gain confidence. Once we process sufficient samples, we can repartition the tracers to obtain faster convergence. Below shows the performance from a single run, repartitioned on every 100,000 input values:


Lookin’ good! Of course, we can get accused of cheating since we initialized the tracers to evenly partition the population distribution range. Yet, had we initialized across all positive 32-bit integers, the result would have been very similar:


Since tracers are only updated by sample magnitudes, they will never drift away toward an area with zero samples. Therefore, convergence is guaranteed regardless of the tracers’ initialization range. Note that in the graph above, the vertical axis is plotted in logarithmic scale (during each repartition, tracers are evenly divided across the range straddling the estimate).

The Floating Buoy Algorithm

So far, we have successfully estimated a single percentile (75th). What if we want to obtain estimates for multiple percentiles, or perhaps generate an estimated cumulative distribution function (CDF) of the sample stream?

Introducing the Floating Buoy algorithm. We describe this algorithm in its three phases: cast, link, and drift:

Phase I: The Cast

To estimate the CDF, we first pick a set of initial buoy locations (e.g. 20th, 40th, 60th, 80th percentiles) to cast. Each tracer group is assigned to a single percentile and is responsible only for that percentile. For efficiency and to preserve monotonicity amongst the tracers, we generate a single random double on each input. Together, the tracers converge as follows:


Convergence of tracer groups. Sample run with each tracer group undergoing three prunes, processing 10,000 input values before each prune.

Phase II: Link

After we obtain the initial estimates, we link them to estimate the remaining buoy locations. The correct choice for linking function depends heavily on the application. For simplicity, we choose linear interpolation (for most applications, this is sufficient as the buoys will drift anyway).


Linearly interpolated mid-buoys. Total estimate error = 0.93%. 

Phase III: Drift

For the remainder of the data stream, the buoys will drift according to the input value. For each input, we generate a single random threshold to update all buoys. Only buoys representing percentiles below this threshold can have a chance to be decremented, while those above the threshold incremented. Using a single update threshold preserves the monotonicity of the percentile estimates (and allows for simple, monotonic merges as well).

More Complex Distributions

How does the Floating Buoy algorithm perform on more complex distributions? For a more realistic data distribution, we tried the truncated Gaussian distribution (μ = 500,000, σ2 = μ/3, truncated at 3σ2):


a beta distribution (α = 1, β = 5):


and a power-law distribution (sampled with n = -2 across region [0.00001, 10], scaled to [1, 1,000,000]):


We see that the estimator performs poorly in regions where input is sparse (note that the power-law distribution is plotted in logarithmic scale and the step-like appearance is an artifact of single-value increments). Let’s try to increase the number of tracer groups:


Better, but still not quite there. Recall that once initialized, each tracer is really just a single constant. Given the price, there is no reason why we can’t add a few more. Let’s initialize one for each percentile:


So this is quite neat. Any remaining discrepancy can easily be resolved by drifting. For those who are doubtful of the power-law performance, this is the same results plotted in non-logarithmic scale:


Assessing the Results

Let’s assess the Floating Buoy algorithm using the benchmarks provided from the first post on real-time data sketching:

  1. Robustness to data growth. Unlike most sketching algorithms, the accuracy of Floating Buoy estimators increases with additional data input, converging to their true percentile values. The monotonicity of percentile estimates allows for simple and direct merging across sketches. 
  2. Memory efficiency. The Floating Buoy algorithm is extremely memory efficient, requiring only a single constant to estimate a percentile. During the cast phase, additional tracers are necessary for pruning and faster convergence, but this overhead is also trivial.
  3. Ease of development. The algorithm is simple. The code to run experiments and test scripts are provided below.

Source Code

You may find all source code for the experiments covered in this blog post here.

 

About Amobee

Founded in 2005, Amobee is an advertising platform that understands how people consume content. Our goal is to optimize outcomes for advertisers and media companies, while providing a better consumer experience. Through our platform, we help customers further their audience development, optimize their cross channel performance across all TV, connected TV, and digital media, and drive new customer growth through detailed analytics and reporting. Amobee is a wholly owned subsidiary of Tremor International, a collection of brands built to unite creativity, data and technology across the open internet.

If you’re curious to learn more, watch the on-demand demo or take a deep dive into our Research & Insights section where you can find recent webinars on-demand, media plan insights & activation templates, and more data-driven content. If you’re ready to take the next step into a sustainable, consumer-first advertising future, contact us today.

Read Next

All Blog Posts
Engineering

Counting Towards Infinity: Next Generation Data Warehousing (Part I)

In this multipart series, "Counting Toward Infinity", we'll explore various approximation-based sketches to handle problems encountered in our most expensive queries. Lets start with the technical details of our first sketch: HyperLogLog, a linear time, constant space algorithm for estimating multiset cardinality that Turn uses to efficiently estimate the number of unique records that satisfy a given input query.

June 1, 2016

Engineering

Amobee Towards Automation: Transcoding Video

At Turn, automation is part of our culture. We will try to automate anything and everything possible to increase efficiency. In this blog post, we are going to talk about an engineering automation project we did to scale digital video advertising called the “Turn Video Transcoder.”

April 4, 2017