Blog  Engineering  

Counting Towards Infinity: Next Generation Data Warehousing (Part I)

by Amobee, June 01, 2016

Think back to the days when you first learned arithmetic. If you are like most, you learned to count using your fingers. You would have noticed that while they are precise and readily available, they are also restrictive and provide limited storage capacity. Ten would have been your first infrastructural barrier, and while you may have discovered clever algorithmic workarounds, the solutions were likely short-lived and you were forced to outgrow your architecture. 

The world of big data is quite analogous. Often, the difficulty in a problem arises not from the complexity of its algorithms, but the sustainability of the solution. As data sizes continue to grow, we see a steady trend toward streaming algorithms and approximation-based sketch summaries. For certain applications, this approach provides superior real-time performance over slower, precision-based aggregation techniques.

Recently, Turn hit the three million global QPS milestone [7]. The impressions we serve are targeted across 25 billion potential user devices everyday and the data stored in HDFS is steadily approaching 30 petabytes. Nevertheless, complex design coupled with slow-moving queries is enough the ruin the day for any data enthusiast. While physical limitations such as the number of machines, network speed, and projected Capex are relatively fixed, how we design and query our system should remain flexible. How do we organize data such that insight flows efficiently to our front line campaigns? How do queries, with date ranges spanning weeks, months, or years, run in seconds to minutes, not hours to days?

In this multipart series, “Counting Toward Infinity,” we’ll explore various approximation-based sketches to handle problems encountered in our most expensive queries. Before we introduce the sketch for this article, let us first illustrate the desirable characteristics for all scalable algorithms: 

1. Robustness to data growth. By definition, scalable algorithms must be robust to data growth. In today’s world, this means the algorithm must be massively parallelizable with an efficient (and preferably lossless) aggregation or merge step. 

2. Memory efficiency. As stated above, physical constraints of a system tend to be more difficult to change. While exceptions exist, data size, service requirements, and general engineer’s ambition tend to evolve faster than the underlying hardware that supports it. Scalable algorithms are not temporary solutions, and should maintain performance standards as input and system requirements change.

3. Ease of development. An algorithm is not useful if it can only be understood by a single engineer. Often, the best algorithm is the simplest one that still gets the job done. Such systems free up development resources to allow for more comprehensive documentation, monitoring, and testing—ensuring not only a working system today, but a maintainable system for the future.

With these motivations in mind, lets take a deep dive into the technical details of our first sketch: HyperLogLog [4], a linear time, constant space algorithm for estimating multiset cardinality. At Turn, we use HyperLogLog to efficiently estimate the number of unique records that satisfy a given input query.

The HyperLogLog Algorithm 

HyperLogLog applies randomized hashing techniques and stochastic averaging to estimate the cardinality of a multiset S. The algorithm takes as input each element i S and applies a hash function, h(i), to generate a pseudorandom bitstring si. For every si, the position of the leftmost “1”-bit, ρ(si), is observed (e.g. ρ(000101···) = 4, ρ(011001···) = 2, ρ(000010···) = 5, etc.) and the maximum across all bitstrings, maxiS ρ(si ), is stored.

The intuition behind this approach is the bit-pattern observables from random uniform hashing preserve statistics of the input data. In particular, multisets containing a greater number of unique elements are more likely to produce a “longer” bitstring containing more leading zeros. In fact, a prefix of 0μ−11 usually requires at least 2μ input elements. This process, known as probabilistic counting, was applied in earlier work on cardinality estimation [3], [5]. 

It is important to note that using a single maxiS ρ(si ) estimator tends to produce unreliable results, especially for small input cardinalities. Since linear increases in ρ(si) have an exponential impact on the estimate:

results from a single “unlucky” hash can result in severe overestimates. To alleviate this problem, multiple estimators are used and a harmonic mean is applied to dampen any outliers. 

Directly applying multiple estimators suffer from a linearly scaling time penalty due to each estimator requiring its own hash function. This overhead is unacceptable for most large scale real world systems. To preserve the same time complexity compared to a single estimator, each S is first divided into r disjoint submultisets. In practice, this can be done by indexing from the suffix of si (e.g. γ(si) for r = 64 ⇒ 6 bits: γ(···000101) = 5, γ(···100111) = 39, γ(···111111) = 63, γ(···000000) = 0, etc.)

Once the estimates are generated, a harmonic mean is applied to reduce bias and outlier effects. Equation (2) describes the final components of the HyperLogLog cardinality estimate:

To derive ζ, the algorithm first gathers the maximum ρ values for each estimator and stores them in a register array R. It then computes the indicator function across each index, R[ω], to obtain ζ:

The alpha constant is used to correct a systematic multiplicative bias resulting from r2ζ:

Using Equation (2), HyperLogLog provides the following theoretical guarantees:

1. Unbiased estimates. As cardinality of the input increases, |S| , the estimate becomes unbiased:

where |δ1 (|S|)| < 5·10−5 for r ≥ 16. 

2. Bounded error. As |S| , the standard error satisfies:

where |δ2 (|S|)| < 5 · 10−4 for r ≥ 16. Here, the beta constant is bounded, with β16 = 1.106, β32 = 1.070, β64 = 1.054, and β = 1.039.

Flajolet et al. [4] provide a detailed proof for the claims above. In addition, some recommendations are made to improve estimator performance in real-world systems:

1. Initialization of registers. Register array R is used to store the maximum ρ values for each estimator.

When |S| is small, it is possible for R[ω] = ∅. This characteristic known as the Coupon Collector’s Problem occurs frequently when |S| ≪ r log r. To handle the case where one estimator receives no input, each register is initialized to zero as opposed to −∞. 

2. Small range corrections. Simulations suggest HyperLogLog performs well under most big data applications where sketching becomes necessary. However, distortions begin to appear for very small input cardinalities. The most extreme example of this is when S = ∅. If the input is the empty set, Equation (3) reduces to ζ = 1/r, resulting in an estimate of |S|e = αrr2ζ ≈ 0.7r. In cases where a small estimated result occurs (|S|< 2.5r), a linear time probabilistic counting algorithm by Whang et al. [8] is used as a more accurate alternative.

3. Large range corrections. As input cardinality approaches 2b, where b is the number of bits used in the hash function, h(i), the number of collisions drastically increases. This generalized Birthday Problem scenario makes it impossible for HyperLogLog to estimate cardinality as the algorithm has no way of differentiating collisions from value duplication. In cases where the input cardinality is very large, we introduce a different measure for analysisan estimate of the number of unique hashed results (hunique):

Note that when collisions exist, |S|e underestimates |S| and becomes an estimator for hunique instead:

Therefore, the likelihood of collisions must be factored into the estimate. When the estimated cardinality is very large, |S|e > (1/b)2b, the collision-corrected estimate:

derived from applying Equations (8) and (9) should be used.

Applications, Extensions & Source Code

HyperLogLog provides the following benefits over precisely computing multiset cardinality:

1. Constant memory usage. The standard error for HyperLogLog may be approximated as:

for reasonable sizes of r. Note that this error is independent of the input size |S|. This is the main benefit of HyperLogLog, allowing accurate cardinality estimates to be performed in memory regardless of how large the input grows in the warehouse.

2. Parallelizable performance. Since the max operation for computing ρ values in Equation (7) is commutative, input can first be grouped into separate partitions to allow for massive parallelization. This is particularly useful for large datasets that are generally partitioned or indexed for storage anyway. Only one additional max operation is necessary to combine the results of m separate computations:

These properties prove very useful for the industries dealing with big data [2], [9]. HyperLogLog’s popularity also created many initiatives for its improvement. Heule et al. [6] from Google Research provides one of the most widely adopted and comprehensive algorithms, appropriately named HyperLogLog++.

HyperLogLog++ presents a set of improvements to reduce memory requirements while increasing accuracy when applied to large, real-world datasets. The algorithm proposes the following improvements over the original implementation:

1. 64-bit hash function. Recall that the accuracy of HyperLogLog depends on the uniqueness of the hashed elements in S. As mentioned in the analysis on large range corrections in the previous section, hash collisions become a problem as input cardinality approaches 2b, with the collision-correction becoming necessary once |S| > (1/b)2b. Since HyperLogLog’s memory usage is independent of input size, a straightforward way to avoid this is to increase the size of the hash function. HyperLogLog++ uses a 64-bit hash function as opposed to the original 32-bit function used in [4]. By using 64-bits, accuracy increases for large data sets and large range correction becomes no longer necessary for estimates less than (1/64)264 or 288 petaelements. 

2. Small range bias correction. HyperLogLog addresses small range error correction by applying an ad hoc linear counting algorithm [8] for estimates less than 2.5r. However, empirical results show that as input cardinality decreased, most of the estimate error could be attributed to a monotonically increasing bias. Running an experiment on 5,000 randomly generated data sets across 200 input cardinalities, Heule et al. created a lookup table linking estimated cardinality and bias. Using k-nearest neighbor interpolation (where k = 6), HyperLogLog++ achieves the following performance improvements over the original algorithm: 

i. HyperLogLog++ performs comparable to linear counting [8] for very small cardinalities, but does not suffer the same accuracy degradation as input cardinality approaches 2.5r

ii. The original HyperLogLog algorithm forgoes small range error correction for estimates greater than 2.5r. However, this empirically determined threshold is derived by minimizing the error between linear counting and HyperLogLog, not reducing bias. Heule et al. find that there is still a significant upward bias in estimates at and above the threshold. HyperLogLog++ minimizes this bias. Since error is high for linear counting and bias is high for HyperLogLog (at this range), the lookup table performs much better on input cardinalities near the threshold.

3. Sparse representation. To squeeze a bit of extra performance and enhance estimates where linear counting performs better than HyperLogLog, i.e. |S| < r, HyperLogLog++ initializes to a sparse storage representation comprising of index, maximum value pairs, (ω, maxρ). This enables HyperLogLog++ to store index values ωr as long as the predesignated memory constraint has not been reached. By extending indices beyond [0, r − 1], HyperLogLog++ treats small multisets similarly to a complete hashmap, allowing near perfect performance on very small cardinalities. Once the predesignated memory constraint is reached, HyperLogLog++ dynamically switches to its original storage format. 

HyperLogLog receive an active following from within the sketching community. Streamlib [1] is an opensource Java library that provides optimized implementations to many data summarization algorithms, including HyperLogLog and HyperLogLog++. Neustar [2] provides a very informative article on HyperLogLog along with an interactive JavaScript sandbox.

HyperLogLog in Action

Enough with the theory. How well does it perform in practice? First, let us look at how HyperLogLog achieves our goals of scalable design:

1. Robustness to data growth. HyperLogLog has constant time, lossless unions, allowing efficient merges and massive parallelization. As a result, input size for the algorithm is only restricted to the length of the hash result, which doubles in capacity with each additional bit of storage. This makes the algorithm immune to growing input size. 

Fun fact: HyperLogLog can estimate the number of atoms in the universe, with a maximum error of 1%, in less than 1% of the space it takes to store this article. You just have to (somehow) feed the features of every atom into the system.

2. Memory efficiency. HyperLogLog provides very accurate estimates using a negligible amount of memory. Estimates within 5% can be achieved in under a kilobyte. For small input sizes, HyperLogLog can further enhance performance through various forms of small range error correction.

3. Ease of development. HyperLogLog is a popular algorithm that is simple to implement, quick to verify, and easy to maintain. It’s wide use in industry and many open source options make it a natural choice in a distributed production environment. 

So let’s take the algorithm for a spin. Currently, we have 25 billion user profiles. Let’s ask the question: How many distinct users have we served this past week?

SELECT ESTIMATE(DISTINCT user_id)
FROM impressions
DATES last_7_days

Answer: 1,232,365,353 users

Running this job took 17.67 seconds using 2,500 mappers on the production cluster. Not bad considering we processed over 420 GB of data!

If you made it this far, then the world of sketching may be quite exciting for you. Check out our career page. We’re always looking for new data enthusiasts to join the fray. 

References

[1] AddThis. Streamlib. https://github.com/addthis/stream-lib/, 2013–2015.

[2] M. Curcio. Sketch of the Day: HyperLogLog – Cornerstone of a Big Data Infrastructure. Technical report, Neustar Research, October 2012.

[3] M. Durand and P. Flajolet. LogLog Counting of Large Cardinalities. In Lecture Notes in Computer Science, pages 605–617. Springer, September 2003.

[4] P. Flajolet, E. Fusy, O. Gandouet, and F. Meunier. HyperLogLog: the Analysis of a Near-optimal Cardinality Estimation Algorithm. In 2007 Conference on Analysis of Algorithms, pages 127–146. Discrete Mathematics and Theoretical Computer Science, June 2007.

[5] P. Flajolet and G. N. Martin. Probabilistic Counting Algorithms for Data Base Applications. Journal of Computer and System Sciences, 31(2):182–209, October 1985.

[6] S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in Practice: Algorithmic Engineering of a State of the Art Cardinality Estimation Algorithm. In EDBT ’13: Proceedings of the 16th International Conference on Extending Database Technology, pages 683–692. ACM, March 2013.

[7] L. Lo. We Have Reached an Unprecedented Programmatic Milestone. Technical report, Turn Inc., May 2016.

[8] K. Whang, B. Vander-zanden, and H. Taylor. A Linear-Time Probabilistic Counting Algorithm for Database Applications. ACM Transactions on Database Systems, 15(2):208–229, June 1990.

[9] F. Yang. Fast, Cheap, and 98% Right: Cardinality Estimation for Big Data. Technical report, Druid, May 2012.

 

About Amobee

Founded in 2005, Amobee is an advertising platform that understands how people consume content. Our goal is to optimize outcomes for advertisers and media companies, while providing a better consumer experience. Through our platform, we help customers further their audience development, optimize their cross channel performance across all TV, connected TV, and digital media, and drive new customer growth through detailed analytics and reporting. Amobee is a wholly owned subsidiary of Singtel, one of the largest communications technology companies in the world.

If you’re curious to learn more, watch the on-demand demo or take a deep dive into our Research & Insights section where you can find recent webinars on-demand, media plan insights & activation templates, and more data-driven content. If you’re ready to take the next step into a sustainable, consumer-first advertising future, contact us today.

Read Next

All Blog Posts
Perspectives

Three Keys to Success in Data-Driven Marketing

Globally, 3.2 billion people use the Internet. How can marketers cut through the noise in a fragmented media landscape? A smart data strategy matters.

August 17, 2016

Perspectives

The Keys to Bridging the Programmatic Data Divide

It should be an unbeatable combination: provide creatives with the data they need, and they’ll come up with mind-blowingly effective ads targeted with pinpoint precision. Unfortunately for brands, Turn’s "Mind the Gap! 2016 UK Agency Survey Report" released today suggests it’s not that simple.

July 25, 2016

Perspectives

Why You Need a Data Strategy

While brands have now heard loud and clear that they need to start using data in their marketing, many organizations often struggle to get off the ground because they don’t have a real data strategy.

November 9, 2016