Probabilistic Data Structures and Algorithms
for Big Data Applications
Probabilistic Data Structures and Algorithms for Big Data Applications is a computer science book about various special data structures that are extremely useful in data stream processing and massive data analysis, extensively developed topics nowadays.
This is my first real book and it is going to be self-published. The current estimation for the first full draft copy is Jan 1, 2018 and the expected release date is in Fall 2018.
What are probabilistic data structures?
Probabilistic data structures is a common name of data structures based mostly on different hashing techniques. Unlike regular (or deterministic) data structures, they always provide approximated answers, but with reliable ways to estimate possible errors. Fortunately, the potential losses or errors are fully compensated by extremely low memory requirements, constant query time and scaling that become important for big data applications.
With such data structures and algorithms you can easy solve problems of "remembering" huge amount of data, estimating number of unique elements, finding similar documents, etc.
You might already have heard about Bloom filter, HyperLogLog or MinHash and many others that are used in modern popular applications like Elasticsearch, Cassandra, etc. or just behind Google News ... These are probabilistic data structures.
About this book
The purpose of this book is to introduce probabilistic data structures and algorithms to technology practitioners, including software architects and developers, as well as technology decision makers. Reading this book, you will get a theoretical and practical understanding of probabilistic data structures and learn about their common use cases.
This is not a book for scientists, but you will gain most from it if you have an understanding of the general theory of data structures and algorithms.
Organization of the book
Chapter 1 provides a brief overview of popular hash functions that are widely used in probabilistic data structures. Chapter 2 is devoted to the most known use case of such structures - approximate membership queries (AMQ). In Chapter 3 data structures that help to estimate the number of unique elements are discussed. Chapters 4 and 5 are dedicated to important frequency and rank-related metrics calculation in streaming applications. Chapter 6 consists of data structures and algorithms to solve near-duplicate detection, nearest neighbor search, and clustering problems.
If you are interested in a Python implementation of many data structures and algorithms from this book, please check out our open-source Python library PDSA: https://github.com/gakhov/pdsa