Books for Big Data and AI. How to start and where to go for software engineers.

After I have published a book that explains the algorithms of Big Data applications, I start getting questions and advice requests from people who want to extend their expertise to Big Data and AI problems. Where to start from? Which books to read? Since many software engineers thinking to shift to these hot domains, I found this topic very interesting and will try to explain my thought in an article so others could learn out of it as well. First of all, I would like to separate AI from Big Data. I have noticed that many people equate these two concepts. Let's figure it out their responsibilities.

The term Big Data usually characterize data that is enormous in many dimensions such as Volume, Velocity, Variety, and so on. Thus, Big Data processing is focusing on the way how to treat such the data and solves how to store it when it's huge, how to handle it when it comes at high speed, how to efficiently merge different formats and perform queries. I consider Big Data positions as more engineering roles.

A joke: Big data is the data that crashes your MS Excel.

At the other hand, the AI (artificial intelligence) is focusing on how to learn from the data we already have, probably, by simulating the human intelligence processes. This term includes learning from data and it's applications include machine learning, deep learning, and others. AI is impossible without data and more data we have more chances we will learn something, but these are two different concepts. I consider AI positions as more scientific roles (however, there are pure positions to develop technical systems to support this activity) that require additional knowledge from mathematics, physics, and other applied domains.

Of course, it's also possible to mix the processing and learning out of the data, but I don't consider that as a good learning path at this point.

Big Data path

At first, everybody has to start learning what is Big Data and why handling Big Data is so hard task. For this step, you can check Big Data: A Revolution That Will Transform How We Live, Work, and Think by Viktor Mayer-Schönberger and Kenneth Cukier. However, you can easily skip it since such information always repeated in every book related to Big Data as the motivation intro.

The next step is to learn how people deal with all these problems in application design. For anyone curious about the architecture of Big Data applications, take a look at Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, by Martin Kleppmann.

One of the most important in practical Big Data applications is the MapReduce programming model that makes possible to distribute processing of large data sets across clusters of computers. This is must-to-know and I recommend reading Mapreduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems by Donald Miner and Adam Shook, that is a Big Data engineer’s handbook. There are implementations of MapReduce in many distributed frameworks and the most known is Apache Hadoop. I consider the Hadoop ecosystem as very important but not mandatory to know because it focuses on very specific problems that you might not want to focus right now. If you decided to do so, take a look at the book Hadoop: The Definitive Guide: Storage and Analysis at Internet Scale by Tom White.

Stream handling is another aspect of data processing and nowadays I consider it as a must for everyone, especially with a rise of IoT devices. MapReduce also important here and there are very good thought and well-developed frameworks, as Apache Storm, Apache Spark, Apache Flink, and others. However, the architecture of streaming applications is different from the regular ones, so I highly recommend reading the book Big Data: Principles and best practices of scalable realtime data systems by James Warren and Nathan Marz who was the author of Twitter Storm that became Apache Storm.

The mentioned above distributed processing of Big Data (in both, non-streaming and streaming cases) mostly based on the classical algorithms but uses special patterns that help to overcome Big Data problem by scaling (adding more computers /getting bigger servers). However, there are many applications where this approach is not feasible any more due to the cost of such scaling and the complexity of the classical algorithms that require at least linear time and in many cases quadratic memory (which you already can’t tolerate with such amount of data). In this case, you need to use specially designed space-efficient data structures and fast probabilistic algorithms. You can find them in my book Probabilistic Data Structures and Algorithms for Big Data Applications including examples of cases when they are useful and why.

ML path

Many times by mentioning AI engineers mean machine learning and/or deep learning. To approach this, assuming you want to build models to learn from data yourself, requires more fundamental steps and every average software engineer can afford it at the end because this is where you will touch the wonderful world of science.

The most important role plays linear algebra and various optimization techniques, so you need to refresh your knowledge of mathematics. For software engineers, I would recommend Coding the Matrix: Linear Algebra through Computer Science Applications by Philip N. Klein.

For engineers interested in machine learning I recommend one of the best-written books in this area — Understanding Machine Learning: From Theory to Algorithms by Shai Shalev-Shwartz and Shai Ben-David.

Deep learning is a very interesting topic in this area that covers neural networks of various types such as multilayer perceptrons, recurrent neural networks, convolutional neural networks, autoencoders. As the starting point, I would recommend reading Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville. This book also provides all the necessary math, but I still advise to refresh your knowledge before.

As soon as you feel ready, I highly recommend The Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani, and Jerome Friedman, which also the main handbook for any machine learning engineer. However, it is quite complex to learn without prior preparation and refreshing your math knowledge.

After these steps, you will feel very confident in such a topic and can apply your knowledge on the various domain problems you like.

Hope, this helps.