Book review : Big Data Analytics with Spark A Practitioner’s Guide to Using Spark for Large Scale Data Analysis By Mohammed Guller

Book review : Big Data Analytics with Spark A Practitioner’s Guide to Using Spark for Large Scale Data Analysis I have been reading and reviewing a number of excellent books for the Data Science for IoT course  and also my Oxford University course. Big Data Analytics with Spark  By Mohammed Guller is for data scientists, business analysts, data architects, and data analysts looking for a better and faster tool for large-scale data analysis. It is also for software engineers and developers building Big Data products. The book covers a subject which I have been focussing on through my teaching and research. It provides a  step-by-step guide for learning how to use Spark for different types of big-data analytics projects, including batch, interactive, graph, and stream data analysis as well as machine learning. The book covers Spark core and its add-on libraries, including Spark SQL, Spark Streaming, GraphX, MLlib, and Spark ML. My analysis: The book covers Mllib, Scala, Spark and Analytics in detail but it is also readable. It also covers Code for all these sections. The only recommendations I would make are: A better index and releasing code in Github. However, the book pdf can be bought for an extra $5(so you can copy and paste the code if you need it) I see the book comprising three sections: a)      The main theme of the book i.e. Big Data Analytics with Spark b)      The first five chapters leading up to the theme c)       The last three chapters on Spark deployment The main theme of the book i.e. Big Data Analytics with Spark

  • Chapter 6: Spark Streaming (23 pages):  Introduce Spark streaming and show an example app using Spark streaming includes Spark streaming introduction, How Spark streaming works and A spark streaming example app.
  • Chapter 7: Spark SQL and Dataframes (50 pages): Introduce Spark SQL along with a few examples
  • Chapter 8: MLlib and SparkML (50 pages): Introduce machine learning and MLlib along with a few examples covers Machine learning introduction, Linear regression, Logistic regression, Classification, Clustering, Recommender system. Building a machine learning application with MLlib, MLBase
  • Chapter 9(23 pages): GraphX   Introduce Graph analysis and GraphX along with a few examples

The first five chapters leading up to the theme

  • Chapter 1: Big Data Technology Landscape :  Cluster computing(Hadoop MapReduce, HDFS, Hive), Data serialization( Avro, Proto Buffer), Columnar storage (Parquet), Messaging system (Kafka, ZeroMQ), NoSQL databases (HBase, Cassandra), Distributed SQL Query engine (Apache Drill, Impala, PrestoDB)
  • Chapter 2: Functional Programming in Scala (30 pages)  Introduce Scala so that readers can understand and write Spark applications in Scala, which is the primary language supported by Spark. This includes Key functional programming concepts including Basic Scala constructs, Scala Shell etc
  • Chapter 3: Spark’s Essentials (35 pages):  Introduce Spark fundamentals and key concepts

What is Spark, Why Spark is hot, Why Spark is faster than Hadoop MapReduce, Resilient Distributed Datasets (RDD)

  • Chapter 4: Spark Shell (10 pages): Introduce Spark Shell and show how it can be used for interactive data analysis, Spark shell introduction, Interactive data analysis in Spark-shell
  • Chapter 5: A Stand-alone Spark Application (10 pages):  Provide step-by-step directions for writing and running a Spark application. Basic structure of a stand-alone Spark application, Compiling a Spark application

The last three chapters (Deployment Chapters)

  • Chapter 10: Deploying Spark – a walkthrough of Spark deployment with different cluster management technologies such as YARN, Mesos, and services like AWS (EC2)
  • Chapter 11: Monitoring a Spark Cluster (20 pages)

Overall, I very much recommend this book. Big Data Analytics with Spark A Practitioner’s Guide to Using Spark for Large Scale Data Analysis By Mohammed Guller I also plan to use this book in the Data Science for IoT course  and also my Oxford University course which I will teach later in the year.