Apache Spark for the Impatient: A Pre-Announcement

One of the books that helped me hit the ground running at my new job with BluePrint Consulting Services was Cay Horstmann’s Scala for the Impatient. For me, Horstmann’s book was an easy, blissfully short read that got me to where I could write simple Scala code very quickly. As a JVM-based language, Scala is familiar territory to most Java programmers, but that familiarity can be deceptive: It is also enough different that you can feel like you’re drowning if you dive in without a lifejacket. Scala for the Impatient was just what I needed right when I needed it.

In the same vein, I’m starting my own parallel effort, Apache Spark for the Impatient. Yes, it is a totally derivative idea, hence my credits to Mr. Horstmann. Maybe he’ll forgive me if I refer readers to his book and drive some sales his way. Nevertheless, such a quick-start kind of book for Spark is sorely needed. Spark is exploding into the mainstream with such force that many developers are finding themselves needing to maintain Spark clusters and extend existing code–ASAP, if not sooner.

That definitely describes my situation, so I can speak from recent experience when it comes to what I needed to know to scrape by.

So let’s begin with the plan. I consider Apache Spark for the Impatient to be an 80/20 solution: For 20% of the time invested, you’ll get 80% of the rewards. This book will not make you an expert–but it can buy you time until you are.

This is going to start as a series of blog posts, and I invite feedback. If you take the time to read this and make a constructive comment, I’ll add you to the credits. Bribery works, right? Seriously, I covet your help here–it will make it better.

Here are a few things I am NOT planning to put in the book:

  1. Why to use Spark… nobody needs another white paper-like summary of use cases, nor do we need another explanation of why Spark is 100x faster than Hadoop.
  2. Install instructions… there are many, and HortonWorks even has a virtual appliance with their Spark Sandbox available.
  3. Scala or Python instruction… Again, Scala for the Impatient is the best quick start I’ve found for Scala and Python Programming: Your Step By Step Guide To Easily Learn Python in 7 Days has good reviews on Amazon (neither of those are affiliate links).

Here is the still-rough outline of what I’d like to put into the book:

  • Introduction: The Zen of Spark: What is it, exactly?
  • Writing Spark Jobs
    1. Batch Mode
    2. Streaming
    3. Data Files
  • Running Spark Standalone vs. Clustered
    1. What is Where and Why
    2. The Resource Managers Compared
  • Spark Metrics and Logging
  • Troubleshooting Spark Jobs
  • Spark and State
  • Spark ML Briefly — Just enough to know it is there and what it can do
  • Spark GraphX Briefly — Just enough to know it is there and what it can do

On to the Introduction!  >>>