Today’s installment is about programming with Spark’s most fundamental construct, the Resilient Distributed Dataset (RDD). But before we get into that I’m going to do something I said at the outset I wouldn’t, namely writing about installing Spark. I decided to discuss it briefly because so many folks get confused by the simplicity.
Yeah, I know: Go figure.
I want to get to writing some Spark code ASAP and having a working local install is a prerequisite, so lets keep this simple and quick. Here are the steps… both of them:
- Download the latest package here. The defaults are all fine 99% of the time for a local install.
- Unpack it into a folder.
That’s it. You just run Spark itself from that folder and you’re good to go. The cool thing about this is you can move it wherever you want later, replace it, or just download new versions and mess with both new and old as you desire. Here’s the key to that bit of wizardry: Always make the folder where you unpacked a version the current folder on the command line. Then just run Spark with this bit at the command line:
> ./sbin/start-master.sh &
It really is that simple.
Her’s a Pro Tip for my fellow Mac users: Install Go2Shell. It will add a button to the Finder UI that you can push to open a terminal window with the current directory already set to the folder you’re looking at:
Push the GoToShell button (circled in red above) and you get this:
Note that the current directory is the same folder where we looking at files in the Finder. Handy, eh?
Where is R2D2, Obi Wan?
I know that today’s headline doesn’t even rate as a bad pun… The Resilient Distributed Dataset is really more like RD^2 (R * D * D… think about it), but that doesn’t get to me to a Star Wars reference. Every good programming article has a Stars Wars reference, right?
With that obligatory reference out of the way, let’s move on to unpacking what an RDD is and how to use them.