Introduction
Apache Spark is designed to run on Linux production environments. However to learn Spark programming we can use Windows machine. In this article I'll explain how we can setup Spark using simple steps and also will run our Hello World Spark program.
Background
Apache Spark is fast and general purpose cluster computing platform. Spark extends the popular MapReduce model to efficiently support more types of computations, including interactive queries and stream processing. You can find more info from - http://spark.apache.org/ and https://en.wikipedia.org/wiki/Apache_Spark
Softwares required
Apache Spark is built using Scala and runs on JVM. The latest Spark release which is 2.0.2 runs on Java 1.7
Step-1
So first we need to setup Java 1.7 if it's not already. You can download it from http://www.oracle.com/technetwork/java/javase/downloads/java-archive-downloads-javase7-521261.html#jre-7u76-oth-JPR
Either you can use Installer or binaries. Once Java setup is over then open you command prompt and check the Java version using the command "java -version". It'll display as below
Step-2
Spark depends on winutils.exe which is usually installed along with Hadoop. As we are not going to deploy Hadoop, we need to download this program and set it up and envirnment variable.
Download winutils.exe from http://public-repo-1.hortonworks.com/hdp-win-alpha/winutils.exe
Create a folder called hadoop/bin wherever you want. I chose c:\backup\hadoop\bin
Create a environment variable called HADOOP_HOME with path c:\backup\hadoop
Step-3
Now download Apache spark from http://spark.apache.org/downloads.html
Unzip it to your preferred location and it looks like this
Update the Path" environment variable with spark bin location - in my case it's C:\backup\spark-2.0.2-bin-hadoop2.7\bin
Test Spark
Spark comes with interactive shell to execute spark APIs. The available shells are
Spark-Shell --> Works with Scala APIs
PySpark --> Works with Python APIs
Open your command prompt and type spark-shell and press enter. You should see Spark shell if all the configurations are set correctly.
Congrates! You have successfully setup Spark on Windows. Now let's try Hadoop hellow world program which is simple word count program :). If you know how to write it using Java MapReduce or Hive SQL or Pig script then you'll really appreciate Spark where we can achieve same using few simple APIs.
A. Make sure that you have sample text file from where you want to count words. Assume it's in c:\temp\test.txt
B. Let's write spark program for hello world
scala> val file = sc.textFile("c:\\temp\\test.txt") --> Press Enter
scala> val words = file.flatMap(line=>line.split(" ").map(word=>(word,1)).reduceByKey(_+_) -> Press Enter
scala> words.collect -> Press Enter
History