Big Data Analytics using Apache Spark for .NET

MehreenTahir

4.82/5 (3 votes)

1 Dec 2019CPOL9 min read

16.4K

179

This article will give you a gentle introduction and quick getting started guide with Apache Spark for .NET for Big Data Analytics.

Download source files - 1.6 MB

Introduction

Big Data is not a fad. In fact, we’re living at the verge of a revolution that is touching every industry, business and life on this planet. With millions of tweets, iMessages, Live streams, Facebook and Instagram posts… terabytes and petabytes of data is being generated every second and getting “meaningful insight” from this data is quite a challenge since the traditional data bases and data warehouses are not able to handle the processing demands of these Big Data sets that need to be updated frequently or often in real time as in case of stocks, application performance monitoring or user’s online activities. In response to the growing demand for tools and technologies for Big Data Analytics, many organizations turned to NoSQL databases and Hadoop along with some its companions analytics tools including but not limited to YARN, MapReduce, Spark, Hive, Kafka, etc.

All these tools and frameworks make up a huge Big Data ecosystem and cannot be covered in a single article. For the sake of this article, my focus is to give you a gentle introduction to Apache Spark and above all, the .NET library for Apache Spark which brings Apache Spark tools into .NET Ecosystem.

We will be covering the following topics:

What is Apache Spark?
Apache Spark for .NET
Architecture
Configuring and testing Apache Spark on Windows
Writing and executing your first Apache Spark program

What is Apache Spark?

Apache Spark is a general purpose, fast, scalable analytical engine that processes large scale data in a distributed way. It comes with a common interface for multiple languages like Python, Java, Scala, SQL, R and now .NET which means execution engine is not bothered by the language you write your code in.

Why Apache Spark?

Let alone the ease of use, following are some advantages that make Spark stand out among other analytical tools.

In-Memory Processing

Apache spark makes use of in-memory processing which means no time is spent moving data or processes in or out to disk which makes it faster.

Efficient

Apache Spark is efficient since it caches most of the input data in memory by the Resilient Distributed Dataset (RDD). RDD is a fundamental data structure of Spark and manages transformation as well as distributed processing of data. Each dataset in RDD is partitioned logically and each logical portion may then be computed on different cluster nodes.

Real-Time Processing

Not only batch processing but Apache Spark also supports stream processing which means data can be input and output in real-time.

Adding to the above argument, Apache Spark APIs are readable and easy to understand. It also makes use of lazy evaluation which contributes towards its efficiency. Moreover, there exist rich and always growing developer’s spaces that are constantly contributing and evaluating the technology.

Apache Spark for .NET

Up until the beginning of this year, .NET developers were locked out from big data processing due to lack of .NET support. On April 24^th, Microsoft unveiled the project called .NET for Apache Spark.

.NET for Apache Spark makes Apache Spark accessible for .NET developers. It provides high performance .NET APIs using which you can access all aspects of Apache Spark and bring Spark functionality into your apps without having to translate your business logic from .NET to Python/Sacal/Java just for the sake of data analysis.

Apache Spark Ecosystem

Spark consists of various libraries, APIs and databases and provides a whole ecosystem that can handle all sorts of data processing and analysis needs of a team or a company. Following are a few things you can do with Apache Spark.

All these modules and libraries stands on top of Apache Spark Core API. Spark Core is the building block of the Spark that is responsible for memory operations, job scheduling, building and manipulating data in RDD, etc.

Since we’ve built some understanding of what Apache Spark is and what can it do for us, let’s now take a look at its architecture.

Architecture

Apache Spark follows driver-executor concept. The following figure will make the idea clear.

Each Spark application consists of a driver and a set of workers or executors managed by cluster manager. The driver consists of user’s program and spark session. Basically, spark session takes the user’s program and divide it into smaller chunks of tasks which are divided among workers or executors. Each executor takes one of those smaller tasks of user’s program and executes it. Cluster Manager is there to manage the overall execution of the program in the sense that it helps diving up the tasks and allocating resources among driver and executors.

Without going any further into theoretical details of how spark works, let’s get our hands dirty and configure and test the spark on our local machine to see how things work.

Setting Up the Environment

.NET implementation of Apache Spark still uses Java VM so there isn’t a separate implementation of .NET Spark, instead it sits on top of Java runtime. Here’s what you’re going to need to run .NET for Apache Spark on your Windows machine.

Java Runtime Environment
(It is recommended that you download and install 64 bit JRE version since 32 bit is very limited for Spark.)
Apache Spark
(.NET implementation supports both Spark 2.3 and 2.4 versions. I’ll be proceeding with Spark 2.4. Once you’ve chosen the Spark version from the given link, select the Pre-Built for Apache Hadoop 2.7 or later and then download the tgz. Once it is downloaded, extract it to a known location.)
Hadoop winutils.exe
Once the download is complete, put the winutils.exe file in a folder called bin inside another folder to a known location.

Configuring Environment Variables

Before testing spark, we need to create a few environment variables for SPARK_HOME, HADOOP_HOME and JAVA_HOME. You can either go ahead and add these environment variables to your system manually or you can run the following script to set these environment variables.

PowerShell

$ SET SPARK_HOME=c:\spark-2.4.1-bin-hadoop2.7
$ SET HADOOP_HOME=c:\hadoop
$ SET JAVA_HOME=C:\Program Files\Java\jre1.8.0_231
$ SET PATH=%SPARK_HOME%\bin;%HADOOP_HOME%\bin;%JAVA_HOME%\bin;%PATH%

Note here that you’re supposed to provide the location of the extracted Spark directory, winutils.exe and JRE installation. The above script will set the environment variables for you and will also add bin folder from each to the PATH environment variable.

To check everything is successfully set up, check if JRE and Spark shell is available. Run the following commands.

PowerShell

$ Java –version
$ spark-shell

If you’ve set up all the environment variables correctly, then you should get the similar output.

Spark shell allows you to run scala commands to use spark and experiment with data by letting you read and process files.

Note: You can exit Spark-shell by typing :q.

We’ve successfully configured our environment for .NET for Apache Spark. Now we’re ready to create our .NET application for Apache Spark.

Let’s Get Started…

For the sake of this post, I’ll be creating .NET Core console application using Visual Studio 2019. Please note that you can also create .NET runtime application.

Once Visual Studio is done creating the template, we need to add .NET Spark Nuget package.

After the Nuget package is added to the project, you’ll see 2 jar files added to the solution. Now we’re in a position to initialize spark session in the program.

using Microsoft.Spark.Sql;

namespace California_Housing
{
    class Program
    {
        static void Main(string[] args)
        {
            // creating spark session
            SparkSession Spark = SparkSession
                           .Builder()
                           .GetOrCreate();

            // reading and loading data from CSV into DataFrame
            DataFrame df = Spark
                .Read()
                .Option("inferSchema", true)
                .Csv("housing.csv");

            df = df.ToDF("longitude", "latitude", "housing_median_age", 
                 "total_rooms", "total_bedrooms", "population", "households", 
                 "median_income", "median_house_value", "ocean_proximity");

            // printing schema and showing records
            df.PrintSchema();
            df.Show();        
        }
    }
}

The above code creates a new SparkSession or get one if already created. The retrieved instance will provide a single entry point and all the necessary APIs to interact with the underlying spark functionality and enables communication with .NET implementation.

The next step is to load the data that’ll be used by the application. Here, I’m using California Housing data housing.csv. Spark.Read() allows Spark session to read from the CSV file. The data is loaded into DataFrame by automatically inferring the columns. Once the file is read, the schema will be printed and first 20 records will be shown.

The program is pretty simple. Build and try to run you solution and see what happens.

You’ll notice that you cannot simply run this program from inside Visual Studio. Instead, we first need to run Spark so it could load the .NET driver to execute the program. Apache Spark provides spark-submit tool command to send and execute the .NET core code. Take a look at the following command:

PowerShell

spark-submit --class org.apache.spark.deploy.dotnet.DotnetRunner 
--master local microsoft-spark-2.4.x-0.2.0.jar dotnet <compiled_dll_filename>

Note here that we need to provide complied DLL file names as parameter to execute our program. Navigate to the project solution directory, e.g., C:\Users\Mehreen\Desktop\California Housing and run the following command to execute your program.

PowerShell

spark-submit --class org.apache.spark.deploy.DotnetRunner 
--master local "bin\Debug\netcoreapp3.0\microsoft-spark-2.4.x-0.2.0.jar" 
dotnet "bin\Debug\netcoreapp3.0\California Housing.dll"

You’ll get a lot of Java IO exceptions which can be successfully ignored at this stage or you can also stop them. Spark folder contains the conf directory. Append the following lines at the end of log4j.properties file under conf directory to stop these exceptions.

log4j.logger.org.apache.spark.util.ShutdownHookManager=OFF
log4j.logger.org.apache.spark.SparkEnv=ERROR

Let’s move to the interesting part and take a look at the PrintSchema() which shows the columns of our CSV file along with data type.

And the displayed rows by Show() method.

What’s Going on Under the Hood?

When creating the application, I mentioned that both .NET Core and .NET Runtime can be used to create a Spark program. Why’s that? And what exactly is happening with our .NET Spark code? To answer that question, take a look at the following image and try to make sense of it.

Do you remember the jar files added to the solution when we added Microsoft.Spark Nuget package? The Nuget Package adds .NET driver to the .NET program and ships .NET library as well as two jar files that you saw. The .NET driver is compiled as .NET standard so it doesn’t matter much if you’re using .NET Core or .NET runtime while both of the jar files are used to communicate with the underlying native Scala APIs of Apache Spark.

What Else Can Be Done…

Now that we know how things work under the hood, let’s make some tweaks to our code and see what else we can do.

Dropping Unnecessary Columns

Since we’re dealing with huge amount of data, there might be unnecessary columns. We can simply drop those columns by using Drop() function.

var CleanedDF = df.Drop("longitude", "latitude");
    CleanedDF.Show();

Data Transformation

Apache Spark allows you to filter data using columns. For instance, we might be interested in only the properties near Bay Area. We can use the following code to filter out the data of properties in the given region.

var FilteredDF = CleanedDF.Where("ocean_proximity = 'NEAR BAY'");
    FilteredDF.Show();

Console.WriteLine($"There are {FilteredDF.Count()} properties near Bay Area");

// There are 2290 properties near Bay Area

We can also iterate over a column using Select() method, i.e., to get the total population of the area.

var sumDF = CleanedDF.Select(Functions.Sum(CleanedDF.Col("population")));
var sum = sumDF.Collect().FirstOrDefault().GetAs<int>(0);

Console.WriteLine($"Total population is: {sum}");

The above code will iterate over population column and return us the sum.

// Total population is: 29421840

Let’s see another example using Select() and Filter() method where we are interested in getting values that falls in a specific range.

var SelectedDF = CleanedDF.Select("median_income", "median_house_value")
                 .Filter(CleanedDF["median_income"].Between(6.5000, 6.6000) & 
                 CleanedDF["median_house_value"].Between(250000, 300000));

    SelectedDF.Show();

The above code will only output the entries that have median_income between 6.5 – 6.6 and median_house_value between 250000 – 300000.

What’s Next?

This article was meant to give you a quick introduction and getting started guide using .NET for Apache Spark. It’s easy to see that .NET implementation brings the full power of Apache Spark for .NET developers. Moreover, you can also write cross platform programs using .NET for Apache Spark. Microsoft is investing a lot in .NET Framework. The .NET implementation of Apache Spark can also be used with ML.NET and a lot of complex machine learning tasks can also be performed. Feel free to experiment along.

History

1^st December, 2019: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)