(untagged)

How to Communicate to Hadoop via Hive using .NET/C#

Rajibdotnet05

0.00/5 (No votes)

4 Mar 2014

Connect to database in Hive

Introduction

Before I start telling you my problem, I have put down certain terms that are relevant to my problem. All the definitions are basically excerpts from Wikipedia.

What is BigData?

Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers.

What is Hadoop?

Hadoop is an open-source framework from Apache Software Foundation. It emerged as a solution for storing as well as processing BigData. Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS).

What is MapReduce?

MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of:

Map() procedure performs filtering and sorting.
Reduce() procedure that performs a summary operation.

What is Hive?

Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.

What is HiveQL?

HiveQL is based on SQL, but do not strictly follow the full SQL-92 standard. Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.

What is my problem?

I was looking for a code snippet which can connect to Hadoop via HIVE using C#. The following discussion will help you connect to HIVE and play with different tables and data underneath. It will also provide you a ground to explore Hadoop/HIVE via C#/.NET.

Background

I Googled everywhere in this regard but could gather few vague references only from Stackoverflow or some other sites. I have added limitations that I cannot use Azure HDInsight.

Using the Code

To begin, you need to download Microsoft® Hive ODBC Driver. The different parameters and their value that can be assigned are explained in detail in this section (Appendix C: Driver Configuration Options) of this article.

Following are the important parameters to get-set ConnectionString. Rest of the parameters can be set as required by ones application.

DRIVER={Microsoft Hive ODBC Driver}
Host=server_name
Port=10000
Schema=default
DefaultTable=table_name

DRIVER={Microsoft Hive ODBC Driver} is the name of the actual driver.

Host=server_name is the name of the server where the Hadoop is running

Port=10000 is the default port, but you can assign your own.

Schema=default is default database. You can create your own.

DefaultTable=table_name is the name of a table in HIVE system.

Function GetDataFromHive() connects to Hadoop/HIVE using Microsoft® Hive ODBC Driver.

SELECT * FROM table_name LIMIT 10 tells database to bring the TOP(10) records from database in SQL Server style.

private void GetDataFromHive(){
   var conn = new OdbcConnection
                  {
                      ConnectionString = @"DRIVER={Microsoft Hive ODBC Driver};                                        
                                        Host=server_name;
                                        Port=10000;
                                        Schema=default;
                                        DefaultTable=table_name;
                                        HiveServerType=1;
                                        ApplySSPWithQueries=1;
                                        AsyncExecPollInterval=100;
                                        AuthMech=0;
                                        CAIssuedCertNamesMismatch=0;
                                        TrustedCerts=C:\Program Files\Microsoft Hive ODBC Driver\lib\cacerts.pem;"
                  };
    try 
    {
        conn.Open();

        var adp = new OdbcDataAdapter("Select * from table_name limit 10", conn); 
        var ds = new DataSet();
        adp.Fill(ds);

        foreach (var table in ds.Tables)  
        {
            var dataTable = table as DataTable;

            if (dataTable == null)
                continue;

            var dataRows = dataTable.Rows;

            if (dataRows == null)
                continue;

            //log.Info("Records found " + dataTable.Rows.Count);

            foreach (var row in dataRows)
            {
                var dataRow = row as DataRow;
                if (dataRow == null)
                    continue;

                //log.Info(dataRow[0].ToString() + " " + dataRow[1].ToString());
            }
        }

    }
    catch (Exception ex)
    {
       // log.Info("Failed to connect to data source");
    }
    finally
    {
        conn.Close();
    }
}

Points of Interest

BigData is coming a big way as traditional relational databases such as SQL Server, Oracle, Sybase and others are finding it more and more difficult to handle big data and data in varied(structured/document-style/unstructured, etc.) formats. In this regard, Hadoop is fast emerging as one of the solutions that big banks, and other data mining industries are embracing. This piece of code will help you talk to Hadoop and will accelerate your effort to solve the problem at hand.

License

This article has no explicit license attached to it but may contain usage terms in the article text or the download files themselves. If in doubt please contact the author via the discussion board below.

A list of licenses authors might use can be found here