Introduction
Before I start telling you my problem, I have put down certain terms that are relevant to my problem. All the definitions are basically excerpts from Wikipedia.
What is BigData?
Big data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing applications. The challenges include capture, curation, storage, search, sharing, transfer, analysis and visualization. Big data is difficult to work with using most relational database management systems and desktop statistics and visualization packages, requiring instead massively parallel software running on tens, hundreds, or even thousands of servers.
What is Hadoop?
Hadoop is an open-source framework from Apache Software Foundation. It emerged as a solution for storing as well as processing BigData. Hadoop consists of the Hadoop Common package, which provides filesystem and OS level abstractions, a MapReduce engine and the Hadoop Distributed File System (HDFS).
What is MapReduce?
MapReduce is a programming model for processing large data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of:
-
Map()
procedure performs filtering and sorting. -
Reduce()
procedure that performs a summary operation.
What is Hive?
Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis.
What is HiveQL?
HiveQL is based on SQL, but do not strictly follow the full SQL-92 standard. Internally, a compiler translates HiveQL statements into a directed acyclic graph of MapReduce jobs, which are submitted to Hadoop for execution.
What is my problem?
I was looking for a code snippet which can connect to Hadoop via HIVE using C#. The following discussion will help you connect to HIVE and play with different tables and data underneath. It will also provide you a ground to explore Hadoop/HIVE via C#/.NET.
Background
I Googled everywhere in this regard but could gather few vague references only from Stackoverflow or some other sites. I have added limitations that I cannot use Azure HDInsight.
Using the Code
To begin, you need to download Microsoft® Hive ODBC Driver. The different parameters and their value that can be assigned are explained in detail in this section (Appendix C: Driver Configuration Options) of this article.
Following are the important parameters to get-set ConnectionString. Rest of the parameters can be set as required by ones application.
- DRIVER={Microsoft Hive ODBC Driver}
- Host=server_name
- Port=10000
- Schema=default
- DefaultTable=table_name
DRIVER={Microsoft Hive ODBC Driver}
is the name of the actual driver.
Host=server_name
is the name of the server where the Hadoop is running
Port=10000
is the default port, but you can assign your own.
Schema=default
is default database. You can create your own.
DefaultTable=table_name
is the name of a table in HIVE system.
Function GetDataFromHive()
connects to Hadoop/HIVE using Microsoft® Hive ODBC Driver.
SELECT * FROM table_name LIMIT 10
tells database to bring the TOP(10) records from database in SQL Server style.
private void GetDataFromHive(){
var conn = new OdbcConnection
{
ConnectionString = @"DRIVER={Microsoft Hive ODBC Driver};
Host=server_name;
Port=10000;
Schema=default;
DefaultTable=table_name;
HiveServerType=1;
ApplySSPWithQueries=1;
AsyncExecPollInterval=100;
AuthMech=0;
CAIssuedCertNamesMismatch=0;
TrustedCerts=C:\Program Files\Microsoft Hive ODBC Driver\lib\cacerts.pem;"
};
try
{
conn.Open();
var adp = new OdbcDataAdapter("Select * from table_name limit 10", conn);
var ds = new DataSet();
adp.Fill(ds);
foreach (var table in ds.Tables)
{
var dataTable = table as DataTable;
if (dataTable == null)
continue;
var dataRows = dataTable.Rows;
if (dataRows == null)
continue;
foreach (var row in dataRows)
{
var dataRow = row as DataRow;
if (dataRow == null)
continue;
}
}
}
catch (Exception ex)
{
}
finally
{
conn.Close();
}
}
Points of Interest
BigData is coming a big way as traditional relational databases such as SQL Server, Oracle, Sybase and others are finding it more and more difficult to handle big data and data in varied(structured/document-style/unstructured, etc.) formats. In this regard, Hadoop is fast emerging as one of the solutions that big banks, and other data mining industries are embracing. This piece of code will help you talk to Hadoop and will accelerate your effort to solve the problem at hand.