While learning the new Tez engine and query vectorization concepts in Hadoop 2.0, I came to know that the query vectorization is claimed as 3x powerful and consume less CPU time in actual Hadoop cluster. Hortonworks tutorial uses a sample sensor data in a CSV that is imported into Hive. Then a sample has been used to explain the performance.
The intention of this post is neither explaining Tez engine and query vectorization nor Hive query. Let us familiarize the problem I have worked before getting to know the purpose of this post.
One sample CSV file called ‘HVAC.csv’ contains 8000 records that contain temperature information on different building during different days. Part of the file content:
Date,Time,TargetTemp,ActualTemp,System,SystemAge,BuildingID
6/1/13,0:00:01,66,58,13,20,4
6/2/13,1:00:01,69,68,3,20,17
6/3/13,2:00:01,70,73,17,20,18
6/4/13,3:00:01,67,63,2,23,15
6/5/13,4:00:01,68,74,16,9,3
…
In the Hive, following configurations are specified to enable Tez engine and query vectorization.
hive> set hive.execution.engine=mr;
hive> set hive.execution.engine=tez;
hive> set hive.vectorized.execution.enabled;
hive.vectorized.execution.enabled=true
I execute the following query in my sandbox that surprisingly took 48 seconds for a ‘group by’ and ‘count’ on 8000 records as shown below:
select date, count(buildingid) from hvac_orc group by date;
This query groups the sensor data by date and counts the number of buildings for that date. It produces 30 results as shown below:
Status: Finished successfully
OK
6/1/13 267
6/10/13 267
6/11/13 267
...
Time taken: 48.261 seconds, Fetched: 30 row(s)
Then I plan to write a simple program without MapReduce castle, since it is just 8000 records. I created a F# script that reads the CSV (note that I did not use any CSV type provider) and using Deedle exploratory library (again, LINQ can also help). I achieved the same result as shown below.
module ft
#I "..\packages\Deedle.1.0.0"
#load "Deedle.fsx"
open System
open System.IO
open System.Globalization
open System.Diagnostics
open Deedle
type hvac = { Date : DateTime; BuildingID : int}
let execute =
let stopwatch = Stopwatch.StartNew()
let enus = new CultureInfo("en-US")
let fs = new StreamReader("..\ml\SensorFiles\HVAC.csv")
let lines = fs.ReadToEnd() |> (fun s -> s.Split("\r\n".ToCharArray()))
let ohvac = lines.[1..(Array.length lines) - 1]
|> Array.map (fun s -> s.Split(",".ToCharArray()))
|> Array.map (fun s -> {Date = DateTime.Parse(s.[0], enus); BuildingID = int(s.[6])})
|> Frame.ofRecords
let result = ohvac.GroupRowsBy("Date")
|> Frame.getNumericCols
|> Series.mapValues (Stats.levelCount fst)
|> Frame.ofColumns
stopwatch.Stop()
(stopwatch.ElapsedMilliseconds, result)
In the FSI,
> #load "finalTouch.fsx";;
> open ft;;
> ft.execute;;
val it : int64 * Deedle.Frame =
(83L,
BuildingID
01-06-2013 12:00:00 AM -> 267
02-06-2013 12:00:00 AM -> 267
03-06-2013 12:00:00 AM -> 267
04-06-2013 12:00:00 AM -> 267
...
The is completed within 83 milliseconds. You may argue that I am comparing apples with oranges. No!. My intention is to understand when MapReduce is the savior. The parable of the above exercise is that be cautious and analyze well before moving your data processing mechanisms into MapReduce clusters.
Elephants are very effective in labor requiring hard slogging and heavy lifting. Not for your gardens!!
Note that the sample CSV files from HortonWorks is clearly for training purposes. This blog post just takes that as an example to project the maximum data-generation capability of a small or medium size application for a period. The above script may not scale and will not perform well with more than the above numbers. Hence, this is not anti-MapReduce proposal.
CodeProject