Download a free trial copy of ScaleOut StateServer!
The era of "big data" is upon us, and
the need for fast analysis has never been more pressing. Companies that can
mine data quickly and effectively can serve their customers better, hone their
internal strategies, and gain a competitive advantage. For example, fast
analysis enables social media Web sites to deliver timely, fine grained results
on customer behavior, and it enables financial analysts to optimize trading
strategies to quickly respond to market changes. Countless industry sectors
from banking to climate simulation are looking for technologies which can
analyze data quickly and easily.
One of the leading analysis techniques,
called "map/reduce" and popularized by Google, has been quickly gaining
momentum because of the popularity of the open source Hadoop programming
platform. Simply stated, this technique harnesses many computers to
simultaneously analyze different data within a large dataset and then combine
the results. Map/reduce analysis dramatically speeds up analysis in comparison
to sequentially stepping through the dataset to look for patterns of interest.
Representing an evolution of data-parallel computing technology that emerged in
the 1980’s, it has again taken center stage as cloud computing has made parallel
computing broadly accessible.
File I/O Limits Performance
Hadoop and various commercial
implementations of map/reduce typically host data either in a distributed file
system, such as the Hadoop Distributed File System, or in a database and then
stream it into memory for analysis. Likewise, Hadoop streams both intermediate
and final results back into the file system for retrieval and reporting. The
analyst decides how to assign various sections (or "splits") of a dataset to
the many concurrent tasks that will analyze them, and then Hadoop stages these
splits in memory as it starts up analysis tasks on the computers.
Performance studies have shown that file
I/O can significantly impact the performance of map/reduce analysis. The time
required to move data into and out of memory for each split lengthens the
execution time for its associated map and reduce tasks, and this in turn delays
completion of the overall analysis. For very large datasets which do not fit
into memory, this overhead – and its performance penalty – cannot be avoided.
However, ways to circumvent these delays clearly are needed to boost map/reduce
performance.
In-Memory Data Grids Tackle
the I/O Bottleneck
A recent Forrester survey of big data
initiatives reported that about 63% of industry databases were smaller than
10TB. Since cloud computing has now made it practical to harness 200 or more
servers to perform map/reduce analysis, these datasets now can be held entirely
in memory. This creates an exciting opportunity to avoid file I/O and
dramatically reduce analysis time for many useful datasets.
Of course, techniques have to be devised
to host datasets in memory scattered across a large array of servers and then
efficiently access the data on demand. A technology called distributed
"in-memory data grid" (IMDG) has evolved over the last several years precisely
for this purpose. Originally employed primarily as a scalable, distributed
cache for database systems, commercial IMDGs have become full-fledged storage
systems with integrated load-balancing, high availability, parallel query,
event processing, and many other capabilities.
Typical IMDGs store data as key/value
pairs. This is a perfect match for map/reduce analysis, which also uses
key/value pairs to identify data. IMDGs provide programming interfaces ("APIs")
which applications use to store and retrieve data. These APIs access key/value
pairs somewhat like files in a file system using straightforward
create/read/update/delete semantics. The APIs integrate nicely with modern,
object-oriented programming languages, such as Java and C#, hosting key/value
pairs in namespaces which correspond to language-defined object collections.
Commercial in-memory data grids
automatically handle the problem of uniformly distributing stored data across
the memory of the computers which host the grid. Applications never have to
know which grid servers hold the key/value pairs they need; the IMDG takes care
of mapping access requests to servers. IMDGs also provide very fast access to
stored data and automatically scale their throughput to handle simultaneous
accesses by many client computers. This precisely meets the demand that
map/reduce analysis places on its storage system, namely that a large set of
map and reduce tasks be able to simultaneously access key/value pairs for
reading or writing.
For datasets that can be held in memory,
IMDGs provide an extremely fast storage layer that can significantly boost the
performance of map/reduce analysis. For example, consider a well known
application in financial services which back-tests a mix of stock trading
strategies based on historical market data. This application helps analysts to
test out new trading strategies on historical stock data as a predictor of
their value in live market transactions.
To see the impact of storing datasets in
memory, the performance of a Hadoop map/reduce analysis on data hosted in a
commercial IMDG called ScaleOut StateServer (SOSS) was compared to another run
in which the data was hosted in the Hadoop Distributed File System (HDFS). The
stock history data were stored in both the in-memory data grid and HDFS as key/
value pairs, one for each stock symbol. In this analysis, servers were added to
scale processing power as the population of stock histories was proportionally
increased to create additional workload. In all scenarios, storing the market
data in an IMDG instead of in HDFS provided a significant performance
improvement that reduced analysis time, yielding approximately a 6X increase in
analysis throughput.
Throughput Comparison
(See the green and blue lines in the
graph.) This is a remarkable boost in performance, but can we do even better?
The Next Step: Integrate
Map/Reduce into the IMDG
Whether data is stored in an in-memory
data grid or in a distributed file system, getting the data to and from the
computer performing its associated map/reduce task creates data transfer
overhead that impacts overall performance. The Hadoop map/reduce platform also
introduces additional overhead, such as networking delays, in its task
scheduling and in its handling of intermediate results that move between the
map and reduce stages of an overall analysis.
Tightly integrating a map/reduce
execution engine into an IMDG can eliminate unnecessary data motion and many
other map/reduce overheads to deliver breathtaking performance gains. For
example, ScaleOut StateServer’s Grid Computing Edition offers a map/reduce
execution model called "parallel method invocation" (PMI) which lets applications
invoke a map/reduce analysis as a simple extension to the standard APIs used
for grid access. The ability to invoke map/reduce inline with program execution
eliminates much of the overhead associated with Hadoop’s batch scheduler.
Moreover, SOSS’s map/reduce engine ensures that map and reduce tasks are always
performed on the computers which host their associated key/value pairs. This
eliminates almost all network overhead involved in staging data and
significantly reduces delays.
To evaluate the performance benefits of
integrating map/reduce into an in-memory data grid, the financial application
for back-testing stock trading strategies was implemented for SOSS’s PMI model
of map/reduce. The test runs delivered a remarkable 16X increase in throughput
over hosting the application in Hadoop using HDFS (as shown by the red and blue
lines in the above graph). Especially in the world of financial analysis,
faster analysis offers an important competitive advantage.
As cloud computing becomes more pervasive,
large pools of servers will become increasing cost-effective to use for
map/reduce analysis. This will enable IMDGs to host increasingly large datasets
in memory, which will provide important benefits in reducing analysis time,
thereby enabling analysts to more quickly and easily gain insights from their
datasets.