Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / Java

Apache Hadoop for Windows Platform

4.76/5 (36 votes)
16 Jul 2014CPOL9 min read 745.4K  
Apache Hadoop 2.3 for Big Data Analytics

Check this Video for Apache Hadoop Installation in Windows

  1. Introduction
  2. Hadoop 2.3 for Windows 7/8/8.1 - Specifically Built for Windows x64
    1. Hadoop 2.3 for Windows (112.5 MB)

      Finally Github Link:

       https://github.com/prabaprakash/Hadoop-2.3

      Box Link:

      https://app.box.com/s/11fwozokqmc1ohttt117

      Google Drive Link:

      https://drive.google.com/file/d/0Bz7A6rJcTjx_Q0RDT0FrU3dUTDQ/edit?usp=sharing

      Dropbox Link:

      https://www.dropbox.com/s/p8xsfmx9g76pn0t/hadoop-2.3.0.tar.gz

    2. Pre Configured file - https://github.com/prabaprakash/Hadoop-2.3-Config/archive/master.zip
    3. Java SDK / Runtime 1.6 Madatory

      Download Links: http://download.oracle.com/otn/java/jdk/6u31-b05/jdk-6u31-windows-x64.exe

      Reference Link: http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archive- downloads-javase6-419409.html

  3. Map Reduce Jobs in Java
  4. Redgate HDFS Explorer - http://bigdatainstallers.azurewebsites.net/files/HDFS%20Explorer/beta/1/HDFS%20Explorer%20-%20beta.application
  5. Eclipse Plugin for Hadoop MapReduce Jobs with Simple HDFS Explorer and Code Completion Configuartion like Visual Studio
    1. Eclipse IDE - http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/kepler/SR2/eclipse-jee-kepler-SR2-win32-x86_64.zip
    2. Hadoop MapReduce Plugin for Eclipse - https://github.com/winghc/hadoop2x-eclipse-plugin/archive/master.zip
  6. Datasets
  7. Recipe Samples

    Source Code: https://github.com/prabaprakash/Hadoop-Map-Reduce-Code

    Documentation:

    https://github.com/prabaprakash/Hadoop-Map-Reduce-Code/blob/master/Recipe%20Sample/Recipe%20Documentation.docx

  8. If You Need Hadoop 2.5.1 Native Built for Ubuntu 14.10

    Setup: https://github.com/prabaprakash/Hadoop-2.5.1-Binary

    Config: https://github.com/prabaprakash/Hadoop-2.5.1-Config-Files

1. Introduction

  • Hadoop is a free, Java-based programming framework that supports the processing of large data sets in a distributed computing environment.
  • It is part of the Apache project sponsored by the Apache Software Foundation.
  • Hadoop makes it possible to run applications on systems with thousands of nodes involving thousands of terabytes.
  • Its distributed file system facilitates rapid data transfer rates among nodes and allows the system to continue operating uninterrupted in case of a node failure.
  • This approach lowers the risk of catastrophic system failure, even if a significant number of nodes become inoperative.
  • Hadoop was inspired by Google's MapReduce, a software framework in which an application is broken down into numerous small parts.
  • Any of these parts (also called fragments or blocks) can be run on any node in the cluster. Doug Cutting, Hadoop's creator, named the framework after his child's stuffed toy elephant.
  • The current Apache Hadoop ecosystem consists of the Hadoop kernel, MapReduce, the Hadoop distributed file system (HDFS) and a number of related projects such as Apache Hive, HBase and Zookeeper.
  • The Hadoop framework is used by major players including Google, Yahoo and IBM, largely for applications involving search engines and advertising. The preferred operating systems are Windows and Linux but Hadoop can also work with BSD and OS

We aren't able to understand Apache Hadoop Framework without Interactive Sessions, so I will list some YouTube playlists that will explain Apache Hadoop interactively/:

Playlist 1 - By Lynn Langit

http://www.youtube.com/playlist?list=PL8C3359ECF726D473

Playlist 2 - By handsonerp

http://www.youtube.com/user/handsonerp/search?query=hadoop

Playlist 3- By Eduraka!

http://www.youtube.com/playlist?list=PL9ooVrP1hQOHpJj0DW8GoQqnkbptAsqjZ

Some Ways to Install Hadoop in Windows

  1. Cygwin
    1. http://sundersinghc.wordpress.com/2013/04/08/running-hadoop-on-cygwin-in-windows-single-node-cluster/
    2. http://bigdata.globant.com/?p=7
    3. http://alans.se/blog/2010/hadoop-hbase-cygwin-windows-7-x64/#.U0bamFerMiw
  2. Azure HD Insight Emulator
    1. http://azure.microsoft.com/en-us/documentation/articles/hdinsight-get-started-emulator/
  3. Build Hadoop for Windows
    1. By Apache Doc - https://svn.apache.org/viewvc/hadoop/common/branches/branch-2/BUILDING.txt?view=markup
    2. Perfect Guide By Abhijit Ghosh from https://app.box.com/s/11fwozokqmc1ohttt117
  4. HortonWorks for Windows (Hadoop 2.0) and also SandBox Images of Hadoop 2.0 for Hyper-V / Vmware / Virtual Box
    1. HortonWorks for Windows - http://hortonworks.com/partner/microsoft/
    2. Sandbox 2.0 - http://hortonworks.com/products/hortonworks-sandbox/
  5. Clodera VM
    1. http://www.cloudera.com/content/support/en/downloads.html

Other Cloud Services

  1. Azure HD Insight
  2. Amazon Elastic Map Reduce
  3. IBM Blue Mix - Hadoop Service

2. Hadoop 2.3 for Windows 7/8/8.1 - Specifically Builded for Windows x64

I built Hadoop 2.3 for windows x64 with the help of steps provided by Abhijit Ghosh from http://www.srccodes.com/p/article/38/build-install-configure-run-apache-hadoop-2.2.0-microsoft-windows-os

[INFO] Executed tasks
[INFO]
[INFO] --- maven-javadoc-plugin:2.8.1:jar (module-javadocs) @ hadoop-dist ---
[INFO] Building jar: C:\hdp\hadoop-dist\target\hadoop-dist-2.3.0-javadoc.jar
[INFO] ------------------------------------------------------------------------
[INFO] Reactor Summary:
[INFO]
[INFO] Apache Hadoop Main ................................ SUCCESS [1.847s]
[INFO] Apache Hadoop Project POM ......................... SUCCESS [3.218s]
[INFO] Apache Hadoop Annotations ......................... SUCCESS [3.812s]
[INFO] Apache Hadoop Assemblies .......................... SUCCESS [0.522s]
[INFO] Apache Hadoop Project Dist POM .................... SUCCESS [3.717s]
[INFO] Apache Hadoop Maven Plugins ....................... SUCCESS [6.613s]
[INFO] Apache Hadoop MiniKDC ............................. SUCCESS [7.117s]
[INFO] Apache Hadoop Auth ................................ SUCCESS [5.104s]
[INFO] Apache Hadoop Auth Examples ....................... SUCCESS [4.230s]
[INFO] Apache Hadoop Common .............................. SUCCESS [3:18.829s]
[INFO] Apache Hadoop NFS ................................. SUCCESS [13.442s]
[INFO] Apache Hadoop Common Project ...................... SUCCESS [0.066s]
[INFO] Apache Hadoop HDFS ................................ SUCCESS [2:45.070s]
[INFO] Apache Hadoop HttpFS .............................. SUCCESS [40.280s]
[INFO] Apache Hadoop HDFS BookKeeper Journal ............. SUCCESS [10.956s]
[INFO] Apache Hadoop HDFS-NFS ............................ SUCCESS [5.037s]
[INFO] Apache Hadoop HDFS Project ........................ SUCCESS [0.075s]
[INFO] hadoop-yarn ....................................... SUCCESS [0.070s]
[INFO] hadoop-yarn-api ................................... SUCCESS [1:12.357s]
[INFO] hadoop-yarn-common ................................ SUCCESS [46.634s]
[INFO] hadoop-yarn-server ................................ SUCCESS [0.071s]
[INFO] hadoop-yarn-server-common ......................... SUCCESS [10.907s]
[INFO] hadoop-yarn-server-nodemanager .................... SUCCESS [25.635s]
[INFO] hadoop-yarn-server-web-proxy ...................... SUCCESS [4.293s]
[INFO] hadoop-yarn-server-resourcemanager ................ SUCCESS [30.427s]
[INFO] hadoop-yarn-server-tests .......................... SUCCESS [3.817s]
[INFO] hadoop-yarn-client ................................ SUCCESS [7.340s]
[INFO] hadoop-yarn-applications .......................... SUCCESS [0.068s]
[INFO] hadoop-yarn-applications-distributedshell ......... SUCCESS [3.047s]
[INFO] hadoop-yarn-applications-unmanaged-am-launcher .... SUCCESS [2.346s]
[INFO] hadoop-yarn-site .................................. SUCCESS [0.101s]
[INFO] hadoop-yarn-project ............................... SUCCESS [4.986s]
[INFO] hadoop-mapreduce-client ........................... SUCCESS [0.137s]
[INFO] hadoop-mapreduce-client-core ...................... SUCCESS [51.554s]
[INFO] hadoop-mapreduce-client-common .................... SUCCESS [28.285s]
[INFO] hadoop-mapreduce-client-shuffle ................... SUCCESS [3.548s]
[INFO] hadoop-mapreduce-client-app ....................... SUCCESS [22.627s]
[INFO] hadoop-mapreduce-client-hs ........................ SUCCESS [12.972s]
[INFO] hadoop-mapreduce-client-jobclient ................. SUCCESS [51.921s]
[INFO] hadoop-mapreduce-client-hs-plugins ................ SUCCESS [2.340s]
[INFO] Apache Hadoop MapReduce Examples .................. SUCCESS [9.765s]
[INFO] hadoop-mapreduce .................................. SUCCESS [3.397s]
[INFO] Apache Hadoop MapReduce Streaming ................. SUCCESS [16.817s]
[INFO] Apache Hadoop Distributed Copy .................... SUCCESS [37.303s]
[INFO] Apache Hadoop Archives ............................ SUCCESS [2.773s]
[INFO] Apache Hadoop Rumen ............................... SUCCESS [11.225s]
[INFO] Apache Hadoop Gridmix ............................. SUCCESS [7.554s]
[INFO] Apache Hadoop Data Join ........................... SUCCESS [3.982s]
[INFO] Apache Hadoop Extras .............................. SUCCESS [4.627s]
[INFO] Apache Hadoop Pipes ............................... SUCCESS [0.080s]
[INFO] Apache Hadoop OpenStack support ................... SUCCESS [8.620s]
[INFO] Apache Hadoop Client .............................. SUCCESS [8.964s]
[INFO] Apache Hadoop Mini-Cluster ........................ SUCCESS [0.186s]
[INFO] Apache Hadoop Scheduler Load Simulator ............ SUCCESS [16.472s]
[INFO] Apache Hadoop Tools Dist .......................... SUCCESS [7.326s]
[INFO] Apache Hadoop Tools ............................... SUCCESS [0.066s]
[INFO] Apache Hadoop Distribution ........................ SUCCESS [1:09.690s]
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 17:47.469s
[INFO] Finished at: Sun Mar 23 18:01:41 IST 2014
[INFO] Final Memory: 131M/349M
[INFO] ------------------------------------------------------------------------

Step to Installation

  1. Download Hadoop 2.3 for Windows (112.5 MB) from my box account - https://app.box.com/s/11fwozokqmc1ohttt117
  2. Also Download the configuration file from my box account - https://github.com/prabaprakash/Hadoop-2.3-Config/archive/master.zip

    You have these files with you

    Image 1

    fine!

  3. Open hadoop-2.3.0.tar.gz with winrar ,extract in local disk

    Image 2

  4. Open config.rar with winrar

    Image 3

    Open bin directory in winrar. extract yarn.cmd file into c:\hadoop-2.3.0\bin folder

    Image 4

    Open config\etc\hadoop extract

    1. yarn-site.xml
    2. mapred.xml
    3. https-site.xml
    4. hdfs-site.xml
    5. hadoop-policy.xml
    6. core-site.xml
    7. capacity-scheduler.xml

    to c:\hadoop-2.3.0\etc\hadoop and replace it.

    Image 5

  5. It's mandatory, because Apache Developer build Hadoop framework using Java 1.6 so, we needed Java 1.6 sdk, and also Java 1.6 Runtime
    1. Download Java SDK 1.6.0_31

      http://download.oracle.com/otn/java/jdk/6u31-b05/jdk-6u31-windows-x64.exe

      Then Install It

  6. Set The Environmental Variables

    Control Panel\System and Security\System

    Open Advanced System Settings

    Image 6

    Then, add new variable " HADOOP_HOME " - value " c:\hadoop-2.3.0 "

    Also add new variable " JAVA_HOME " - value " java installation path "

    Image 7

    System Variables -> Path -> Edit

    Add Hadoop bin path, Java 6 bin path -> click ok

    Image 8

  7. Then Open hadoop-env.cmd in wordpad located in C:\hadoop-2.3.0\etc\hadoop\hadoop-env.cmd

    Set the JAVA_HOME path in line 25! remember not JDK bin path.

    Image 9

  8. Let Play with Apache Hadoop 2.3
    1. Open cmd as adminstrator
    C:\Windows\system32>cd c:\hadoop-2.3.0\bin
    
    c:\hadoop-2.3.0\bin>hadoop
    Usage: hadoop [--config confdir] COMMAND
    where COMMAND is one of:
      fs                   run a generic filesystem user client
      version              print the version
      jar <jar>            run a jar file
      checknative [-a|-h]  check native hadoop and compression libraries availabilit
    y
      distcp <srcurl> <desturl> copy file or directories recursively
      archive -archiveName NAME -p <parent> <src>* <dest> create a hadoop archi
    ve
      classpath            prints the class path needed to get the
                           Hadoop jar and the required libraries
      daemonlog            get/set the log level for each daemon
     or
      CLASSNAME            run the class named CLASSNAME
    
    Most commands print help when invoked w/o parameters.
    
    c:\hadoop-2.3.0\bin>cd c:\hadoop-2.3.0\bin
    
    c:\hadoop-2.3.0\bin>hadoop
    Usage: hadoop [--config confdir] COMMAND
    where COMMAND is one of:
      fs                   run a generic filesystem user client
      version              print the version
      jar <jar>            run a jar file
      checknative [-a|-h]  check native hadoop and compression libraries availabilit
    y
      distcp <srcurl> <desturl> copy file or directories recursively
      archive -archiveName NAME -p <parent> <src>* <dest> create a hadoop archi
    ve
      classpath            prints the class path needed to get the
                           Hadoop jar and the required libraries
      daemonlog            get/set the log level for each daemon
     or
      CLASSNAME            run the class named CLASSNAME
    
    Most commands print help when invoked w/o parameters.
    
    c:\hadoop-2.3.0\bin>hadoop namenode -format
       </br>

    It will create a HDFS in your system and format it.

    c:\hadoop-2.3.0\bin>cd..
    
    c:\hadoop-2.3.0>cd sbin
    
    c:\hadoop-2.3.0\sbin>start-dfs.cmd
    c:\hadoop-2.3.0\sbin>start-yarn.cmd
    starting yarn daemons

    Image 10

    So check, whether Apache Namenode & Datanode, Apache Yarn Nodemanger & Yarn Resouce Manager is running concurrenlty.

    OK, let's go to mapreduce

3. Some Map Reduce Jobs

  • I had seen every where programmer begin their first mapreduce programming using simple WordCount program.
  • I was bored, so let's begin with recipe's .
  1. Download the Recipeitems-latest.son file ( 26 MB)

    http://openrecipes.s3.amazonaws.com/recipeitems-latest.json.gz

  2. Create a folder in c:\> named as hwork

    Extract recipe-latest.json.gz in c:\>hwork folder. it was about 150 MB.

    It contain about 1.5 Lakh of Recipe Items

    { "_id" : { "oid" : "5160756b96cc62079cc2db15" }, "name" : "Drop Biscuits and Sausage Gravy", "ingredients" : "Biscuits\n3 cups All-purpose Flour\n2 Tablespoons Baking Powder\n1/2 teaspoon Salt\n1-1/2 stick (3/4 Cup) Cold Butter, Cut Into Pieces\n1-1/4 cup Butermilk\n SAUSAGE GRAVY\n1 pound Breakfast Sausage, Hot Or Mild\n1/3 cup All-purpose Flour\n4 cups Whole Milk\n1/2 teaspoon Seasoned Salt\n2 teaspoons Black Pepper, More To Taste", "url" : "http://thepioneerwoman.com/cooking/2013/03/drop-biscuits-and-sausage-gravy/", "image" : "http://static.thepioneerwoman.com/cooking/files/2013/03/bisgrav.jpg", "ts" : { "date" : 1365276011104 }, "cookTime" : "PT30M", "source" : "thepioneerwoman", "recipeYield" : "12", "datePublished" : "2013-03-11", "prepTime" : "PT10M", "description" : "Late Saturday afternoon, after Marlboro Man had returned home with the soccer-playing girls, and I had returned home with the..." }

  3. Downlod Gson Libray for Java to deserialize the json

    https://code.google.com/p/google-gson/downloads/detail?name=google-gson-2.2.4-release.zip&can=2&q=

    extract the zip files, copy all jar files and paste into C:\hadoop-2.3.0\share\hadoop\common\lib folder.....

    approximately 1.5 Lakh recipe items are there in json file , my intention is to go the number of items per "cooktime"

    PT0H20M 25
    PT0H25M 24
    PT0H2M 3
    PT0H30M 74
    PT0H34M 1
    PT0H35M 31
    PT0H3M 1
    PT0H40M 67
    PT0H45M 74
    PT0H50M 52
    PT0H55M 10
    PT0H5M 118
    PT0H6M 1
    PT0H7M 1
    PT0H8M 6
    PT0M 80

  4. Map Reduce Code

    Recipe.java

    Java
    import java.io.IOException;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.IntWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.util.GenericOptionsParser;
    
    import com.google.gson.Gson;
    public class Recipe {
    
        public static class TokenizerMapper
                extends Mapper<Object, Text, Text, IntWritable>{
    
            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();
            Gson gson = new Gson();
            public void map(Object key, Text value, Context context
            ) throws IOException, InterruptedException {
               /* StringTokenizer itr = new StringTokenizer(value.toString());
                while (itr.hasMoreTokens()) {
                    word.set(itr.nextToken());
                    context.write(word, one);
                } */
    
    
                Roo roo=gson.fromJson(value.toString(),Roo.class);
                if(roo.cookTime!=null)
                {
                word.set(roo.cookTime);
                }
                else
                {
                    word.set("none");
                }
                context.write(word, one);
            }
        }
    
        public static class IntSumReducer
                extends Reducer<Text,IntWritable,Text,IntWritable> {
            private IntWritable result = new IntWritable();
    
            public void reduce(Text key, Iterable<IntWritable> values,
                               Context context
            ) throws IOException, InterruptedException {
                int sum = 0;
                for (IntWritable val : values) {
                    sum += val.get();
                }
                result.set(sum);
                context.write(key, result);
            }
        }
    
        public static void main(String[] args) throws Exception {
            Configuration conf = new Configuration();
           String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
           /* for ( String string : otherArgs) {
                System.out.println(string);
            }*/
            if (otherArgs.length != 2) {
                System.err.println("Usage: recipe <in> <out>");
                System.exit(2);
            }
            @SuppressWarnings("deprecation")
            Job job = new Job(conf, "Recipe");
    
            job.setJarByClass(Recipe.class);
            job.setMapperClass(TokenizerMapper.class);
            job.setCombinerClass(IntSumReducer.class);
            job.setReducerClass(IntSumReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));
           // FileInputFormat.addInputPath(job, new Path("hdfs://127.0.0.1:9000/in"));
           // FileOutputFormat.setOutputPath(job, new Path("hdfs://127.0.0.1:9000/out"));
            System.exit(job.waitForCompletion(true) ? 0 : 1);
           // job.submit();
        }
    }
    
     class Id
    {
    
        public String oid;
    }
    
    
     class Ts
    {
    
        public long date ;
    }
    
    class Roo
    {
    
        public Id _id ;
    
        public String name ;
    
        public String ingredients ;
    
        public String url ;
    
        public String image ;
    
        public Ts ts ;
    
        public String cookTime;
    
        public String source ;
    
        public String recipeYield ;
        public String datePublished;
    
        public String prepTime ;
    
        public String description;
    }

    By dafault, Hadoop itself read the input file line by line and send it to

    Java
    class TokenizerMapper 

    In TokenizeMapper class, we deserializing the JSON string, then initalize it to Roo Class.

    So by roo instantiate object, we will get cooktime, then set in to "Mapper context."

    In TokenizerReduce class, count the number of items, then set into "Reducer context"

  5. We need to compile it.

    Create and copy Recipe.java in c:\>hwork folder, then follow the given command

    c:\Hwork>javac -classpath C:\hadoop-2.3.0\share\hadoop\common\hadoop-common-2.3.0.jar;C:\hadoop-2.3.0\share\hadoop\mapreduce\hadoop-mapreduce-client-core-2.3.0.jar;C:\hadoop-2.3.0\share\hadoop\common\lib\gson-2.2.4.jar;C:\hadoop-2.3.0\share\hadoop\common\lib\commons-cli-1.2.jar Recipe.java

    Now our mapreduce program is compiled successfully. Then we need to create a jar file because Hadoop need jar file to run it.

    To make jar, follow the below command

    C:\Hwork>jar -cvf Recipe.jar *.class
    added manifest
    adding: Id.class(in = 217) (out= 179)(deflated 17%)
    adding: Recipe$IntSumReducer.class(in = 1726) (out= 736)(deflated 57%)
    adding: Recipe$TokenizerMapper.class(in = 1887) (out= 820)(deflated 56%)
    adding: Recipe.class(in = 1861) (out= 1006)(deflated 45%)
    adding: Roo.class(in = 435) (out= 293)(deflated 32%)
    adding: Ts.class(in = 201) (out= 168)(deflated 16%)

    We are ready to run mapreduce program, but before we need to copy c:\>hwork\recipe-items.json file to Hadoop distributed filesystem, follow the steps given below

    c:\hadoop-2.3.0\sbin>hadoop fs -mkdir /in
    
    
    c:\hadoop-2.3.0\sbin>hadoop fs -copyFromLocal c:\Hwork\recipeitems-latest.json /in
    
     So We Copied the file from local to Hadoop Distributed File System...
    Run The mapreduce ......
    
    c:\hadoop-2.3.0\sbin>hadoop jar c:\Hwork\Recipe.jar Recipe /in /out
    14/04/12 00:52:02 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
    14/04/12 00:52:03 INFO input.FileInputFormat: Total input paths to process : 1
    14/04/12 00:52:03 INFO mapreduce.JobSubmitter: number of splits:1
    14/04/12 00:52:04 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1397243723769_0001
    14/04/12 00:52:04 INFO impl.YarnClientImpl: Submitted application application_1397243723769_0001
    14/04/12 00:52:04 INFO mapreduce.Job: The url to track the job: http://OmSkathi:8088/proxy/application_1397243723769_0001/
    14/04/12 00:52:04 INFO mapreduce.Job: Running job: job_1397243723769_0001
    14/04/12 00:52:16 INFO mapreduce.Job: Job job_1397243723769_0001 running in uber mode : false
    14/04/12 00:52:16 INFO mapreduce.Job:  map 0% reduce 0%
    14/04/12 00:52:26 INFO mapreduce.Job:  map 100% reduce 0%
    14/04/12 00:52:33 INFO mapreduce.Job:  map 100% reduce 100%
    14/04/12 00:52:34 INFO mapreduce.Job: Job job_1397243723769_0001 completed successfully
    14/04/12 00:52:34 INFO mapreduce.Job: Counters: 49
        File System Counters
            FILE: Number of bytes read=3872
            FILE: Number of bytes written=180889
            FILE: Number of read operations=0
            FILE: Number of large read operations=0
            FILE: Number of write operations=0
            HDFS: Number of bytes read=119406749
            HDFS: Number of bytes written=2871
            HDFS: Number of read operations=6
            HDFS: Number of large read operations=0
            HDFS: Number of write operations=2
        Job Counters 
            Launched map tasks=1
            Launched reduce tasks=1
            Data-local map tasks=1
            Total time spent by all maps in occupied slots (ms)=7383
            Total time spent by all reduces in occupied slots (ms)=5121
            Total time spent by all map tasks (ms)=7383
            Total time spent by all reduce tasks (ms)=5121
            Total vcore-seconds taken by all map tasks=7383
            Total vcore-seconds taken by all reduce tasks=5121
            Total megabyte-seconds taken by all map tasks=7560192
            Total megabyte-seconds taken by all reduce tasks=5243904
        Map-Reduce Framework
            Map input records=146949
            Map output records=146949
            Map output bytes=1387492
            Map output materialized bytes=3872
            Input split bytes=113
            Combine input records=146949
            Combine output records=293
            Reduce input groups=293
            Reduce shuffle bytes=3872
            Reduce input records=293
            Reduce output records=293
            Spilled Records=586
            Shuffled Maps =1
            Failed Shuffles=0
            Merged Map outputs=1
            GC time elapsed (ms)=70
            CPU time spent (ms)=5108
            Physical memory (bytes) snapshot=370135040
            Virtual memory (bytes) snapshot=428552192
            Total committed heap usage (bytes)=270860288
        Shuffle Errors
            BAD_ID=0
            CONNECTION=0
            IO_ERROR=0
            WRONG_LENGTH=0
            WRONG_MAP=0
            WRONG_REDUCE=0
        File Input Format Counters 
            Bytes Read=119406636
        File Output Format Counters 
            Bytes Written=2871
    
    Map Reduce Job Completed Successfully ,Now Check Output folder " /out " 
    
    c:\hadoop-2.3.0\sbin>hadoop fs -ls /out
    Windows_NT-amd64-64
    Found 2 items
    -rw-r--r--   1 PrabaKarthi supergroup          0 2014-04-12 00:52 /out/_SUCCESS
    -rw-r--r--   1 PrabaKarthi supergroup       2871 2014-04-12 00:52 /out/part-r-00000
    
    Open the ouput files.have a good look , you will enjoy hadoop analytics work by Apache  
    
    c:\hadoop-2.3.0\sbin>hadoop fs -cat /out/part-r-00000
    P0D    121
    P1D    2
    P1DT6H    1
    P4DT8H    1
    PT    8491
    PT0H10M    56
    PT0H12M    1
    PT0H14M    1
    PT0H15M    55
    PT0H1M    1
    PT0H20M    25
    PT0H25M    24
    PT0H2M    3
    PT0H30M    74
    PT0H34M    1
    PT0H35M    31
    PT0H3M    1
    PT0H40M    67
    PT0H45M    74
    PT0H50M    52
    PT0H55M    10
    PT0H5M    118
    PT0H6M    1
    PT0H7M    1
    PT0H8M    6
    PT0M    80
    PT1008H    1
    PT100M    1
    PT10H    102
    PT10H0M    2
    PT10H10M    5
    PT10H15M    4
    PT10H20M    1
    PT10H25M    1
    PT10H30M    5
    PT10H35M    1
    PT10H40M    1
    PT10H45M    1
    PT10M    9982

    So, we done mapreduce job. Every one knows about, but I am going list the tools make your work more easier, then before.

4. Redgate HDFS Explorer

I get bored while copying local files to Hadoop filesystem using command and also retrive the Hadoop filesystem data using command. I got this open source software is very fun, first download it (2.5 MB)

http://bigdatainstallers.azurewebsites.net/files/HDFS%20Explorer/beta/1/HDFS%20Explorer%20-%20beta.application

  1. Install it, we already copied the configuration files for Hadoop 2.3, so, our hadoop filesystem will be accessible remotely, also using webclient in Java, C#, Python etc.

    Image 11

  2. Open HDFS Explorer

    File->Add Connection

    Image 12

    Browse our Hadoop File System in Graphical File Explorer. Copy the input file from local disk and paste it in hdfs, also copy the output form hdfs and paste it in your local disk, you can do every operation, what a traditional file explorer will do. Enjoy with HDFS Explorer

    Image 13 Fine hdfs explorer is good, but I was bored writting mapreduce coding in Notepad++ without perfect intellisene, indentation. I got eclipse plugin for hadoop mapreduce. Let's go to next topic

5. Eclipse Plugin for Hadoop MapReduce Jobs with Simple HDFS Explorer and Auto Code Completion Configuartion like Visual Studio

Eclipse IDE - http://www.eclipse.org/downloads/download.php?file=/technology/epp/downloads/release/kepler/SR2/eclipse-jee-kepler-SR2-win32-x86_64.zip
Hadoop MapReduce Plugin for Eclipse - https://github.com/winghc/hadoop2x-eclipse-plugin/archive/master.zip

Let begin, Download the Above " Eclipse Kepler IDE " ( 250 MB ) , also Download the Hadoop MapReduce Pluign for Eclipse (23 MB).

  1. Extract the Eclipse IDE

    For explanation : extract eclipse IDE in D:\>eclipse

  2. Open hadoop2x-eclipse-plugin-master.zip

    goto "release" directory , extract " hadoop-eclipse-kepler-plugin-2.2.0.jar " file into eclispe\plugin folder

    Image 14

    Let Rock and Role.

  3. Open Eclipse IDE (Run As Administrator)

    Choose Your Own WorkPlace Location -> Click OK

    Menu->Window->Open Perspective->Other->Map/Reduce

    Image 15

  4. I love Visual Studio more so, I need intellisense and code formatting as like Visual Studio (somehow) for eclipse? Some Configuration, which make work easier

    Menu->Window->Preference->Java->Editor->Content Assistent->"Auto Activation"

    • check enable auto activation
    • auto activation delay(ms):0
    • auto activation trigger for java : .(abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ
    • auto activation triggers for javadoc:@#

    Apply->Ok

    Image 16

  5. Configure Hdfs and Map/reduce connection

    Map/reduce location->new hadoop location

    Image 17

    Location name : "some name", and other are same as given below in the image, don't modify because we configured the mapreduce address, dfs address already in c:\hadoop-2.3.0\etc\hadoop folder.

    Image 18

    Simple HDFS Explorer

    Image 19

  6. File->new project->map /reduce Project->next

    It showing error becasue hadoop installation folder not configured correctly

    Image 20

    Now, browse installation directory and click apply

    Image 21

    Click next

    Image 22

    Import, usually we will do mistake here becasue Hadoop 2.3 need jdk 6 for runtime/compilation so and so

    Image 23

    Click Add Library ->JRE System Library

    Image 24 Click installed JRE's

    Image 25

    Add -> Standard VM

    Image 26

    Browse the jdk 1.6 location and click finish

    ok->ok->finish->finish

    Image 27

    So, Hadoop 2.3 libraries are added -> good , again we got jdk 1.7 error ,we need jdk 1.6

    Image 28

    Image 29

    Change to jdk 1.6 ->click ok

    Image 30

    Finally Hadoop 2.3 Libraries a along with Jdk 1.6.

    Image 31

Your Eclipse is Configured Perfectly for Hadoop MapReduce Coding and Exection along with Intellisense. Let's code

  1. Add new -> Recipe.java in src folder,then copy and paste the above code

    Image 32

  2. Right click -> Recipe.java -> Runs As->Run on Hadoop

    Image 33

  3. Map Reduce Job is Running

    Image 34

  4. Job Completed

    Image 35

Examples

  1. Hadoop : WordCount with Custom Record Reader of TextInputFormat

6. Datasets

  1. Large Public Datasets
  2. Free Large datasets to experiment with Hadoop
  3. Explain patent data set in Hadoop example
  4. 60,000+ Documented UFO Sightings With Text Descriptions And Metadata
  5. Recipe-Items List 

Reference Books

  1. Hadoop Map Reduce CookBook - Srinath Perera
  2. Big Data Analytics: From Strategic Planning to Enterprise Integration with Tools, Techniques, NoSQL, and Graph
  3. Hadoop: The Definitive Guide MapReduce for the Cloud 

Reference Links

  1. Searchcloudcomputing.techtarget.com
  2. Hadoop: What it is, how it works, and what it can do
  3. IBM: What is Hadoop?
  4. Hadoop at Yahoo

Conclusion

I am sure, this article will be helpful for Beginners & Intermediary Programmers to Bootstrap Apache Hadoop (Big Data Analytics Framework ) in Windows Environment.

Yours Friendly

Prabakaran.A

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)