Hadoop on Azure: word count in Java and JavaScript | Hadoop sur Azure: comptons les mots en Java et JavaScript

A preview version of Hadoop on Windows Azure is available. The details of that availability is at

Une pr%u00e9-version d’Hadoop sur Azure est disponible. Les d%u00e9tails de cette disponibilit%u00e9 sont %u00e0

availability-of-community-technology-preview-ctp-of-hadoop-based-service-on-windows-azure.aspx

A good introduction to what Hadoop and Map Reduce are is available at

Une bonne introduction %u00e0 ce que sont Hadoop et Map/Reduce est %u00e0

http://developer.yahoo.com/hadoop/tutorial/module4.html

As a developer using Hadoop, you write a mapper function, a reducer function, and Hadoop does the rest: – distribute code to the nodes where data resides – execute code on the nodes – provide reducers with all the same keys generated by the mappers	En tant que d%u00e9veloppeur utilisant Hadoop, on %u00e9crit des fonctions de mapper et de reducer, et Hadoop fait le reste: – distribuer le code aux noeuds o%u00f9 la donn%u00e9e se trouve – ex%u00e9cuter le code sur tous les noeuds – fournir aux reducers les ensembles de m%u00eames clefs g%u00e9n%u00e9r%u00e9es par les mappers
One of the often used examples is the WordCount example.	Un des exemples les plus utilis%u00e9s est le comptage de mots (WordCount).
In this WordCount example, – the mapper function emits each word found as a key, and 1 as the value. – the reducer function adds the values for the same key – Thus, you get each word and the number of occurrences as a result of the map/reduce. This sample can be found in different places, including:	Dans cet exemple WordCount, – la fonction mapper %u00e9met chaque mot trouv%u00e9 en tant que clef, et 1 en tant que valeur. – la fonction reducer ajoute les valeurs pour la m%u00eame clef – Ainsi, on obtient comme r%u00e9sultat du map/reduce chaque mot et le nombre d’occurrences pour ce mot. Cet exemple peut se trouver %u00e0 diff%u00e9rents endroits dont

http://wiki.apache.org/hadoop/WordCount

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

Lets’s try this on an Hadoop On Azure cluster, after having changed the code to get only words with letters a to z, and having 4 letters or more	Essayons cela sur un cluster Hadoop sur Azure, apr%u00e8s avoir modifi%u00e9 le code pour avoir uniquement les mots avec les lettres a %u00e0 z, et ayant au moins 4 lettres
Here is the code we have	Voici le code

package com.benjguin.hadoopSamples;

import java.io.IOException; 
import java.util.*; 
  
import org.apache.hadoop.fs.Path; 
import org.apache.hadoop.io.*; 
import org.apache.hadoop.mapred.*; 

public class WordCount {
  public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { 
        private final static IntWritable one = new IntWritable(1); 
        private Text word = new Text(); 
      
        public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 
          String line = value.toString(); 
          StringTokenizer tokenizer = new StringTokenizer(line);
          while (tokenizer.hasMoreTokens()) { 
            String[] wordsToCount = Utils.wordsToCount(tokenizer.nextToken());
            for (int i=0; i<wordsToCount.length; i++) {
                if (Utils.countThisWord(wordsToCount[i])) {
                    word.set(wordsToCount[i]);
                    output.collect(word, one);
                }
            }
          } 
        } 
      } 

  public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { 
    public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { 
      int sum = 0; 
      while (values.hasNext()) { 
        sum += values.next().get(); 
      } 
      output.collect(key, new IntWritable(sum)); 
    } 
  } 
          
  public static void main(String[] args) throws Exception { 
    JobConf conf = new JobConf(WordCount.class); 
    conf.setJobName("wordcount"); 
  
    conf.setOutputKeyClass(Text.class); 
    conf.setOutputValueClass(IntWritable.class); 
  
    conf.setMapperClass(Map.class); 
    conf.setCombinerClass(Reduce.class); 
    conf.setReducerClass(Reduce.class); 
  
    conf.setInputFormat(TextInputFormat.class); 
    conf.setOutputFormat(TextOutputFormat.class); 
  
    FileInputFormat.setInputPaths(conf, new Path(args[0])); 
    FileOutputFormat.setOutputPath(conf, new Path(args[1])); 
  
    JobClient.runJob(conf); 
  } 
}

package com.benjguin.hadoopSamples;

public class Utils {
    public static String[] wordsToCount(String word) {
        return word.toLowerCase().split("[^a-zA-Z]");
    }
    
    public static boolean countThisWord(String word) {
        return word.length() > 3;
    }
}

The first step is to compile the code and generate a JAR file. This can be done with Eclipse for instance:

La premi%u00e8re %u00e9tape est de compiler le code et de g%u00e9n%u00e9rer un fichier JAR. Cela peut %u00eatre fait depuis Eclipse par exemple:

We also need to have some data. For that, it is possible to download a few books from the Gutenberg project.

On a %u00e9galement besoin de donn%u00e9es. On peut par exemple t%u00e9l%u00e9charger quelques livres du projet Gutenberg.

Then, an Hadoop on Azure cluster is requested as explained there:

Ensuite, on demande la cr%u00e9ation d’un cluster Hadoop sur Azure, comme expliqu%u00e9 %u00e0:

http://social.technet.microsoft.com/wiki/contents/articles/6225.aspx

Let’s upload the files to HDFS (Hadoop’s distributed file system) by using the interactive JavaScript Console:

Ensuite, on charge les donn%u00e9es en HDFS (syst%u00e8me de fichier distribu%u00e9 d’Hadoop) en utilisant la console interactive JavaScript:

NB: for large volumes of data, FTPS would be a better option. Please refer to How To FTP Data To Hadoop on Windows Azure.	NB: pour de grands volumes de donn%u00e9es, FTPS est pr%u00e9f%u00e9rable. cf How To FTP Data To Hadoop on Windows Azure.
Let’s create a folder and upload the 3 books into that HDFS folder	On cr%u00e9e un r%u00e9pertoire HDFS et on y charge les 3 livres.

Then it is possible to create the job

Puis il est possible de cr%u00e9er un job

11/12/19 17:51:27 INFO mapred.FileInputFormat: Total input paths to process : 3
11/12/19 17:51:27 INFO mapred.JobClient: Running job: job_201112190923_0004
11/12/19 17:51:28 INFO mapred.JobClient: map 0% reduce 0%
11/12/19 17:51:53 INFO mapred.JobClient: map 25% reduce 0%
11/12/19 17:51:54 INFO mapred.JobClient: map 75% reduce 0%
11/12/19 17:51:55 INFO mapred.JobClient: map 100% reduce 0%
11/12/19 17:52:14 INFO mapred.JobClient: map 100% reduce 100%
11/12/19 17:52:25 INFO mapred.JobClient: Job complete: job_201112190923_0004
11/12/19 17:52:25 INFO mapred.JobClient: Counters: 26
11/12/19 17:52:25 INFO mapred.JobClient: Job Counters 
11/12/19 17:52:25 INFO mapred.JobClient: Launched reduce tasks=1
11/12/19 17:52:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=57703
11/12/19 17:52:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0
11/12/19 17:52:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0
11/12/19 17:52:25 INFO mapred.JobClient: Launched map tasks=4
11/12/19 17:52:25 INFO mapred.JobClient: Data-local map tasks=4
11/12/19 17:52:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=18672
11/12/19 17:52:25 INFO mapred.JobClient: File Input Format Counters 
11/12/19 17:52:25 INFO mapred.JobClient: Bytes Read=1554158
11/12/19 17:52:25 INFO mapred.JobClient: File Output Format Counters 
11/12/19 17:52:25 INFO mapred.JobClient: Bytes Written=186556
11/12/19 17:52:25 INFO mapred.JobClient: FileSystemCounters
11/12/19 17:52:25 INFO mapred.JobClient: FILE_BYTES_READ=427145
11/12/19 17:52:25 INFO mapred.JobClient: HDFS_BYTES_READ=1554642
11/12/19 17:52:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=964132
11/12/19 17:52:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=186556
11/12/19 17:52:25 INFO mapred.JobClient: Map-Reduce Framework
11/12/19 17:52:25 INFO mapred.JobClient: Map output materialized bytes=426253
11/12/19 17:52:25 INFO mapred.JobClient: Map input records=19114
11/12/19 17:52:25 INFO mapred.JobClient: Reduce shuffle bytes=426253
11/12/19 17:52:25 INFO mapred.JobClient: Spilled Records=60442
11/12/19 17:52:25 INFO mapred.JobClient: Map output bytes=1482365
11/12/19 17:52:25 INFO mapred.JobClient: Map input bytes=1535450
11/12/19 17:52:25 INFO mapred.JobClient: Combine input records=135431
11/12/19 17:52:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=484
11/12/19 17:52:25 INFO mapred.JobClient: Reduce input records=30221
11/12/19 17:52:25 INFO mapred.JobClient: Reduce input groups=17618
11/12/19 17:52:25 INFO mapred.JobClient: Combine output records=30221
11/12/19 17:52:25 INFO mapred.JobClient: Reduce output records=17618
11/12/19 17:52:25 INFO mapred.JobClient: Map output records=135431

go back to the interactive JavaScript console.

On retourne dans la console interactive JavaScript

This generates another Map/Reduce job that will sort the result.

Cela cr%u00e9e un autre job Map/Reduce qui va trier le r%u00e9sultat

()

Then, it is possible to get the data and show it in a chart

Puis, il est possible de r%u00e9cup%u00e9rer la donn%u00e9e et de la montrer sous forme de graphique

It is also possible to have a more complete console by using Remote Desktop (RDP).

Il est %u00e9galement possible d’avoir une console plus compl%u00e8te en se connectant au bureau %u00e0 distance.

Benjamin

Blog Post by: Benjamin GUINEBERTIERE

Hadoop on Azure: word count in Java and JavaScript | Hadoop sur Azure: comptons les mots en Java et JavaScript

Submit a Comment Cancel reply

Search this Site:

Recent Posts

Recent Topics