A preview version of Hadoop on Windows Azure is available. The details of that availability is at | Une pr%u00e9-version d’Hadoop sur Azure est disponible. Les d%u00e9tails de cette disponibilit%u00e9 sont %u00e0 |
availability-of-community-technology-preview-ctp-of-hadoop-based-service-on-windows-azure.aspx
A good introduction to what Hadoop and Map Reduce are is available at | Une bonne introduction %u00e0 ce que sont Hadoop et Map/Reduce est %u00e0 |
http://developer.yahoo.com/hadoop/tutorial/module4.html
As a developer using Hadoop, you write a mapper function, a reducer function, and Hadoop does the rest: – distribute code to the nodes where data resides – execute code on the nodes – provide reducers with all the same keys generated by the mappers |
En tant que d%u00e9veloppeur utilisant Hadoop, on %u00e9crit des fonctions de mapper et de reducer, et Hadoop fait le reste: – distribuer le code aux noeuds o%u00f9 la donn%u00e9e se trouve – ex%u00e9cuter le code sur tous les noeuds – fournir aux reducers les ensembles de m%u00eames clefs g%u00e9n%u00e9r%u00e9es par les mappers |
One of the often used examples is the WordCount example. | Un des exemples les plus utilis%u00e9s est le comptage de mots (WordCount). |
In this WordCount example, – the mapper function emits each word found as a key, and 1 as the value. – the reducer function adds the values for the same key – Thus, you get each word and the number of occurrences as a result of the map/reduce. This sample can be found in different places, including: |
Dans cet exemple WordCount, – la fonction mapper %u00e9met chaque mot trouv%u00e9 en tant que clef, et 1 en tant que valeur. – la fonction reducer ajoute les valeurs pour la m%u00eame clef – Ainsi, on obtient comme r%u00e9sultat du map/reduce chaque mot et le nombre d’occurrences pour ce mot. Cet exemple peut se trouver %u00e0 diff%u00e9rents endroits dont |
http://wiki.apache.org/hadoop/WordCount
http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html
Lets’s try this on an Hadoop On Azure cluster, after having changed the code to get only words with letters a to z, and having 4 letters or more | Essayons cela sur un cluster Hadoop sur Azure, apr%u00e8s avoir modifi%u00e9 le code pour avoir uniquement les mots avec les lettres a %u00e0 z, et ayant au moins 4 lettres |
Here is the code we have | Voici le code |
package com.benjguin.hadoopSamples; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; public class WordCount { public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { String[] wordsToCount = Utils.wordsToCount(tokenizer.nextToken()); for (int i=0; i<wordsToCount.length; i++) { if (Utils.countThisWord(wordsToCount[i])) { word.set(wordsToCount[i]); output.collect(word, one); } } } } } public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += values.next().get(); } output.collect(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { JobConf conf = new JobConf(WordCount.class); conf.setJobName("wordcount"); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(Map.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(TextOutputFormat.class); FileInputFormat.setInputPaths(conf, new Path(args[0])); FileOutputFormat.setOutputPath(conf, new Path(args[1])); JobClient.runJob(conf); } }
et
package com.benjguin.hadoopSamples; public class Utils { public static String[] wordsToCount(String word) { return word.toLowerCase().split("[^a-zA-Z]"); } public static boolean countThisWord(String word) { return word.length() > 3; } }
The first step is to compile the code and generate a JAR file. This can be done with Eclipse for instance: | La premi%u00e8re %u00e9tape est de compiler le code et de g%u00e9n%u00e9rer un fichier JAR. Cela peut %u00eatre fait depuis Eclipse par exemple: |
We also need to have some data. For that, it is possible to download a few books from the Gutenberg project. | On a %u00e9galement besoin de donn%u00e9es. On peut par exemple t%u00e9l%u00e9charger quelques livres du projet Gutenberg. |
Then, an Hadoop on Azure cluster is requested as explained there: | Ensuite, on demande la cr%u00e9ation d’un cluster Hadoop sur Azure, comme expliqu%u00e9 %u00e0: |
http://social.technet.microsoft.com/wiki/contents/articles/6225.aspx
Let’s upload the files to HDFS (Hadoop’s distributed file system) by using the interactive JavaScript Console: | Ensuite, on charge les donn%u00e9es en HDFS (syst%u00e8me de fichier distribu%u00e9 d’Hadoop) en utilisant la console interactive JavaScript: |
NB: for large volumes of data, FTPS would be a better option. Please refer to How To FTP Data To Hadoop on Windows Azure. | NB: pour de grands volumes de donn%u00e9es, FTPS est pr%u00e9f%u00e9rable. cf How To FTP Data To Hadoop on Windows Azure. |
Let’s create a folder and upload the 3 books into that HDFS folder | On cr%u00e9e un r%u00e9pertoire HDFS et on y charge les 3 livres. |
Then it is possible to create the job | Puis il est possible de cr%u00e9er un job |
11/12/19 17:51:27 INFO mapred.FileInputFormat: Total input paths to process : 3 11/12/19 17:51:27 INFO mapred.JobClient: Running job: job_201112190923_0004 11/12/19 17:51:28 INFO mapred.JobClient: map 0% reduce 0% 11/12/19 17:51:53 INFO mapred.JobClient: map 25% reduce 0% 11/12/19 17:51:54 INFO mapred.JobClient: map 75% reduce 0% 11/12/19 17:51:55 INFO mapred.JobClient: map 100% reduce 0% 11/12/19 17:52:14 INFO mapred.JobClient: map 100% reduce 100% 11/12/19 17:52:25 INFO mapred.JobClient: Job complete: job_201112190923_0004 11/12/19 17:52:25 INFO mapred.JobClient: Counters: 26 11/12/19 17:52:25 INFO mapred.JobClient: Job Counters 11/12/19 17:52:25 INFO mapred.JobClient: Launched reduce tasks=1 11/12/19 17:52:25 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=57703 11/12/19 17:52:25 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 11/12/19 17:52:25 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 11/12/19 17:52:25 INFO mapred.JobClient: Launched map tasks=4 11/12/19 17:52:25 INFO mapred.JobClient: Data-local map tasks=4 11/12/19 17:52:25 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=18672 11/12/19 17:52:25 INFO mapred.JobClient: File Input Format Counters 11/12/19 17:52:25 INFO mapred.JobClient: Bytes Read=1554158 11/12/19 17:52:25 INFO mapred.JobClient: File Output Format Counters 11/12/19 17:52:25 INFO mapred.JobClient: Bytes Written=186556 11/12/19 17:52:25 INFO mapred.JobClient: FileSystemCounters 11/12/19 17:52:25 INFO mapred.JobClient: FILE_BYTES_READ=427145 11/12/19 17:52:25 INFO mapred.JobClient: HDFS_BYTES_READ=1554642 11/12/19 17:52:25 INFO mapred.JobClient: FILE_BYTES_WRITTEN=964132 11/12/19 17:52:25 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=186556 11/12/19 17:52:25 INFO mapred.JobClient: Map-Reduce Framework 11/12/19 17:52:25 INFO mapred.JobClient: Map output materialized bytes=426253 11/12/19 17:52:25 INFO mapred.JobClient: Map input records=19114 11/12/19 17:52:25 INFO mapred.JobClient: Reduce shuffle bytes=426253 11/12/19 17:52:25 INFO mapred.JobClient: Spilled Records=60442 11/12/19 17:52:25 INFO mapred.JobClient: Map output bytes=1482365 11/12/19 17:52:25 INFO mapred.JobClient: Map input bytes=1535450 11/12/19 17:52:25 INFO mapred.JobClient: Combine input records=135431 11/12/19 17:52:25 INFO mapred.JobClient: SPLIT_RAW_BYTES=484 11/12/19 17:52:25 INFO mapred.JobClient: Reduce input records=30221 11/12/19 17:52:25 INFO mapred.JobClient: Reduce input groups=17618 11/12/19 17:52:25 INFO mapred.JobClient: Combine output records=30221 11/12/19 17:52:25 INFO mapred.JobClient: Reduce output records=17618 11/12/19 17:52:25 INFO mapred.JobClient: Map output records=135431
go back to the interactive JavaScript console. | On retourne dans la console interactive JavaScript |
This generates another Map/Reduce job that will sort the result. | Cela cr%u00e9e un autre job Map/Reduce qui va trier le r%u00e9sultat |
()
Then, it is possible to get the data and show it in a chart | Puis, il est possible de r%u00e9cup%u00e9rer la donn%u00e9e et de la montrer sous forme de graphique |
It is also possible to have a more complete console by using Remote Desktop (RDP). | Il est %u00e9galement possible d’avoir une console plus compl%u00e8te en se connectant au bureau %u00e0 distance. |
Blog Post by: Benjamin GUINEBERTIERE