start a Pig + Jython job in HDInsight thru WebHCat

You can also use HDInsight with Hive + Python.

The drawback of the latter is that you use streaming between Hive and Python. In Hadoop streaming is just a way to call stdin/stdout inter process communication. So if you just do simple operations like string concatenations between two fields in Python it may be slow. the good things is that hive has user defined functions and also standard ones that help do all the simple things (like string concatenation).

There’s a way to use Python language without using streaming: just run Python in the JVM (remember Hadoop is written in Java). Python in the JVM is Jython. And Pig (an equivalent to Hive that has its own scripting language, instead of using SQL) can call Jython scripts.

With HDInsight 3.0 which became generally available recently, you can use that kind of feature. Here’s how. In order to launch the job here, I use a script from Linux that leverages WebHCat / Templeton REST API from a Linux machine. Here is how.

The Python script that launches the job is the following:

import requests #http://pypi.python.org/pypi/requests

clusterName='monclusterhadoop'
clusterAdmin='cornac'
clusterPassword='ChangeWithY0urs!'

#get WebHCat status
webHCatUrl='https://' + clusterName + '.azurehdinsight.net/templeton/v1/status'

r = requests.get(webHCatUrl, auth=(clusterAdmin, clusterPassword))

print r.status_code
print r.json()

#submit a pig job:
# http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/ds_HCatalog/pig.html

webHCatUrl='https://' + clusterName + '.azurehdinsight.net/templeton/v1/pig'

hive_params={'user.name':clusterAdmin,
             'file':'wasb://demo@monstockageazure.blob.core.windows.net/scripts/pig_python/pig_python.pig',
             'statusdir': '/wasbwork/pig_from_python'}

r = requests.post(webHCatUrl, auth=(clusterAdmin, clusterPassword), data=hive_params)
print r.status_code
print r.json()

the pig job looks like this:

Register ‘wasb://demo@monstockageazure.blob.core.windows.net/scripts/pig_python/pig_python.py’ using jython as myfuncs;

a = load ‘wasb://demo@monstockageazure.blob.core.windows.net/data/ref_villes’ using PigStorage(‘ ‘) as (ville:chararray);

b = foreach a generate ville, myfuncs.helloworld(), myfuncs.square(3);

store b into ‘/wasbwork/pigresult’;

the Python script called by Pig and defines a few sample basic functions is the following:

#!/usr/bin/python

@outputSchema("word:chararray")
def helloworld():
    return ('Hello, World')
 
@outputSchema("t:(word:chararray,num:long)")
def complex(word):
    return (str(word),long(word)*long(word))

@outputSchemaFunction("squareSchema")
def square(num):   
    return ((num)*(num))   

@schemaFunction("squareSchema") 
def squareSchema(input):   
    return input   

# No decorator - bytearray 
def concat(str):   
    return str+str

Source data (ref_villes) looks like this (first lines) :

paris
marseille
lyon
toulouse
nice
nantes
strasbourg
montpellier
bordeaux
lille
rennes
reims
le havre
saint-etienne
toulon
grenoble

the output (part-m-00000) looks like this

paris    Hello, World    9
marseille    Hello, World    9
lyon    Hello, World    9
toulouse    Hello, World    9
nice    Hello, World    9
nantes    Hello, World    9
strasbourg    Hello, World    9
montpellier    Hello, World    9
bordeaux    Hello, World    9
lille    Hello, World    9
rennes    Hello, World    9
reims    Hello, World    9
le    Hello, World    9
saint-etienne    Hello, World    9
toulon    Hello, World    9
grenoble    Hello, World    9

the execution report looks like this (stderr):

2014-03-21 11:50:59,951 [main] INFO  org.apache.pig.Main - Apache Pig version 0.12.0.2.0.7.0-1551 (r: unknown) compiled Feb 19 2014, 11:47:04
2014-03-21 11:50:59,951 [main] INFO  org.apache.pig.Main - Logging error messages to: C:\apps\dist\hadoop-2.2.0.2.0.7.0-1551\logs\pig_1395402659935.log
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/C:/apps/dist/hadoop-2.2.0.2.0.7.0-1551/share/hadoop/common/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/C:/apps/dist/pig-0.12.0.2.0.7.0-1551/pig-0.12.0.2.0.7.0-1551.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
2014-03-21 11:51:00,810 [main] INFO  org.apache.pig.impl.util.Utils - Default bootup file D:\Users\hdp/.pigbootup not found
2014-03-21 11:51:00,997 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker is deprecated. Instead, use mapreduce.jobtracker.address
2014-03-21 11:51:00,997 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-03-21 11:51:00,997 [main] INFO  org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: wasb://monclusterhadoop@monstockageazure.blob.core.windows.net
2014-03-21 11:51:01,451 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-03-21 11:51:01,841 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - created tmp python.cachedir=D:\Users\hdp\AppData\Local\Temp\pig_jython_5196260548692206718
2014-03-21 11:51:03,951 [main] WARN  org.apache.pig.scripting.jython.JythonScriptEngine - pig.cmd.args.remainders is empty. This is not expected unless on testing.
2014-03-21 11:51:04,560 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: myfuncs.complex
2014-03-21 11:51:04,560 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: myfuncs.square
2014-03-21 11:51:04,576 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: myfuncs.helloworld
2014-03-21 11:51:04,576 [main] INFO  org.apache.pig.scripting.jython.JythonScriptEngine - Register scripting UDF: myfuncs.concat
2014-03-21 11:51:04,701 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-03-21 11:51:04,951 [main] INFO  org.apache.pig.scripting.jython.JythonFunction - Schema 'word:chararray' defined for func helloworld
2014-03-21 11:51:05,232 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: UNKNOWN
2014-03-21 11:51:05,326 [main] INFO  org.apache.pig.newplan.logical.optimizer.LogicalPlanOptimizer - {RULES_ENABLED=[AddForEach, ColumnMapKeyPrune, DuplicateForEachColumnRewrite, GroupByConstParallelSetter, ImplicitSplitInserter, LimitOptimizer, LoadTypeCastInserter, MergeFilter, MergeForEach, NewPartitionFilterOptimizer, PartitionFilterOptimizer, PushDownForEachFlatten, PushUpFilter, SplitFilter, StreamTypeCastInserter], RULES_DISABLED=[FilterLogicExpressionSimplifier]}
2014-03-21 11:51:05,482 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.textoutputformat.separator is deprecated. Instead, use mapreduce.output.textoutputformat.separator
2014-03-21 11:51:05,763 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler - File concatenation threshold: 100 optimistic? false
2014-03-21 11:51:05,810 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size before optimization: 1
2014-03-21 11:51:05,810 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MultiQueryOptimizer - MR plan size after optimization: 1
2014-03-21 11:51:06,091 [main] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at headnode0/100.86.204.54:9010
2014-03-21 11:51:06,263 [main] INFO  org.apache.pig.tools.pigstats.ScriptState - Pig script settings are added to the job
2014-03-21 11:51:06,263 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.reduce.markreset.buffer.percent is deprecated. Instead, use mapreduce.reduce.markreset.buffer.percent
2014-03-21 11:51:06,263 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - mapred.job.reduce.markreset.buffer.percent is not set, set to default 0.3
2014-03-21 11:51:06,263 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.output.compress is deprecated. Instead, use mapreduce.output.fileoutputformat.compress
2014-03-21 11:51:06,279 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - creating jar file Job608857099241848139.jar
2014-03-21 11:51:14,294 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - jar file Job608857099241848139.jar created
2014-03-21 11:51:14,294 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.jar is deprecated. Instead, use mapreduce.job.jar
2014-03-21 11:51:14,341 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.JobControlCompiler - Setting up single store job
2014-03-21 11:51:14,341 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Key [pig.schematuple] is false, will not generate code.
2014-03-21 11:51:14,341 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Starting process to move generated code to distributed cache
2014-03-21 11:51:14,341 [main] INFO  org.apache.pig.data.SchemaTupleFrontend - Setting key [pig.schematuple.classes] with classes to deserialize []
2014-03-21 11:51:14,404 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 1 map-reduce job(s) waiting for submission.
2014-03-21 11:51:14,404 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.job.tracker.http.address is deprecated. Instead, use mapreduce.jobtracker.http.address
2014-03-21 11:51:14,404 [JobControl] INFO  org.apache.hadoop.yarn.client.RMProxy - Connecting to ResourceManager at headnode0/100.86.204.54:9010
2014-03-21 11:51:14,513 [JobControl] INFO  org.apache.hadoop.conf.Configuration.deprecation - fs.default.name is deprecated. Instead, use fs.defaultFS
2014-03-21 11:51:16,154 [JobControl] INFO  org.apache.hadoop.mapreduce.lib.input.FileInputFormat - Total input paths to process : 2
2014-03-21 11:51:16,154 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths to process : 2
2014-03-21 11:51:16,185 [JobControl] INFO  org.apache.pig.backend.hadoop.executionengine.util.MapRedUtil - Total input paths (combined) to process : 1
2014-03-21 11:51:16,435 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - number of splits:1
2014-03-21 11:51:16,732 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Submitting tokens for job: job_1395391185318_0006
2014-03-21 11:51:16,732 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Kind: mapreduce.job, Service: job_1395391185318_0005, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@45d45314)
2014-03-21 11:51:16,763 [JobControl] INFO  org.apache.hadoop.mapreduce.JobSubmitter - Kind: RM_DELEGATION_TOKEN, Service: 100.86.204.54:9010, Ident: (owner=cornac, renewer=mr token, realUser=hdp, issueDate=1395402643673, maxDate=1396007443673, sequenceNumber=5, masterKeyId=2)
2014-03-21 11:51:17,154 [JobControl] INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl - Submitted application application_1395391185318_0006 to ResourceManager at headnode0/100.86.204.54:9010
2014-03-21 11:51:17,232 [JobControl] INFO  org.apache.hadoop.mapreduce.Job - The url to track the job: http://headnode0:9014/proxy/application_1395391185318_0006/
2014-03-21 11:51:17,232 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - HadoopJobId: job_1395391185318_0006
2014-03-21 11:51:17,232 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Processing aliases a,b
2014-03-21 11:51:17,232 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - detailed locations: M: a[3,4],b[-1,-1] C:  R: 
2014-03-21 11:51:17,279 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 0% complete
2014-03-21 11:51:34,575 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 50% complete
2014-03-21 11:51:37,981 [main] INFO  org.apache.hadoop.conf.Configuration.deprecation - mapred.reduce.tasks is deprecated. Instead, use mapreduce.job.reduces
2014-03-21 11:51:38,028 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - 100% complete
2014-03-21 11:51:38,028 [main] INFO  org.apache.pig.tools.pigstats.SimplePigStats - Script Statistics: 

HadoopVersion    PigVersion    UserId    StartedAt    FinishedAt    Features
2.2.0.2.0.7.0-1551    0.12.0.2.0.7.0-1551    hdp    2014-03-21 11:51:06    2014-03-21 11:51:38    UNKNOWN

Success!

Job Stats (time in seconds):
JobId    Maps    Reduces    MaxMapTime    MinMapTIme    AvgMapTime    MedianMapTime    MaxReduceTime    MinReduceTime    AvgReduceTime    MedianReducetime    Alias    Feature    Outputs
job_1395391185318_0006    1    0    6    6    6    6    n/a    n/a    n/a    n/a    a,b    MAP_ONLY    /wasbwork/pigresult,

Input(s):
Successfully read 260 records from: "wasb://demo@monstockageazure.blob.core.windows.net/data/ref_villes"

Output(s):
Successfully stored 260 records in: "/wasbwork/pigresult"

Counters:
Total records written : 260
Total bytes written : 0
Spillable Memory Manager spill count : 0
Total bags proactively spilled: 0
Total records proactively spilled: 0

Job DAG:
job_1395391185318_0006


2014-03-21 11:51:38,278 [main] INFO  org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MapReduceLauncher - Success!

Benjamin (@benjguin)

Blog Post by: Benjamin GUINEBERTIERE

Web.Config Transforms Per Build

Since the introduction of .Net 4.0, Visual Studio 2012 has supported Web.Config transforms. That is you can publish a Web.Config file per solution configuration in Visual Studio. This is great if you are attempting to deploy to different environments that would require different settings such as Connection Strings or AppSettings. However, it is limited to […]
Blog Post by: Rob Rastelli

Using the BRE Pipeline Framework to assess and update XML message content using XML vocabularies in Pipelines/Messaging only scenarios

Using the BRE Pipeline Framework to assess and update XML message content using XML vocabularies in Pipelines/Messaging only scenarios

One of the new features that was made available with v1.4.0 of the BRE Pipeline Framework (see this post for a summary of new features) was the ability to make use of XML based vocabulary definitions in the BRE execution policy called by the BRE Pipeline Framework pipeline component. The framework is also able to […]
Blog Post by: Johann

HDInsight + PowerBI: un exemple simple

En octobre dernier, j’ai eu l’occasion de montrer comment analyser des donn%u00e9es venant de logs Web et Twitter avec PIG et HIVE dans Hadoop, puis de croiser les r%u00e9sultats dans Excel, ce qui permet de d%u00e9cliner le r%u00e9sultat dans Power BI.

Je mets ici les diapos et les vid%u00e9os (les vid%u00e9os sont les vid%u00e9os de secours que j’avais, et non la pr%u00e9sentation live qui a %u00e9t%u00e9 faite, mais c’est %u00e9videmment tr%u00e8s proche).

Cela permet d’avoir une premi%u00e8re vision rapide de ce qu’on peut faire avec un cluster HDInsight. C’est un moyen tr%u00e8s abordable (autant en termes financiers que technique) de d%u00e9marrer avec Hadoop.

Les diapos compl%u00e8tes sont disponibles sur OneDrive.

La probl%u00e9matique:

Si vous voulez tester par vous-m%u00eame, vous pouvez aller %u00e0 http://aka.ms/tester-mon-azure o%u00f9 vous aurez 150 de ressources Windows Azure pour tester pendant 1 mois.

Voici les vid%u00e9os:

Pr%u00e9sentation des donn%u00e9es

Cr%u00e9ation du cluster
Jobs PIG et Hive
Excel et suite de l’ex%u00e9cution des jobs
Suppression du cluster

Benjamin (@benjguin)

Blog Post by: Benjamin GUINEBERTIERE

Why can’t I remove my storage account ?

You may want to remove a storage account you’ve created and get a message like this one:

Storage account <mystorage> has container(s) which have an active image and/or disk artifacts. Ensure those artifacts are removed from the image repository before deleting this storage account.

Here is what you may want to check. In the management portal http://manage.windowsazure.com, Storage, <mystorage>, Containers, check the content of your containers, especially the “vhds” one which contains virtual hard disks by default. Here is an example of the portal with stockageazure3 instead of <mystorage>

By clicking on the arrow near vhds (or each any other container), you’ll find a list of the blobs inside the container. VHD are good candidates for lock, and we’ll see why in a minute.

Select a .vhd blob and click EDIT at the bottom of the screen

this will show you the lock:

So where do I unlock?

This is related to the way Virtual Machines Work in Windows Azure. The OS disk and data disks live in Windows Azure blob storage. Here is an image of that (here with Windows VMs, but this is very similar with Linux VMs):

So VHD blobs are virtual hard disks that may be used by virtual machines; Windows Azure doesn’t want you to remove a virtual machine disk without knwoing about it! The locks are handled by images, and disks, that you can find here:

for an image or a disk, you can see the referenced blob in the LOCATION column. Here is an example:

in this example, myCentoOSImage references the bueearwy.zbn201304031550350920.vhd blob in the vhds container of the northeurope2affstorage storage account. Thus the URI of http://northeurope2affstorage.blob.core.windows.net/vhds/bueearwy.zbn201304031550350920.vhd.

At the bottom of the screen, you can remove the image and optionally also remove the associated VHD:

Same for DISKS

A disk or an image may itself be locked by a virtual machine instance. In such a case, you may have to stop and remove the virtual machiane first.

For example, in the following screen shot, the benjguinu1 virtual machine holds the benjguinmisc-benjguinu1-0-201310042100440782 disk which locks itself the http://stockageazure2.blob.core.windows.net/vhds/benjguinmisc-benjguinu1-2013-10-04.vhd blob.

So we have

and you cannot remove the referenced blobs without removing the disks, images and VM involved in that chain.

Benjamin (@benjguin)

Blog Post by: Benjamin GUINEBERTIERE

Viasfora v1.6 Released

Today I published a new update to my Viasfora extension for Visual Studio 2010-2013. One of the new features in this build is a text editor margin that could be useful to other fellow developers working on extending the Visual Studio Text Editor. One of the reasons why I implemented this was that I when […]
Blog Post by: Tomas Restrepo

BizTalk360 installation failed: “error status: 1603″

BizTalk360 installation failed: “error status: 1603″

Recently we got a client who was unable to complete the installation of BizTalk360. Although all checks indicated the prerequisites were met, there was an issue on completing the installation wizard for some reason. In the first step they filled all info for creating the IIS virtual directory and application pool. But for some reason, […]
Blog Post by: mitchke