How to use HDInsight from Linux

HDinsight is very easy to use from PowerShell, but how would you create and delete a cluster from Linux? How would you submit a job and get the result?

Here is is a simple sample and pointers to further documentation.

1. Create a cluster

You can create a cluster with the Windows Azure Command Line Interface (CLI).

In order to install the CLI, you can go to http://windowsazure.com, downloads. At the bottom of the page, you have two links: one for the CLI itself, the other one is the documentation.

Once you have installed it, you get an azure command line with many options.

The following bash script will create a cluster:

#!/bin/bash
# create an HDInsight cluster

# more information at http://www.windowsazure.com/en-us/documentation/articles/hdinsight-administer-use-command-line/

defaultStorageAccount='monstockageazure'
storageAccount2='wasbshared'
clusterName='monclusterhadoop'
clusterContainerName='monclusterhadoop2'
clusterVersion='2.1'
clusterAdmin='cornac'
clusterConfigFile='./hdinsightCluster.config'

subscription='demos874F33876Y'

clusterPassword='YHqj6sq#ap9'
defaultStorageAccountKey='9O5uEqY1MsT6LIKifmXL0bQgrQElbslvu4N6mX58mSpPa4sPtYPTL5YjvLvcQAItuw87BdLulZWnGJWZ/VCd6Q=='
storageAccount2Key='7on846mc+5u9AItkVIEYz1OXwJZ86gN7o7ExURXO3qWJy+jNO56EtfUmRur+/qKkFGc4drA4GvBmhYGiBMlj3g=='

azure account set $subscription

azure hdinsight cluster config create $clusterConfigFile
azure hdinsight cluster config set $clusterConfigFile --clusterName $clusterName --nodes 3 --location "North Europe" --storageAccountName "$defaultStorageAccount.blob.core.windows.net" --storageAccountKey "$defaultStorageAccountKey" --storageContainer "$clusterName" --username "$clusterAdmin" --clusterPassword "$clusterPassword"
azure hdinsight cluster config storage add $clusterConfigFile --storageAccountName "$storageAccount2.blob.core.windows.net" --storageAccountKey "$storageAccount2Key"

azure hdinsight cluster create --config $clusterConfigFile

2. Submit a job

HDInsight exposes an Apache REST API called WebHCat (the former name was Templeton). This allows to submit jobs. It is documented at https://cwiki.apache.org/confluence/display/Hive/WebHCat.

There are tons of ways to call a REST API from Linux. The one I chose for this post is Python. For this sample, you install the “requests” module

pip install requests

then you can run that script (02_submit_hive_job.py):

import requests #http://pypi.python.org/pypi/requests

clusterName='monclusterhadoop'
clusterAdmin='cornac'
clusterPassword='YHqj6sq#ap9'

#get WebHCat status
webHCatUrl='https://' + clusterName + '.azurehdinsight.net/templeton/v1/status'

r = requests.get(webHCatUrl, auth=(clusterAdmin, clusterPassword))

print r.status_code
print r.json()

#submit a hive job:
# SELECT * FROM hivesampletable limit 10
# http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/ds_HCatalog/hive.html

webHCatUrl='https://' + clusterName + '.azurehdinsight.net/templeton/v1/hive'

hive_params={'user.name':clusterAdmin,
             'execute':'SELECT * FROM hivesampletable limit 10',
             'statusdir': '/wasbwork/hive_from_python'}

r = requests.post(webHCatUrl, auth=(clusterAdmin, clusterPassword), data=hive_params)
print r.status_code
print r.json()

with the following command line:

python 02_submit_hive_job.py

In my case, I got the following result:

benjguin@benjguinu2:~/dev/hdinsight_from_linux$ python 02_submit_hive_job.py
200
{u'status': u'ok', u'version': u'v1'}
200
{u'id': u'job_201402171346_0002'}

You can also get the status of the job, submit pig jobs, submit hive jobs from scripts you uploaded to Windows Azure Storage Blob. Here is a link to the documentation by Hortonworks:

http://docs.hortonworks.com/HDPDocuments/HDP1/HDP-Win-1.3.0/ds_HCatalog/hive.html

and you get a table of contents on the left:

3. Get the result

In the Python script, as we asked the result to be at /wasbwork/hive_from_python, it is stored in the Windows Azure Storage Blob or wasb (in HDInsight, wasb is the default file system over HDFS which is also available at hdfs://namenodehost:9000/()). So, once the job is fiinished, and a script can figure it out with this REST API, you get the following files:

 

So, you can get the result by downloading the result (with azure CLI) and see it with this bash script:

#!/bin/bash

defaultStorageAccount='monstockageazure'
clusterName='monclusterhadoop'
defaultStorageAccountKey='9O5uEqY1MsT6LIKifmXL0bQgrQElbslvu4N6mX58mSpPa4sPtYPTL5YjvLvcQAItuw87BdLulZWnGJWZ/VCd6Q=='

export AZURE_STORAGE_ACCOUNT="$defaultStorageAccount"
export AZURE_STORAGE_ACCESS_KEY="$defaultStorageAccountKey"

azure storage blob download $clusterName wasbwork/hive_from_python/stdout
cat wasbwork/hive_from_python/stdout

In my case, this gave the following result:

benjguin@benjguinu2:~/dev/hdinsight_from_linux$ ./03_get_result.sh
info:    Executing command storage blob download
+ Download blob wasbwork/hive_from_python/stdout in container monclusterhadoop to wasbwork/hive_from_python/stdout
Percentage: 100.0% (809.00B/809.00B) Average Speed: 809.00B/S Elapsed Time: 00:00:00
+ Getting Storage blob information
info:    File saved as wasbwork/hive_from_python/stdout
info:    storage blob download command OK
8       18:54:20        en-US   Android Samsung SCH-i500        California      United States   13.9204007      0       0
23      19:19:44        en-US   Android HTC     Incredible      Pennsylvania    United States   NULL    0       0
23      19:19:46        en-US   Android HTC     Incredible      Pennsylvania    United States   1.4757422       0       1
23      19:19:47        en-US   Android HTC     Incredible      Pennsylvania    United States   0.245968        0       2
28      01:37:50        en-US   Android Motorola        Droid X Colorado        United States   20.3095339      1       1
28      00:53:31        en-US   Android Motorola        Droid X Colorado        United States   16.2981668      0       0
28      00:53:50        en-US   Android Motorola        Droid X Colorado        United States   1.7715228       0       1
28      16:44:21        en-US   Android Motorola        Droid X Utah    United States   11.6755987      2       1
28      16:43:41        en-US   Android Motorola        Droid X Utah    United States   36.9446892      2       0
28      01:37:19        en-US   Android Motorola        Droid X Colorado        United States   28.9811416      1       0

4. Remove the cluster

In order to remove the cluster, the azure CLI will also help:

#!/bin/bash

clusterName='monclusterhadoop'

azure hdinsight cluster delete $clusterName

this produces the following sample result:

benjguin@benjguinu2:~/dev/hdinsight_from_linux$ ./04_removeCluster.sh
info:    Executing command hdinsight cluster delete
+ Removing HDInsight Cluster
info:    hdinsight cluster delete command OK
benjguin@benjguinu2:~/dev/hdinsight_from_linux$

Conclusion

This post only shows a few simple examples. The goal is to show the principles that can be used. The azure CLI is used to manage the cluster itself, and may also be used to interact with Windows Azure Storage blobs. Submitting jobs can be done with WebHCat REST calls.

Benjamin (@benjguin)

Blog Post by: Benjamin GUINEBERTIERE

BizTalk Server Tip #15: Split big messages in the receive pipeline

If you have a big message consider using envelop schemas and the default pipelines to split the message at the entrance point in BizTalk for best performance and high resource utilization. Using this method also doesn’t require the creation of custom pipelines or very expensive splitting in the orchestrations.  

The post BizTalk Server Tip #15: Split big messages in the receive pipeline appeared first on BizTalk360 Blog.

Blog Post by: Ricardo Torre

BizTalk Summit 2014 – London | March 3rd & 4th | London, England | BizTalk Mapping Patterns and an Introduction to WABS maps

BizTalk Summit 2014 – London | March 3rd & 4th | London, England | BizTalk Mapping Patterns and an Introduction to WABS maps

For the third consecutive year, BizTalk Innovation Day event will be conducted in several European cities. And like last year the tour will start with our major event that is back to London even bigger and better! 12 Integration MVP’s from across the world (USA, Canada, India, Netherlands, Norway, Portugal, Italy, Belgium and of course […]
Blog Post by: Sandro Pereira

BizTalk Server Tip #14: Use the Business Rule Engine to implement Business Logic

Use Business Rule Engine to implement business logic that is modular, reusable and simple. It will allow you to operate on information contained in .NET objects, database tables and XML Documents. BRE also enables developers to create and maintain applications with minimal effort. The Business Rule Engine can be a good way of to modularize […]

The post BizTalk Server Tip #14: Use the Business Rule Engine to implement Business Logic appeared first on BizTalk360 Blog.

Blog Post by: Ricardo Torre

Using Variable Mapping in a WCF-WebHttp Send Port without using promoted properties

When you call a REST Service in BizTalk 2013, there could be scenarios when an ID, or any other query variable, must be determined at runtime. To enable such scenarios, you specify variables for the HTTP Method URL Mapping. The variable maps normally to a promoted field in a message but there are some scenarios where you cannot use promoted properties.

In this sample I’m going to call a REST Service and update an Event for a specific Customer with BizTalk. To start the process I receive a CustomerEvent message with a Customer Id. Because the REST service expects a data message I have to transform it to the data message and post the data message to the REST service.

 

 

 

 

REST Service called by SoapUI

But also the Customer Id needs to be dynamically set in the URI at runtime. I can’t use property promotion to dynamically set the URI because I don’t have a customer_id field in the data message. So in this specific case I cannot use promoted properties.
To solve this I’m going to put the customer_id field from the CustomerEvent message in a custom context property because I don’t have the customer_id field anymore in the data message.

Steps

The following steps show how you can use variable mapping and context properties without using promoted properties.

Generate schemas for the CustomerEvent- and the data message.
 
Create a Property schema.
 
In the Property Schema Base property you can specify if the property is a Message Data property or a Message Context property.
 
Create a BizTalk Map to transform the CustomerEvent message to the data message.
 
Create an Orchestration to execute the Map and set the custom context property in a Message Assignment shape. (The customer_id is a distinguished field and is not promoted.)
 

Create a Send Port in the BizTalk Administration Console. Specify WCF-WebHttp for the Type option in the Transport section of the General tab.
Also specify variables for the HTTP Method URL Mapping, provide the variable component of the URL within curly brackets { }.

 
Click in the Variable Mapping section on the ’Edit’ button to specify where the value for the variable ID must be picked from at runtime. Under the Variable column, the dialog box lists the variables that you defined for the URL Mapping. In the Property Name field you must specify the name of the property that provides the value to be associated to the variable.
  

 

Conclusion

If you have to call a REST Service with BizTalk and the URI has to be dynamically set at runtime, you normally can use property promotion. However there are scenarios where you cannot use promoted properties and in such cases custom context properties are a really good alternative!

You can download the BizTalk sample with the source code here:
http://code.msdn.microsoft.com/Using-Variable-Mapping-in-2d52d9ef

European Tour 2014

European Tour 2014

As I look at the calendar and see some important dates are quickly approaching, I thought I better put together a quick blog post to highlight some of the events that I will be speaking at in early March.

I will be using the same content at all events but am happy to talk offline about anything that you have seen in this blog or my presentation from Norway this past September.

The title of my session this time around is: Exposing Operational data to Mobile devices using Windows Azure and here is the session’s abstract:

In this session Kent will take a real world business scenario from the Power Generation industry. The scenario involves real time data collection, power generation commitments made to market stakeholders and current energy prices. A Power Generation company needs to monitor all of these data points to ensure it is maintaining its commitments to the marketplace. When things do not go as planned, there are often significant penalties at stake. Having real time visibility into these business measures and being notified when the business becomes non-compliant becomes extremely important.
Learn how Windows Azure and many of its building blocks (Azure Service Bus, Azure Mobile Services) and BizTalk Server 2013 can address these requirements and provide Operations people with real time visibility into the state of their business processes.

London – March 3rd and March 4th

The first stop on the tour is London where I will be speaking at BizTalk360’s BizTalk Summit 2014.  This is a 2 day paid conference event which has allowed BizTalk360 to bring in experts from all over the world to speak at this event.  This includes speakers from Canada (me), my neighbor, the United States, Italy, Norway, Portugal, Belgium, the Netherlands and India.  These experts include many Integration MVPs and the product group from Microsoft.

There are still a few tickets available for this event so I would encourage you to act quickly to avoid being disappointed.  This will easily be the biggest Microsoft Integration event in Europe this year with a lot of new content.

londonbanner

Stockholm – March 5th

After the London event, Steef-Jan Wiggers and I will be jumping on a plane and will head to Stockhom to visit our good friend Johan Hedberg and the Swedish BizTalk Usergroup.  This will be my third time speaking in Stockholm and 4th time speaking in Scandinavia.  I really enjoy speaking in Stockholm and am very much looking forward to returning to Sweden.  I just really hope that they don’t win the Gold Medal in Men’s Hockey at the Olympics otherwise I won’t hear the end of it.

I am also not aware of any Triathlons going on in Sweden at this time so I should be safe from participating in any adventure sports.

At this point an EventBrite is not available but watch the BizTalk Usergroup Sweden site or my twitter handle (@wearsy) for more details. 

icy-harbour-stockholm

Netherlands – March 6th

The 3rd and last stop on the tour is the Netherlands where I will be speaking at the Dutch BizTalk User Group.  Steef-Jan Wiggers will also be speaking as will Ren%u00e9 Brauwers.  This will be my second trip to the Netherlands but my first time speaking here. I am very much looking forward to coming back to the region to talk about integration with the community and sample Dutch Pancakes, Stroopwafels and perhaps a Heineken (or two).

The eventbrite is available here and there is no cost for this event.

amsterdam

See you in Europe!

The Forrester Wave%u2122: Hybrid Integration, Q1 2014 – Microsoft represented BizTalk360 for BizTalk Server runtime management

Disclaimer: This summary blog post is written by extracting content from "The Forrester Wave: Hybrid Integration, Q1 2014" report Forrester conducted integration portfolio evaluations with more than 40 integration products in October 2013 and interviewed 14 vendors. Participation was tough, all participating vendors had to offer at least four of the following seven integration capabilities […]

The post The Forrester Wave™: Hybrid Integration, Q1 2014 – Microsoft represented BizTalk360 for BizTalk Server runtime management appeared first on BizTalk360 Blog.

Blog Post by: Saravana Kumar

BizTalk Server Tip #13: Cluster host instances with adapter like FTP or POP3

Use Microsoft Clustering for adapters that required a single host instance running at a time like FTP or POP3 to avoid duplicate messages while providing high availability. Failure to implement this will result in either having to deal with duplicate messages or implementing manual processes for performing failover. Due to the nature of some protocols […]

The post BizTalk Server Tip #13: Cluster host instances with adapter like FTP or POP3 appeared first on BizTalk360 Blog.

Blog Post by: Ricardo Torre