by community-syndication | Mar 7, 2014 | BizTalk Community Blogs via Syndication
Tallan will be presenting at the Global Windows Azure Bootcamp – Tampa on March 29th.
Sign up Today it is FREE!
About the event
Global Windows Azure Bootcamp
In April of 2013 we held thefirst Global Windows Azure Bootcamp atmore than 90 locations around the globe! This year we want to again offer up aone day deep dive class […]
Blog Post by: Dan Fluet
by community-syndication | Mar 6, 2014 | BizTalk Community Blogs via Syndication
We are very honoured and humbled to see such a positive response from the entire BizTalk community. We are blown away from the feedback. I’m in the process of writing a detailed story explaining how we organised this great event with limited resources. Since lot of things are floating in the social media space about […]
The post What did the attendees say about BizTalk Summit 2014, London? appeared first on BizTalk360 Blog.
Blog Post by: Saravana Kumar
by community-syndication | Mar 5, 2014 | BizTalk Community Blogs via Syndication
The first day of BizTalk Summit 2014, London event was a great success, everyone of them seem to enjoyed it very much and found it really useful. One of the best quote I’ve received personally on my email is from Jon Fancey I think it’s fair to say this was the best BizTalk focussed event […]
The post BizTalk Summit 2014, London– Day 2 summary from tweets appeared first on BizTalk360 Blog.
Blog Post by: Saravana Kumar
by community-syndication | Mar 4, 2014 | BizTalk Community Blogs via Syndication
Use Event Tracking for Windows (ETW) for a high performance tracking and debugging of your BizTalk Applications. Since this method uses some high performance structures within Windows to host tracking the impact is near zero if you are not consuming the output and highly valuable if you are troubleshooting a problem. You can access to […]
The post BizTalk Server Tip #30: Use ETW for high performance tracking appeared first on BizTalk360 Blog.
Blog Post by: Ricardo Torre
by community-syndication | Mar 4, 2014 | BizTalk Community Blogs via Syndication
Microsoft has delivered an ebook around cloud design pattern http://msdn.microsoft.com/en-us/library/dn568099.aspx You can download it as pdf, epub or mobi This book contains twenty-fourdesign patternsand ten relatedguidancetopics, this guide articulates the benefit of applying patterns by showing how each piece can fit into the big picture of cloud application architectures. It also discusses the benefits and […]
Blog Post by: Jeremy Ronk
by community-syndication | Mar 4, 2014 | BizTalk Community Blogs via Syndication
One of the objective for us in the event is to promote the #msbts twitter hash tag and wanted to bring more people from BizTalk integration community to come and share valuable information in public. We are very pleased to see the adaption, about 420 tweets (which is a great number for a BizTalk community). […]
The post BizTalk Summit 2014, London– Day 1 summary from tweets appeared first on BizTalk360 Blog.
Blog Post by: Saravana Kumar
by community-syndication | Mar 3, 2014 | BizTalk Community Blogs via Syndication
Introduction
In a previous post, I explained how to run Hive + Python in HDInsight (Hadoop as a service in Windows Azure).
The sample showed a Python script using standard modules such as hashlib. In real life, modules need to be installed on the machine before they can be used. Recently, I had to use the shapely, shapefile and rtree modules.
Here is how I did and why.
A quick recap on how HDInsight works
HDinsight is a cluster created on top of Windows Azure worker roles. This means that each VM in the cluster can be reimaged i.e. replaced by another VM with the same original bits on it. In terms of configuration, the “original bits” contain what was declared when the cluster was created (Windows Azure storage accounts, ). So installing something after the cluster was created can run until a node is reimaged.
NB: In practice, a node would be typically reimaged while the underlying VM host is rebooted because it gets security patches installed. This does not happen every day, but good practices in cloud development should take those constraints into account.
Install on the fly
So the idea is to install the module on the fly, while executing the script.
Python is flexible enough
to let you catch exception while importing modules.
So the top of the Python script looks like this:
import sys
import os
import shutil
import uuid
from zipfile import *
py_id = str(uuid.uuid4())
#sys.stderr will end up in Hadoop execution logs
sys.stderr.write(py_id + '\n')
sys.stderr.write('My script title.\n')
sys.stderr.flush()
# try to import shapely module. If it has already been install this will succeed
# otherwise, we'll install it on the fly
try:
from shapely.geometry import Point
has_shapely = True
except ImportError:
has_shapely = False
if (has_shapely == False):
# shapely module was not installed on this machine.
# let's install all the required modules which are brought near the .py script as zip files
sys.stderr.write(py_id + '\n')
sys.stderr.write('will unzip the shapely module\n')
sys.stderr.flush()
#unzip the shapely module files (a Python module is a folder) in the python folder
with ZipFile('shapely.zip', 'r') as moduleZip:
moduleZip.extractall('d:\python27')
#unzip the rtree module files in the Python lib\site-packages folder
with ZipFile('rtree.zip', 'r') as moduleZip:
moduleZip.extractall('d:\python27\Lib\site-packages')
#add a required dependency (geos_c.dll) and install the shapefile module (just one .py file, no zip required)
try:
sys.stderr.write(py_id + '\n')
sys.stderr.write('trying to copy geos_c.dll to python27\n')
sys.stderr.flush()
shutil.copyfile('.\geos_c.dll', 'd:\python27\geos_c.dll')
shutil.copyfile('.\shapefile.py', 'd:\python27\shapefile.py')
except:
sys.stderr.write(py_id + '\n')
sys.stderr.write('could not copy geos_c.dll to python27. Ignoring\n')
sys.stderr.flush()
pass
#now that the module is installed, re-import it ...
from shapely.geometry import Point
# ... and also import the second module (rtree) which was also installed
from shapely.geometry import Polygon
import string
import time
import shapefile
from rtree import index
()
How to package the modules
So the next questions are:
- where do the .zip files come from?
- how to send them with the .py script so that it can find them when unzipping?
As the approach is to prepare the module so that it can be unzipped rather than installed, the idea is to install the module manually on a similar machine, then package it for that machine.
An HDInsight machine is installed has shown below:
Well, this was the head node, the worker node don’t look much different:
and this is Python 32 bits (on a 64-bit Windows Server 2008 R2 OS):
If you don’t have that kind of environment, you can just create a virtual machine on Windows Azure. The OS is available in the gallery:
Once you have created that VM, installed Python 2.7 32 bits on it and installed the required modules manually, you can zip them.
Then, you just have to send them with the .py python script. Here is what I have in my example (in PowerShell):
$hiveJobVT = New-AzureHDInsightHiveJobDefinition -JobName "my_hive_and_python_job" `
-File "$mycontainer/with_python/my_hive_job.hql"
$hiveJobVT.Files.Add("$wasbvtraffic/with_python/geos_c.dll")
$hiveJobVT.Files.Add("$wasbvtraffic/with_python/shapely.zip")
$hiveJobVT.Files.Add("$wasbvtraffic/with_python/my_python_script.py")
$hiveJobVT.Files.Add("$wasbvtraffic/with_python/rtree.zip")
$hiveJobVT.Files.Add("$wasbvtraffic/with_python/shapefile.py")
In the HIVE job, the files must also be added:
(...)
add file point_in_polygon.py;
add file shapely.zip;
add file geos_c.dll;
add file rtree.zip;
add file shapefile.py;
(...)
INSERT OVERWRITE TABLE my_result
partition (dt)
SELECT transform(x, y, z, dt)
USING 'D:\Python27\python.exe my_python_script.py' as
(r1 string, r2 string, r3 string, dt string)
FROM mytable
WHERE dt >= ${hiveconf:dt_min} AND dt <= ${hiveconf:dt_max};
(...)
Benjamin (@benjguin)
Blog Post by: Benjamin GUINEBERTIERE