Can you move a Virtual Machine from VirtualPC to Hyper-V?

Can you move a Virtual Machine from VirtualPC to Hyper-V?

I was recently asked if I could help move a virtual machine that had been setup and running in VirtualPC and move it to run under Hyper-V. 

The answer is yes it can be done.  Movement from Hyper-V to VirtualPC is not as easy (and many times not possible) but there are a number of blog posts on this topic already so I won’t cover that here.  One of the reasons that it is easier to move from VirtualPC to Hyper-V is that VirtualPC is only 32 bit. 

There are a number of steps that must be occur for a successful move.

First, start by uninstalling the integration components while the virtual machine is running in VirtualPC.  You can do this through the Add/Remove programs feature in Windows in the Virtual Machine.  The Hyper-V drivers and additions will not install over the VirtualPC additions and that is why you must remove them first.

Next, move the vhd file to a location where it can be accessed by your Hyper-V instance.  Walk through the wizard to create a new virtual machine but when prompted to create a new drive or select and existing drive, pick select an existing drive and point it to your .vhd file.

Finally, once you have the virtual machine configured in your Hyper-V instance then start the machine.  Go through the Settings Menu and install the Hyper-V additions.  Once you do this, the Hyper-V additions installs a new HAL as well as new drivers for network, video and sound devices.  The process of installing the new HAL is one of the reasons that a Hyper-V image is no longer portable back to VirtualPC.

However, at this point, you might think that everything is done and you are ready to use the virtual machine.  Most of the time this is correct, however, there are situations that require additional steps.  You will know that you have additional steps if your integration components aren’t working – you can tell really quickly if your mouse doesn’t move outside of the virtual machine.

You are more likely to have this occur if your virtual machine is running versions of Windows prior to Vista or if you are running Windows Server 2008 as these do not have the ability to dynamically detect the HAL at boot time.

So, to fix this run MSConfig.exe – by clicking the Start menu, selecting Run and typing msconfig.  Once the utility launches, click on the Boot tab and click the Advanced Options Button.  When the BOOT Advanced Options dialog appears, click the Detect HAL checkbox and hit ok.  Restart the virtual machine and you should be good to go!

Blog Post by: Stephen Kaufman

How to: Microsoft CRM 2011 Integration example (Part 2 – Get data out of CRM 2011)

How to: Microsoft CRM 2011 Integration example (Part 2 – Get data out of CRM 2011)

First things first, at this point in time I assume

  • you’ve read the previous post
  • downloaded and installed the CRM2011 SDK
  • have a working CRM2011 environment to your proposal.
  • you have an account for CRM2011 with sufficient rights (I’d recommend System Administrator)
  • have visual studio 2010 installed.
  • downloaded and extract my visual studio example solution

So you’ve met all the requirements mentioned above? Good; let’s get started.

Note: all code should be used for Demo/Test purposes only! I did not intent it to be Production Grade. So, if you decide to use it, don’t use it for Production Purposes!

Building your Custom Workflow Activity for CRM2011

Once you’ve downloaded and extracted my visual studio example solution, it is time to open it and fix some issues.


Ensure your references are correct

Go to the Crm2011Entities Project and extend the references folder and remove the following two references


Once done, we are going to re-add these references; so right click on the References folder of the Crm2011Entities Project and click ‘Add Reference’


Now click on the ‘browse’ button, in the add reference dialog window


Now browse to your Windows CRM 2011 SDK BIN folder (in my case: B:InstallMicrosoft CRMSDK CRM 2011bin) and select the following two assemblies:

  • microsoft.xrm.sdk
  • microsoft.xrm.sdk.workflow


Now repeat the above mentioned steps for the other project


Generate a strongly typed class of all your existing CRM entities.

Open op the “Crm2011Entities Project”, and notice that it does not contain any files except a readme.txt file.

Q&A session

Me: Well let’s add a file to this project, shall we?

You: Hmmm, what file you ask?

Me: Well this project will hold a class file which contains all the definitions of your CRM 2011 Entities.

You: O no, do I need to create this myself?

Me: Well lucky you, there is no need for this.

You: So how do I create this file then?

Me: Well just follow the steps mentioned below

So let’s fix this, and open up a command prompt with administrator privileges.


Now navigate to your CRM 2011 SDK Folder (in my case this would be: B:InstallMicrosoft CRMSDK CRM 2011bin)


Note: Before you proceed, ensure that you know the url of the  the CRM2011 OrganizationService. Just test it, by simply browsing to this address, and if everything goes right you should see the following page:


now type in the following command (and replace the values between <….> with your values (see readme.txt)):


Once completed, you should be presented with the following output:


The actual file should be written to the location you set and in my case this is: c:Temp


Once the actual class has been generated, open Visual Studio and right click on the CRM2011Entities project and select  ‘Add Existing Item’


Browse to the directory in which the generated class was saved, and select the generated class.


At this point you should be able to compile the complete solution, so go ahead and do so.

Note: The source code; includes comments which should be self-explanatory

Making the custom workflow activity available in CRM2011.

So you’ve successfully compiled the solution, so what’s next? Well now it’s time to import this custom created activity in CRM2011.

In order to do this we will use this nifty application which comes with the CRM2011 SDK. This application is called ‘pluginregistration’ and can be found in the subdirectory tools/pluginregistration of the CRM2011 SDK (in my case the location is

B:InstallMicrosoft CRMSDK CRM 2011toolspluginregistration)

Note: As you will notice, only the source code of the pluginregistration is available; so you need to compile it; in order to use it.

In the pluginregistration folder, browse to the bin folder and either open the debug or release folder and double click the application PluginRegistration.exe


You will be presented with the following GUI:


Now click on “Create New Connection”


Fill out the connection information, consisting of:

  • Label:   Friendly name of connection
    • In my case I named it: CRM2011
  • Discovery Url:   Base url of CRM
  • User Name: Domain Account with sufficient rights in CRM 2011
    • In my case I used: LABAdministrator


Once everything is filled in, press Connect and wait until the discovery is finished. Once finished double click on the organization name (in my case: Motion10 Lab Environent ) and wait for the objects to be loaded.


Once the objects have been loaded; you should see a screen similar to the one depicted here below:


Now let’s add our ‘Custom Activity or plugin’. Do this by selecting the ‘Register’ tab and clicking on ‘Register new Assembly’


The ‘Register New Plugin’ screen will popup and click on the ‘Browse (…)’ button.


Now browse to the bin folder of the example project “SendCrmEntityToEndPoint“ (the one you compiled earlier) and select the SendCrmEntityToEndPoint.dll file and click on ‘Open’


Once done, select the option “None“ at step 3 and select the option “Database“ at step 4 and press the button ‘Register Selected Plugins’


Once done you should receive feedback that the plugin was successfully registered.


Creating a workflow in CRM2011 which uses the custom activity.

Now that we have registered our ‘plugin’, it is time to put it to action. In order to do so; we will logon to CRM2011 and create a custom workflow.

Once you’ve logged on to CRM2011, click on ‘Settings’


Now find the ‘Process Center’ section and click on ‘Processes’


In the main window, click on ‘New’


A dialog window will pop up; fill in the following details and once done press OK:

  • Process Name: Logical name for this workflow
    • I named it: OnAccountProspectStatusExport
  • Entity:  Entity which could trigger this workflow
    • I used the Account Entity
  • Category: Select WorkFlow


A new window will pop up; which is used to define the actual workflow. Use the following settings:

  • Activate as: Process
  • Scope: Organization
  • Start When:
    • check Record is created
    • check Record fields change and select the field RelationShipType


  • Now add the following step: Check Condition


  • Set the condition to be
    • Select “Account”
    • Select Field “RelationshipType”
    • Select “Equals”
    • Select “Prospect”


  • Now add our custom activity the following step: SendCrmEntityToEndPoint


  • Configure this activity like this:
    • Export to disk:  True
    • EndPoint location: <Path where entity needs to be written>
      • In my case I used: c:temp (note this will be written on the c drive on the CRM server!)


  • Now once again add our custom activity the following step: SendCrmEntityToEndPoint


  • Configure this activity like this:
  • Export to disk: False
  • EndPoint location: Url path to your BizTalk webservice
    • In my case I used: the endpoint which points to my generated BizTalk WebService (which we will cover in our next blogpost)


Well at this point your workflow should look similar to this:


Now click on the ‘Activate’ button


Confirm the ‘Activation’


Save and close the new workflow


Test if everything works

So now it is time to see if everything works; in order to do so we will create a new Account and if everything went ok; we should see

  • An Account.xml file somewhere on disk
  • An Routing Error in BizTalk (as we send a document which was not recognized by BizTalk)

In CRM2011 click on the ‘Work Place’ button


Subsequently click on ‘Accounts’


And finally add a new ‘Account’, by clicking on ‘NEW’


A new window will pop-up; fill in some basic details


and don’t forget to set the Relationship type to ‘Prospect’


Once done click on the ‘Save & Close’ button


After a few minutes we can check both our output directory and the BizTalk Administrator, and we should notice that in the output directory a file has been written


and we should have an ‘Routing Failure’ error in BizTalk.


Closing Note

So this sums up our first part in which we build our own Workflow activity, imported it into CRM2011, constructed a workflow and last but not least saw that it worked.

Hope you enjoyed the read



How to: Microsoft CRM 2011 Integration example (Part 1–Introduction)

Well it has been a while since my last post; however as I stated in my first post. “I’ll only try to blog whenever I have something which in my opinion adds value”, and well the topic I want to discuss today might just add that additional value.

Please note: This post will supply you with background information, the actual implementation of the solution will be covered in the next blog posts. However the sample files which are mentioned in this post can already be downloaded.

Scenario sketch

Let’s  say one of your customer’s are considering to replace their current CRM with Microsoft CRM2011.

Now one of the company’s business processes dictates that whenever a new customer or contact has been added to their CRM system,  this data has to be send to their ERP system near-real-time. This customer or contact is then added into to ERP system and is assigned an unique account number. This account number then needs to be send back to the CRM system. As an end result the corresponding customer in CRM2011 is updated with the account number from the ERP system.

Their current CRM solution already takes care of this functionality however this has been implemented using a point-to-point solution and therefore replacing their current CRM with Microsoft CRM2011 would break this ‘integration-point’.  The customer is aware that in the long-term it would be best to move away from these kind of point-to-point solutions and move more to a Service Oriented Architecture.

At the end of the day it is up to you to convince your customer that it is no problem at all with Microsoft CRM2011 to setup a solution which includes an integration with their ERP system  and as you are aware of the fact that the customer wants to move to a Service Oriented Architecture, you see the opportunity fit to introduce the company to BizTalk Server 2010 as well.

So eventually you propose the following Proof of Concept scenario to your customer: ‘You will show to the customer that it is possible with almost no effort to build a solution which connects Microsoft CRM 2011 to their ERP system, whilst adhering to the general known Service Oriented Architecture principles’; once you tell your customer that this POC does not involve any costs for them except time and cooperation; they are more than happy and agree to it.

Preparing your dish

In order to complete the solution discussed in this blog post you will need the following ingredients:

A test environment consisting of:

  • 1 Windows Server 2008R2 which acts as Domain Server (Active Directory)
  • 1 Windows Server 2008R2 on which Microsoft CRM2011 is installed and configured
  • 1 Windows Server 2008R2 on which Microsoft BizTalk Server 2010 is installed and configured.
  • One Development Machine with Visual Studio 2010 installed

Step 1: How do I get data out of Microsoft CRM2011?

Well in order to get data (let me rephrase; an entity) out of CRM for our Integration scenario we will need to build a custom activity which can be added as a workflow step within CRM2011.

So which ingredients are required to do this?

  • We need to download the CRM2011 SDK; so go and fetch it here

So what are we going to build?

  • We will build a custom activity and deploy it to CRM2011 such that it can be used in a workflow, or download my example and install it

Step 2: How do I get data in my custom ERP?

Well for this step I’ve build a custom application which will act as a stub for our custom ERP. This custom ERP system will be exposed by means of a WCF service.

So which ingredients are required to do this?

  • An (sample) ERP system.

So what are we going to build?

  • Well you could build your own application, or download my example and install it.

Step 3: How do I get data into CRM2011?

Well in order to get data into CRM; we will use the out of the box web services which are exposed by CRM2011.

So which ingredients are required to do this?

  • Well if you have not yet downloaded the CRM2011 SDK; go and fetch it here

So what are we going to build?

  • Well in order to make our life easier we will build a proxy web service; which will talk directly to CRM2011 this way we will make our integration efforts go smoother.

Step 4: How do I hook it all together?

Well for this part we will use BizTalk, BizTalk will receive the ‘Create Customer’ event from CRM and subsequently logic will be applied such that this data is send to the custom ERP application. Once the insert was successful the ERP system sends back an customer account number and subsequently we will update the corresponding Entity in CRM2011 with the account number obtained from the ERP system. 

So which ingredients are required to do this?

  • Well if you have not yet downloaded the CRM2011 SDK; go and fetch it here 🙂

So what are we going to build?

  • Well we need to make a customization to our Account Entity in CRM2011, to be more specific; we will add a custom field to the Account entity and call it Account Number.
  • We will build a BizTalk solution which will hook all the bits together.


Closing Note

So this sums up the introduction part. Be sure to check back soon for the follow up part in which I’ll discuss how to build our CRM Trigger

BizTalk: Ordered Delivery

BizTalk: Ordered Delivery

It is one more description of the Ordered Delivery (OD) in BizTalk.

The main article about it is in MSDN.

Here I am discussing the BizTalk Ordered Delivery “implementation details”.

OD Considerations

  • Ordered Delivery (sequential) mode is opposite of the “Parallel Delivery” mode. Parallel Delivery is the most productive mode; the Ordered Delivery is less productive mode.
  • Transports such MSMQ and protocols, supporting the WS-ReliableMessaging, are the protocols supporting OD. Other protocols as FTP, File, SQL, SMTP etc. do not have notion of the “order”.
  • BizTalk application usually is a part of the whole message workflow.
  • There are two approaches in the OD implementation:
    • All steps of the message workflow independently support OD.
    • A Destination System implements the re-sequencing and manages lost and duplicate messages.
  • Order is relative. Sequences can be ordered regards one or several parameters. For example, OD for the Company or for the Company + Department.

OD and the BizTalk Architecture

  • MessageBox is an implementation of the Message Queue. OD is an intrinsic feature of the MessageBox.
  • The BizTalk Server works in the Parallel Delivery mode by default.
  • There are three parts in the BizTalk message exchange outside of the MessageBox in relation to OD: Receive Locations; Orchestrations; Send ports.
    • Receive Locations support OD on the Transport level (say, MSMQ and some WCF adapters).
    • OD in Orchestrations is implemented by the sequential convoy pattern.
    • Send ports support OD for all static adapters.
  • The BizTalk Pipelines (part of Receive and Send Ports) always process messages in order using streams.

OD and Ports

To force Send Ports work in order we set up a flag the “Ordered Delivery” in Send Ports, then the BizTalk MessageBox takes care of implementing OD.

To force Receive Locations work in order we set up flag the “Ordered Delivery” option in the Receive Location Transports, whenever is possible. Then the BizTalk Transport Adapter takes care of implementing OD.

Ordered Delivery Send Port instance works as a singleton service. Since start it stays in Running state. It will not recycle if we restart its Host Instance. We could manually terminate it, if we want.

OD and Orchestrations

MessageBox implements the convoy message exchange pattern [See Using Correlations in Orchestrations]. See the detail convoy description in the BizTalk Server 2004 Convoy Deep Dive article.
It is not just a design pattern that developer should follow. There are special mechanisms inside MessageBox responsible for implementing OD.

OD and Orchestration: Sample

Imagine four Orchestrations which implement four approaches to the sequencing.

The first is the ProcessNoOrder Orchestration. It processes all messages without any order. One ProcessNoOrder Orchestration instance will be created for each inbound message.

The ProcessInOrder Orchestration processes all messages in one sequence. Only one ProcessInOrder Orchestration instance will be created.

The ProcessInOrderByCompany Orchestration processes messages in sequences correlated by the Company value (A, B, C, D, etc.). The separate queue is created for each new value of the Company. Messages inside queues are processed in order. Queues for different Companies are independent. A separate ProcessInOrderByCompany Orchestration instance will be created for each new Company value.

The ProcessInOrderByProduct Orchestration works exactly as the ProcessInOrderByCompany Orchestration but correlated by the Product value (xx, yy, etc.).



By default all Orchestration and Messaging Service instances works in the Parallel Delivery mode with maximum performance.

If we check up the Ordered Delivery option in Send Port, BizTalk will initiate the Send Port instance as a singleton service. It is always a single instance. We don’t have the flexibility of the Orchestration where we could tune up “the proportion of the sequencing” and could control the recycling of the service instance.

Send Port OD could be in two states, on and off:

  • OD is off: a service instance per message, one message per queue, maximum performance.
  • OD is on: one infinite service instance, all messages in one queue, minimum performance.

Orchestration OD could be in two states also, on and off:

  • OD is off: a service instance per one activating message, one activating message per queue, maximum performance.
  • OD is on: one or several service instances, one per new correlation set value; all correlated messages per queue; less performance.

Carefully designing the correlation sets we could tune up the performance of the Orchestration. For example, if we only care of the document order for each separate Company, we include the Company to the Correlation set. If we had thousand documents related to hundreds companies, the performance will be near maximum. If there are only two companies, the performance will be near minimum, and we should consider improving the correlation with one more parameter.

Orchestrations and Zombies

Zombies are big problem of Convoy Orchestrations. See BizTalk: Instance Subscription and Convoys: Details article with description of this problem. This problem could be mitigated but could not be completely removed. We are waiting a new version of the BizTalk Server where this issue will be targeted.

BizTalk Server version Next, Ordered Delivery and Zombies

It is possible the BizTalk Server version Next will implement the automatic Ordered Delivery for Orchestrations, with pattern similar to the Ordered Delivery in Send Ports.

Three new Orchestration parameters are shown up there: Ordered Delivery, Stop on Exception, and Recycle Interval (in seconds).

Ordered Delivery parameter works in the same way as the Ordered Delivery parameter of the Send Port. Now we don’t have to build the Convoy Orchestration manually. No more Loop inside Orchestration.

If the Ordered Delivery parameter set up to True, the Orchestration is working as a Singleton. The first Receive shape receives all correlated messages in sequence. Correlation set is created implicitly regards of the Activation Subscription expression.

There are several limitations for this kind of Orchestration. The most important is: only one Receive shape is permitted here.

There are two big advantages of this new feature:

  • Simplified Orchestration design for the Ordered Delivery.
  • No more Zombies. The Orchestration instance is recycled in controllable way, when no messages, matched the Orchestration Subscription, are placed in the MessageBox.


We discussed the Ordered Delivery implemented in the BizTalk Server and ways to improve it.

On demand map reduce cluster and persistent storage | un cluster map reduce à la demande et des données persistantes

On demand map reduce cluster and persistent storage | un cluster map reduce à la demande et des données persistantes

Here is one of the use cases of Hadoop on Azure: you have a few applications accumulating data over time and you need to execute batches against this data a few times a month. You need many machines in an Hadoop cluster, but most of the time, you don’t need the cluster, just the data. Voici un des cas d’utilisation d’Hadoop sur Azure: Quelques applications accumulent des donn%u00e9es au fur et %u00e0 mesure et on doit ex%u00e9cuter des batches sur ces donn%u00e9es quelques fois dans le mois. On a besoin de machines dans un cluster Hadoop, mais la plupart du temps, on n’a pas besoin du cluster, juste des donn%u00e9es.
One possible way is shown in the following diagram, that we will explain in this post. Une des fa%u00e7ons d’organiser les choses est d%u00e9crite ci-dessous dans ce diagramme que nous allons expliquer au cours de ce billet:


Hadoop Hadoop
Hadoop is a framework that implements map/reduce algorithm to execute code against big amounts of data (Terabytes). Hadoop est un framework qui met en oeuvre l’agorithme map/reduce pour ex%u00e9cuter du code sur un grand ensemble de donn%u00e9es (on compte typiquement en teraoctets).
On an Hadoop cluster, data is typically spread across the different data nodes of the Hadoop Distributed File System (HDFS). Even one big file can be spread across the cluster in blocks of 64 Mb (by default). Sur un cluster Hadoop, les donn%u00e9es sont typiquement distribu%u00e9es sur plusieurs noeuds de donn%u00e9es du syst%u00e8me de fichiers distribu%u00e9 Hadoop (HDFS). M%u00eame pour un gros fichier, il peut y avoir une r%u00e9partition en blocs de 64 Mo (par d%u00e9faut).
So data nodes play two roles at the same time: they are a processing role and they are also hosting the data itself. This means that removing processing power removes HDFS storage at the same time. Les noeuds de donn%u00e9es jouent donc deux r%u00f4les diff%u00e9rents en m%u00eame temps: ils font du calcul et stockent les blocs de donn%u00e9es. Cela signifie aussi qu’en supprimant la puissance de calcul, on supprime en m%u00eame temps le stockage HDFS.


Persistent storage Stockage persistant
In order to make data survive cluster removals, it is possible to have the data to persistent storage. In Windows Azure, the candidate is Windows Azure Blobs, because it is what corresponds the most to files, which is what HDFS stores. De fa%u00e7on %u00e0 faire en sorte que les donn%u00e9es survivent %u00e0 la suppression du cluster, il est possible d’avoir les donn%u00e9es dans un stockage persistant. Dans Windows Azure, le candidat naturel est Windows Azure Blobs, puisque c’est ce qui correspond le plus %u00e0 des fichiers, ce qu’HDFS stocke aussi.
NB: other Windows Azure persistent storages also include Windows Azure Tables (non relationnal) and SQL Azure (relationnal, with sharding capabilities called federations). NB: Il y a %u00e9galement d’autres stockages persistants Windows Azure tels que les tables Windows Azure (non relationnelles) et SQL Azure (base de donn%u00e9es relationnelle, avec des capacit%u00e9s de partitionnement appel%u00e9es f%u00e9d%u00e9rations).


Pricing on Windows Azure Tarification sur Windows Azure
Officiel pricing are described here and you should refer to that URL in order to have up to date pricings. Les prix officiels sont fournis ici. Il est recommand%u00e9 de se r%u00e9f%u00e9rer %u00e0 cette URL officielle pour avoir les prix les plus %u00e0 jour.
While I’m writing this article, current pricing are the following: A l’heure o%u00f9 j’%u00e9cris cet article, les prix sont les suivants (je les donne en $, mais ils sont factur%u00e9s en avec une parit%u00e9 refl%u00e9tant la parit%u00e9 du march%u00e9):
– using Windows Azure blobs costs
* $0.14 per GB stored per month based on the daily average – There are discounts for high volumes. Between 1 and 50 TB, it’s $0.125 / GB / month.
* $1.00 per 1,000,000 storage transactions
– l’utilisation des blobs Windows Azure co%u00fbte
* 0,14$ par Go stock%u00e9 par mois en se basant sur l’utlisation quotidienne – Il y a des r%u00e9ductions pour les gros volumes. Entre 1 et 50 To, c’est 0,125$ / Go / mois.
* 1,00$ par 1.000.000 de transactions de stockages
– An Hadoop cluster uses an 8-CPU head node (Extra large) and n 2-CPU data nodes (Medium).
* nodes are charged $0.12 per hour and per CPU. An 8 node + 1 head node cluster costs (8x2CPU+1x8CPU)x$0.12x750h=$2160/month.
– Un cluster Hadoop utilise un noeud principal %u00e0 8 coeurs (tr%u00e8s grande taille) et n noeuds de donn%u00e9es %u00e0 2 coeurs (taille moyenne).
* Les noeuds sont factur%u00e9s 0,12$ par heure et par coeur. Un cluster compos%u00e9 de 8 noeuds bi-coeurs + 1 noeud principal (8x2coeurs+1x8coeurs)x0,12$x750h=2160$/mois.
– There are also data transfer in and out of the Windows Azure DataCenter.
* Inbound data transfer are free of charge
* Outbound data transfer: North America and Europe regions: $0.12/GB, Asia Pacific Region: $0.19/GB
– il y a aussi les transferts de donn%u00e9es depuis et vers le centre de calcul Windows Azure.
* les transferts de donn%u00e9es vers le centre de calcul sont gratuits
* Pour les transferts de donn%u00e9es depuis le centre de calcul: R%u00e9gions Am%u00e9rique du Nord et Europe: 0,12$/Go, r%u00e9gion Asie/Pacifique: 0,19/Go
Disclaimer: in this post, I don’t take into account any additional cost that may come for Hadoop as a service. For now, current version is in CTP (Community Technology Preview) and no price was announced. I personnaly have no idea of how this could be charged, or even if this would be charged. I just suppose the relative comparisons between costs would keep roughly the same.

Avertissement: dans ce billet, je me base uniquement sur les prix des ressources Azure et ne tiens pas compte du prix additionnel %u00e9ventuel pour l’utilisation d’Hadoop en tant que service. La version actuelle est une pr%u00e9-version (CTP) et aucun prix n’a %u00e9t%u00e9 annonc%u00e9. Je n’ai d’ailleurs personnellement aucune id%u00e9e de la fa%u00e7on dont ce sera factur%u00e9, ni m%u00eame si cela le sera. Je suppose simplement que les co%u00fbts relatifs devraient rester %u00e0 peu pr%u00e8s les m%u00eames.

In current Hadoop on Azure CTP (community Technology preview) the following clusters are available (they are offered at no charge to a limited number of testers). Dans la version CTP actuelle d’Hadoop sur Azure, les clusters suivants sont propos%u00e9s (ils sont propos%u00e9s gratuitement %u00e0 un nombre limit%u00e9 de testeurs).


In order to store 1 TB of storage one needs at least a cluster with 3 TB  because HDFS replicates data 3 times (by default). So medium cluster is OK. Note that for computation moving to a large cluster may be needed as additional data will be generated by computation. Pour stocker 1 To de stockage, il est n%u00e9cessaire d’avoir un cluster qui dispose d’au moins 3 To parce qu’HDFS r%u00e9plique les donn%u00e9es 3 fois (par d%u00e9faut). Un cluster de taille moyenne (Medium) est donc satisfaisant. On notera tout de m%u00eame que lors des calculs il peut %u00eatre n%u00e9cessaire de passer %u00e0 un cluster de plus grande taille puisque des donn%u00e9es compl%u00e9mentaires peuvent %u00eatre g%u00e9n%u00e9r%u00e9es pendant les calculs.
In order to store 1 TB of storage in Windows Azure blobs, one needs 1 TB of Windows Azure blob storage (replication on 3 different physical nodes is included in the price). De fa%u00e7on %u00e0 stocker 1 To de donn%u00e9es dans les blobs Windows Azure, on a besoin d’1 To de stockage Windows Azure (la r%u00e9plication sur 3 machines physiques diff%u00e9rentes est incluse dans le prix).
So storing 1 TB of data in an Hadoop cluster with HDFS costs $2160/month while storing 1 TB of data in Windows Azure storage blobs costs 1024x$0.125=$128/month. Ainsi, stocker 1 To de donn%u00e9es dans un cluster Hadoop avec HDFS co%u00fbte 2160$/mois alors que stocker ce m%u00eame To de donn%u00e9es dans des blobs Windows Azure co%u00fbte 1024×0,125$=128$/mois.
Copying 1 TB of data to or out of Windows Azure blobs inside the datacenter will incur storage transactions. As an approximation let’s count a storage transaction / 1 MB. (per MSDN documentation a PUT storage transaction on a block blob may contain up to 4 MB of data). So copying 1 TB of data would roughly cost $1. Pour copier 1 To de donn%u00e9es depuis ou vers les blobs Windows Azure au sein du m%u00eame centre de calcul il faut prendre en compte les transactions de stockage. En premi%u00e8re approximation, comptons une transaction de stockage / 1 Mo (comme indiqu%u00e9 dans la documentation MSDN, une transaction de stockage PUT sur un blob de type bloc peut contenir jusqu’%u00e0 4 Mo de stockage). Il en r%u00e9sulte que la copie d’1 To de donn%u00e9es co%u00fbte %u00e0 peu pr%u00e8s 1$.
Let’s now suppose we need the Hadoop cluster 72 hours (3 x 24h) a month for computation. We would use an extra large cluster to have the result faster and to get extra storage capacity for intermediary data. That cluster costs (32x2CPU+1x8CPU)x$0.12x72h=$622.08. Supposons maintenant qu’on a besoin d’un cluster Hadoop 72 heures par mois (3 x 24h) pour effectuer des calculs. On utiliserait alors un cluster %u00e0 32 noeuds pour avoir le r%u00e9sultat plus vite et pour avoir %u00e9galement plus de capacit%u00e9 de stockage pour des donn%u00e9es interm%u00e9diaires. Ce cluster co%u00fbterait (32x2coeurs+1x8coeurs)x0,12$x72h=622,08$.
So using an extra large cluster 3 times 24 h a month would cost the following per month:
– permanently store 1 TB of data in Windows Azure Storage: $128.00.
– copy 1 TB of data to and from Windows Azure storage 3 times = 3x2x$1=$6
– Hadoop Extra Large cluster: $622.08
==> $756,08
Ainsi, utiliser un cluster %u00e0 32 noeuds 3 fois 24 heures dans le mois co%u00fbte par mois:
– stocker 1 To de donn%u00e9es de fa%u00e7on permanente dans le stockage Windows Azure: 128,00$
– copier 1 To de donn%u00e9es depuis et vers les blobs Windows Azure 3 fois = 3x2x1$=6,00$
– Un cluster Hadoop 32 noeuds: 622,08$
==> 756,08$


So it is ~2.9 times cheaper to store 1 TB of data in Windows Azure Storage and have 3 times a 32 node cluster for 24 hours rather than permanently having an 8 node Hadoop cluster storing permanently 1 TB of data. Cela est donc %u00e0 peu pr%u00e8s 2,9 fois moins cher de stocker 1 To de donn%u00e9es dans le stockage Windows Azure et d’avoir 3 fois un cluster 32 noeuds pendant 24 heures que d’avoir en permanence un cluster %u00e0 8 noeuds qui stocke ce To.


Interactions between storage and applications Interactions entre le stockage et les applications
An additional consideration is the way applications may interact with the storage. On doit %u00e9galement prendre en compte la fa%u00e7on dont les applications interagissent avec le stockage.
HDFS would mainly be accessed thru a Java API, or a Thrift API. It may also be possible to interact with HDFS data thru other stacks like HIVE and an ODBC driver like this one or this one. HDFS peut principalement %u00eatre acc%u00e9d%u00e9 depuis les API Java, ou Thrift. Il est %u00e9galement possible d’interagir avec les donn%u00e9es HDFS %u00e0 travers des couches compl%u00e9mentaires telle que HIVE et un pilote ODBC tel que celui-ci o%u00f9 celui-ci.
Windows Azure blobs may also be accessed thru a number of ways like .NET, REST, Java, and PHP APIs. Windows Azure storage may also offer security and permissions features that are more suited for remote access like shared access signatures. Les blobs Windows Azure peuvent aussi %u00eatre acc%u00e9d%u00e9s par des moyens divers comme des APIs .NET, REST, Java, et PHP. le stockage Windows Azure peut aussi offrir des fonctionnalit%u00e9s de s%u00e9curit%u00e9 et de permissions qui sont plus adapt%u00e9es %u00e0 un acc%u00e8s %u00e0 distance telles que les signatures d’acc%u00e8s %u00e0 distance.
Depending on the scenarios, it may be easier to access Windows Azure Storage rather than HDFS. Suivant les sc%u00e9narios, cela peut %u00eatre plus simple d’acc%u00e9der au stockage Windows Azure plut%u00f4t qu’%u00e0 HDFS.


How to copy data between Windows Azure Blobs and HDFS Comment copier les donn%u00e9es entre le stockage Windows Azure et HDFS
Let’s now see how to copy data between Windows Azure Storage and HDFS. Voyons maintenant comment copier des donn%u00e9es entre le stockage Windows Azure et HDFS.


asv:// asv://
First of all, you need to give your Windows Azure storage credentials to the Hadoop cluster. From the portal, this can be done in the following way Avant tout, il faut donner ses cr%u00e9dentit%u00e9s du compte de stockage Windows Azure au cluster Windows Azure. A partir du portail, cela peut %u00eatre fait de la fa%u00e7on suivante






Then, the asv:// scheme can be used instead of hdfs://. Here is an example: Ensuite, le pr%u00e9fixe asv:// peut %u00eatre utilis%u00e9 %u00e0 la place d’hdfs://. Voici un exemple:



This can also be used from the JavaScript interactive console Cela peut aussi %u00eatre utilis%u00e9 depuis la console interactive JavaScript



Copying as a distributed job Copier avec un job distribu%u00e9
In order to copy data from Windows Azure Storage to HDFS, it is interesting to have the whole cluster participating in this copy instead of just one thread of one server. While the
hadoop fs -cp
command will do the 1 thread copy, the
hadoop distcp
command will generate a map job that will copy the data.
De fa%u00e7on %u00e0 copier les donn%u00e9es du stockage Windows Azure vers HDFS, il est int%u00e9ressant de faire en sorte que tout le cluster copie des donn%u00e9es plut%u00f4t qu’un seul thread d’une seule machine. Alors que la commande
hadoop fs -cp
fera la copie %u00e0 un thread, la commande
hadoop distcp
g%u00e9n%u00e8rera un job qui copiera les donn%u00e9es.
Here is an example En voici un exemple




Here are a few tips and tricks: Voici quelques trucs et astuces:
Hadoop on Azure won’t list the content of an Windows Azure Blob container (the first level folder, just after /). You just need to have at least a second level folder so that you can work on folders (in other words for Azure blob purists, the blob names needs to contain at least one /). Trying to list a container content would result in Hadoop sur Azure ne donne pas la liste d’un conteneur de blob Windows Azure (le premier niveau de dossier, juste apr%u00e8s /). Il suffit de cr%u00e9er un dossier de second niveau pour pouvoir travailler sur des dossiers (en d’autres termes pour les puristes de blobs Windows Azure, les noms de blobs doivent contenir au moins un /). Si l’on essaie de lister le contenu d’un conteneur, on a l’erreur suivante:

ls: Path must be absolute: asv://mycontainer
Usage: java FsShell [-ls <path>]

ls: Path must be absolute: asv://mycontainer
Usage: java FsShell [-ls <path>]

Here is an example

Voici un exemple


That’s why I have a fr-fr folder under my books container in the following example: C’est pourquoi j’ai un dossier fr-fr sous mon conteneur books dans l’exemple suivant:



A distributed copy (distcp) may generate a few more storage transactions on the Windows Azure storage than needed because of Hadoop default strategy which uses idle nodes to execute several times the same tasks. This mainly happens at the end of the copy. Remember we calculated that 1 TB of data would cost ~$1 in storage transactions. That may be ~$1.20 because of speculative execution. Une copie distribu%u00e9e (distcp) peut g%u00e9n%u00e9rer quelques transactions de plus sur le stockage Windows Azure que ce qui est n%u00e9cessaire %u00e0 cause de la strat%u00e9gie par d%u00e9faut d’Hadoop qui utilise des noeuds inoccup%u00e9s pour ex%u00e9cuter des m%u00eames t%u00e2ches plusieurs fois. Cela arrive principalement %u00e0 la fin de la copie. Souvenons-nous qu’on a calcul%u00e9 que la copie d’1 To de donn%u00e9es co%u00fbtait %u00e0 peu pr%u00e8s 1$ en transactions de stockage. Cela pourrait en fait plut%u00f4t %u00eatre de l’ordre de 1,20$ %u00e0 cause de l’ex%u00e9cution sp%u00e9culative.


Why not bypass HDFS, after all? Pourquoi ne pas contourner HDFS, apr%u00e8s tout?
It is possible to use asv: instead of hdfs: including while defining the source or the destination of a map reduce job. So why use HDFS? Il est possible d’utiliser asv: %u00e0 la place d’hdfs: y compris pendant qu’on d%u00e9finit la source ou la destination d’un job map/reduce. Pourquoi donc utiliser HDFS?
Here are a few drawbacks with this non HDFS approach: Voici quelques inconv%u00e9nients avec cette approche sans HDFS:
– you won’t have processing close to the data which will generate network traffic which is slower than interprocess communication inside a machine. – on n’a pas le traitement pr%u00e8s des donn%u00e9es et cela g%u00e9n%u00e8re du trafic r%u00e9seau qui est plus lent que de la communication inter process au sein de la m%u00eame machine.
– you will generate many storage transactions against Windows Azure storage (remember: 1 million of them costs $1 real money). In particular Hadoop may run a single task several times from multiple nodes just because it has available nodes and that one of those tasks may fail. – on va g%u00e9n%u00e9rer beaucoup de transactions de stockage sur le stockage Windows Azure (se souvenir qu’1 million de ces transactions co%u00fbte 1$ en argent r%u00e9el). En particulier, Hadoop peut ex%u00e9cuter une m%u00eame t%u00e2che depuis plusieurs noeuds juste parce qu’il a des noeuds disponibles et qu’une de ces t%u00e2ches pourrait %u00e9chouer.
– HDFS has a default behavior of spreading files in chunks of 64 MB and this will automatically spread map tasks to those blocks of data. Running directly against Windows Azure Storage may need additional tuning (like explicitly defining a number of tasks). – HDFS a un comportement par d%u00e9faut qui consiste %u00e0 %u00e9clater les fichiers en blocs de 64 Mo et cela r%u00e9partit naturellement et par d%u00e9faut les t%u00e2ches sur ces blocs de donn%u00e9es. En ex%u00e9cutant directement un job sur du stockage Windows Azure, on peut avoir besoin de  r%u00e9glages manuels compl%u00e9mentaires (comme par exemple d%u00e9finir manuellement le nombre de t%u00e2ches).


Conclusion Conclusion
In a case where you need to work three days a month on 1 TB of data, it is roughly three times cheaper to have a 32 node cluster that takes its data from and to Azure Blobs Storage each time it is created and destroyed than having an 8 node cluster that keeps the 1 TB data  full time. Copying data between Windows Azure storage and HDFS should be done thru distcp which generates a map job to copy in a distributed way. Dans un cas o%u00f9 l’on a besoin de travailler trois jours par mois sur 1 To de donn%u00e9es, il est %u00e0 peu pr%u00e8s trois fois moins cher d’avoir un cluster de 32 noeuds qui prend et d%u00e9pose ses donn%u00e9es depuis et vers le stockage Windows Azure %u00e0 chaque fois qu’il est cr%u00e9%u00e9 et d%u00e9truit que d’avoir un cluster %u00e0 8 noeuds qui garde 1 To de donn%u00e9es tout le temps. Copier les donn%u00e9es entre le stockage Windows Azure et HDFS doit plut%u00f4t %u00eatre fait avec distcp qui g%u00e9n%u00e8re un job map de fa%u00e7on %u00e0 copier de fa%u00e7on distribu%u00e9e.
This leverages Hadoop as well as Windows Azure elasticity. Cela tire parti d’Hadoop ainsi que de l’%u00e9lasticit%u00e9 de Windows Azure.




Blog Post by: Benjamin GUINEBERTIERE