The following post is intended to offer you a set of best practices centered around the development of reliable SQL Azure client applications. The primary focus of this paper is positioned towards handling transient conditions, namely, those intermittent faults, errors and exceptions that need to be accounted for when developing reliable applications for high-density multi-tenant environments such as SQL Azure.

Background

The developers who have already had the opportunity to start working with the Microsoft’s cloud-based relational database service, widely known as SQL Azure, may know that SQL Azure has introduced some specific techniques and approaches to implementing data access service layer in the applications leveraging the SQL Azure infrastructure.

One of the important considerations is the way how client connections are to be handled. SQL Azure comes with throttling behavior the attributes of which can manifest themselves when a client is establishing connections to a SQL Azure database or running queries against it. The database connections can be throttled internally by the SQL Azure fabric for several reasons, such as excessive resource usage, long-running transactions, and possible failover and load balancing actions, leading to termination of a client session or temporary inability to establish new connections while a transient condition persists. The database connections may also be dropped due to the variety of reasons related to network connectivity between the client and distant Microsoft data centers: quality of network, intermittent network faults in the client’s LAN or WAN infrastructure and other transient technical reasons.

The behavior in question was discussed in the article posted on the SQL Azure team blog back in May 2010. The article articulates the need for implementing retry logic in the client code in order to provide reliable connectivity to the SQL Azure databases. In one of our recent Azure customer projects, we have faced with multiple challenges related to this behavior. This experience has led to creating a generic, reusable framework for handling transient conditions using an extensible retry policy model. We hope that our learnings can be of use for many .NET developers working with SQL Azure.

Disclaimer
The information provided in this article reflects the real-world experience with the SQL Azure to date. It is likely that some of the transient conditions discussed below may never surface in a given client application. It is in the nature of a transient condition to be dependent on and be driven by variable technical, environmental, behavioral and other unique characteristics of a particular application or its surrounding infrastructure.

Transient Conditions in SQL Azure

When handling exceptions in the client applications accessing the SQL Azure databases, it is important to differentiate between general errors and faults that require special treatment. Not every exception would be considered as a transient error. The client applications need to ensure that the application code will enter into retry state only when it’s strictly necessary.

Below are some examples of the transient conditions that may occur in the SQL Azure infrastructure:

In order to determine whether or not a specific exception should be treated as “transient” when working with SQL Azure, the following guidelines must be adhered to:

Check the exception type first. The two specific types which the application code would need to filter accordingly are SqlException and TimeoutException;
Filter out those SQL exceptions which do not indicate a transient error. The SqlException.Number property helps assert whether or not an exception should be considered as transient. Do not attempt to parse the exception text as it may vary between different releases of the .NET Framework Data Provider for SQL Server;
Verify if the error number belongs to the family of transient errors by checking it against a set of well-known error codes. The main error codes that need to be accounted for are listed above. In addition, check the up-to-date list of error codes indicating a loss of connection.

This guidance could easily be packaged into a fully reusable framework for handling connection loss and failed SQL commands due to transient conditions.

Transient Conditions Handling Framework

The framework that we have developed in our project takes into account the end requirements for handling the possible transient conditions. Internally, the framework relies on the implementation of a "retry policy" which makes sure that only valid transient errors will be handled. The policy verifies whether or not an exception belongs to the legitimate category of transient faults before the client application enters into retry state.

At a glance, our implementation of the transient condition handling framework:

Provides the foundation for building highly extensible retry logic for handling a variety of transient conditions, not limited to SQL Azure;
Supports a range of pre-defined retry policies (fixed retry interval, progressive retry interval, random exponential backoff);
Supports separate retry policies for SQL connections and SQL commands for additional flexibility;
Supports retry callbacks to notify the user code whenever a retry condition is encountered;
Supports the fast retry mode whereby the very first retry attempt will be made immediately thus not imposing delays when recovering from short-lived transient faults;
Enables to define the retry policies in the application configuration files;
Provides extension methods to support retry capabilities directly in SqlConnection and SqlCommand objects.

The next sections drill down into specific implementation details and are intended to help the developers understand when and how they should make use of the transient condition handling framework referenced above. To follow along, download the full sample code from the MSDN Code Gallery.

Technical Implementation

The following class diagram depicts the underlying technical implementation highlighting all core components and their dependencies:

The key components in the framework are the RetryPolicy<T> and ReliableSqlConnection classes and the ITransientErrorDetectionStrategy interface.

The RetryPolicy<T> class along with its abstract RetryPolicy counterpart encapsulate all the essential logic responsible for iterative execution of developer-defined actions which may result in a transient exception.

The ReliableSqlConnection class is implemented as a look-a-like of SqlConnection and provides a set of value-add methods to ensure that connections could be reliably established and commands could reliably executed against a SQL Azure database.

The ITransientErrorDetectionStrategy interface provides the base mechanism upon which different types of transient conditions can be described and packaged into a reusable policy object that performs validation on a given .NET exception against a well-known set of transient faults. Along with a transient error detection policy for SQL Azure, the framework also includes the transient condition detection strategies for AppFabric Service Bus, AppFabric Message Buffers and Windows Azure storage.

In addition, the class library provides a set of C# extension methods enabling the .NET developers to open SQL Azure database connections and invoke the SQL commands from within a retry policy-aware scope. The extension methods can be useful in the event when the developers are unable to adopt their code to take advantage of the ReliableSqlConnection class. For instance, a developer might be using an existing data access framework (e.g. Enterprise Library) which returns the pre-initialized instances of SqlConnection class. In this case, the extension methods could help add the retry capabilities into the existing code without major re-work.

Usage Patterns

The following sections illustrate some common usage patterns that apply when building reliable SQL Azure client applications using the transient condition handling framework discussed above.

Configuring Retry Policies

There are two primary ways of setting up a retry policy in the transient condition handling framework:

Create an instance of the RetryPolicy<T> class with required transient error detection strategy and appropriate configuration parameters specified at construction time.
Describe the retry policy definitions in the application configuration file and use the provided configuration APIs to instantiate and return an instance of the appropriate retry policy.

The RetryPolicy<T> class allows creating different policies depending on particular needs. The class constructors accept variable input and return an instance of the respective retry policy configured as per specified initialization parameters:

public class RetryPolicy<T> : RetryPolicy where T : ITransientErrorDetectionStrategy, new()
{
    /// <summary>
    /// Initializes a new instance of the RetryPolicy class with the specified number of retry attempts and default
    /// fixed time interval between retries.
    /// </summary>
    /// <param name="retryCount">The number of retry attempts.</param>
    public RetryPolicy(int retryCount) : this(retryCount, DefaultRetryInterval) {  /* ... */ }

    /// <summary>
    /// Initializes a new instance of the RetryPolicy class with the specified number of retry attempts and time
    /// interval between retries.
    /// </summary>
    /// <param name="retryCount">The number of retry attempts.</param>
    /// <param name="intervalBetweenRetries">The interval between retries.</param>
    public RetryPolicy(int retryCount, TimeSpan intervalBetweenRetries) {  /* ... */ }

    /// <summary>
    /// Initializes a new instance of the RetryPolicy class with the specified number of retry attempts and backoff
    /// parameters for calculating the exponential delay between retries.
    /// </summary>
    /// <param name="retryCount">The number of retry attempts.</param>
    /// <param name="minBackoff">The minimum backoff time.</param>
    /// <param name="maxBackoff">The maximum backoff time.</param>
    /// <param name="deltaBackoff">The delta value in the exponential delay between retries.</param>
    public RetryPolicy(int retryCount, TimeSpan minBackoff, TimeSpan maxBackoff, TimeSpan deltaBackoff) {  /* ... */ }

    /// <summary>
    /// Initializes a new instance of the RetryPolicy class with the specified number of retry attempts and 
    /// parameters defining the progressive delay between retries.
    /// </summary>
    /// <param name="retryCount">The number of retry attempts.</param>
    /// <param name="initialInterval">The initial interval which will apply for the first retry.</param>
    /// <param name="increment">The incremental time value for calculating progressive delay between retries.</param>
    public RetryPolicy(int retryCount, TimeSpan initialInterval, TimeSpan increment) {  /* ... */ }
}

The retry policies can also be defined in the application configuration. Each retry policy definition is accompanied with a friendly name and a set of parameters such as retry count and interval:

<?xml version="1.0" encoding="utf-8" ?>
<configuration>
  <configSections>
    <section name="RetryPolicyConfiguration" type="Microsoft.AppFabricCAT.Samples.Azure.TransientFaultHandling.Configuration.RetryPolicyConfigurationSettings, Microsoft.AppFabricCAT.Samples.Azure.TransientFaultHandling" />
  </configSections>

  <RetryPolicyConfiguration defaultPolicy="FixedIntervalDefault" defaultSqlConnectionPolicy="FixedIntervalDefault" defaultSqlCommandPolicy="FixedIntervalDefault" defaultStoragePolicy="IncrementalIntervalDefault" defaultCommunicationPolicy="IncrementalIntervalDefault">
    <add name="FixedIntervalDefault" maxRetryCount="10" retryInterval="100" />
    <add name="IncrementalIntervalDefault" maxRetryCount="10" retryInterval="100" retryIncrement="50" />
    <add name="ExponentialIntervalDefault" maxRetryCount="10" minBackoff="100" maxBackoff="1000" deltaBackoff="100" />
  </RetryPolicyConfiguration>
</configuration>

Once the retry policy configuration is defined, a policy can be instantiated using the following 3 simple lines of code:

// Retrieve the retry policy settings from the application configuration file.
RetryPolicyConfigurationSettings retryPolicySettings = ApplicationConfiguration.Current.GetConfigurationSection<RetryPolicyConfigurationSettings>(RetryPolicyConfigurationSettings.SectionName);

// Retrieve the required retry policy definition by its friendly name.
RetryPolicyInfo retryPolicyInfo = retryPolicySettings.Policies.Get("FixedIntervalDefault");

// Create an instance of the respective retry policy using the transient error detection strategy for SQL Azure.
RetryPolicy sqlAzureRetryPolicy = retryPolicyInfo.CreatePolicy<SqlAzureTransientErrorDetectionStrategy>();

The RetryPolicy instances carry all the necessary “intellect” capable of recognizing the legitimate transient conditions when executing the user code as shown in the next two sections.

Reliably Opening SQL Azure Database Connections

In order to ensure that a connection to a SQL Azure database can be reliably established, one of the following approaches can be adopted:

Use the Open method from the ReliableSqlConnection class. Should the connection fail to be established from the first attempt, the associated retry policy will take effect and a request will be retried as per the specified retry policy;
Use the OpenWithRetry extension method against an instance of the SqlConnection class. Its behavior is similar to the above. Behind the scene, the specified retry policy will kick in and will re-try the request should a transient error be encountered.

Below are some examples of using the above approaches:

using (ReliableSqlConnection conn = new ReliableSqlConnection(connString))
{    
     // Attempt to open a connection using the specified retry policy.    
     conn.Open(sqlAzureRetryPolicy);    
     // ... execute SQL queries against this connection ...
}

using (ReliableSqlConnection conn = new ReliableSqlConnection(connString, sqlAzureRetryPolicy))
{
     // Attempt to open a connection using the retry policy specified at construction time.    
     conn.Open();
    // ... execute SQL queries against this connection ...
}

using (SqlConnection conn = new SqlConnection(connString))
{
    // Attempt to open a connection using the specified retry policy. 
    // The extension method is used in this context since we are dealing with a SqlConnection instance.
    conn.OpenWithRetry(sqlAzureRetryPolicy);    
    // ... execute SQL queries against this connection ...
}

Note that both approaches deliver the same end result. In comparison to the standard SqlConnection class, ReliableSqlConnection provides a few value-add capabilities such as retrieving the current session’s CONTEXT_INFO value for tracing purposes and executing SQL commands using general purpose ExecuteCommant<T> method.

Reliably Executing Queries Against SQL Azure Databases

When executing queries against a SQL Azure database, it is also important to handle situations when a connection may be terminated due to transient reasons previously discussed, .e.g query throttling. Should this occur, an attempt to retry the query may become a necessity. Note that not all queries could be safely retried, most importantly those which do not leave the data in a consistent state, for instance, when making updates to multiple tables without an outer transaction ensuring atomicity of the overall operation.

For those queries that could be safely retried, one of the following approaches can be adopted:

Use the ExecuteCommand or ExecuteCommand<T> method in the ReliableSqlConnection class. A failed command will be automatically retried as per the specified policy. The retry operation will also ensure that a SQL connection will be re-opened (if required) before attempting to re-run the failed SQL command.
Use the appropriate extension method available for the SqlCommand class such as ExecuteNonQueryWithRetry, ExecuteReaderWithRetry, etc.

Below are some examples of using the above approaches:

using (ReliableSqlConnection conn = new ReliableSqlConnection(connString, sqlAzureRetryPolicy))
{
    conn.Open();
 
    SqlCommand selectCommand = new SqlCommand("select name, object_id from sys.objects where name = 'Application'", conn.Current);
 
    // Execute the above query using a retry-aware ExecuteCommand method which will
    // automatically retry if the query has failed (or connection was dropped)
    using (IDataReader dataReader = conn.ExecuteCommand<IDataReader>(selectCommand))
    {
        if (dataReader.Read())
        {
            string objectName = dataReader.GetString(dataReader.GetOrdinal("name"));
            tableObjectID = dataReader.GetInt32(dataReader.GetOrdinal("object_id"));
        }
    }
}

using (ReliableSqlConnection conn = new ReliableSqlConnection(connString))
{
    conn.Open(sqlAzureRetryPolicy);
 
    IDbCommand selectCommand = conn.CreateCommand();
    selectCommand.CommandText = "UPDATE Application SET [DateUpdated] = getdate()";
 
    // Execute the above query using a retry-aware ExecuteCommand method which will
    // automatically retry if the query has failed (or connection was dropped)
    int recordsAffected = conn.ExecuteCommand(selectCommand, sqlAzureRetryPolicy);
}

using (SqlConnection conn = new SqlConnection(connString))
{
    conn.Open();
 
    SqlCommand selectCommand = conn.CreateCommand();
    selectCommand.CommandText = "select * from sys.objects where name = 'Application'";
 
    int tableObjectID = Int32.MinValue;
 
    // Execute the above query using a retry-aware ExecuteReaderWithRetry method which will
    // automatically retry if the query has failed (or connection was dropped)
    using (IDataReader dataReader = selectCommand.ExecuteReaderWithRetry(sqlAzureRetryPolicy))
    {
        if (dataReader.Read())
        {
            string objectName = dataReader.GetString(dataReader.GetOrdinal("name"));
            tableObjectID = dataReader.GetInt32(dataReader.GetOrdinal("object_id"));
        }
    }
 
    selectCommand = conn.CreateCommand();
    selectCommand.CommandText = "select object_id from sys.objects where name = 'Application'";
 
    // Execute the above query using a retry-aware ExecuteScalarWithRetry method which will
    // automatically retry if the query has failed (or connection was dropped)
    object objectID = selectCommand.ExecuteScalarWithRetry(sqlAzureRetryPolicy);
 
    selectCommand = conn.CreateCommand();
    selectCommand.CommandText = "UPDATE Application SET [DateUpdated] = getdate()";
 
    // Execute the above query using a retry-aware ExecuteNonQueryWithRetry method which
    // will automatically retry if the query has failed (or connection was dropped)
    int recordsAffected = selectCommand.ExecuteNonQueryWithRetry(sqlAzureRetryPolicy);
}

The usage patterns have so far been focused on fairly primitive ADO.NET examples. The following section offers slightly more advanced scenarios where the transient error handling framework could increase the reliability of SQL Azure client applications regardless how these applications access their data.

Advanced Usage Patterns

It is fair to expect that modern data-oriented software would not always be going down the path of using the plain ADO.NET APIs when accessing the application data. Many alternative technologies have been developed over the past few years to support advanced data access scenarios: Entity Framework, WCF Data Services, LINQ to SQL, ASP.NET Dynamic Data, just to name a few. All these technologies are intended to significantly reduce the complexity of data management and greatly simplify the way how rich data is modeled, queried and projected to the application domain-specific space.

Whenever SQL Azure is chosen as relational data platform interoperable with any of the above technologies, handling transient conditions will immediately become a requirement. Given that the data access is heavily abstracted with the use of these technologies, the approach to adding resistance against transient faults differs from what has been discussed up to this point.

Fortunately, the implementation of the retry policy model in the transient condition handling framework makes it easier to wrap any user code into a retryable scope. Should a transient fault be encountered, the entire scope will be re-run. This capability is delivered by the ExecuteAction and ExecuteAction<T> methods:

sqlAzureRetryPolicy.ExecuteAction(() =>
{
    // Invoke a LINQ2SQL query.
});

return sqlAzureRetryPolicy.ExecuteAction<IEnumerable<string>>(() =>
{
    // Invoke a LINQ query against the Entity Framework model.
    return result;
});

Note that retryable scope should present itself as an atomic unit of work. The scope could be invoked multiple times and it is therefore important to ensure that it leaves the underlying data in a transactionally consistent state. In addition, the scope should not swallow exceptions, these are required for detecting transient conditions.

The following sample is borrowed from the MSDN Library and enriched with retry-aware logic where appropriate. This will increase the overall reliability of the client code making it robust and more resistant to potential connection or query throttling should the application database be hosted in SQL Azure.

// Define the order ID for the order we want.
int orderId = 43680;

// Create an EntityConnection.
EntityConnection conn = new EntityConnection("name=AdventureWorksEntities");

// Create a long-running context with the connection.
AdventureWorksEntities context = new AdventureWorksEntities(conn);

try
{
    // Explicitly open the connection inside a retry-aware scope.
    sqlAzureRetryPolicy.ExecuteAction(() =>
    {
        if (conn.State != ConnectionState.Open)
        {
            conn.Open();
        }
    });

    // Execute a query to return an order. Use a retry-aware scope for reliability.
    SalesOrderHeader order = sqlAzureRetryPolicy.ExecuteAction<SalesOrderHeader>(() =>
    {
        return context.SalesOrderHeaders.Where("it.SalesOrderID = @orderId", 
                new ObjectParameter("orderId", orderId)).Execute(MergeOption.AppendOnly).First();
    });

    // Change the status of the order.
    order.Status = 1;

    // Delete the first item in the order.
    context.DeleteObject(order.SalesOrderDetails.First());

    // Save changes inside a retry-aware scope.
    sqlAzureRetryPolicy.ExecuteAction(() => { context.SaveChanges(); });

    SalesOrderDetail detail = new SalesOrderDetail
    {
        SalesOrderID = 1,
        SalesOrderDetailID = 0,
        OrderQty = 2,
        ProductID = 750,
        SpecialOfferID = 1,
        UnitPrice = (decimal)2171.2942,
        UnitPriceDiscount = 0,
        LineTotal = 0,
        rowguid = Guid.NewGuid(),
        ModifiedDate = DateTime.Now
    };

    order.SalesOrderDetails.Add(detail);

    // Save changes again inside a retry-aware scope.
    sqlAzureRetryPolicy.ExecuteAction(() => { context.SaveChanges(); });
}
finally
{
    // Explicitly dispose of the context and the connection. 
    context.Dispose();
    conn.Dispose();
}

In summary, the versatility of the transient condition handling framework comes with the ability to perform retry-aware operations in a variety of contexts, whether it’s a single SQL statement or a large unit of work. In all cases, the way how the transient faults are detected will be consistently similar.

Conclusion

The underlying fabric managing the SQL Azure nodes comes with specific elements of behavior which need to be fully understood by the client applications accessing the SQL Azure databases. The throttling behavior of SQL Azure forces to come up with a better way of handling connections and executing queries. This includes the need for handling transient exceptions to ensure that the client code is able to behave reliably in the event of SQL Azure database connections being throttled by the Resource Manager. There are also other intermittent conditions which need to be accounted for. Consequently, having a robust retry mechanism in the SQL Azure client applications becomes imperative.

We aimed to provide the community with validated best practices to help the .NET developers build a reliable data access layer taking into account these specific behavioral attributes of our cloud-based database infrastructure. Our best practices were presented in a form a reusable framework which developers could easily plug in and adopt in their solutions.

The accompanying sample code is available for download from the MSDN Code Gallery. Note that all source code files are governed by the Microsoft Public License (Ms-PL) as explained in the corresponding legal notices.

Additional Resources/References

For more information on the topic discussed in this paper, please check out the following resources:

"General Guidelines and Limitations (SQL Azure Database)" article in the MSDN library;
"Development Considerations in SQL Azure" article in the MSDN library;
"Connections and SQL Azure" post on the SQL Azure team blog;
"Development Best Practices and Patterns for Using Microsoft SQL Azure Databases" presentation at PDC09.

Authored by: Valery Mizonov

Reviewed by: James Podgorski, Michael Thomassy

Best Practices for Handling Transient Conditions in SQL Azure Client Applications