Background: The large(ish)
BizTalk system I’ve been working on for the last few months consumes a few external
Web Services. Whilst doing some early performance tuning, I noticed that I was getting
pretty serious request-response latency from the primary BizTalk Web Service… Like
over 45 secs, instead of the expected 0.5 – 2 secs…
Fortunately, I routinely
put tracing/audit calls in my BizTalk code which includes timing info – and I always log
the times around orchestration entry/exit… So I was quite surprised to see that
a request to one of the external Web Services was taking 15secs to respond… especially
as the web service was hosted (temporarily) on the same server…
Although I didn’t
have the source code, with the help of Lutz Roeder’s trusty .NET
Reflector I was poking around the innards within minutes and couldn’t find anything
that would cause a delay of 15 secs.
What’s more, after
further testing I found that the problem only occurred the first time after IIS was
restarted. After the first time, it would return within 2 secs (still not very good,
but much better). One thing I noticed about the Web Service was that it made use of
some code in an unmanaged assembly (old school C++ DLL)…
On a hunch, I wrote
a simple test Web Service which simply referenced some functions in the DLL (using
the DllImport attribute)…
and bang, a 15 sec delay before the Web Service constructor was called.
I also tried the
same trick in a console app… and there was no delay.
By this time I was
fairly stumped. This issue only happened on our Load Test server: it didn’t happen
on the Dev or Test servers. And it seemed to only happen to code hosted under IIS.
I figured it was
time to dig a bit deeper and see what IIS was doing: starting up Process
Monitor (to monitor file and registry access) and TCP
View (to monitor network activity), I called my test Web Service.
Filtering the Process
Monitor logs to the w3wp.exe process (the IIS Worker Process for Application
Pools) I noticed that w3wp was reading a lot of security and cryptography keys from
the registry:
Hmm. Interesting.
Then I looked at
the TCP View log. And I noticed something very interesting coming from the w3wp process:
a connection was being attempted to http://crl.versign.com and
was being blocked.
How do I know it
was being blocked? Well, the TCP View log showed a SYN_SENT, which appeared and was
visible for 15 secs before disappearing:
Now if you’re not
familiar with how TCP
works, then you might
miss the significance of this: When a TCP client attempts to contact a TCP server,
the first thing performed is a handshake. This involves the client sending a SYN,
the server responding with a SYN-ACK, and the client sending a SYN-ACK-ACK.
In the TCP View tool,
the rightmost column indicates the state of the local TCP connection end-point. You’ll
see that in this case, the state is SYN_SENT: this state indicates that the local
TCP client is waiting for a SYN-ACK response from the SYN it sent.
Why would this be?
Well, many reasons, but mostly because the address can’t be reached e.g. a proxy server
is required, or a firewall rule is blocking the call.
And what’s the relevance
of crl.verisign.com? A CRL (Certificate Revocation List) is used to indicate which
Certificates (issued by a Certificate Authority) are no longer valid.
And this is only
done when you are using an assembly which has been digitally signed.
(Note:
Microsoft became really really serious about CRL’s back in 2001 when VeriSign
accidentally released two Microsoft certificates to a third party – which meant
that the third party could sign code that appeared to have come from Microsoft).
So what does this
have to do with
my Web Service and a 15 second
delay?
Looking through the
assemblies in the Web Service, I noticed that the unmanaged assembly was digitally
signed. Aha! So when the unmanaged assembly was referenced in a Web Service, IIS would
hand off control to the Windows Cryptography API (CAPI) and ask it to verify the certificate.
And this involves retrieving a CRL for the certificate.
After doing some
more poking
around, I found this blog
post (from one of the developers at Xceed
Software, a company specializing in .NET components).
The thing that drew
my attention was this comment:
Some people behind
a very strict firewall, ignoring outgoing connections instead of rejecting them, had
a timeout while loading our assemblies. That timeout (at least 15 seconds) is hardcoded
in WinVerifyTrust!
Eureka
! That’s exactly
my problem!
And it explained
why it was only happening on our Load Test server: Our Dev and Test servers have access
to the internet, whilst the Load Test server does not: it has no proxy server configured.
To test this theory,
I configured the Load Test server to use a proxy server. If you haven’t done this
before, you can’t do it via Internet Explorer settings – that only configures a proxy
for application which go via wininet.dll i.e.
the Internet Explorer API, which uses the proxy server settings in the Internet
Explorer/Internet Options/Connections/LAN Settings panel.
CAPI, however, uses winhttp.dll (Microsoft
Windows HTTP Services). And to set the proxy server for that, you need to use the proxycfg command
line tool which sets proxy server settings
for winhttp.
If you run proxycfg
-u, this will
import your
Internet Explorer proxy server settings:
So I did this, restarted
IIS and presto! The problem
went away.
However, giving the
Load Test server access to the Internet is a fairly large security hole (at least
it is in our network, where the Load test server is behind both the firewall
and DMZ).
So, we contacted
the vendor who supplied the unmanaged DLL, and managed to get them to release a new
version which wasn’t digitally signed.
Which solved the problem
of the latency in the Web Service (which accounted for 15 secs of the 45 secs latency).
Solving the remaining
30 secs latency is the subject of next week’s post.