This post contains instructions about modifying the registry. This is inherently risky and this post is provided AS-IS without warranty of any kind.

I recently resolved this issue for a client but I suspect that many people have it due to some incorrect documentation regarding the installation of a clustered Master Secret Server.

The Problem

  1. SSO Master Secret server fails to start after failing over to second cluster node.

Event Logs

———————————————————————————
Source: ENTSSO
EventID: 10565
Description: The secret could not be loaded from the registry. The service account for the SSO service may have been changed or the secret may be corrupted. Restore the secret from a backup file.
———————————————————————————
Source: ENTSSO
EventID: 10521
Description: Could not load secrets from the registry of the master secret server.
———————————————————————————

Theory 1

  1. Perhaps the master secret was not installed on this second node.  So I dig out the Master Secret from the secure file share, and run the following command in command prompt from the Enterprise Single Sign-On directory:

    ssoconfig -restoresecret <keyname>

  2. The command completes, I go to cluster administrator, take the ENTSSO service offline, bring it back online and voila! It starts fine. End of problem right? WRONG?
  3. Just to be thorough, I fail it back over to node1 of the cluster (WHERE IT FAILS!). Now node1 exhibits the same behaviour as node2 did originally.
  4. Then I restore they key to node1 and node2 has the same problem it had in the first place.
  5. I scratch my head
  6. I call MS PSS (always a good idea when you're not sure)

Pause for Thoughts

  1. Couldn't be the key, because the node that has it restored begins to work and both nodes have it
  2. Couldn't be the database because one node always works
  3. Couldn't be the service account because one node always works
  4. Each node can operate correctly so there is no defect specific to the node
  5. Why does restoring the key on one node break the other, after all it's just a REGISTRY KEY (AHA!)

Theory 2

  1. The registry replication configured on the ENTSSO service in the cluster definition has a domino affect and somehow messes up the registry on whatever node did not just have the key restored. If I disable that, and cleanup the registies, maybe SSO will failover properly.

Solution

  1. Backup keys HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ENTSSO\SSOSS from both nodes (just in case)
  2. In cluster administrator, remove registry replication for that same key on the ENTSSO "Generic Service"
  3. Delete keys HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ENTSSO\SSOSS from both nodes (carefully)
  4. Restart both nodes (perhaps not necessary)
  5. Restore master secret to active node, then failover to other node
  6. Restore master secret to this node as well.
  7. Problem solved (test with multiple failovers to be sure)

Special thanks to Somu J. from MS PSS for the assistance on this problem.