This post contains instructions about modifying the registry. This is inherently risky and this post is provided AS-IS without warranty of any kind.
I recently resolved this issue for a client but I suspect that many people have it due to some incorrect documentation regarding the installation of a clustered Master Secret Server.
The Problem
- SSO Master Secret server fails to start after failing over to second cluster node.
Event Logs
———————————————————————————
Source: ENTSSO
EventID: 10565
Description: The secret could not be loaded from the registry. The service account for the SSO service may have been changed or the secret may be corrupted. Restore the secret from a backup file.
———————————————————————————
Source: ENTSSO
EventID: 10521
Description: Could not load secrets from the registry of the master secret server.
———————————————————————————
Theory 1
- Perhaps the master secret was not installed on this second node. So I dig out the Master Secret from the secure file share, and run the following command in command prompt from the Enterprise Single Sign-On directory:
ssoconfig -restoresecret <keyname>
- The command completes, I go to cluster administrator, take the ENTSSO service offline, bring it back online and voila! It starts fine. End of problem right? WRONG?
- Just to be thorough, I fail it back over to node1 of the cluster (WHERE IT FAILS!). Now node1 exhibits the same behaviour as node2 did originally.
- Then I restore they key to node1 and node2 has the same problem it had in the first place.
- I scratch my head
- I call MS PSS (always a good idea when you're not sure)
Pause for Thoughts
- Couldn't be the key, because the node that has it restored begins to work and both nodes have it
- Couldn't be the database because one node always works
- Couldn't be the service account because one node always works
- Each node can operate correctly so there is no defect specific to the node
- Why does restoring the key on one node break the other, after all it's just a REGISTRY KEY (AHA!)
Theory 2
- The registry replication configured on the ENTSSO service in the cluster definition has a domino affect and somehow messes up the registry on whatever node did not just have the key restored. If I disable that, and cleanup the registies, maybe SSO will failover properly.
Solution
- Backup keys HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ENTSSO\SSOSS from both nodes (just in case)
- In cluster administrator, remove registry replication for that same key on the ENTSSO "Generic Service"
- Delete keys HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\ENTSSO\SSOSS from both nodes (carefully)
- Restart both nodes (perhaps not necessary)
- Restore master secret to active node, then failover to other node
- Restore master secret to this node as well.
- Problem solved (test with multiple failovers to be sure)
Special thanks to Somu J. from MS PSS for the assistance on this problem.