SQL Server AlwaysOn is a popular feature used by many organizations to ensure high availability and disaster recovery for their databases. However, sometimes users may encounter an issue where the availability group goes into a “resolving” state, and they are unsure why this happens.
So, what exactly is the “resolving” state in SQL Server AlwaysOn? When an availability group is set up, the replica can be in either a primary state or a secondary state, depending on its role in the failover cluster manager. The “resolving” state is an intermediate state that occurs during the transition from primary to secondary, or vice versa. If for some reason the transition is not successful, the replica enters the “resolving” state, and during this time, the database is not accessible.
If you encounter this issue, the first step is to find the cause. One way to do this is by checking the ERRORLOG. The ERRORLOG contains a series of events that can provide insights into what might have caused the availability group to enter the “resolving” state.
Now, you might be wondering, where exactly is the ERRORLOG located? There are various ways to find its location, depending on the version of SQL Server you are using. For example, in SQL Server 2016, you can find the ERRORLOG by navigating to the “Logs” folder within the SQL Server installation directory.
Here is an example of what you might find in the ERRORLOG:
2016-02-22 17:36:14.75 Server ***Unable to get thread context for spid 0 2016-02-22 17:36:14.75 Server * ******************************************************************************* 2016-02-22 17:36:14.75 Server * 2016-02-22 17:36:14.75 Server * BEGIN STACK DUMP: 2016-02-22 17:36:14.75 Server * 04/22/15 17:36:14 spid 7416 2016-02-22 17:36:14.75 Server * 2016-02-22 17:36:14.75 Server * Non-yielding Scheduler 2016-02-22 17:36:14.75 Server * 2016-02-22 17:36:14.75 Server * ******************************************************************************* 2016-02-22 17:36:14.76 Server Stack Signature for the dump is 0x000000000000025C …… 2016-02-22 17:36:26.19 Server Error: 19407, Severity: 16, State: 1. 2016-02-22 17:36:26.19 Server The lease between availability group 'ProdAG' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster. 2016-02-22 17:36:26.20 Server AlwaysOn: The local replica of availability group 'ProdAG' is going offline because either the lease expired or lease renewal failed. This is an informational message only. No user action is required.
In this example, it was discovered that a non-yielding scheduler occurred due to a restore command. The restore command caused an increase in the IO load on the server, which in turn prevented the SQL Server from renewing the lease with the cluster. As a result, the lease timeout for the availability group occurred, causing it to enter the “resolving” state.
Along with the ERRORLOG, it is also recommended to check the Windows event logs (Application and System) and the Cluster log from all participating nodes. These logs can provide additional information that can help in troubleshooting the issue.
Have you ever encountered a situation where the availability group was in a “resolving” state? If so, feel free to share your experience in the comments below.
If you found this blog post helpful, you may also be interested in reading our previous blog post on how to add files to the database in the AlwaysOn Availability Group.