Today, we will discuss an interesting scenario related to SQL Server Cluster Resources. A client recently reached out to me for assistance with a two-node Windows cluster running three SQL Server instances. They were facing an issue where one of the instances would start from the services.msc but not from the Failover Cluster Manager when attempting to bring the service online.
Upon further investigation, it was discovered that the SQL Server service was able to start, but the cluster was not able to connect to the SQL Server. This connectivity issue was confirmed by the errors found in the event viewer:
[sqsrvres] ODBC sqldriverconnect failed [sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = ffffffff; message = [Microsoft][SQL Server Native Client 10.0]SQL Server Network Interfaces: Error Locating Server/Instance Specified [xFFFFFFFF]. [sqsrvres] ODBC sqldriverconnect failed [sqsrvres] checkODBCConnectError: sqlstate = HYT00; native error = 0; message = [Microsoft][SQL Server Native Client 10.0]Login timeout expired [sqsrvres] ODBC sqldriverconnect failed [sqsrvres] checkODBCConnectError: sqlstate = 08001; native error = ffffffff; message = [Microsoft][SQL Server Native Client 10.0]A network-related or instance-specific error has occurred while establishing a connection to SQL Server. Server is not found or not accessible. Check if instance name is correct and if SQL Server is configured to allow remote connections. For more information see SQL Server Books Online.
It was clear that there was a connectivity issue from the local machine to the SQL instance. The Cluster Service attempts to connect to the SQL service every few minutes for the IsAlive check. If this fails, the SQL resource is restarted, even if the instance was online.
To resolve this issue, I followed the following steps:
- Start the SQL Service via the command:
net start MSSQL$SQL2014(ReplaceSQL2014with the appropriate instance name) - Make a connection to SQL Server using SQLCMD:
SQLCMD -S <ServerName\InstanceName> - Check the Errorlog to find the exact server name
- Perform various troubleshooting steps such as ensuring the SQL Browser Service is running, creating a TCP alias to the SQL Instance, and allowing connections through the firewall
Once the connectivity issue is resolved, stop the SQL Service using the command: net stop MSSQL$SQL2014 (Replace SQL2014 with the appropriate instance name)
After following these steps, the client informed me that they had a TCP alias already created, but the port number of the SQL Server had changed, causing the connections to fail. After creating the correct alias, the SQL Server was able to come online in the Failover Cluster Manager.
Have you ever encountered a similar situation? Feel free to share your experiences and insights in the comments section below.
Thank you for reading!