Published on

October 17, 2020

Improving SQL Server Job Failure Monitoring

SQL Server Agent provides a built-in alerting process for job failures, but the information it provides is often not very useful. It only gives you basic details such as the job name, the time it failed, who ran it, and which step failed. If you want to see why the job failed, you have to manually review the job history. This can be time-consuming, especially on a busy system with many frequently run jobs.

Fortunately, there are ways to overcome these limitations and improve the monitoring of SQL Server job failures. One approach is to create a SQL Server Agent job that executes a SQL query to retrieve detailed information about job failures since the last time the job ran. This query can then send an email with the failure details using SQL Server Database Mail.

Here is an example of the SQL code that can be added to the job:

-- Get the job ID for the job this is running in.
-- Note, will only run inside the job, not in an SSMS query
DECLARE @JobID UNIQUEIDENTIFIER;
SET @JobID = (SELECT CONVERT(uniqueidentifier, $(ESCAPE_NONE(JOBID))));

--Get the last time this job ran
DECLARE @LastRunTime DATETIME;
SET @LastRunTime = (SELECT MAX([msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time]))
                    FROM [msdb].[dbo].[sysjobhistory] jh
                    WHERE jh.[job_id] = @JobID);

--Get all the failed jobs into a temp table, and give each individual job an ID
SELECT
   RANK() OVER(ORDER BY j.[name] ASC) AS FailedJobsID,
   j.[name] AS JobName,
   jh.[step_name] AS StepName,
   [msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time]) AS RunDateTime,
   SUBSTRING(jh2.[message], PATINDEX('%The Job was invoked by User%', jh2.[message]) + 28, PATINDEX('%The last step to run was%', jh2.[message]) -PATINDEX('%The Job was invoked by User%', jh2.[message])-28) AS ExecutedBy,
   REPLACE(SUBSTRING(jh.[message], 1, PATINDEX('%. %', jh.[message])) , 'Executed as user: ','') AS ExecutionContext,
   REPLACE(SUBSTRING(jh.[message], PATINDEX('%. %', jh.[message]) + 2, LEN(jh.[message]) - PATINDEX('%. %', jh.[message])-1), ' The step failed.','') AS FailureMessage,
   0 AS Emailed
 INTO #FailedJobs
 FROM [msdb].[dbo].[sysjobs] j
   INNER JOIN [msdb].[dbo].[sysjobhistory] jh ON jh.[job_id] = j.[job_id]
   INNER JOIN [msdb].[dbo].[sysjobsteps] js ON js.[job_id] = j.[job_id] AND js.[step_id] = jh.[step_id]
   INNER JOIN [msdb].[dbo].[sysjobhistory] jh2 ON jh2.[job_id] = jh.[job_id]
 --Job isn't currently running
 WHERE jh.[run_status] = 0
  --Only get jobs that ran since we last checked for failed jobs
  AND [msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time]) > DATEADD(SECOND,-1,@LastRunTime)
  --Join back to sysjobhistory again to get step_id 0 for the failed job, to find who executed it
  AND jh.[sql_severity] > 0
  AND jh2.[step_id] = 0
  AND [msdb].[dbo].agent_datetime(jh2.[run_date], jh2.[run_time]) <= [msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time])
  AND NOT EXISTS (SELECT 1 FROM [msdb].[dbo].[sysjobhistory] jh3
                  WHERE [msdb].[dbo].agent_datetime(jh3.[run_date], jh3.[run_time]) > [msdb].[dbo].agent_datetime(jh2.[run_date], jh2.[run_time])
  AND jh3.[job_id] = jh2.job_id)
  --Add any exclusions here, for example:
  --Any SSIS steps, as the job history doesn't show SSIS catalogue error messages.
  --Checks for running SQL on either node of an Always On Availability Group
  AND js.[subsystem] <> 'SSIS'
  AND jh.[message] NOT LIKE ('%Unable to execute job on secondary node%')
  AND jh.[message] NOT LIKE ('%Request to run job%refused because the job is already running from a request by User%');

--Variable to store the current job being dealt with
DECLARE @CurrentFailedJobID INT;

WHILE EXISTS (SELECT 1 FROM #FailedJobs)
--Loop through all the failed jobs
 BEGIN
   SET @CurrentFailedJobID = (SELECT TOP 1 fj.[FailedJobsID] FROM #FailedJobs fj);

   --Set the email subject
   DECLARE @MailSubject VARCHAR(255);
   SET @MailSubject = (SELECT @@SERVERNAME + ': ' + fj.[JobName] + ' steps have failed'
     FROM #FailedJobs fj
     WHERE fj.[FailedJobsID] = @CurrentFailedJobID
     GROUP BY fj.[JobName]);

   --Set the output as an HTML table to make it clear to read
   DECLARE @tableHTML NVARCHAR(MAX) ;
   SET @tableHTML = N'' +
                    N''+
                    N'' +
                    N'' +
                    CAST ( (
                            SELECT td = fj.[JobName], '',
                                        td = fj.[StepName], '',
                                   td = fj.[RunDateTime], '',
                                   td = fj.[ExecutedBy], '',
                                   td = fj.[ExecutionContext], '',
                                   td = fj.[FailureMessage], ''
                              FROM #FailedJobs fj
                              --Groups all the jobs with the same job name together into one email
                              WHERE fj.[FailedJobsID] = @CurrentFailedJobID
                              ORDER BY fj.[RunDateTime] DESC
                              FOR XML PATH('tr'), TYPE
                                ) AS NVARCHAR(MAX) ) +
                           N'
Job NameJob StepRun TimeRun ByExecution ContextError Message
' ; --Send the email EXEC msdb.dbo.sp_send_dbmail @recipients = 'support@mycompany.net', @subject = @MailSubject, @body = @tableHTML, @body_format = 'HTML' ; --Delete the currently emailed job from the failed jobs list DELETE fj FROM #FailedJobs fj WHERE fj.[FailedJobsID] = @CurrentFailedJobID; END

By scheduling this job to run at regular intervals, you can ensure that you always receive an email with the details of any job failures since the last time the job ran. This allows you to quickly identify and address any issues without the need for manual investigation.

It is recommended to increase the maximum job history log size and the maximum job history rows per job to ensure that you have enough historical data to analyze. This can be done by modifying the job history thresholds in SQL Server Agent settings.

With this improved job failure monitoring in place, you can have a better understanding of the reasons behind job failures and take appropriate actions to resolve them. This can help ensure the smooth operation of your SQL Server environment and minimize any potential disruptions.

Originally posted by Sven Lowry at https://www.xten.uk/technical-blogs/sql-job-failure-monitoring-improvements

Click to rate this post!
[Total: 0 Average: 0]

Let's work together

Send us a message or book free introductory meeting with us using button below.