Improving SQL Server Job Failure Monitoring

SQL Server Agent provides a built-in alerting process for job failures, but the information it provides is often not very useful. It only gives you basic details such as the job name, the time it failed, who ran it, and which step failed. If you want to see why the job failed, you have to manually review the job history. This can be time-consuming, especially on a busy system with many frequently run jobs.

Fortunately, there are ways to overcome these limitations and improve the monitoring of SQL Server job failures. One approach is to create a SQL Server Agent job that executes a SQL query to retrieve detailed information about job failures since the last time the job ran. This query can then send an email with the failure details using SQL Server Database Mail.

Here is an example of the SQL code that can be added to the job:

-- Get the job ID for the job this is running in.
-- Note, will only run inside the job, not in an SSMS query
DECLARE @JobID UNIQUEIDENTIFIER;
SET @JobID = (SELECT CONVERT(uniqueidentifier, $(ESCAPE_NONE(JOBID))));

--Get the last time this job ran
DECLARE @LastRunTime DATETIME;
SET @LastRunTime = (SELECT MAX([msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time]))
                    FROM [msdb].[dbo].[sysjobhistory] jh
                    WHERE jh.[job_id] = @JobID);

--Get all the failed jobs into a temp table, and give each individual job an ID
SELECT
   RANK() OVER(ORDER BY j.[name] ASC) AS FailedJobsID,
   j.[name] AS JobName,
   jh.[step_name] AS StepName,
   [msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time]) AS RunDateTime,
   SUBSTRING(jh2.[message], PATINDEX('%The Job was invoked by User%', jh2.[message]) + 28, PATINDEX('%The last step to run was%', jh2.[message]) -PATINDEX('%The Job was invoked by User%', jh2.[message])-28) AS ExecutedBy,
   REPLACE(SUBSTRING(jh.[message], 1, PATINDEX('%. %', jh.[message])) , 'Executed as user: ','') AS ExecutionContext,
   REPLACE(SUBSTRING(jh.[message], PATINDEX('%. %', jh.[message]) + 2, LEN(jh.[message]) - PATINDEX('%. %', jh.[message])-1), ' The step failed.','') AS FailureMessage,
   0 AS Emailed
 INTO #FailedJobs
 FROM [msdb].[dbo].[sysjobs] j
   INNER JOIN [msdb].[dbo].[sysjobhistory] jh ON jh.[job_id] = j.[job_id]
   INNER JOIN [msdb].[dbo].[sysjobsteps] js ON js.[job_id] = j.[job_id] AND js.[step_id] = jh.[step_id]
   INNER JOIN [msdb].[dbo].[sysjobhistory] jh2 ON jh2.[job_id] = jh.[job_id]
 --Job isn't currently running
 WHERE jh.[run_status] = 0
  --Only get jobs that ran since we last checked for failed jobs
  AND [msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time]) > DATEADD(SECOND,-1,@LastRunTime)
  --Join back to sysjobhistory again to get step_id 0 for the failed job, to find who executed it
  AND jh.[sql_severity] > 0
  AND jh2.[step_id] = 0
  AND [msdb].[dbo].agent_datetime(jh2.[run_date], jh2.[run_time]) <= [msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time])
  AND NOT EXISTS (SELECT 1 FROM [msdb].[dbo].[sysjobhistory] jh3
                  WHERE [msdb].[dbo].agent_datetime(jh3.[run_date], jh3.[run_time]) > [msdb].[dbo].agent_datetime(jh2.[run_date], jh2.[run_time])
  AND jh3.[job_id] = jh2.job_id)
  --Add any exclusions here, for example:
  --Any SSIS steps, as the job history doesn't show SSIS catalogue error messages.
  --Checks for running SQL on either node of an Always On Availability Group
  AND js.[subsystem] <> 'SSIS'
  AND jh.[message] NOT LIKE ('%Unable to execute job on secondary node%')
  AND jh.[message] NOT LIKE ('%Request to run job%refused because the job is already running from a request by User%');

--Variable to store the current job being dealt with
DECLARE @CurrentFailedJobID INT;

WHILE EXISTS (SELECT 1 FROM #FailedJobs)
--Loop through all the failed jobs
 BEGIN
   SET @CurrentFailedJobID = (SELECT TOP 1 fj.[FailedJobsID] FROM #FailedJobs fj);

   --Set the email subject
   DECLARE @MailSubject VARCHAR(255);
   SET @MailSubject = (SELECT @@SERVERNAME + ': ' + fj.[JobName] + ' steps have failed'
     FROM #FailedJobs fj
     WHERE fj.[FailedJobsID] = @CurrentFailedJobID
     GROUP BY fj.[JobName]);

   --Set the output as an HTML table to make it clear to read
   DECLARE @tableHTML NVARCHAR(MAX) ;
   SET @tableHTML = N'' +
                    N''+
                    N'' +
                    N'' +
                    CAST ( (
                            SELECT td = fj.[JobName], '',
                                        td = fj.[StepName], '',
                                   td = fj.[RunDateTime], '',
                                   td = fj.[ExecutedBy], '',
                                   td = fj.[ExecutionContext], '',
                                   td = fj.[FailureMessage], ''
                              FROM #FailedJobs fj
                              --Groups all the jobs with the same job name together into one email
                              WHERE fj.[FailedJobsID] = @CurrentFailedJobID
                              ORDER BY fj.[RunDateTime] DESC
                              FOR XML PATH('tr'), TYPE
                                ) AS NVARCHAR(MAX) ) +
                           N'Job Name Job Step Run Time Run By Execution Context Error Message' ;

   --Send the email
   EXEC msdb.dbo.sp_send_dbmail
            @recipients = 'support@mycompany.net',
            @subject = @MailSubject,
            @body = @tableHTML,
            @body_format = 'HTML' ;

  --Delete the currently emailed job from the failed jobs list
  DELETE fj
    FROM #FailedJobs fj
    WHERE fj.[FailedJobsID] = @CurrentFailedJobID;
 END

By scheduling this job to run at regular intervals, you can ensure that you always receive an email with the details of any job failures since the last time the job ran. This allows you to quickly identify and address any issues without the need for manual investigation.

It is recommended to increase the maximum job history log size and the maximum job history rows per job to ensure that you have enough historical data to analyze. This can be done by modifying the job history thresholds in SQL Server Agent settings.

With this improved job failure monitoring in place, you can have a better understanding of the reasons behind job failures and take appropriate actions to resolve them. This can help ensure the smooth operation of your SQL Server environment and minimize any potential disruptions.

Originally posted by Sven Lowry at https://www.xten.uk/technical-blogs/sql-job-failure-monitoring-improvements

Click to rate this post!

[Total: 0 Average: 0]

Comprehensive 360 Degree Assessment

Data Replication

Performance Optimization

Data Security

Database Migration

Expert Consultation

Cloud Migration Made Easy

Considering a move to the cloud? Axial SQL brings you proven migration strategies to streamline your transition. Our expert team ensures a smooth, efficient shift, keeping your data safe and accessible. Start your journey to the cloud with confidence!

SQL Performance Optimization

Is your SQL running slower than expected? Don't let sluggish performance hinder your business. Our optimization experts at Axial SQL specialize in tuning your databases for peak performance. Speed up your SQL and supercharge your data processing today!

Database Stability Solutions

Tired of frequent database outages? Discover stability with Axial SQL! Our comprehensive analysis identifies and resolves your database vulnerabilities. Enhance reliability, reduce downtime, and keep your operations running smoothly with our expert guidance.

Expert Database Team Evaluation

Questioning your database team's efficiency? Let Axial SQL provide an expert, unbiased analysis. We assess your team's strategies and workflows, offering insights and improvements to boost productivity. Elevate your database management to new heights!

Data Security Assurance

Concerned about your database security? Axial SQL is here to fortify your data defenses. Our specialized security assessments identify potential risks and implement robust protections. Keep your sensitive data secure and your peace of mind intact with our expert services.

Published on

Improving SQL Server Job Failure Monitoring

Let's work together