SQL Server Agent provides a built-in alerting process for job failures, but the information it provides is often not very useful. It only gives you basic details such as the job name, the time it failed, who ran it, and which step failed. If you want to see why the job failed, you have to manually review the job history. This can be time-consuming, especially on a busy system with many frequently run jobs.
Fortunately, there are ways to overcome these limitations and improve the monitoring of SQL Server job failures. One approach is to create a SQL Server Agent job that executes a SQL query to retrieve detailed information about job failures since the last time the job ran. This query can then send an email with the failure details using SQL Server Database Mail.
Here is an example of the SQL code that can be added to the job:
-- Get the job ID for the job this is running in. -- Note, will only run inside the job, not in an SSMS query DECLARE @JobID UNIQUEIDENTIFIER; SET @JobID = (SELECT CONVERT(uniqueidentifier, $(ESCAPE_NONE(JOBID)))); --Get the last time this job ran DECLARE @LastRunTime DATETIME; SET @LastRunTime = (SELECT MAX([msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time])) FROM [msdb].[dbo].[sysjobhistory] jh WHERE jh.[job_id] = @JobID); --Get all the failed jobs into a temp table, and give each individual job an ID SELECT RANK() OVER(ORDER BY j.[name] ASC) AS FailedJobsID, j.[name] AS JobName, jh.[step_name] AS StepName, [msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time]) AS RunDateTime, SUBSTRING(jh2.[message], PATINDEX('%The Job was invoked by User%', jh2.[message]) + 28, PATINDEX('%The last step to run was%', jh2.[message]) -PATINDEX('%The Job was invoked by User%', jh2.[message])-28) AS ExecutedBy, REPLACE(SUBSTRING(jh.[message], 1, PATINDEX('%. %', jh.[message])) , 'Executed as user: ','') AS ExecutionContext, REPLACE(SUBSTRING(jh.[message], PATINDEX('%. %', jh.[message]) + 2, LEN(jh.[message]) - PATINDEX('%. %', jh.[message])-1), ' The step failed.','') AS FailureMessage, 0 AS Emailed INTO #FailedJobs FROM [msdb].[dbo].[sysjobs] j INNER JOIN [msdb].[dbo].[sysjobhistory] jh ON jh.[job_id] = j.[job_id] INNER JOIN [msdb].[dbo].[sysjobsteps] js ON js.[job_id] = j.[job_id] AND js.[step_id] = jh.[step_id] INNER JOIN [msdb].[dbo].[sysjobhistory] jh2 ON jh2.[job_id] = jh.[job_id] --Job isn't currently running WHERE jh.[run_status] = 0 --Only get jobs that ran since we last checked for failed jobs AND [msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time]) > DATEADD(SECOND,-1,@LastRunTime) --Join back to sysjobhistory again to get step_id 0 for the failed job, to find who executed it AND jh.[sql_severity] > 0 AND jh2.[step_id] = 0 AND [msdb].[dbo].agent_datetime(jh2.[run_date], jh2.[run_time]) <= [msdb].[dbo].agent_datetime(jh.[run_date], jh.[run_time]) AND NOT EXISTS (SELECT 1 FROM [msdb].[dbo].[sysjobhistory] jh3 WHERE [msdb].[dbo].agent_datetime(jh3.[run_date], jh3.[run_time]) > [msdb].[dbo].agent_datetime(jh2.[run_date], jh2.[run_time]) AND jh3.[job_id] = jh2.job_id) --Add any exclusions here, for example: --Any SSIS steps, as the job history doesn't show SSIS catalogue error messages. --Checks for running SQL on either node of an Always On Availability Group AND js.[subsystem] <> 'SSIS' AND jh.[message] NOT LIKE ('%Unable to execute job on secondary node%') AND jh.[message] NOT LIKE ('%Request to run job%refused because the job is already running from a request by User%'); --Variable to store the current job being dealt with DECLARE @CurrentFailedJobID INT; WHILE EXISTS (SELECT 1 FROM #FailedJobs) --Loop through all the failed jobs BEGIN SET @CurrentFailedJobID = (SELECT TOP 1 fj.[FailedJobsID] FROM #FailedJobs fj); --Set the email subject DECLARE @MailSubject VARCHAR(255); SET @MailSubject = (SELECT @@SERVERNAME + ': ' + fj.[JobName] + ' steps have failed' FROM #FailedJobs fj WHERE fj.[FailedJobsID] = @CurrentFailedJobID GROUP BY fj.[JobName]); --Set the output as an HTML table to make it clear to read DECLARE @tableHTML NVARCHAR(MAX) ; SET @tableHTML = N'
Job Name | Job Step | Run Time | Run By | Execution Context | Error Message | ' + N'
---|
By scheduling this job to run at regular intervals, you can ensure that you always receive an email with the details of any job failures since the last time the job ran. This allows you to quickly identify and address any issues without the need for manual investigation.
It is recommended to increase the maximum job history log size and the maximum job history rows per job to ensure that you have enough historical data to analyze. This can be done by modifying the job history thresholds in SQL Server Agent settings.
With this improved job failure monitoring in place, you can have a better understanding of the reasons behind job failures and take appropriate actions to resolve them. This can help ensure the smooth operation of your SQL Server environment and minimize any potential disruptions.
Originally posted by Sven Lowry at https://www.xten.uk/technical-blogs/sql-job-failure-monitoring-improvements