Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / T-SQL

Monitoring SQL Agent with Alerts for Long Running Jobs

5.00/5 (2 votes)
2 Sep 2021CPOL5 min read 14.1K  
Providing admin with an alert mechanism via an SP
An alerting via email technique is provided in this article for being notified on long running SQL Agent jobs.

Problem

When you have to manage hundreds of scheduled jobs on busy production servers, inevitably situations arise when jobs take too long to complete, causing substantial waits or affecting the performance of other processes. We want to know when a job started taking too long before we even dive deeper into investigating the causes of the degradation. Some jobs are not just taking a long time to complete, but they may be taking a longer time than they normally used take to complete.

Which jobs are regressing in performance? Is the slow job now consistently becoming a problem or is it a one-time occurrence? When there is a need to investigate a job, often within time constraints, the runtime duration is the number one metric to consider, before anything else in our analysis.

In order to compare the current run time with the duration of the previous executions for the same job, we would usually look at the job’s history for the previous runtime and see that the current runtime is longer than several previous durations. Yet if you manage 500 jobs as a DBA, Developer, or DevOps Engineer, the task of investigating one job at a time may take you half a day and still not bring you closer to a conclusion.

There may be 20 jobs you need to check. You don’t want to babysit each job individually all the time. Automated alerts delivered to your email box can come in handy in such a situation. The solution in this article is to create a stored procedure that will accept a parameter. This parameter is the number of days used for calculating average duration of the job in the past. The procedure produces a report showing jobs that are currently running are degrading (regressing) in performance along with an optional email alert.

Certain steps of understanding such approach as well as handy stored procedure that we already are using in Production is shown below (scheduled as a separate job on the same server where the monitored jobs are run from (see 'Appendix' for the script of such job) to help you get an email alert for one or many jobs that are regressing at the time of the polling. We execute the Stored Proc, asp_long_running_Regressing_Jobs_Alerts, in our environment every minute, as the durations of our jobs vary from seconds to hours. Before giving SP DDL, certain components of the SP are broken down as pre-requisites (small size code snippets) that help to understand the alerting SP in its entirety.

List of tools needed: SQL Server (below code tested on versions 2012 and up)

Goals here are to compare the duration of each currently running job to an average duration for all run times for the same job over a given period of time.

Note: Before proceeding to compiling the Stored Procedure given below, make sure you have high access privileges in the environment where you want to analyze scheduled jobs. Ideally, admin. To check your access level (if you are not an admin on this server, run this or similar T-SQL statement. If you have no result of this query, you have no Select permissions on the MSDB tables.

SQL
USE MSDB
GO
SELECT HAS_PERMS_BY_NAME(QUOTENAME(SCHEMA_NAME(schema_id)) + '.' + 
       QUOTENAME(name), 'OBJECT', 'SELECT') AS have_select, *
FROM   sys.tables
GO

Image 1

Figure 1: The result set shows tables that you have SELECT permissions on.

Solution

The T-SQL DDL code for the Stored Procedure below can be compiled in any database of your choice. We use dedicated DBA_db for such admin SPs.

SQL
USE [DBA_db]
GO
 
-- <begin stored procedure DDL/>
 
CREATE PROCEDURE asp_long_running_Regressing_Jobs_Alerts
         @history_days int = 7,
         @avg_duration_multiplier float = 1.5,
         @bEmail bit = 0,
         @bSaveToTable bit = 0,
         @RecipientsList Varchar(1000) = 'myName@myCoDomain.com',
         @ignore_zero_durations bit = 0
 
AS
/* example of usage:
   exec DBA_db..asp_long_running_Regressing_Jobs_Alerts
               @history_days = 45,
               @avg_duration_multiplier = 2,
               @bEmail = 0,
               @bSaveToTable = 0,
               @RecipientsList  = 'myName@myCoDomain.com;'   ,
               @ignore_zero_durations = 1
 
AUTHOR(s):
Vladimir Isaev;
-- + V.B., S.L;
-- contact@sqlexperts.org
*/      
 
/*PARAMETERS:
@history_days int          (how many days back we use for AVF run duration)
@avg_duration_multiplier   (how many times longer than AVG will qualify job 
                            for producing an alert)
@bEmail                    (send out Alert Email or just print the msg about Regressing jobs)
                           -- 'REGRESSION' is defined here by Duration only
*/
SET NOCOUNT ON
BEGIN
 
        select sj.name,
               sja.start_execution_date,
               sja.stop_execution_date,
               ajt.min_run_duration,
               ajt.max_run_duration,
               ajt.avg_run_duration,
               datediff(ss, start_execution_date, getdate()) as cur_run_duration
        into #Regressing_Jobs
 
        from msdb..sysjobactivity sja
               left join
               (select job_id,
                       avg(dbo.udf_convert_int_time2ss(run_duration)) as avg_run_duration,
                       min(dbo.udf_convert_int_time2ss(run_duration)) as min_run_duration,
                       max(dbo.udf_convert_int_time2ss(run_duration)) as max_run_duration
                       from msdb..sysjobhistory
                       where step_id=0
                       and run_date >CONVERT(varchar(8),GETDATE() - @history_days,112)
                       and ((run_duration <> 0 or @ignore_zero_durations = 0))
                       and run_duration < 240000
                       group by job_id
               )ajt on sja.job_id=ajt.job_id
        join msdb..sysjobs sj on sj.job_id=sja.job_id
        where
               sja.session_id = (SELECT TOP 1 session_id
                                   FROM msdb.dbo.syssessions
                                  ORDER BY agent_start_date DESC)
               AND start_execution_date is not null
               and stop_execution_date is null
               and datediff(ss, start_execution_date, getdate()) >
                   ajt.avg_run_duration * @avg_duration_multiplier
 
        select name as JobName,
               start_execution_date,
               stop_execution_date,
               dateadd(second, min_run_duration, 0) as min_run_duration,
               dateadd(second, max_run_duration, 0) as max_run_duration,
               dateadd(second, avg_run_duration, 0) as avg_run_duration,
               dateadd(second, cur_run_duration, 0) as cur_run_duration
        into #Regressing_Jobs_DurAsDate
        from #Regressing_Jobs
                                            --  waitfor delay '00:00:10'
        declare @sHtml varchar(max) = ''
 
        declare @tableHTML  nvarchar(max) =
               N'<H1>Job(s) taking longer than recent baseline duration
                (in descending avg duration order):</H1>' + Char(13)
               + N'    <table border="1">'           + Char(13)
               + N'    <tr bgcolor="#ddd">'          + Char(13)
               + N'           <th>Start Time</th>'   + Char(13)
               + N'           <th>Job Name</th>'     + Char(13)
               + N'           <th>Host Name</th>'    + Char(13)
               + N'           <th>History Days</th>' + Char(13)
               + N'           <th>Avg Dur Mul</th>'  + Char(13)
               + N'           <th>Min Dur</th>'      + Char(13)
               + N'           <th>Max Dur</th>'      + Char(13)
               + N'           <th>Avg Dur</th>'      + Char(13)
               + N'           <th>Cur Dur</th>'      + Char(13)
               + N'    </tr>'                        + Char(13)
 
        select @tableHTML =  @tableHTML
               + FORMATMESSAGE(
                       '<tr><td>%s</td>'      _
                          + Char(13) --start_execution_date
                       + '<td>%s</td>'        + Char(13) --name
                       + '<td>%s</td>'        + Char(13) --@@SERVERNAME
                       + '<td style="text-align:center">%i</td>' _
                          + Char(13) --@history_days
                       + '<td style="text-align:center">%s</td>' 
                          + Char(13) --@avg_duration_multiplier
                       + '<td>%s</td>'        + Char(13) --Min Dur
                       + '<td>%s</td>'        + Char(13) --Max Dur
                       + '<td>%s</td>'        + Char(13) --Avg Dur
                       + '<td>%s</td>'        + Char(13),--Cur Dur
                               convert(varchar, start_execution_date, 120),
                               JobName,
                               @@SERVERNAME,
                               @history_days,
                               convert(varchar, @avg_duration_multiplier),
                               format(min_run_duration, N'HH\hmm\mss\s'),
                               format(max_run_duration, N'HH\hmm\mss\s'),
                               format(avg_run_duration, N'HH\hmm\mss\s'),
                               format(cur_run_duration, N'HH\hmm\mss\s')
                       )
          from #Regressing_Jobs_DurAsDate
          order by avg_run_duration desc, JobName
 
               select @tableHTML = @tableHTML + '</tr></table>' + Char(13)
 
        select @sHtml = @tableHTML
        --select @sHtml
 
        declare @DateStr varchar(30) = convert(varchar,getdate(),121)
        IF @bEmail = 1 and (select count(*) from #Regressing_Jobs) > 0
         begin
              
               declare @sSubject varchar(250)
                   = @@SERVERNAME + ' Job(s) taking longer than recent baseline duration: ' _
                     + @DateStr 
 
               EXEC msdb.dbo.sp_send_dbmail  @profile_name='SQL Server Monitoring Account',
                                              @recipients= @RecipientsList,
                                              @subject=@sSubject, 
                                               @body=@sHtml,
                                              @body_format = 'HTML'
 
               print 'email sent: ' + CHAR(13) + @sHtml
        end
 
        IF @bSaveToTable = 1
          begin
               insert into RegressingJobs
         (
 CaptureDateTime,
 JobName,
 start_execution_date,
 HostName,
 history_days,
 avg_duration_multiplier,
                           min_run_duration,
 max_run_duration,
 avg_run_duration,
 cur_run_duration
 )
select         @DateStr,
JobName,
start_execution_date,
@@SERVERNAME,
@history_days,   
@avg_duration_multiplier,
min_run_duration,
max_run_duration,
avg_run_duration,
cur_run_duration
               from #Regressing_Jobs_DurAsDate
          end
 
        begin
 
         SELECT 'JOBS THAT ARE TAKING LONGER THAN USUAL:  '
         select  @DateStr as CaptureDateTime, JobName, _
            start_execution_date, @@SERVERNAME as 'Server',
                 @history_days as '@history_days', _
                    @avg_duration_multiplier as '@avg_duration_multiplier',
                 min_run_duration, max_run_duration, _
                    avg_run_duration, cur_run_duration
         from    #Regressing_Jobs_DurAsDate
 
        end
 
--all currently running jobs:
       begin
               SELECT ' ALL JOBS THAT ARE CURRENTLY RUNNING:  '
               SELECT
                 -- '',  -- CAST (ja.job_id AS VARCHAR(max)),
                       j.name AS job_name,
                  cast ( ja.start_execution_date as varchar) start_execution_time,  
                  cast ( ja.stop_execution_date  as varchar) stop_execution_time,  
                  -- ISNULL(last_executed_step_id,0)+1 AS current_executed_step_id,
                       Js.step_name step_name
               FROM msdb.dbo.sysjobactivity ja
               LEFT JOIN msdb.dbo.sysjobhistory jh
                       ON ja.job_history_id = jh.instance_id
               JOIN msdb.dbo.sysjobs j
                 ON ja.job_id = j.job_id
               JOIN msdb.dbo.sysjobsteps js
                 ON ja.job_id = js.job_id
                AND ISNULL(ja.last_executed_step_id,0)+1 = js.step_id
               WHERE ja.session_id =
                    (SELECT TOP 1 session_id
               FROM msdb.dbo.syssessions
           ORDER BY agent_start_date DESC)
               AND start_execution_date is not null
               AND stop_execution_date  is null;
 
        end
END
 
GO
 
-- <end of stored procedure DDL/>

2 UDFs used by the asp_long_running_Regressing_Jobs_Alerts

SQL
-- dependencies of asp_long_running_Regressing_Jobs_Alerts:
 
-- udf_convert_int_time
CREATE   FUNCTION [dbo].[udf_convert_int_time] (@time_in INT)
RETURNS TIME
AS
 BEGIN
        DECLARE @time_out TIME
        DECLARE @time_in_str varchar(6)
        SELECT  @time_in_str = RIGHT('000000' + CAST(@time_in AS VARCHAR(6)), 6)
        SELECT  @time_out    = CAST(STUFF(STUFF(@time_in_str,3,0,':'),6,0,':') AS TIME)
  RETURN @time_out
 END
GO
 
-- udf_convert_int_time2ss
CREATE   FUNCTION [dbo].[udf_convert_int_time2ss] (@time_in INT)
RETURNS int
AS
 BEGIN
        DECLARE @time_out int
        select @time_out = datediff(ss, 0,  dbo.udf_convert_int_time(@time_in))
  RETURN @time_out
 END
GO

In addition to the examples of calls listed in the header of the SP, here is an example of a typical call:

SQL
exec dba_DB.dbo.asp_long_running_Regressing_Jobs_Alerts
       @history_days = 45,
       @avg_duration_multiplier = 2,
       @bEmail = 1,
       @bSaveToTable = 0,
       @RecipientsList  = 'myName@myCoDomain.com;
                           AssociateName@myCoDomain.com’
       @ignore_zero_durations = 1

This call of the SP means the following:

Give me a report (or alert) on all jobs that took twice as long to complete as the average duration of runtime for the same jobs over 45 days. Email such report to me (myName) and to my associate (myAssociateName). Do not save this data to a baseline table, and do not include jobs with zero durations.

https://www.sqlservercentral.com/wp-content/uploads/2021/08/img_6118df89ac710.png

Figure 2: Sample output in SSMS of another similar call to the SP: in this case, showing jobs that right now are taking 10% more time that they took on AVG in past 2 days.

The image below shows an example of what an email alert, generated by the SP, looks like in HTML.

This Email Notification is generated by a call of this SP only if parameter @bEmail = 1.

https://www.sqlservercentral.com/wp-content/uploads/2021/08/img_6118e0f061fcd.png

Figure 3

Email subject line reads:
<ServerName> Job(s) taking longer than recent baseline duration;
yyyy-mm-dd mm:ss

Saving Results to Table for Future Historical Analysis

If you decide to save the report to table, do this: in addition to be alerted via email or direct run of the SP in SSMS, a table will be required. Call the SP with @bSaveToTable = 1.
(It is one of the dependencies of the SP so it is a good idea to create it even if you do not with results to table at this time). Here is the table DDL:

SQL
CREATE TABLE [RegressingJobs](
                            [CaptureDateTime]        [datetime]      NULL,
                            [JobName]                [sysname]       NOT NULL,
                            [start_execution_date]   [datetime]      NULL,
                            [HostName]               [sysname]       NOT NULL,
                            [history_days]           [int]           NULL,
                            [avg_duration_multiplier]    [float]     NULL,
                            [min_run_duration]           [time](7)   NULL,
                            [max_run_duration]           [time](7)   NULL,
                            [avg_run_duration]           [time](7)   NULL,
                            [cur_run_duration]           [time](7)   NULL
 ) ON [PRIMARY]
   GO

This example of calling the SP with full functionality, including saving to table and email alert (bSaveToTable= 1, bEmail=1):

SQL
EXEC DBA_db.dbo.asp_long_running_Regressing_Jobs_Alerts
                             @history_days = 30,
                             @avg_duration_multiplier = 2,
                             @bEmail = 1,
                             @bSaveToTable = 1,
                             @RecipientsList  = 'myName@myCoDomain.com;',
                             @ignore_zero_durations = 1

Other Examples of how this SP can be Called for Various Purposes

Compare currently running job(s) to its history for past 30 days, and report on each job whose current duration exceeds the 30-day average by 1.5 times. DO not send alert email and do not save this info to baseline table:

SQL
EXECUTE DBA_db.dbo.asp_long_running_Regressing_Jobs_Alerts 30, 1.5, 0, 0

Compare currently running job(s) to its history for past 60 days, and report on each job whose current duration exceeds the 60-day average by 2 times. Send email alert to default recipients (list), and do not save this info to baseline table:

SQL
EXECUTE DBA_db.dbo.asp_long_running_Regressing_Jobs_Alerts 60, 1.5, 1, 0

Note: It is highly recommended to use assignment of values to parameters naming each parameter explicitly. The above examples are given for purposes of being succinct.

Conclusion

This article has described a problem that admins face while managing and analyzing performance issues for multiple jobs in a high automated workload environment. The stored procedure that I shared here allows admin to be alerted in the event certain job(s) are taking longer than their past average durations for the given period of time.

Appendix

Below is the DDL to create the schedule job named [MyMonitoredServerName_Maintenance - Regressing Jobs] to execute the asp_long_running_Regressing_Jobs_Alerts every minute and send alert to BigShotAdmin at BigShotAdmin@MyCoDomain.com.

SQL
USE [msdb]
GO
 
/****** Object:  Job [MyMonitoredServerName_Maintenance - Regressing_Jobs] ******/
BEGIN TRANSACTION
DECLARE @ReturnCode INT
SELECT @ReturnCode = 0
IF NOT EXISTS (SELECT name
                 FROM msdb.dbo.syscategories
   WHERE name=N'[BigLoad]'
     AND category_class=1)
BEGIN
 EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[BigLoad]'
 IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
 
END
 
DECLARE @jobId BINARY(16)
 EXEC   @ReturnCode =  msdb.dbo.sp_add_job
        @job_name=N'MyMonitoredServerName_Maintenance - Regressing_Jobs',
                                @enabled=1,
                                @notify_level_eventlog=0,
                                @notify_level_email=0,
                                @notify_level_netsend=0,
                                @notify_level_page=0,
                                @delete_level=0,
                                @description=N'Send email notifications to _
                                @RrecipientsList (last parameter in SP) on CURRENTLY RUNNING _
                                Agent Jobs that regress in performance by duration compared to _
                                baseline (baseline collected during the number of days before _
                                getdate() specified by the first parameter @history_days.',
                                @category_name=N'[BigLoad]',
                                @owner_login_name=N'sa', @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
/****** Object:  Step [asp_long_running_Regressing_Jobs_Alerts] ******/
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, _
   @step_name=N'asp_long_running_Regressing_Jobs_Alerts',
                                @step_id=1,
                                @cmdexec_success_code=0,
                                @on_success_action=1,
                                @on_success_step_id=0,
                                @on_fail_action=2,
                                @on_fail_step_id=0,
                                @retry_attempts=0,
                                @retry_interval=0,
                                @os_run_priority=0, @subsystem=N'TSQL',
                                @command=N'exec _
                                           dbmaint..asp_long_running_Regressing_Jobs_Alerts
                                        @history_days = 45,
                                        @avg_duration_multiplier = 2,
                                        @bEmail = 1,
                                        @bSaveToTable = 1,
                                        @RecipientsList  = ''BigShotAdmin@MyCoDomain.com;'',
                                        @ignore_zero_durations = 1',
                                        @database_name=N'master',
                                        @flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'every 1 min',
                                @enabled=1,
                                @freq_type=4,
                                @freq_interval=1,
                                @freq_subday_type=4,
                                @freq_subday_interval=1,
                                @freq_relative_interval=0,
                                @freq_recurrence_factor=0,
                                @active_start_date=20201222,
                                @active_end_date=99991231,
                                @active_start_time=60000,
                                @active_end_time=235959,
                                @schedule_uid=N'999ac144-4e13-4965-82f2-55555cc37a09'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
    IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO

Enjoy being alerted about Regressing/Long Running SQL Agent jobs!

History

  • 1st September, 2021: Initial version

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)