An alerting via email technique is provided in this article for being notified on long running SQL Agent jobs.
Problem
When you have to manage hundreds of scheduled jobs on busy production servers, inevitably situations arise when jobs take too long to complete, causing substantial waits or affecting the performance of other processes. We want to know when a job started taking too long before we even dive deeper into investigating the causes of the degradation. Some jobs are not just taking a long time to complete, but they may be taking a longer time than they normally used take to complete.
Which jobs are regressing in performance? Is the slow job now consistently becoming a problem or is it a one-time occurrence? When there is a need to investigate a job, often within time constraints, the runtime duration is the number one metric to consider, before anything else in our analysis.
In order to compare the current run time with the duration of the previous executions for the same job, we would usually look at the job’s history for the previous runtime and see that the current runtime is longer than several previous durations. Yet if you manage 500 jobs as a DBA, Developer, or DevOps Engineer, the task of investigating one job at a time may take you half a day and still not bring you closer to a conclusion.
There may be 20 jobs you need to check. You don’t want to babysit each job individually all the time. Automated alerts delivered to your email box can come in handy in such a situation. The solution in this article is to create a stored procedure that will accept a parameter. This parameter is the number of days used for calculating average duration of the job in the past. The procedure produces a report showing jobs that are currently running are degrading (regressing) in performance along with an optional email alert.
Certain steps of understanding such approach as well as handy stored procedure that we already are using in Production is shown below (scheduled as a separate job on the same server where the monitored jobs are run from (see 'Appendix' for the script of such job) to help you get an email alert for one or many jobs that are regressing at the time of the polling. We execute the Stored Proc, asp_long_running_Regressing_Jobs_Alerts
, in our environment every minute, as the durations of our jobs vary from seconds to hours. Before giving SP DDL, certain components of the SP are broken down as pre-requisites (small size code snippets) that help to understand the alerting SP in its entirety.
List of tools needed: SQL Server (below code tested on versions 2012 and up)
Goals here are to compare the duration of each currently running job to an average duration for all run times for the same job over a given period of time.
Note: Before proceeding to compiling the Stored Procedure given below, make sure you have high access privileges in the environment where you want to analyze scheduled jobs. Ideally, admin. To check your access level (if you are not an admin on this server, run this or similar T-SQL statement. If you have no result of this query, you have no Select permissions on the MSDB tables.
USE MSDB
GO
SELECT HAS_PERMS_BY_NAME(QUOTENAME(SCHEMA_NAME(schema_id)) + '.' +
QUOTENAME(name), 'OBJECT', 'SELECT') AS have_select, *
FROM sys.tables
GO
Figure 1: The result set shows tables that you have SELECT permissions on.
Solution
The T-SQL DDL code for the Stored Procedure below can be compiled in any database of your choice. We use dedicated DBA_db
for such admin SPs.
USE [DBA_db]
GO
CREATE PROCEDURE asp_long_running_Regressing_Jobs_Alerts
@history_days int = 7,
@avg_duration_multiplier float = 1.5,
@bEmail bit = 0,
@bSaveToTable bit = 0,
@RecipientsList Varchar(1000) = 'myName@myCoDomain.com',
@ignore_zero_durations bit = 0
AS
SET NOCOUNT ON
BEGIN
select sj.name,
sja.start_execution_date,
sja.stop_execution_date,
ajt.min_run_duration,
ajt.max_run_duration,
ajt.avg_run_duration,
datediff(ss, start_execution_date, getdate()) as cur_run_duration
into #Regressing_Jobs
from msdb..sysjobactivity sja
left join
(select job_id,
avg(dbo.udf_convert_int_time2ss(run_duration)) as avg_run_duration,
min(dbo.udf_convert_int_time2ss(run_duration)) as min_run_duration,
max(dbo.udf_convert_int_time2ss(run_duration)) as max_run_duration
from msdb..sysjobhistory
where step_id=0
and run_date >CONVERT(varchar(8),GETDATE() - @history_days,112)
and ((run_duration <> 0 or @ignore_zero_durations = 0))
and run_duration < 240000
group by job_id
)ajt on sja.job_id=ajt.job_id
join msdb..sysjobs sj on sj.job_id=sja.job_id
where
sja.session_id = (SELECT TOP 1 session_id
FROM msdb.dbo.syssessions
ORDER BY agent_start_date DESC)
AND start_execution_date is not null
and stop_execution_date is null
and datediff(ss, start_execution_date, getdate()) >
ajt.avg_run_duration * @avg_duration_multiplier
select name as JobName,
start_execution_date,
stop_execution_date,
dateadd(second, min_run_duration, 0) as min_run_duration,
dateadd(second, max_run_duration, 0) as max_run_duration,
dateadd(second, avg_run_duration, 0) as avg_run_duration,
dateadd(second, cur_run_duration, 0) as cur_run_duration
into #Regressing_Jobs_DurAsDate
from #Regressing_Jobs
declare @sHtml varchar(max) = ''
declare @tableHTML nvarchar(max) =
N'<H1>Job(s) taking longer than recent baseline duration
(in descending avg duration order):</H1>' + Char(13)
+ N' <table border="1">' + Char(13)
+ N' <tr bgcolor="#ddd">' + Char(13)
+ N' <th>Start Time</th>' + Char(13)
+ N' <th>Job Name</th>' + Char(13)
+ N' <th>Host Name</th>' + Char(13)
+ N' <th>History Days</th>' + Char(13)
+ N' <th>Avg Dur Mul</th>' + Char(13)
+ N' <th>Min Dur</th>' + Char(13)
+ N' <th>Max Dur</th>' + Char(13)
+ N' <th>Avg Dur</th>' + Char(13)
+ N' <th>Cur Dur</th>' + Char(13)
+ N' </tr>' + Char(13)
select @tableHTML = @tableHTML
+ FORMATMESSAGE(
'<tr><td>%s</td>' _
+ Char(13)
+ '<td>%s</td>' + Char(13)
+ '<td>%s</td>' + Char(13)
+ '<td style="text-align:center">%i</td>' _
+ Char(13)
+ '<td style="text-align:center">%s</td>'
+ Char(13)
+ '<td>%s</td>' + Char(13)
+ '<td>%s</td>' + Char(13)
+ '<td>%s</td>' + Char(13)
+ '<td>%s</td>' + Char(13),
convert(varchar, start_execution_date, 120),
JobName,
@@SERVERNAME,
@history_days,
convert(varchar, @avg_duration_multiplier),
format(min_run_duration, N'HH\hmm\mss\s'),
format(max_run_duration, N'HH\hmm\mss\s'),
format(avg_run_duration, N'HH\hmm\mss\s'),
format(cur_run_duration, N'HH\hmm\mss\s')
)
from #Regressing_Jobs_DurAsDate
order by avg_run_duration desc, JobName
select @tableHTML = @tableHTML + '</tr></table>' + Char(13)
select @sHtml = @tableHTML
declare @DateStr varchar(30) = convert(varchar,getdate(),121)
IF @bEmail = 1 and (select count(*) from #Regressing_Jobs) > 0
begin
declare @sSubject varchar(250)
= @@SERVERNAME + ' Job(s) taking longer than recent baseline duration: ' _
+ @DateStr
EXEC msdb.dbo.sp_send_dbmail @profile_name='SQL Server Monitoring Account',
@recipients= @RecipientsList,
@subject=@sSubject,
@body=@sHtml,
@body_format = 'HTML'
print 'email sent: ' + CHAR(13) + @sHtml
end
IF @bSaveToTable = 1
begin
insert into RegressingJobs
(
CaptureDateTime,
JobName,
start_execution_date,
HostName,
history_days,
avg_duration_multiplier,
min_run_duration,
max_run_duration,
avg_run_duration,
cur_run_duration
)
select @DateStr,
JobName,
start_execution_date,
@@SERVERNAME,
@history_days,
@avg_duration_multiplier,
min_run_duration,
max_run_duration,
avg_run_duration,
cur_run_duration
from #Regressing_Jobs_DurAsDate
end
begin
SELECT 'JOBS THAT ARE TAKING LONGER THAN USUAL: '
select @DateStr as CaptureDateTime, JobName, _
start_execution_date, @@SERVERNAME as 'Server',
@history_days as '@history_days', _
@avg_duration_multiplier as '@avg_duration_multiplier',
min_run_duration, max_run_duration, _
avg_run_duration, cur_run_duration
from #Regressing_Jobs_DurAsDate
end
begin
SELECT ' ALL JOBS THAT ARE CURRENTLY RUNNING: '
SELECT
j.name AS job_name,
cast ( ja.start_execution_date as varchar) start_execution_time,
cast ( ja.stop_execution_date as varchar) stop_execution_time,
Js.step_name step_name
FROM msdb.dbo.sysjobactivity ja
LEFT JOIN msdb.dbo.sysjobhistory jh
ON ja.job_history_id = jh.instance_id
JOIN msdb.dbo.sysjobs j
ON ja.job_id = j.job_id
JOIN msdb.dbo.sysjobsteps js
ON ja.job_id = js.job_id
AND ISNULL(ja.last_executed_step_id,0)+1 = js.step_id
WHERE ja.session_id =
(SELECT TOP 1 session_id
FROM msdb.dbo.syssessions
ORDER BY agent_start_date DESC)
AND start_execution_date is not null
AND stop_execution_date is null;
end
END
GO
2 UDFs used by the asp_long_running_Regressing_Jobs_Alerts
CREATE FUNCTION [dbo].[udf_convert_int_time] (@time_in INT)
RETURNS TIME
AS
BEGIN
DECLARE @time_out TIME
DECLARE @time_in_str varchar(6)
SELECT @time_in_str = RIGHT('000000' + CAST(@time_in AS VARCHAR(6)), 6)
SELECT @time_out = CAST(STUFF(STUFF(@time_in_str,3,0,':'),6,0,':') AS TIME)
RETURN @time_out
END
GO
CREATE FUNCTION [dbo].[udf_convert_int_time2ss] (@time_in INT)
RETURNS int
AS
BEGIN
DECLARE @time_out int
select @time_out = datediff(ss, 0, dbo.udf_convert_int_time(@time_in))
RETURN @time_out
END
GO
In addition to the examples of calls listed in the header of the SP, here is an example of a typical call:
exec dba_DB.dbo.asp_long_running_Regressing_Jobs_Alerts
@history_days = 45,
@avg_duration_multiplier = 2,
@bEmail = 1,
@bSaveToTable = 0,
@RecipientsList = 'myName@myCoDomain.com;
AssociateName@myCoDomain.com’
@ignore_zero_durations = 1
This call of the SP means the following:
Give me a report (or alert) on all jobs that took twice as long to complete as the average duration of runtime for the same jobs over 45 days. Email such report to me (myName
) and to my associate (myAssociateName
). Do not save this data to a baseline table, and do not include jobs with zero durations.
Figure 2: Sample output in SSMS of another similar call to the SP: in this case, showing jobs that right now are taking 10% more time that they took on AVG in past 2 days.
The image below shows an example of what an email alert, generated by the SP, looks like in HTML.
This Email Notification is generated by a call of this SP only if parameter @bEmail = 1
.
Figure 3
Email subject line reads:
<ServerName> Job(s) taking longer than recent baseline duration;
yyyy-mm-dd mm:ss
Saving Results to Table for Future Historical Analysis
If you decide to save the report to table, do this: in addition to be alerted via email or direct run of the SP in SSMS, a table will be required. Call the SP with @bSaveToTable = 1
.
(It is one of the dependencies of the SP so it is a good idea to create it even if you do not with results to table at this time). Here is the table DDL:
CREATE TABLE [RegressingJobs](
[CaptureDateTime] [datetime] NULL,
[JobName] [sysname] NOT NULL,
[start_execution_date] [datetime] NULL,
[HostName] [sysname] NOT NULL,
[history_days] [int] NULL,
[avg_duration_multiplier] [float] NULL,
[min_run_duration] [time](7) NULL,
[max_run_duration] [time](7) NULL,
[avg_run_duration] [time](7) NULL,
[cur_run_duration] [time](7) NULL
) ON [PRIMARY]
GO
This example of calling the SP with full functionality, including saving to table and email alert (bSaveToTable= 1, bEmail=1
):
EXEC DBA_db.dbo.asp_long_running_Regressing_Jobs_Alerts
@history_days = 30,
@avg_duration_multiplier = 2,
@bEmail = 1,
@bSaveToTable = 1,
@RecipientsList = 'myName@myCoDomain.com;',
@ignore_zero_durations = 1
Other Examples of how this SP can be Called for Various Purposes
Compare currently running job(s) to its history for past 30 days, and report on each job whose current duration exceeds the 30-day average by 1.5 times. DO not send alert email and do not save this info to baseline table:
EXECUTE DBA_db.dbo.asp_long_running_Regressing_Jobs_Alerts 30, 1.5, 0, 0
Compare currently running job(s) to its history for past 60 days, and report on each job whose current duration exceeds the 60-day average by 2 times. Send email alert to default recipients (list), and do not save this info to baseline table:
EXECUTE DBA_db.dbo.asp_long_running_Regressing_Jobs_Alerts 60, 1.5, 1, 0
Note: It is highly recommended to use assignment of values to parameters naming each parameter explicitly. The above examples are given for purposes of being succinct.
Conclusion
This article has described a problem that admins face while managing and analyzing performance issues for multiple jobs in a high automated workload environment. The stored procedure that I shared here allows admin to be alerted in the event certain job(s) are taking longer than their past average durations for the given period of time.
Appendix
Below is the DDL to create the schedule job named [MyMonitoredServerName_Maintenance
- Regressing Jobs] to execute the asp_long_running_Regressing_Jobs_Alerts
every minute and send alert to BigShotAdmin
at BigShotAdmin@MyCoDomain.com.
USE [msdb]
GO
BEGIN TRANSACTION
DECLARE @ReturnCode INT
SELECT @ReturnCode = 0
IF NOT EXISTS (SELECT name
FROM msdb.dbo.syscategories
WHERE name=N'[BigLoad]'
AND category_class=1)
BEGIN
EXEC @ReturnCode = msdb.dbo.sp_add_category @class=N'JOB', @type=N'LOCAL', @name=N'[BigLoad]'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
END
DECLARE @jobId BINARY(16)
EXEC @ReturnCode = msdb.dbo.sp_add_job
@job_name=N'MyMonitoredServerName_Maintenance - Regressing_Jobs',
@enabled=1,
@notify_level_eventlog=0,
@notify_level_email=0,
@notify_level_netsend=0,
@notify_level_page=0,
@delete_level=0,
@description=N'Send email notifications to _
@RrecipientsList (last parameter in SP) on CURRENTLY RUNNING _
Agent Jobs that regress in performance by duration compared to _
baseline (baseline collected during the number of days before _
getdate() specified by the first parameter @history_days.',
@category_name=N'[BigLoad]',
@owner_login_name=N'sa', @job_id = @jobId OUTPUT
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobstep @job_id=@jobId, _
@step_name=N'asp_long_running_Regressing_Jobs_Alerts',
@step_id=1,
@cmdexec_success_code=0,
@on_success_action=1,
@on_success_step_id=0,
@on_fail_action=2,
@on_fail_step_id=0,
@retry_attempts=0,
@retry_interval=0,
@os_run_priority=0, @subsystem=N'TSQL',
@command=N'exec _
dbmaint..asp_long_running_Regressing_Jobs_Alerts
@history_days = 45,
@avg_duration_multiplier = 2,
@bEmail = 1,
@bSaveToTable = 1,
@RecipientsList = ''BigShotAdmin@MyCoDomain.com;'',
@ignore_zero_durations = 1',
@database_name=N'master',
@flags=0
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_update_job @job_id = @jobId, @start_step_id = 1
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobschedule @job_id=@jobId, @name=N'every 1 min',
@enabled=1,
@freq_type=4,
@freq_interval=1,
@freq_subday_type=4,
@freq_subday_interval=1,
@freq_relative_interval=0,
@freq_recurrence_factor=0,
@active_start_date=20201222,
@active_end_date=99991231,
@active_start_time=60000,
@active_end_time=235959,
@schedule_uid=N'999ac144-4e13-4965-82f2-55555cc37a09'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
EXEC @ReturnCode = msdb.dbo.sp_add_jobserver @job_id = @jobId, @server_name = N'(local)'
IF (@@ERROR <> 0 OR @ReturnCode <> 0) GOTO QuitWithRollback
COMMIT TRANSACTION
GOTO EndSave
QuitWithRollback:
IF (@@TRANCOUNT > 0) ROLLBACK TRANSACTION
EndSave:
GO
Enjoy being alerted about Regressing/Long Running SQL Agent jobs!
History
- 1st September, 2021: Initial version