Oracle Performance Dashboard (OPD) is a small ASP.NET website that shows you performance & problems of one or more Oracle instances in near real-time. It uses the Dynamic Performance Views (DPV) and runs some popular DBA scripts in order to get meaningful, easy to understand information out of the server. You can use it to quickly spot blocking queries, who is blocking who, expensive query that are consuming high CPU or disk, see if there's unusual locks, very high disk activity and so on.
Demo
You can see a live demo of this from here:
http://odp.omaralzabir.com
Get the code
The binaries are here, which you can just extract into an IIS folder, put the connection strings in the web.config file, and you are ready to roll. No need to install any Oracle client software on the server.
GitHub Project Binaries
You can get the source code from the GitHub project site:
https://github.com/oazabir/OraclePerformanceDashboard
Why not use OEM
OEM is pretty good, no doubt. But what I need is something that runs complex DBA scripts, processes the output and tells me what is wrong exactly. On standard monitoring tools like OEM, you get to see the key performance indicators and sometimes some generic alerts, but a much more powerful way to investigate problems in the server is to run those DBA scripts that you find in Oracle Performance Tuning books and blogs, that really get into the heart of the problem and shows you where things are going wrong. There are many powerful scripts available on the books, forums, blogs that are quite handy to detect issues on the server, gather relevant useful information about your database health. This tool tries to give those powerful scripts a nice front-end and easy to discover clues for those, who aren't professional Oracle DBAs like me.
Feature walkthrough
OPD comes with the following features in V1:
- Summary of all your instances in one screen, showing important indicators on each instance. Quick way to check if all your databases are doing fine or not.
- Instance Dashboard showing details of an instance:
- CPU usage on the OS.
- CPU consumed by each session
- Important System Statistics like Buffer Gets, Parse to execute ratio which would indicate some common problems on the server.
- Sessions and what the sessions are doing, how much resource they are consuming, whether they are hogging the disk or not.
- Waits, Blocks, Locks, deadlocks that make database suffer.
- Historical analysis on the databse showing you some very useful stuff:
- Most expensive queries in terms of CPU and Buffer Get, which are immediate convern for your devs to sort out.
- IO usage on data files. You can see if some data file is unusually loaded and getting hammered by physical IO.
- Tablespace usage. Alerts you if some tablespace is over 85% full.
- Stale stats on tables. You should always keep this clean.
- Killer Indexes that will actually blow up your system and confuse Oracle query optimizer. You need to get rid of those indexes and rewrite queries that you thought will hit those indexes for better performance. They won't. They will kill your database.
Summary Screen
When you launch OPD, you see a quick summary of all your instances and indicators showing you whether there's something off in any of the instances. You can then click on the instance name and zoom into more details on each instance.
The summary is generated by running this long query which collects some important indicators from the server, does some calculation to decide whether something is off the chart or not.
SELECT 'Block' as Name, (select
nvl(sum(seconds_in_wait),0)
from
v$session
where
blocking_session is not NULL) as Value from Dual
union all
select 'Locks' as Name, (select count(1)
from v$session sn,
v$lock m
where
((sn.SID = m.SID and m.REQUEST != 0)
or (sn.SID = m.SID and m.REQUEST = 0 and LMODE != 4 and (ID1, ID2) in
(select s.ID1, s.ID2
from v$lock S
where REQUEST != 0
and s.ID1 = m.ID1
and s.ID2 = m.ID2)))) as Value FROM Dual
union all
select 'Waits' as Name, (select count(1) FROM v$session_wait w, v$session s, dba_objects o WHERE s.sid = w.sid AND w.p2 = o.object_id) as Value from Dual
union all
select 'Long op (sec rem)' as Name, (SELECT
nvl(sum(time_remaining),0)
FROM v$session_longops sl
INNER JOIN v$session s ON sl.SID = s.SID AND sl.SERIAL# = s.SERIAL#
WHERE time_remaining > 0) as Value FROM Dual
union all
select 'CPU' as Name, (
with AASSTAT as (
select
decode(n.wait_class,'User I/O','User I/O',
'Commit','Commit',
'Wait') CLASS,
sum(round(m.time_waited/m.INTSIZE_CSEC,3)) AAS,
BEGIN_TIME ,
END_TIME
from v$waitclassmetric m,
v$system_wait_class n
where m.wait_class_id=n.wait_class_id
and n.wait_class != 'Idle'
group by decode(n.wait_class,'User I/O','User I/O', 'Commit','Commit', 'Wait'), BEGIN_TIME, END_TIME
union
select 'CPU_ORA_CONSUMED' CLASS,
round(value/100,3) AAS,
BEGIN_TIME ,
END_TIME
from v$sysmetric
where metric_name='CPU Usage Per Sec'
and group_id=2
union
select 'CPU_OS' CLASS ,
round((prcnt.busy*parameter.cpu_count)/100,3) AAS,
BEGIN_TIME ,
END_TIME
from
( select value busy, BEGIN_TIME,END_TIME from v$sysmetric where metric_name='Host CPU Utilization (%)' and group_id=2 ) prcnt,
( select value cpu_count from v$parameter where name='cpu_count' ) parameter
union
select
'CPU_ORA_DEMAND' CLASS,
nvl(round( sum(decode(session_state,'ON CPU',1,0))/60,2),0) AAS,
cast(min(SAMPLE_TIME) as date) BEGIN_TIME ,
cast(max(SAMPLE_TIME) as date) END_TIME
from v$active_session_history ash
where SAMPLE_TIME >= (select BEGIN_TIME from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2 )
and SAMPLE_TIME < (select END_TIME from v$sysmetric where metric_name='CPU Usage Per Sec' and group_id=2 )
)
select
round(
decode(sign(CPU_OS-CPU_ORA_CONSUMED), -1, 0, (CPU_OS - CPU_ORA_CONSUMED )) +
CPU_ORA_CONSUMED +
decode(sign(CPU_ORA_DEMAND-CPU_ORA_CONSUMED), -1, 0, (CPU_ORA_DEMAND - CPU_ORA_CONSUMED )) +
COMMIT +
READIO +
WAIT
, 2)
CPU_TOTAL
from (
select
min(BEGIN_TIME) BEGIN_TIME,
max(END_TIME) END_TIME,
sum(decode(CLASS,'CPU_ORA_CONSUMED',AAS,0)) CPU_ORA_CONSUMED,
sum(decode(CLASS,'CPU_ORA_DEMAND' ,AAS,0)) CPU_ORA_DEMAND,
sum(decode(CLASS,'CPU_OS' ,AAS,0)) CPU_OS,
sum(decode(CLASS,'Commit' ,AAS,0)) COMMIT,
sum(decode(CLASS,'User I/O' ,AAS,0)) READIO,
sum(decode(CLASS,'Wait' ,AAS,0)) WAIT
from AASSTAT)
)
as Value FROM Dual
UNION ALL
select 'Space' as Name, (
select count(1) FROM (
SELECT df.tablespace_name,
df.file_name,
df.size_mb,
f.free_mb,
df.max_size_mb,
f.free_mb + (df.max_size_mb - df.size_mb) AS max_free_mb,
ROUND((df.max_size_mb-(f.free_mb + (df.max_size_mb - df.size_mb)))/max_size_mb,0) AS used_pct
FROM (SELECT file_id,
file_name,
tablespace_name,
TRUNC(bytes/1024/1024) AS size_mb,
TRUNC(GREATEST(bytes,maxbytes)/1024/1024) AS max_size_mb
FROM dba_data_files) df,
(SELECT TRUNC(SUM(bytes)/1024/1024) AS free_mb,
file_id
FROM dba_free_space
GROUP BY file_id) f
WHERE df.file_id = f.file_id (+)
ORDER BY df.tablespace_name,
df.file_name
) where USED_PCT > 85) as Value from Dual
UNION ALL
select 'Invalid Objects' as Name, (SELECT count(1)
FROM dba_objects
WHERE status = 'INVALID') as Value FROM Dual
Instance Dashboard
When you zoom into an instance, you get the Dashboard for that instance.
The dashboard shows you the resource usage on the instance and some key system statistics. Here's the query used to collect various CPU counts from the server, which is run twice with one second interval and then the delta is calculated so derive the values for the chart:
SELECT systimestamp as DATETIME,
(SELECT value
FROM v$osstat
WHERE stat_name = 'NUM_CPUS') CPU,
sum(decode(stat_name,'IDLE_TIME', value, NULL)) as idle_time,
sum(decode(stat_name,'USER_TIME', value, NULL)) as user_time,
sum(decode(stat_name,'SYS_TIME', value, NULL)) as sys_time,
sum(decode(stat_name,'IOWAIT_TIME', value, NULL)) as iowait_time,
sum(decode(stat_name,'NICE_TIME', value, NULL)) as nice_time
FROM v$osstat
WHERE stat_name in ('IDLE_TIME','USER_TIME','SYS_TIME','IOWAIT_TIME','NICE_TIME')
GROUP BY systimestamp
Resource graph
On the left, it shows the CPU usage on the OS and shows you how much CPU is used by user, sys, IO. Usually when there's a IO bottleneck, you will see the IO line goes up.
On the right, it shows the CPU consumed by each session. If some session is off the chart, you know that session is the culprit.
At the bottom, there are 4 key indicators. They will be red if they aren't optimal or at least what the best practice guideline says they should be. For example, Buffer Cache hit Ratio should be over 95%. You can see on the graph that sometimes it becomes green, because sometimes it is over 95%.
Session details
This shows details about the ACTIVE sessions. You can see if there's any session consuming high CPU, high physical read, hard parse and so on. The yellow warning indicator will come up if there's something off.
Here's the query that collects data for this:
SELECT
SID,
v$session.STATUS,
USERNAME || ' (' || OSUSER || ')' "USER",
MACHINE,
Logon_time,
SYS.AUDIT_ACTIONS.NAME "Command",
(case
when v$session.STATUS = 'ACTIVE' and v$session.ROW_WAIT_OBJ# > 0 then (select OBJECT_NAME FROM dba_objects WHERE v$session.ROW_WAIT_OBJ# = dba_objects.object_ID)
else ''
end) "Wait Object",
sql_text,
(SELECT ROUND (VALUE / 1024 / 1024, 2)
FROM v$sesstat
WHERE STATISTIC# = 25 AND v$sesstat.SID = v$session.sid)
AS "PGA Memory, in MB",
(SELECT VALUE
FROM v$sesstat
WHERE STATISTIC# = 12 AND v$sesstat.SID = v$session.sid)
AS "CPU, used by session",
ROUND ( (SELECT VALUE
FROM v$sesstat
WHERE STATISTIC# = 339 AND v$sesstat.SID = v$session.sid)
/ (SELECT DECODE (VALUE, 0, 1, VALUE)
FROM v$sesstat
WHERE STATISTIC# = 338 AND v$sesstat.SID = v$session.sid),
2)
AS "Hard Parse, %",
(SELECT ROUND (VALUE / 1024 / 1024, 2)
FROM v$sesstat
WHERE STATISTIC# = 58 AND v$sesstat.SID = v$session.sid)
AS "Physical read bytes, in MB",
(SELECT ROUND (VALUE / 1024 / 1024, 2)
FROM v$sesstat
WHERE STATISTIC# = 66 AND v$sesstat.SID = v$session.sid)
AS "Physical write bytes, in MB",
(SELECT ROUND (VALUE / 1024 / 1024, 2)
FROM v$sesstat
WHERE STATISTIC# = 139 AND v$sesstat.SID = v$session.sid)
AS "Redo size, in MB",
(SELECT ROUND (VALUE / 1024 / 1024, 2)
FROM v$sesstat
WHERE STATISTIC# = 344 AND v$sesstat.SID = v$session.sid)
AS "Received from client, in MB",
(SELECT ROUND (VALUE / 1024 / 1024, 2)
FROM v$sesstat
WHERE STATISTIC# = 343 AND v$sesstat.SID = v$session.sid)
AS "Sent to client, in MB",
SERIAL#,
PROGRAM,
OWNERID,
PROCESS
FROM v$session
LEFT OUTER JOIN
v$sqlarea
ON
v$session.sql_hash_value = v$sqlarea.hash_value
AND v$session.sql_address = v$sqlarea.address
LEFT OUTER JOIN SYS.AUDIT_ACTIONS ON v$session.command = SYS.AUDIT_ACTIONS.action
WHERE v$session.LOGON_TIME BETWEEN TRUNC (SYSDATE) AND SYSDATE
AND
v$session.sid != userenv('SESSIONID')
AND
rownum < 10
ORDER BY 2, 10 DESC
The output of this query is bound to a GridView, which then looks at each row and decides whether to show any warning or not:
<asp:GridView CssClass="table table-striped" ID="GridView1" runat="server" DataSourceID="sqlDataSource" EnableModelValidation="True">
<Columns>
<asp:TemplateField>
<ItemTemplate>
<%#HandleDbNull(Eval("Hard Parse, %"))>0 ? "<span class='label label-warning'>Hard Parse</span>" : ""%>
<%#HandleDbNull(Eval("PGA Memory, in MB"))>10 ? "<span class='label label-warning'>PGA</span>" : ""%>
<%#HandleDbNull(Eval("CPU, used by session"))>3000 ? "<span class='label label-warning'>CPU</span>" : ""%>
<%#HandleDbNull(Eval("Physical read bytes, in MB"))>100 ? "<span class='label label-warning'>Physical Read</span>" : ""%>
<%#HandleDbNull(Eval("Physical write bytes, in MB"))>100 ? "<span class='label label-warning'>Physical Write</span>" : ""%>
<%#HandleDbNull(Eval("Redo size, in MB"))>10 ? "<span class='label label-warning'>Redo</span>" : ""%>
</ItemTemplate>
</asp:TemplateField>
</Columns>
</asp:GridView>
Bad Indexes
This is very handy for developers because you can quickly see which indexes are bad by design, because you have columns in the index which have very low selectivity. For example, on the screenshot, you can see there are some killer index. Trying to use those indexes will actually kill your server. You can see the first example that there are just 7 unique values on column PRODUCT_ID, but there are over 5 million rows. So, each PRODUCT_ID has probably got over 1 million rows against it.
Some developer thought that there's a query that has WHERE PRODUCT_ID=X and thus created the index on PRODUCT_ID. But it is not going to work in this case.
Sometimes during development and testing, we do not have representative data from production database and thus we do not catch such issues. Only when we go-live and go down the next hour, we realize what we have done wrong. This view of Bad Indexes will help identify such wrong index designs.
Here's the query that finds these bad indexes:
select i.table_name,i.index_name,
(select rtrim (xmlagg (xmlelement (e, column_name || ', ')).extract ('//text()'), ',') column_names from USER_IND_COLUMNS where index_name=i.index_name) column_names,
TO_CHAR(ROUND((i.distinct_keys/(i.num_rows+1))*100, 3), '999.000') as Uniqueness,
case
when i.distinct_keys/(i.num_rows+1) = 0 then 'IGNORE'
when i.distinct_keys/(i.num_rows+1) < 0.1 AND num_rows > 1000 then 'KILLER!'
when i.distinct_keys/(i.num_rows+1) < 0.1 AND num_rows < 1000 then 'IGNORE'
when i.distinct_keys/(i.num_rows+1) < 0.2 AND num_rows > 1000 then 'SERIOUSLY POOR'
when i.distinct_keys/(i.num_rows+1) < 0.5 AND num_rows > 1000 then 'POOR'
when i.distinct_keys/(i.num_rows+1) < 0.8 then 'OK'
when i.distinct_keys/(i.num_rows+1) < 0.9 then 'GOOD'
else 'VERY GOOD'
end Quality,
i.distinct_keys,i.num_rows,i.blevel,i.leaf_blocks,i.avg_leaf_blocks_per_key,i.avg_data_blocks_per_key from user_indexes i
where i.table_name in (select table_name from user_tables)
and i.num_rows > 0
group by table_name,i.index_name,i.distinct_keys,i.num_rows,i.blevel,i.leaf_blocks,i.avg_leaf_blocks_per_key,i.avg_data_blocks_per_key
order by Uniqueness, num_rows desc
Locks, Waits
Locks and waits are common culprits causing performance degradation on the servers. You can quickly see from this tool what locks are being held, and what are the expensive waits.
It will show you the exact query holding lock and causing contention or even deadlocks. Here you see that session ID1 and ID2 grabbing exclusive lock and getting into a deadlock.
How it works
Look ma, no AJAX!
You will notice that the panels are refreshing periodically. You might think I am using AJAX to call some serverside web service in order to get JSON/XML response, and then use some jQuery template to render the html output. Nope. I am using what our ancestors have happily used for generations. The natural, organic IFRAME solution, with no side effect. The html output for each panel comes from individual ASP.NET pages, via IFRAMEs and then they get injected into a DIV on the main Dashboard page.
There are several benefits to this approach:
- The widgets are individual pages, which user can browse directly in full browser view.
- Each widget is a plain ASP.NET page. No need to build webservices to return data in JSON/XML format. No need for any entity classes either that you usually use to serialize into JSON/XML.
- The HTML content is generated server side, using regular ASP.NET. Thus there's no need to use any Javascript based HTML templating library.
- As there's no need for AJAX or html templating, there's no need to worry about jQuery or its plugin breaking compatibility in new versions, and updating javascript libraries regularly.
Let's see how to do this. First the HTML markup to draw the panels:
<div class="row">
<div class="panel panel-success">
<div class="panel-heading"><a href="WhoIsActive.aspx?c=<%= ConnectionString %>">What's going on</a></div>
<div class="panel-body panel-body-height" id="WhoIsActive">
<div class="progress">
<div class="progress-bar progress-bar-striped" style="width: 60%"><span class="sr-only">100% Complete</span></div>
</div>
</div>
<iframe class="content_loader" onload="setContent(this, 'WhoIsActive')" src="WhoIsActive.aspx?c=<%= ConnectionString %>" style="width: 100%; height: 100%; border: none; display: none" frameborder="0"></iframe>
</div>
</div>
This is the markup taken from the Twitter Bootstrap theme.
You will notice there's an invisible IFRAME there. When the IFRAME loads, it calls the setContent
function. That function takes the whole content of the IFRAME and injects inside the panel-body div.
function setContent(iframe, id) {
...
$('#' + id)
.html($(iframe).contents().find("form").html())
.dblclick(function () {
iframe.contentWindow.location.reload();
})
...
}
There you go, clean AJAX-like solution without any AJAX: no XMLHTTP, no JSON plumbing, no HTML templating, no server-side webservice.
Now this would not work for any event handler that is hooked inside the IFRAME. So, how does the click on a query show the popup window with the full query? Also if it was an IFRAME, shouldn't the popup actually come inside the IFRAME?
The click functionality is done on the main Dashboard page. After injecting the content into the DIV, it hooks the click handlers that shows the popup on the main page:
function setContent(iframe, id) {
$('#' + id)
.find('td.large-cell').off('click');
if ($('#' + id).scrollLeft() == 0) {
$('#' + id)
.html($(iframe).contents().find("form").html())
.dblclick(function () {
iframe.contentWindow.location.reload();
})
.find('td.large-cell').find('div').click(function () {
$('#content_text').text($(this).html());
$('#basic-modal-content').modal();
});
}
Here it looks for any <TD>
having the class large-cell. It then hooks the click even on it and shows the modal dialog box. The modal dialog box is from Eric Martin's SimpleModal plugin.
Plotting the charts
The chart uses the jQuery plugin Flot to render some of the performance counters as running charts.
There's a SysStats.aspx which is responsible for rendering the table showing the performance counters. It picks some important counters, and marks them to appear on the chart. First it runs through the table, looking for the counters, and marks the label of the counter as x-axis and value of the counter as y-axis:
var plot = ["Buffer cache hit ratio (95+)", "Parse CPU to total CPU ratio (<30)", "Shared pool free (~0)", "Parse to execute ratio (<0.3)"];
$('td').each(function (i, e) {
td = $(e);
if (td.text().trim().length > 0) {
for (var i = 0; i < plot.length; i ++) {
if (plot[i] == td.text().trim()) {
td.addClass("x-axis");
td.next().addClass("y-axis");
}
}
}
})
Now this page is hosted on the Dashboard page inside an IFRAME. So, the Dashboard page scans the IFRAME content, looks for these labels, picks their values and passes to the Flot chart plugin:
$(iframe).contents().find("form").find(".x-axis").each(function (i, e) {
var x = $(e);
var y = x.next('.y-axis');
var xname = x.text();
var yvalue = parseInt(y.text());
if (datasets[xname]) {
var data = datasets[xname].data;
data.pop();
data.splice(0, 0, yvalue);
}
});
Rest of the job of updating the Flot chart is done by the usual Flot code:
function updatePlot() {
var index = 0;
$.each(datasets, function (key, val) {
var items = [];
for (var i = 0; i < val.data.length; i++)
items.push([i, val.data[i]]);
var data = { color: val.color, data: items };
if (plots[index] != null) {
plot = plots[index];
plot.setData([data]);
plot.draw();
}
else {
plot = $.plot("#placeholder" + (index + 1), [data], {
series: {
},
lines: { show: true, fill: true },
grid: {
hoverable: true,
clickable: true
},
yaxis: {
min: 0,
max: val.ymax
},
xaxis: {
show: false
}
});
That's it! Again, no AJAX, no webservice, no html templating, no JSON plubing. Pure organic IFRAME and html.
Conclusion
OPD tries to make it easy for us to monitor Oracle instance health. It gives meaningful information out of the verbose data produced by Oracle's built-in views. It shows you exactly where the problem is. You can customize the code anyway you like, adding your own warnings, own thresholds, and adjust it to suit your specific need.