Battle of The Nodes RAC Performance Myths

RAC Performance myths
By Riyaj Shamsudeen
OraInternals Riyaj Shamsudeen
Who am I?
16 years using Oracle products Over 15 years as Oracle DBA Certified DBA versions 7.0,7.3,8,8i &9i Specializes in performance tuning, Internals and E-business suite Principal at OraInternals A performance consulting company. http://www.orainternals.com OakTable member Email: rshamsud@orainternals.com Blog : http://orainternals.wordpress.com
Disclaimer
These slides and materials represent the work and opinions of the author and do not constitute official positions of my current or past employer or any other organization. This material has been peer reviewed, but author assume no responsibility whatsoever for the test cases. If you corrupt your databases by running my scripts, you are solely responsible for that. This material should not should not be reproduced or used without the authors' written permission.
Agenda - Myths
High CPU usage in one node doesn't affect other node performance. All global cache performance issues are due to interconnect performance. Inter instance parallelism is excellent, since CPUs from all nodes can be effectively used. Set sequence to nocache value in RAC environments to avoid gaps in sequence. Small tables should not be indexed in RAC. Bitmap index performance is worse compared to single instance.
Typical RAC node setup

Online users reports Adhoc users Heavy batch processes
Instance#1
Instance #2
Instance #3
Database
Reporting node
Idea here is to put online money-paying users to a all nodes and throw costly reports/adhoc SQL/batch in to one node. Only a small part of online users are in batch node. High CPU usage in the batch node shouldn't cause any issues to online users, right? If SQL is bad, don't worry about tuning, let it run in report node. It wouldn't affect much online users performance, right? If batch process is costly, no need to tune it, run that in batch node. Not Exactly!
OraInternals Riyaj Shamsudeen 6
What really happens?

Online users reports Adhoc users Heavy batch processes
LMS Instance#1 user
LMS Instance #2 user
LMS Instance #3 user
Database
LMS processes are serving cache fusion to other instances..

7
Global cache transfer

With cache fusion, blocks are transferred from remote cache if a suitable block is found in the remote cache avoiding costly disk reads. Block transfer between caches are done by LMS processes. Until 10.2.0.1, LMS processes are running in normal priority. If there is CPU starvation in any server, then all instances will be affected due to LMS latency.
LMS processes normal state

In steady state, there is no message latency between LMS processes.
LMS User CPU usage 40% Server #1
LMS User 40% Server #2


9
LMS processes one node is busy

But, if one node is busy, then LMS processes in that node starve for CPU and cause cache fusion latency.
LMS User CPU usage 40% Server #1

LMS User 80%
Server #3
10
GC waits
GC CR waits 'gc cr grant 2 way' (10g) and 'global cache cr request' (9i) latency increases due to global cache latencies.
Avg %Time Total Wait wait Waits Event Waits -outs Time (s) (ms) /txn ---------------------------- -------------- ------ ----------- ------- --------... gc cr grant 2-way 11,518 3.0 23 2 14.7
Much of these GC waits are blamed on interconnect interface and hardware. In many cases, interconnect is performing fine, it is that GCS server processes are introducing latencies.
11
More LMS processes?
Typical response from DBA to improve global cache performance is to increase # of LMS processes adjusting _lm_lms (9i) or gcs_server_processes(10g). This has detrimental effect in performance. More LMS processes increases latency due to TLB thrashing. From mpstat/trapstat outputs, it is visible that there is increased amount of xcalls/migrates/tlb-misses. Few busy LMS processes are better than many quasi-busy LMS processes
LMS & CPU usage

Typically, same number of LMS processes as interconnect or remote nodes seems to be a good starting point. For e.g., in a four node cluster three LMS processes per node is a good starting point. Of course, Correct way to fix this issue is to reduce CPU usage by tuning SQL statements (or) add more CPUs if necessary. In real life, that is not always possible.
LMS & 10.2.0.3
In 9i, increasing priority of LMS processes to RT helps (more covered later). From Oracle release 10.2.0.3 LMS processes run in Real Time priority. This is alleviating much of performance issues with LMS issues. Two parameters control this behaviour: _high_priority_processes : High Priority Process Name Mask with a default value of LMS* _os_sched_high_priority : OS high priority level with a default value of 1.
Agenda - Myths
15
Node1 GC workload
Global Cache and Enqueue Services - Workload Characteristics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg global enqueue get time (ms):
Average of 63ms cr block receive time, quite high.

8.9
Avg global cache current block receive time (ms):
Avg global cache cr block receive time (ms):
63.3
2.1
Avg global cache cr block build time (ms): Global cache log flushes for cr blocks served %:
Avg global cache cr block send time (ms):
0.3
0.1
Avg global cache cr block flush time (ms):
51.5
4.5
Global cache log flushes for current blocks served %: Avg global cache current block flush time (ms):
Avg global cache current block send time (ms):
Avg global cache current block pin time (ms):
0.0 0.1 4.8
30.0
16
Could this be interconnect issue?

Common reaction to any Global cache performance issue : It is an interconnect network problem. It could be, but not necessarily. Unless interconnect is flooded, interconnect latency is very small fraction of global cache latency.
17
Interconnect performance
Before LMS sends a block back to remote cache, LMS waits for Log flush to complete. Even CR block transfer suffer from this wait. Of course, CUR blocks needs to have log flush complete. So, Global cache latency ~= Interconnect latency for message from & to LMS + LMS processing latency + LGWR log flush latency
18
Node2 GC workload
In this specific case, log flush was very slow due to an hardware issue
Global Cache and Enqueue Services - Workload Characteristics ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Avg global enqueue get time (ms): 0.3
Avg global cache current block receive time (ms):
Avg global cache cr block receive time (ms):
10.4
3.2
Avg global cache cr block build time (ms): Global cache log flushes for cr blocks served %:
Avg global cache cr block send time (ms):
0.1
0.0
Avg global cache cr block flush time (ms):
4380.0
5.0
Global cache log flushes for current blocks served %: Avg global cache current block flush time (ms):
Avg global cache current block send time (ms):
Avg global cache current block pin time (ms):
0.0 0.1 0.1
0.0
19
LGWR priority
LGWR processes should also run with higher priority, in addition to LMS processes.
Better write throughput on redo log files is essential for overall RAC performance. High interconnect block transfer inevitably will result in hyper active LGWR. Increase priority for LGWR and LMS (Example for Solaris) priocntl -e -c class -m userlimit -p priority priocntl -e -c RT -p 59 `pgrep -f ora_lgwr_${ORACLE_SID}` priocntl -e -c FX -m 60 -p 60 `pgrep -f ora_lms[09]*_${ORACLE_SID}`
20
Binding..
Another option is to bind LGWR/LMS to specific processors or processor sets. Still, interrupts can pre-empt LMS processors and LGWR. So, binding LMS to processor set without interrupts helps (see psradm in solaris).
But, of course, processor binding is applicable to only servers with high # of CPUs such as E25K platforms.
21
Summary
In summary, Use optimal # of LMS processes Use RT or FX high priority for LMS and LGWR proceses. Configure decent hardware for online redo log files. Tune LGWR writes and Of course, avoid double buffering and double copy using optimal file systems. Of course, tune SQL.
22
Agenda - Myths
23
Parallelism
Few parameters controls this behaviour: parallel_min_servers parallel_max_servers Two more parameters, RAC specific: instance_group parallel_instance_group In a multi-instance RAC cluster, we can control parallelism to specific instances.
Parallelism
SQL
Let's say that there are three instances: inst1, inst2, inst3. To span slaves across all instances inst1.instance_groups='inst1','all' inst2.instance_groups='inst2','all' inst3.instance_groups='inst3','all' inst1.parallel_instance_group='all'
Inst 1 QC Inst 2 Inst 3
P001
P002
P001
P001
P001
P001
25
Parallelism
SQL
To span slaves across all instances inst1 and inst2 alone, parameters will be: inst1.instance_groups='inst1','all', 'inst12' inst2.instance_groups='inst2','all','inst12' inst3.instance_groups='inst3','all' inst1.parallel_instance_group='inst12'
Inst 1 QC Inst 2 Inst 3
P001
P002
P001
P001
26
Parallel Select
Alter session set parallel_instance_group=inst12'; select /*+ full(tl) from t_large tl; avg(n1), max(n1), avg(n2), max(n2), max(v1) parallel (tl,4) */
Username CBQT
Four slaves were allocated for above SQL statement.

QC/Slave QC Slave Set SID 140 (Slave) (Slave) (Slave) (Slave) 1 1 1 1 138 152 121 126 140 140 140 140 140 4 4 4 4
------------ ---------- ---------- ------ ------ ------------- ---------- --------- p001 - p000 - p000 - p001 1 4 4 4 4 1 1 2 2
QC SID Requested DOP Actual DOP
INST_ID
27
SQL
PQ select In ideal situation

Inst 1 3 Merge QC Inst 2 2 Aggregate data P001 P001 Inst 3
P001
P002
1 Read partial table
28
SQL
PQ select actual processing

Inst 1 3 Merge QC Inst 2 2 Interconnect traffic P001 P001 Inst 3
P001
P002
1 Read partial table
29
Parallel Select
select /*+ full(tl) from t_large tl; call count 3 3 6 avg(n1), max(n1), avg(n2), max(n2), max(v1)
Alter session set parallel_instance_group ='ALL';
parallel (tl,4) */
Elapsed time reduced from 191 seconds to 91 seconds.
------- -----Parse Fetch total call Execute
-------- ---------- ---------- ---------- ---------0.00 0.00 0.02 1.27 0 0 0 0 0 9 0 9 0 0 0 0
cpu
elapsed
disk
query
current
---------0 0 ---------3 3
rows
------- -----12
-------- ---------- ---------- ---------- ---------69.91 191.22
69.90
189.92
Alter session set parallel_instance_group ='ORCL1';
------- -----Parse Fetch total Execute 1 1 2 4
count
-------- ---------- ---------- ---------- ---------0.00 0.00 7.48 7.50 30.63 60.69 91.38 0.05 0 0 0 0 0 3 0 3 0 0 0 0
cpu
elapsed
disk
query
current
---------0 0 ---------1 1
rows
------- ------
-------- ---------- ---------- ---------- ----------
30
PQ-Summary
Inter instance parallelism need to be carefully considered and measured. For partition based processing, when processing for a set of partitions is contained within a node, performance may be better. Excessive inter instance parallelism will increase interconnect traffic leading to performance issues.
31
Agenda - Myths
32
Sequence operation in RAC

5 Subsequent accesses returns values until value reaches 29 6 After 29, values will be in 50-69 range.
emp_seq cache 20 start with 10
1 First access to sequence caches values from 10 to 29 10-29
3 Second access caches value from 30-49 30-49
1. 60 access to sequence results in 3 changes to block.
2. These changes might not result in physical reads/writes. 3. Gaps in sequence values. 4. Still, log flush needed for cache transfer.
Inst 1 2 SEQ$ updated with last_value as 29
Inst 2
7 SEQ$ updated with last_value as 69

33
Sequence operation in RAC

5 Subsequent accesses returns value 12 6 Due to nocache values, there will be no gaps.
emp_seq nocache start with 10
1 First access to sequence returns value 10 10
3 Second access returns value of 11 11
1. 3 access to sequence results in 3 block changes. 2. No gaps in sequence values. 3. But, SEQ$ table blocks transferred back and forth.
34
Inst 1 2 SEQ$ updated with last_value as 10
Inst 2
Sequences Test case

set timing on declare alter session set events '10046 trace name context forever, level 8'; l_v1 varchar2(512); l_n1 number :=0;
Single row inserts using sequence values..
begin
for loop_cnt in 1 .. 10000 loop -- Random access insert into t1
-- Also making undo blocks to be pinged.. select t1_seq.nextval, lpad( loop_cnt, 500, 'x') from dual; commit;
if mod(loop_cnt, 1000) =0 then end if;
end; /
end loop;
35
Code executions one node

call ------- -----Parse Fetch total Execute count 1 0 cpu elapsed disk 0 0 10000 0.00 5.28 0.00 5.29 7.66 0.00 0.00 1
INSERT INTO T1 SELECT T1_SEQ.NEXTVAL, LPAD( :B1 , 500, 'x') FROM DUAL -------- ---------- ---------- ---------- ---------794 0 0 0 0 query current
---------0 0
rows
25670
10000 ----------
------- ------
10001
-------- ---------- ---------- ---------- ---------7.66 1 794
25670
10000
update seq$ set increment$=:2,minvalue=:3,maxvalue=:4,cycle#=:5,order$=:6, call cache=:7,highwater=:8,audit$=:9,flags=:10 where count 10000 10000 0 cpu elapsed disk obj#=:1 query current 0 rows 0
------- -----Parse Fetch total Execute
-------- ---------- ---------- ---------- ---------0.32 2.74 0.00 3.06 0.30 3.04 0 0 10000 0
----------
------- ------
20000
-------- ---------- ---------- ---------- ---------3.34 0 10000
0.00
20287 0
10000 ---------0
20287
10000
36
Code executions two nodes

INSERT INTO T1 SELECT T1_SEQ.NEXTVAL, LPAD( :B1 , 500, 'x') FROM DUAL call ------- -----Parse Fetch total Execute count 1 0 -------- ---------- ---------- ---------- ---------0.00 8.02 0.00 8.02 81.23 0.00 0.00 0 0 0 1584 0 0 0 0 cpu elapsed disk query current ---------0 0 rows
10000
27191
10000 ----------
------- ------
10001
-------- ---------- ---------- ---------- ---------81.23 0 1584
27191
10000
Elapsed times include waiting on following events: ---------------------------------------row cache lock gc current block 2-way gc cr block 2-way Event waited on
Excessive row cache lock waits

---------Max. Wait 2.93 -----------62.86 Total Waited
Waited
Times 5413 46 63
0.00
0.16
0.06
0.41
37
Code executions two nodes 5000 blocks transferred

between nodes..
update seq$ set increment$=:2,minvalue=:3,maxvalue=:4,cycle#=:5,order$=:6, where call cache=:7,highwater=:8,audit$=:9,flags=:10 obj#=:1 count 10000 10000 0 cpu elapsed
------- -----Parse Fetch total Execute
-------- ---------- ---------- ---------- ---------0.35 4.08 0.00 4.44 11.18 0.30 0 0 10000 0
disk
query
current 0
----------
rows 0
------- ------
Event waited on
20000
-------- ---------- ---------- ---------- ---------11.49 0 Times 5166 1 10000 Max. Wait 0.01 0.00
0.00
20290 0
10000 ---------0
20290
---------------------------------------gc current block 2-way gc current grant busy log file switch completion
Waited
----------
-----------0.22
Total Waited 5.39
10000
0.16
0.00
38
Sequence- summary
Nocache sequences increases 'row cache lock' waits. Increases interconnect traffic. Increases elapsed time. If no gaps are needed, control sequence access from just one node or use non-sequence based techniques.
39
Agenda - Myths
High CPU usage in one node doesn't affect other node performance. Inter instance parallelism is excellent, since CPUs from all nodes can be effectively used. Set sequence to nocache value in RAC environments to avoid gaps in sequence. Small tables should not be indexed in RAC. Bitmap index performance is worse compared to single instance. All global cache performance issues are due to interconnect performance.
40
Small tables
Even small tables must be indexed. Excessive full table scans on smaller tables will increase CPU usage. This guideline applies to RAC environments too. I think, this myth arises due to misunderstanding of the problem.
41
Small tables
set timing on drop table t_small2; ; create table t_small2 (n1 number, v1 varchar2(10) ) tablespace users insert into t_small2 select n1, lpad(n1,10,'x') from commit;
(select level n1 from dual connect by level <=10001 );
select segment_name, sum(bytes)/1024 from dba_segments where segment_name='T_SMALL2' and owner='CBQT' group by segment_name SQL> / SEGMENT_NAME T_SMALL2
---------------------------------------- --------------256
SUM(BYTES)/1024
42
Test case
alter session set events '10046 trace name context forever , level 8'; set serveroutput on size 100000 declare v_n1 number; b_n1 number;
v_v1 varchar2(512); begin
Concurrently running this plsql block in both nodes.
for i in 1 .. 100000 loop
b_n1 := trunc(dbms_random.value (1,10000));
exception
end loop;
select n1, v1 into v_n1, v_v1 from t_small2 where n1 =b_n1;
when no_data_found then
end; /
dbms_output.put_line (b_n1);
43
Results from RAC nodes.

SELECT N1, V1 FROM call T_SMALL2 WHERE N1 =:B1 count 1
63 seconds of CPUs consumed

cpu elapsed 0.00 3.08 disk query 0 current rows
------- -----Parse Fetch total Execute 100000 ------- -----200001 100000
-------- ---------- ---------- ---------- ---------0.00 2.81 0 0 0 0 1 0 0 0
---------0
-------- ---------- ---------- ---------- ---------65.54 66.79 3100001
62.72
63.71
3100000
---------100000
100000
Rows
-------
Row Source Operation
100000
TABLE ACCESS FULL T_SMALL2 (cr=3100000 pr=0 pw=0 time=63391728 us)
---------------------------------------------------
44
Results with an index

REM adding an index and repeating test create index t_small2_n1 on t_small2(n1);
CPU usage dropped from 63 seconds to 3.5 seconds.
call
------- -----Parse Fetch total Execute 100000 ------- -----200001 100000 1
count
-------- ---------- ---------- ---------- ---------0.00 1.64 1.79 3.43 0.00 1.61 1.78 3.40 0 0 0 2 0 0 0
cpu
elapsed
disk
query
current
---------0
rows
-------- ---------- ---------- ---------- ---------23 300211
23
300209
---------100000
100000
Rows
-------
100000
100000 INDEX RANGE SCAN T_SMALL2_N1 (cr=200209 pr=23 pw=0 time=1109464 us)(object id 53783) OraInternals Riyaj Shamsudeen 45
TABLE ACCESS BY INDEX ROWID T_SMALL2 (cr=300209 pr=23 pw=0 time=1896719 us)
---------------------------------------------------
Agenda - Myths
High CPU usage in one node doesn't affect other node performance. Inter instance parallelism is excellent, since CPUs from all nodes can be effectively used. All global cache performance issues are due to interconnect performance. Small tables should not be indexed in RAC. Trigger performs worse in RAC compared to single instance. Bitmap index performance is worse compared to single instance.
46
Bitmap index
Bitmap indices are optimal for low cardinality columns. Bitmap indices are not suitable for table with massive DML changes. Bitmap index performance does not worsen because of RAC for select queries. Of course, having bitmap indices on columns with enormous DML changes is not optimal even in single instance databases.
47
Test case - Select

Create bitmap index t_large2_n4 on t_large2(n4); alter session set events '10046 trace name context forever , level 8'; set serveroutput on size 100000 declare v_n1 number; b_n1 number;
v_v1 varchar2(512); begin
for i in 1 .. 100000 loop
b_n1 := trunc(dbms_random.value (1,10000));
exception
end loop;
select count(*) into v_n1 from t_large2 where n4 =b_n1;
when no_data_found then
end; /
dbms_output.put_line (b_n1);
48
Result Single thread

SELECT COUNT(*) FROM T_LARGE2 WHERE N4 =:B1 call count 1 cpu elapsed 0.00 2.93 2.03 4.97 disk query 0 current rows ------- -----Parse Fetch total Rows Execute 100000 ------- -----200001 100000 -------- ---------- ---------- ---------- ---------0.00 2.87 1.86 4.73 2 0 2 0 0 0 ---------0
-------- ---------- ---------- ---------- ---------80 200748
78
200746
---------100000
100000
-------
100000
100000
SORT AGGREGATE (cr=200746 pr=78 pw=0 time=2854389 us)
---------------------------------------------------
BITMAP CONVERSION COUNT (cr=200746 pr=78 pw=0 time=1766444 us)
49
Result From two nodes

SELECT COUNT(*) FROM T_LARGE2 WHERE N4 =:B1
call
------- -----Parse Fetch total Execute 100000 ------- -----200001 100000 1
count
-------- ---------- ---------- ---------- ---------0.00 2.82 1.90 4.73 0.01 2.95 1.94 4.90 0 0 3 3 0 2 0 0 0 0
cpu
elapsed
disk
query
current
---------0 0
rows
-------- ---------- ---------- ---------- ---------200755
200753
---------100000
100000
Misses in library cache during parse: 1
50
References
Oracle support site. Metalink.oracle.com. Various documents Internals guru Steve Adams website www.ixora.com.au Jonathan Lewis website www.jlcomp.daemon.co.uk Julian Dykes website www.julian-dyke.com Oracle8i Internal Services for Waits, Latches, Locks, and Memory by Steve Adams Tom Kytes website Asktom.oracle.com
51

Battle of The Nodes RAC Performance Myths

Uploaded by

Copyright:

Available Formats

Battle of The Nodes RAC Performance Myths

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Battle of The Nodes RAC Performance Myths

Uploaded by

Copyright:

Available Formats

RAC Performance myths

OraInternals Riyaj Shamsudeen

OraInternals Riyaj Shamsudeen

OraInternals Riyaj Shamsudeen

OraInternals Riyaj Shamsudeen

Typical RAC node setup

OraInternals Riyaj Shamsudeen

What really happens?

LMS Instance#1 user

LMS Instance #2 user

LMS Instance #3 user

LMS processes are serving cache fusion to other instances..

OraInternals Riyaj Shamsudeen

Global cache transfer

OraInternals Riyaj Shamsudeen

LMS processes normal state

LMS User CPU usage 40% Server #1

LMS User 40% Server #2

LMS User 40% Server #3

LMS processes one node is busy

LMS User CPU usage 40% Server #1

LMS User 40% Server #2

LMS User 80%

OraInternals Riyaj Shamsudeen

More LMS processes?

LMS & CPU usage

LMS & 10.2.0.3

OraInternals Riyaj Shamsudeen

Average of 63ms cr block receive time, quite high.

Avg global cache current block receive time (ms):

Avg global cache cr block receive time (ms):

Avg global cache cr block send time (ms):

Avg global cache cr block flush time (ms):

Avg global cache current block send time (ms):

Avg global cache current block pin time (ms):

0.0 0.1 4.8

OraInternals Riyaj Shamsudeen

Could this be interconnect issue?

OraInternals Riyaj Shamsudeen

OraInternals Riyaj Shamsudeen

Avg global cache current block receive time (ms):

Avg global cache cr block receive time (ms):

Avg global cache cr block send time (ms):

Avg global cache cr block flush time (ms):

Avg global cache current block send time (ms):

Avg global cache current block pin time (ms):

0.0 0.1 0.1

OraInternals Riyaj Shamsudeen

OraInternals Riyaj Shamsudeen

OraInternals Riyaj Shamsudeen

OraInternals Riyaj Shamsudeen

OraInternals Riyaj Shamsudeen

OraInternals Riyaj Shamsudeen

Four slaves were allocated for above SQL statement.

QC SID Requested DOP Actual DOP

OraInternals Riyaj Shamsudeen

PQ select In ideal situation

1 Read partial table