6-68046-01 SN 5 TuningGuide Rev
6-68046-01 SN 5 TuningGuide Rev
6-68046-01 SN 5 TuningGuide Rev
StorNext 5
6-68046-01, Rev. E
StorNext 5 File System Tuning Guide, 6-68046-01, Rev. E, January 2015, Product of USA.
Quantum Corporation provides this publication as is without warranty of any kind, either express or implied, including but not limited to
the implied warranties of merchantability or fitness for a particular purpose. Quantum Corporation may revise this publication from time to
time without notice.
COPYRIGHT STATEMENT
2015 Quantum Corporation. All rights reserved.
Your right to copy this manual is limited by copyright law. Making copies or adaptations without prior written authorization of Quantum
Corporation is prohibited by law and constitutes a punishable violation of the law.
TRADEMARK STATEMENT
Quantum, the Quantum logo, DLT, DLTtape, the DLTtape logo, Scalar, StorNext, the DLT logo, DXi, GoVault, SDLT, StorageCare, Super
DLTtape, and SuperLoader are registered trademarks of Quantum Corporation in the U.S. and other countries. Protected by Pending and
Issued U.S. and Foreign Patents, including U.S. Patent No. 5,990,810. LTO and Ultrium are trademarks of HP, IBM, and Quantum in the U.S.
and other countries. All other trademarks are the property of their respective companies. Specifications are subject to change without
notice.
StorNext utilizes the following components which are copyrighted by their respective entities:
ACSAPI, copyright Storage Technology Corporation; Java, copyright Oracle Corporation; LibICE, LibSM, LibXau, LibXdmcp, LibXext,
LibXi copyright The Open Group; LibX11copyright The Open Group, MIT, Silicon Graphics, and the Regents of the University of California,
and copyright (C) 1994-2002 The XFree86 Project, Inc. All Rights Reserved. And copyright (c) 1996 NVIDIA, Corp. NVIDIA design patents
pending in the U.S. and foreign countries.; Libxml2 and LibXdmcp, copyright MIT; MySQL, copyright Oracle Corporation; Ncurses,
copyright 1997-2009,2010 by Thomas E. Dickey <dickey@invisible-island.net>. All Rights Reserved.; TCL/TK, copyright Sun
Microsystems and the Regents of the University of California; vixie-cron: copyright Internet Systems Consortium (ISC); Wxp-tdi.h, copyright
Microsoft Corporation; Zlib, copyright 1995-2010 Jean-loup Gailly and Mark Adler without notice.
ii
Contents
Chapter 1
iii
Contents
Chapter 2
85
Contents
Appendix A
95
Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Stripe Group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
Exclusivity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Setting Affinities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Auto Affinities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
Affinity Preference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
Allocation Session Reservation . . . . . . . . . . . . . . . . . . . . . . . 100
Allocation Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
Common Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
Using Affinities on the HaShared File . . . . . . . . . . . . . . . . . . 104
Segregating Audio and Video Files Onto Their Own Stripe Groups
106
Reserving High-Speed Disk For Critical Files . . . . . . . . . . . . . 107
Appendix B
109
Contents
vi
Chapter 1
RAID Cache
Configuration
with properly tuned caching. So, for the best general purpose
performance characteristics, it is crucial to utilize the RAID system
caching as fully as possible.
For example, write-back caching is absolutely essential for metadata
stripe groups to achieve high metadata operations throughput.
However, there are a few drawbacks to consider as well. For example,
read-ahead caching improves sequential read performance but might
reduce random performance. Write-back caching is critical for small
write performance but may limit peak large I/O throughput.
Caution: Some RAID systems cannot safely support write-back
caching without risk of data loss, which is not suitable for
critical data such as file system metadata.
Consequently, this is an area that requires an understanding of
application I/O requirements. As a general rule, RAID system caching is
critically important for most applications, so it is the first place to focus
tuning attention.
RAID Write-Back
Caching
RAID Level
Configuration settings such as RAID level, segment size, and stripe size
are very important and cannot be changed after put into production, so
it is critical to determine appropriate settings during initial
configuration.
Quantum recommends Metadata and Journal Strips Groups use RAID 1
because it is most optimal for very small I/O sizes. Quantum
recommends using fibre channel or SAS disks (as opposed to SATA) for
metadata and journal due to the higher IOPS performance and
reliability. It is also very important to allocate entire physical disks for
the Metadata and Journal LUNs in order to avoid bandwidth contention
with other I/O traffic. Metadata and Journal storage requires very high
IOPS rates (low latency) for optimal performance, so contention can
severely impact IOPS (and latency) and thus overall performance. If
Journal I/O exceeds 1ms average latency, you will observe significant
performance degradation.
The stripe size is the sum of the segment sizes of the data disks in the
RAID group. For example, a 4+1 RAID 5 group (four data disks plus one
parity) with 64kB segment sizes creates a stripe group with a 256kB
stripe size. The stripe size is a critical factor for write performance.
Writes smaller than the stripe size incur the read/modify/write penalty,
described more fully below. Quantum recommends a stripe size of
512kB or smaller.
The RAID stripe size configuration should typically match the SNFS
StripeBreadth configuration setting when multiple LUNs are utilized
in a stripe group. However, in some cases it might be optimal to
configure the SNFS StripeBreadth as a multiple of the RAID stripe
size, such as when the RAID stripe size is small but the user's I/O sizes are
very large. However, this will be suboptimal for small I/O performance,
so may not be suitable for general purpose usage.
To help the reader visualize the read/modify/write penalty, it may be
helpful to understand that the RAID can only actually write data onto
the disks in a full stripe sized packet of data. Write operations to the
RAID that are not an exact fit of one or more stripe-sized segments,
requires that the last, or only, stripe segment be read first from the
disks. Then the last, or only portion, of the write data is overlaid onto
the read stripe segment. Finally, the data is written back out onto the
RAID disks in a single full stripe segment. When RAID caching has been
disabled (no Write-Back caching), these read/modify/write operations
will require a read of the stripe data segment into host memory before
the data can be properly merged and written back out. This is the worst
case scenario from a performance standpoint. The read/modify/write
penalty is most noticeable in the absence of write-back caching being
performed by the RAID controller.
It can be useful to use a tool such as lmdd to help determine the
storage system performance characteristics and choose optimal
settings. For example, varying the stripe size and running lmdd with a
range of I/O sizes might be useful to determine an optimal stripe size
multiple to configure the SNFS StripeBreadth.
This file is used to control the I/O scheduler, and control the scheduler's
queue depth.
For more information about this file, see the deviceparams man page,
or the StorNext Man Pages Reference Guide posted here (click the
Select a StorNext Version menu to view the desired documents):
http://www.quantum.com/sn5docs
The I/O throughput of Linux Kernel 2.6.10 (SLES10 and later and RHEL5
and later) can be increased by adjusting the default I/O settings.
Note: SLES 10 is not supported in StorNext 5.
Beginning with the 2.6 kernel, the Linux I/O scheduler can be changed
to control how the kernel does reads and writes. There are four types of
I/O scheduler available in most versions of Linux kernel 2.6.10 and
higher:
The completely fair queuing scheduler (CFQ)
The no operation scheduler (NOOP)
The deadline scheduler (DEADLINE)
The anticipatory scheduler (ANTICIPATORY)
Note: ANTICIPATORY is not present in SLES 11 SP2.
nr_requests=4096
In addition, there are three Linux kernel parameters that can be tuned
for increased performance:
1 The minimal preemption qranularity variable for CPU bound tasks.
kernel.sched_min_granularity_ns = 10ms
echo 10000000 > /proc/sys/kernel/
sched_min_granularity_ns
2 The wake-up preemption qranularity variable. Increasing this
variable reduces wake-up preemption, reducing disturbance of
computer bound tasks. Lowering it improves wake-up latency and
throughput for latency of critical tasks.
kernel.sched_wakeup_granularity_ns = 15ms
echo 15000000 > /proc/sys/kernel/
sched_wakeup_granularity_ns
3 The vm.dirty_background_ratio variable contains 10, which is a
percentage of total system memory, the number of pages at which
the pbflush background writeback daemon will start writing out
dirty data. However, for fast RAID based disk system, this may cause
large flushes of dirty memory pages. Increasing this value will result
in less frequent flushes.
Buffer Cache
Reads and writes that aren't well-formed utilize the SNFS buffer cache.
This also includes NFS or CIFS-based traffic because the NFS and CIFS
daemons defeat well-formed I/Os issued by the application.
There are several configuration parameters that affect buffer cache
performance. The most critical is the RAID cache configuration because
buffered I/O is usually smaller than the RAID stripe size, and therefore
incurs a read/modify/write penalty. It might also be possible to match
the RAID stripe size to the buffer cache I/O size. However, it is typically
most important to optimize the RAID cache configuration settings
described earlier in this document.
It is usually best to configure the RAID stripe size no greater than 256K
for optimal small file buffer cache performance.
For more buffer cache configuration settings, see Mount Command
Options on page 40.
fs.leases-enable=0
Note: If the Samba daemon (smbd) is also running on the server,
disabling leases will prevent the use of the Samba kernel
oplocks feature. In addition, any other application that make of
use of leases through the F_SETLEASE fcntl will be affected.
Leases do not need to be disabled in this manner for NFS servers
running SLES11 or RHEL6.
10
NFS / CIFS
It is best to isolate NFS and/or CIFS traffic off of the metadata network
to eliminate contention that will impact performance. On NFS clients,
use the vers=3, rsize=1048576 and wsize=1048576 mount options.
When possible, it is also best to utilize TCP Offload capabilities as well as
jumbo frames.
Note: Jumbo frames should only be configured when all of the
relevant networking components in the environment support
them.
Note: When Jumbo frames are used, the MTU on the Ethernet
interface should be configured to an appropriate size. Typically,
the correct value is 9000, but may vary depending on your
networking equipment. Refer to the documentation for your
network adapter.
It is best practice to have clients directly attached to the same network
switch as the NFS or CIFS server. Any routing required for NFS or CIFS
traffic incurs additional latency that impacts performance.
It is critical to make sure the speed/duplex settings are correct, because
this severely impacts performance. Most of the time auto-negotiation
is the correct setting for the ethernet interface used for the NFS or CIFS
traffic.
11
12
13
It is best practice to have all SNFS clients directly attached to the same
network switch as the MDC systems. Any routing required for metadata
traffic will incur additional latency that impacts performance.
It can be useful to use a tool like netperf to help verify the Metadata
Network performance characteristics. For example, if netperf -t TCP_RR
-H <host> reports less than 4,000 transactions per second capacity, a
performance penalty may be incurred. You can also use the netstat tool
to identify tcp retransmissions impacting performance. The cvadmin
latency-test tool is also useful for measuring network latency.
Note the following configuration requirements for the metadata
network:
In cases where gigabit networking hardware is used and maximum
StorNext performance is required, a separate, dedicated switched
Ethernet LAN is recommended for the StorNext metadata network.
If maximum StorNext performance is not required, shared gigabit
networking is acceptable.
Stripe Groups
Splitting apart data, metadata, and journal into separate stripe groups
is usually the most important performance tactic. The create, remove,
and allocate (e.g., write) operations are very sensitive to I/O latency of
the journal stripe group. However, if create, remove, and allocate
performance aren't critical, it is okay to share a stripe group for both
metadata and journal, but be sure to set the exclusive property on the
stripe group so it doesn't get allocated for data as well.
Note: It is recommended that you have only a single metadata stripe
group. For increased performance, use multiple LUNs (2 or 4)
for the stripe group.
RAID 1 mirroring is optimal for metadata and journal storage. Utilizing
the write-back caching feature of the RAID system (as described
previously) is critical to optimizing performance of the journal and
15
[StripeGroup MetaFiles]
16
Status Up
StripeBreadth 256K
Metadata Yes
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 200
Rtios 200
RtmbReserve 1
RtiosReserve 1
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk0 0
[StripeGroup JournFiles]
Status Up
StripeBreadth 256K
Metadata No
Journal Yes
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk1 0
[StripeGroup RegularFiles]
Status Up
StripeBreadth 256K
Metadata No
Journal No
Exclusive No
Read Enabled
17
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk14 0
Node CvfsDisk15 1
Node CvfsDisk16 2
Node CvfsDisk17 3
Affinities
Affinities are another stripe group feature that can be very beneficial.
Affinities can direct file allocation to appropriate stripe groups
according to performance requirements. For example, stripe groups can
be set up with unique hardware characteristics such as fast disk versus
slow disk, or wide stripe versus narrow stripe. Affinities can then be
employed to steer files to the appropriate stripe group.
For optimal performance, files that are accessed using large DMA-based
I/O could be steered to wide-stripe stripe groups. Less performancecritical files could be steered to slow disk stripe groups. Small files could
be steered clear of large files, or to narrow-stripe stripe groups.
Example (Linux)
18
[StripeGroup AudioFiles]
Status Up
StripeBreadth 1M
Metadata No
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk10 0
Node CvfsDisk11 1
Node CvfsDisk12 2
Node CvfsDisk13 3
Affinity Audio
Note: Affinity names cannot be longer than eight characters.
StripeBreadth
This setting should match the RAID stripe size or be a multiple of the
RAID stripe size. Matching the RAID stripe size is usually the most
optimal setting. However, depending on the RAID performance
characteristics and application I/O size, it might be beneficial to use a
multiple or integer fraction of the RAID stripe size. For example, if the
RAID stripe size is 256K, the stripe group contains 4 LUNs, and the
application to be optimized uses DMA I/O with 8MB block size, a
StripeBreadth setting of 2MB might be optimal. In this example the
StorNext File System Tuning Guide
19
8MB application I/O is issued as 4 concurrent 2MB I/Os to the RAID. This
concurrency can provide up to a 4X performance increase. This
StripeBreadth typically requires some experimentation to determine the
RAID characteristics. The lmdd utility can be very helpful. Note that this
setting is not adjustable after initial file system creation.
Optimal range for the StripeBreadth setting is 128K to multiple
megabytes, but this varies widely.
Note: This setting cannot be changed after being put into
production, so its important to choose the setting carefully
during initial configuration.
Example (Linux)
Example (Windows)
[StripeGroup VideoFiles]
Status Up
StripeBreadth 4M
Metadata No
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk2 0
Node CvfsDisk3 1
Node CvfsDisk4 2
Node CvfsDisk5 3
Node CvfsDisk6 4
Node CvfsDisk7 5
Node CvfsDisk8 6
Node CvfsDisk9 7
Affinity Video
BufferCacheSize
Increasing this value can reduce latency of any metadata operation by
performing a hot cache access to directory blocks, inode information,
and other metadata info. This is about 10 to 1000 times faster than I/O.
It is especially important to increase this setting if metadata I/O latency
is high, (for example, more than 2ms average latency). Quantum
recommends sizing this according to how much memory is available;
more is better. Optimal settings for BufferCacheSize range from
32MB to 8GB for a new file system and can be increased up to 500GB as
a file system grows. A higher setting is more effective if the CPU is not
heavily loaded.
21
<bufferCacheSize>268435456</bufferCacheSize>
Example (Windows)
BufferCacheSize 256MB
In StorNext 5, the default value for the BufferCacheSize parameter in
the file system configuration file changed from 32 MB to 256 MB.
While uncommon, if a file system configuration file is missing this
parameter, the new value will be in effect. This may improve
performance; however, the FSM process will use more memory than it
did with previous releases (up to 400 MB).
To avoid the increased memory utilization, the BufferCacheSize
parameter may be added to the file system configuration file with the
old default values.
Example (Linux)
<bufferCacheSize>33554432</bufferCacheSize>
Example (Windows)
BufferCacheSize 32M
22
InodeCacheSize
This setting consumes about 1400 bytes of memory times the number
specified. Increasing this value can reduce latency of any metadata
operation by performing a hot cache access to inode information
instead of an I/O to get inode info from disk, about 100 to 1000 times
faster. It is especially important to increase this setting if metadata I/O
latency is high, (for example, more than 2ms average latency). You
should try to size this according to the sum number of working set files
for all clients. Optimal settings for InodeCacheSize range from 16K to
128K for a new file system and can be increased to 256K or 512K as a
file system grows. A higher setting is more effective if the CPU is not
heavily loaded. For best performance, the InodeCacheSize should be
at least 1024 times the number of megabytes allocated to the journal.
For example, for a 64MB journal, the inodeCacheSize should be at
least 64K.
Example (Linux)
<inodeCacheSize>131072</inodeCacheSize>
Example (Windows)
InodeCacheSize 128K
In StorNext 5, the default value for the InodeCacheSize parameter in
the file system configuration file changed from 32768 to 131072.
While uncommon, if a file system configuration file is missing this
parameter, the new value will be in effect. This may improve
performance; however, the FSM process will use more memory than it
did with previous releases (up to 400 MB).
To avoid the increased memory utilization, the InodeCacheSize
parameter may be added to the file system configuration file with the
old default values.
Example (Linux)
<inodeCacheSize>32768</inodeCacheSize>
Example (Windows)
InodeCacheSize 32K
23
FsBlockSize
Beginning with StorNext 5, all SNFS file systems use a File System Block
Size of 4KB. This is the optimal value and is no longer tunable. Any file
systems created with pre-5 versions of StorNext having larger File
System Block Sizes will be automatically converted to use 4KB the first
time the file system is started with StorNext 5.
JournalSize
Beginning with StorNext 5, the recommended setting for JournalSize
is 64Mbytes.
Increasing the JournalSize beyond 64Mbytes may be beneficial for
workloads where many large size directories are being created, or
removed at the same time. For example, workloads dealing with 100
thousand files in a directory and several directories at once will see
improved throughput with a larger journal.
The downside of a larger journal size is potentially longer FSM startup
and failover times.
Using a value less than 64Mbytes may improve failover time but reduce
file system performance. Values less than 16Mbytes are not
recommended.
Note: Journal replay has been optimized with StorNext 5 so a
64Mbytes journal will often replay significantly faster with
StorNext 5 than a 16Mbytes journal did with prior releases.
A file system created with a pre-5 version of StorNext may have been
configured with a small JournalSize. This is true for file systems
created on Windows MDCs where the old default size of the journal was
4Mbytes. Journals of this size will continue to function with StorNext 5,
but will experience a performance benefit if the size is increased to
64Mbytes. This can be adjusted using the cvupdatefs utility. For more
information, see the command cvupdatefs in the StorNext MAN Pages
Reference Guide.
If a file system previously had been configured with a JournalSize
larger than 64Mbytes, there is no reason to reduce it to 64Mbytes when
upgrading to StorNext 5.
24
Example (Linux)
JournalSize 64M
SNFS Tools
The snfsdefrag tool is very useful to identify and correct file extent
fragmentation. Reducing extent fragmentation can be very beneficial
for performance. You can use this utility to determine whether files are
fragmented, and if so, fix them.
Qustats
The qustats are measuring overall metadata statistics, physical I/O, VOP
statistics and client specific VOP statistics.
The overall metadata statistics include journal and cache information.
All of these can be affected by changing the configuration parameters
for a file system. Examples are increasing the journal size, and increasing
cache sizes.
The physical I/O statistics show number and speed of disk I/Os. Poor
numbers can indicate hardware problems or over-subscribed disks.
The VOP statistics show what requests SNFS clients making to the
MDCs, which can show where workflow changes may improve
performance.
The Client specific VOP statistics can show which clients are generating
the VOP requests.
Examples of qustat operations;
Print the current stats to stdout
# qustat -g <file_system_name>
Print a description of a particular stat
25
There are a large number of qustat counters available in the output file,
most are debugging counters and are not useful in measuring or tuning
the file system. The items in the table below have been identified as the
most interesting counters. All other counters can be ignored.
Name
Type
Description
Journal
Waits
Cache Stats
Read/Write
Physical metadata
I/O statistics
PhysIO Stats
26
max
sysmax
average
Name
Type
Description
File and Directory operations
VOP Stats
Create and Remove
Cnt
Mkdir and Rmdir
Rename
Cnt
Open and Close
ReadDir
Cnt
DirAttr
Cnt
n/a
Cnt
Client VOP
Stats
Cnt
Cnt
The qustat command also supports the client module. The client is the
StorNext file system driver that runs in the Linux kernel. Unlike the cvdb
PERF traces, which track individual I/O operations, the qustat statisics
group like operations into buckets and track minimum, maximum and
average duration times. In the following output, one can see that the
statistics show global counters for all file systems as well as counters for
individual file system. In this example, only a single file system is shown.
27
# qustat -m client
28
# Table 1: Global.VFSOPS
# Last Reset: Secs=68772 time_t=1383862113 2013-11-07 16:08:33 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Mount
TIM
145651
145651
145651
145651
# Table 2: Global.VNOPS
# Last Reset: Secs=68772 time_t=1383862113 2013-11-07 16:08:33 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Lookup
TIM
1409440
129
1981607
1075696112
763
Lookup Misses
TIM
85999
119
1683984
55086899
641
Create
TIM
345920
176
7106558
1558428932
4505
Link
TIM
43236
144
1082276
109649645
2536
Open
TIM
899448
249137
419685143
467
Close
TIM
966263
1842135
239392382
248
1583301
1842136
252638155
160
Flush
TIM
85822
1461219
67287424
784
Delete
TIM
86187
138
6951252
161379694
1872
Truncate
TIM
86719
10788310
579502414
6683
Read Calls
TIM
155568
11424282
7201269947
46290
Read Bytes
SUM
155567
1048576 112968716468
726174
Write Calls
TIM
396833
Write Bytes
SUM
396831
12133240
17325631059
43660
1048576 193278769608
487056
29
# Table 3: fs.slfs.vnops
# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Lookup
TIM
1409419
130
1981624
1075831257
763
Lookup Misses
TIM
85999
119
1683984
55095943
641
Create
TIM
345922
176
7106558
1558495382
4505
Link
TIM
43235
145
1082276
109654274
2536
Open
TIM
899450
249137
419766923
467
Close
TIM
966264
1842135
239456503
248
1583301
1842136
252713577
160
Flush
TIM
85822
1461219
67294264
784
Delete
TIM
86187
138
6951252
161388534
1873
Truncate
TIM
86719
10788310
579507707
6683
Read Calls
TIM
155567
11424281
7201242975
46290
1048576 112968716468
726174
Read Bytes
SUM
155567
Write Calls
TIM
396832
Write Bytes
SUM
396833
12133240
17325589658
43660
1048576 193278769608
487053
The remaining tables show read and write performance. Table 4, ending
in.san, shows reads and writes to attached storage.
30
# Table 4: fs.slfs.sg.VideoFiles.io.san
# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Rd Time Dev
TIM
74911
138
3085649
753960684
10065
Rd Bytes Dev
SUM
74909
512
1048576
16400501760
218939
TIM
136359
190
3087236
2405089567
17638
SUM
136351
4096
1048576 37450731520
274664
Err Wrt
CNT
Retry Wrt
CNT
# Table 5: fs.slfs.sg.VideoFiles.io.gw
# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Rd Time Dev
TIM
30077
150
895672
172615407
5739
512
262144
4480934912
148997
173
797861
267709887
4478
4096
262144
10034552832
167861
Rd Bytes Dev
Wrt Time Dev
Wrt Bytes Dev
SUM
TIM
SUM
30074
59780
59779
Tables 6 through 9 show the same statistics, san and gateway, but they
are broken up by stripe groups. The stripe group names are AudioFiles
and RegularFiles.
You can use these statistics to determine the relative performance of
different stripe groups. You can use affinities to direct I/O to particular
stripe groups.
31
# Table 6: fs.slfs.sg.AudioFiles.io.san
# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Rd Time Dev
TIM
265234
134
11722036
9525236181
35913
Rd Bytes Dev
SUM
265226
512
1048576
49985565184
188464
TIM
498170
173
11959871
19787582210
39721
498128
4096
1048576 113701466112
228258
# Table 7: fs.slfs.sg.AudioFiles.io.gw
# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Rd Time Dev
TIM
14621
158
2325615
35532936
2430
512
262144
1957234176
133865
194
2316627
95964242
3072
4096
262144
4644196352
148676
Rd Bytes Dev
Wrt Time Dev
Wrt Bytes Dev
SUM
TIM
SUM
14621
31237
31237
# Table 8: fs.slfs.sg.RegularFiles.io.san
# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Rd Time Dev
TIM
181916
148
5831201
5682515637
31237
Rd Bytes Dev
SUM
181913
512
262144
22479920640
123575
TIM
369178
190
5922770
15576716200
42193
SUM
369148
4096
262144
51478306816
139452
32
# Table 9: fs.slfs.sg.RegularFiles.io.gw
# Last Reset: Secs=68678 time_t=1383862207 2013-11-07 16:10:07 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Rd Time Dev
TIM
21640
163
742691
85429364
3948
262144
2638490112
121949
182
745754
160432969
3650
4096
262144
6106251264
138939
Rd Bytes Dev
SUM
21636
TIM
43950
SUM
43949
512
# qustat -m client
# Table 1: Global.VFSOPS
# Last Reset: Secs=1071 time_t=1383929826 2013-11-08 10:57:06 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Mount
TIM
31432
31432
31432
31432
33
# Table 2: Global.VNOPS
# Last Reset: Secs=1071 time_t=1383929826 2013-11-08 10:57:06 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Lookup
TIM
279050
144
812119
247428083
887
Lookup Misses
TIM
26454
141
115134
16098784
609
Create
TIM
106236
194
1190095
510658586
4807
Link
TIM
13436
129
722179
37626036
2800
Open
TIM
275816
175466
112723260
409
Close
TIM
312814
1012779
88674424
283
299858
1012781
95911627
320
Flush
TIM
26661
1359094
29984110
1125
Delete
TIM
26618
147
558260
44942723
1688
Truncate
TIM
26566
1105350
117272934
4414
Read Calls
TIM
48294
2339980
812905246
16832
Read Bytes
SUM
48294
1048576 35012379084
724984
Write Calls
TIM
122068
3079900
2167773017
17759
1048576
59774430318
489681
Write Bytes
34
SUM
122068
# Table 3: fs.slfs.vnops
# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Lookup
TIM
279046
144
812119
247453265
887
Lookup Misses
TIM
26454
141
115134
16101297
609
Create
TIM
106235
194
1190095
510601746
4806
Link
TIM
13436
129
722179
37627542
2801
Open
TIM
275819
175466
112746924
409
Close
TIM
312818
1012779
88695707
284
299858
1012782
95926782
320
Flush
TIM
26661
1359095
29986277
1125
Delete
TIM
26618
147
558260
44945428
1689
Truncate
TIM
26566
1105350
117274473
4414
Read Calls
TIM
48294
2339980
812900201
16832
Read Bytes
SUM
48293
1048576 35012379084
724999
Write Calls
TIM
122068
3079899
2167759516
17759
1048576
59774430318
489681
Write Bytes
SUM
122068
35
# Table 4: fs.slfs.sg.VideoFiles.io.san
# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
# Table 5: fs.slfs.sg.VideoFiles.io.lan
# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Rd Time Dev
TIM
60101
290
1095898
836391520
13916
Rd Bytes Dev
SUM
60101
512
1048576
13414037504
223192
TIM
108047
444
1097372
2117952993
19602
SUM
108033
4096
1048576 30341750784
280856
# Table 6: fs.slfs.sg.AudioFiles.io.san
# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
# Table 7: fs.slfs.sg.AudioFiles.io.lan
# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Rd Time Dev
TIM
32229
362
2339935
300582018
9326
512
1048576
6183428096
191895
474
2521769
766287139
12551
4096
1048576
14110711808
231119
Rd Bytes Dev
Wrt Time Dev
Wrt Bytes Dev
SUM
TIM
SUM
32223
61055
61054
Err Rd
CNT
Retry Rd
CNT
36
# Table 8: fs.slfs.sg.RegularFiles.io.san
# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
# Table 9: fs.slfs.sg.RegularFiles.io.lan
# Last Reset: Secs=1028 time_t=1383929869 2013-11-08 10:57:49 CST
# NAME
TYP
COUNT
MIN
MAX
TOT/LVL
AVG
Rd Time Dev
TIM
64259
365
812197
750876114
11685
Rd Bytes Dev
SUM
64259
512
262144
7952000000
123749
TIM
130317
469
1084678
1856862556
14249
SUM
130304
4096
262144
18139078656
139206
37
The Lmdd utility is very useful to measure raw LUN performance as well
as varied I/O transfer sizes. It is part of the lmbench package and is
available from http://sourceforge.net.
The cvdbset utility has a special Perf trace flag that is very useful to
analyze I/O performance. For example: cvdbset perf
Then, you can use cvdb -g to collect trace information such as this:
38
39
Mount Command
Options
40
data from the file in the background for improved performance. The
default setting is optimal in most scenarios.
The auto_dma_read_length and auto_dma_write_length settings
determine the minimum transfer size where direct DMA I/O is
performed instead of using the buffer cache for well-formed I/O. These
settings can be useful when performance degradation is observed for
small DMA I/O sizes compared to buffer cache.
For example, if buffer cache I/O throughput is 200 MB/sec but 512K
DMA I/O size observes only 100MB/sec, it would be useful to determine
which DMA I/O size matches the buffer cache performance and adjust
auto_dma_read_length and auto_dma_write_length accordingly. The
lmdd utility is handy here.
The dircachesize option sets the size of the directory information cache
on the client. This cache can dramatically improve the speed of readdir
operations by reducing metadata network message traffic between the
SNFS client and FSM. Increasing this value improves performance in
scenarios where very large directories are not observing the benefit of
the client directory cache.
Optimistic Allocation
Note: It is no longer recommended that the InodeExpand
parameters (InodeExpandMin, InodeExpandMax and
InodeExpandInc) be changed from their default values. These
settings are provided for compatibility when upgrading file
systems.
41
The InodeExpand values are still honored if they are in the .cfgx file,
but the StorNext GUI does not allow these values to be set. Also, when
converting from .cfg to .cfgx files, if the InodeExpand values in the .cfg
file are found to be the default example values, these values are not set
in the new .cfgx. Instead, the new formula is used.
How Optimistic
Allocation Works
42
mark will extend the file by 10MB (2MB + 4MB + 4MB). This pattern
repeats until the file's allocation value is equal to or larger than
InodeExpandMax, at which point it's capped at InodeExpandMax.
This formula generally works well when it's tuned for the specific I/O
pattern. If it's not tuned, with certain I/O patterns it can cause
suboptimal allocations resulting in excess fragmentation or wasted
space from files being over allocated.
This is especially true if there are small files created with O_DIRECT, or
small files that are simultaneously opened by multiple clients which
cause them to use an InodeExpandMin that's too large for them.
Another possible problem is an InodeExpandMax that's too small,
causing the file to be composed of fragments smaller than it otherwise
could have been created with.
With very large files, without increasing InodeExpandMax, it can create
fragmented files due to the relatively small size of the allocations and
the large number that are needed to create a large file.
Another possible problem is an InodeExpandInc that's not aggressive
enough, again causing a file to be created with more fragments than it
could be created with, or to never reach InodeExpandMax because
writes stop before it can be incremented to that value.
Note: Although the preceding example uses DMA I/O, the
InodeExpand parameters apply to both DMA and non-DMA
allocations.
Optimistic Allocation
Formula
The following table shows the new formula (beginning with StorNext
4.x):
File Size (in bytes)
Optimistic Allocation
<= 16MB
1MB
4MB
16MB
64MB
43
Optimistic Allocation
256MB
1GB
4GB
16GB
64GB
256GB
kbytes
16
800
4
4
depth
192
128
1216
48
4
4
4
4
45
Hardware
Configuration
46
47
Follow the steps below to modify the grub.conf file so that the Intel
sleep state is disabled. Making this change could result in increased
power consumption, but it helps prevent problems which result in
system hangs due to processor power transition.
1 For the above systems, prior to installation:
Add the following text to the "kernel" line in /boot/grub/:
grub.conf:idle=poll intel_idle.max_cstate=0
processor.max_cstate=1
2 Reboot the system for the change to take effect.
48
On Gateway systems
sndpscfg E fsname
How This Helps
On Gateway systems
49
transfer_buffer_size_kb and
transfer_buffer_count can lead to
On Gateway systems
sndpscfg E fsname
How This Helps
Set the cache buffer size to 512K in the file system mount
options
Where to Set This
50
On DLC clients
51
52
HKEY_LOCAL_MACHINE\SYSTEM\CurrentCo
ntrolSet\Services\Tcpip\Parameters\
Tcp1323Opts
To the value of 3.
53
sysctl -w
net.core.rmem_max=4194304
sysctl -w
net.core.wmem_max=4194304
The exact syntax may vary by Linux version.
For details, refer to the documentation for
your version of Linux.
On Solaris: Run ndd. For example, ndd -set
/dev/tcp tcp_max_buf 4194304
The exact syntax may vary by Solaris version.
For details, refer to the documentation for
your version of Solaris.
On Windows: Systems running Vista or
newer do not require adjustment. For older
versions of Windows, add or set the DWORD
keys:
HKEY_LOCAL_MACHINE\SYSTEM\CurrentCo
ntrolSet\Services\Tcpip\Parameters\
TcpWindowSize
HKEY_LOCAL_MACHINE\SYSTEM\CurrentCo
ntrolSet\Services\Tcpip\Parameters\
GlobalMaxTcpWindowSize
These should both be set to a value of 4MB
(0x400000 or 4194304) or greater.
54
55
56
On DLC clients
57
Network Configuration
and Topology
For maximum throughput, a StorNext LAN Client can use multiple NICs
on StorNext Gateway Servers. In order to take advantage of this feature,
each of the NICs on a given gateway must be on a different IP
subnetwork (this is a requirement of TCP/IP routing, not of SNFS - TCP/IP
can't utilize multiple NICs on the same subnetwork). An example of this
is shown in the following illustration.
58
Note: The diagram shows separate physical switches used for the two
subnetworks. They can, in fact, be the same switch, provided it
has sufficient internal bandwidth to handle the aggregate
traffic.
Scheduling requests across multiple subnetworks and multiple servers
via multiple network ports can be challenging. In particular, multiple
streams of large disk read requests, because of the additional latency
from disk, can lead to an imbalance of traffic across a client's network
ports. In some cases, it may be possible to tune this scheduling for a
particular application mix using the proxypath mount options. In other
cases, changing the network configuration might help. Matching the
number of server ports to the number of client ports, thereby reducing
the number of path choices, has been shown to improve the
performance of multiple streams of large reads.
For a detailed description of the proxypath mount options, see the
mount_cvfs man page.
59
Performance
The StorNext LAN Clients outperform NFS and CIFS for single-stream I/O
and provide higher aggregate bandwidth. For inferior NFS client
implementations, the difference can be more than a factor of two.The
The StorNext LAN Client also makes extremely efficient use of multiple
NICs (even for single streams), whereas legacy NAS protocols allow only
a single NIC to be used. In addition, StorNext LAN Clients communicate
directly with StorNext metadata controllers instead of going through an
intermediate server, thereby lowering IOP latency.
Fault Tolerance
60
Load Balancing
Consistent Security
Model
StorNext LAN Clients have the same security model as StorNext SAN
Clients. When CIFS and NFS are used, some security models arent
supported. (For example, Windows ACLs are not accessible when
running UNIX Samba servers.)
61
counters, disable the counters, and then re-enable the counters with
"cvdb -P".
To view the performance monitor counters:
Note: The following instructions apply to versions of Windows
supported by StorNext 5 (for example, Windows Vista and
newer). Refer to earlier versions of the StorNext Tuning Guide
when enabling Windows Performance Monitor Counters for
older versions of Windows.
1 Start the performance monitor.
a Click the Start icon.
b In the Search programs and files dialog, enter:
perfmon
2 Click the "add counter" icon.
3 Select either "StorNext Client" or "StorNext Disk Agent".
Note: The "StorNext Disk Agent" counters are internal debugging/
diagnostic counters used by Quantum personnel and are not
helpful in performance tuning of your system.
4 Select an individual counter.
5 To display additional information about the counter:
Check the "Show description" checkbox.
62
intel_idle.max_cstate=0 processor.max_cstate=1
2 Reboot the system for the change to take effect.
intel_idle.max_cstate=0 processor.max_cstate=1
63
2 Run "update-grub".
3 Reboot the system for the change to take effect.
Note: Disabling CPU power saving states in the system BIOS has no
effect on Linux.
In some cases, performance can also be improved by adjusting the idle
kernel parameter. However, care should be taken when using certain
values. For example, idle=poll maximizes performance but is
incompatible with hyperthreading (HT) and will lead to very high power
consumption. For additional information, refer to the documentation
for your version of Linux.
On Windows, disable CPU power saving states by adjusting BIOS
settings. Refer to system vendor documentation for additional
information.
Linux Example
Configuration File
Below are the contents of the StorNext example configuration file for
Linux (example.cfgx):
<?xml version="1.0"?>
64
<configDoc xmlns="http://www.quantum.com/snfs"
version="1.0">
<config configVersion="0" name="example"
fsBlockSize="4096" journalSize="67108864">
<globals>
<affinityPreference>false</affinityPreference>
<allocationStrategy>round</allocationStrategy>
<haFsType>HaUnmonitored</haFsType>
<bufferCacheSize>268435456</bufferCacheSize>
<cvRootDir>/</cvRootDir>
<storageManager>false</storageManager>
<debug>00000000</debug>
<dirWarp>true</dirWarp>
<extentCountThreshold>49152</extentCountThreshold>
<enableSpotlight>false</enableSpotlight>
<enforceAcls>false</enforceAcls>
<fileLocks>false</fileLocks>
<fileLockResyncTimeOut>20</fileLockResyncTimeOut>
<forcePerfectFit>false</forcePerfectFit>
<fsCapacityThreshold>0</fsCapacityThreshold>
<globalSuperUser>true</globalSuperUser>
<inodeCacheSize>131072</inodeCacheSize>
<inodeExpandMin>0</inodeExpandMin>
<inodeExpandInc>0</inodeExpandInc>
<inodeExpandMax>0</inodeExpandMax>
<inodeDeleteMax>0</inodeDeleteMax>
<inodeStripeWidth>0</inodeStripeWidth>
<maintenanceMode>false</maintenanceMode>
<maxLogs>4</maxLogs>
StorNext File System Tuning Guide
65
<namedStreams>false</namedStreams>
<remoteNotification>false</remoteNotification>
<renameTracking>false</renameTracking>
<reservedSpace>true</reservedSpace>
<fsmRealTime>false</fsmRealTime>
<fsmMemLocked>false</fsmMemLocked>
<opHangLimitSecs>180</opHangLimitSecs>
<perfectFitSize>131072</perfectFitSize>
<quotas>false</quotas>
<quotaHistoryDays>7</quotaHistoryDays>
<restoreJournal>false</restoreJournal>
<restoreJournalDir></restoreJournalDir>
<restoreJournalMaxHours>0</restoreJournalMaxHours>
<restoreJournalMaxMb>0</restoreJournalMaxMb>
<stripeAlignSize>-1</stripeAlignSize>
<trimOnClose>0</trimOnClose>
<useL2BufferCache>true</useL2BufferCache>
<unixDirectoryCreationModeOnWindows>755</
unixDirectoryCreationModeOnWindows>
<unixIdFabricationOnWindows>false</
unixIdFabricationOnWindows>
<unixFileCreationModeOnWindows>644</
unixFileCreationModeOnWindows>
<unixNobodyUidOnWindows>60001</
unixNobodyUidOnWindows>
<unixNobodyGidOnWindows>60001</
unixNobodyGidOnWindows>
<windowsSecurity>true</windowsSecurity>
<globalShareMode>false</globalShareMode>
66
<useActiveDirectorySFU>true</
useActiveDirectorySFU>
<eventFiles>true</eventFiles>
<eventFileDir></eventFileDir>
<allocSessionReservationSize>0</
allocSessionReservationSize>
</globals>
<diskTypes>
<diskType typeName="MetaDrive" sectors="99999999"
sectorSize="512"/>
<diskType typeName="JournalDrive"
sectors="99999999" sectorSize="512"/>
<diskType typeName="VideoDrive" sectors="99999999"
sectorSize="512"/>
<diskType typeName="AudioDrive" sectors="99999999"
sectorSize="512"/>
<diskType typeName="DataDrive" sectors="99999999"
sectorSize="512"/>
</diskTypes>
<autoAffinities/>
<stripeGroups>
<stripeGroup index="0" name="MetaFiles" status="up"
stripeBreadth="262144" read="true" write="true"
metadata="true" journal="false" userdata="false"
realTimeIOs="0" realTimeIOsReserve="0" realTimeMB="0"
realTimeMBReserve="0" realTimeTokenTimeout="0"
multipathMethod="rotate">
<disk index="0" diskLabel="CvfsDisk0"
diskType="MetaDrive" ordinal="0"/>
</stripeGroup>
<stripeGroup index="1" name="JournFiles"
status="up" stripeBreadth="262144" read="true"
write="true" metadata="false" journal="true"
StorNext File System Tuning Guide
67
69
</configDoc>
Windows Example
Configuration File
Below are the contents of the StorNext example configuration file for
Windows (example.cfg):
# Globals
AffinityPreference no
AllocationStrategy Round
HaFsType HaUnmonitored
FileLocks No
BrlResyncTimeout 20
BufferCacheSize 256M
CvRootDir /
DataMigration No
Debug 0x0
DirWarp Yes
ExtentCountThreshold 48K
EnableSpotlight No
ForcePerfectFit No
FsBlockSize 4K
GlobalSuperUser Yes
InodeCacheSize 128K
InodeExpandMin 0
InodeExpandInc 0
InodeExpandMax 0
InodeDeleteMax 0
InodeStripeWidth 0
JournalSize 64M
70
MaintenanceMode No
MaxLogs 4
NamedStreams No
PerfectFitSize 128K
RemoteNotification No
RenameTracking No
ReservedSpace Yes
FSMRealtime No
FSMMemlock No
OpHangLimitSecs 180
Quotas No
QuotaHistoryDays 7
RestoreJournal No
RestoreJournalMaxHours 0
RestoreJournalMaxMB 0
StripeAlignSize -1
TrimOnClose 0
UseL2BufferCache Yes
UnixDirectoryCreationModeOnWindows 0755
UnixIdFabricationOnWindows No
UnixFileCreationModeOnWindows 0644
UnixNobodyUidOnWindows 60001
UnixNobodyGidOnWindows 60001
WindowsSecurity Yes
GlobalShareMode No
UseActiveDirectorySFU Yes
EventFiles Yes
AllocSessionReservationSize 0m
71
# Disk Types
[DiskType MetaDrive]
Sectors 99999999
SectorSize 512
[DiskType JournalDrive]
Sectors 99999999
SectorSize 512
[DiskType VideoDrive]
Sectors 99999999
SectorSize 512
[DiskType AudioDrive]
Sectors 99999999
SectorSize 512
[DiskType DataDrive]
Sectors 99999999
SectorSize 512
# Disks
[Disk CvfsDisk0]
Type MetaDrive
Status UP
[Disk CvfsDisk1]
Type JournalDrive
Status UP
[Disk CvfsDisk2]
Type VideoDrive
Status UP
72
[Disk CvfsDisk3]
Type VideoDrive
Status UP
[Disk CvfsDisk4]
Type VideoDrive
Status UP
[Disk CvfsDisk5]
Type VideoDrive
Status UP
[Disk CvfsDisk6]
Type VideoDrive
Status UP
[Disk CvfsDisk7]
Type VideoDrive
Status UP
[Disk CvfsDisk8]
Type VideoDrive
Status UP
[Disk CvfsDisk9]
Type VideoDrive
Status UP
[Disk CvfsDisk10]
Type AudioDrive
Status UP
[Disk CvfsDisk11]
Type AudioDrive
Status UP
[Disk CvfsDisk12]
73
Type AudioDrive
Status UP
[Disk CvfsDisk13]
Type AudioDrive
Status UP
[Disk CvfsDisk14]
Type DataDrive
Status UP
[Disk CvfsDisk15]
Type DataDrive
Status UP
[Disk CvfsDisk16]
Type DataDrive
Status UP
[Disk CvfsDisk17]
Type DataDrive
Status UP
# Stripe Groups
[StripeGroup MetaFiles]
Status Up
StripeBreadth 256K
Metadata Yes
Journal No
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
74
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk0 0
[StripeGroup JournFiles]
Status Up
StripeBreadth 256K
Metadata No
Journal Yes
Exclusive Yes
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk1 0
[StripeGroup VideoFiles]
Status Up
StripeBreadth 4M
Metadata No
Journal No
75
Exclusive No
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk2 0
Node CvfsDisk3 1
Node CvfsDisk4 2
Node CvfsDisk5 3
Node CvfsDisk6 4
Node CvfsDisk7 5
Node CvfsDisk8 6
Node CvfsDisk9 7
Affinity Video
[StripeGroup AudioFiles]
Status Up
StripeBreadth 1M
Metadata No
Journal No
Exclusive No
Read Enabled
Write Enabled
Rtmb 0
76
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk10 0
Node CvfsDisk11 1
Node CvfsDisk12 2
Node CvfsDisk13 3
Affinity Audio
[StripeGroup RegularFiles]
Status Up
StripeBreadth 256K
Metadata No
Journal No
Exclusive No
Read Enabled
Write Enabled
Rtmb 0
Rtios 0
RtmbReserve 0
RtiosReserve 0
RtTokenTimeout 0
MultiPathMethod Rotate
Node CvfsDisk14 0
Node CvfsDisk15 1
Node CvfsDisk16 2
77
Node CvfsDisk17 3
MySQL innodb_buffer_pool_size
The InnoDB buffer pool is used to cache the data and indexes of MySQL
InnoDB database tables.
Increasing the size of the InnoDB buffer pool will allow MySQL to keep
more of the working database set memory resident, thereby reducing
the amount of disk access required to read datasets from the file
system. The InnoDB buffer pool size is determined by the parameter
innodb_buffer_pool_size in the /usr/adic/mysql/my.cnf file.
Increasing this value can improve the performance of Storage Manager
operations that require large queries. However, setting this value too
high can inefficiently remove memory from the free pool that could
otherwise be used by the StorNext file system or other applications, and
could lead to memory starvation issues on the system.
To change this value, modify the /usr/adic/mysql/my.cnf file and
change the innodb_buffer_pool_size setting in the
.mysqld.group.
Both Storage Manager and MySQL will need to be restarted for the
change to /usr/adic/mysql/my.cnf to take effect by executing the
following commands:
# adic_control stop
# adic_control start
78
[mysqld]
innodb_buffer_pool_size = 8G
StorNext Appliance
Upgrades
Contacts
79
http://www.quantum.com
Comments
Getting More
Information or Help
EMEA
80
APAC
Worldwide End-User
Product Warranty
81
82
Chapter 2
Allocation Session
Reservation (ASR)
83
84
Allocation Sessions
85
For example, for 128 MB the small chunk size is 4 MB, and for 1 GB the
small chunk size is 32 MB. Small sessions do not round the chunk size. A
file can get an allocation from a small session only if the allocation
request (offset + size) is less than 1MB. When users do small I/O sizes
into a file, the client buffer cache coalesces these and minimizes
allocation requests. If a file is larger than 1MB and is being written
through the buffer cache, it will most likely have allocation on the order
of 16MB or so requests (depending on the size of the buffer cache on
the client and the number of concurrent users of that buffer cache).
With NFS I/O into a StorNext client, the StorNext buffer cache is used.
NFS on some operating systems breaks I/O into multiple streams per file.
These will arrive on the StorNext client as disjointed random writes.
These are typically allocated from the same session with ASR and are not
impacted if multiple streams (other files) allocate from the same stripe
group. ASR can help reduce fragmentation due to these separate NFS
generated streams.
Files can start using one session type and then move to another session
type. A file can start with a very small allocation (small session), become
larger (medium session), and end up reserving the session for the file. If
a file has more than 10% of a medium sized chunk, it reserves the
remainder of the session chunk it was using for itself. After a session is
reserved for a file, a new session segment will be allocated for any other
medium files in that directory.
Small chunks are never reserved.
When allocating subsequent pieces for a session, they are rotated
around to other stripe groups that can hold user data unless
InodeStripeWidth (ISW) is set to 0.
Note: In StorNext 5, rotation is not done if InodeStripeWidth is set
to 0.
When InodeStripeWidth is set, chunks are rotated in a similar fashion
to InodeStripeWidth. The direction of rotation is determined by a
combination of the session key and the index of the client in the client
table. The session key is based on the inode number so odd inodes will
rotate in a different direction from even inodes. Directory session keys
are based on the inode number of the parent directory. For additional
information about InodeStripeWidth, refer to the snfs_config(5)
man page.
86
Video applications typically write one frame per file and place them in
their own unique directory, and then write them from the same
StorNext client. The file sizes are all greater than 1MB and smaller than
50 MB each and written/allocated in one I/O operation. Each file and
write land in medium/directory sessions.
For this kind of workflow, ASR is the ideal method to keep streams (a
related collection of frames in one directory) together on disk, thereby
preventing checker boarding between multiple concurrent streams. In
addition, when a stream is removed, the space can be returned to the
free space pool in big ASR pieces, reducing free space fragmentation
when compared to the default allocator.
Suppose a file system has four data stripe groups and an ASR size of 1
GB. If four concurrent applications writing medium-sized files in four
separate directories are started, they will each start with their own 1 GB
piece and most likely be on different stripe groups.
Without ASR
Without ASR, the files from the four separate applications are
intermingled on disk with the files from the other applications. The
default allocator does not consider the directory or application in any
way when carving out space. All allocation requests are treated equally.
With ASR turned off and all the applications running together, any
hotspot is very short lived: the size of one allocation/file. (See the
following section for more information about hotspots.)
With ASR
Now consider the 4 GB chunks for the four separate directories. As the
chunks are used up, ASR allocates chunks on a new SG using rotation.
Given this rotation and the timings of each application, there are times
when multiple writers/segments will be on a particular stripe group
together. This is considered a hotspot, and if the application expects
more throughput than the stripe group can provide, performance will
be sub par.
At read time, the checker boarding on disk from the writes (when ASR is
off) can cause disk head movement, and then later the removal of one
application run can also cause free space fragmentation. Since ASR
StorNext File System Tuning Guide
87
collects the files together for one application, the read performance of
one application's data can be significantly better since there will be little
to no disk head movement.
Small files (those less than 1 MB) are placed together in small file chunks
and grouped by StorNext client ID. This was done to help use the
leftover pieces from the ASR size chunks and to keep the small files
away from medium files. This reduces free space fragmentation over
time that would be caused by the leftover pieces. Leftover pieces occur
in some rare cases, such as when there are many concurrent sessions
exceeding 500 sessions.
88
1
2
3
4
5
6
7
8
9
4
1
2
3
4
1
2
3
4
0x40000000
0x80000000
0xc0000000
0x100000000
0x140000000
0x180000000
0x1c0000000
0x200000000
0x240000000
0xdd488a
0x10f4422
0x20000
0xd34028
0xd9488a
0x10c4422
0x30000
0x102c028
0xd6c88a
0xde4889
1048576
0x1104421 1048576
0x2ffff
1048576 1
0xd44027
1048576
0xda4889
1048576
0x10d4421 1048576
0x3ffff
1048576 1
0x103c027 1048576
0xd7c889
1048576
1
1
1
1
1
1
1
Here are the extent layouts of two processes writing concurrently but in
their own directory:
root@per2:() -> lmdd of=1d/10g bs=2m move=10g & lmdd of=2d/10g bs=2m move=10g &
[1] 27866
[2] 27867
root@per2:() -> wait
snfsdefrag -e 1d/* 2d/*
10240.00 MB in 31.30 secs, 327.14 MB/sec
[1]- Done
lmdd of=1d/10g bs=2m move=10g
10240.00 MB in 31.34 secs, 326.74 MB/sec
[2]+ Done
lmdd of=2d/10g bs=2m move=10g
root@per2:() ->
root@per2:() -> snfsdefrag -e 1d/* 2d/*
1d/10g:
# group frbase
fsbase
fsend
kbytes
depth
0 1 0x0
0xf3c422
0xf4c421
1048576 1
1 4 0x40000000
0xd2c88a
0xd3c889
1048576 1
2 3 0x80000000
0xfcc028
0xfdc027
1048576 1
3 2 0xc0000000
0x50000
0x5ffff
1048576 1
4 1 0x100000000 0x7a0472
0x7b0471
1048576 1
5 4 0x140000000 0xc6488a
0xc74889
1048576 1
6 3 0x180000000 0xcd4028
0xce4027
1048576 1
7 2 0x1c0000000 0x70000
0x7ffff
1048576 1
8 1 0x200000000 0x75ef02
0x76ef01
1048576 1
9 4 0x240000000 0xb9488a
0xba4889
1048576 1
2d/10g:
# group frbase
fsbase
fsend
kbytes
depth
0 2 0x0
0x40000
0x4ffff
1048576 1
1 3 0x40000000
0xffc028
0x100c027 1048576 1
2 4 0x80000000
0xca488a
0xcb4889
1048576 1
3 1 0xc0000000
0xedc422
0xeec421
1048576 1
StorNext File System Tuning Guide
89
4
5
6
7
8
9
2
3
4
1
2
3
0x100000000
0x140000000
0x180000000
0x1c0000000
0x200000000
0x240000000
0x60000
0xea4028
0xc2c88a
0x77f9ba
0x80000
0xbe4028
0x6ffff
1048576 1
0xeb4027
1048576
0xc3c889
1048576
0x78f9b9
1048576
0x8ffff
1048576 1
0xbf4027
1048576
1
1
1
1
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
1
1
1
1
1
1
1
1
1
1
1
4
3
1
4
3
0x600000
0x800000
0xa00000
0xc00000
0xe00000
0x1600000
0x2200000
0x3200000
0x3600000
0x3a00000
0x5200000
0x6200000
0x46200000
0x86200000
0xc6200000
0x106200000
0x18d440
0x18d4bf
2048
1
0x18d540
0x18d5bf
2048
1
0x18d740
0x18d7bf
2048
1
0x18d840
0x18d8bf
2048
1
0x18d9c0
0x18dbbf
8192
1
0x18dcc0
0x18dfbf
12288
1
0x18e4c0
0x18e8bf
16384
1
0x18e9c0
0x18eabf
4096
1
0x18ebc0
0x18ecbf
4096
1
0x18f3c0
0x18f9bf
24576
1
0x18fdc0
0x1901bf
16384
1
0x1530772 0x1540771 1048576 1
0x1354028 0x1364027 1048576 1
0x12e726
0x13e725
1048576 1
0x14ed9b2 0x14fd9b1 1048576 1
0x1304028 0x13127a7 948224
1
Without ASR and with concurrent writers of big files, each file typically
starts on its own stripe group. The checker boarding doesn't occur until
there are more writers than the number of data stripe groups. However,
once the checker boarding starts, it will exist all the way through the
file. For example, if we have two data stripe groups and four writers, all
four files would checker board until the number of writers is reduced
back to two or less.
91
92
Appendix A
This appendix describes the behavior of the stripe group affinity feature
in the StorNext file system, and it discusses some common use cases.
Note: This section does not discuss file systems managed by StorNext
Storage Manager. There are additional restrictions on using
affinities for these managed file systems.
Definitions
Following are definitions for terms used in this appendix:
Stripe Group
Affinity
93
Exclusivity
A stripe group which has both an affinity and the exclusive attribute can
have its space allocated only by files with that affinity. Files without a
matching affinity cannot allocate space from an exclusive stripe group.
Setting Affinities
Affinities for stripe groups are defined in the file system configuration
file. They can be created through the StorNext GUI or by adding one or
more Affinity lines to a StripeGroup section in the configuration
file. A stripe group may have multiple affinities, and an affinity may be
assigned to multiple stripe groups.
Affinities for files are defined in the following ways:
Using the cvmkfile command with the -k option
Using the snfsdefrag command with the -k option
Using the cvaffinity command with the -s option
Through inheritance from the directory in which they are created
Through the CvApi_SetAffinity() function, which sets affinities
programmatically
Using the cvmkdir command with the -k option, a directory can
be created with an affinity
Auto Affinities
94
See the StorNext Online Help for more details on how to configure Auto
Affinities using the GUI.
95
<autoAffinities>
<autoAffinity affinity="Video">
<extension>dpx</extension>
<extension>mov</extension>
</autoAffinity>
<autoAffinity affinity="Audio">
<extension>mp3</extension>
<extension>wav</extension>
</autoAffinity>
<autoAffinity affinity="Image">
<extension>jpg</extension>
<extension>jpeg</extension>
</autoAffinity>
<autoAffinity affinity="Other">
<extension>html</extension>
<extension>txt</extension>
</autoAffinity>
<noAffinity>
<extension>c</extension>
<extension>sh</extension>
<extension>o</extension>
</noAffinity>
</autoAffinities>
The affinities used must also exist in the StripeGroup sections of the
same configuration file. For example, the above configuration uses the
affinity Image. This affinity must be present in at least one StripeGroup.
If a filename does not match any lines in the AutoAffinities section, its
affinity is 0 or the inherited affinity from its parent. An entry can be
96
</autoAffinity>
<autoAffinity affinity="Other">
<extension>html</extension>
<extension>txt</extension>
<extension></extension>
</autoAffinity>
Or:
<noAffinity>
<extension>c</extension>
<extension>sh</extension>
<extension>o</extension>
<extension></extension>
</autoAffinity>
The last case is useful to override inheritable affinities for files that do
not match any extension so that they get 0 as their affinity. One could
map all files thereby overriding all inheritable affinities on a system with
old directory trees that have affinities.
Affinity Preference
97
overly fragmented, a file allocate might fail with ENOSPC. This can be
thought of as affinity enforcement. The new behavior is that an
allocation with an affinity would behave the same as a file without any
affinity when there is no matching StripeGroup. It can allocate space on
any StripeGroup with exclusive=false, but only after an attempt to
allocate space with the affinity fails.
The new parameter is:
<affinityPreference>true<affinityPreference/>
The default behavior is false (enforcement instead). If set to true, the
new behavior is to allow allocation on StripeGroups that do not match
the affinity but only when there is no space available on the "preferred"
StripeGroups.
Allocation Session
Reservation
<AffinityPreference>true<AffinityPreference/>
98
All the files under the directory prefer StripeGroup number 1. This
includes *.dpx and *.WAV files and any others. For example, you want
to separate the *.dpx files and *.WAV files. Create the mapping in the
configuration file shown above.
StripeGroup number 1: Contains the Video affinity. StripeGroup
number 2 is modified to have the Audio affinity.
StripeGroup number 2: affinity Audio exclusive=false
Consider if you create a directory with no affinity and then write *.dpx
and audio files, *.WAV, in that directory.
The *.dpx files correctly land on StripeGroup number 1 and the *.WAV
files go to StripeGroup number 2. Any other files also go to either
StripeGroup number 2 or StripeGroup number 1. Now, consider you
want the voice also on StripeGroup number 1 but in a separate location
from the *.dpx files. Simply add Audio to the affinities on StripeGroup
number 1 and remove it from StripeGroup number 2. With this
addition, *.WAV files will land on StripeGroup number 1, but they will
be in separate sessions from the *.dpx files. They will all prefer
StripeGroup number 2 until it is full.
With Allocation Session Reservation enabled and the above automatic
affinities, *.WAV files are grouped separately from *.dpx files since they
have different affinities. If you desire just file grouping and no affinity
steering to different stripe groups, simply add all the affinities to each
StripeGroup that allows data and set exclusive=false on each data
StripeGroup. The non-exclusivity is needed so that files without affinities
can also use the data stripe groups.
Finally, Quantum recommends to set AffinityPreference=true.
AffinityPreference yes
Next, Auto or No affinity mappings shown above in XML.
99
# Auto Affinities
[AutoAffinity Video]
Extension mov
Extension dpx
[AutoAffinity Audio]
Extension wav
Extension mp3
[AutoAffinity Image]
Extension jpeg
Extension jpg
[AutoAffinity Other]
Extension txt
Extension html
# No Affinities
[NoAffinity]
Extension o
Extension sh
Extension c
Empty Extension example:
100
[AutoAffinity Other]
Extension txt
Extension html
Extension
Or:
[NoAffinity]
Extension o
Extension sh
Extension c
Extension
Allocation Strategy
StorNext has multiple allocation strategies which can be set at the file
system level. These strategies control where a new files first blocks will
be allocated. Affinities modify this behavior in two ways:
A file with an affinity is usually allocated on a stripe group with
matching affinity, unless the affinity is a preferred affinity.
A stripe group with an affinity and the exclusive attribute is used
only for allocations by files with matching affinity.
Once a file has been created, StorNext attempts to keep all of its data on
the same stripe group. If there is no more space on that stripe group,
data may be allocated from another stripe group. The exception to this
is when InodeStripeWidth is set to a non-zero value. For additional
information about InodeStripeWidth, refer to the snfs_confg(5)
man page.
If the file has an affinity, only stripe groups with that affinity are
considered. If all stripe groups with that affinity are full, new space may
StorNext File System Tuning Guide
101
not be allocated for the file, even if other stripe groups are available.
The AffinityPreference parameter can be used to allow file
allocations for an affinity that would result in ENOSPAC to allocate on
other stripe groups (using an affinity of 0). See the snfs_config(5)
man page for details.
When a file system with two affinities is to be managed by the Storage
Manager, the GUI forces those affinities to be named tier1 and tier2.
This will cause an issue if a site has an existing unmanaged file system
with two affinities with different names and wants to change that file
system to be managed. There is a process for converting a file system so
it can be managed but it is non-trivial and time consuming. Please
contact Quantum Support if this is desired.
Note: The restriction is in the StorNext GUI because of a current
system limitation where affinity names must match between
one managed file system and another. If a site was upgraded
from a pre-4.0 version to post-4.0, the affinity names get
passed along during the upgrade. For example, if prior to
StorNext 4.0 the affinity names were aff1 and aff2, the GUI
would restrict any new file systems to have those affinity
names as opposed to tier1 and tier2.
Note: StorNext File Systems prior to version 4.3, which are configured
to utilize affinities on the HaShared file system, will need to
reapply affinities to directories in the HaShared file system after
the upgrade to version 4.3 completes.
In some instances customers have seen improved performance of the
HaShared file system by separating I/O to database and metadump
directories through the use of multiple stripe groups and SNFS stripe
102
/usr/adic/HAM/shared/database/metadumps
/usr/adic/HAM/shared/TSM/internal/mapping_dir
Key Database File Locations
/usr/adic/HAM/shared/mysql/db
/usr/adic/HAM/shared/mysql/journal
/usr/adic/HAM/shared/mysql/tmp
For configurations utilizing two data stripe groups in the HaShared file
system, key database files should be assigned to one affinity and key
metadata files should be assigned the other affinity. If more than two
stripe groups are configured in the HaShared file system, the individual
MySQL directories can be broken out into their own stripe groups with
the appropriate affinity set.
WARNING: Ensure that each stripe group is provisioned appropriately
to handle the desired file type. See the snPreInstall
script for sizing calculations. Failure to provision stripe
groups appropriately could result in unexpected no-space
errors.
1 Configure HaShared file system to use multiple data stripe groups.
StorNext File System Tuning Guide
103
104
To segregate audio and video files onto their own stripe groups:
One common use case is to segregate audio and video files onto their
own stripe groups. Here are the steps involved in this scenario:
Create one or more stripe groups with an AUDIO affinity and the
exclusive attribute.
Create one or more stripe groups with a VIDEO affinity and the
exclusive attribute.
Create one or more stripe groups with no affinity (for non-audio,
non-video files).
Create a directory for audio using cvmkdir -k AUDIO audio.
Create a directory for video using cvmkdir -k VIDEO video.
Files created within the audio directory will reside only on the AUDIO
stripe group. (If this stripe group fills, no more audio files can be
created.)
Files created within the video directory will reside only on the VIDEO
stripe group. (If this stripe group fills, no more video files can be
created.)
Reserving High-Speed
Disk For Critical Files
In this use case, high-speed disk usage is reserved for and limited to only
critical files. Here are the steps for this scenario:
Create a stripe group with a FAST affinity and the exclusive
attribute.
Label the critical files or directories with the FAST affinity.
The disadvantage here is that the critical files are restricted to using only
the fast disk. If the fast disk fills up, the files will not have space
allocated on slow disks.
To work around this limitation, you could reserve high-speed disk for
critical files but also allow them to grow onto slow disks. Here are the
steps for this scenario:
Create a stripe group with a FAST affinity and the exclusive
attribute.
Create all of the critical files, pre allocating at least one block of
space, with the FAST affinity. (Or move them using snfsdefrag
after ensuring the files are not empty.)
Remove the FAST affinity from the critical files.
Alternatively, configure the AffinityPreference parameter. For
additional information see Affinity Preference.
105
Because files allocate from their existing stripe group even if they no
longer have a matching affinity, the critical files will continue to grow
on the FAST stripe group. Once this stripe group is full, they can allocate
space from other stripe groups since they do not have an affinity.
This scenario will not work if new critical files can be created later,
unless there is a process to move them to the FAST stripe group, or an
affinity is set on the critical files by inheritance but removed after their
first allocation (to allow them to grow onto non-FAST groups).
106
Appendix B
Best Practice
Recommendations
107
Replication Copies
108
The replication target can keep one or more copies of data. Each copy is
presented as a complete directory tree for the policy. The number of
copies and placement of this directory are ultimately controlled by the
replication target. However, if the target does not implement policy
here, the source system may request how many copies are kept and
how the directories are named.
When multiple copies are kept, the older copies and current copy share
files where there are no changes. This shows up as extra hard links to
the files. If a file is changed on the target, it affects all copies sharing the
file. If a file is changed on the replication source, older copies on the
target are not affected.
The best means to list which replication copies exist on a file system is
running snpolicy -listrepcopies command. The rmrepcopy,
mvrepcopy and exportrepcopy options should be used to manage
the copies.
Replication and
Deduplication
109
StorNext Gateway
Server Performance
Replication with
Multiple Physical
Network Interfaces
Deduplication will not be beneficial on small files, nor will it provide any
benefit on files using compression techniques on the content data (such
as mpeg format video). In general, deduplication is maximized for files
that are 64MB and larger. Deduplication performed on files below 64MB
may result in sub-optimal results.
You can filter out specific files to bypass by using the dedup_skip
policy parameter. This parameter works the same as filename expansion
in a UNIX shell.
You can also skip files according to size by using the dedup_min_size
parameter.
110
Deduplication and
Backups
Deduplication and
System Resources
Deduplication Parallel
Streams
111
Deduplication and
Truncation
config/snpolicyd.conf.
In the case where both deduplication and tape copies of data are being
made, TSM is the service which performs truncation.
112