DataStage Best Practices
DataStage Best Practices
DataStage Best Practices
Introduction..................................................................................................................... 6
Objective.............................................................................................................. 6
Background.......................................................................................................... 6
Best practices................................................................................................................. 9
Who has locked the job?......................................................................................9
How to use DS.TOOLS to release the lock..................................................10
Type conversion................................................................................................. 10
Decimal output................................................................................................... 11
How to copy or move data set file......................................................................11
NLS related options.....................................................................................12
COMMAND: copy | cp source-descriptor-file target-descriptor-file..............12
COMMAND: delete | del | rm [ -options... ] descriptor-files........................13
COMMAND: truncate [ -options... ] descriptor-files....................................13
COMMAND: dump [ -options... ] descriptor-files........................................14
COMMAND: describe | lp | lf | ls | ll [ -options... ] descriptor-files...............15
COMMAND: diskinfo [ -a | -np nodepool | -n node... ] diskpool...................16
COMMAND: check......................................................................................16
Job design..................................................................................................................... 17
Modular development........................................................................................17
Default and explicit type conversions.................................................................18
Source type to target type conversions.......................................................18
Sequential file stages (Import and Export).........................................................18
Improving sequential file performance.........................................................19
Partitioning sequential file reads..................................................................19
Sequential file (Export) buffering.................................................................19
Reading from and writing to fixed-length files..............................................19
Reading bounded-length VARCHAR columns.............................................20
Transformer usage guidelines............................................................................20
Choosing appropriate stages.......................................................................20
Transformer NULL handling and reject link..................................................21
Transformer derivation evaluation...............................................................22
Conditionally aborting jobs...........................................................................22
Transformer decimal arithmetic...................................................................22
Optimizing Transformer expressions and stage variables...........................23
Modify stage....................................................................................................... 26
Join stage........................................................................................................... 26
June 2009 2
Reduce Re-partition.....................................................................................27
Sorting......................................................................................................... 27
Joining tables...............................................................................................27
Lookup stage vs. Join stage...............................................................................27
Capturing unmatched records from a Join.........................................................27
The Funnel stage............................................................................................... 28
The Aggregator stage........................................................................................28
Database Stages........................................................................................................... 30
Native Parallel vs. Plug-In Database stage usage.............................................30
Native Parallel Database stages..................................................................30
Appropriate use of SQL and DataStage stages.................................................31
Optimizing Select lists........................................................................................ 31
Designing for restart........................................................................................... 32
Database OPEN and CLOSE commands..........................................................32
Database Sparse Lookup vs. Join.....................................................................32
Performance management........................................................................................... 36
Bottleneck resolution................................................................................................... 50
Combinable operators........................................................................................ 50
Disk I/O.............................................................................................................. 50
Buffering............................................................................................................. 51
Sequential file stages.........................................................................................52
Hashed File stages............................................................................................ 53
Creating hashed files...................................................................................54
Using hashed file caching............................................................................56
Preloading hashed files to memory.............................................................57
Very large hashed files................................................................................57
DB/2 table stages............................................................................................... 57
Reading from DB/2......................................................................................57
Writing to DB/2............................................................................................58
Reference lookups to DB/2..........................................................................59
Transformer stages............................................................................................ 59
Routines and transforms....................................................................................61
Shared containers..............................................................................................61
Sequences......................................................................................................... 62
Objective
PROJECT warehouse has nearly 400 odd jobs, which are developed by various groups
of developers from various parts of the world. Even though majority of the development
is done by XXXXi, other regions like XXXX have their contributions towards
enhancing PROJECT warehouse product. The collaborative approach of PROJECT
engagement model encourages the involvement of experts across the globe, thus
prompting for the need to have Reference handbook which can be used by these
various groups of developers.
Background
During development we observe that the developers are driven by certain factors.
Tight schedule and demanding environment
The ETLs were limited to ensure output data matches with design specifications
Unexpected data volume
Affecting the performance, where the performance was not a design factor
Usage of appropriate stages
Usage of SQLs
Considering ETL and data warehouse environment that are bound to grow and scale over
in the long term, we need to make the DataStage jobs run as efficiently as possible.
Understanding a jobs UNIX environment
DataStage EE provides a number of environment variables to control how the jobs
operate on a UNIX system. In addition to providing required information, environment
variables can be used to enable or disable various DataStage features, and to fine tune
performance settings. Although UNIX environment variables can be set in multiple
places, there is a defined order of precedence that is evaluated when a jobs actual
environment is established at runtime, as given below:
1. The daemon for managing client connections to the DataStage server engine is
called dsrpcd. By default (in a root installation), dsrpcd is started when the server is
installed, and should start whenever you restart your machine.
You can also start and stop dsrpcd manually, using the command,
$DSHOME/uv admin.
By default, DataStage jobs inherit the dsrpcd UNIX environment, which is set in
the /etc/profile and $DSHOME/dsenv scripts.
Note
Client connections DO NOT pick up per-user environment
settings from their $HOME/.profile script.
2. Environment variable settings for particular projects can be set in the DataStage
Administrator. Any project-level settings for a specific environment variable will
override any settings inherited from dsrpcd
Caution
When you are migrating projects between machines or
environments, it is important to note that project-level
environment variable settings are not exported when a project
is exported. Any project-level environment variables must be
set for new projects.
6
4. $ENV causes the value of the named environment variable to be retrieved from the
operating system of the job environment. Typically this is used to pickup values set
in the operating system outside of DataStage.
Note
$ENV should not be used for specifying the default
$APT_CONFIG_FILE value because, during job development,
the DataStage Designer parses the corresponding parallel
configuration file to obtain a list of node maps and constraints
(advanced stage properties).
5. $PROJDEF causes the project default value for the environment variable (as shown
on the DataStage Administrator client) to be picked up and used to set the
environment variable and job parameter for the job.
7
Best practices
8
How to use DS.TOOLS to release the lock
SAMPLE:
Type conversion
All columns in external file have the data type as, VARCHAR, and you need to convert
it to the type according to your requirement. For example, VARCHAR to decimal, or
convert decimal to VARCHAR for input.
DataStage EE provides a lot of in-line type conversion functions such as the following:
DecimalToString
StringToDate
StringToDecimal
You can use IsValid function to check if this data type is valid.
For example:
IsValid(Date,2008-08-08) can be used to check if the string 2008-08-08 is a valid date.
To convert Decimal to Date you need to use DecimalToString to convert Decimal to
String first, and then use IsValid function to check if it is right date, following which, you
need to use StringToDate to convert it to a Date.
9
Decimal output
When you use Aggregator stage, output type of calculation column is normally double
(unless Default to Decimal Output has been set); setting this property overrides the default
setting and makes this a decimal. The value of this property denotes the precision and
scale of that decimal and should be specified as shown below:
Decimal Output
NAME
orchadmin - delete, copy, describe and dump ORCHESTRATE files
SYNOPSIS
orchadmin command [ -options... ] descriptor-files...
The following table provides options that can be used with the orchadmin command.
Options Description
orchadmin [-help] Prints help message for all commands
10
DESCRIPTION
Options Description
Command delete, copy, describe, dump or check
The file may have multiple commands separated by semicolons. A command may be
spread over multiple lines. C and C++ style comments and csh style quotation marks are
allowed.
The following table provides the NLS related options along with their description.
11
If the target file already exists, it is truncated first.
If the preserve-partitioning flag of the source file is set and the target file already
exists, it must have the same number of partitions as the source file.
The copy command has no options. A warning message is issued if the target does not
already exist. This is a bug, not a feature.
The following table provides options that can be used with the delete command.
Options Description
-f Force. Proceed even if some partitions of the dataset are on nodes
that are inaccessible from the current configuration file. This
leaves orphan data files on those nodes. They must be deleted by
some other means.
-x Use the system configuration file rather than the one stored in the
dataset.
EXAMPLE:
To delete all datasets in the current directory that end in .ds, you can use the following
command:
orchadmin rm *.ds
The following table provides options that can be used with the truncate command.
Options Description
-f Force. Proceed even if some partitions of the dataset are on nodes
that are inaccessible from the current configuration file. This
leaves orphan data files on those nodes. They must be truncated
by some other means.
-x Use the system configuration file rather than the one stored in the
dataset.
12
EXAMPLE:
To remove all data from small.ds, you can use the following command:
orchadmin truncate small.ds
The following table provides options that can be used with the dump command.
Options Description
-field name Dump the specified top-level field. The default is to dump all
fields. This option can occur multiple times. Each occurrence
adds to the list of fields.
-p period Dump every N'th record in a partition, starting with the first
record not skipped (see -skip). The period must be greater than 0.
The default is 1.
-x Use the system configuration file rather than the one stored in the
dataset.
If an option occurs multiple times, the last one takes effect. The -field option is an
exception, where each occurrence adds to the list of fields to be dumped.
EXAMPLE:
13
To dump all records of all partitions of a parallel file named small.ds, precede each
value by its field name and a colon. You can use the following command:
orchadmin dump -name small.ds
To dump the value of the customer field of the first 99 records of partition 0 of
big.ds, you can use the following command:
orchadmin dump -part 0 -n 99 -field customer big.ds
The following table provides options that can be used with the describe command.
Options Description
-p List partitioning information (except for datafile information)
-x Use the system configuration file rather than the one stored in the
dataset
-l Means -p -f -s -e -v -c
EXAMPLE:
To list the partitioning information, data files and schema of file1 and file2, you can use
the following command:
orchadmin ll file1 file2
14
COMMAND: diskinfo [ -a | -np nodepool | -n node... ] diskpool
This command prints a report about the specified disk pool.
OPTIONS:
The following table provides options that can be used with the diskinfo command.
Options Description
-a Print information for all nodes
To describe disk pool pool1 in node pool bignodes, you can use the following command:
orchadmin diskinfo -np bignodes pool1
COMMAND: check
This command checks the configuration file for any problems. The command, check, has
no options.
15
Job design
The ability to process large volumes of data in a short period of time depends on all
aspects of the flow and environment being optimized for maximum throughput and
performance. Performance tuning and optimization is an iterative process that begins at
job design and unit tests, proceeds through integration and volume testing, and continues
throughout an applications production lifecycle.
You need to note the following points regarding job design:
When writing intermediate results that will only be shared between DataStage EE
jobs, always write to persistent DataSets.
DataSets achieve end-to-end parallelism across jobs by writing data in
partitioned form, retaining sort order, in EE-native format without the overhead
of format conversion or serial I/O.
DataSets should be used to create restart points in the event that a job (or sequence)
needs to be re-run.
Caution
Because datasets are platform and configuration-specific, you
should not use them for long-term backup and recovery of
source data.
Modular development
You should use the modular development techniques to maximize re-use of DataStage
jobs and components. Two of such techniques use the following:
Job Parameterization
Job parameterization allows a single job design to process similar logic instead
of creating multiple copies of the same job. In v7.x, the Multiple-Instance job
property allows multiple invocations of the same job to run simultaneously.
Parallel Shared Containers
Starting with DataStage V7, Parallel Shared Containers allow common logic to
be shared across multiple jobs.
For maximum component re-use, enable runtime column propagation at the
project level and for every stage within the parallel shared container. This allows
the container input and output links to contain only the columns relevant to the
container processing.
16
Using runtime column propagation, any additional columns can be passed
through the container at runtime without the need to separate and remerge.
Note
Parallel Shared Containers are inserted when a job is
compiled. If the shared container is changed, the Usage
Analysis and Multi-Job Compile tools can be used to
recompile jobs that use a shared container.
Note
PadString does not work with fixed-length (CHAR) string
types. You must first convert a Char string type to a
VARCHAR type before using PadString.
Some stages (for example, SequentialFile and DB2/UDB Enterprise targets) allow the
pad character to be specified in their stage or column definition properties.
17
Improving sequential file performance
If the source file is fixed width, you can use the Readers per Node option to read a single
input file in parallel at evenly-spaced offsets. Note that in this manner, input row order is
not maintained.
If the input sequential file cannot be read in parallel, performance can still be improved
by separating the file I/O from the column parsing operation. To accomplish this, define
a single large string column for the non-parallel Sequential File read, and then pass this
to a Column Import stage to parse the file in parallel. The formatting and column
properties of the Column Import stage match those of the Sequential File stage.
On heavily-loaded file servers or some RAID/SAN array configurations, you can use the
environment variables $APT_IMPORT_BUFFER_SIZE and $APT_EXPORT_BUFFER_SIZE
to improve I/O performance. These settings specify the size of the read (import) and
write (export) buffer size in Kbytes, with a default of 128 (128K). Increasing this may
improve performance.
Finally, in some disk array configurations, setting the environment variable
$APT_CONSISTENT_BUFFERIO_SIZE to a value equal to the read/write size in bytes can
significantly improve performance of Sequential File operations.
18
fixed-width of the input column. Double-click on the column number in the grid
dialog to set this column property.
If a field is nullable, you must define the NULL field value and length in the Nullable
section of the column property. Double-click on the column number in the grid
dialog to set these properties.
When writing fixed-length files from variable-length fields (for example, Integer,
Decimal, VARCHAR), you must set the field width and pad string column properties
to match the fixed-width of the output column. Double-click on the column number
in the grid dialog to set this column property.
To display each field value, use the print_field import property.
19
Note
You can also perform the rename, drop (if runtime column
propagation is disabled), and default type conversions by the
output mapping tab of any stage.
NEVER use the BASIC Transformer stage in large-volume job flows. Instead,
user-defined functions and routines can expand parallel Transformer capabilities.
Consider, if possible, implementing complex derivation expressions using regular
patterns by Lookup tables instead of using a Transformer with nested derivations.
For example, the derivation expression:
If A=0, 1, 2, 3 Then B=X If A=4, 5, 6, 7 Then B=C
Could be implemented with a lookup table containing values for column A and
corresponding values of column B.
Optimize the overall job flow design to combine derivations from multiple
Transformers into a single Transformer stage when possible.
In DataStage V7 and later, you can use the Filter and/or Switch stages to separate
rows into multiple output links based on SQL-like link constraint expressions.
In DataStage V7 and later, you can use the Modify stage for non-default type
conversions, NULL handling, and character string trimming.
Buildops should be used instead of Transformers in the handful of scenarios where
complex reusable logic is required, or where existing Transformer-based job flows
do not meet performance requirements.
Note
If an incoming column is only used in a pass-through
derivation, the Transformer allows this row to be output.
DataStage V7 enhances this behaviour by placing warnings
in the log file when the discards occur.
20
Transformer derivation evaluation
Output derivations are evaluated BEFORE any type conversions on the assignment.
For example,
The PadString function uses the length of the source type, and not the target.
Therefore, it is important to make sure the type conversion is done before a row reaches
the Transformer.
For example,
TrimLeadingTrailing(string) works only if string is a VARCHAR field. Thus, the
incoming column must be type VARCHAR before it is evaluated in the Transformer.
Keywords Description
ceil Rounds towards positive infinity.
Examples: 1.4 -> 2, -1.6 -> -1
21
Keywords Description
floor Rounds towards negative infinity.
Examples: 1.6 ->1, -1.4 -> -2
22
IF (DSLINK1.col[1,3] = 001) THEN ...
In fact, this example could be improved further by also moving the string
comparison into the stage variable. The stage variable would be:
IF (DSLink1.col[1,3] = 001 THEN 1 ELSE 0
This reduces both the number of substring functions evaluated and string
comparisons made in the Transformer.
Where an expression includes calculated constant values
For example,
a column definition may include a function call that returns a constant value,
such as:
Str( ,20)
This returns a string of 20 spaces. In this case, the function gets evaluated every
time the column derivation is evaluated. It is more efficient to calculate the
constant value just once for the whole Transformer.
This can be achieved using stage variables. This function could be moved into a
stage variable derivation; but in this case, the function still gets evaluated once
for every input row. The solution here is to move the function evaluation into the
initial value of a stage variable.
A stage variable can be assigned an initial value from the Stage Properties
dialog/Variables tab in the Transformer stage editor. In this case, the variable
would have its initial value set to:
Str( ,20)
You then leave the derivation of the stage variable on the main Transformer page
empty. Any expression that previously used this function would be changed to
use the stage variable instead.
23
The initial value of the stage variable is evaluated just once, before any input
rows are processed. However, because the derivation expression of the stage
variable is empty, it is not re-evaluated for each input row. Therefore, its value
for the whole Transformer processing is unchanged from the initial value.
In addition to a function value returning a constant value, another example
would be part of an expression such as:
"abc" : "def"
As with the function-call example, this concatenation is evaluated every time the
column derivation is evaluated. Since the subpart of the expression is actually
constant, this constant part of the expression could again be moved into a stage
variable, using the initial value setting to perform the concatenation just once.
Where an expression requiring a type conversion is used as a constant, or it is
used in multiple places
For example,
an expression may include something like this:
DSLink1.col1+"1"
In this case, the "1" is a string constant, and so, in order to be able to add it to
DSLink1.col1, it must be converted from a string to an integer each time the
expression is evaluated. The solution in this case is just to change the constant
from a string to an integer:
DSLink1.col1+1
Note
When using stage variables to evaluate parts of expressions,
the data type of the stage variable should be set correctly for
that context. Otherwise, needless conversions are required
wherever that variable is used.
24
Suggest using staging variables for cases where a derivation value is used
multiple times.
If the derivation is done once and is applicable to only one target column, then
no need to derive it using staging variable and then assign, instead apply the
derivation logic against the target column itself.
Modify stage
The Modify stage is the most efficient stage available. Any transformation which can be
implemented in Modify is more efficient.
Transformations that touch a single field, such as keep/drop, type conversions, some
string manipulations, and NULL handling, are the primary operations, which should be
implemented using Modify instead of Transformer derivations.
Releases beyond 7.0 may be able to automatically perform this optimization. In addition,
the Output Mapping tab of any stage generates an underlying modify.
Starting with v7.01, the function string_trim has been added of Modify, with the
following syntax:
stringField=string_trim[character, direction, justify] (string)
You can use this function to remove the characters used to pad variable-length strings
when they are converted to fixed-length strings of greater length.
By default, these characters are retained when the fixed-length string is then converted
back to a variable-length string.
The character argument is the character to remove. By default, this is NULL. The value
of the direction and justify arguments can be either begin or end; direction defaults to
end, and justify defaults to begin. Justify has no affect when the target string has variable
length.
The following example removes all leading ASCII NULL characters from the beginning
of name and places the remaining characters in an output variable-length string with the
same name:
name:string = string_trim[NULL, begin](name)
The following example removes all trailing Z characters from colour, and left-justifies
the resulting hue fixed-length string:
hue:string[10] = string_trim[Z, end, begin](color)
The Modify stage uses the syntax of the underlying modify operator.
Join stage
This section provides information on the Join stage.
25
Reduce Re-partition
Understanding of data is critical for selecting the right partition key to allow high
scalability and parallelism in the job design.
You should define partitioning as possible to minimize re-partitioning. To allow parallel
processing in Join, KEY based partitioning is required. Therefore, it proves effective if
we set the right Partition key from the point of extraction, by either reading from files or
RDBMS.
Sorting
Sort stage provides better flexibility since you can do the following:
Skip keys that are previously sorted.
Set the buffer required.
Joining tables
To ensure parallel processing and good performance, all Join stage does the following by
default:
KEY based partitioning.
Insert a Sort operator to sort the Joint Keys this can be disabled at
APT_SORT_INSERTION_CHECK_ONLY (This is a Project/Job level setting) assuming
proper sorting has been done in previous stage.
All the settings above require proper understanding of the data and functions of the
stages. It should be based on case by case (or job by job) basis.
Typically, familiarization with the Jobs and Data to work with is required to fine and
effective tuning of Jobs.
26
In an Outer join scenario, all rows on an outer link (for example, Left Outer, Right
Outer, or both links in the case of Full Outer) are output regardless of match on key
values.
During an Outer Join, when a match does not occur, the Join stage inserts NULL values
into the unmatched columns. You must take care when changing the column properties
to allow NULL values before the Join. This is most easily done by inserting a Copy
stage and mapping a column from NON-NULLABLE to NULLABLE.
You can use a Filter stage to test for NULL values in unmatched columns.
In some cases, it is simpler to use a Column Generator to add an indicator column,
with a constant value, to each of the outer links and test that column for the constant
after you have performed the join. This is also handy with Lookups that have multiple
reference links.
27
Version Control of ETL Jobs
Change Log
It is always recommended to put Change log or Update history in the ETL annotation.
Details should include:
Updated by
Update Date
Details in brief, explaining what has been changed.
Also some other documents like excel document need to be maintained having job
names and change details.
28
Database Stages
29
Due to the exceptions to this rule (especially with Teradata), specific guidelines of when
to use various stage types are provided in the database-specific topics in this section.
30
For Auto-Generated SQL, the DataStage Designer will automatically populate the
select list based on the stages output column definition.
The only exception to this rule is when building dynamic database jobs that use
runtime column propagation to process all rows in a source table.
31
When directly connected as the reference link to a Lookup stage, both DB2/UDB
Enterprise and Oracle Enterprise stages allow the lookup type to be changed to Sparse,
sending individual SQL statements to the reference database for each incoming Lookup
row. Sparse Lookup is only available when the database stage is directly connected to
the reference link, with no intermediate stages.
Caution
The individual SQL statements required by a Sparse Lookup
are an expensive operation from a performance perspective.
In most cases, it is faster to use a DataStage Join stage
between the input and DB2 reference data than it is to
perform a Sparse Lookup.
For scenarios where the number of input rows is significantly smaller (for example,
1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse
Lookup may be appropriate.
32
Performance tips for job design
You can follow the below mentioned tips to improve the performance of the job design:
Remove unnecessary columns as early as possible within the job flow every
additional unused column requires additional buffer memory which can impact
performance (it also makes each transfer of a record from one stage to the next more
expensive).
When reading from database sources, use a select list to read needed columns
instead of the entire table (if possible)
To ensure that columns are actually removed using a stages Output Mapping,
disable runtime column propagation for that column.
Always specify a maximum length for VARCHAR columns.
Unbounded strings (VARCHARs without a maximum length) can have a
significant negative performance impact on a job flow. There are limited
scenarios when the memory overhead of handling large VARCHAR columns
would dictate the use of unbounded strings.
For example:
VARCHAR columns of a large (such as, 32K) maximum length that are rarely
populated.
VARCHAR columns of a large maximum length with highly varying data sizes.
Placing unbounded columns at the end of the schema definition may improve
performance.
In DataStage v7.0 and earlier, limit the use of variable-length records within a flow.
Depending on the number of variable-length columns, it may be beneficial to
convert incoming records to fixed-length types at the start of a job flow, and trim to
variable-length at the end of a flow before writing to a target database or flat file
(using fixed-length records can dramatically improve performance). DataStage
v7.01 and later implement internal performance optimizations for variable-length
columns that specify a maximum length.
Avoid type conversions if possible.
Be careful to use proper datatype from source (especially Oracle) in EE job
design.
33
Enable $OSH_PRINT_SCHEMAS to verify runtime schema matches job
design column definitions.
Verify that the data type of defined Transformer stage variables matches the
expected result type.
Minimize the number of Transformers. Where appropriate, use other stages (for
example, Copy, Filter, Switch, Modify) instead of the Transformer.
NEVER use the BASIC Transformer in large-volume data flows. Instead, user-
defined functions and routines can expand the capabilities of the parallel
Transformer.
You should use buildops instead of Transformers in the handful of scenarios where
complex reusable logic is required, or where existing Transformer-based job flows
do not meet performance requirements.
Minimize and combine use of Sorts where possible.
It is sometimes possible to re-arrange the order of business logic within a job
flow to leverage the same sort order, partitioning, and groupings.
If data has already been partitioned and sorted on a set of key columns,
specifying the dont sort, previously sorted option for those key columns in
the Sort stage reduce the cost of sorting and take greater advantage of
pipeline parallelism.
When writing to parallel datasets, sort order and partitioning are preserved.
When reading from these datasets, try to maintain this sorting if possible by
using SAME partitioning.
The stable sort option is much more expensive than non-stable sorts, and should
only be used if there is a need to maintain row order except as needed to perform
the sort.
Performance of individual sorts can be improved by increasing the memory
usage per partition using the Restrict Memory Usage (MB) option of the
standalone Sort stage.
The default setting is 20MB per partition. Note that sort memory usage can
only be specified for standalone Sort stages, it cannot be changed for inline
(on a link) sorts.
34
Performance management
The primary factors that developers can change in their design process that affect
performance are:
Number of sequential I/Os
Number, type and timing of database (DB/2) I/Os
Number and type of hashed file I/Os
Amount of CPU used
Amount and distribution of memory usage
Minimum performance speeds that each job must meet are being introduced. No job is
accepted into production unless it meets minimum performance rules; there are only a
few exceptions and those are only under special circumstances, with approval.
Note
A standard job design approach yields acceptable
performance for most jobs. Only a small percentage of jobs
need to be analyzed and tuned. If a job only runs once a
month or it processes few rows, or it only runs less than 5
minutes then it is usually not worth spending additional
resources in trying to increase its performance.
Jobs that run several times a day or have execution times over 5 minutes are worth
examining for potential inadvertent bottlenecks or easy means of increasing
performance.
Any job that runs more than 15 minutes per day needs to be looked at carefully and in
detail to see if performance enhancement measures can increase the throughput without
impacting other jobs.
It is critical for the developer to begin the job design phase knowing:
Expected volumes and sizes for all sources, targets, reference files and tables.
Whether there are parent or child dependencies on the job
If the output data is important or time critical outside of job dependencies
How often and when the job is scheduled to be executed
35
Performance monitoring and tuning
This section provides information on the monitoring and tuning of the jobs
performance.
iostat
The UNIX utility iostat is useful for examining the throughput of various disk resources.
If one or more disks have high throughput, understanding where that throughput is
coming from is vital. If there are spare CPU cycles, I/O is often the culprit. iostat can
also help a user determine if there is excessive I/O for a specific job.
36
vmstat
The UNIX vmstat utility is useful for examining system paging. Ideally, once an EE
flow begins running, it should never be paging to disk (si and so should be zero). Paging
suggests EE is consuming too much total memory.
$vmstat 1
procs memory swap io system cpu
r b w swpd free buff cache si so bi bo in cs us sy id
0 0 0 10692 24648 51872 2288360 0 0 01 2 2 1 1 0
Platform-specific tuning
Tru64-specific tuning
Some environments have experienced better memory management when the
vm_swap_eager kernel parameter is set to true. This swaps out idle processes more
quickly, allowing more physical memory for EE. A higher degree of parallelism may
be available as a result of this setting, but system interactivity may suffer as a result.
Some environments have experienced improved performance when the virtual
memory Eager setting, vm_aggressive_swap, is enabled. This aggressively swaps
processes out of memory to free up physical memory for the running processes.
HP-UX-specific tuning
HP-UX has a limitation when running in 32-bit mode, which limits memory mapped I/O
to 2GB per machine. This can be an issue when dealing with large lookups. The Memory
Windows option can provide a work around for this memory limitation. Ascential
Product Support can provide this document on request.
In an EE flow, certain operators may complete before the entire flow has finished, but
the job is not deemed successful until the slowest operator has finished all its processing.
37
While editing a flow for testing, it is important to keep in mind that removing one
operator might have unexpected affects in the flow. Comparing the score dump between
runs is useful before concluding what has made the performance differ.
When modifying the flow, be aware of introducing any new performance problems.
For example,
Adding a persistent dataset to a flow introduces disk contention with any other datasets
being read. This is rarely a problem, but it might be significant in some cases.
Reading and writing data are two obvious places to be aware of potential performance
bottlenecks. Changing a job to write into a Copy stage with no outputs discards the data.
Keep the degree of parallelism the same, with a nodemap, if necessary. Similarly,
landing any read data to a dataset can be helpful if the point of origin of the data is a flat
file or RDBMS.
You should follow this pattern, removing any potentially suspicious stages while trying
to keep the rest of the flow intact. Removing any customer-created operators or sequence
operators should be at the top of the list. Much work has gone into the latest 7.0 release
to improve Transformer performance.
Eliminating repartitions
It is strongly recommended that you eliminate superfluous re-partitioning. Due to
operator or license limitations (import, export, RDBMS operators, SAS operators, and so
on) some operators run with a degree of parallelism that is different than the default
degree of parallelism. Some of this cannot be eliminated, but understanding the where,
when and why these repartitions occur is important for understanding the flow.
Repartitions are especially expensive when the data is being repartitioned on an MPP,
where significant network traffic is generated.
Sometimes a repartition might be able to be moved further upstream in order to
eliminate a previous, implicit repartition. Imagine an Oracle read, which does some
processing, and is then hashed and joined with another dataset. There might be a
repartition after the Oracle read stage and then the hash, when only one repartitioning is
ever necessary.
Similarly, a nodemap on a stage may prove useful for eliminating repartitions. In this
case, a transform between a DB2 read and a DB2 write might need to have a nodemap
placed on it to force it to run with the same degree of parallelism as the two DB2 stages
in order to avoid two repartitions.
38
data volumes between keys often skew this data slightly, but any significant (over 5 or
10%) differences in volume should be a warning sign that alternate keys or an alternate
partitioning strategy might be required.
39
Job performance analysis
You can refer to some of the job runtime information from the snapshots given below:
After analysis of job monitor and job log, for almost all jobs you may see the long time
part is upsert with DB2:
40
From following picture you may find that most of the upsert is really an update process
in database:
41
DataStage Director Monitor
42
DataStage Director Monitor
43
DataStage Designer
44
Although the use of Upsert function with DB2 EE stage is easy in job design, it has poor
performance when we have big CDC data volume,
For example:
Job: djpHewr2LOD_CUST_IP_INTERFACE_HEW2_LEO
CDC data volume: 50M
Job run time: Startup time, 00:04:59; production run time, 3:26:25.
The job run slowly and its Upsert speed < 100row/s
You need to note the following points for
We use a sequence file to store the CDC data, so we need to modify job output.
Create a temp table with the same structural like final table which will be updated
Use load command load the data form sequence file into the temp table
Use merge SQL to update the final table with temp table
For the updated job you can refer to the job,
djpHewr2LOD_CUST_IP_INTERFACE_HEW2_LEO3
45
( S.IP_ID ,S.IP_TYPE_CDE
,S.SRCE_SYS_CDE ,S.SUSPT_MNTR_IND ,S.UPDT_DT_TM ) ;
Use script to load sequence file into Upsert table and then insert CDC_HIS
and do the merge
Script runtime:
real 6m0.14s
user 0m0.09s
sys 0m0.06s
46
Identifying performance bottlenecks
All jobs, no matter how efficiently they are written, have a bottleneck. Each job has
some aspect that, when changed, makes the job run faster. This bottleneck is often
different on development machines than it is on production ones.
A bottleneck is not something negative; it is an inherent attribute of a job. It is important
to determine which portion of a job is causing the bottleneck. Once armed with that
knowledge, the developer can then decide if the performance value limited by the
bottleneck is acceptable. There are no absolute values for this but common sense and an
understanding of the priorities makes the decision easier. Later on in this document some
metrics are presented, which also help build a decision.
The job at the right shows how a sample DataStage job can be broken into broad
categories that are used to locate bottlenecks.
Category Job
Source Data reads
Processing Speeds
A job performs at a maximum speed that is limited by its slowest component. Even with
buffering, once the buffer fills up the job reverts back to the speed of the slowest stage.
Usually a job functions so that one input row gets processed and after that row is written
out to the one or more destinations the next row is used. Jobs that perform aggregations
or sorting or have interim file stages and work a bit differently, but the overall speed is
also limited.
Measuring component performance in the context of the job is not complicated. It is a
matter of compartmentalizing into groups and measuring those speeds separately. Some
jobs lend themselves to quite simplistic measurements while others need more work, but
with just a couple of minutes of design modification effort, the slowest components can
be easily identified.
Usually a combination of modifying constraints and adding in temporary sequential files
is necessary. If the constraint of a Transform stage is set to always evaluate to a false
value (1=2 or @FALSE) then the performance of all stages leading up to that
transform is measured. If the speed comparison of this job with the original shows that
the overall performance has not changed appreciably then the bottleneck is before that
transform, otherwise the bottleneck lies after the transformer stage.
The best method is to use temporary sequential files between stages or sections of the
job as shown here:
47
Sequential file I/O is very fast and is only rarely a bottleneck. The addition of Sequential
File stages as depicted makes sure that each group of steps between sequential stages is
executed completely the last row of input into to the sequential file is written before
the first row on the output side is read before the subsequent stage is executed. Thus
each section has a Rows/Second speed that is independent of the others and can be
compared with each other for performance measurement.
48
Bottleneck resolution
This section provides information and tips to overcome the bottleneck using DataStage
best practices.
Combinable operators
Combined operators generally improve performance at least slightly; and in some cases,
the performance improvement may be dramatic. However, there may be situations where
combining operators actually hurt performance. Identifying such operators can be
difficult without trial and error.
The most common situation arises when multiple operators, such as Sequential File
(import and export) and Sort, are combined and are performing disk I/O. In I/O-bound
situations, turning off combination for these specific operators may result in a
performance increase.
This is a new option in the advanced stage properties of DataStage Designer version 7.x.
Combinable operators often provide a dramatic performance increase when a large
number of variable length fields are used in a flow.
To experiment with this, try disabling the combination of any stages that perform I/O
and any sort stages.
$APT_DISABLE_COMBINATION=1 globally disables operator combining.
Disk I/O
Total disk throughput is often a fixed quantity that EE has no control over. There are,
however, some settings and rules of thumb that are often beneficial:
If data is going to be read back in, in parallel, it should never be written as a
sequential file. A dataset or file set is a much more appropriate format.
When importing fixed-length data, the Number of Readers Per Node option on the
Sequential File stage can often provide a noticeable performance boost as compared
with a single process reading the data. However, if there is a need to assign a number
in source file row order, you cannot use the -readers option because it opens multiple
streams at evenly-spaced offsets in the source file. Also, you can use this option only
for fixed-length sequential files.
Some disk arrays have read-ahead caches that are only effective when data is read
repeatedly in like-sized chunks.
$APT_CONSISTENT_BUFFERIO_SIZE=n forces import to read data in chunks
which are size n or a multiple of n.
Memory mapped I/O is, in many cases, a big performance win; however, in certain
situations, such as a remote disk mounted via NFS, it may cause significant
performance problems. APT_IO_NOMAP=1 and APT_BUFFERIO_NOMAP=1 turn off
this feature and sometimes affect performance.
49
AIX and HP-UX default to NOMAP.
You can use APT_IO_MAP=1 and APT_BUFFERIO_MAP=1 to turn on the memory
mapped I/O for these platforms.
Buffering
Buffer operators are intended to slow down their input to match the consumption rate of
the output. When the target stage reads very slowly, or not at all, for a length of time,
upstream stages begin to slow down. This can cause a noticeable performance loss if the
optimal behaviour of the buffer operator is something other than rate matching.
By default, the buffer operator has a 3MB in-memory buffer. Once that buffer reaches
two-thirds full, the stage begins to push back on the rate of the upstream stage. Once the
3MB buffer is filled, data is written to disk in 1MB chunks.
In the following discussions, settings in all capital letters are environment variables and
affect all buffer operators. Settings in all lowercase are buffer-operator options and can
be set per buffer operator.
In most cases, the easiest way to tune the buffer operator is to eliminate the push
back and allow it to buffer the data to disk as necessary.
$APT_BUFFER_FREE_RUN=n or bufferfreerun do this. The buffer operator reads N *
max_memory (3MB by default) bytes before beginning to push back on the
upstream. If there is enough disk space to buffer large amounts of data, this usually
fixes any egregious slow-down issues caused by the buffer operator.
If there is a significant amount of memory available on the machine, increasing the
maximum in-memory buffer size is likely to be very useful if the buffer operator is
causing any disk IO. $APT_BUFFER_MAXIMUM_MEMORY or
maximummemorybuffersize is used to do this. It defaults to roughly 3000000 (3MB).
For systems where small to medium bursts of I/O are not desirable, the 1MB write to
disk size chunk size may be too small. $APT_BUFFER_DISK_WRITE_INCREMENT or
diskwriteincrement controls this and defaults to roughly 1000000 (1MB). This setting
may not exceed max_memory * 2/3.
Finally, in a situation where a large, fixed buffer is needed within the flow,
queueupperbound (no environment variable exists) can be set equal to max_memory
to force a buffer of exactly max_memory bytes.
Such a buffer blocks an upstream stage (until data is read by the downstream
stage) once its buffer has been filled, so this setting should be used with extreme
caution. This setting is rarely necessary to achieve good performance, but is
useful when there is a large variability in the response time of the data source or
data target. No environment variable is available for this flag; it can only be set
at the osh level.
For releases 7.0.1 and beyond, per-link buffer settings are available in EE. They appear on
the Advanced tab of the Input & Output tabs. The settings saved on an Output tab are shared
with the Input tab of the next stage and vice versa, like Columns.
50
Sequential file stages
This is the fastest stage both for reading and writing in DataStage. There is little that can
be done to speed up I/O with sequential files. Since this stage is only seldom the
bottleneck, any increases in efficiency are often not visible with regards to single job
performance. Nevertheless there are some simple approaches that optimize Sequential
stage performance and reduce the overall system load; these small changes add up across
all jobs and speed up processing.
Often, as part of processing, temporary or staging sequential files are created in one job
and read in another. When this is the case, it is worthwhile looking at the data and
deciding whether or not quotes are necessary for string fields and if the data can be
written as fixed length without separators.
For example,
If a staging file has three columns, ARRG_ID with 18 digits numeric and fields
CURRENCY and STATUS with Char(3) and Char(1) respectively, two possible formats
for the resulting sequential file could be:
4104385636294522114,GBP,A 4104385636294522114GBPA
4104385636294522115,USD,A 4104385636294522115USDA
4104385636294522116,JPY,A 4104385636294522116JPYA
The first method uses 140 bytes per row while the second uses only 105 a 25%
reduction in data stored. Since this data gets written and read at least once, a small
modification such as this can almost halve the I/O in this example. Even though the
difference in large sequential files is not always directly noticeable it lets the job have a
smaller impact on the system; the cumulative effect over the time and across many jobs
is significant. The more the columns in a file, the greater data size savings.
Fixed length sequential files have the secondary effect of being more efficient for the
read process to parse. As the start and end positions are the same for each line DataStage
needs only to access a substring on each row. With delimited files the process is more
involved, as the string needs to be searched character by character until a start or end
delimiter is encountered. This processing overhead is very noticeable with files
containing a hundred or more columns and no inter-process buffering.
Yet another advantage of using fixed-length sequential files is that they can be read
significantly faster by Parallel Extender (PX) jobs, should those be used in the future for
51
part of the processing. PX Jobs start a number of concurrent processes, if a delimited
sequential file is read, it must be parsed line by line by just one process to determine
how the rows are organized. You can easily read a fixed length file by several jobs in
parallel as the row start positions are easily determined.
Putting the data into a fixed width format is often not a solution. It can have a very
detrimental effect, especially when dealing with the fixed-width CHAR type columns
often found when using DB/2. Often these columns contain many extraneous spaces
which are trimmed of leading and trailing blanks in DataStage so writing a CHAR
(100) column containing hello only puts in 5 bytes in a variable-length sequential
column, but if written to a fixed-width flat file it would, of necessity, take up 100 bytes
of space.
Not using quote characters always saves a lot of space, but care needs to be taken when
removing them. If any of these columns could contain the column delimiter character
then the resulting text file becomes unusable and causes jobs to abort.
The default column delimiter is a comma , character and one common method is to
choose an arbitrary non-displayable character as the delimiter; so using 0x04 (the EOT
code) works as a separator if it is certain that this does not occur in any character field.
52
the bucket number MOD ({KeyValue}, 4) +1. So if all keys between 1 and 100 are used
an even distribution of 25 records per group is achieved. But if the keys were 2 through
200 with only even-numbered keys being used, then the group counts end up being 50,
0, 50, and 0 respectively resulting in a very uneven distribution and thus the hashing
algorithm of type 2 is inappropriate for this file.
Since modulo has such a profound impact on performance it is important to make sure
that the value is sufficiently large for the files data. DataStage has introduced a
maintenance-free type of hashed file called Dynamic (or type 30) and has made it the
default file type for new files. This hashed file type automatically increases or decreases
the actual modulo used depending on the file size.
By default a dynamic file starts off with one group, and the split threshold is set at 80% -
so when the file is 80% full it splits the groups and effectively increases modulo
dynamically. The reverse happens when the load goes down below 20% which triggers a
group merge.
This is a very efficient method of keeping no-maintenance high system performance for
typical database tables, which tend to grow or shrink slowly. In DataStage jobs hashed
files are often cleared and then filled (sometimes several times daily) so the overhead
associated with this splitting and merging can be very high.
Fortunately dynamic hashed files allow the developer to specify and override default
settings when they are created and by correctly setting initial values, performance is
increased.
When hashed files are written to in DataStage jobs the hashed file stage shows the
options which in turn shows the following window:
53
Hashed File Stage
The only attribute that you need to change from the default value of 1 is the Minimum
modulus. There are only 2 hash algorithms possible for dynamic files and if the key of
the hashed file changes most in the rightmost bytes (regardless of whether numeric or
not) then using SEQ.NUM as the algorithm might increase speed up to 5%.
In order to find the optimum starting modulus it is necessary to fill the hashed file with
data. Once the file contains data, use either the DataStage Administrator or the TCL shell
client to execute the appropriate commands.
54
There are 2 methods of creating and accessing hashed files in DataStage, either by
putting them into the local account or specifying a path. If the file has been created using
a path, then a file pointer to this file needs to be created by using the command SETFILE
{full UNIX path to file} {FileName}, if a local file was specified then this entry has already
been made. The command ANALYZE.FILE {FileName} returns the following window:
Command Output
You should use the value shown for the No. of Groups (modulus) as the minimum modulo
(or as the basis for multiplication if the whole hashed file wasnt loaded).
55
Preloading hashed files to memory
Any hashed file that is used for reference lookups and is less than approximately 128MB
in size should be pre-loaded into memory. There are only a few exceptions to this rule.
They are as follows:
When the hashed file is modified within the job itself or by another job at the same
time and the newly modified entries need to be visible in the lookup.
Only a small percentage of the records in the hashed file are going to be looked up
or when the source data is very small and time needed to load this hashed file to
memory takes longer than just reading it.
56
Writing to DB/2
Overall this stage tends to be the speed-limiting factor in most jobs. Similar to reading
DB/2 data, it is necessary to reduce the amount of data written to just the necessary
columns and updated/inserted rows. The processing time used to decide whether or not
to write a row of data is always going to be less than the time spent updating a row that
hasnt changed.
The array size attribute is computed the same way as for reads, setting it to a number that
fills the 4Kb buffer with data without overflowing into a second buffer. There is a
transaction size setting when writing and you should set it to a multiple of the array size.
This parameter is equivalent to the database commit frequency and is to be set to as high
a value as practical; the more commits a job does, the more overhead is incurred in the
job.
From a performance point of view, the best commit frequency is 0 denoting the whole
job is one transaction and a commit is only done at the end of processing. The drawback
to this is that uncommitted writes need to be stored on the DB/2 machine and if too
much data is placed in the buffers they can overflow. At present the recommendations
from the DB/2 DBAs is to use a commit frequency no larger than 10,000.
Two of the write options to DB/2 are:
1. Update existing or insert new rows
2. Insert new or update existing rows
Both options do the same thing, but due to their ordering they can have a big impact on
performance. If a job writes to a table that is 90% updates and only 10% inserts ; using
the insert new or update existing rows option results over 40% more database
operations! If we use 100 rows then you can refer to the following table to compare
between the two options:
If the job design allows it, split the output to DB/2 stages into two distinct data streams,
one doing only INSERT operations and the other doing only UPDATE. This has several
advantages:
57
The INSERT stream is doing pure inserts with no wasted DB checking if the record
already exists.
The UPDATE stream is also only doing updates with no wasted cycles.
If the job settings have enable row buffer and interprocess enabled, then both of
the DB/2 stages are executed as a separate process and build up two DB/2
connections that run in parallel.
Should the bulk load functionality be incorporated in the future, you dont need to
modify the job apart from changing the stage type in order to get maximum benefit
from it.
Transformer stages
As the real workhorse in DataStage, the Transform stages usually consume most, if not
all of a jobs CPU time. Excepting Transform stages that have lookup stages attached,
these are seldom the bottleneck stages in processes. Nevertheless it is important to avoid
wasting CPU time with unnecessary computations and you need to always look at
transform with performance considerations in mind.
Note
Remember that each computation done in a transform is
repeated for every row.
By enabling the inter process row buffering in the job settings each transform stage
becomes its own UNIX process. If a transform stage does many complex or time-
consuming computations then it might cause a bottleneck as its processing is limited to
just one processor. By splitting a transformer stage into two or more distinct stages the
processing load can be distributed and the bottleneck removed. The CPU time used for a
Transformer (or other active) stage is displayed as part of the job log information.
The most common cause of CPU wastage in Transformer stages is when identical
computations or derivations are unnecessarily repeated. Transformer stage local
variables have been included so that a computation need only be done once per row and
then used as many times as possible. The following transform is typical of one that
performs extraneous operations.
58
Stage Variables
By adding a stage variable whose value is re-used makes this use a lot less CPU and
perform much more efficiently.
Stage variables are recomputed for each row of data. The exception to this rule is the
initial value, which is only computed once at the beginning of the job. If the variable
value is never changed such as when the derivation line is left empty then this value
can be used for each row processed in a job.
Stage Variables
59
Transformer Stage Properties
Shared containers
Shared containers are much like Transform calls in that their design information and
programming code is compiled into the jobs object. So any changes to the common
shared container will not be reflected until such time as the job is re-compiled.
The container should be a part of the normal data flow in a job, it should never possible
to write a container as a separate job; when this is the case it must be written as a
standalone job and connected to the original via a sequencer. Containers should also be
written so as to not to be misleading. The following job design looks like it doesnt do
much:
Job design
60
But if the containers contents were to be:
It would mean that each and every row needs to be processed through the ToCont
link and into the container before any data streams from the FromCont link to the
link collector. The container could also be updating table entries that again get updated from
the parent job.
Job design
Sequences
Job Sequences are, fortunately, rather straightforward. They almost never use any CPU
or I/O and therefore leave little to be tuned. The only performance consideration with
sequences is to attempt to allow as many jobs run in parallel within them as the design
allows.
61
Production support process
This section provides information on the procedure followed under production support.
SWAT
You can refer to the following procedure that is followed under SWAT:
XXXX sends issue details to SWH or XXXX directly. Issue is analyzed by XXXX if
issue is related to DS jobs or if its related to design or business logic then it is
analyzed by SWH.
XXXX sends their analysis to SWH/XXXX; XXXX also sends suggestion to
solution based on experience.
SWH instructs XXXX to amend DS jobs and Design specifications. XXXX amends
DS jobs, reviews and tests it in XXXX UAT environment and sends back to SWH
for their review and confirmation.
Note
Unit testing in UAT is must, DO NOT Deliver Code without
testing in UAT environment.
Capture unit testing result (in QC) and send it along with amended job to SWH.
Once job change is approved by SWH it is delivered to XXXX along with unit test
result.
Follow version control process (Applicable only if changes need to be incorporated
in v1.6/1.7 project)
XXXX further checks the jobs from their end before moving it to UAT (in case issue
is raised by them and not by the end user).
Follow up and get confirmation from XXXX that they have tested the code from
their side in UAT and that the code is fine.
Production support
XXXX sends Production issue details to SWH or XXXX directly.
Issue is analyzed by XXXX. If logic change is required then it gets confirmation
from SWH.
XXXX amends DS jobs, after getting confirmation, tests it in XXXX UAT
environment and delivers to XXXX.
62
Note
Unit testing in UAT is must, DO NOT Deliver Code without
testing in UAT environment.
Capture unit testing result (in QC) and send it along with amended job to XXXX.
XXXX further checks the jobs from their end before moving it to Production.
Follow up and get confirmation from XXXX that they have tested the code from
their side in UAT and code is fine.
Follow version control process.
XXXX moves code to Production.
Version control
Once the fix is applied and released, update the following spreadsheets with change
details and check in MKS,
\\mks_hew_r2\Production Fix Change-RequestsChange Request Tracking Sheet.xls
\\mks_hew_r2\ETL_Source\ETL Version Control\Amendments during Production
Implementation.xls
Copy dsx of updated jobs, CR document (if applicable) to project shared folder
\Share\XXXX\vol_ss\HEW 4.0\3 Development\GWM\Production Fixes\
63