What - S New in DataStage 8 - FINAL
What - S New in DataStage 8 - FINAL
What - S New in DataStage 8 - FINAL
Summary
DataStage has seen major transformations in the past years from an extract-transform-load
(ETL) tool running in what was called the UniVerse engine, to what is now a DataStage
engine. With the need to adapt to demands of volume processing, Ascential acquired and
integrated the parallel processing engine into DataStage. To provide data quality and data
integration solutions, Ascential added more products that included MetaStage, QualityStage
and ProfileStage.
Enter IBM Information Server, the latest release from IBM after acquiring Ascential. IBM
Information Server delivers improved functionalities that make job administration,
development and maintenance easier.
IBM Information Server is the integrated set of components that include WebSphere
DataStage and Quality Stage, WebSphere Information Analyzer (hybrid of ProfileStage and
AuditStage), Federation Server, and Business Glossary that share a common metadata
repository, common administration, common logging and common reporting. Each
component is separately orderable to fit the client’s needs.
IBM Information Server is built on a service-oriented architecture. Jobs for data analysis,
data cleansing and data transformation can run as a service; making them available to any
application at any time. This approach enhances job reuse and supports standardized
processes to be implemented across projects and applications that call these services.
Organizations looking to implement or upgrade from previous versions will find a host of
new features that simplify job design and help cut implementation effort. Websphere
DataStaqe alone is packed with new tools to better manage the job performance and
improve operational efficiency.
This article will examine the new features in DataStage version 8. Other Information Server
components will not be discussed.
There are many new features and added functionalities in WebSphere DataStage that help
to cut development time, simplify job design and improve job performance. Among these
features are Data Connection Object, ODBC Connector, SQL Builder, Slowly Changing
Dimension stage, Range Look-up, Advanced and Quick Find, Parameter Set, Common
Logging, Resource Estimation tool, Performance Analysis tool and Job Compare. The
following are some additional details on each of these new features:
dragged and dropped during job design, a very handy feature when creating jobs that
read from or write to the same database.
ODBC Connector
ODBC Connectors are for accessing external data sources and can be used to read,
write, look up and filter data or simply to test the database connectivity during job
design. The Test Button on connectors allows developers to test database connections
without having to view the data or to run the job.
ODBC Connector is the best stage to use for your database because it gives the
maximum parallel performance and offers more features compared to database
enterprise stages and plug-in stages.
ODBC Connecter has its own reject-handling function which eliminates the need to add a
Modify or a Transformer stage for capturing SQL errors or for aborting jobs. A choice
between number of rows or percentage or rows rejected can be specified for terminating
the job run.
ODBC Connector can transfer large objects (LOB) using inline or reference methods.
However, a connector is the only stage that does reference methods so another
connector is needed to transfer the LOB inline later in the job.
SQL Builder
The enterprise database stages in WebSphere DataStage as well as the ODBC Connector
stage provide a graphical interface for building SQL that makes it easier for developers
to view, create and edit SQL. While previous releases support auto-generated queries
that synchronize the SQL select list with the DataStage column list, the SQL that can be
created is very limited – the table alias has to be manually appended to the columns and
the “where clause” has to be manually keyed in. The SQL Builder provides a drag-and-
drop approach for creating the column list and the “where clause” and supports most
database functions. It is important to note that the SQL Builder is limited to one or two
source tables and a simple SQL statement.
Range Look-up
Range Look-up is equivalent to the operator between. Lookup against a range of values
was difficult to implement in previous DataStage versions. By having this functionality in
the lookup stage, comparing a source column to a range of two lookup columns or a
lookup column to a range of two source columns can be easily implemented.
Advanced Find
The Quick Find and the Advanced Find are very useful tools for developers as it does a
quick search within the project or within the repository for object usage or
dependencies. The results are presented in a detailed view or graphical view. This facility
can be used for impact analysis or when doing project-wide job changes.
Parameter Set
Parameter Set is a new repository object containing parameter names and values that
can be shared across jobs. Parameter Sets provide an easier and faster method when
adding parameters to a job, eliminating the need to add parameters invidiually to each
job. Parameter values can be stored in a file, adding flexibility when changing parameter
values at runtime or when changing environment.
Resource Estimation
Resource Estimation tool is designed for planning how much resources will have to be
allocated to the DataStage jobs as volume of data grows. The Resource Estimation utility
can project how much CPU, scratch and disk requirements per operator per partition will
be utilized by the job given a sample size of data. The output is presented in a graphical
form that can also be generated in a HTML report.
Performance Analysis
The Performance Analysis tool is useful for identifying bottlenecks in a job. The graphical
report visualizes the start and end time for each stage in the job, the physical amount of
time needed for each stage to complete in each partition, the total CPU and system time
for each stage, the memory utilization for each stage over the course of the job, and
much more.
Job Compare
While it is a good to keep copies of different job implementations, we frequently lose
track of the job changes. In our attempt to improve job design and performance, our
experiments often lead us back to the older copy of the job. This is where the Job
Compare feature of WebSphere DataStage comes in handy. Job Compare also works on
other objects like routines, shared containers, and table definitions; whether they are in
the same or in different repositories.
Job Locking
In WebSphere DataStage, jobs can be opened in read-only mode. Developers need not
guess who has the job locked as this information is provided when opening jobs that are
in use.
Sessions can also be disconnected from the web console if there is a need such as
restarting the DataStage server while there are existing connections. Unlocking a job
from a disconnected session is very easily managed from the web console without
having to follow multiple steps in the DataStage Administrator client, which is a major
improvement over previous versions.
Job Monitoring
Time-based and record-based monitoring of jobs is available, both of which can adjust
to system loads, thus eliminating an unnecessary refresh when the DataStage server is
at its peak performance.
Information Server web console has common logging and common reporting features
that work across components. Both features provide easy searches using log or
report date constraints, wildcard filters, saved queries and reports.
The ODBC enterprise stage for WebSphere DataStage can now connect to more
external data sources including IBM DB2, Informix, Oracle, Sybase, SQL Server,
Teradata, MySQL, MS Access, and others.
Conclusion
The look and feel of DataStage and QualityStage canvas remains the same but the new
functionalities are major enhancements over the previous version. Data Connection Object,
Parameter Set, Range Look-up and Slowly Changing Dimension are all designed to simplify
design, help cut implementation effort and reduce cost. Advance Find provides a good way
to do impact analysis, an importanat step in project management. Resource Estimation is as
important for project planning. Meanwhile, Perfornamce Analysis tool is another useful
feature that can be used throughout the lifecycle of a job. By knowing what causes a
performance bottleneck, production support groups can better cope with the ever-shrinking
batch windows.
While Advance Find will not perform a Replace function and SQL Builder will not let us build
complex SQL, all the changes in version 8 have positive impact on job development,
production support and project management. Combined with the features offered in
Information Server, existing customers who are looking to upgrade or new DataStage
clients will benefit from the new enhancements.
References
1
IBM Information Server, Version 8, http://publib.boulder.ibm.com/infocenter
Author
Luisa Garcia – Senior Information Management Consultant, Alpine Consulting, Inc.
For more information on DataStage, Information Server, or related products, please contact
Alpine Consulting at 847.605.0788 or info@alpineinc.com.