Nothing Special   »   [go: up one dir, main page]

What - S New in DataStage 8 - FINAL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

What’s New in DataStage version 8.

Summary
DataStage has seen major transformations in the past years from an extract-transform-load
(ETL) tool running in what was called the UniVerse engine, to what is now a DataStage
engine. With the need to adapt to demands of volume processing, Ascential acquired and
integrated the parallel processing engine into DataStage. To provide data quality and data
integration solutions, Ascential added more products that included MetaStage, QualityStage
and ProfileStage.

Enter IBM Information Server, the latest release from IBM after acquiring Ascential. IBM
Information Server delivers improved functionalities that make job administration,
development and maintenance easier.

What is Information Server?

IBM Information Server is the integrated set of components that include WebSphere
DataStage and Quality Stage, WebSphere Information Analyzer (hybrid of ProfileStage and
AuditStage), Federation Server, and Business Glossary that share a common metadata
repository, common administration, common logging and common reporting. Each
component is separately orderable to fit the client’s needs.

IBM Information Server is built on a service-oriented architecture. Jobs for data analysis,
data cleansing and data transformation can run as a service; making them available to any
application at any time. This approach enhances job reuse and supports standardized
processes to be implemented across projects and applications that call these services.
Organizations looking to implement or upgrade from previous versions will find a host of
new features that simplify job design and help cut implementation effort. Websphere
DataStaqe alone is packed with new tools to better manage the job performance and
improve operational efficiency.

This article will examine the new features in DataStage version 8. Other Information Server
components will not be discussed.

What’s new in WebSphere DataStage

There are many new features and added functionalities in WebSphere DataStage that help
to cut development time, simplify job design and improve job performance. Among these
features are Data Connection Object, ODBC Connector, SQL Builder, Slowly Changing
Dimension stage, Range Look-up, Advanced and Quick Find, Parameter Set, Common
Logging, Resource Estimation tool, Performance Analysis tool and Job Compare. The
following are some additional details on each of these new features:

Data Connection Object


Data connection objects are reusable components that store database connection
information. One data connection object can be created for each particular database and
has to be associated with a particular stage type. Once configured, they can be simply

Alpine Consulting, Inc. Page 1 of 5 Confidential Material


What’s New in DataStage version 8.x

dragged and dropped during job design, a very handy feature when creating jobs that
read from or write to the same database.

ODBC Connector
ODBC Connectors are for accessing external data sources and can be used to read,
write, look up and filter data or simply to test the database connectivity during job
design. The Test Button on connectors allows developers to test database connections
without having to view the data or to run the job.

ODBC Connector is the best stage to use for your database because it gives the
maximum parallel performance and offers more features compared to database
enterprise stages and plug-in stages.

ODBC Connector has a schema reconciliation function that automatically compares


DataStage schemas to external-resource schemas such as a database. Schemas include
data types, attributes and field lengths. Based on the reconciliation rules that you
specify, runtime errors or extra transformation on mismatched schemas can be avoided.

ODBC Connecter has its own reject-handling function which eliminates the need to add a
Modify or a Transformer stage for capturing SQL errors or for aborting jobs. A choice
between number of rows or percentage or rows rejected can be specified for terminating
the job run.

ODBC Connector can transfer large objects (LOB) using inline or reference methods.
However, a connector is the only stage that does reference methods so another
connector is needed to transfer the LOB inline later in the job.

SQL Builder
The enterprise database stages in WebSphere DataStage as well as the ODBC Connector
stage provide a graphical interface for building SQL that makes it easier for developers
to view, create and edit SQL. While previous releases support auto-generated queries
that synchronize the SQL select list with the DataStage column list, the SQL that can be
created is very limited – the table alias has to be manually appended to the columns and
the “where clause” has to be manually keyed in. The SQL Builder provides a drag-and-
drop approach for creating the column list and the “where clause” and supports most
database functions. It is important to note that the SQL Builder is limited to one or two
source tables and a simple SQL statement.

Slowly Changing Dimension Stage


Slowly Changing Dimension Stage (SCD) is a new stage that supports both Type 1 and
Type 2 processing. The SCD stage allows for in-memory lookup updates, surrogate key
generation and updates to dimension tables: processes that were previously
implemented in a series of stages or in multiple jobs. SCD can handle a series of
updates to the same dimension row because the changes are applied to the lookup table
in memory and are mirrored to the dimension update link.

Range Look-up
Range Look-up is equivalent to the operator between. Lookup against a range of values
was difficult to implement in previous DataStage versions. By having this functionality in

Alpine Consulting, Inc. Page 2 of 5 Confidential Material


What’s New in DataStage version 8.x

the lookup stage, comparing a source column to a range of two lookup columns or a
lookup column to a range of two source columns can be easily implemented.

Advanced Find
The Quick Find and the Advanced Find are very useful tools for developers as it does a
quick search within the project or within the repository for object usage or
dependencies. The results are presented in a detailed view or graphical view. This facility
can be used for impact analysis or when doing project-wide job changes.

Parameter Set
Parameter Set is a new repository object containing parameter names and values that
can be shared across jobs. Parameter Sets provide an easier and faster method when
adding parameters to a job, eliminating the need to add parameters invidiually to each
job. Parameter values can be stored in a file, adding flexibility when changing parameter
values at runtime or when changing environment.

Resource Estimation
Resource Estimation tool is designed for planning how much resources will have to be
allocated to the DataStage jobs as volume of data grows. The Resource Estimation utility
can project how much CPU, scratch and disk requirements per operator per partition will
be utilized by the job given a sample size of data. The output is presented in a graphical
form that can also be generated in a HTML report.

Performance Analysis
The Performance Analysis tool is useful for identifying bottlenecks in a job. The graphical
report visualizes the start and end time for each stage in the job, the physical amount of
time needed for each stage to complete in each partition, the total CPU and system time
for each stage, the memory utilization for each stage over the course of the job, and
much more.

Job Compare
While it is a good to keep copies of different job implementations, we frequently lose
track of the job changes. In our attempt to improve job design and performance, our
experiments often lead us back to the older copy of the job. This is where the Job
Compare feature of WebSphere DataStage comes in handy. Job Compare also works on
other objects like routines, shared containers, and table definitions; whether they are in
the same or in different repositories.

Job Locking
In WebSphere DataStage, jobs can be opened in read-only mode. Developers need not
guess who has the job locked as this information is provided when opening jobs that are
in use.

Sessions can also be disconnected from the web console if there is a need such as
restarting the DataStage server while there are existing connections. Unlocking a job

Alpine Consulting, Inc. Page 3 of 5 Confidential Material


What’s New in DataStage version 8.x

from a disconnected session is very easily managed from the web console without
having to follow multiple steps in the DataStage Administrator client, which is a major
improvement over previous versions.

Job Monitoring
Time-based and record-based monitoring of jobs is available, both of which can adjust
to system loads, thus eliminating an unnecessary refresh when the DataStage server is
at its peak performance.

Other Key Improvements


ƒ One drawback of previous releases of DataStage EE was job startup. In version 8,
this drawback is markedly reduced. Small parallel jobs will now run faster. Buffering
and process combination is also improved to optimize job performance.

ƒ Information Server web console has common logging and common reporting features
that work across components. Both features provide easy searches using log or
report date constraints, wildcard filters, saved queries and reports.

ƒ The ODBC enterprise stage for WebSphere DataStage can now connect to more
external data sources including IBM DB2, Informix, Oracle, Sybase, SQL Server,
Teradata, MySQL, MS Access, and others.

ƒ In Information Server, WebSphere QualityStage is tightly integrated with WebSphere


DataStage and shares the same design canvas, which enables users to design jobs
with data transformation stages and data quality stages in the same session. The
new designer client is now called DataStage and QualityStage Designer, as are the
Administrator and Director clients. In Information Server, the Manager client is gone
and the functionalities are incorporated into DataStage and QualityStage Designer
thereby saving developer time when importing and exporting jobs. Metadata analysis
and report is also available inside DataStage and QualityStage Designer.

Conclusion
The look and feel of DataStage and QualityStage canvas remains the same but the new
functionalities are major enhancements over the previous version. Data Connection Object,
Parameter Set, Range Look-up and Slowly Changing Dimension are all designed to simplify
design, help cut implementation effort and reduce cost. Advance Find provides a good way
to do impact analysis, an importanat step in project management. Resource Estimation is as
important for project planning. Meanwhile, Perfornamce Analysis tool is another useful
feature that can be used throughout the lifecycle of a job. By knowing what causes a
performance bottleneck, production support groups can better cope with the ever-shrinking
batch windows.

While Advance Find will not perform a Replace function and SQL Builder will not let us build
complex SQL, all the changes in version 8 have positive impact on job development,
production support and project management. Combined with the features offered in
Information Server, existing customers who are looking to upgrade or new DataStage
clients will benefit from the new enhancements.

Alpine Consulting, Inc. Page 4 of 5 Confidential Material


What’s New in DataStage version 8.x

References
1
IBM Information Server, Version 8, http://publib.boulder.ibm.com/infocenter

Author
Luisa Garcia – Senior Information Management Consultant, Alpine Consulting, Inc.

For more information on DataStage, Information Server, or related products, please contact
Alpine Consulting at 847.605.0788 or info@alpineinc.com.

Alpine Consulting, Inc. Page 5 of 5 Confidential Material

You might also like