Datastage Scenarios
Datastage Scenarios
Datastage Scenarios
The Join stage does not provide reject handling for unmatched records (such as in an InnerJoin
scenario). If un-matched rows must be captured or logged, an OUTER join operation must be
performed. In an OUTER join scenario, all rows on an outer link (eg. Left Outer, Right Outer, or
both links in the case of Full Outer) are output regardless of match on key values.
During an Outer Join, when a match does not occur, the Join stage inserts NULL values into the
unmatched columns. Care must be taken to change the column properties to allow NULL values
before the Join. This is most easily done by inserting a Copy stage and mapping a column from
NON-NULLABLE to NULLABLE.
A Filter stage can be used to test for NULL values in unmatched columns.
In some cases, it is simpler to use a Column Generator to add an indicator column, with a
constant value, to each of the outer links and test that column for the constant after you have
performed the join. This is also handy with Lookups that have multiple reference links.
Answer Posted By
ashok
9 Yes
1 No
Answer take a sequential file and give the output link to copy
#2
stage and from copy stage give one output link to
head
stage and one output to tail stage and in head and tail
stages give no. of partitions per record =1 and give
the
output to funnel stage and give funnel type =
sequence and
mention the link order to displays which records first
and
give the output link to dataset and you will get the
output
you want
how can i get 2nd highest salary in datastage?can u send me
,thanQ
2)if i had source has 2 records 1st record ie 1st column
contains 1,2,3 and 2nd coulmn contains 10,10,10 i have to
get target as 2nd columns as 20,30,40 how can i?
Answer Posted By
farzana kalluri
Is This Answer
Correct ?
5 Yes
1 No
subhash
Link.Col1 * Link.Col2 +
Link.Col2------>Col2
1*10+10------------->20
2*10+10------------->30
3*10+10------------->40
Hope this is fine.
Is This Answer
Correct ?
4 Yes
1 No
4 Yes
2 No
Teradata (199)
Cognos (872)
Informatica (1756)
Actuate (35)
trgt2
Ab Initio (185)
raj
how can i do this in datastage?
SAS (575)
ETL (199)
Question trgt1
ram
sam
Answer Posted
By
Answers were Sorted based on User's Feedback
Answer src.....>agg........>t/r..........2trgs
# 1 take src as seqfiles and take
aggregater stage is for
calculate count of records based on
name coloumn,next take
the transformer is for apply
constraints like cnt=1 then go
for trg1
Is This Answer
Correct ?
7 Yes
0 No
Answer Posted By
Answer
#1
In DataStage:
SRC--->CPY---->JOIN----TFM---TGT
--------|---- /
--------|--- /
--------|-- /
--------|- /
--------AGG
In AGG, GROUP BY EmpId, calculate
MIN and MAX for each
EmpId.
JOIN both one copy from CPY and 2nd
Aggrigated copy from
AGG.
In TFM, put constraint: IF MIN=MAX,
then populate to TGT
then u will get required output.
subhash
Is This Answer
Correct ?
Answer
#2
9 Yes
0 No
1. SQL:
SELECT * FROM
( SELECT EmpId, COUNT(*) AS CNT1
FROM EMP GROUP BY
EmpId) E1,
( SELECT EmpId, COUNT(*) AS CNT2
FROM EMP GROUP BY
EmpId, EmpInd) E2,
WHERE E1.EmpID = E2.EmpId AND
E1.CNT1 = E2.CNT2;
2.DataStage:
SRC--->CPY---->JOIN----TFM---TGT
|/
|/
|/
|/
AGG
In AGG, GROUP BY EmpId, calculate
CNT and SUM.
JOIN both one copy from CPY and 2nd
Aggrigated copy from
AGG.
In TFM, put constraint: IF CNT=SUM,
then populate to TGT
then u will get required output.
hi my source is::
empno,deptno,salary
1,
10,
2,
20,
2,
10,
1,
30,
3,
10,
3,
20,
1,
20,
then target should be in
3.5
8
4.5
5
6
4
9
below form...
empno,max(salary),min(salary),deptno
1,
9,
3.5,
20
2,
8,
4.5,
20
3,
6,
4,
10
can anyone give data flow in data stage for the above
scenario....
thanks in advance...
Answer Posted By
Answer
#1
source->copy->2 aggregators->join->target
1 aggregator->eno,max(sal),min(sal)
2 aggregator->eno,dno,max(sal)
by using max(sal) key, we can join both o/p of
aggregators,we can get that output...
Answer Posted By
Answer
#1
1)
JOB1: SRC---->COPY---->TGT
SEQuence:
START LOOP---->JOB1----->END LOOP
Activity.
In TGT stage use 'Append' Mode.
By Looping 100 time, we can get 100 records in
target.
2)
SRC---->Transformer---->TGT
By using Looping Variable in the Transformer, we
can achieve
this.
Loop While Condition "@ITERATION <=100"
With out using Funnel Stage, how to populate the data from
different sources to single target
Answer Posted By
Answer
#1
Hi Kiran ,
We can populate the sources
metadata to target without
using funnel stage using
"Seqential File" Stage.
let me explain
In Sequential file we have a
property called "File" so first
u give the file name and load
the data.
Next time in the same
sequential file right side we
have a
property "File" just click on
that so it will ask another
file name just give other file
name ...do not load the data
, in the same you can give how
many files u have.
Finally u ran the job
Automatically the data will be
appended.
Thks
I/P
---
ID
Value
1
2
3
4
AB
ABC
ADE
A
O/p
---
ID
1
1
2
2
2
3
3
3
4
Value
A
B
A
B
C
A
D
E
A
Answer
Posted
By
Answer
#1
records,
I want target table with
column name start with
'A'
and 'B',remaining columns as reject outputs.
how can achieve this by data stage?please help
me?????
Answer Posted By
Answer
#1
Answer
Posted By
Answer
#1
Use query
>select * from tab1,tab2;
bharath
Answer
#2
10 Yes
2 No
10
1
1
1
2
2
3
3
3
subhash
10000
subhash
10000 3
subhash
10000
raju
20000
raju
chandra
30000
chandra
30000
chandra
30000
3
3
2
20000
3
3
3
Answer
Posted By
Answer File1,File2====Funnel----#1
Copy=======1st link AGG, 2nd link
JOIN----Filter----OutputFile
1. pass the 2 files to funnel stage and then copy stage.
2. from copy stage 1st link to AGG stage, 2nd link to
JOIN stage
3. In AGG stage, Group by Key column say ID, NAME
take the count and JOIN based on KEY column
4. Filter on COUNT>1 send the output OutputFile
we get desired output
I have a file it contain 2 records like
empname,company as
Ram, Tcs and Ram, IBM. But i want empname,
company1,company2 as Ram, TCS,IBM in the target. How?
11
Answer Posted
By
Source:
EMPNO EMPNAME
4567
shree
6999
Ram
3265
Venkat
2655
Abhi
3665
Vamsi
5852
Amit
3256
Sagar
3265
Vishnu
Target:
EMPNO EMPNAME
6999
Ram
3265
Venkat
2655
Abhi
3665
Vamsi
5852
Amit
3256
Sagar
I dont wan't to Shree and vishnu records.we can fetch
another way also but How can I write the function in
transform stage?
Answer Posted
By
12
Answer Posted
By
Answer Posted
By
So, first time when your job runs first 50 records will be
loaded in the target and same time the input file records
are overwritten with records next first 50 records i.e. 51
to 200.
2nd time when your job runs first 50 records(i.e. 51-100)
will be loaded in the target and same time the input file
records are overwritten with records next first 50 records
i.e. 101 to 200.
And so on, all 50-50 records will be loaded in each run to
the target
My input has a unique column-id with the values
10,20,30.....how can i get first record in one o/p
file,last record in another o/p file and rest of the
records in 3rd o/p file?
Answer Posted
By
Answer Posted By
7 Yes
3 No
sree
Back to top
I have defined a Join stage using fuller outer join. And I have explicitly
copied the key from both input into the output dataset.
Image that I have pass the output to a Transformer stage, and set some
criteria to capture the unmatched records from either side.
When I start looking for unmatched records in the output, I just can't get
them. I have tried following methods :
1) IsNull(Link1.Key)
2) Len(Link1.Key) = 0
3) RawLength(Link1.Key) = 0
I have checked that the output with Data Set Management, and the key has
nothing within it, and it is not null.
Would any one please help by suggesting
1) a best way to look for unmatched records of two inputs.
2) how to set the criteria so that I can capture those records.
Thanks in advance.
Back to top
Participant
[quote="santhu"]
Points: 153
Hi,
First of all, when you use JOIN stage, you cannot capture any unmatched
data in the output link of the JOIN stage. So any condition under
Transformer will not help.
There are 3 ways of Horizontally combining data, i.e JOINS, LOOKUP and
MERGE.
Possibilites of capturing unmatched data for
1) JOIN: You cannot capture unmatched data for any kind of join using the
JOIN stage i.e neither from the left nor the Right inputs.
2) LOOKUP: Lookup stage has only 1 Primary Source and can have N
lookups / secondary data/reference data. You can capture unmatched
primary data in the Reject set (1 only) if you specify "Reject" option in the
lookup stage settings. You cannot capture unmatched secondary / lookup
data
3) MERGE: This stage has 1 MASTER source and can have N update /
secondary sources. You can KEEP / DROP the master source data if not
matching, and you can capture all the N unmatching update / secondary
sources into respective N Reject files.
Hope this helps to solve your issue
Regards,
Santhosh S
17
inShare
Orchadmin is a command line utility provided by datastage to research on data sets.
1. Before using orchadmin, you should make sure that either the working directory or the
$APT_ORCHHOME/etc contains the file config.apt OR
The environment variable $APT_CONFIG_FILE should be defined for your session.
Orchadmin commands
Validates the configuration file contents like , accesibility of all nodes defined in the configuration file,
scratch disk definitions and accesibility of all the nodes etc. Throws an error when config file is not found or
not defined properly
Makes a complete copy of the datasets of source with new destination descriptor file name. Please not that
a. You cannot use UNIX cp command as it justs copies the config file to a new name. The data is not copied.
b. The new datasets will be arranged in the form of the config file that is in use but not according to the old
confing file that was in use with the source.
18
The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be
used to delete one or more persistent data sets.
-f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset
partitions from accessible nodes and leave the other partitions in inaccesible nodes as orphans.
-x forces to use the current config file to be used while deleting than the one stored in data set.
The dump command is used to dump(extract) the records from the dataset.
Without any options the dump command lists down all the records starting from first record from first
partition till last record in last partition.
19
-delim <string> : Uses the given string as delimtor for fields instead of space.
-field <name> : Lists only the given field instead of all fields.
-name : List all the values preceded by field name and a colon
-n numrecs : List only the given number of records per partition.
-p period(N) : Lists every Nth record from each partition starting from first record.
-skip N: Skip the first N records from each partition.
-x : Use the current system configuration file rather than the one stored in dataset.
Without options deletes all the data(ie Segments) from the dataset.
-f: Uses force truncate. Truncate accessible segments and leave the inaccesible ones.
-x: Uses current system config file rather than the default one stored in the dataset.
-n N: Leaves the first N segments in each partition and truncates the remaining.
scenario
i/p file
col1
a,b,c
o/p
col1
a
b
c
20
**************
we can do this with field function in transformer stage
************
I have worked on similar scenario like yours for one of my friend,so i want to explain that for
you.......
Input:100|aa,cc,bb,dd
200|aa
330|mm,nn
440|aa,cc,dd,ee,ff,gg
Output:440,cc
440,ee
440,gg
100,cc
100,dd
330,nn
200,aa
440,aa
440,dd
440,ff
100,aa
100,bb
330,mm
********************************************
I have developed one shell script,so before running the job we have to run the script...
Script is:#/bin/sh
"" > C:/temp/temp.txt
"" > C:/temp/temp1.txt
for line in `cat C:/temp/pivot.txt`
do
VAR=`echo $line|awk -F"," '{printf"%s|%s\n",$0,NF-1}'`
echo $VAR >> C:/temp/temp.txt
done
MAX=`cat C:/temp/temp.txt|awk -F"|" '{print $3}'|sort -r|head -1`
for line in `cat C:/temp/temp.txt`
21
do
VER=`echo $line|awk -F"|" '{print $3}'`
MAX1=`expr $MAX - $VER`
line1=$line
while [ $MAX1 -gt 0 ]
do
line1=`echo $line1|awk -F"|" '{printf"%s|%s\n",$1,$2}'|sed 's/$/,/g'`
MAX1=`expr $MAX1 - 1`
done
if [ $MAX1 -eq 0 ]
then
line1=`echo $line1|awk -F"|" '{printf"%s|%s\n",$1,$2}'`
fi
echo $line1 >> C:/temp/temp1.txt
done
********************************************************************
Now Read the Temp1.txt as a source in datastage .
Job Design:Seq file --------> Transformer----->Povot Stage---->filter ----->Target Seq file.
Read Temp1.txt in seq file.
In Tsfm,Using field function,parse the columns.
Use the pivot stage to pivot columns into rows.
Filter the null records.
Pass the output to seq file.
I have this approch,if anyone has better approch,please share your idea.
*************
seq->transformer->pivot->target
in transformer create three columns col1,col2,col3
use substring option
substring(colname,[1,1])=col1
substring(colname,[3,1])=col2
substring(colname,[5,1])=col3
in pivot output
22
Posted: Mon
Sep 08, 2008
7:13 am
Back to top
Posts: 214
Location: Chennai
Points: 1320
In put data
col|col1
1|a
1|d
1|r
2|g
3|h
3|g
4|e
out put data
col|col1|count
1|a|1
1|d|2
1|r|3
2|g|1
3|h|1
3|g|2
4|e|1
24
INTEGER,
INTEGER,
INTEGER,
INTEGER,
INTEGER
INSERT
INSERT
INSERT
INSERT
INSERT
INTO
INTO
INTO
INTO
INTO
SALES
SALES
SALES
SALES
SALES
VALUES
VALUES
VALUES
VALUES
VALUES
(
(
(
(
(
1,
2,
3,
4,
5,
100,
100,
100,
100,
100,
2008,
2009,
2010,
2011,
2012,
10,
12,
25,
16,
8,
5000);
5000);
5000);
5000);
5000);
INSERT
INSERT
INSERT
INSERT
INSERT
INTO
INTO
INTO
INTO
INTO
SALES
SALES
SALES
SALES
SALES
VALUES
VALUES
VALUES
VALUES
VALUES
(
(
(
(
(
6, 200,
7, 200,
8, 200,
9, 200,
10,200,
2010,
2011,
2012,
2008,
2009,
10,
15,
20,
13,
14,
9000);
9000);
9000);
9000);
9000);
25
INSERT INTO
INSERT INTO
INSERT INTO
INSERT INTO
INSERT INTO
COMMIT;
SALES
SALES
SALES
SALES
SALES
VALUES
VALUES
VALUES
VALUES
VALUES
(
(
(
(
(
11,
12,
13,
14,
15,
300,
300,
300,
300,
300,
2010,
2011,
2012,
2008,
2009,
20,
18,
20,
17,
19,
7000);
7000);
7000);
7000);
7000);
26
Year,
QUANTITY,
PRICE,
COUNT(1) OVER (PARTITION BY YEAR) CNT
FROM SALES;
SALE_ID PRODUCT_ID YEAR QUANTITY PRICE CNT
-----------------------------------------9
200
2008
13
9000 3
1
100
2008
10
5000 3
14
300
2008
17
7000 3
15
300
2009
19
7000 3
2
100
2009
12
5000 3
10
200
2009
14
9000 3
11
300
2010
20
7000 3
6
200
2010
10
9000 3
3
100
2010
25
5000 3
12
300
2011
18
7000 3
4
100
2011
16
5000 3
7
200
2011
15
9000 3
13
300
2012
20
7000 3
5
100
2012
8
5000 3
8
200
2012
20
9000 3
From the ouputs, you can observe that the aggregate functions return only one row
per group whereas analytic functions keeps all the rows in the gorup. Using the
aggregate functions, the select clause contains only the columns specified in group
by clause and aggregate functions whereas in analytic functions you can specify all
the columns in the table.
The PARTITION BY clause is similar to GROUP By clause, it specifies the window of
rows that the analytic funciton should operate on.
I hope you got some basic idea about aggregate and analytic functions. Now lets
start with solving the Interview Questions on Oracle Analytic Functions.
1. Write a SQL query using the analytic function to find the total sales(QUANTITY) of
each product?
Solution:
SUM analytic function can be used to find the total sales. The SQL query is
SELECT PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER( PARTITION BY PRODUCT_ID ) TOT_SALES
FROM SALES;
PRODUCT_ID QUANTITY TOT_SALES
-----------------------------
27
100
100
100
100
100
200
200
200
200
200
300
300
300
300
300
12
10
25
16
8
15
10
20
14
13
20
18
17
20
19
71
71
71
71
71
72
72
72
72
72
94
94
94
94
94
28
The ORDER BY clause is used to sort the data. Here the ROWS UNBOUNDED
PRECEDING option specifies that the SUM analytic function should operate on the
current row and the pervious rows processed.
3. Write a SQL query to find the sum of sales of current row and previous 2 rows in a
product group? Sort the data on sales and then find the sum.
Solution:
The sql query for the required ouput is
SELECT PRODUCT_ID,
QUANTITY,
SUM(QUANTITY) OVER(
PARTITION BY PRODUCT_ID
ORDER BY QUANTITY DESC
ROWS BETWEEN 2 PRECEDING AND CURRENT ROW) CALC_SALES
FROM SALES;
PRODUCT_ID QUANTITY CALC_SALES
-----------------------------100
25
25
100
16
41
100
12
53
100
10
38
100
8
30
200
20
20
200
15
35
200
14
49
200
13
42
200
10
37
300
20
20
300
20
40
300
19
59
300
18
57
300
17
54
The ROWS BETWEEN clause specifies the range of rows to consider for calculating
the SUM.
4. Write a SQL query to find the Median of sales of a product?
Solution:
The SQL query for calculating the median is
SELECT PRODUCT_ID,
QUANTITY,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY QUANTITY ASC)
29
FROM
SALES;
5. Write a SQL query to find the minimum sales of a product without using the
group by clause.
Solution:
The SQL query is
SELECT
PRODUCT_ID,
YEAR,
QUANTITY
FROM
(
SELECT PRODUCT_ID,
YEAR,
QUANTITY,
ROW_NUMBER() OVER(PARTITION BY PRODUCT_ID
ORDER BY QUANTITY ASC) MIN_SALE_RANK
FROM
SALES
) WHERE MIN_SALE_RANK = 1;
PRODUCT_ID YEAR QUANTITY
-----------------------100
2012
8
200
2010
10
300
2008
17
30
Oracle Analytic Functions compute an aggregate value based on a group of rows. It opens up a
whole new way of looking at the data. This article explains how we can unleash the full potential
of this.
Analytic functions differ from aggregate functions in the sense that they return multiple rows for
each group. The group of rows is called a window and is defined by the analytic clause. For each
row, a sliding window of rows is defined. The window determines the range of rows used to
perform the calculations for the current row.
Oracle provides many Analytic Functions such as
AVG, CORR, COVAR_POP, COVAR_SAMP, COUNT, CUME_DIST, DENSE_RANK, FIRST,
FIRST_VALUE, LAG, LAST, LAST_VALUE, LEAD, MAX, MIN, NTILE,
PERCENT_RANK, PERCENTILE_CONT, PERCENTILE_DISC, RANK,
RATIO_TO_REPORT, STDDEV, STDDEV_POP, STDDEV_SAMP, SUM, VAR_POP,
VAR_SAMP, VARIANCE.
The Syntax of analytic functions:
Analytic-Function(Column1,Column2,...)
OVER (
[Query-Partition-Clause]
[Order-By-Clause]
[Windowing-Clause]
)
31
The partition clause makes the SUM(sal) be computed within each department, independent of
the other groups. The SUM(sal) is 'reset' as the department changes. The ORDER BY ENAME
clause sorts the data within each department by ENAME;
1. Query-Partition-Clause
The PARTITION BY clause logically breaks a single result set into N groups, according
to the criteria set by the partition expressions. The analytic functions are applied to each
group independently, they are reset for each group.
2. Order-By-Clause
The ORDER BY clause specifies how the data is sorted within each group (partition).
This will definitely affect the output of the analytic function.
3. Windowing-Clause
The windowing clause gives us a way to define a sliding or anchored window of data, on
which the analytic function will operate, within a group. This clause can be used to have
the analytic function compute its value based on any arbitrary sliding or anchored
window within a group. The default window is an anchored window that simply starts at
the first row of a group an continues to the current row.
Let's look an example with a sliding window within a group and compute the sum of the current
row's salary column plus the previous 2 rows in that group. i.e ROW Window clause:
SELECT deptno, ename, sal,
SUM(sal)
OVER ( PARTITION BY deptno
ORDER BY ename
ROWS 2 PRECEDING ) AS Sliding_Total
FROM emp
ORDER BY deptno, ename;
32
Now if we look at the Sliding Total value of SMITH it is simply SMITH's salary plus the salary
of two preceding rows in the window. [800+3000+2975 = 6775]
We can set up windows based on two criteria: RANGES of data values or ROWS offset from
the current row . It can be said, that the existance of an ORDER BY in an analytic function will
add a default window clause of RANGE UNBOUNDED PRECEDING. That says to get all rows
in our partition that came before us as specified by the ORDER BY clause.
This will give us the employee name and salary with ranks based on descending order of salary
for each department or the partition/group . Now to get the top 3 highest paid employees for each
dept.
SELECT * FROM (
SELECT deptno, ename, sal, ROW_NUMBER()
OVER (
PARTITION BY deptno ORDER BY sal DESC
) Rnk FROM emp
) WHERE Rnk <= 3;
The use of a WHERE clause is to get just the first three rows in each partition.
33
34