1.ab Initio - Unix - DB - Concepts & Questions - !

AB Initio Related
 Output Index & Indexes

- In Reformat transform ouput linked to one port – Index
Out::output_index(in)
Begin
Out:: if(in.type==”A”)
0
Else if (in.type==”B”)
1
Else
2
End;
- In Reformat transform ouput linked to many port – Indexes

Out::output_indexes(in)
Begin
Out:: if(in.type==”A”)
Vector [0]
Else
Vector [0,1]
End;
Using which component we can specify the rate of data movement from input to output?
A. Throttle
What do you call the file which can treat several serial files having the
same record format as a single graph component?
Adhoc Multifile
What a find splitter component does?

A. Splits the data in to ranges
 Multistage Components: Rollup, Scan, Normalize etc.

Multistage components are transform components hich by default use packages.
(input_select, temp_records_initialisation, processing, finalization, output)
Rollup:summarizes grp of data records & supports i/p, o/p selection
Scan: cumulative sum/stores intermediate values. Used for Y2D totals
Normlise: breaks one record to many
Rollup[ has more control over data than aggregate bczdata is summarized based on
key(like grp by)
 PDL Concept and Dynamic DML

Parameterized defin lang – [] - provides more consistant approach for parameter
definition. All same dmls can be used as PDLso as to provide a better performance
since PDL is executed & evaluated only once @ the beginning.
PDL is used to make the graph behave more dynamically.
To use a function defined in $AI_XFR as pdl, without adding the path of .xfr, give the
path of .xfr in $AB_DML_DEFS as graph input parameter.
- Params can be regular expressions

- Pdl is a interpretation tht supports every other interpretation
 $ substitution
 ${} substitution
 $() – inline KSH construct -> here v get complete functionality of shell like
lookup etc..
 $[] – inline dml construct -> has restrictions on lookup. Cannot use
m_catalog utility
- To use lookup files in PDL :
 Reading d lookup file[read_fie()] & interpreting it as array of records
[reinterpret_as()] then use vector_bsearch() to simulate lookup
 m_catalogue .. places lookup files in AB_CATALOG. Catalog is a file that
stores info about lookup files
?? m_mkcatalog –catalog mycatalog.cat
?? m_catalog_add – catalog mycatalog.cat \
Table mycatalog\
Key key
?? export AB_CATALOG= mycatalog.cat
?? m_eval ‘lookup (“mycatalog”,12).value”
Dynamic dml-
 dml needs to b configured before the graph starts
 Suppose at different time, different data comes for processing having
different dml, then we can use flag in the dml & tht flag is first read in i/p file
received & accordingly dml is chosen
 Metaprogramming utilities can be used to generate dml like
dml_field_info_vec.
To parse the dml -> read_type, record_info, length_of etc..
 Can also be hndled using conditional dml
Dynamic script generation – helps u run mp without deploying ksh
 Defining Layout in Input/output table components

- Dbc file info dbms, db_version, db_home, db_name, db_node, user, password etc.
- When you set the database component's layout to default - Abinitio determines
the number of partitions in the table to define the layout, if the table is not
partitioned, then the layout depends on the number of data files the table has
occupied.
 Parallelism and its related components

- Pipeline : sequential proessing
- Component : 2 different components are not interrelated and they process the
data
parallely. For example you have 2 input files and you sort the data of
both of them in 2 different flows
input file(100 records) -> FBE |-> reformat on select port (50 records say)
|-> rollup on deselect port (50 records say)
- Data: PBK : This is the most common parallelism when you partition your
data to be processed fast
input file(100 records) -> partition by round robin -> reformat(with layout
of depth 4)
Data parallelism: any partition component, broadcast.

Component Parallelism: Replicate, FBE
Pipeline parallelism: Any compoennts
Following are the components that breaks the pipeline parallelism Fuse,
Scan, Interleave, Sort, Sort within groups, Rollup, Join [any in-memory
cmpts).
Also remember whenever there is a phase change in the graph, the
pipeline parallelism is broken.
The short answer is that the Replicate copies a flow while a Broadcast
multiplies it. Broadcast is a partitioner where Replicate is a simple flow-
copy mechanism.
Departition components
Concatenate - maintains sequence
Interleave - combines blocks of data records from multiple flow partitions
in round-robin fashion
Merge –combines data records from multiple flow partitions that have
been sorted according to the same key, and maintains the sort order
Gather - Gather combines data records from multiple flow partitions
(mfs) or multiple flows arbitrarily and make the flow serial
 Pset and its scope at graph/Plan Level

below command which will display the project input parameters:
air sandbox run <pset_name> -script-only > a1.dat
open a1.dat
Search for "Define parameters for dynamic scope pset" .

after this line, you will get the graph level parameter list with all the
resolved parameter list.
 Different level of parameters (Project, Sandbox, and Graph) and their evaluation
order
The host setup script is run.
Common (that is, included) sandbox parameters are evaluated.
Sandbox parameters are evaluated.
The project-start.ksh script is run.
Formal/input parameters are evaluated.
Graph parameters are evaluated.
The graph Start Script is run.
Dml evaluation occurs.
 Public and Private sandboxes: stdenv, localenv, etc.

Common Project - A project included by another project
Private Project - A project not intended to be included by any other project.
All parameters are intended to be internal for the use of that project only.
Public Project - A project intended to but not necessarily included by other
projects. Imagine this as an interface to common parameters.
abenv is for co>operating system and stndenv is for each project
You can see the stdenv variables in the stdenv sandbox .sandbox.pset &
.project.pset. It usually has all the environmental variables like the serilal
dir,MFS dir,MFS depth,etc.
 How do you perform deployment of the code to different environments?
2. Checkin the Graphs and other necessary objects (dmls, xfrs.......). This will put
the updated things in the EME.
3. Create a tag (say xyz) for the graphs / dmls / xfrs that u want to move to other
EME. Use the command
air tag create abc /Projects/DEV/ENV/mp/load_dw.mp
4. Create a .save file using the commmand --

air save <.save file name> -from-tag <Tag name created in last step>.
5. Load the .save file in the PROD / QA EME using the following cmd -
air load <name of .save file created in last step> -relocate <Dev Project Path>
<QA Project Path>.
Putting an example --
air load xyz.SAVE

-relocate /Projects/DEV/ENV /Projects/PROD/ENV
6. Now checkout the Graphs and other files like dmls, xfrs.......from the PROD
Environment to ur sandbox (If required).
 Layout concept in case of parallelism and serial

 What happens when keep the -1 in select expression of FBE component?
 What are the different methods to load the data in in-put table component when DB
is Teradata?
Teradata TPump is a highly parallel utility designed to load Teradata
TPump provides near-real-time data into your data warehouse
because TPump uses row hash locks, users can run queries even while it’s
updating your Teradata system.
Fastload – fast, loads 1 table t a time, loads data to empty table, nly insert, no
duplicates alloewd
Multiload – slow, loads max 5 tables, Loads full or empty tables, i/u/d, dupli
allowed
FastExport is a high-speed utility that quickly exports large data sets from
Teradata tables or views to a client system for processing
Abinitio uses sql*loader conventional method to load data
ORACLE has API & UTILITY mode

API – slow(because API commits after each record is loaded), executes SQL
Stmt, follows all the constraints of database
Utility – fast, doesn’t execute sql stmt, disables constratints
SET Table – doesn’t allow duplicates

Multiset Tables – allows dupli
 Continuous flows components subscriber, publish, Batch subscriber etc.

A continuous flow graph, unlike a regular Ab Initio graph, is intended to run
for an indefinite period of time, continually taking in new input and
producing new, usable output
The advantages of continuous flow graphs include better performance and
latency(time delay between the cause and the effect of some physical change
in the system)
A continuous flow graph includes

• One or more subscribers. A subscriber is the only allowed data source.
• A publisher at the end of every data flow.
• Any continuous or continuously enabled component can be in the middle,
between a subscriber and a publisher.
Restrictions for continuous flow :
• All components in the graph must be continuous components or they must

be continuously enabled.
• All components with no output flows must be publishers. There must be at
least one publisher in the graph for the graph to determine when a
checkpoint can be committed.
• All subscribers must issue checkpoints and compute points in the same
sequence.
• The graph must execute in a single phase. More than one phase is not
allowed.
• All data in a continuous flow graph must come from a subscriber
component. The source of data cannot be an Input File or Input Table
component.
"continous flow"
components in Abinito which uses "Message Queing" technology.
PUBLISH,MULTIPUBLISH,SUBSCRIBE are some of the components that
comes
under this category.With the help ,of these components extract process
is made to cycle on the data with CDC technology inplace at definite
time intervals.Those data is pushed to QUEUE through the PUBLISH or
MULTIPUBLISH components.One file per run of extract is made to
accumulate in the queue.The Load process which uses SUBSCRIBE connects
to the queue to process all the files that exist in the queue.you may
cycle the load process with the same time interval as that of the
extract.This technique reflects the business in NEAR REAL-TIME an not in
REAL-TIME.It's left to you to choose.One last point is Continous
components exhibit well results in the of NEAR REAL-TIME integration
Queue is data structure where we can store the data.

We can read and write the data by using Subscribe and publish
component.and Queues helps us to store the records in sequence of files.
these are the best method to store the data of Continuous graphs.
Types of queue
1 MQ
2 Abinitio Queues
3 JMS Queues
Commond to create queue

create queue quename subscribe id.
FTP - Download Files in Continuous Mode - build a dedicated continuous graph that
simply reads from the FTP Server using UNIVERSAL SUBSCRIBE and then publishes
into a Ab Initio Queue. (Easy to test, maintain, etc.)
 Which component breaks pipeline parallelism in Ab Initio? Sort
Scripting Related
Scenario Based questions can be expected.
m_env –version -> ab_version

ls –l -> file permi, no. of links, owner, group, size, datetimpstamp, filanme
 How to sort the files in directory based on its name, date, owner etc.
Based on name : ls sorts by name by default
ls -1 | sort
To sort them in reverse order: ls -1 | sort –r
Based on date: ls –lt or ls –lrt
Based on owner: ls -l | sort -k3,3
Sort start a key at POS1 (origin 1), end it at POS2 (default end of line)
N,
List all files in current and sub directories - $ find

Search specific directory or path – find ./test
- find ./test –name “abc.txt”
Ignore case – find ./test –iname “*.php”
Limit depth of directory traversal – find ./test –maxdepth 2 –name “a.dat”
Invert match – find ./test –not –name “a.dat”
command looks for files that begin with abc in their names and do not have a php
extension – find ./test –name “abc*” ! –name “*.php”
command search for files ending in either the php extension or the txt extension –
find ./test –name “*.php” –o –name “*.txt”
Search only files or only directories – find ./test –type f –name “abc*”
Find ./test –type d –name “abc*”
Search multiple directories together- find ./test ./dir –type f –name “abc*”
Find hidden files – find ~ -type f –name “.*”
searches for files with the permission 0664 – find . –type f –perm 0664
find . –type f ! –perm 0664
find files of a particular user – find . –user bob –name “*.php”
find files of a group – find ./test –group abinitio
you could search your home directory by using the ~ symbol
Find files modified N days back – find / -mtime 50
Find files accessed in last N days – find / –atime 50
Find files modified in a range of days – find / - mtime +50 –mtime -100 (between 50
to 100 days)
Find files modified within the last 1 hour.. – find / -mmin 60
files which are accessed in last 1 hour – find / -amin 60
Find files of given size – find / -size 50M
Find / -size +50M –size -100M
display the 5 largest file in the current directory and its subdirectory.
Find . –type –f –exec ls –s {}\; |sort –n –r|head -5 ---largest
Find . –type f –exec ls –s {}\; | sort –n | head -5 –smallest
- -exec command ;
Execute command; true if 0 status is returned. All following arguments to find are
taken to be arguments to the command until an argument consisting of `;' is
encountered. The string `{}' is replaced by the current file name being processed
-s, --size
print the allocated size of each file, in blocks
-n, --numeric-sort
compare according to string numerical value
find empty files/dir - Find ~ –type f/d –empty

list out the found files – find . –exec ls –ld {} \;
delete the matching files/dire : find . –type f –name “a.txt” –exec rm –f {} \;
delete files larger than 100MB – find . –type f –name “*.txt” –size +100M –exec rm –f
{} \;
to get directories with sorted modification date use
ls -ltr |grep ^d
And to files other than directories
ls -ltr | grep -v ^d or ls –lrt | grep ^ -
The caret ^ and the dollar sign $ are meta-characters that respectively match the
empty string at the beginning and end of a line.
first character of that is either d for a directory or - for a file
cut –c -> extract desidred column.

cut –c2 a.dat
cut –c1-3 a.dat
cut –c1- a.dat // 3rd to end
cut –c-8 a.dat // 1st 8 characters
displays only first field of each lines from /etc/passwd file using the field delimiter :
(colon)
cut -d':' -f1 /etc/passwd
grep "/bin/bash" /etc/passwd | cut -d':' -f1-4,6,7
Select Fields Only When a Line Contains the Delimiter

grep "/bin/bash" /etc/passwd | cut -d'|' -f1
displays the whole line, even when it doesn’t find any line that has | (pipe) as delimiter.
grep "/bin/bash" /etc/passwd | cut -d'|' -s -f1
specified delimiter using -s option. doesn’t display any output, if no “|” found
Select All Fields Except the Specified Fields
In order to complement the selection field list use option –complement.
The following example displays all the fields from /etc/passwd file except field 7
$ grep "/bin/bash" /etc/passwd | cut -d':' --complement -s -f7
To change the output delimiter use the option –output-delimiter as shown below. In
this example, the input delimiter is : (colon), but the output delimiter is # (hash).
$ grep "/bin/bash" /etc/passwd | cut -d':' -s -f1,6,7 --output-delimiter='#'
Change Output Delimiter to Newline

grep bala /etc/passwd | cut -d':' -f1,6,7 --output-delimiter=$'\n'
2>/dev/null
> file redirects stdout to file

1> file redirects stdout to file
2> file redirects stderr to file
&> file redirects stdout and stderr to file
/dev/null is the null device it takes any input you want and throws it away. It can be
used to suppress any output.
 Head, tail, cut, awk, grep, sed utilities.

Awk is one of the most powerful tools in Unix used for processing the rows
and columns in a file
The basic syntax of AWK:
awk 'BEGIN {start_action} {action} END {stop_action}' filename
Here the actions in the begin block are performed before processing the file
and the actions in the end block are performed after processing the file. The
rest of the actions are performed while processing the file.
Examples:
Create a file input_file with the following data. This file can be easily created using
the output of ls -l.
-rw-r--r-- 1 center center 0 Dec 8 21:39 p1

-rw-r--r-- 1 center center 17 Dec 8 21:15 t1
From the data, you can observe that this file has rows and columns. The rows
are separated by a new line character and the columns are separated by a
space characters. We will use this file as the input for the examples discussed
here.
1. awk '{print $1}' input_file –prints 1st column

To print the 4th and 6th columns in a file use awk '{print $4,$5}' input_file
2. awk 'BEGIN {sum=0} {sum=sum+$5} END {print sum}' input_file
This will prints the sum of the value in the 5th column. In the Begin block the
variable sum is assigned with value 0. In the next block the value of 5th
column is added to the sum variable. This addition of the 5th column to the
sum variable repeats for every row it processed. When all the rows are
processed the sum variable will hold the sum of the values in the 5th column.
This value is printed in the End block.
#!/usr/bin/awk -f
BEGIN {sum=0}
{sum=sum+$5}
END {print sum}
4. awk '{ if($9 == "t4") print $0;}' input_file
This awk command checks for the string "t4" in the 9th column and if it finds a
match then it will print the entire line. The output of this awk command is
-rw-r--r-- 1 pcenter pcenter 43 Dec 8 21:39 t4
5. awk 'BEGIN { for(i=1;i<=5;i++) print "square of", i, "is",i*i; }'
This will print the squares of first numbers from 1 to 5. The output of the command
is
square of 1 is 1
square of 2 is 4
square of 3 is 9
square of 4 is 16
square of 5 is 25
If the fields in the file are separted by any other character, we can use the FS
variable to tell about the delimiter.
6. awk 'BEGIN {FS=":"} {print $2}' input_file

OR
awk -F: '{print $2} input_file
OFS - Output field separator variable:
By default whenever we printed the fields using the print statement the fields are
displayed with space character as delimiter. For example
7. awk '{print $4,$5}' input_file
The output of this command will be
center 0
center 17
center 26
center 25
center 43
center 48
We can change this default behavior using the OFS variable as
awk 'BEGIN {OFS=":"} {print $4,$5}' input_file
center:0
center:17
center:26
center:25
center:43
center:48
8. awk '{print NF}' input_file
This will display the number of columns in each row.
awk '{print NR}' input_file

This will display the line numbers from 1.
awk 'END {print NR}' input_file

This will display the total number of lines in the file.
Sed is a Stream Editor used for modifying the files in unix (or linux).
Whenever you want to make changes to the file automatically
>cat file.txt
unix is great os. unix is opensource. unix is free os.
learn operating system.
unixlinux which one you choose.
The below simple sed command replaces the word "unix" with "linux" in the file.
>sed 's/unix/linux/' file.txt
Here the "s" specifies the substitution operation. The "/" are delimiters. The
"unix" is the search pattern and the "linux" is the replacement string.
Vi editor
0 Move to the begining of the line
$ Move to the end of the line
1G Move to the first line of the file
G Move to the last line of the file
nG Move to nth line of the file
The below command replaces the second occurrence of the word "unix" with "linux"
in a line.
>sed 's/unix/linux/2' file.txt
Replacing all the occurrence of the pattern in a line.
The substitute flag /g (global replacement) specifies the sed command to replace all
the occurrences of the string in the line.
>sed 's/unix/linux/g' file.txt
Replacing from nth occurrence to all occurrences in a line.

sed 's/unix/linux/3g' file.txt
sed 's/http:\/\//www/' file.txt
In this case the url consists the delimiter character which we used. In that
case you have to escape the slash with backslash character, otherwise the
substitution won't work.
Using & as the matched string
There might be cases where you want to search for the pattern and replace that
pattern by adding some extra characters to it. In such cases & comes in handy. The &
represents the matched string.
>sed 's/unix/{&}/' file.txt

{unix} is great os. unix is opensource. unix is free os.
{unix}linux which one you choose.
>sed 's/unix/{&&}/' file.txt

{unixunix} is great os. unix is opensource. unix is free os.
{unixunix}linux which one you choose.
You can restrict the sed command to replace the string on a specific line number. An
example is
>sed '3 s/unix/linux/' file.txt
You can delete the lines a file by specifying the line number or a range or numbers.
>sed '2 d' file.txt

>sed '5,$ d' file.txt
You can make sed command to work as similar to grep command.
>grep 'unix' file.txt

>sed -n '/unix/ p' file.txt
>grep -v 'unix' file.txt

>sed -n '/unix/ !p' file.txt
XARGS
The xargs command (by default) expects the input from stdin, and executes
/bin/echo command over the input
When you type xargs without any argument, it will prompt you to enter the
input through stdin:
$ xargs
Hi,
Welcome to TGS.
After you type something, press ctrl+d, which will echo the string back to you on
stdout as shown below.
$ xargs
Hi,
Welcome to TGS.Hi, Welcome to TGS.
It is one of the most important usage of xargs command. When you need to find
certain type of files and perform certain actions on them (most popular being the
delete action).
The xargs command is very effective when we combine with other commands.
$ ls
one.c one.h two.c two.h
$ find . -name "*.c" | xargs rm -rf
$ ls
one.h two.h
 What is the command to generate a DML for table?

M_db gendml <dbcfile> - table tblnm
 Command to tag the artifacts.

 Check-out and check-in commands
To check out a particular version
air -version <version number> project export <EME Path Name> -basedir
/tmp/<existing or new sandbox > -files mp/<graph name> -cofiles -find-required-
files -gde
To check out a common project
air project export <eme path> -basedir <existing or new sandbox> \ -common <Eme
path of common sandbox> <path to where it needs to be checked out>
Database Related
Questions would be based on experience and exposure on different Databases.

Scenario Based questions can be expected.
Indexes, procedures, etc

 How to find duplicates in a table on a key?
select count(*), keys

from some_table
group by keys
having count(*) > 1
SELECT key-col1, key-col2, key-colx, COUNT(*) AS CNT

FROM table-name
GROUP BY key-col1, key-col2, key-colx
HAVING COUNT(*) > 1
 How to delete the least value key after finding the duplicates? – create temp table
with duplicate rows & then keep the min id row
OR create a SET table out of this duplicate row table
 select * from test;

 +----+------------+
 | id | day |
 +----+------------+
 | 1 | 2006-10-08 |
 | 2 | 2006-10-08 |
 | 3 | 2006-10-09 |
 +----+------------+
select day, count(*) from test GROUP BY day;

+------------+----------+
| day | count(*) |
+------------+----------+
| 2006-10-08 | 2|
| 2006-10-09 | 1|
+------------+----------+
The duplicated rows have a count greater than one. If you only want to see rows that
are duplicated, you need to use a HAVING clause (not a WHERE clause), like this:
select day, count(*) from test group by day HAVING count(*) > 1;
+------------+----------+
| day | count(*) |
+------------+----------+
| 2006-10-08 | 2|
create temporary table to_delete (day date not null, min_id int not null);
insert into to_delete(day, min_id)

select day, MIN(id) from test group by day having count(*) > 1;
select * from to_delete;

+------------+--------+
| day | min_id |
+------------+--------+
| 2006-10-08 | 1 |
+------------+--------+
delete from test

where exists(
select * from to_delete
where to_delete.day = test.day and to_delete.min_id <> test.id
)
Below are a few questions which our team members faced in their interviews
 What are the different partition and de-partition components available? Explain
about them?
 If file containing a data how do you find out the data types of each column to create
the DML?
 Any knowledge on validate data (or validate records) component?
 Have you done performance tuning? And How?\
 Lookup concept? How did you use in your graph? What precautions have to be
taken when lookup file is multi-file? Usage of lookup and lookup_local functions.
 How do you check-in from command prompt? What is the command to check-in the
different sandboxes under one project at a time?
 What is the command to see the disk usage of particular partition in multi-file?
Du – disk usage
Du –h -> human readable format
Su –s -> summary of a grand total disk usage size of an directory
Df –device filesystem
Df –a -> filesystem usage
The nproc command shows the number of processing units available:

$ nproc
 Which components have the option called input_select?

 When you put select expression as 1 in filter by expression what is the output? And
when keep * in the select expression what is the output?
 Display 50-75 records
M_dump is 'THE' best for multiple file system.
For serial and mfs there are many ways the components can be used.
1.Filter by Expression : use next_in_sequence() >50 && next_in_sequence() < 75

for 1st requirement and next_in_sequence() !=5 for 2nd one.
2. We can also use multiple LEADING RECORDS component for meeting the
requirement.
 What is the difference between the rollup and scan components?
 What is the m_dump command how do you use it?
 If input data is containing the duplicate data which partition component is better to
use? - PBK
 Do you know Autosys scheduler and did you work through the GDE or from
command prompt?
 What is the difference between “ON-ICE” and “ON-HOLD” status of jobs in Autosys?
 What is the command to print duplicate records in file? - uniq
 What is the command to print 3rd column in a file? How do you replace the 3rd
column value with some other column value? Awk ‘{print$3}
 How do you perform find & replace in a file? sed
finding the 2nd highest salary in SQL
SELECT MAX(Salary) FROM Employee
WHERE Salary NOT IN (SELECT MAX(Salary) FROM Employee )
finding the Nth highest salary in SQL

SELECT TOP 1 Salary
FROM (
SELECT DISTINCT TOP N Salary
FROM Employee
ORDER BY Salary DESC
) AS Emp
ORDER BY Salary
SELECT Salary FROM Employee

ORDER BY Salary DESC OFFSET N-1 ROW(S)
FETCH FIRST ROW ONLY
finding the Nth highest salary in ORACLE

select * from (
select Emp.*,
row_number() over (order by Salary DESC) rownumb
from Employee Emp
)
where rownumb = n; /*n is nth highest salary*/
select * FROM (
select EmployeeID, Salary
,rank() over (order by Salary DESC) ranking
from Employee
)
WHERE ranking = N;
finding the Nth highest salary in MY-SQL

SELECT Salary FROM Employee
ORDER BY Salary DESC LIMIT n-1,1
Note that the DESC used in the query above simply arranges the salaries in
descending order – so from highest salary to lowest. Then, the key part of the query
to pay attention to is the “LIMIT N-1, 1″. The LIMIT clause takes two arguments in
that query – the first argument specifies the offset of the first row to return, and the
second specifies the maximum number of rows to return. So, it’s saying that the
offset of the first row to return should be N-1, and the max number of rows to return
is 1. What exactly is the offset? Well, the offset is just a numerical value that
represents the number of rows from the very first row, and since the rows are
arranged in descending order we know that the row at an offset of N-1 will contain
the (N-1)th highest salary.
Indexing:
Why is it needed?
When data is stored on disk based storage devices, it is stored as blocks of data. These blocks are
accessed in their entirety, making them the atomic disk access operation. Disk blocks are structured
in much the same way as linked lists; both contain a section for data, a pointer to the location of the
next node (or block), and both need not be stored contiguously.
Due to the fact that a number of records can only be sorted on one field, we can state that searching
on a field that isn’t sorted requires a Linear Search which requires N/2 block accesses (on average),
where N is the number of blocks that the table spans. If that field is a non-key field (i.e. doesn’t
contain unique entries) then the entire table space must be searched at N block accesses.
Whereas with a sorted field, a Binary Search may be used, this has log2 N block accesses. Also since
the data is sorted given a non-key field, the rest of the table doesn’t need to be searched for duplicate
values, once a higher value is found. Thus the performance increase is substantial.
What is indexing?
Indexing is a way of sorting a number of records on multiple fields. Creating an index on a field in a
table creates another data structure which holds the field value, and pointer to the record it relates
to. This index structure is then sorted, allowing Binary Searches to be performed on it.
The downside to indexing is that these indexes require additional space on the disk, since the indexes
are stored together in a table using the MyISAM engine, this file can quickly reach the size limits of
the underlying file system if many fields within the same table are indexed.
Data Definition Language (DDL) - Data definition language (DDL) commands enable you to
perform the following tasks:
Create, alter, and drop schema objects
Data Manipulation Language (DML) - These SQL commands are used for storing, retrieving,
modifying, and deleting data. These commands are SELECT, INSERT, UPDATE, and DELETE
Transaction Control Language (TCL) - Transaction control commands manage changes made by
DML commands. These SQL commands are used for managing changes affecting the data. These
commands are COMMIT, ROLLBACK, and SAVEPOINT.
Data Control Language (DCL) - It is used to create roles, permissions, and referential integrity as
well it is used to control access to database by securing it. These SQL commands are used for
providing security to database objects. These commands are GRANT and REVOKE.
1>TRUNCATE is a DDL command whereas DELETE is a DML command.
2>TRUNCATE is much faster than DELETE.

Delete requires explicit COMMIT /ROLLBACK to make the transaction permanent:
Reason:When you type DELETE.all the data get copied into the Rollback Tablespace
first.then delete operation get performed.Thatswhy when you type ROLLBACK after
deleting a table ,you can get back the data(The system get it for you from the
Rollback Tablespace).All this process take time.But when you type TRUNCATE,it
removes data directly without copying it into the Rollback Tablespace.Thatswhy
TRUNCATE is faster.Once you Truncate you cann't get back the data.
Referential integrity refers to the relationship between tables as PK FK
Scripts:
$ vi ginfo
#
#
# Script to print user information who currently login , current date & time
#
clear
echo "Hello $USER"
echo "Today is \c ";date
echo "Number of user login : \c" ; who | wc -l
echo "Calendar"
cal
exit 0
Number of command line arguments. Useful to test no.

$#
of command line args in shell script.
$* All arguments to shell

$@ Same as above
$- Option supplied to shell
$$ PID of shell
$! PID of last started background process (started with &)
In Linux (Shell), there are two types of variable:

(1) System variables - Created and maintained by Linux itself. This type of variable
defined in CAPITAL LETTERS.
(2) User defined variables (UDV) - Created and maintained by user. This type of
variable defined in lower letters.
You can see system variables by giving command like $ set, some of the important
System variables are:
echo [options] [string, variables...]
Displays text or variables value on screen.
Options
-n Do not output the trailing new line.
-e Enable interpretation of the following backslash escaped characters in the strings:
\a alert (bell)
\b backspace
\c suppress trailing new line
\n new line
\r carriage return
\t horizontal tab
\\ backslash
Double "Double Quotes" - Anything enclose in double quotes removed

"
Quotes meaning of that characters (except \ and $).
Single
' 'Single quotes' - Enclosed in single quotes remains unchanged.
quotes
` Back quote
`Back quote` - To execute command
$ vi sayH
#
#Script to read your name from key-board
#
echo "Your first name please:"
read fname
echo "Hello $fname, Lets be friend!"
Removing duplicate lines using uniq

utility
Printing lines using range
Now if you want to print 1st line to next 5 line (i.e. 1 to 5 lines) then give command
:1,5 p
:8 s/lerarns/learn/
Sed cmd
8 Goto line 8, address of line.
S Substitute
/lerarns/ Target pattern
If target pattern found substitute the
learn/
expression (i.e. learn/ )
:1,$ s/Linux/Unix/
:1,$ Substitute for all line
Replacing word with confirmation

from user
Give command as follows
:g/Linux/ s//UNIX/gc
g All occurrence
/[^ [^] This means not
Empty line, Combination
/^$
of ^ and $.
To view entire file without blank line you can use command as follows:
:g/[^/^$]
bc is an arbitrary-precision language for performing math calculations.
Description
bc is a language that supports arbitrary-precision numbers, meaning that it delivers

accurate results regardless of how large (or very very small) the numbers are.
a=5.66
b=8.67
c=`echo $a + $b | bc`
echo "$a + $b = $c"
sed
Syntax: $ sed -n -e Xp -e Yp FILENAME
 sed : sed command, which will print all the lines by default.
 -n : Suppresses output.
 -e CMD : Command to be executed
 Xp: Print line number X
 Yp: Print line number Y
 FILENAME : name of the file to be processed.
The example below will display line numbers 101 – 110 of /var/log/anaconda.log
file
 $ cat /var/log/anaconda.log | tail -n +101 | head -n 10
 Display the first N lines of a compressed file.
$ zcat file.gz | head -250
 Display the last N lines of a compressed file.
$ zcat file.gz | tail -250
 Ignoring the last N lines of a compressed file.
$ zcat file.gz | head -n -250
 Ignoring the first N lines of a compressed file.
$ zcat file.gz | tail -n +250
 Viewing the lines matching the pattern
$ zcat file.gz | grep -A2 'error'
 Viewing particular range of lines identified by line number.
$ zcat file.gz | sed -n -e 45p -e 52p
http://www.freeos.com/guides/lsst/ch08.html
Q.11.Write script to determine whether given file exist or not, file name is supplied
as command line argument, also check for sufficient number of command line
argument
if [ $# -ne 1 ]
then
echo "Usage - $0 file-name"
exit 1
fi
if [ -f $1 ]
then
echo "$1 file exist"
else
echo "Sorry, $1 file does not exist"
fi
Q.1. How to write shell script that will add two nos, which are supplied as command
line argument, and if this two nos are not given show error and its usage
Answer: See Q1 shell Script.
Q.2.Write Script to find out biggest number from given three nos. Nos are supplies
as command line argument. Print error if sufficient arguments are not supplied.
Q.3.Write script to print nos as 5,4,3,2,1 using while loop.

Q.4. Write Script, using case statement to perform basic math operation as
follows
+ addition
- subtraction
x multiplication
/ division
The name of script must be 'q4' which works as follows
$ ./q4 20 / 3, Also check for sufficient command line arguments
Q.5.Write Script to see current date, time, username, and current directory
Q.6.Write script to print given number in reverse order, for eg. If no is 123 it must
print as 321.
Q.7.Write script to print given numbers sum of all digit, For eg. If no is 123 it's sum
of all digit will be 1+2+3 = 6.
Q.8.How to perform real number (number with decimal point) calculation in Linux
Answer: Use Linux's bc command
Q.9.How to calculate 5.12 + 2.5 real number calculation at $ prompt in Shell ?

Answer: Use command as , $ echo 5.12 + 2.5 | bc , here we are giving echo
commands output to bc to calculate the 5.12 + 2.5
Q.10.How to perform real number calculation in shell script and store result to
third variable , lets say a=5.66, b=8.67, c=a+b?
Q.11.Write script to determine whether given file exist or not, file name is supplied
as command line argument, also check for sufficient number of command line
argument
Q.12.Write script to determine whether given command line argument ($1) contains
"*" symbol or not, if $1 does not contains "*" symbol add it to $1, otherwise show
message "Symbol is not required". For e.g. If we called this script Q12 then after
giving ,
$ Q12 /bin
Here $1 is /bin, it should check whether "*" symbol is present or not if not it should
print Required i.e. /bin/*, and if symbol present then Symbol is not required must
be printed. Test your script as
$ Q12 /bin
$ Q12 /bin/*
Answer: See Q12 shell Script
Q.13. Write script to print contains of file from given line number to next given
number of lines. For e.g. If we called this script as Q13 and run as
$ Q13 5 5 myf , Here print contains of 'myf' file from line number 5 to next 5 line of
that file.
Q.14. Write script to implement getopts statement, your script should understand
following command line argument called this script Q14,
Q14 -c -d -m -e
Where options work as
-c clear the screen
-d show list of files in current working directory
-m start mc (midnight commander shell) , if installed
-e { editor } start this { editor } if installed
Q.15. Write script called sayHello, put this script into your startup file called
.bash_profile, the script should run as soon as you logon to system, and it print any
one of the following message in infobox using dialog utility, if installed in your
system, If dialog utility is not installed then use echo statement to print message : -
Good Morning
Good Afternoon
Good Evening , according to system time.
Q.16. How to write script, that will print, Message "Hello World" , in Bold and Blink
effect, and in different colors like red, brown etc using echo command.
Q.17. Write script to implement background process that will continually print
current time in upper right corner of the screen , while user can do his/her normal
job at $ prompt.
Q.18. Write shell script to implement menus using dialog utility. Menu-items and
action according to select menu-item is as follows:
Menu-Item Purpose Action for Menu-Item

Date/time To see current date time Date and time must be shown using infobox of dialog utility
Calendar To see current calendar Calendar must be shown using infobox of dialog utility
First ask user name of directory where all files are present,
if no name of directory given assumes current directory,
then show all files only of that directory, Files must be
shown on screen using menus of dialog utility, let the user
Delete To delete selected file
select the file, then ask the confirmation to user whether
he/she wants to delete selected file, if answer is yes then
delete the file , report errors if any while deleting file to
user.
Exit To Exit this shell script Exit/Stops the menu driven program i.e. this script
Note: Create function for all action for e.g. To show date/time on screen create
function show_datetime().
Q.19. Write shell script to show various system configuration like

1) Currently logged user and his logname
2) Your current shell
3) Your home directory
4) Your operating system type
5) Your current path setting
6) Your current working directory
7) Show Currently logged number of users
8) About your os and version ,release number , kernel version
9) Show all available shells
10) Show mouse settings
11) Show computer cpu information like processor type, speed etc
12) Show memory information
13) Show hard disk information like size of hard-disk, cache memory, model etc
14) File system (Mounted)
Q.20.Write shell script using for loop to print the following patterns on screen
for2 for3 for4

for5 for6 for7
for8 for8 for9
Answer: Click on above the links to see the scripts.
Q.21.Write shell script to convert file names from UPPERCASE to lowercase file
names or vice versa.
Answer: See the rename.awk - awk script and up2sh shell script.
TERRADATA
Normalisation of DB:
It is a technique of organizing data in database. It ensures
- Eliminating redundant(useless) data
- Ensuring data dependencies make sense i.e. data is logically stored
Without this it will be tuf to handle & update database, without facing data loss.
1NF – A database is in first normal form if it satisfies the following conditions:
 Contains only atomic values

 There are no repeating groups
Student Age Subject

Adam 22 Physics, Bio
Eve 21 Chemistry
After 1 NF
Student Age Subject
Adam 22 Physics
Adam 22 Bio
Eve 21 Chemistry
But data redundancy increases in 1NF
2NF - A database is in second normal form if it satisfies the following conditions:
 It is in first normal form

 All non-key attributes are fully functional dependent on the primary key
Eg: (student, subject) is candidate key.
After 2NF – suffers update anamoly
Student Age
Adam 22
Eve 21
Student Subject
Adam Physics
Adam Bio
Eve Chemistry
3NF- A database is in third normal form if it satisfies the following conditions:
 It is in second normal form

 There is no transitive functional dependency
- Data duplication reduced

- Data integrity achieved (The accuracy and consistency of stored data)
Consider the following example:

 In the table able, [Book ID] determines [Genre ID], and [Genre ID] determines
[Genre Type]. Therefore, [Book ID] determines [Genre Type] via [Genre ID]
and we have transitive functional dependency, and this structure does not
satisfy third normal form.
 To bring this table to third normal form, we split the table into two as
follows:
- 3NF does not deal satisfactorily with the case of a relation with
overlapping candidate keys
BCNF
BCNF is based on the concept of a determinant.
A determinant is any attribute (simple or composite) on which some other
attribute is fully functionally dependent.
A relation is in BCNF is, and only if, every determinant is a candidate key.
1. teradata advtgs over oracle

- The main difference between Oracle and Teradata is the architecture:
Teradata sports a MPP (Massive Parallel Processing) architecture that can
scale linearly in growth (# of nodes=disk space+processing power) and in
number of concurrent user requests.
- Oracle is for operational purposes.
Teradata is for analitical purposes. Can handle billion rows tables, join
them, and get some results
- Oracle is mainly used for OLTP, while Teradata is more on the
DSS(Decision support system) side
- TD supports the 3NF model
for implementing the Enterprise wide DW called (EDW) which is
exntedible for future oragnization growth and is more felxible in
adapting the constant changing business env, while Oracle supports
Star Schema which is in my opinion is not a good approach for building
the EDW.
facts,dimensions,
- These tables contain the basic data used to conduct detailed analyses and
derive business value
- Fact contain measurable attributes (how much, how many)
Eid Salary
The information contained within a fact table is typically numeric data
Eg: sales happening in retail
ProductID CustomerID Unit Sold
- Dimension contain reference data (what who where of biz data)

Eid Ename Job Location Manager
CustomerID Name Gender Income Education Region
-
The fact table lists events that happen in our company. The dimension
tables list the factors by which we want to analyze the data.
cubes,
An OLAP cube is a method of storing data in a multidimensional form, generally for
reporting purposes.
- The cubes divide the data into subsets that are defined by dimensions.
- A cube provides an easy-to-use mechanism for querying data with quick
and uniform response times.
- Cubes are the main objects in online analytic processing (OLAP), a
technology that provides fast access to data in a data warehouse. A cube
is a set of data that is usually constructed from a subset of a data
warehouse and is organized and summarized into a multidimensional
structure defined by a set of dimensions and measures.
- A cube can be stored on a single analysis server and then defined as a linked
cube on other Analysis servers. End users connected to any of these analysis
servers can then access the cube. This arrangement avoids the more costly
alternative of storing and maintaining copies of a cube on multiple analysis
servers. linked cubes can be connected using TCP/IP or HTTP.
Virtual-cubes
These are combinations of one or more real cubes and require no disk space to store them.
They store only the definitions and not the data of the referenced source cubes. They are
similar to views in relational databases.
materialised views
- a materialized view is a database object that contains the results of a
query.
- A materialized view is a pre-computed table comprising aggregated
and/or joined data from fact and possibly dimension tables.
- Builders of data warehouses will know a materialized view as a summary
or aggregation.
- Unlike an ordinary view which is only a stored select statement that runs
if we use the view, a materialized view stores the result set of the select
statement as a container table.
 The purpose of a materialized view is to increase request execution

performance.
 A materialized view consumes storage space.
 The contents of the materialized view must be maintained when the
underlying detail tables are modified.
create materialized view mv1 enable query rewrite

as select channel_id,sum(amount_sold) from sales group by channel_id;
Materialized view created.
star, snowflake schema

snowflake star
Snowflake model uses normalized data Star model on the other hand uses
de-normalized data
More complex queries Less complex queries
Longer exec time Shorter
More than 1 dimension table for each dimension Only 1 dimension table for each
dimension
Data model: bottom up approach top down approach
First of all, some definitions are in order. In a star schema, dimensions that reflect a
hierarchy are flattened into a single table. For example, a star schema Geography
Dimension would have columns like country, state/province, city, state and postal
code. In the source system, this hierarchy would probably be normalized with
multiple tables with one-to-many relationships.
A snowflake schema does not flatten a hierarchy dimension into a single table. It
would, instead, have two or more tables with a one-to-many relationship. This is a
more normalized structure. For example, one table may have state/province and
country columns and a second table would have city and postal code. The table with
city and postal code would have a many-to-one relationship to the table with the
state/province columns.
There are some good for reasons snowflake dimension tables. One example is a
company that has many types of products. Some products have a few attributes,
others have many, many. The products are very different from each other. The
thing to do here is to create a core Product dimension that has common attributes
for all the products such as product type, manufacturer, brand, product group, etc.
Create a separate sub-dimension table for each distinct group of products where
each group shares common attributes. The sub-product tables must contain a
foreign key of the core Product dimension table.
One of the criticisms of using snowflake dimensions is that it is difficult for some of
the multidimensional front-end presentation tools to generate a query on a
snowflake dimension. However, you can create a view for each combination of the
core product/sub-product dimension tables and give the view a suitably description
name (Frozen Food Product, Hardware Product, etc.) and then these tools will have
no problem.
Top-down: If you use a top-down approach, you will have to analyze global
business needs, plan how to develop a data warehouse, design it, and implement
it as a whole.
In top-down approach complete data warehouse is created first and then the datamarts are
derived from the data warehouse.
Disadvtg: high-cost estimates
analyzing and bringing together all relevant sources is a very difficult task
since no prototype is going to be delivered in the short term, users cannot check for
this project to be useful,
Bottom-Up: In bottom-up approach datamarts are created first. T datamarts are
then integrated and a comprehensive data warehouse is created. As the datamarts
are created first, business solutions can be answered quickly.
scd2 type
lookup_count() Lookup_next() are the functions used to set the index to first record in a
group of records (if your lookup has duplicates for a key) and then walk through those
records using lookup_next() to pick the right one.
Use lookup_count for finding the duplicates and lookup_next for retrieving it.If
lookup_count (string file_label, [ expression [ , expression ... ] ] )>0lookup_next (
lookup_identifier_type lookup_id, string lookup_template ) Njoi!!Abhi - fresh as dew!
Lookup_match function returns 1 or 0 depending on whether a specific

expression is found in the records of a lookup file.
It's syntax is like this ,
lookup_match("lookupfilename",expression);
If matching expression is found in atleast one record in the lookup
file it returns 1 else it returns 0.It is generally faster than simple
lookup function.
lookup_count
Returns the number of matching data records in a lookup file.
Syntax
int lookup_count ( string file_label, expression [, expression ...] )
lookup_count_local
Returns the number of matching data records in a partition of a lookup file.
lookup_local
Returns a data record from a partition of a lookup file.
lookup_next
Returns successive data records from a lookup file.
Syntax
record lookup_next ( string file_label )

lookup_next_local
Returns successive data records from the partition of a lookup file.
You can use lookup_match to quickly determine whether or not a particular key value is
contained in a lookup file. It is faster than lookup and lookup_count, and you should use
it when:
if (is_defined(lookup(FILE, key)))
statement1
else
statement2
You can use:
if (lookup_match(FILE, key))
statement1
else
statement2
About Interval Lookups
An interval lookup consists of comparing a value to a series of intervals in a lookup file

to determine in which interval the value falls.
 Each record in the Lookup File represents an interval, that is, a range of values.
The lower and upper bounds of the range are usually given in two separate fields
of the same type.
 A key field marked interval_bottom holds the lower endpoint of the interval.
 A key field marked interval_top holds the upper endpoint.
 If a field in the Lookup File's key specifier is marked as interval, it must be the
only key field for that Lookup File. You cannot specify a multipart key as an
interval lookup.
 The file contains well-formed intervals:
o For each record, the value of the lower endpoint of the interval must be
less than or equal to the value of the upper endpoint of the intervals.
o The intervals must not overlap.
o The intervals must be sorted into ascending order.
 To use a lookup file for interval lookups, the modifier for the key field in the
lookup file must be interval, interval_bottom, or interval_top.
 For example, in a Lookup File that is an interval lookup, the following DML
function returns the record, if any, for which arg is between the lower and upper
endpoints.
 lookup("Lookup_File_name",arg)
 By default, the interval endpoints are inclusive, but you can add the modifier
exclusive to specify otherwise. For example, suppose Lookup File
insurance_coverage has the following key specifier:
 "{coverage_start interval_bottom exclusive; coverage_end interval_top}"
 This identifies the fields coverage_start and coverage_end as the endpoints of
the interval.
 The following DML function returns record R if R.coverage_start is less than
arg and less than or equal to R.coverage_end.
 lookup("insurance_coverage",arg)
The most reliable way to ftp a file without advance knowledge of it's dml is
to give it a dml of:
void(1)
It then becomes a byte stream that Ab Initio does not try to interpret
XML split has more control over the dml being generated.
Go for XML SPLIT if you have XSDs of type parent -child- sub child etc...
the component will just do everything for you.. and all that you need to do
is just specify the base elements.
If it is a plain XML, go for READ XML..
To find size of mfs file: M_ls will disply you the number of partitions in a multifile
system If you want to know how much disk space is used by a multi file u can use the
following two commands
m_du: Will give you the disk usage of a multi file system , it will clerly gives you how
much space is used for each partion on the disk
m_df : Disk fragmentation will also help's
Compute check sum is the component which will give you the total size and the number
of records in it, but it is a time taking process and you need to build a graph for the
same.Hope the answers will help.
The file extension .mfctl will contain the URLs of all the data partitions. The file with the
extension .mdir will contain the URL of the control file used by MFS
m_partition <src-url> <dst-url> <record-format-file> [

<partition-key> ] -> becomes mfs file
M_cat mfs_file >$ai_serial/a.dat  becomes serial file
1. UnSigned can hold a larger positive value, and no negative value.

2. signed integers can hold both positive and negative numbers.

1.ab Initio - Unix - DB - Concepts & Questions - !

Uploaded by

Copyright:

Available Formats

1.ab Initio - Unix - DB - Concepts & Questions - !

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1.ab Initio - Unix - DB - Concepts & Questions - !

Uploaded by

Copyright:

Available Formats

What are the different types of parallelism mentioned in the document?

What are the different types of parallelism mentioned in the document?

What is the difference between interval lookups and regular lookups?

What is the difference between interval lookups and regular lookups?

AB Initio Related

 Output Index & Indexes

- In Reformat transform ouput linked to many port – Indexes

What a find splitter component does?

 Multistage Components: Rollup, Scan, Normalize etc.

 PDL Concept and Dynamic DML

- Params can be regular expressions

Dynamic script generation – helps u run mp without deploying ksh

 Defining Layout in Input/output table components

 Parallelism and its related components

Data parallelism: any partition component, broadcast.

 Pset and its scope at graph/Plan Level

Search for "Define parameters for dynamic scope pset" .

 Public and Private sandboxes: stdenv, localenv, etc.

abenv is for co>operating system and stndenv is for each project

 How do you perform deployment of the code to different environments?

4. Create a .save file using the commmand --

air load xyz.SAVE

 Layout concept in case of parallelism and serial

Abinitio uses sql*loader conventional method to load data

ORACLE has API & UTILITY mode

SET Table – doesn’t allow duplicates

 Continuous flows components subscriber, publish, Batch subscriber etc.

A continuous flow graph includes

Restrictions for continuous flow :

• All components in the graph must be continuous components or they must

Queue is data structure where we can store the data.

Commond to create queue

 Which component breaks pipeline parallelism in Ab Initio? Sort

m_env –version -> ab_version

List all files in current and sub directories - $ find

find empty files/dir - Find ~ –type f/d –empty

to get directories with sorted modification date use

And to files other than directories

ls -ltr | grep -v ^d or ls –lrt | grep ^ -

cut –c -> extract desidred column.

grep "/bin/bash" /etc/passwd | cut -d':' -f1-4,6,7

Select Fields Only When a Line Contains the Delimiter

grep "/bin/bash" /etc/passwd | cut -d'|' -s -f1

In order to complement the selection field list use option –complement.

$ grep "/bin/bash" /etc/passwd | cut -d':' --complement -s -f7

$ grep "/bin/bash" /etc/passwd | cut -d':' -s -f1,6,7 --output-delimiter='#'

Change Output Delimiter to Newline

> file redirects stdout to file

 Head, tail, cut, awk, grep, sed utilities.

The basic syntax of AWK:

awk 'BEGIN {start_action} {action} END {stop_action}' filename

-rw-r--r-- 1 center center 0 Dec 8 21:39 p1

1. awk '{print $1}' input_file –prints 1st column

2. awk 'BEGIN {sum=0} {sum=sum+$5} END {print sum}' input_file

-rw-r--r-- 1 pcenter pcenter 43 Dec 8 21:39 t4

5. awk 'BEGIN { for(i=1;i<=5;i++) print "square of", i, "is",i*i; }'

6. awk 'BEGIN {FS=":"} {print $2}' input_file

OFS - Output field separator variable:

7. awk '{print $4,$5}' input_file