0930 SqoopPerformanceTuningGuidelines en H2L
0930 SqoopPerformanceTuningGuidelines en H2L
0930 SqoopPerformanceTuningGuidelines en H2L
Guidelines
© Copyright Informatica LLC 2016, 2021. Informatica and the Informatica logo are trademarks or registered
trademarks of Informatica LLC in the United States and many jurisdictions throughout the world. A current list of
Informatica trademarks is available on the web at https://www.informatica.com/trademarks.html
Abstract
When you use Sqoop with Informatica Developer to transfer data between relational databases and Hadoop File
System (HDFS), multiple factors impact the performance. You can optimize the performance by tuning Sqoop
command line arguments, hardware parameters, database parameters, and Informatica mapping parameters. This
article provides guidelines to help you to tune the performance of Sqoop when you transfer data between relational
databases and HDFS.
Supported Versions
• Big Data Management 10.1
Table of Contents
Overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
Performance Tuning Areas. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
Tune the Sqoop Command Line Arguments. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
batch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
boundary-query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
compress or z. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
direct. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Dsqoop.export.records.per.statement. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Enable Primary Key . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
fetch-size. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
num-mappers. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
split-by. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Tune the Hardware. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
Tune the Database. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Tune the Mapping. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
Performance Improvement with the Third-Party Teradata TDCH Driver. . . . . . . . . . . . . . . . . . . . . . . . 9
Overview
Sqoop is a Hadoop command line program to process data between relational databases and HDFS through
MapReduce programs. You can use Sqoop to import and export data.
This document describes the key Sqoop command line arguments, hardware, database, and Informatica mapping
parameters that you can tune to optimize the performance of Sqoop. It also includes case studies that illustrate the
impact that the tuning has on the Sqoop performance.
Note: The performance testing results listed in this article are based on observations in an internal Informatica
environment using data from real-world scenarios. The Sqoop performance might vary based on individual
environments and other parameters even when you use the same data.
2
Performance Tuning Areas
You can optimize the performance of Sqoop mappings by tuning the following areas:
• batch
• boundary-query
• compress or z
• direct
• Dsqoop.export.records.per.statement
• Enable primary key
• fetch-size
• num-mapper
• split-by
batch
Specifies that you can group the related SQL statements into a batch when you export data.
boundary-query
Specifies the range of values that you can import. You can use boundary-query if you do not get the desired results by
using the split-by argument alone.
When you configure the boundary-query argument, you must specify the min(id) and max(id) along with the table name.
If you do not configure the argument, Sqoop runs the following query:
select min (<split-by>), max(<split-by>) from <table name>
compress or z
When you configure the compress or z argument, you can compress the data approximately by 60% and reduce the
amount of disk space required in the target. You can configure compression when the target storage is limited.
3
-z or --compress
However, when you use the compress or z argument, the overall execution time increases by 35%. To reduce the
execution time, do not configure the compress or z argument.
Case Study
The following image shows the performance impact of disabling the compress argument:
direct
Specifies the direct import fast path when you import data from Oracle.
4
Case Study
The following image shows the performance impact of tuning the direct argument:
Dsqoop.export.records.per.statement
When you export data, you can configure the batch argument along with the Dsqoop.export.records.per.statement
argument to insert multiple rows with a single statement.
Sqoop runs the following query when you configure the Dsqoop.export.records.per.statement argument.
INSERT INTO table VALUES (...), (...), (...),...;
5
Case Study
The following image shows the performance impact of tuning the -Dsqoop.export.records.per.statement argument.
6
Case Study
The following image shows the performance impact of enabling the primary key constraint:
fetch-size
Specifies the number of entries that Sqoop can import at a time.
Where <n> represents the number of entries that Sqoop must fetch at a time. Default is 1000.
You can increase the value of the fetch-size argument based on the volume of data that you want to read. Set the value
based on the available memory and bandwidth.
7
Case Study
The following image summarizes the impact of tuning the Sqoop fetch-size argument to 10000:
num-mappers
Specifies the number of map tasks that can run in parallel.
Default is 4.
To optimize performance, set the number of map tasks to a value lower than the maximum number of connections that
the database supports.
split-by
Specifies the column name based on which Sqoop must split the work units.
• CPU frequency
• NIC card ring buffer size
8
Tune the Database
To optimize the performance of relational databases, perform the following tasks:
9
The following image shows the performance improvement when you use the Teradata TDCH driver when compared to
the generic Teradata JDBC driver:
Authors
Krishna Prabhakar Devarakonda
Siddiq Hussain
Anu Chandrasekharan
Pranav Sharma
10