Hive Performance Tuning
Hive Performance Tuning
Hive Performance Tuning
http://docs.hortonworks.com
Data Access | Contents | ii
Contents
LLAP ports................................................................................................................4
Setting up LLAP.......................................................................................................5
Enable YARN preemption................................................................................................................................... 5
Enable interactive query....................................................................................................................................... 6
Set up multiple HiveServer Interactives for high availability............................................................................. 7
Configure an llap queue....................................................................................................................................... 7
Add a Hive proxy.................................................................................................................................................9
Configure other LLAP properties........................................................................................................................ 9
Configure the HiveServer heap size.................................................................................................................. 11
Save LLAP settings and restart services............................................................................................................12
Run an interactive query.................................................................................................................................... 14
LLAP ports
You use port 10500 to make the JDBC connection through Beeline to query Hive through the HiveServer Interactive
host. The LLAP daemon uses several other ports.
Best practices
• Set up your cluster to use Apache Tez or the Hive on Tez execution engine.
In HDP 3.x, the MapReduce execution engine is replaced by Tez.
• Disable user impersonation by setting Run as end user to false in Ambari, which is equivalent to setting
hive.server2.enable.doAs in hive-site.xml.
LLAP caches data for multiple queries and this capability does not support user impersonation.
• Add the Ranger security service to your cluster and dependent services.
• Set up LLAP to run interactive queries.
• Store data using the ORC File format.
4
Data Access Setting up LLAP
Setting up LLAP
Using Ambari, you can set up basic, low-latency analytical processing (LLAP) by accepting the default llap queue,
changing some YARN queue properties, and adding a HiveServer property to hive-site.xml.
Procedure
1. In Ambari, select Services > YARN > Configs.
2. In YARN Features, set Pre-emption to Enabled (the default).
5
Data Access Setting up LLAP
Procedure
1. In Ambari, select Services > Hive > Configs > Settings.
2. In Interactive Query, set Enable Interactive Query to Yes:
3. In Select HiveServer Interactive Host, accept the default server to host HiveServer Interactive, or from the drop-
down, select a different host.
6
Data Access Setting up LLAP
If you do not want to set up multiple HiveServer Interactives for high availability, skip the next set of steps, and
proceed to configuring the llap queue.
Procedure
1. In Select HiveServer Interactive Host, after selecting one HiveServer2 Interactive host, click + to add another.
2. Accept the default server to host the additional HiveServer Interactive, or from the drop-down, select a different
host.
3. Optionally, repeat these steps to add additional HiveServer Interactives.
Procedure
1. In Ambari, select Hive > Configs.
2. In Interactive Query Queue, choose the llap queue if it appears as a selection, and save the Hive configuration
changes.:
7
Data Access Setting up LLAP
Depending on your YARN Capacity Scheduler settings, a queue named llap might or might not appear. This
setting dedicates all LLAP daemons and all YARN Application Masters (AMs) of the system to the single,
specified queue.
3. In Ambari, select Services > YARN > Configs.
4. From the hamburger menu Views, select YARN Queue Manager.
5. If an llap queue does not exist, add a queue named llap. Otherwise, proceed to the next step.
6. If the llap queue is stopped, change the state to running.
7. Check the llap queue capacity, and accept or change the value as follows:
• If the llap queue capacity is zero, you might have too few nodes for Ambari to configure llap queue capacity.
Go to the next step to configure llap queue capacity and max capacity.
• If Ambari set the llap capacity to greater than zero, no change is necessary. Skip the next step. For example, in
a 7-node cluster, Ambari allocates llap queue capacity as follows:
8. If llap queue capacity is zero, increase the capacity allocated to your llap queue, and also change max capacity to
the remainder of the allocated llap capacity minus 100 percent.
8
Data Access Setting up LLAP
For example, set max capacity to 100 percent minus 50 percent = 50 percent.
Procedure
1. In Ambari, select Services > HDFS > Configs > Advanced.
2. In Custom core-site, add the FQDNs of the HiveServer Interactive host or hosts to the value of
hadoop.proxyuser.hive.hosts.
3. Save the changes.
Memory per Daemon YARN container size for each daemon (MB)
In-Memory Cache per Daemon Size of the cache in each container (MB)
Number of executors per LLAP Daemon The number of fragments that can execute in parallel on
a daemon
Use the slider controls to change or restore settings:
9
Data Access Setting up LLAP
To set the value outside the slider range, you move your pointer over the field to enable the hover actions, and select
Override.
Procedure
1. Accept or change the Number of Nodes Used By Hive LLAP (num_llap_nodes property). For example, accept
using 2 nodes for LLAP.
This memory (hive.llap.daemon.yarn.container.mb) plus the cache (hive.llap.io.memory.size) must fit within the
container size specified for the YARN container. The YARN container configuration setting appears in Services >
YARN > Configs > Settings.
10
Data Access Setting up LLAP
4. Accept the Number of Executors per LLAP Daemon (hive.llap.daemon.num.executors), or change this setting if
you know what you are doing, and check that the hive.llap.io.threadpool.size is the same value.
5. Save any Hive configuration changes, and in Services > YARN > Settings > Memory - Node, check that the
Minimum Container Size (Memory) for YARN is low.
The value should rarely exceed 1024 MB.
6. Set the Maximum Container Size (Memory) to the same value as the Memory Allocated for All YARN Containers
on a Node.
Procedure
1. In Ambari, go to Services > Hive > Config.
2. In Optimization, adjust the slider to set the heap size.
11
Data Access Setting up LLAP
For 1 to 20 concurrent executing queries, set to 6 GB heap size; 21 to 40 concurrent executing queries: Set to 12
GB heap size.
Procedure
1. Click Save at the bottom of the wizard.
2. If the Dependent Configurations window appears, review recommendations and accept or reject the
recommendations.
3. Navigate to the each service, starting with the first one listed under Ambari Services, and restart any services as
required.
4. Select Services > Hive > Summary and verify that the single or multiple HiveServer Interactive instances you set
up started.
For example, the following screenshot shows a single HiveServer Interactive.
12
Data Access Setting up LLAP
5. If HiveServer Interactive is stopped, click the link to the stopped HiveServer Interactive instance. In Components,
click Action for the stopped HiveServer2 Interactive, and click Start.
If you set up multiple HiveServer Interactives, after instances start, one is designated Active HiveServer2
Interactive. Others operate in passive mode.
6. In Ambari, select Services > Hive > Summary and in Quick Links, click HiveServer Interactive UI to check that
the LLAP application is running.
13
Data Access Setting up LLAP
7. If LLAP is not running, in Summary, click the HiveServer Interactive link (the active HiveServer Interactive link
in the case of multiple instances), and then, choose Restart LLAP from the Action menu.
Procedure
1. On the command-line of a node in the cluster, connect to HiveServer Interactive on port 10500 through Beeline.
For example, enter the following beeline command, but replace my_hiveserver_interactive.com with the FQDN of
your HiveServer Interactive node:
14
Data Access Use HiveServer Interactive UI
INSERT INTO TABLE students VALUES ('fred flintstone', 35, 1.28), ('barney
rubble', 32, 2.32);
You probably notice that Hive inserted the data much faster using the LLAP interactive query than using a
conventional Hive query.
Procedure
1. In Ambari, select Services > Hive > Summaryand in Quick Links, click HiveServer Interactive UI.
15
Data Access Connect a JDBC client to LLAP
Procedure
1. In Ambari, select Services > Hive .
2. In Summary, copy the JDBC URL for HiveServer Interactive: Click the clipboard icon.
16
Data Access Configuring YARN queues for Hive
Procedure
1. In Ambari, access the capacity scheduler:
• Select YARN > Configs > Advanced, and in Filter enter yarn.scheduler.capacity.root.
• On the command line of the node where YARN is installed, go to the YARN /conf file, and open the capacity-
scheduler.xml file.
2. Define the hive1 and hive2 queues, and set the maximum capacity to 50 percent of the queue users with a hard
limit.
For example:
yarn.scheduler.capacity.root.queues=hive1,hive2
yarn.scheduler.capacity.root.hive1.capacity=50
yarn.scheduler.capacity.root.hive2.capacity=50
If the maximum-capacity is set to more than 50 percent, the queue can use more than its capacity when there are
other idle resources in the cluster.
3. Configure usage limits for these queues and their users.
For example:
yarn.scheduler.capacity.root.hive1.maximum-capacity=50
yarn.scheduler.capacity.root.hive2.maximum-capacity=50
yarn.scheduler.capacity.root.hive1.user-limit=1
yarn.scheduler.capacity.root.hive2.user-limit=1
The default value of 1 for user-limit means that any single user in the queue can at a maximum occupy 1X the
queue's configured capacity. These settings prevent users in one queue from monopolizing resources across all
queues in a cluster.
4. From the Ambari dashboard, select ADMIN > Manage Ambari > Views.
17
Data Access Configuring YARN queues for Hive
5. Click the URL for the view named AUTO_CS_INSTANCE, which is the capacity scheduler view.
7. Enter the queue path, which is the name of the first queue hive1, and then add the hive2 queue.
8. To create the following schedule, select the root queue and add hive1 and hive2 at that level:
18
Data Access Configuring YARN queues for Hive
Procedure
1. In Ambari, access the capacity scheduler:
• Select YARN > Configs > Advanced, and in Filter enter yarn.scheduler.capacity.root.
• On the command line of the node where YARN is installed, go to the YARN /conf file, and open the capacity-
scheduler.xml file.
2. Define the custom LLAP queue, specifying a name and other properties, such as capacity and usage limits.
For example:
yarn.scheduler.capacity.root.queues=myllap
yarn.scheduler.capacity.root.myllap=50
yarn.scheduler.capacity.root.myllap.maximum-capacity=50
yarn.scheduler.capacity.root.myllap.user-limit=1
3. From the Ambari dashboard, select ADMIN > Manage Ambari > Views.
4. Click the URL for the view named AUTO_CS_INSTANCE, which is the capacity scheduler view.
19
Data Access Configuring YARN queues for Hive
6. Enter the queue path, which is the name of the custom queue myllap hive1.
7. In Ambari, select Services > Hive > Configs > Settings.
8. In Interactive Query, set Enable Interactive Query to Yes:
9. In Select HiveServer Interactive Host, select the server to host HiveServer Interactive.
For example, select the default highlighted server, which is typically a good choice.
10. In Interactive Query Queue, select the custom YARN queue to replace the default llap:
20
Data Access Set up multiple HiveServer instances
This action dedicates all the LLAP daemons and YARN ApplicationMasters of the system to the single, specified
queue.
Procedure
1. In Ambari, select Hosts.
2. Click the name of the host node where you want to create the HiveServer instance.
3. In Summary, in Components, click Add, and select HiveServer.
21
Data Access Key components of Hive warehouse processing
Do not add HiveServer Interactives for high availability using this procedure.
Related Information
Set up multiple HiveServer Interactives for high availability
HiveServer
HiveServer provides Hive service to multiple clients that simultaneously execute queries against Hive using an open
Apache Hive API driver, such as JDBC. For optimal performance, you should use HiveServer to connect your client
application and the Hive enterprise data warehouse (EDW). Using Ambari, you can install, configure, and monitor the
Hive service and HiveServer.
An embedded metastore, which is different from MetastoreDB, runs in HiveServer and performs the following tasks:
• Gets statistics and schema from MetastoreDB
• Compiles queries
• Generates query execution plans
• Submits query execution plans
• Returns query results to the client
Admission control
HiveServer coordinates admission control. Admission control is critical to Hive performance tuning. Admission
control manages queries in a manner similar to how connection pooling manages network connections in RDBMS
databases. When using the Hive LLAP on the Tez engine, you can configure admission control.
Admission control performs the following functions:
• Enables optimal Hive performance when multiple user sessions generate asynchronous threads simultaneously
• Scales concurrent queries to suit system resources and demand
• Postpones processing or cancels other queries if necessary>
22
Data Access Key components of Hive warehouse processing
In-memory cache A cache inside the LLAP daemon that is shared across all
users.
YARN
LLAP depends on YARN queues. HiveServer and YARN work together to intelligently queue incoming queries of
a Hive data set. The queuing process tends to minimize the latency of returned results. The LLAP daemons manage
resources across YARN nodes. When you set up LLAP, you enable interactive queries, enable YARN pre-emption,
and choose a YARN queue for LLAP processing.
23
Data Access Query result cache and metastore cache
Tuning Tez
In Ambari, select Services > Tez > Configsto change general or advanced properties using guidelines in the
following table. You can search for a property by entering it in the Filter field.
Property Guideline Default
tez.shuffle-vertex-manager.desired-task-input- Increase for large ETL jobs that run too long. No default value set
size
24
Data Access Monitoring Apache Hive performance
tez.shuffle-vertex-manager.min-task- Set a value if reducer counts are too low, even No default value set
parallelism if the tez.shuffle-vertex-manager.min-src-
fraction property is already adjusted.
hive.mapjoin.hybridgrace.hashtable true for slower but safer processing. false for false
faster processing.
25
Data Access Monitoring Apache Hive performance
On your cluster, an extra HiveServer instance is installed that is dedicated to interactive queries. You can see this
HiveServer instance listed in the Hive Summary page of Ambari.
In the YARN ResourceManager UI, you can see the queue of Hive LLAP daemons or running queries:
The Apache Tez ApplicationMasters are the same as the selected concurrency. If you selected a total concurrency of
5, you see 5 Tez ApplicationMasters. The following example shows selecting a concurrency of 2:
26
Data Access Maximizing storage resources using ORC
ORC provides the best Hive performance overall. In addition, to specifying the storage format, you can also specify a
compression algorithm for the table, as shown in the following example:
Setting the compression algorithm is usually not required because your Hive settings include a default algorithm.
Using ORC advanced properties, you can create bloom filters for columns frequently used in point lookups.
You can read a table and create a copy in ORC using the following command:
A common practice is to store data in HDFS as text, create a Hive external table over it, and then store the data as
ORC inside Hive where it becomes a Hive-managed table.
27
Data Access Improving performance using partitions
Hive supports Parquet and other formats. You can also write your own SerDes (Serializers, Deserializers) interface to
support custom file formats.
Storage layer
While a Hive enterprise data warehouse (EDW) can run on one of a variety of storage layers, HDFS and Amazon
S3 are among the most frequently used for data analytics in the Hadoop stack. Amazon S3 is a commonly used for a
public cloud infrastructure.
A Hive EDW can store data on other file systems, including WASB and ADLS.
Depending on your environment, you can tune the file system to optimize Hive performance by configuring
compression format, stripe size, partitions, and buckets.
To insert data into this table, you specify the partition key for fast loading:
28
Data Access Handling bucketed tables
You do not need to specify dynamic partition columns. Hive generates a partition specification if you enable dynamic
partitions.
For bulk-loading data into partitioned ORC tables, you use the following property, which optimizes the performance
of data loading into 10 or more partitions.
Follow these best practices when you partition tables and query partitioned tables:
• Never partition on a unique ID.
• Size partitions so that on average they are greater than or equal to 1 GB.
• Formulate a query so that it does not process more than 1000 partitions.
V2:
Performance of ACID V2 tables is on a par with non-ACID tables using buckets. ACID V2 tables are compatible
with native cloud storage.
A common challenge related to using buckets is maintaining query performance while the workload or data scales
up or down. For example, you could have an environment that operates smoothly using 16 buckets to support 1000
users, but a spike in the number of users to 100,000 for a day or two creates problems if you do not promptly tune
29
Data Access Improving performance using the cost-based optimizer
the buckets and partitions. Tuning the buckets is complicated by the fact that after you have constructed a table with
buckets, the entire table containing the bucketed data must be reloaded to reduce, add, or eliminate buckets.
With Tez, you only need to deal with the buckets of the biggest table. If workload demands change rapidly, the
buckets of the smaller tables dynamically change to complete table JOINs.
You perform the following tasks related to buckets:
• Setting hive-site.xml to enable buckets
SET hive.tez.bucket.pruning=true
• Bulk-loading tables that are both partitioned and bucketed:
When you load data into tables that are both partitioned and bucketed, set the following property to optimize the
process:
SET hive.optimize.sort.dynamic.partition=true
If you have 20 buckets on user_id data, the following query returns only the data associated with user_id = 1:
SELECT * FROM tab WHERE user_id = 1;
To best leverage the dynamic capability of table buckets on Tez, adopt the following practices:
• Use a single key for the buckets of the largest table.
• Usually, you need to bucket the main table by the biggest dimension table. For example, the sales table might be
bucketed by customer and not by merchandise item or store. However, in this scenario, the sales table is sorted by
item and store.
• Normally, do not bucket and sort on the same column.
A table that has more bucket files than the number of rows is an indication that you should reconsider how the table is
bucketed.
Explain plans
You can generate explain plans by running the EXPLAIN query command. An explain plan shows you the execution
plan of a query by revealing the operations that occur when you run the query. Having a better understanding of the
plan, you might rewrite the query or change Tez configuration parameters.
30
Data Access Improving performance using the cost-based optimizer
Procedure
1. In Ambari, select Services > Hive > Configs.
2. Enable cost-based optimization if you changed the default: In Filter, enter hive.cbo.enable, and check the
checkbox.
3. Configure automatic gathering of table-level statistics for newly created tables and table partitions if you changed
the default: In Filter, enter hive.stats.autogather, and check the checkbox.
4. Configure Hive to use statistics when generating query plans: In Filter, enter hive.compute.query.using.stats, and
check the checkbox.
5. Restart Hive and any other affected services.
31
Data Access Improving performance using the cost-based optimizer
Procedure
1. Launch a hive shell and log in.
2. Gather statistics for the non-partitioned table mytable:
ANALYZE TABLE mytable COMPUTE STATISTICS;
3. Confirm that the hive.stats.autogather property is enabled.
a) In Ambari, select Services > Hive > Configs.
b) In Filter, enter hive.stats.autogather.
4. View table statistics you generated:
DESCRIBE EXTENDED mytable;
5. Gather column statistics for the table:
6. View column statistics for the name column in my_table in the my_db database:
DESCRIBE FORMATTED my_db.my_table name;
Related Information
Apache Hive Wiki language reference
Apache Hive Wiki - Statistics in Hive
ANALYZE TABLE [table_name] COMPUTE Gathers table statistics for non-partitioned tables.
STATISTICS;
ANALYZE TABLE [table_name] Gathers column statistics for the entire table.
COMPUTE STATISTICS for COLUMNS
[comma_separated_column_list];
ANALYZE TABLE partition2 (col1="x") Gathers statistics for the partition2 column on a table
COMPUTE STATISTICS for COLUMNS; partitioned on col1 with key x.
DESCRIBE [EXTENDED] table_name; View table statistics. The EXTENDED keyword can be
used only if the hive.stats.autogather property is enabled
in the hive-site.xml configuration file.
32
Data Access Improving performance using the cost-based optimizer
noconditionaltask.size
Small VMs: 1 to 2 GB
hive.tez.java.opts -Xmx value must be 80% to 90% of container Auto-tuned: Depends on environment
size
33