Nothing Special   »   [go: up one dir, main page]

1-Analyze All Your Data With Oracle Big Data SQL

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33
At a glance
Powered by AI
The document discusses how Oracle Big Data SQL can analyze data across different data sources like Oracle Database, Hadoop and Oracle NoSQL Database using SQL. It also describes how Oracle Big Data SQL can perform pattern matching and sessionization on log data.

Oracle Big Data SQL allows applying Oracle's SQL capabilities and security policies across different data platforms to gain insights from all data. It can analyze data stored in Oracle Database, Hadoop, Oracle NoSQL Database or a combination using SQL.

Oracle Big Data SQL can analyze data residing in Oracle Database 12c, Hadoop, Oracle NoSQL Database or a combination of these sources.

1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Analyze All Your Data with Oracle Big Data SQL

Overview

Purpose

This tutorial illustrates how you can securely analyze data across the big data platform - whether that data resides in Oracle Database 12c, Hadoop,
Oracle NoSQL Database or a combination of these sources. Importantly, you will be able to leverage your existing Oracle skill sets and applications
to gain these insights. Oracle Big Data SQL allows you to apply Oracle's rich SQL dialect and security policies across the data platform - greatly
simplifying the ability to gain insights from all your data.

Note, there are two parts to Big Data SQL: 1) enhanced Oracle Database 12c external tables and 2) Oracle Big Data SQL Servers on the Oracle Big
Data Appliance or DIY Hadoop Clusters (see Big Data SQL Datasheet for supported deployments). On the Hadoop cluster, Big Data SQL Cell
Servers apply SmartScan over data stored in HDFS in order to achieve fast performance (see this blog post for details). The Oracle Big Data Lite
Virtual Machine used for this lab does not have Big Data SQL Cell Server installed.

Time to Complete

Approximately 90 mins

Prerequisites

This tutorial requires Oracle BIg Data Lite Virtual Machine (VM). You can download the VM from the Big Data Lite page on OTN.

Before starting this lesson, perform the following:

1. After starting Big Data Lite, ensure the following services are started by using the Start/Stop Services application found on the Linux
desktop:
ORCL
Zookeeper
HDFS
Hive
NoSQL
YARN

Note the started services have an * next to their name:

2. Using the right-mouse menu, save the following two files to a directory on the machine where SQL Developer is installed:
bigdatasql_hol_otn_setup.sql and bigdatasql_hol.sql. Remember this location - you will open these files in SQL Developer in just a minute!
NOTE: Use the 'Save Link As' option in the menu.
3. Launch SQL Developer from the Desktop Toolbar menu, as shown here:

4. In SQL Developer, open both files.


https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 1/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
p p
5. Select the bigdatasql_hol_otn_setup.sql script, and then click the Run Script tool in the SQL Developer, as shown below. When prompted
for a connection, select the moviedemo connection and click OK. This will complete the setup for this tutorial.

6. Close the bigdatasql_hol_otn_setup.sql script.


7. Next, in the bigdatasql_hol.sql script, multi-select the drop statements at the top of the script and click the Run Statement tool, as shown
here:

Note: Ignore any errors generated by these statements.

8. Leave the bigdatasql_hol.sql script open in SQL Developer, as it contains all of code examples that are referenced in this tutorial.

Introduction

This tutorial is divided into the following sections:

a. Review the Scenario


b. Configuring Oracle Big Data SQL
c. Create Oracle Tables Over an HDFS Sourced Application Log
d. Leverage the Hive Metastore to Access Data in Hadoop and Oracle NoSQL Database
e. Applying Oracle Database Security Policies Across the Big Data Platform
f. Using Oracle Analytic SQL Across All Your Data
g. Using SQL Pattern Matching on your web log data

Scenario

Oracle MoviePlex is an online movie streaming company. Every user that accesses the site is presented with his/her own movie recommendations
based on past viewing activity. This list of recommended movies is updated frequently and is part of the user's profile. Oracle NoSQL Database
stores these profiles - delivering application query requests with very low latency for large user communities. Additionally, the web site collects
every customer interaction in massive JSON formatted log files. By unlocking the information contained in that activity data and combining it with
recommendation data and enterprise data in its data warehouse, the company will be able to enrich its understanding of customer behavior, the
effectiveness of product offers, the organization of web site content, and more.

The company is using Oracle's Big Data Management System to unify their data platform and facilitate these analyses.

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 2/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Oracle Big Data Management System unifies the data platform by providing a common query language, management platform and security
framework across Hadoop, NoSQL and Oracle Database. Oracle Big Data SQL is a key component of the platform. It enables Oracle Database
12c to seamlessly query data in Hadoop and NoSQL using Oracle's rich SQL dialect. Data stored in Hadoop or Oracle NoSQL Database is queried
in exactly the same way as all other data in Oracle Database. This means that users can begin to gain insights from these new sources using their
existing skill sets and applications.

For Oracle MoviePlex, every click on its web site is streamed into HDFS. After the data lands in HDFS, it is immediately accessible to Oracle
Database users through Oracle Big Data SQL. In addition, the recommendation data in Oracle NoSQL Database is also accessible thru Oracle Big
Data SQL. In this hands-on, you will learn how to combine the "click data" stored in HDFS with recommendation data in NoSQL Database and
revenue data in Oracle Database to better understand the shopping and purchasing patterns of customers visiting the site.

Let's begin the tutorial by reviewing how access to the BDA is configured in Oracle Database.

Part 1 - Configuring Oracle Big Data SQL

In this section, you learn how to configure Oracle Big Data SQL. This configuration process enables Oracle Database 12c to query data in Hadoop
or Oracle NoSQL Database.

Configuration Tasks
As mentioned in the overview, this VM uses enhanced external tables in Oracle Database 12c to access data in HDFS and Oracle NoSQL
Database. It does not have the Big Data SQL Server Cells installed. In a true Oracle Big Data environment, there are two installation tasks:

1. Install Oracle Big Data SQL on the Hadoop cluster. This step sets up a Big Data SQL Server Cells on each node A - enabling
SmartScan on local data.
2. For Oracle Database 12c, run the Big Data SQL installation script on each Oracle database node. This step sets up connectivity from
Oracle Database to the Big Data SQL Server Cells on the Hadoop Cluster. It also includes installing a Hadoop client, configuration
directories and files, Big Data SQL Agent, Oracle directory objects and more.

Let's review some of the important elements that are produced in the second configuration task.

Review the Common and Cluster Directories

Two file system directories -- the Common and Cluster directories -- are set up in the Oracle Database home. These directories store
configuration files that enable the Exadata Server to connect to the BDA. A short description of each follows.

Common Directory

The Common directory contains a few subdirectories and an important file, named bigdata.properties. This file stores
configuration information that is common to all BDA clusters. Specifically, it contains property-value pairs used to configure the
JVM and identify a default cluster.
The bigdata.properties file must be accessible to the operating system user under which the Oracle Database runs.

Cluster Directory

The Cluster directory contains configuration files required to connect to a specific BDA cluster.
In addition, the Cluster directory must be a subdirectory of the Common directory - and the name of the directory is important: It
is the name that you will use to identify the cluster. This will be described in more detail later.

First, let's review the Common Directory's bigdata.properties file:

1. Launch a Terminal window using the Desktop toolbar. (SQL Developer should also be open.)

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 3/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

2. In the Terminal window, change to the Common directory location, and then view the contents of the
bigdata.properties file. Enter the following commands at the prompt:

cd /u01/bigdatasql_config/
cat bigdata.properties

Result: The output of the commands will look similar to the following:

Notes:

The properties, which are not specific to a hadoop cluster, include items such as the location of the Java VM,
classpaths and the LD_LIBRARY_PATH.
In addition, the last line of the file specifies the default cluster property - in this case bigdatalite.
As you will see later, the default cluster simplifies the definition of Oracle tables that are accessing data in
Hadoop.
In our hands-on lab, there is a single cluster: bigdatalite. The bigdatalite subdirectory contains the
configuration files for the bigdatalite cluster.
The name of the cluster must match the name of the subdirectory (and it is case sensitive!).

Next, let's review the contents of the Cluster Directory.

3. Using the Terminal window, change to the Cluster directory and view it's contents by executing the following
commands at the prompt:

cd /u01/bigdatasql_config/bigdatalite
ls

Result: The output of the commands above will look similar to the following:

Notes:

These are the files required to connect Oracle Database to HDFS and to Hive.
Each cluster will have its own directory - with configuration files specific to that cluster.

Oracle directory objects that correspond to these file system directories are created by the install process.

Review Oracle Directory Objects


https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 4/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
Review Oracle Directory Objects

As previously shown, the configuration files have been saved to the file system. The installation process creates corresponding
Oracle directory objects that point to these folders.
The Oracle directory objects have a specific naming convention:

ORACLE_BIGDATA_CONFIG : the Oracle directory object that references the Common Directory
ORACLE_BIGDATA_CL_bigdatalite : the Oracle directory object that references the Cluster Directory. The naming
convention for this directory is as follows:
Cluster Directory name begins with ORACLE_BIGDATA_CL_
Followed by the cluster name (i.e. "bigdatalite"). This name is case sensitive (so don't forget quotes for lowercase
names!) and is limited to 15 characters.
Must match the physical directory name in the file system (repeat: it's case sensitive!).

Review these Oracle directory objects:

1. In SQL Developer, using the bigdatasql_hol script file, execute the following statement:

Notes:

In SQL Developer, use the Run Statement tool (shown above) to run one or more selected
statements.
The directory object is case sensitive. In our example, the bigdatalite cluster is lower case and was
created by the install script using the following command:

create or replace directory "ORA_BIGDATA_CL_bigdatalite" as '';

Notice that there is no location specified for the Cluster Directory. It is expected that the directory will:

Be a subdirectory of ORACLE_BIGDATA_CONFIG
Use the cluster name as identified by the Oracle directory object.

Review Oracle Big Data SQL Agent

In addition to creating the Oracle directory objects, Big Data SQL Agents are also created by the install script:

This multi-threaded agent bridges the metadata between Oracle Database and Hadoop. It launches a
single JVM - instead of one for every process (which can be quite slow).
If the MTA were not already set up, you would use the following commands to create it:

create public database link BDSQL$_bigdatalite using


'extproc_connection_data';
create public database link BDSQL$_DEFAULT_CLUSTER using
'extproc_connection_data';

Now that we have reviewed the configuration, lets create Oracle tables that access data in HDFS and Oracle NoSQL Database!

Part 2 - Create Oracle Table Over Application Log

In this section, you will create an Oracle table over data stored in HDFS and then query that data. This example will use the ORACLE_HDFS driver; it
will not leverage metadata stored in the Hive Catalog.

Review Application Log Stored in HDFS


The movie application streamed data into HDFS - specifically into the following directory: /user/oracle/moviework/applog_json. Let's
review that log data:

1. Open a terminal window.

2. Execute the following command to review the log file stored in HDFS:

hadoop fs -ls /user/oracle/moviework/applog json


https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 5/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
p / / / / pp g_j

Result: You should see the following output:

3. Now, view the contents of the file, execute the following command:

hadoop fs -tail /user/oracle/moviework/applog_json/movieapp_log_json.log

Result: You should see the following output:

Notice the file contains every click that has taken place on the web site. The JSON log captures the following information about each
interaction:
custid: the customer accessing the site
movieid: the movie that the user clicked on
genreid: the genre that the movie belongs to
time: when the activity occurred
recommended: did the customer click on a recommended movie?
activity: a code for the various activities that can take place, including log in/out, view a movie, purchase a movie, show
movie listings, etc.
price: the price of a purchased movie

Create Oracle Table Over Application Log


Now that you have reviewed the source data, create an Oracle table over the file. This table will be very simple: a single column where each
record contains a JSON document. You will then user Oracle SQL to easily parse the JSON fields found in each document:

1. Go to the SQL Worksheet in SQL Developer and execute the following SQL statement (Note: These statements are all in the
bigdatasql_hol.sql script):

CREATE TABLE movielog


(click VARCHAR2(4000))
ORGANIZATION EXTERNAL
(TYPE ORACLE_HDFS
DEFAULT DIRECTORY DEFAULT_DIR
LOCATION ('/user/oracle/moviework/applog_json/')
)
REJECT LIMIT UNLIMITED;

Notice in the code above that Oracle external tables have been enhanced to natively understand data stored on the BDA.
Specifically, the following attributes are leveraged:

Access driver ORACLE_HDFS indicates that the data is stored in HDFS.


LOCATION identifies the HDFS directory (or file or directories) that contains the source data for the table
The DEFAULT DIRECTORY contains log files that are generated by the external table (if logging is enabled)
The REJECT LIMIT applies to each parallel query slave that is executing the query.

2. Execute the following command to review the data in the table movielog

SELECT * FROM movielog WHERE rownum < 20;

Result: The output will look similar to our previous tail statement. A record is returned each JSON document.

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 6/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

3. There are numerous options that can be applied to the external table that impact how the data is queried and processed. Let's take a
look at a couple of these options. Create the table movielog_plus by using the following DDL command:

CREATE TABLE movielog_plus


(click VARCHAR2(40))
ORGANIZATION EXTERNAL
(TYPE ORACLE_HDFS
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS (
com.oracle.bigdata.cluster=bigdatalite
com.oracle.bigdata.overflow={"action":"truncate"}
)
LOCATION ('/user/oracle/moviework/applog_json/')
)
REJECT LIMIT UNLIMITED;

Notice in the code above:

First, the click column has been changed to a VARCHAR2(40). Clearly, this is going to be a problem; the length of a JSON
document exceeds that size. There are numerous ways to handle this situation, including:
Generate an error and then either reject the record, set its value to null or replace it with an alternate value.
Simply truncate the data. Here, we are truncating the data. And, we have applied this truncate action to all columns
in the table; you can also specify the individual column(s) to truncate.
Second, a cluster bigdatalite has been specified. This cluster will be used instead of the default (which in this case
happens to be the same). Currently a given session may only connect to a single cluster.

4. Execute the following command to review the data in the table movielog_plus:

SELECT * FROM movielog_plus WHERE rownum < 20;

Note: Each JSON document is truncated based on the size of the Oracle table column (40 characters). In practice, truncating a
JSON document is not very useful, but this example illustrates the point.

5. Oracle Database 12c (12.1.0.2) includes native JSON support. This allows queries to easily extract attribute data from JSON
documents. Run the following query in SQL Developer:

SELECT m.click.custid, m.click.movieid, m.click.genreid, m.click.time


FROM movielog m
WHERE rownum < 20;

Result: The query output looks like this:

Notes:

The column specification in the select list is a full path to the JSON attribute.
The specification starts with the table alias ("m" - note: this is required!), followed by the column name ("click"), and then a
case sensitive JSON path (e.g. "genreId").

6. One of the key strengths of Oracle Big Data SQL is the ability to answer questions that combine data from Oracle Database and
Hadoop. Combine the "click" data with data sourced from the movie dimension table, by executing the following command:

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 7/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
SELECT f.click.custid, m.title, m.year, m.gross, f.click.rating
FROM movielog f, movie m

WHERE f.click.movieid = m.movie_id


AND f.click.rating > 4;

Result: The query results will look similar in structure to the following (your record output may be in a different order):

Note: The output above enables us to see how a given customer's ratings on the web site compared to the movies' gross revenues.

7. Execute the following command to create a view to simplify queries against the JSON data. This view will also be useful in
subsequent exercises when security policies are applied to the table:

CREATE OR REPLACE VIEW movielog_v AS


SELECT
CAST(m.click.custid AS NUMBER) custid,
CAST(m.click.movieid AS NUMBER) movieid,
CAST(m.click.activity AS NUMBER) activity,
CAST(m.click.genreid AS NUMBER) genreid,
CAST(m.click.recommended AS VARCHAR2(1)) recommended,
CAST(m.click.time AS VARCHAR2(20)) time,
CAST(m.click.rating AS NUMBER) rating,
CAST(m.click.price AS NUMBER) price
FROM movielog m;

8. Now, execute the following command to find how Oracle MoviePlex average ratings compare to top 10 grossing movies:

SELECT m.title, m.year, m.gross, round(avg(f.rating), 1)


FROM movielog_v f, movie m
WHERE f.movieid = m.movie_id
GROUP BY m.title, m.year, m.gross
ORDER BY m.gross desc
FETCH FIRST 10 ROWS ONLY;

Result: The output looks like this:

Note: The data indicates that MoviePlex users aren't necessarily enjoying blockbuster movies.

Summary:

In a matter of minutes, you were able to create and query Oracle Database tables over data sourced in HDFS - and then join that data
with other Oracle Database tables.

Next, we will leverage metadata already available in the Hive Metastore to make it even easier to query complex data in Hadoop.

Part 3 - Leverage the Hive Metastore to Access Data in Hadoop & Oracle NoSQL Database

Hive enables SQL access to data stored in Hadoop and NoSQL stores. There are two parts to Hive: the Hive execution engine and the Hive
Metastore.

The Hive execution engine launches MapReduce job(s) based on the SQL that has been issued. MapReduce is a batch processing framework and
is not intended for interactive query and analysis - but it is extremely useful for querying massive data sets using the well understood SQL
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 8/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
is not intended for interactive query and analysis but it is extremely useful for querying massive data sets using the well understood SQL
language. Importantly, no coding is required (Java, Pig, etc.). The SQL supported by Hive is still limited (SQL92), but improvements are being
made over time.

The Hive Metastore has become the standard metadata repository for data stored in Hadoop. It contains the definitions of tables (table name,
columns and data types), the location of data files (e.g. directory in HDFS), and the routines required parse that data (e.g. StorageHandlers,
InputFormats and SerDes). The data accessed thru Hive does not have to be stored in Hadoop. For example, Oracle NoSQL Database offers a
StorageHandler that makes its data accessible thru Hive. This capability will be leveraged by Oracle Big Data SQL.

There are many query execution engines that use the Hive Metastore while bypassing the Hive execution engine. Oracle Big Data SQL is an
example of such an engine. This means that the same metadata can be shared across multiple products (e.g. Hive, Oracle Big Data SQL, Impala,
Pig, Spark SQL, etc.); you will see an example of this in action in the following exercises.

Let's begin by reviewing the tables that have been defined in Hive. After reviewing these hive definitions, we'll create tables in the Oracle Database
that will query the underlying Hive data stored in HDFS and Oracle NoSQL Database:

Review Tables Stored in Hive


Tables in Hive are organized into databases. In our example, several tables have been created in the default database. Connect to Hive and
investigate these tables.

1. Open a terminal window and execute the following command at the command prompt:

bee

Result: This command is a shortcut for running beeline - a Hive JDBC client (see /opt/bin/bee). Beeline is a very basic Hive
command line interface (CLI).

2. At the prompt, enter the following command to display the list of tables in the default database:

show tables;

Result: As shown in the output, several tables have been defined in the database. There are tables defined over Avro data, JSON
data and tab delimited text files.

Let's review two tables that have been defined over JSON data.

3. The first table is very simple and is equivalent to the external table that was defined in Oracle Database in the previous exercise.
Review the definition of the table by executing the following command at the prompt:

show create table movielog;

Result: The DDL for the table is displayed.

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 9/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Notes:

There is a single string column called click - and the table is referring to data stored in the
/user/oracle/moviework/applog_json folder.
There is no special processing of the JSON data; i.e. no routine is transforming the attributes into columns. The table is
simply displaying the JSON as a line of text.

4. Next, query the data in the movielog table by executing the following command:

select * from movielog limit 10;

Result: The follow output is produced:

Notes:

Because there are no columns in the select list and no filters applied, the query simply scans the file and returning the results.
No MapReduce job is executed.

There are more useful ways to query the JSON data. The next steps will show how Hive can parse the JSON data using a
serializer/deserializer - or SerDe.

5. The second table queries that same file - however this time it is using a SerDe that will translate the attributes into columns. Review
the definition of the table by executing the following command:

show create table movieapp_log_json;

Result: The DDL for the second table is shown.

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 10/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Notes:

There are columns defined for each field in the JSON document - making it much easier to understand and query the data.
A java class org.apache.hive.hcatalog.data.JsonSerDe is used to deserialize the JSON file.

This is also an illustration of Hadoop's schema on read paradigm; a file is stored in HDFS, but there is no schema associated with it
until that file is read. Our examples are using two different schemas to read that same data; these schemas are encapsulated by the
Hive tables movielog and movieapp_log_json.

6. Execute the following query against the movieapp_log_json table to find movies that were highly rated:

SELECT movieid, AVG(rating) AS avg_rating


FROM movieapp_log_json
WHERE rating IS NOT NULL
GROUP BY movieid
ORDER BY avg_rating DESC, movieid ASC
LIMIT 25;

Result: The following output is generated (the query may take a moment to return these results).

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 11/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Note: This is a much better way to query and view the data than in our previous table.

The Hive query execution engine converted this query into MapReduce jobs.
The author of the query does not need to worry about the underlying implementation - Hive handles this automatically.

7. Review a third table called recommendation. This table is in the moviework database and is defined over an Oracle
NoSQL Database table that contains movie recommendations for each user:

show create table moviework.recommendation;

Result: The DDL for the third table is shown.

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 12/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Notes:

The TBLPROPERTIES describe the connection details for the Oracle NoSQL Database instance
An Oracle NoSQL DB storage handler oracle.kv.hadoop.hive.table.TableStorageHandler provides access
to the underlying data store

8. Execute the following query against the recommendation table to view genres and movies recommended for users:

SELECT * FROM moviework.recommendation LIMIT 20;

9. At the prompt, execute the !exit; command to close beeline

Leverage Hive Metadata When Creating Oracle Tables


Oracle Big Data SQL is able to leverage the Hive metadata when creating and querying tables.

In this section, you will create Oracle tables over three Hive tables: movieapp_log_json, movieapp_log_avro and recommendation.
Oracle Big Data SQL will utilize the existing StorageHandlers and SerDes required to process this data.

1. Go to Oracle SQL Developer. Create a table over the Hive movieapp_log_json table using the following DDL:

CREATE TABLE i l j (
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 13/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
CREATE TABLE movieapp_log_json (
custid INTEGER ,
movieid INTEGER ,

genreid INTEGER ,
time VARCHAR2 (20) ,
recommended VARCHAR2 (4) ,
activity NUMBER,
rating INTEGER,
price NUMBER
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
)
REJECT LIMIT UNLIMITED;

Notice the new ORACLE_HIVE access driver type. This access driver invokes Oracle Big Data SQL at query compilation time to
retrieve the metadata details from the Hive Metastore. By default, it will query the metastore for a table name that matches the name
of the external table: movieapp_log_json. As you will see later, this default can be overridden using ACCESS PARAMETERS.

2. Query the table using the following select statement:

SELECT * FROM movieapp_log_json WHERE rating > 4

Result: The query output is shown here:

Notes:

As mentioned earlier, at query compilation time, Oracle Big Data SQL queries the Hive Metastore for all the information
required to select data. This metadata includes the location of the data and the classes required to process the data (e.g.
StorageHandlers, InputFormats and SerDes).
In this example, Oracle Big Data SQL scanned the files found in the /user/oracle/movie/moviework/applog_json
directory and then used the Hive SerDe to parse each JSON document.
In a true Oracle Big Data Appliance environment, the input splits would be processed in parallel across the nodes of the
cluster by the Big Data SQL Server, the data would then be filtered locally using Smart Scan, and only the filtered results
(rows and columns) would be returned to Oracle Database.

3. Query the table using the following select statement:

SELECT movieid, AVG(rating)


FROM movieapp_log_json
WHERE rating IS NOT NULL
GROUP BY movieid
ORDER BY AVG(rating) DESC, movieid ASC
FETCH FIRST 25 ROWS ONLY;

Result: The query output is shown here:

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 14/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Notes:

This query highlights that - although the hive metadata is leveraged - the hive execution engine is not used by Big Data SQL.
Previously, we ran a similar query from beeline - and MapReduce jobs were launched to execute the query. MapReduce was
not used here.

4. It is now easy to combine data available thru hive with data stored in Oracle Database tables. What are the highly rated movies that
customers are purchasing?

SELECT f.custid, m.title, m.year, m.gross, f.rating


FROM movieapp_log_json f, movie m
WHERE f.movieId = m.movie_id
AND f.rating > 4

Result: The query output is shown here:

Notes:

The movie lookup table resides in Oracle Database - providing context to the click data.

5. There is a second Hive table over the same movie log content - except the data is in Avro format - not JSON text format. Create an
Oracle table over that Avro-based Hive table using the following command:

CREATE TABLE mylogdata (


custid INTEGER ,
movieid INTEGER ,
genreid INTEGER ,
time VARCHAR2 (20) ,
recommended VARCHAR2 (4) ,
activity NUMBER,
rating INTEGER,
price NUMBER
)
ORGANIZATION EXTERNAL
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 15/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR

ACCESS PARAMETERS ( com.oracle.bigdata.tablename=default.movieapp_log_avro )


)
REJECT LIMIT UNLIMITED;

Note: In this instance, the Oracle table name does not match the Hive table name. Therefore, an ACCESS PARAMETER was
specified that references the Hive table (default.movieapp_log_avro).

6. Query the mylogdata table using the following command:

SELECT custid, movieid, time FROM mylogdata;

Result: The query output will be similar to this:

Note: Oracle Big Data SQL utilized the Avro InputFormat to query the data.

Now, to illustrate how Oracle Big Data SQL uses the Hive Metastore at query compilation to determine query execution parameters,
you will change the definition of the hive table movieapp_log_data. In Hive, alter the table's LOCATION field so that it points to a file
that containing only two records.

7. Return to the terminal window, invoke Hive's beeline CLI, and then change the location field and query the table by executing the
following three commands:

bee

ALTER TABLE movieapp_log_json SET LOCATION


"hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/two_recs";

SELECT * FROM movieapp_log_json;

Result: The Hive table returns the file's only two records, which look something like this (your two rows may show different data):

8. Return to SQL Developer and - without making any changes to the Oracle table - query movieapp_log_json:

SELECT * FROM movieapp_log_json;

Result: Oracle Big Data SQL queried the Hive Metastore and picked up the change in LOCATION. The Oracle table returns the same
two rows (your two rows will be the same as returned in Hive).

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 16/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

9. Finally, reset the Hive table and then confirm that there are more than two rows. Execute the following commands at the beeline
prompt.
ALTER TABLE movieapp_log_json SET LOCATION
"hdfs://bigdatalite.localdomain:8020/user/oracle/moviework/applog_json";

select * from movieapp_log_json limit 10;

Note: The query should return 10 rows.

10. Accessing the recommendation data in Oracle NoSQL Database will utilize the same method. Return to SQL Developer and create
the recommendation table. Then, select the first 20 rows from the table:

CREATE TABLE RECOMMENDATION


(
CUSTID NUMBER
, SNO NUMBER
, GENREID NUMBER
, MOVIEID NUMBER
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename: moviework.recommendation
)
)
REJECT LIMIT UNLIMITED;

SELECT * FROM recommendation WHERE rownum <=20;

Result: Oracle Big Data SQL queried the Hive Metastore to determine how to access the Oracle NoSQL Database table. It then used
that information to retrieve the first 20 rows from the key-value store:

Part 4 - Big Data SQL Performance Features

Big Data SQL provides numerous features that enhance query performance. These include:

SmartScan: data local scans on the Hadoop cluster that will filter data based on SQL query predicates
Storage Indexes: automatically generated, in-memory indexes that enable SmartScan to skip reading blocks that do not contain data based
on the query predicate
Bloom Filters: pushes a predicate that was applied to a joined look-up table to the data stored on the hadoop cluster
Partition Pruning: avoid reading hive partitions based on query predicates
Predicate Pushdown: intelligent data sources - like Oracle NoSQL Database, HBase, Parquet and ORC files - are able to process
predicates and leverage optimized storage performance capabilities.

Using a simple, single-node VM is not an environment for evaluating performance. However, Big Data Lite will allow you to utilize Partition Pruning
and Predicate Pushdown - enabling you to better understand how these performance features work. Because Big Data Lite does not include Big
Data SQL Cells, you will not be able see the value from SmartScan, Storage Indexes and Bloom Filter features.

Querying Partitioned Hive Tables


In this exercise, we will examine the performance impact of Hive partition pruning. The hive table movieapp_log_avro is a non-partitioned
table defined over Avro data; we queried this table in a previous exercise. A second table has been created - movieapp_log_month_avro - that
is partitioned by month
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 17/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
is partitioned by month.

Launch beeline and review table definition and its partitions


In SQL Developer, create a Big Data SQL-enabled table over the hive partitioned table
Compare query performance between the non-partitioned and partitioned sources

1. In beeline, review the definition and partitions for table movieapp_log_month_avro:

bee

show create table movieapp_log_month_avro;

Result: DDL for the table

+-----------------------------------------------------------------------------------------------------+--+
| createtab_stmt |
+-----------------------------------------------------------------------------------------------------+--+
| CREATE TABLE `movieapp_log_month_avro`( |
| `custid` int COMMENT '', |
| `movieid` int COMMENT '', |
| `activity` int COMMENT '', |
| `genreid` int COMMENT '', |
| `recommended` string COMMENT '', |
| `time` string COMMENT '', |
| `rating` int COMMENT '', |
| `price` double COMMENT '', |
| `position` int COMMENT '') |
| PARTITIONED BY ( |
| `month` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' |
| WITH SERDEPROPERTIES ( |
| 'avro.schema.url'='hdfs://bigdatalite.localdomain/user/oracle/moviework/schemas/activity.avsc') |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' |
| LOCATION |
| 'hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/movieapp_log_month_avro' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1469574004') |
+-----------------------------------------------------------------------------------------------------+--+

show partitions movieapp_log_month_avro;

Result: You will see 4 partitions:

+----------------+--+
| partition |
+----------------+--+
| month=2012-07 |
| month=2012-08 |
| month=2012-09 |
| month=2012-10 |
+----------------+--+

Note, this is the same data found in the non-partitioned hive table. It is simply divided into 4 partitions.

2. In SQL Developer, create a Big Data SQL-enabled table over the partitioned hive table. Notice, you do not have to specify anything
about the partition definition. Oracle Database queries the hive metastore at query compilation time to determine the partitions:

CREATE TABLE MOVIEAPP_LOG_MONTH_AVRO


(
CUSTID NUMBER
, MOVIEID NUMBER
, ACTIVITY NUMBER
, GENREID NUMBER
, RECOMMENDED VARCHAR2(4)
, TIME VARCHAR2(20)
, RATING NUMBER
, PRICE NUMBER
, POSITION NUMBER
, MONTH VARCHAR2(8)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename: default.movieapp log month avro
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 18/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
com.oracle.bigdata.tablename: default.movieapp_log_month_avro
)
)

REJECT LIMIT UNLIMITED;

3. Query the non-partitioned and partitioned sources and notice the performance difference:

-- non-partitioned
SELECT movieid,
COUNT(*)
FROM mylogdata
WHERE SUBSTR(TIME, 1, 7) = '2012-07'
AND movieid = 11547
GROUP BY movieid;

Result: The query output looks similar to the following:

MOVIEID COUNT(*)
---------- ----------
11547 1716

Elapsed: 00:00:11.561

-- partitioned

SELECT movieid,
COUNT(*)
FROM movieapp_log_month_avro
WHERE MONTH = '2012-07'
AND movieid = 11547
GROUP BY movieid;

MOVIEID COUNT(*)
---------- ----------
11547 1716

Elapsed: 00:00:03.611

Notes:

Due to partition pruning, the query over the partitioned source is scanning approximately one-fourth data. As a result, the
query performance is approximately four times faster.
When running a real Hadoop cluster with Big Data SQL Server cells, SmartScan and Storage Indexes would engage to
enhance performance.

Predicate Pushdown to Intelligent Sources


A delimited text file is not an intelligent data source. The data contained in the source may be interesting - but there it doesn't provide
capabilities to optimize retrieval of data. Oracle NoSQL Database, Parquet and ORC files are examples are intelligent sources. They provide
numerous features that optimize data retrieval. You can review this blog post for details.

This exercise will examine the performance benefit of predicate pushdown into Parquet files - which provides a compressed, efficient columnar
store. This example uses the same data as found in the previous example; movieapp_log_month_parquet is a partitioned hive table where
data for each month is stored in Parquet format.

Launch beeline and review the table definition and its partitions
In SQL Developer, create a Big Data SQL-enabled table over the hive partitioned table
Compare query performance between the parquet source and the previous example

1. In beeline, review the definition and partitions for table movieapp_log_month_parquet:

bee

show create table movieapp_log_month_parquet;

Result: DDL for the table

+-----------------------------------------------------------------------------------------+--+
| createtab_stmt |
+-----------------------------------------------------------------------------------------+--+
| CREATE TABLE `movieapp_log_month_parquet`( |
| `custid` int, |
| `movieid` int, |
| `activity` int, |
| `genreid` int, |
| `recommended` string, |
| `time` string |
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 19/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
| time string, |
| `rating` int, |
| `price` double, |
| `position` int) |
| PARTITIONED BY ( |
| `month` string) |
| ROW FORMAT SERDE |
| 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' |
| STORED AS INPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' |
| OUTPUTFORMAT |
| 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat' |
| LOCATION |
| 'hdfs://bigdatalite.localdomain:8020/user/hive/warehouse/movieapp_log_month_parquet' |
| TBLPROPERTIES ( |
| 'transient_lastDdlTime'='1469643702') |
+-----------------------------------------------------------------------------------------+--+

show partitions movieapp_log_month_parquet;

Result: You will see 4 partitions:

+----------------+--+
| partition |
+----------------+--+
| month=2012-07 |
| month=2012-08 |
| month=2012-09 |
| month=2012-10 |
+----------------+--+

Note, this data is the same as the Avro example above - but in Parquet format.

2. In SQL Developer, create a Big Data SQL-enabled table over the partitioned hive table. Notice, you do not have to specify anything
about the partition definition. Oracle Database queries the hive metastore at query compilation time to determine the partitions:

CREATE TABLE MOVIEAPP_LOG_MONTH_PARQUET


(
CUSTID NUMBER
, MOVIEID NUMBER
, ACTIVITY NUMBER
, GENREID NUMBER
, RECOMMENDED VARCHAR2(4)
, TIME VARCHAR2(20)
, RATING NUMBER
, PRICE NUMBER
, POSITION NUMBER
, MONTH VARCHAR2(8)
)
ORGANIZATION EXTERNAL
(
TYPE ORACLE_HIVE
DEFAULT DIRECTORY DEFAULT_DIR
ACCESS PARAMETERS
(
com.oracle.bigdata.tablename: default.movieapp_log_month_parquet
)
)
REJECT LIMIT UNLIMITED;

3. Query the non-partitioned and partitioned sources and notice the performance difference:

-- partitioned parquet table

SELECT movieid,
COUNT(*)
FROM movieapp_log_month_parquet
WHERE MONTH = '2012-07'
AND movieid = 11547
GROUP BY movieid;

MOVIEID COUNT(*)
---------- ----------
11547 1716

Elapsed: 00:00:00.532

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 20/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Notes:

Query performance benefits are cumulative. In this example, the query benefits from both partition pruning and the Parquet
data source. The elapsed query time has been significantly reduced.

Part 5 - Applying Oracle Database Security Policies Across the Big Data Platform

In most deployments, the Oracle Database contains critical and sensitive data that must be protected. A rich set of Oracle Database security
features, including strong authentication, row level access, data redaction, data masking, auditing and more - have been utilized to ensure that data
remains safe. These same security policies can be leveraged when using Oracle Big Data SQL. This means that a single set of security policies
can be utilized to protect all of your data.

In our example, we need to protect personally identifiable information, including the customer last name and customer id. To accomplish this task,
an Oracle Data Redaction policy has already been set up on the customer table that obscures these two fields. This was accomplished by using the
DBMS_REDACT PL/SQL package, shown here:

DBMS_REDACT.ADD_POLICY(
object_schema => 'MOVIEDEMO',
object_name => 'CUSTOMER',
column_name => 'CUST_ID',
policy_name => 'customer_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => '9,1,7',
expression => '1=1'
);

DBMS_REDACT.ALTER_POLICY(
object_schema => 'MOVIEDEMO',
object_name => 'CUSTOMER',
action => DBMS_REDACT.ADD_COLUMN,
column_name => 'LAST_NAME',
policy_name => 'customer_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => 'VVVVVVVVVVVVVVVVVVVVVVVVV,VVVVVVVVVVVVVVVVVVVVVVVVV,*,3,25',
expression => '1=1'
);

The first PL/SQL call creates a policy called customer_redaction:

It is applied to the cust_id column moviedemo.customer table


It performs a partial redaction - i.e. it is not nec. applied to all characters in the field
It replaces the first 7 characters with the number "9"
The redaction policy will always apply - since the expression describing when it will apply is specified as "1=1"

The second API call updates the customer_redaction policy, redacting a second column in that same table. It will replace the characters 3 to 25
in the LAST_NAME column with an '*'. Note: the application of redaction policies does not change underlying data. Oracle Database performs the
redaction at execution time, just before the data is displayed to the application user.

Querying these two columns in the customer table produces the following result:

SELECT cust_id, last_name FROM customer;

Importantly, SQL executed against redacted data remains unchanged. For example, queries can use the cust_id and last_name columns in join
conditions, apply filters to them, etc. The fact that the data is redacted is transparent to application code.

In the next section, you apply redaction policies to our tables that have data sourced in Hadoop.

Apply Redaction Policies to Data Stored in Hadoop and Oracle NoSQL Database
Here, you apply an equivalent redaction policy to two of our Oracle Big Data SQL tables, with the following effects:

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 21/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

The first procedure redacts data sourced from JSON in HDFS


The second procedure redacts Avro data sourced from Hive
The third procedure redacts data sourced from Oracle NoSQL Database

Both policies redact the custid; attribute.

1. Go to the SQL Developer Worksheet and run the following two PL/SQL DBMS_REDACT.ADD_POLICY procedures:

BEGIN
-- JSON file in HDFS
DBMS_REDACT.ADD_POLICY(
object_schema => 'MOVIEDEMO',
object_name => 'MOVIELOG_V',
column_name => 'CUSTID',
policy_name => 'movielog_v_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => '9,1,7',
expression => '1=1'
);

-- Avro data from Hive


DBMS_REDACT.ADD_POLICY(
object_schema => 'MOVIEDEMO',
object_name => 'MYLOGDATA',
column_name => 'CUSTID',
policy_name => 'mylogdata_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => '9,1,7',
expression => '1=1'
);

-- Recommendations data from Oracle NoSQL Database


DBMS_REDACT.ADD_POLICY(
object_schema => 'MOVIEDEMO',
object_name => 'RECOMMENDATION',
column_name => 'CUSTID',
policy_name => 'recommendation_redaction',
function_type => DBMS_REDACT.PARTIAL,
function_parameters => '9,1,7',
expression => '1=1'
);

END;
/

Result: As stated previously, the custid column for the three objects are now redacted.

2. Review the redacted data from the Avro source:

SELECT * FROM mylogdata WHERE rownum < 20;

Result: The output should look like this:

Notice how the custid column displays a series of 9s instead of the original value.

3. Join the redacted HDFS data to the customer table by executing the following SELECT statement:

SELECT f.custid, c.last_name, f.movieid, f.time


FROM customer c, movielog_v f
WHERE c.cust_id = f.custid;

Results: The query output looks similar to the following:

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 22/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Notes:

As highlighted in the example above, we used the Sort tool in the TIME column to sort the output in ascending order by
TIME.
As you can see, the redacted data sourced from Hadoop works seamlessly with the rest of the data in your Oracle Database.

4. Similarly, join the redacted NoSQL data to the customer and movie Oracle Database tables by executing the following SELECT
statement:

SELECT f.custid, c.last_name, c.income_level, f.genreid, m.title


FROM customer c, recommendation f, movie m
WHERE c.cust_id = f.custid
AND f.movieid = m.movie_id
AND c.income_level like 'F%'
ORDER BY f.custid, f.genreid;

Results: The query output displays recommendations for wealthier customers:

Notes:

You can now easily see movies that are recommended to customers - while preserving sensitive, customer identity.

Part 6 - Using Oracle Analytic SQL Across All Your Data

Oracle Big Data SQL allows you to utilize Oracle's rich SQL dialect to query all your data, regardless of where that data may reside. We will take a
look at a couple of analytic queries that deliver unique insights across our three data sources.

Gaining Insights From All Your Data


This next example will enrich Oracle MoviePlex's understanding of customers by utilizing an RFM analysis. This query will identify:

Recency : when was the last time the customer accessed the site?
Frequency : what is the level of activity for that customer on the site?
Monetary : how much money has the customer spent?

To answer these questions, SQL Analytic Functions will be applied to data residing in both the application logs on Hadoop and sales data in
O l D t b t bl C t ill b t i di t 5b k t di i i i t F l RFM bi d
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 23/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
Oracle Database tables. Customers will be categorized into 5 buckets measured in increasing importance. For example, an RFM combined
score of 551 indicates that the customer is in the highest tier of customers in terms of recent visits (R=5) and activity on the site (F=5), however
the customer is in the lowest tier in terms of spend (M=1). Perhaps this is a customer that performs research on the site, but then decides to buy
movies elsewhere!
We want to target customers who we may be losing to competition. Therefore, execute the following query -- which finds important customers
(high monetary score) that have not visited the site recently (low recency score):

1. Go to the SQL Developer Worksheet and run the following query:

WITH customer_sales AS (
-- Sales and customer attributes
SELECT m.cust_id,
c.last_name,
c.first_name,
c.country,
c.gender,
c.age,
c.income_level,
NTILE (5) over (order by sum(sales)) AS rfm_monetary
FROM movie_sales m, customer c
WHERE c.cust_id = m.cust_id
GROUP BY m.cust_id,
c.last_name,
c.first_name,
c.country,
c.gender,
c.age,
c.income_level
),
click_data AS (
-- clicks from application log
SELECT custid,
NTILE (5) over (order by max(time)) AS rfm_recency,
NTILE (5) over (order by count(1)) AS rfm_frequency
FROM movielog_v
GROUP BY custid
)
SELECT c.cust_id,
c.last_name,
c.first_name,
cd.rfm_recency,
cd.rfm_frequency,
c.rfm_monetary,
cd.rfm_recency*100 + cd.rfm_frequency*10 + c.rfm_monetary AS rfm_combined,
c.country,
c.gender,
c.age,
c.income_level
FROM customer_sales c, click_data cd
WHERE c.cust_id = cd.custid
AND c.rfm_monetary >= 4
AND cd.rfm_recency <= 2
ORDER BY c.rfm_monetary desc, cd.rfm_recency desc;

Notes:

The customer_sales subquery selects from the Oracle Database fact table movie_sales to categorize customers based on
sales.
The click_data subquery performs a similar task for web site activity stored in the application logs - categorizing customers
based on their activity and recent visits.
These two subqueries are then joined to produce the complete RFM score. The result only shows customers who have
significant spend (>= 4) but have not visited the site recently (<= 2).

Result: The query output looks similar to the following:

These are the at-risk customers for Oracle MoviePlex. They were at one time active, big spenders on the site. Let's see what we can
do to bring them back!
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 24/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
do to bring them back!

2. How is the recommendation engine performing? To answer this question, we will need to understand the following:
Rank how many times movies are recommended (from Oracle NoSQL Database)
Rank sales revenue for movies (from Oracle Database tables)
Rank interest level in a movie - i.e. how many times people have previewed, watched, displayed more info, etc. (from HDFS
click data)

WITH rank_recs AS (
-- recommendation rank from NoSQL Database
SELECT movieid,
RANK () OVER (ORDER BY COUNT(movieid) DESC) AS rec_rank
FROM recommendation
GROUP BY movieid),
rank_sales AS (
-- sales rank from Oracle Database
SELECT m.movie_id,
m.title,
RANK () OVER (ORDER BY SUM(ms.sales) DESC) as sales_rank
FROM movie m, movie_sales ms
WHERE ms.movie_id = m.movie_id
GROUP BY m.title, m.movie_id
),
rank_interest AS (
-- "interest" rank from hdfs logs
SELECT movieid,
RANK () OVER (ORDER BY COUNT(movieid) DESC) AS click_rank
FROM movielog_v
WHERE activity IN (1,4,5) -- rated, started or browsed the movie
GROUP BY movieid
)
-- combine the results
SELECT rs.title,
sales_rank,
rec_rank,
click_rank
FROM rank_recs rr, rank_sales rs, rank_interest ri
WHERE rr.movieid = rs.movie_id
AND ri.movieid = rs.movie_id
ORDER BY rec_rank asc;

Result: The query output looks like this:

Notes:

By combining results from all three data sources, we are able to get a complete view of the customer activity.

Part 7 - Introduction to SQL Pattern Matching

This section covers the new SQL pattern matching and analytical SQL functionality that is part of Oracle Database 12c. Row pattern matching in
native SQL improves application development, developer productivity and query efficiency for row-sequence analysis. This new feature is an
important addition to your SQL toolbox.

Introduction

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 25/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
Recognizing patterns in a sequence of rows has been a capability that was widely desired, but not really possible with SQL until now. There were
many workarounds, but these were difficult to write, hard to understand, and inefficient to execute. With Oracle Database 12c you can use the
MATCH_RECOGNIZE clause to perform pattern matching in SQL to do the following:

a. Logically group and order the data that is used in the MATCH_RECOGNIZE clause using the PARTITION BY and ORDER BY clauses.
b. Define business rules/patterns using the PATTERN clause. These patterns use regular expressions syntax, a powerful and expressive
feature and applied to the pattern variables.
c. Specify the logical conditions required to map a row to a row pattern variable using the DEFINE clause.
d. Define output measures, which are expressions within the MEASURES clause.
e. Control the output (summary vs. detailed) from the pattern matching process

The moviedemo schema contains a view called MOVIEAPP_LOG_JSON_V which returns a formatted version of JSON click data stream from our web
application log file. The view returns the following columns:

CUST_ID
MOVIE_ID
GENRE_ID
TIME_ID
RECOMMENDED
RATED
COMPLETED
PAUSE
START
BROWSE
LIST
SEARCH
LOGIN
LOGOUT
INCOMPLETE
PURCHASE
PRICE
RATING

Using this click data we will create a sessionization data set which tracks each session, the duration of the session and the number of clicks/events.

Tasks and Keywords in Pattern Matching


Let us go over some of the tasks and keywords used in pattern matching. Building a pattern matching statement can be broken down into four
simple steps:

Task Keyword Description


PARTITION BY Logically divide/partition the rows into groups
1. Organize the data
ORDER BY Logically order the rows within a partition
PATTERN Defines the pattern variables that must be matched,
the sequence in which they must be matched, and the
number of rows which must be matched
2. Define the business rules
DEFINE Specifies the conditions that define a pattern variable

AFTER MATCH Determines where to restart the matching process


after a match is found
MEASURES Defines row pattern measure columns

MATCH_NUMBER Finds which pattern variable applies to which rows


3. Define the output measures
CLASSIFIER Identifies which component of a pattern applies to a
specific row
ONE ROW PER MATCH Returns one summary row of output for each match
4. Control the output
ALL ROWS PER MATCH Returns one detail row of output for each row of each
match

Pattern Match Example: Web Log Sessionization Analysis


1. Defining the pattern/business rules

For this scenario we are going to assume that a series of events or clicks within our web log file are part of the same session if the
date-time between events (clicks) is less than 2 hours (when people are watching a movie they are not recording any click activity so
if we set this threshold too low we run the risk of splitting up single sessions into multiple sessions). The exact definition of a session
will need to come from the business users and of course using SQL pattern matching it is relatively simple to change the session
threshold to say two minutes if that was the specific requirement from the business.

Using this information we can now build our PATTERN and DEFINE clauses.

PATTERN (bgn sess+)


DEFINE
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 26/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
DEFINE
sess as time_id <= PREV(sess.time_id) + interval '2' hour

the plus sign (+) indicates that we are looking for at least one or more instances of our pattern, i.e. each event must fall within a two
hour boundary of the PREVIOUS event which is captured by the element sess.time_id. To capture this information we are using
one of many built-in functions that allows us to point to specific values within the dataset as it is being processed (there is more
information about this later in this lab

There are many other regular expressions that we can use and these are all discussed in the Data Warehouse Guide.

2. Using built-in functions

The MATCH_RECOGNIZE feature comes with some very useful built-in functions that you can include in your code:

MATCH_NUMBER: You might have a large number of matches for your pattern inside a given row partition. How do you
tell all these matches apart? This is done with the MATCH_NUMBER function. Matches within a row pattern partition are
numbered sequentially starting with 1 in the order they are found. Note that match numbering starts over again at 1 in
each row pattern partition, because there is no inherent ordering between row pattern partitions.

CLASSIFIER: Along with knowing which MATCH_NUMBER you are seeing, you may want to know which component of
a pattern applies to a specific row. This is done using the CLASSIFIER function. The classifier of a row is the pattern
variable that the row is mapped to by a row pattern match. The function returns a character string whose value is the
classifier of a row. The classifier of a row that is not mapped by a row pattern match is null.

Once we have identified a group of records as belonging to a unique session we need a way to identify each unique session within
our resultset. To do this we can use the built-in measure MATCH_NUMBER() to apply a sequential number to each of our unqiue
sessions. At the same time we will use the CLASSIFIER() function to show which pattern variable is matched for each row. Using
this information we can build our MEASURE clause:

MEASURES MATCH_NUMBER() session_id,


CLASSIFIER() AS pattern_id,

3. Detailed or summary report?

For this first step in creating our sessionization data set we will return a detailed report by using the ALL ROWS PER MATCH syntax.

4. Returning a simple sessionization result set

We can now bring all of the above information together and build our pattern matching statement. This simple SELECT statement
returns the session id from our MATCH_RECOGNIZE clause along with all the columns from our source view. As part of this example
we have used a final WHERE clause to restrict the output rows to one specific customer (1000693):

SELECT *
FROM movieapp_log_json_v
MATCH_RECOGNIZE
(PARTITION BY cust_id ORDER BY time_id
MEASURES MATCH_NUMBER() AS session_id
ALL ROWS PER MATCH
PATTERN (bgn sess+)
DEFINE
sess as time_id <= PREV(sess.time_id) + interval '2' hour
)
WHERE cust_id ='1000693';

5. The output from this MATCH_RECOGNIZE statement should look something like this:

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 27/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Our automatic calculation to determine the session id is shown in column 3. Now we have successfully converted our original web
log file into a basic sessionization data set. Note that that we could now share this data set with our business users. But before we do
that we might want to do some more work to make sure our pattern matching process is working correctly.

Part 8 - Checking the Pattern Matching Process

In this section, we will use the CLASSIFIER() measure to show which pattern variable is being assigned to each row. This will help us debug our
pattern matching process and ensure it is working correctly.

1. Adding CLASSIFIER measure

We need to expand the MEASURE clause and include the built-in function CLASSIFIER().

MEASURES MATCH_NUMBER() AS session_id,


CLASSIFIER() AS pattern_id

2. Selecting specific columns

We are going to amend the SELECT clause to only return the customer id, session id (from our MATCH_NUMBER() function), date, time
and the pattern id (from our CLASSIFIER() function). The new code should look like this:

SELECT
cust_id,
session_id,
time_id,
TO_CHAR(time_id, 'hh24:mi:ss') AS session_time,
pattern_id
FROM movieapp_log_json_v
MATCH_RECOGNIZE
(PARTITION BY cust_id ORDER BY time_id
MEASURES MATCH_NUMBER() AS session_id,
CLASSIFIER() AS pattern_id
ALL ROWS PER MATCH
PATTERN (bgn sess+)
DEFINE
sess as time_id <= PREV(sess.time_id) + interval '2' hour
)
WHERE cust_id ='1000693';

3. The output from this MATCH_RECOGNIZE statement should look something like this:

We can see that each new session starts with the BGN pattern and then all other clicks are within a 2 hour window of their previous
SESS.time event and are marked with the pattern SESS. Now we have a much better data set for our business users but we can still
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 28/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
SESS.time event and are marked with the pattern SESS. Now we have a much better data set for our business users but we can still
make improvements to the data set.

Part 9 - Creating a More Useful Data Set

Using the CLASSIFIER() function we have established that our pattern matching process is working correctly. What we need to do now is
condense the data set so that we have one row for each session. We can do that by changing the way we output the data. It would be also useful to
include some additional business metrics as part of the of the data set. The following sections will explain how to do this.

Creating a Summary Report


1. Using ONE ROW PER MATCH

We need to change the output clause from ALL ROWS PER MATCH to ONE ROW PER MATCH.

2. Updating the measure clause

As we are now creating a summary report we need to remove the CLASSIFIER() function from the MEASURE clause.

3. Selecting specific columns

We are going to amend the SELECT clause to return only the customer id and session id (from our MATCH_NUMBER() function) for
the this summary report. The new code should look like this:

SELECT
cust_id,
session_id
FROM movieapp_log_json_v
MATCH_RECOGNIZE
(PARTITION BY cust_id ORDER BY time_id
MEASURES MATCH_NUMBER() AS session_id
ONE ROW PER MATCH
PATTERN (bgn sess+)
DEFINE
sess as time_id <= PREV(sess.time_id) + interval '2' hour
)
WHERE cust_id ='1000693';

4. The output from this MATCH_RECOGNIZE statement should look something like this:

The report should have 31 rows. The report now shows one row per session. However, this information is not especially useful for
our business users. We can add more useful information to this report by expanding the measure clause.

Adding Business Value


1. Calculating the number of clicks in a session

We can return a count of the number of clicks within a session by using the COUNT() function within the MEASURE clause.

MEASURES MATCH_NUMBER() AS session_id,


COUNT(*) AS no_of_events

2. Finding the start and end time of a session

We can find the start time and end time of session by using one of the unique features of MATCH_REOGNIZE - the ability to point to
specific values within a column by referencing the relevant pattern expressions. MATCH_RECOGNIZE includes some additional
functions that help us extract a data points within a specific column. These new functions include:

FIRST
LAST
NEXT
PREVIOUS

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 29/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
3. Updated measure clause

Using these new functions we can now expand the measure clause to return the start time and end time of each session.

MEASURES MATCH_NUMBER() AS session_id,


COUNT(*) AS no_of_events,
TO_CHAR(FIRST(bgn.time_id),'hh24:mi:ss') AS start_time,
TO_CHAR(LAST(sess.time_id),'hh24:mi:ss') AS end_time

4. Calculating the duration of a session

We can calculate the duration of session by taking the start time of each session from the end time of each session. To do that we
can use the normal database date-time functionality.

MEASURES MATCH_NUMBER() AS session_id,


COUNT(*) AS no_of_events,
TO_CHAR(FIRST(bgn.time_id),'hh24:mi:ss') AS start_time,
TO_CHAR(LAST(sess.time_id),'hh24:mi:ss') AS end_time,
TO_CHAR(to_date('00:00:00','HH24:MI:SS') + (LAST(sess.time_id)-
FIRST(bgn.time_id)),'hh24:mi:ss') AS mins_duration

5. SQL to generate summary report

Our revised SQL statement now looks like this:

SELECT
cust_id,
session_id,
no_of_events,
start_time,
end_time,
mins_duration
FROM movieapp_log_json_v
MATCH_RECOGNIZE
(PARTITION BY cust_id ORDER BY time_id
MEASURES MATCH_NUMBER() AS session_id,
COUNT(*) AS no_of_events,
TO_CHAR(FIRST(bgn.time_id),'hh24:mi:ss') AS start_time,
TO_CHAR(LAST(sess.time_id),'hh24:mi:ss') AS end_time,
TO_CHAR(to_date('00:00:00','HH24:MI:SS') + (LAST(sess.time_id)-
FIRST(bgn.time_id)),'hh24:mi:ss') AS mins_duration
ONE ROW PER MATCH
PATTERN (bgn sess+)
DEFINE
sess as time_id <= PREV(sess.time_id) + interval '2' hour
)
WHERE cust_id ='1000693';

6. The output from this MATCH_RECOGNIZE statement should look something like this:

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 30/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL

Our new summary report now shows one row per session and we have add more a lot of useful information for our business users by
expanding the measure clause to show the start time, end time and duration of each session. This gives our business users a great
place to start their analysis.

Part 10 - Other Useful 12c Analytical SQL Features

To make the following code easier to read we have wrapped the MATCH_RECOGNIZE clause that we created above within a view called
movieapp_analytics_v. Of course you could incorporate the following additional features into the MATCH_RECOGNIZE statements above.

How Many Distinct Customers?


A useful metric within click data is to count the number of distinct customers using our site each month. Normally we would just use the
COUNT(DISTINCT expr) function. However, this can require a lot of resources to search through a large dataset and return the exact number
of distinct values.

With the release of Database 12c Oracle provides a much faster way to do this type of analysis. Using APPROX_COUNT_DISTINCT it is possible
to get a reasonably accurate estimate of the number of distinct values within a column. This new function can process large amounts of data
significantly faster than COUNT(DISTINCT expr) function, with negligible deviation from the exact result. For more information about this new
feature please refer to the SQL Reference Guide

Let's calculate the number of unique sessions per month using both functions:

1. Using APPROX_COUNT_DISTINCT

Add new function to SELECT statement and for comparison purposes also include COUNT(DISTINCT...) function:

SELECT
time_id,
SUM(no_of_events) AS tot_events,
COUNT(DISTINCT cust_id) AS unique_customers,
APPROX_COUNT_DISTINCT(cust_id) AS est_unique_customers
FROM movieapp_analytics_v
GROUP BY time_id
ORDER BY 1;

2. The output from this statement should look like this:

The report shows the number of distinct customers per month and the approximate number of distinct customers per month. This
type of summary report is usually the basis for doing for further analysis, i.e. if the counts are significantly higher or lower then further
analysis might be required. Therefore, using the new APPROX_COUNT_DISTINCT function means our business users can get their
results much faster without having to sacrifice too much in terms of accuracy

Finding the Top 1% of Customers


1. Another useful metric within click data is finding out who are our best customers and worst. Using the new Top-N syntax we can very
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 31/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
1. Another useful metric within click data is finding out who are our best customers and worst. Using the new Top N syntax we can very
quickly find the top 1% of our customers based on total number sessions and the number of clicks they recorded

The new syntax for TOP-N support was introduced in 12c and is shown here:

OFFSET [ROW | ROWS]


FETCH [FIRST | NEXT]
[ | PERCENT] [ROW | ROWS]
[ONLY | WITH TIES]

This new function can process large amounts of data significantly faster than COUNT(DISTINCT expr) function, with negligible
deviation from the exact result. For more information about this new function please refer to the SQL Reference Guide

2. Top N SQL code

Using our new FETCH syntax we can easily find our top 1% customers:

SELECT
cust_id,
MAX(session_id) AS no_of_sessions,
SUM(no_of_events) AS tot_clicks_session,
TRUNC(AVG(no_of_events)) AS avg_clicks_session,
MIN(no_of_events) AS min_clicks_session
MAX(no_of_events) AS max_clicks_session
FROM movieapp_analytics_v
GROUP BY cust_id
ORDER BY 2 DESC, 3 DESC
FETCH FIRST 1 PERCENT ROWS ONLY;

For more information about this new feature please refer to the SQL Reference Guide

3. The output from this statement should look something like this:

The report shows our top 1% customers based on number of sessions and number of clicks recorded during a session and it
contains 23 rows.

Finding the Bottom 1% of Customers


1. Bottom N SQL code

To find the bottom 1% of customers we simply need to reverse the sort order!

SELECT
cust_id,
MAX(session_id) AS no_of_sessions,
SUM(no_of_events) AS tot_clicks_session,
TRUNC(AVG(no_of_events)) AS avg_clicks_session,
MIN(no_of_events) AS min_clicks_session,
https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 32/33
1/8/2019 Analyze All Your Data with Oracle Big Data SQL
MAX(no_of_events) AS max_clicks_session
FROM movieapp_analytics_v
GROUP BY cust_id

ORDER BY 2 ASC, 3 ASC


FETCH FIRST 1 PERCENT ROWS ONLY;

Summary

In this tutorial, you have learned how to:

Configure Oracle Big Data SQL on Oracle Exadata


Create Oracle external tables that access data in HDFS, Oracle NoSQL Database and Hive
Apply Oracle security policies across data in both Oracle Database and Hadoop
Use Oracle's rich SQL dialect to analyze data across the data platform
Use Oracle's new rich SQL pattern matching features to analyze web log data
Use the MATCH_RECOGNIZE clause to perform pattern matching in SQL
Use the main keywords that are used in pattern matching
Create sessionization analysis from web log file data
Quickly change the business rules for an existing query
Add calculated measures to increase business value
Speed up processing by using the new APPROX_COUNT_DISTINCT function to quickly calculate number of distinct values
Quickly find the top/bottom values using the new FETCH feature

Resources

Go to the Oracle Technology Network for more information on Oracle Big Data SQL, Oracle Data Warehousing and Oracle Analytical SQL

For detailed information on pattern matching, see the Pattern Matching chapter in the Oracle Database 12c Data Warehousing Guide reference
guide. Among other things, this chapter contains the following detailed examples:

Stock Market Examples: based on common tasks involving share prices and patterns.
Security Log Analysis Examples: deals with a computer system that issues error messages and authentication checks, and stores the
events in a system file.
Sessionization Examples: analysis of user activity, typically involving multiple events in a single session. Pattern matching makes it easy to
express queries for sessionization.

Credits

Authors: Marty Gubar and Keith Laker

https://www.oracle.com/webfolder/technetwork/tutorials/obe/db/12c/BigDataSQL/BigDataSQL_HOL.html 33/33

You might also like