Nothing Special   »   [go: up one dir, main page]

Data Handling in I.O.T: R.K.Biradar

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 17

Data Handling in I.O.

T
R.K.Biradar

1
4.1 should be studied from chapter no. 5 of Rajkamal book
4.2 should be studied from chapter no. 6 of Rajkamal book

2
Learning Objectives
LO5.1 Apply the data-acquiring and data-storage functions for IoT/M2M
devices data and messages

LO5.2 Classify ways of organising data

LO5.3 Summarise the transactions on stored data, functions for business


processes and business intelligence, and the concepts of loT applications
—integration and services architecture

LO5.4 Identify the functions and usage of data analytics and data visualisations
for IoT applications and business processes

LO5.5 Explain knowledge discovery, knowledge management and knowledge-


management reference architecture 3
5.3 Organising the Data
Data can be organised in a number of ways. For example, objects, files, data store, database,
relational database and object oriented database. Following subsections describe these ways
of organising and querying methods.

5.3.1 Databases
Required data values are organised as database(s) so that select values can be retrieved
later.

A] Database
One popular method of organising data is a database, which is a collection of data. This collection is
organised into tables. A table provides a systematic way for access, management and update A single table
tile is railed flat file database. Each record is listed in separate row, unrelated to each other.

B] Relational Database
A relational database is a collection of data into multiple tables which relate to each other
through special fields, called keys (Primary key, foreign key and unique key).
Relational databases provide flexiiblity. Examples of relational database are MySQL, PostGreSQL, Oracle
database created using PL/SQL and Microsoft SQL server using T-SQL.
Object Oriented Database (OODB) is a collection of objects, which save the objects in objected oriented 4
design. Examples are Concept Base or Cache.
C] Database Management System
Database Management System (DBMS) is a software system, which contains a set of programs
specially designed for creation and management of data stored in a database. Database transactions
can be performed on a database or relational database.

D] Atomicity, Data Consistency, Data Isolation and Durability (ACID) Rules

The database transactions must maintain the atomicity, data consistency, data isolation and durability
during transactions. Let us explain these rules using Example 5.3 as follows:

⮚ Atomicity means a transaction must complete in full, treating it as indivisible. When a service
request completes, then the pending request field should also be made zero.

⮚ Consistency means that data after the transactions should remain consistent. For example, sum of
chocolates sent should equal the sums of sold and unsold chocolates for each flavour after the
transactions on the database.

⮚ Isolation means transactions between tables 5.1 and 5.2, 5.2 and 5.3 and 5.3 and 5.1 are isolated
from each other.[Check next slide]

⮚ Durability means after completion of transactions, the previous transaction cannot be recalled. Only
a new transaction can affect any change. 5
6
E] Distributed Database

Distributed Database (DDB) is a collection of logically interrelated databases over a


computer network.

Distributed DBMS means a software system that manages a distributed database.

The features of a distributed database system are:

⮚ DDB is a collection of databases which are logically related to each other.

⮚ Cooperation exists between the databases in a transparent manner. Transparent means


that each user within the system may access all of the data within all of the databases as
if they were a single database.

⮚ DDB should be 'location independent', which means the user is unaware of where the
data is located, and it is possible to move the data from one physical location to another
without affecting the user.
7
F] Consistency, Availability and Partition-Tolerance Theorem (CAP theorem) is a theorem
for distributed computing systems.

The theorem states that it is impossible for a distributed computer system to simultaneously provide all three
of the Consistency Availability and Partition tolerance (CAP) guarantees.

This is due to the fact that a network failure can occur during communication among the distributed
computing nodes. Partitioning of a network therefore needs to be tolerated. Hence, at all times cither there
will be consistency or availability.

⮚ Consistency means 'Every read receives the most recent write or an error'. When a message or data is
sought the network generally issues notification of time-out or read error. During an interval of a network
failure, the notification may not reach the requesting node(s).

⮚ Availability means Every request receives a response, without guarantee that it contains the most recent
version of the information. Due to the interval of network failure, it may happen that most recent version
of message or data requested may not be available.

⮚ Partition tolerance means 'The system continues to operate despite an arbitrary number of messages
being dropped by the network between the nodes'
8
5.3.2 Query Processing

Query means an application seeking a specific data set from a database.

⮚ For example, a query at a relational database at bank server may be for the ATM transactions
made in a month by a specific customer ID.

⮚ Other examples are: most-liked chocolate flavour in the city by children of age group 6 to 10
(Example 5.1);

⮚ number of times a vehicle visited at the ACPAMS center (Example 5.2) and

⮚ service was rendered with satisfaction level of 5 out of 5.

Query Processing

Query processing means using a process and getting the results of the query made from database.
The process should use a correct as well as efficient execution strategy.

9
Five steps in Query processing are:

1. Parsing and translation: This step translates the query into an internal form, into a
relational algebraic expression and then a Parser, which checks the syntax and verifies the
relations.

2. Decomposition to complete the query process into micro-operations using the analysis,
conjunctive and disjunctive normalisation and semantic analysis.

3. Optimisation which means optimising the cost of processing. The cost means number of
micro-operations generated in processing which is evaluated by calculating the costs of the
sets of equivalent expressions.

4. Evaluation plan: A query-execution engine (software) takes a query-evaluation p|an


and executes that plan.

5. Returning the results of the query.

The process can also be based on a heuristic approach, by performing the selection and
projection steps as early as possible and eliminating duplicate operations. 10
Distributed Query Processing

Distributed Query Processing means query processing operations in distributed


databases on the same system or networked systems. The distributed database system
has ability to access remote sites and transmit the queries to other systems.

11
5.3.3 SQL
SQL stands for Structured Query Language. It is a language for viewing or changing (update, insert or
append or delete) databases. It is a language for data querying, updating, inserting, appending
and deleting the databases. It is a language for data access control, schema creation and
modifications. It is also a language for managing the RDBMS.

SQL was originally based upon the tuple relational calculus and relational algebra. SQL can embed
within other languages using SQL modules, libraries and pre-compilers.

SQL features are as follows:


⮚ Create Schema is a structure that contains descriptions of objects created by a user (base tables,
views, constraints). The user can describe and define the data for a database. Create Catalog
consists of a set of schemas that constitute the description of the database.
⮚ Use Data Definition Language (DDL) for the commands that depict a database, including creating,
altering and dropping tables and establishing constraints. The user can create and drop databases
and tables, establish foreign keys, create view, stored procedure, functions in a database.
⮚ Use Data Manipulation Language (DML) for commands that maintain and query a database. The user
can manipulate (INSERT, UPDATE or SELECT the data and access data in relational database
management systems.
⮚ Use Data Control language (DCL) for commands that control a database, including administering
privileges and committing data. The user can set (grant or add or revoke) permissions on tables,
procedures, and views. 12
5.3.4 NOSQL
NOSQL stands for No-SQL or Not Only SQL that does not integrate with applications that are based on
SQL. NOSQL is used in cloud data store.
NOSQL may consist of the following:
⮚ A class of non-relational data storage systems, flexible data models and multiple schemas
⮚ Class consisting of unordered keys and the JSON. Example PNUTS
⮚ Class consisting of ordered keys and semi-structured data storage systems.
For examples in the BigTable, Hbase and Cassandra (used in Facebook and Apache)
⮚ Class consisting of JSON NOSQL) For example in MongoDb6 which is widely
⮚ Class consisting of name and value in the text. For example in CouchDB
⮚ May not require a fixed table schema

NOSQL systems do not use the concept of joins (in distributed data storage systems). Data written at one
node replicates to multiple nodes, therefore identical and distributed system can be fault-tolerant, and can
have partitioning tolerance.
CAP theorem is applicable. The system offers relaxation in one or more of the ACID and CAP
properties.
Out of the three properties (consistency, availability and partitions), two are at least present for an
application.
❖ Consistency means all copies have same value like in traditional DBs.
❖ Availability means at least one copy available in case a partition becomes inactive or fails. For example,
in web applications, the other copy in other partition is available. 13
❖ Partition means parts which are active but may not cooperate as in distributed databases.
5.3.5 Extract, Transform and Load

Extract, Transform and Load or ETL is a system which enables the usage of databases used,
especially the ones stored at a data warehouse.
⮚ Extract means obtaining data from homogeneous or heterogeneous data sources.
⮚ Transform means transforming and storing the data in an appropriate structure or
format.
⮚ Load means the structured data load in the final target database or data store or
data warehouse.

All the three phases can execute in parallel. Data extraction takes longer time. Therefore, the
system while pulling data, executes another transformation processes on already recived data
and prepares the already transformed data for loading.

As soon as data ready for load into the target, the data load starts. It means next phase starts
without tasting for the completion of the previous phases.

ETL system usages are for integrating data from multiple applications (systems) hosted 14
5.3.6 Relational Time Series Service
⮚ Time series data means an array of numbers indexed with time (date-time or a range ot
date-time).

⮚ Time series data can be considered as time stamped data. It means data carries along with it the
date and time information about the data values. For example, sales of chocolates in Internet of
ACVMs (Example 5.1) are different on different dates and times.

⮚ The sales need indexing with a range between two dates or indexed with date-time. A time Mites of
sales is called sale profile of the ACVMs. A time series of log of chocolate sales is called choclate
sales trace.
⮚ Time series is any data-set that is accessed in a sequence of time . Software programs and
an analytics program analyses the set in a time series, meaning analyses in a chronological order.

⮚ IoT devices, such as temperature sensors, wireless sensor network nodes, energy meters, RFID tags,
ATMs, ACVMs generate time-stamped or time series data.

⮚ Time Series Database (TSDB) is a software system which implements a database that optimally
handles mathematical operations (profiles, traces, curves), queries or database transactions on time
series. 15
Exercise

1. What does a relational database mean?

2. List the differences between flat-file and relational databases.

3. What are three essential features when using distributed databases?

4. What does TSDB mean?

5. What are the features of SQL?

6. How does SQL differ from NOSQL?

7. List the differences between time-series database system and RDBMS in construction
and usages.

16
5.3.7 Real-Time and Intelligence

⮚ Decision on real-time data is fast when query processing in live data (streaming)
has lower latency.

⮚ Decision on historical data is fast when interactive query processing has low
latency.

⮚ Low latencies are obtained by various approaches: Massively Parallel


Processing (MPP), in-memory databases and columnar databases.

⮚ TeraData Aster and Pivotal Greenplum are examples of MPP. In-memory and on-
store both transaction methods exist for the databases.

⮚ SAP Hana and QClick view are examples of in-memory databases.

⮚ SAP Sybase IQ and HP Vertica are examples for columnar databases for faster
Analytics.
17

You might also like