Data Handling in I.O.T: R.K.Biradar
Data Handling in I.O.T: R.K.Biradar
Data Handling in I.O.T: R.K.Biradar
T
R.K.Biradar
1
4.1 should be studied from chapter no. 5 of Rajkamal book
4.2 should be studied from chapter no. 6 of Rajkamal book
2
Learning Objectives
LO5.1 Apply the data-acquiring and data-storage functions for IoT/M2M
devices data and messages
LO5.4 Identify the functions and usage of data analytics and data visualisations
for IoT applications and business processes
5.3.1 Databases
Required data values are organised as database(s) so that select values can be retrieved
later.
A] Database
One popular method of organising data is a database, which is a collection of data. This collection is
organised into tables. A table provides a systematic way for access, management and update A single table
tile is railed flat file database. Each record is listed in separate row, unrelated to each other.
B] Relational Database
A relational database is a collection of data into multiple tables which relate to each other
through special fields, called keys (Primary key, foreign key and unique key).
Relational databases provide flexiiblity. Examples of relational database are MySQL, PostGreSQL, Oracle
database created using PL/SQL and Microsoft SQL server using T-SQL.
Object Oriented Database (OODB) is a collection of objects, which save the objects in objected oriented 4
design. Examples are Concept Base or Cache.
C] Database Management System
Database Management System (DBMS) is a software system, which contains a set of programs
specially designed for creation and management of data stored in a database. Database transactions
can be performed on a database or relational database.
The database transactions must maintain the atomicity, data consistency, data isolation and durability
during transactions. Let us explain these rules using Example 5.3 as follows:
⮚ Atomicity means a transaction must complete in full, treating it as indivisible. When a service
request completes, then the pending request field should also be made zero.
⮚ Consistency means that data after the transactions should remain consistent. For example, sum of
chocolates sent should equal the sums of sold and unsold chocolates for each flavour after the
transactions on the database.
⮚ Isolation means transactions between tables 5.1 and 5.2, 5.2 and 5.3 and 5.3 and 5.1 are isolated
from each other.[Check next slide]
⮚ Durability means after completion of transactions, the previous transaction cannot be recalled. Only
a new transaction can affect any change. 5
6
E] Distributed Database
⮚ DDB should be 'location independent', which means the user is unaware of where the
data is located, and it is possible to move the data from one physical location to another
without affecting the user.
7
F] Consistency, Availability and Partition-Tolerance Theorem (CAP theorem) is a theorem
for distributed computing systems.
The theorem states that it is impossible for a distributed computer system to simultaneously provide all three
of the Consistency Availability and Partition tolerance (CAP) guarantees.
This is due to the fact that a network failure can occur during communication among the distributed
computing nodes. Partitioning of a network therefore needs to be tolerated. Hence, at all times cither there
will be consistency or availability.
⮚ Consistency means 'Every read receives the most recent write or an error'. When a message or data is
sought the network generally issues notification of time-out or read error. During an interval of a network
failure, the notification may not reach the requesting node(s).
⮚ Availability means Every request receives a response, without guarantee that it contains the most recent
version of the information. Due to the interval of network failure, it may happen that most recent version
of message or data requested may not be available.
⮚ Partition tolerance means 'The system continues to operate despite an arbitrary number of messages
being dropped by the network between the nodes'
8
5.3.2 Query Processing
⮚ For example, a query at a relational database at bank server may be for the ATM transactions
made in a month by a specific customer ID.
⮚ Other examples are: most-liked chocolate flavour in the city by children of age group 6 to 10
(Example 5.1);
⮚ number of times a vehicle visited at the ACPAMS center (Example 5.2) and
Query Processing
Query processing means using a process and getting the results of the query made from database.
The process should use a correct as well as efficient execution strategy.
9
Five steps in Query processing are:
1. Parsing and translation: This step translates the query into an internal form, into a
relational algebraic expression and then a Parser, which checks the syntax and verifies the
relations.
2. Decomposition to complete the query process into micro-operations using the analysis,
conjunctive and disjunctive normalisation and semantic analysis.
3. Optimisation which means optimising the cost of processing. The cost means number of
micro-operations generated in processing which is evaluated by calculating the costs of the
sets of equivalent expressions.
The process can also be based on a heuristic approach, by performing the selection and
projection steps as early as possible and eliminating duplicate operations. 10
Distributed Query Processing
11
5.3.3 SQL
SQL stands for Structured Query Language. It is a language for viewing or changing (update, insert or
append or delete) databases. It is a language for data querying, updating, inserting, appending
and deleting the databases. It is a language for data access control, schema creation and
modifications. It is also a language for managing the RDBMS.
SQL was originally based upon the tuple relational calculus and relational algebra. SQL can embed
within other languages using SQL modules, libraries and pre-compilers.
NOSQL systems do not use the concept of joins (in distributed data storage systems). Data written at one
node replicates to multiple nodes, therefore identical and distributed system can be fault-tolerant, and can
have partitioning tolerance.
CAP theorem is applicable. The system offers relaxation in one or more of the ACID and CAP
properties.
Out of the three properties (consistency, availability and partitions), two are at least present for an
application.
❖ Consistency means all copies have same value like in traditional DBs.
❖ Availability means at least one copy available in case a partition becomes inactive or fails. For example,
in web applications, the other copy in other partition is available. 13
❖ Partition means parts which are active but may not cooperate as in distributed databases.
5.3.5 Extract, Transform and Load
Extract, Transform and Load or ETL is a system which enables the usage of databases used,
especially the ones stored at a data warehouse.
⮚ Extract means obtaining data from homogeneous or heterogeneous data sources.
⮚ Transform means transforming and storing the data in an appropriate structure or
format.
⮚ Load means the structured data load in the final target database or data store or
data warehouse.
All the three phases can execute in parallel. Data extraction takes longer time. Therefore, the
system while pulling data, executes another transformation processes on already recived data
and prepares the already transformed data for loading.
As soon as data ready for load into the target, the data load starts. It means next phase starts
without tasting for the completion of the previous phases.
ETL system usages are for integrating data from multiple applications (systems) hosted 14
5.3.6 Relational Time Series Service
⮚ Time series data means an array of numbers indexed with time (date-time or a range ot
date-time).
⮚ Time series data can be considered as time stamped data. It means data carries along with it the
date and time information about the data values. For example, sales of chocolates in Internet of
ACVMs (Example 5.1) are different on different dates and times.
⮚ The sales need indexing with a range between two dates or indexed with date-time. A time Mites of
sales is called sale profile of the ACVMs. A time series of log of chocolate sales is called choclate
sales trace.
⮚ Time series is any data-set that is accessed in a sequence of time . Software programs and
an analytics program analyses the set in a time series, meaning analyses in a chronological order.
⮚ IoT devices, such as temperature sensors, wireless sensor network nodes, energy meters, RFID tags,
ATMs, ACVMs generate time-stamped or time series data.
⮚ Time Series Database (TSDB) is a software system which implements a database that optimally
handles mathematical operations (profiles, traces, curves), queries or database transactions on time
series. 15
Exercise
7. List the differences between time-series database system and RDBMS in construction
and usages.
16
5.3.7 Real-Time and Intelligence
⮚ Decision on real-time data is fast when query processing in live data (streaming)
has lower latency.
⮚ Decision on historical data is fast when interactive query processing has low
latency.
⮚ TeraData Aster and Pivotal Greenplum are examples of MPP. In-memory and on-
store both transaction methods exist for the databases.
⮚ SAP Sybase IQ and HP Vertica are examples for columnar databases for faster
Analytics.
17