Nothing Special   »   [go: up one dir, main page]

House Dzone Refcard 263 Messaging Data Infra Iot 2

Download as pdf or txt
Download as pdf or txt
You are on page 1of 6

263

CONTENTS

Messaging and Data •  IoT Challenges and Solutions

•  Unified Messaging and Data


Infrastructure for IoT

Infrastructure for IoT


−  Messaging Infrastructure
Needs
−  Data Infrastructure
Requirements
Getting Started With IoT Messaging at Scale •  Getting Started: An IoT
Architecture Example

•  Conclusion
−  Additional Resources
TIMOTHY SPANN
DEVELOPER ADVOCATE, STREAMNATIVE

The Internet of Things (IoT) is a strange term to define software that runs that can scale across Kubernetes and clouds quickly to match rapid
continuously on any number and type of devices. The scale of IoT data iterations. Modern messaging systems can scale to handle the number
is sometimes mind boggling with thousands of devices communicating of messages occurring.
messages in streams to message brokers that need to instantly receive
IoT is often hindered by fragile legacy protocols that don't work
and often acknowledge those events. The stream of data from devices
well for streaming real-time data. Modern systems support fast
requires new scalable infrastructure.
messaging protocols like MQTT and the Apache Pulsar binary protocol
The most common protocol utilized for these messages is MQTT, specification. These modern systems are asynchronous and support
which is lightweight and allows for easy publish-subscribe transfers data processing frameworks like Apache Spark and Apache Flink.
of small data events. For super-fast messaging, these are commonly Pulsar is a unique messaging system that allows for transparent native
implemented in C or Python, and I will show you a short example of support for multiple communication protocols, including the Apache
this later. Kafka protocol, MQTT, Advanced Message Queuing Protocol (AMQP),
and the Apache RocketMQ protocol, all of which are supported in a
In this Refcard, we will walk through modern messaging and unified
pluggable non-proxy manner identical to the native binary protocol
data infrastructure to derive business insights from IoT devices.
used by Pulsar. This allows for Pulsar to grow and adapt to new
Medium and large enterprises can now build diverse and innovative
protocols without any performance degradation.
solutions with open-source IoT platforms. They will be able to filter,
route, transform, enrich, distribute, and replicate messages at any
scale for real-time analytics.

Once such an infrastructure is in place, any and all data sources can
be joined with IoT data, enabling applications never before dreamt of.

IOT CHALLENGES AND SOLUTIONS


Before we begin, we need to look at the challenges of implementing a
complex solution for IoT. One of the first challenges is that existing IoT
systems are hobbled together by a mix of legacy proprietary systems
and poorly connected solutions. The solution is a unified open-source
platform that allows for building a modern IoT system that can populate
modern data lakes and data warehouses.

Many solutions are hindered by the inability to dynamically scale up


and down based on scale. IoT is often very dynamic since the number
of devices can drastically change at any time. The solution is a platform

REFCARD | JULY 2022 1


REFCARD | MESSAGING AND DATA INFR ASTRUCTURE FOR IOT

Though this is the case, to provide support for multiple MQTT versions FLEXIBILITY
and rich MQTT features, Pulsar is often used with MQTT brokers such as A very important feature that is a minimum requirement for this IoT
open-source EMQX. messaging infrastructure is support for flexibility. This flexibility is
multifaceted as the system needs to allow for configurations to change
UNIFIED MESSAGING AND DATA
important features such as persistence, message deduplication,
INFRASTRUCTURE FOR IOT
protocol additions, storage types, replication types, and more. This
For advanced IoT solutions, we need a unified messaging and data
may be the most important feature for IoT versus IT systems, with new
infrastructure that can handle all the needs of a modern enterprise that
vendors, devices, and requirements being added for almost every new
is running IoT applications. In Figure 1, we can see an example of such a
application or use case.
system that is built on scalable open-source projects that can support
messaging, streaming, and all the data needs of IoT systems. New devices, sensors, networks, and data types appear from a
heterogeneous group of IoT hardware vendors — we need our solution
Figure 1
to support, adapt, and work with all of these when they become
available. This can be difficult and requires our messaging
infrastructure to support multiple protocols, libraries, languages,
operating systems, and data subscription types.

Our system will also have to be open source to allow anyone to easily
extend any piece of the infrastructure for when unique hardware
requires something that doesn't fit into any of our existing pathways.
This flexibility is something that can be found only in open-source
platforms designed to grow and change. If you don't have the flexibility
to meet new requirements based on these new devices, your system
MESSAGING INFRASTRUCTURE NEEDS
will not meet your needs. This could happen at any time. Change is the
A unified platform that supports IoT messaging needs to support
only constant in the world of IoT — be prepared.
MQTT and multiple streaming protocols. It must be able to handle
past, current, and future needs as more devices are brought into MULTIPLE PROTOCOL SUPPORT
solutions. These devices will have updated operating systems, A corollary to flexibility is the support for multiple messaging
languages, networks, protocols, and data types that require dynamic protocols since no one protocol fits every device, problem, system, or
IoT platforms. These unified IoT messaging platforms will need to allow language. MQTT is a common protocol, but many others can be used or
for extension, frequent updates, and diverse libraries to support the requested. We need to also support protocols such as web sockets,
constantly changing and expanding ecosystem. HTTP, AMQP, and Kafka's binary messaging protocol over TCP/IP.

There are a few key requirements for this infrastructure:


RESILIENCY
•  Low latency Most infrastructures for messaging are designed for some level of
•  Flexibility resiliency within some percentage of tolerance and that works fine for
•  Multiple protocol support most use cases. This will not be adequate for IoT as workload changes,

•  Resiliency bad data, and sudden changes to networking, power, scale, demand,
and data quality are to be expected. The modern IoT messaging system
•  Geo-replication
must bend but never break to these bursts of data and problems.
•  Multi-tenancy
•  Data smoothing Along with resiliency, our system must scale to any size required by the

•  Infinite scalability applications in question. With IoT, this can start with dozens of devices
but then scale to millions in very minimal time. We must not set a fixed
LOW LATENCY or upper limit on how large our clusters, applications, or data can be.
In IoT, there are numerous common patterns formed around different We must scale to meet these challenges.
edge use cases. For sensors, we often have a continuous stream of small
We also need resiliency to survive when IoT devices get hacked or have
events with little variation. We need the ability to configure for low
issues. We need to support as many security paradigms, protocols,
latency with the option to drop messages for performance since they
and options as feasible. At a minimum, we must enable full end-to-end
are mostly duplicates and we only care about the current state.
encryption, full SSL, or HTTPS-encrypted channels. As for encrypted
A common pattern is to take samples, averages, or work against
payloads, these will still handle millions of messages with low latency
aggregates over a short time window. This can often be achieved
and fast throughput.
further down the data chain in data processing via SQL.

REFCARD | JULY 2022 2


REFCARD | MESSAGING AND DATA INFR ASTRUCTURE FOR IOT

We also need to provide logging, metrics, APIs, feeds, and dashboards INFINITE SCALABILITY
to help identify and prevent hacks or attacks. The system should allow Often in IoT, we need to rapidly scale up for short periods of time and
for machine learning models to be deployed against queues and topics then scale down, such as for temporary use cases like deployments of
dynamically to intercept hacked messages or intrusion attempts. many sensors at sporting events. In some instances, we may need this
to happen with no notice and on demand.
Security should never be an afterthought when real-time systems
interacting with devices and hardware are involved. The ability to do this scaling without tremendous expense, difficulty,
or damaging performance is a feature required by the needed
GEO-REPLICATION
infrastructure. This can often be done by using a messaging platform
Distributed edge computing is a fancy way of saying that we will have
that runs natively on Kubernetes and has separate compute and
our IoT devices deployed and running in multiple areas in the physical
storage of messaging.
and networking world. This requirement to get data to where it needs
to be analyzed, transformed, enriched, processed, and stored is a key IoT workloads can happen in large bursts as thousands of devices in the
factor in choosing the right messaging platform for your use case. field can become active at once in, say, energy meters, and we need to
be able to survive these bursts. These bursts should drive intelligent
To achieve this, one important and required feature is support
scaling, and where delays occur to infrastructure availability, we must
within the platform for geo-replication between clusters, networks,
be able to provide caching, tiered storage, and backpressure to never
availability zones, and clouds. This will often require active-active
break in the face of massive torrents of device data.
two-way replication of data, definitions, and code. Our solution has no
choice but to support this data routing at scale. We also need to cleanly shut down and remove extra brokers when the
storm has subsided, and we can reduce our infrastructure costs cleanly
As users of IoT expand, they quickly start deploying across new cloud
and intelligently.
regions, states, countries, continents, and even space. We need to be
able to get data around the globe and often duplicate it to primary We have gone through a very tall order on what this magical, scalable
regions for aggregations. Our messaging infrastructure needs to be able messaging system must do. Fortunately, there are open-source options
to geo-replicate messaging data without requiring new installations, that you can investigate today for your edge use cases.
closed-source additions, or coding.
DATA INFRASTRUCTURE REQUIREMENTS
The system will need to replicate via configuration and allow for any A new but important part of building IoT solutions is a modern data
number of clusters to communicate in both active-active and active- infrastructure that supports all the analytical and data processing
passive methods. This is not without difficulty and will require full needs of enterprises today. In the past, there was a great disconnect
knowledge of the systems, clouds, and security needed for all regions. between how IoT systems data was handled and that of the rest of
Again, this is something that can only be achieved with open source. the IT data assets. This disconnect was driven by the differences in
infrastructure used by IoT and other use cases. With a modern unified
MULTI-TENANCY
messaging infrastructure bridging the gap between systems, we no
If we only had one IoT application, use case, company, or group working
longer face those differences. Therefore, we must now update what
with the system, then things would be easy, but this is not the case. We
systems our IoT data is fed into and how.
will have many groups, applications, and often companies sharing one
system. We require secure multi-tenancy to allow for independent IoT In the modern unified data infrastructure required for all use cases,
applications to share one conduit for affordability, ease of use, and scale. including IoT, we must support some basic tenants.

DATA SMOOTHING DYNAMIC SCALABILITY


In some edge cases, all the data sent is time series data that needs to A key factor in handling the diverse workloads and bursty nature of IoT
be captured for each time point — this could arrive out of sequence and events is the support for dynamic scalability. This usually requires that
the messaging system needs to be able to handle this. the messaging system runs on Kubernetes and allows for scaling up
and down based on metrics and workloads.
In messaging systems like Kafka or Pulsar, you may want to handle
this as an exactly once messaging system so that nothing is duplicated CONTINUOUS IOT DATA ACCESS
or lost before it lands in durable permanent storage, such as an AWS As soon as each event from any device enters the messaging system,
S3-backed data lake. The advantage of modern data streaming systems we need to be able to get access to it immediately. This allows us to
is that missing data can be resent, waited on, or interpolated based on run aggregates and check for alerting conditions. This is often achieved
previous data streams. This is often done by utilizing machine learning with a real-time data-processing streaming engine, such as Apache
algorithms such as Facebook's Prophet library. Flink, connected to the event stream.

REFCARD | JULY 2022 3


REFCARD | MESSAGING AND DATA INFR ASTRUCTURE FOR IOT

This goes with the standard requirements of robust library support If you wish to install the EMQX system on a Linux system, you can follow
for all the modern open-source data libraries such as Apache Spark, the reference guide here.
Apache NiFi, Apache Flink, and others. We also would want robust
docker run -d --name emqx -p 1883:1883 -p 8083:8083
support by cloud data processing like Decodable.
-p 8084:8084 -p 8883:8883 -p 18083:18083 emqx/
emqx:latest
SQL ACCESS TO LIVE DATA
Along with continuous data access via a programmatic or API, we need
An Apache project option is Apache Pulsar. The standalone version of
to allow analysts and developers to utilize SQL to access these live data
Apache Pulsar will require a modern operating system such as Linux
streams as each event enters the system. This allows for aggregates,
or Mac and a JDK version of 8 or higher. I recommend using JDK 17, as
condition checks, and joining streams. This may be the most important
this has the greatest performance and features that enable a fast and
feature of a modern messaging system to support IoT. Without SQL,
stable system.
the learning curve may reduce adoption within the enterprise.
wget https://archive.apache.org/dist/pulsar/pulsar-
OBSERVABILITY
2.10.0/apache-pulsar-2.10.0-bin.tar.gz
If the events don't arrive, the system slows down, or things go offline, tar xvfz apache-pulsar-2.10.0-bin.tar.gz
we must know not only instantly but also preemptively that things are cd apache-pulsar-2.10.0
starting to go astray. A smart observability system will trigger scaling,
alerts, and other actions to notify administrators and possibly stop To get started with Apache Pulsar in Docker, you will need enough
potential data losses. This is often added as an afterthought, but it RAM and CPU as well as have Docker installed. I recommend at least
can be critical. We also need to be able to replay, fill in missing data, or eight gigabytes of RAM and four virtual CPUs. Once you meet these
have data rerun. requirements, you can begin running your Apache Pulsar messaging
broker on your machine:
SUPPORT FOR MODERN CLOUD DATA STORES AND TOOLS
Our IoT events must stream to any of the cloud data stores that we docker run -it -p 6650:6650 -p 8080:8080 --mount
need it to. This may be as simple as a cloud object store like AWS S3 or source=pulsardata,target=/pulsar/data --mount

Apache Pinot — or as advanced as a real database — and is a minimum source=pulsarconf,target=/pulsar/conf apachepulsar/

requirement that cannot be skipped. We need our IoT events to be in pulsar:2.10.0 bin/pulsar standalone

the same database as all our other main data warehouse datasets. We
need to support open data formats like Apache Parquet, JSON, and You can now build a sample IoT application and try out various

Apache AVRO. examples via the standard documentation. I suggest sending and
receiving a few messages first to make sure everything works. Also
HYBRID CLOUD SUPPORT check logs, metrics, dashboards, and other tools listed for the
Finally, we need to be able to run our messaging architecture across respective messaging tools. If messages are appearing as expected,
various infrastructure hosting locations, including on-premises and you are ready to start using simulated devices or a Raspberry Pi.
multiple public clouds. The ability to be installed, run anywhere, and
I have a simple Python example that you can use to start with. It
propagate messages is key.
generates standard JSON, which is a common and easy-to-work-with
format for sensor applications. There are many affordable sensors
GETTING STARTED: AN IOT
available for the Raspberry Pi, and I linked a few examples at the end
ARCHITECTURE EXAMPLE
of this Refcard.
Now that we have looked at what we need, let's build an example of the
architecture to accomplish all these requirements for an actual IoT use Figure 2
case. I have included two ways to get started with the most common
open-source options for messaging infrastructure for IoT at scale:
Apache Pulsar for streaming and EMQX as an MQTT broker.

As a quick start, you can install a standalone broker or docker instance


of EMQX. You can download the latest stable release from the project's
GitHub release directory here: https://github.com/emqx/emqx/releases

Recommended next steps:

•  Read the EMQX and Documentation before installation


A use case that comes up with enterprises quite often is that of an IoT
•  Start on your laptop and test with one real device implementation with thousands of devices streaming data over MQTT
•  Go to production in batches of devices and test the results and Pulsar into multiple topics. From here, our system makes data

REFCARD | JULY 2022 4


REFCARD | MESSAGING AND DATA INFR ASTRUCTURE FOR IOT

asynchronously available for subscribed data consumers. These feeds There are many other languages supported by MQTT libraries, and you
are continuously consumed by Flink SQL and auto-sync to a data lake, should check the documentation for one that matches the development
data stores, and fast Spark ETL. There are often data consumers that languages available on your device.
continuously update real-time dashboards as seen in InfluxDB.
An advanced feature of one implementation of a scalable broker is
A common way to process, especially when we are receiving millions of support for standard SQL via Presto. This allows you to use simple
events a second, is to store them to a data lake. We can build a simple SQL statements as your data consumer. This can be done from any
and fast ETL for this utilizing Apache Spark's Structured Streaming notebook, application, query tool, or system that supports JDBC,
framework with some easy-to-write and deployable code. A simple ODBC, or SQL Alchemy driver.
Scala snippet for this code is shown below.
select * from pulsar."public/default".iottopic;
In this two-line code snippet, we read from a Pulsar topic containing
our IoT sensor data, then select all the fields and store the data in For SQL access to EMQX, you can use the SQL Rule Engine:
Apache Parquet files in an AWS S3 object store. This can be substituted
SELECT * FROM "#" WHERE field = 'valuex'
for other common data lake formats and object stores depending on
the cloud or Kubernetes platform that you are running on.
CONCLUSION
val dfPulsar = spark.readStream.format("pulsar"). We found that to scale an IoT solution, you need open-source
option("service.url", "pulsar://server:6650"). messaging and data infrastructure. We've found there are only a
option("admin.url", "http://server:8080"). few platforms, including Apache Pulsar and EMQX, that have all the
option("topic", "persistent://public/default/
features to support solving these complex IoT problems. As we iterated
topic").load()
through a long list of requirements, we have determined how to
val pQuery = dfPulsar.selectExpr("*").writeStream.
implement these in the real world. You can now be assured that these
format("parquet").option("truncate", false)
modern data messaging systems will handle your IoT workloads.
.option("checkpointLocation", "/tmp/checkpoint").
option("path", "/s3").start() Figure 3

If you wanted to run continuous SQL statements against a large stream


of IoT messages, you could do this with SQL:

select max(equivalentco2ppm) as MaxCO2,


max(totalvocppb) as MaxVocPPB from topicname;

Using Flink SQL, a common use case is to aggregate our data over time
windows and capture the result to any number of cloud data stores or
to additional topics for further analytics or machine learning.

On a typical device, we could easily send our IoT messages via MQTT
in Python with two short lines of code. The first connects to an MQTT
broker and then publishes a message to a queue. In this example, the
message is defined by json_string:

client.connect("broker", 1883, 180)


client.publish("persistent://public/default/queue",
payload=json_string, qos=0, retain=True)

We could also produce messages to MQTT brokers via Java:

MemoryPersistence persistence = new


MemoryPersistence();
IMqttClient mqttClient = new MqttClient("tcp://
host:1883", clientId, persistence);
mqttClient.connect(mqttConnectOptions()); For example, geo-replication, as mentioned, is very important to
MqttMessage mqttMessage = new MqttMessage(); propagate messages between various infrastructure facilities,
mqttMessage.setPayload(payload); which commonly occur in distributed edge technologies like IoT. We
mqttClient.publish("queuename", mqttMessage);
also realized that without support for open-source data processing

REFCARD | JULY 2022 5


REFCARD | MESSAGING AND DATA INFR ASTRUCTURE FOR IOT

frameworks, our data will be sitting idle in formats we can't access,


locking our data away from our analysts. At every need and every WRITTEN BY TIMOTHY SPANN,
level, we should keep things as open, flexible, and scalable as possible DEVELOPER ADVOCATE, STREAMNATIVE
to ensure our applications function at the necessitated degree and
Tim Spann is a developer advocate for
provide the analytics required by modern enterprises. StreamNative. He works with StreamNative Cloud,
Apache Pulsar, Apache Flink, Apache NiFi, MQTT,
For more information on building out your messaging infrastructure, AMQP, Apache Kafka, Edge AI, TensorFlow, Apache
Spark, InfluxDB, Aerospike, ElasticSearch, Lakehouses, and deep
please see the following references that include deep dives, installation
learning. Tim has over a decade of experience with the IoT, big data,
references, examples, and further reading. distributed computing, messaging, streaming technologies, and
Java programming.
ADDITIONAL RESOURCES
•  "Real-Time Pulsar and Python Apps on a Pi" – https://dzone.com/
articles/five-sensors-real-time-with-pulsar-and-python-on-a
600 Park Offices Drive, Suite 300
•  "Pulsar in Python on Pi for Sensors" – https://dzone.com/articles/ Research Triangle Park, NC 27709
888.678.0399 | 919.678.0300
pulsar-in-python-on-pi
•  MQTT on Pulsar (MoP) – https://github.com/streamnative/mop At DZone, we foster a collaborative environment that empowers developers and
tech professionals to share knowledge, build skills, and solve problems through
•  "Apache Pulsar with MQTT for Edge Computing" – https://www. content, code, and community. We thoughtfully — and with intention — challenge
the status quo and value diverse perspectives so that, as one, we can inspire
youtube.com/watch?v=sPGyl6XgGHw&t=11s positive change through technology.

•  "FLiP-Pi-Weather" – https://github.com/tspannhw/FLiP-Pi-
Weather Copyright © 2022 DZone, Inc. All rights reserved. No part of this publication
may be reproduced, stored in a retrieval system, or transmitted, in any form or
•  "FLiP-Py-Pi-GasThermal" – https://github.com/tspannhw/FLiP- by means of electronic, mechanical, photocopying, or otherwise, without prior
written permission of the publisher.
Py-Pi-GasThermal
•  "FLiP-Py-Pi-EnviroPlus" – https://github.com/tspannhw/FLiP-Py-
Pi-EnviroPlus
•  "FLiP-Pi-BreakoutGarden" – https://github.com/tspannhw/FLiP-
Pi-BreakoutGarden
•  "The Dream Stream Team for Pulsar and Spring" – https://www.
slideshare.net/bunkertor/the-dream-stream-team-for-pulsar-
and-spring
•  "Integration of Apache NiFi and Apache Pulsar" – https://
streamnative.io/blog/release/2022-03-09-cloudera-and-
streamnative-announce-the-integration-of-apache-nifi-and-
apache-pulsar/
•  "Data Management for Industrial IoT" Refcard – https://dzone.
com/refcardz/data-management-for-industrial-iot

REFCARD | JULY 2022 6

You might also like