24

Creating a Serverless Solution for Streaming IoT Data in Microsoft Azure — Part...

 4 years ago
source link: https://towardsdatascience.com/creating-a-serverless-solution-for-streaming-iot-data-in-microsoft-azure-part-i-5056f2b23de0?gi=bf0f5a1fd1a0
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Creating a Serverless Solution for Streaming IoT Data in Microsoft Azure — Part I

IvyuMvF.jpg!web

Photo by Markus Spiske on Unsplash

There is no doubt about it: the use of IoT devices (and the data they produce) are growing wildly. According to one Gartner study , by 2020 there will be 5.8 billion IoT devices online, a 46% increase from the 3.9 billion online in 2018. By 2025, the total volume of data generated by IoT devices is forecast to reach a staggering 79.4 zettabytes, which is almost six times the volume in in 2018.

The wonderful thing is that platforms like Microsoft Azure make receiving, storing, and analyzing this data quite easy. In this article, I’m going to walk you through how to harvest streamed telemetry from an IoT temperature sensing device and both store it in a data lake as well as do some rudimentary anomaly detection on it. All the code for this article can be found in this GitHub repository .

Components of the solution include:

  • Azure IoT Hub , for receiving data sent from IoT devices as well as capturing the raw data to a data lake.
  • Azure Stream Analytics , for processing data sent to the IoT Hub and detecting anomalies.
  • Azure Event Hub , for receiving messages from Stream Analytics when anomalies are detected, for downstream processing.
  • Azure Data Lake Storage , for storing raw data sent to the IoT Hub for possible later analysis.

Pre-Requisites

If you want to follow along, you must run the Terraform deployment located in the terraform folder of the linked GitHub repository. Instructions for doing this are in the README file of the repository. This is optional as I'll have plenty of screen shots throughout the article, but by all means dig in and build your own lab!

Components of the Solution in Detail

Azure IoT Hubs

Azure IoT Hubs provide an easy PaaS offering to allow IoT devices to send data to a messaging endpoint. They are easily scalable and you can start for as little as zero cost per month (the Free tier provides up to 8,000 messages per day, and is great for proof-of-concepts). When deploying for a live installation, you should always start with either a Basic or Standard tier, rather than Free, because according to Microsoft , you cannot upgrade Free tier hubs to other tiers. Carefully review the provided feature matrix and message throughput numbers before deciding, and of course you can scale up the solution at a later time.

In order to allow for devices to connect to the IoT Hub, you must provision device identities. This is easily done via the Azure CLI interface as follows:

Commands to create a new IoT Device in IoT Hubs

This will show a connection string that is used within the various client libraries available (of which there are numerous).

You also must create Shared Access Policies to allow downstream applications, such as Azure Stream Analytics or Azure Functions, access to consume device messages. In the case of this example, the Terraform template already created one called “streamanalytics”, which was granted the Service Connect permission, allowing it to read messages from the hub. For a list of all the permissions that are possible to grant, see this .

TIP: Don’t ever use the built in iothubowner policy, as that grants essentially unlimited permissions to the hub. Also, create one policy per consuming application so that you can easily rotate the access keys without having to replace them in a large number of places.

vMRbymE.png!web

When consuming messages, you want to configure different Consumer Groups for each application that reads messages. This allows each to independently operate for purposes such as tracking where in the message stream they are. In our example, the deployment created a streamanalytics Consumer Group, since that's what will be reading the messages.

A3eAZrv.png!web

Finally, there’s the concept of routing , which determines where messages get, well, routed! This allows you to send messages to multiple endpoints, such as Azure Event Hub compatible ones, Service Bus Queues, and Azure Storage. The latter is very useful for capturing and storing raw data into a data lake, which is why we’ve configured it here.

nyM3EbQ.png!web

When configuring a Storage endpoint, you can configure the way in which the files are created in terms of folder structure. You can also configure things like how often files are written, as well as how large of a batch must come through before it is written to the storage account.

eiUvyin.png!web

TIP: To get full featured later use of the data, make sure the storage account you configure is enabled for Azure Data Lake Storage Gen2 . Also, make sure the path and name for the file are sortable; for example, you might ensure that the path starts with the year, then month, then day, etc.

Azure Data Lake Storage

Azure Data Lake Storage provides a robust and cost effective way to store massive amounts of data on the existing Azure Blob Storage platform. It is fully compatible with numerous analytic platforms, such as Hadoop, Azure Databricks, and Azure SQL Data Warehouse. And while not enabled in this proof of concept, you can easily secure access to the Storage Account by restricting network access to only authorized public IP or Azure VNET ranges, as well as only Azure trusted services (required in this case because of our use of IoT Hubs).

Azure Stream Analytics

Azure Stream Analytics allows for easy processing of streamed IoT data, using an easy to learn SQL syntax. In our case, we’re using the built in anomaly detection functions to detect unexpected temperature changes.

Stream Analytics jobs are scaled based on Streaming Units (SUs), which define how much compute resources (CPU and memory) are allocated to a particular job.

There are three main parts of the Stream Analytics deployment: an input, an output, and the query that processes the data. Let’s go through them each in a bit more detail.

Input

Inputs define where Stream Analytics gets its data to process. In our case, we’ve defined the input as the Azure IoT Hub that was created as part of our deployment.

Because this resource was created via an automated deployment, all of the information used to connect to the IoT Hub is already entered. As previously noted, you should always create a specific shared access policy specifically for the Stream Analytics connection, rather than using the built in high privilege one (or any of the other pre-existing ones). You also want to ensure a specific consumer group is created as well. By clicking the “Test” button, we can confirm everything is working properly.

UveYr2n.png!web

Azure Stream Analytics also supports accepting data from Azure Blob Storage and Azure Event Hubs as well. For more details on the different kinds of streaming inputs, see this Microsoft documentation .

It’s also worth noting that Stream Analytics accepts a different kind of input known as Reference Data . This is used to load static data sets which can be used to join and enrich the streamed data. For example, you might use a file that contains details about the different sensor devices as a means to include useful information in the output of the stream. For more details on using Reference Data streams, see this Microsoft documentation .

Output

Azure Stream Analytics supports a multitude of outputs for the data it produces, including blob storage, Azure SQL Database, Event Hubs, and many others. For a full list of the possible outputs, see this documentation .

In our case, we’ve configured a single output to Azure Event Hubs, which is used as a destination for detected anomalies. As with the input, it’s wise to use a specific authorization policy to allow the Stream Analytics job to connect to the Event Hub.

2eQrIjE.png!web

Note that there is one property that we must set manually, as the Terraform provider doesn’t currently (as of November 2019) support it, which is the Partition key column . It’s important to set this properly, to ensure that groups of related events always end up in the same partition, so that they can be processed in order. In our case, a logical choice would be the device column, since that way we ensure that all the events from a particular device end up in the same partition.

Query

The query is where the meat of the work done by Stream Analytics is defined. Here is where we can use a simple SQL dialect to aggregate, transform, and enrich the data streamed through the Stream Analytics job. A full explanation of the language and all the possibilities would be an article in and of itself, so for now we’ll simply describe the query in the current job.

The query consists of three parts. The first uses the built in AnomalyDetection_ChangePoint function to detect anomalies in the temperature data, partitioning by the device identified and limited to the last 30 minutes of data. We also set the timestamp column in the data so Stream Analytics knows how to sort the data. The second part retrieves two values from the anomaly detection function's output, which tell us whether the data is considered anomalous, and if so, how unexpected does the algorithm rate the data. Finally, we query a number of fields where the data is detected as anomalous and output the results into the Anomalies job output (previously defined as an Event Hub).

With that out of the way, we can review the final component of our proof of concept, namely Azure Event Hubs.

Azure Event Hubs

Azure Event Hubs is a scalable PaaS offering that provides a pub/sub messaging platform for applications. In our case, we are using it as a destination to receive detected anomaly events for processing by one or more downstream applications.

Event Hubs consist of the following components.

Event Hubs

An Event Hub is a single logical bucket or, to use a phrase commonly used in messaging architecture, topic, on which related events are published. Within an Event Hub, you define properties such as the number of partitions (hint, pick more than you think you’ll need, since they cannot be changed without recreating the Event Hub), Consumer Groups (similar to IoT Hubs, used to define applications that process the data for the purpose of ensuring synchronization of where in the stream the application is), Capture (making it easy to send events to a Storage Account), and message retention (how long should messages be retained in the hub; consider how far back you might need to reprocess, with a maximum of 7 days).

In our case, we define a single Event Hub, named anomalies , which will receive output from our Stream Analytics job.

Why use Event Hubs rather than a simpler queuing offering, such as Azure Queues? By using a pub/sub message bus, rather than a queue, we allow for the possibility of multiple consumers of the messages; for example, we might generate Slack alerts using one consumer, but then also publish the results to a data store of some kind for a dashboard. As discussed later, I wanted to have something ready to show in this regard, however I was already beyond my self-imposed time limit for this piece and will defer that to a later article.

Event Hub Namespace

An Event Hub Namespace is a logical collection of related Event Hubs, and is where we define the assigned processing power for all the Event Hub resources within the Namespace. Event Hub Namespace names must be globally unique, as with Storage Accounts.

With our tour of the various resources out of the way, let’s actually run some data through the system and see the results!

Live Data Flow

First, we must simulate an IoT device sending data to the Azure IoT Hub. I’ve built a Docker container for this purpose, which you can run using the command below. You’ll need the IoT Hub connection string that we retrieved in the previous section. Make sure you run this in a separate command prompt, as it will keep hold of the command line until you exit the container.

Next, we need to start up the Stream Analytics job. This can be done through the portal.

u6FVBnb.png!web

You can select “Now” for the start time, so the job will consume data starting from the current date and time.

From the Query pane, you can click the Test Query Results button to see sample output of the query.

aQrq2qm.png!web

NOTE: Because only anomalies are shown, you may not actually see any results. You can always repeat the step after the next action, where we introduce anomalous data on purpose.

Now to show that anomalies reach the Event Hub, we’re going to introduce some strikingly different data into the mix, thus triggering the anomaly detection algorithm.

First, exit the container process, then re-run it specifying some additional options as shown below.

To show that the output is reaching the Event Hub, we can browse to the Metrics panel of the Event Hub Namespace resource, and view the Incoming Messages metric. Make sure you select 30 minutes as your time window to get the best view, and you may need to hit the Refresh button several times.

rAvQ3aF.png!web

22I32qi.png!web

2qQjyeM.png!web

Summary and To-Dos

In this article we’ve walked through a reference set-up for receiving and consuming messages from IoT devices using the Azure platform, including IoT Hubs, Azure Stream Analytics, and Azure Event Hubs. In a subsequent one, I’ll walk through how to use Azure Functions to consume and generate Slack alerts from this same solution. Stay tuned!


About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK