A methodology for solving problems with DataScience for Internet of Things



This (long!) blog is based on my forthcoming book:  Data Science for Internet of Things.

It is also the basis for the course I teach  Data Science for Internet of Things Course.   Welcome your comments.  Please email me at ajit.jaokar at futuretext.com  - Email me also for a pdf version if you are interested in joining the course

Here, we start off with the question:  At which points could you apply analytics to the IoT ecosystem and what are the implications?  We then extend this to a broader question:  Could we formulate a methodology to solve Data Science for IoT problems?  I have illustrated my thinking through a number of companies/examples.  I personally work with an Open Source strategy (based on R, Spark and Python) but  the methodology applies to any implementation. We are currently working with a range of implementations including AWS, Azure, GE Predix, Nvidia etc.  Thus, the discussion is vendor agnostic.

I also mention some trends I am following such as Apache NiFi etc

The Internet of Things and the flow of Data

As we move towards a world of 50 billion connected devices,  Data Science for IoT (IoT  analytics) helps to create new services and business models.  IoT analytics is the application of data science models  to IoT datasets.  The flow of data starts with the deployment of sensors.  Sensors detect events or changes in quantities. They provide a corresponding output in the form of a signal. Historically, sensors have been used in domains such as manufacturing. Now their deployment is becoming pervasive through ordinary objects like wearables. Sensors are also being deployed through new devices like Robots and Self driving cars. This widespread deployment of sensors has led to the Internet of Things.

Features of a typical wireless sensor node are described in this paper (wireless embedded sensor  architecture). Typically, data arising from sensors is in time series format and is often geotagged. This means, there are two forms of analytics for IoT: Time series and Spatial analytics. Time series analytics typically lead to insights like Anomaly detection. Thus, classifiers (used to detect anomalies) are commonly used for IoT analytics to detect anomalies.  But by looking at historical trends, streaming, combining data from multiple events(sensor fusion), we can get new insights. And more use cases for IoT keep emerging such as Augmented reality (think – Pokemon Go + IoT)

Meanwhile,  sensors themselves continue to evolve. Sensors have shrunk due to technologies like MEMS. Also, their communications protocols have improved through new technologies like LoRA. These protocols lead to new forms of communication for IoT such as Device to Device; Device to Server; or Server to Server. Thus, whichever way we look at it, IoT devices create a large amount of Data. Typically, the goal of IoT analytics is to analyse the data as close to the event as possible. We see this requirement in many ‘Smart city’ type applications such as Transportation, Energy grids, Utilities like Water, Street lighting, Parking etc

IoT data transformation techniques

Once data is captured through the sensor, there are a few analytics techniques that can be applied to the Data. Some of these are unique to IoT. For instance, not all data may be sent to the Cloud/Lake.  We could perform temporal or spatial analysis. Considering the volume of Data, some may be discarded at source or summarized at the Edge. Data could also be aggregated and aggregate analytics could be applied to the IoT data aggregates at the ‘Edge’. For example,  If you want to detect failure of a component, you could find spikes in values for that component over a recent span (thereby potentially predicting failure). Also, you could correlate data in multiple IoT streams. Typically, in stream processing, we are trying to find out what happened now (as opposed to what happened in the past).  Hence, response should be near real-time. Also, sensor data could be ‘cleaned’ at the Edge. Missing values in sensor data could be filled in(imputing values),  sensor data could be combined to infer an event(Complex event processing), Data could be normalized, we could handle different data formats or multiple communication protocols, manage thresholds, normalize data across sensors, time, devices etc



Applying IoT Analytics to the Flow of Data


Here, we address the possible locations and types of analytics that could be applied to IoT datasets.

(Please click to expand diagram)


Some initial thoughts:

  • IoT data arises from  sensors and ultimately resides in the Cloud.
  • We  use  the  concept  of  a  ‘Data  Lake’  to  refer  to  a repository of Data
  • We consider four possible avenues for IoT analytics: ‘Analytics  at  the  Edge’,  ‘Streaming  Analytics’ , NoSQL databases and ‘IoT analytics at the Data Lake’
  • For  Streaming  analytics,  we  could  build  an  offline model and apply it to a stream
  • If  we  consider  cameras  as  sensors,  Deep  learning techniques could be applied to Image and video datasets (for example  CNNs)
  • Even when IoT data volumes are high, not  all  scenarios  need  Data  to  be distributed. It is very much possible to run analytics on a single node using a non-distributed architecture using Python or R systems.
  • Feedback mechanisms are a key part of IoT analytics. Feedback is part of multiple IoT analytics modalities ex Edge, Streaming etc
  • CEP (Complex event processing) can be applied to multiple points as we see in the diagram


We now describe various analytics techniques which could apply to IoT datasets

Complex event processing

Complex Event Processing (CEP) can be used in multiple points for IoT analytics (ex : Edge, Stream, Cloud et).

In general, Event processing is a method of tracking and  analyzing  streams  of  data and deriving a conclusion from them. Complex event processing, or CEP, is event processing that combines data from multiple sources to infer events or patterns that suggest more complicated circumstances. The goal of complex event processing is to identify meaningful events (such as opportunities or threats) and respond to them as quickly as possible.

In CEP, the data is at motion. In contrast, a traditional Query (ex an RDBMS) acts on Static Data. Thus, CEP is mainly about Stream processing but the algorithms underlining CEP can also be applied to historical data

CEP relies on a number of techniques including for Events: pattern detection, abstraction, filtering,  aggregation and transformation. CEP algorithms model event hierarchies and detect relationships (such as causality, membership or timing) between events. They create an abstraction of an  event-driven processes. Thus, typically, CEP engines act as event correlation engines where they analyze a mass of events, pinpoint the most significant ones, and trigger actions.

Most CEP solutions and concepts can be classified into two main categories: Aggregation-oriented CEP and Detection-oriented CEP.  An aggregation-oriented CEP solution is focused on executing on-line algorithms as a response  to  event  data  entering  the  system  –  for example to continuously calculate an average based on data in the inbound events. Detection-oriented CEP is focused on detecting combinations of events called events patterns or situations – for example detecting a situation is to look for a specific sequence of events. For IoT, CEP techniques are concerned with deriving a higher order value / abstraction from discrete sensor readings.

CEP uses techniques like Bayesian    networks,    neural    networks,     Dempster- Shafer methods, kalman filters etc. Some more background at Developing a complex event processing architecture for IoT

Streaming analytics

Real-time systems differ in the way they perform analytics. Specifically,  Real-time  systems  perform  analytics  on  short time  windows  for  Data  Streams.  Hence, the scope  of  Real Time analytics is a ‘window’ which typically comprises of the last few time slots. Making Predictions on Real Time Data streams involves building an Offline model and applying it to a stream. Models incorporate one or more machine learning algorithms which are trained using the training Data. Models are first built offline based on historical data (Spam, Credit card fraud etc). Once built, the model can be validated against a real time system to find deviations in the real time stream data. Deviations beyond a certain threshold are tagged as anomalies.

IoT ecosystems can create many logs depending on the status of IoT devices. By collecting these logs for a period of time and analyzing the sequence of event patterns, a model to predict a fault can be built including the probability of failure for the sequence. This model to predict failure is then applied to the stream (online). A technique like the Hidden Markov Model can be used for detecting failure patterns based on the observed sequence. Complex Event Processing can be used to combine events over a time frame (ex in the last one minute) and co-relate patterns to detect the failure pattern.

Typically, streaming systems could be implemented in Kafka and spark


Some interesting links on streaming I am tracking:

 Newer versions of kafka designed for iot use cases

Data Science Central: stream processing and streaming analytics how it works

Iot 101 everything you need to know to start your iot project – Part One

Iot 101 everything you need to know to start your iot project – Part Two


Edge Processing

Many vendors like Cisco and Intel are proponents of Edge Processing  (also  called  Edge  computing).  The  main  idea behind Edge Computing is to push processing away from the core and towards the Edge of the network. For IoT, that means pushing processing towards the sensors or a gateway. This enables data to be initially processed at the Edge device possibly enabling smaller datasets sent to the core. Devices at the Edge may not be continuously connected to the network. Hence, these devices may need a copy of the master data/reference data for processing in an offline format. Edge devices may also include other features like:

•    Apply rules and workflow against that data

•    Take action as needed

•    Filter and cleanse the data

•    Store local data for local use

•    Enhance security

•    Provide governance admin controls

IoT analytics techniques applied at the Data Lake

Data Lakes

The concept of a Data Lake is similar to that of a Data warehouse or a Data Mart. In this context, we see a Data Lake as a repository for data from different IoT sources. A Data Lake is driven by the Hadoop platform. This means, Data in a Data lake is preserved in its raw format. Unlike a Data Warehouse, Data in a Data Lake is not pre-categorised. From an analytics perspective, Data Lakes are relevant in the following ways:

  • We could monitor the stream of data arriving in the lake for specific events or could co-relate different streams. Both of these tasks use Complex event processing (CEP). CEP could also apply to Data when it is stored in the lake to extract broad, historical perspectives.
  • Similarly, Deep learning and other techniques could be applied to IoT datasets in the Data Lake when the Data  is ‘at rest’. We describe these below.

ETL (Extract Transform and Load)

Companies like Pentaho are applying ETL techniques to IoT data

Deep learning

Some deep learning techniques could apply to IoT datasets. If you consider images and video as sensor data, then we could apply various convolutional neural network techniques to this data.

It gets more interesting when we consider RNNs(Recurrent Neural Networks)  and Reinforcement learning. For example – Reinforcement learning and time series – Brandon Rohrer How to turn your house robot into a robot – Answering the challenge – a new reinforcement learning robot

Over time, we will see far more complex options – for example for Self driving cars  and the use of Recurrent neural networks (mobileeye)

Some more interesting links for Deep Learning and IoT:


Systems level optimization and process level optimization for IoT is another complex area where we are doing work. Some links for this



Visualization is necessary for analytics in general and IoT analytics is no exception

Here are some links

NOSQL databases

NoSQL databases today offer a great way to implement IoT analytics. For instance,

Apache Cassandra for IoT

MongoDB and IoT tutorial


Other  IoT analytic techniques

In this section, I list some IoT  technologies where we could implement analytics


A Methodology to solve Data Science for IoT problems

We started off with the question: Which points could you apply analytics to the IoT ecosystem and what are the implications? But behind this work is a broader question:  Could we formulate a methodology to solve Data Science for IoT problems?  I am exploring this question as part of my teaching both online and at Oxford University along with Jean-Jacques Bernard.

Here is more on our thinking:

  • CRISP-DM is a Data mining process methodology used in analytics.  More on CRISP-DM HERE and HERE (pdf documents).
  • From a business perspective (top down),we can extend CRISP-DM to incorporate the understanding of the IoT domain i.e. add domain specific features.  This includes understanding the business impact, handling high volumes of IoT data, understanding the nature of Data coming from various IoT devices etc
    • From an implementation perspective(bottom up),  once we have an understanding of the Data and the business processes, for each IoT vertical : We first find the analytics (what is being measured, optimized etc). Then find the data needed for those analytics. Then we provide examples of that implementation using code. Extending CRISP-DM to an implementation methodology, we could have Process(workflow), templates,  code, use cases, Data etc
    • For implementation in R, we are looking to initially use Open source R and Spark and the  h2o.ai  API



We started off with the question:  At which points could you apply analytics to the IoT ecosystem and what are the implications? And extended this to a broader question:  Could we formulate a methodology to solve Data Science for IoT problems?  The above is comprehensive but not absolute. For example, you can implement deep learning algorithms on mobile devices (Qualcomm snapdragon machine learning development kit for mobile mobile devices).  So, even as I write it, I can think of exceptions!


This article is part of my forthcoming book on Data Science for IoT and also the courses I teach

Welcome your comments.  Please email me at ajit.jaokar at futuretext.com  - Email me also for a pdf version if you are interested. If you want to be a part of my course please see the testimonials at Data Science for Internet of Things Course.  

Speak Your Mind