Creating an open methodology for Internet of Things (IoT) Analytics: Data science for Internet of Things



a) I am not referring to ‘standardization’ here. Rather the need for a methodology i.e. structured way to solve problems(Think of it like Kaggle meets #IoT analytics)

b)  Added reference to PFA(Portable format for Analytics) – thanks Gregory Piatetsky-Shapiro @kdnuggets for the feedback

We often encounter this problem in my teaching Data Science for Internet of Things:

There is no specific methodology to solve Data Science for IoT  (IoT Analytics) problems.

This leads to some initial questions:

  • Should there be a distinct methodology to solve Data Science problems for IoT?
  • Are IoT problems for Data Science unique enough to warrant a specific approach?
  • What existing methodologies should we draw upon?

On one hand , A Data Science for IoT problem is a typical Data Science problem. On the other hand, there are some unique considerations to IoT – for example in the use of Hardware, High Data volumes, Use of CEP(Complex event processing), impact of verticals(like automotive), Impact of streaming data etc.

Background and inspiration

Some initial background:

Data mining has well known methodologies such as Crisp DM.  Hilary Mason and others have also proposed specific methodologies for Data Science . Kaggle problems have a specific approach to solving them . With techniques like PFA(Portable format for Analytics) provide a way of formalizing and moving Analytics models.

All these strategies also apply to IoT. IoT itself has methodologies like Ignite IoT – but these do not cover IoT analytics in detail.

A methodology for IoT analytics(Data Science for IoT) should cover the unique aspects of each step in Data Science. For example: It is more than the choice of the model family. The choice of the model family (ANN, SVM, Trees, etc) is only one of the many choices to make – Others include :

a) Choice of the model structure – optimisation methodology (CV, Bootstrap, etc)

b)  Choice of the model parameter optimisation algorithm (joint gradients vs. conjugate gradients )

c)  Preprocessing of the data (centring, reduction, functional reduction, log-transform, etc.)

d)  How to deal with missing data (case deletion, imputation, etc.)

e)  How to detect and deal with suspect data (distance-based outlier detection, density-based, etc.)

f)  How to choose relevant features (filters, wrappers, embedded method ?)

g)  How to measure prediction performances (mean square error, mean absolute error, misclassification rate, lift, precision/recall, etc.)

source Methodology and standards for data analysis with machine learning tools Damien Fran¸cois ∗

The methodology could also cover  -

Exploratory analysis of data

Hypothesis testing (“Given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” )

and other ideas ..

An Open methodology for IoT analytics problems

Building on the above, we need an Open, end-to-end,  step by step methodology to solve IoT Analytics/Data Science for IoT problems

In addition, the methodology would need to consider the unique aspects of IOT. For example:

a)      Complex event processing especially using Apache Spark for CEP

b)      Deep learning (because we consider Cameras as sensors)

c)      Anomaly Detection: Consider Anomaly detection (a typical IoT analytics scenario). There are many considerations:  What is the triggering event, How much has the machine deviated from the plan, What is the root cause of the bottleneck, Are there any external factors affecting the system performance, How do I know that I should trust IOT data? Is there a recommended plan of action? How is the Data visualized? Does the Data have missing elements? How do we detect failure in other processes? (Anomaly detection adapted from Dr Vinay Mehendiratta)

In addition, IoT vertical domains have special considerations: Smart Grid, Smart cities, Smart energy, Automotive, Smart factory, Mobile, Wearables, Smart home etc.

For example:

Modelling energy prices,

Classifying step using machine learning,

Bus routing using mobile phone data,

Linear and non-linear regression models to predict global temperature and weather prediction


Creating an Open methodology

Currently, this is an evolving thought process being developed as a part of the Data Science for IoT course. We intend to create it as an open methodology – starting with the question: What is common across these IoT analytics problems and how can we adapt existing Data Science techniques  to solve IoT analytics problems?

Over the next few weeks, we are conducting a survey and developing the methodology

If you are interested in participating and knowing more, please sign up to our mailing list and download our papers or contact me at ajit.jaokar at 

My blog featured in 4 Top 50/100 lists for IoT / Big Data / Data Science last year

An interesting year in social media last year .. and A nice way to start the year

My blog featured in 4 Top 50/100 lists for IoT / Big Data / Data Science last year
I always find this interesting since I write about a very niche space(Data Science for IoT) and its more mathematical / technical than my previous work in Mobile
These are great lists also – some very clued on people – well worth following them

Inline images 1

What is the best way for getting started in Statistics for Programmers/Data Science?

What is the best way for getting started in Statistics for Programmers/Data Science?

I am often asked this question: What’s the best way for getting started in Statistics for Programmers?

At the Data Science for IoT course – and also in my teaching at Oxford University – I have used the following approach.

Comments welcome:

Firstly, the interest in Statistics for Programmers is a fairly recent phenomenon.

This interest is based on the uptake of Data Science – a hot profession now.

Here’s how most people approach the problem

They pick up an old High School statistics text book – either their own from younger days– or a standard book.

These books are often decades old.

They start with page One .. and work linearly through a few pages ..

They quickly realize why they disliked stats earlier.

And that sentiment has not changed with the passage of time ..

But, here is a different approach

For Data Science, you do not need to master Statistics per se

You need to understand Statistical models.

A model is defined as a combination of  predictive algorithms (based on Statistics) and Data.

Data science is based on creating models that improve with experience / training/

In contrast, in the Data Science for IoT course – we start with problems (the Engineering approach).

I recommend three sources which I am using (if you have others, please let me know at ajit.jaokar at and I shall link them and refer back to you)

Start with Understanding the problem

See these two links by @Brandon Rohrer  (@Microsoft Data Science)  -

Which algorithm family can answer my question and

Which questions can Data Science answer.

See also this post by Dr Vincent Granville @DataScienceCtrl

on 24 uses of Statistical modelling Part 1 and  2

These posts give you an idea of the problems that can be solved using Data science and stats(without going into the math itself initially)

Then read Allen Downey’s books

Allen Downney writes excellent books and they are all free under creative commons. You can download them  at Green Tea Press and they have an excellent ethos. Especially – Think Stats, Think Bayes, Think complexity (in that order).

To encourage the author I would also encourage you to buy these books especially Think Stats.

You can follow him on Twitter @allendowney

Having mastered to this stage, then start with code and small datasets.

I prefer UCI datasets and Python scikit learn library.

Sumit also works with the REPL approach and Paul uses Spark notebook in our course.

In any case, these are small sections of code run in a controlled environment and show you how the stats are implemented(libraries / APIs like scikit learn – are relatively easier to understand if you come from a Programming background)

Thats the path we are using in the Data Science for IoT course.

Any comments/feedback welcome on your approach to teach statistics (ajit.jaokar at

Image source: Scatter plots – wikipedia

Data Science for Internet of Things – practitioner course – March 2016

Now running in it’s third batch ..

Welcome to the world’s first course that helps you to become a Data Scientist for the Internet Of Things ..

For the latest batch of this course see  Data Science for Internet of things #DataScience #IoT – Aug – Sep 2016 start now in its fourth batch




The course starts on March 22 – 2016 - 

Please contact [email protected] 

This niche, personalized course is suited for:

  • Developers who want to transition to a new role as Data Scientists
  • Entrepreneurs who want to launch new products covering IoT and analytics
  • Anyone interested in developing their career in IoT Analytics

Duration: The course starts from March 2016 and extends to July  2016. We work with you for the next six months after that on a specific project and to help transition your career to Data Science through our network. The extra time also allows you to catch up on specific modules in the course

Scope: Created by Data Science and IoT professionals, the course covers infrastructure (Hadoop – Spark), Programming / Modelling (Python/R/Time series) and Deep Learning (Theano, Deeplearning4j) within the context of the Internet of Things.

Internet of Things: We cover unique aspects of Data Science for IoT including Deep Learning, Complex event processing/sensor fusion and Streaming/Real time analytics


Offline (London):  £1,200 GBP + VAT
Online:  Yes. Please contact us at [email protected]


Contact  us at [email protected] to signup




  • The course aims to equip you to be a Data Scientist for the Internet of Things domain
  • You can transition your career to Data Science for IoT. This could mean a new job, role, project or a start-up idea
  • You are not alone: Toolkits and community support to start working on real Data science problems for IoT
  • You master specific skills: Spark, R, Python, Scala, IoT platforms, Data analysis, Deep Learning and SQL among others
  • The course content can be personalized (see below)
  • The Data Science principles can apply to other domains i.e. beyond IoT



(Note the modules and the sequence are subject to change)


An overview of Data Science

An overview of Data Science,  What is Data Science? What problems can be solved using Data science – Extracting meaning from Data – Statistical processes behind Data – Techniques to acquire data (ex APIs) – Handling large scale data – Big Data fundamentals


Data Science and IoT

The IoT ecosystem, Unique considerations for the IoT ecosystem – Addressing IoT problems in Data science (time series data, enterprise IoT edge computing, real-time processing, cognitive computing, image processing, introduction to deep learning algorithms, geospatial analysis for IoT/managing massive geographic scale, strategies for integration with hardware, sensor fusion)


The Apache Spark ecosystem

Apache spark in detail including Scala, SQL, SparkR, Mlib and GraphX


The Data Science for IoT methodology

A specific approach to solve Data Science problems for IoT including strategy and development


Mathematical foundations of Machine learning

Here we formally cover the mathematics for Data science including Linear Algebra, Matrix algebra, Bayesian Statistics, Optimization techniques (Gradient descent) etc. We also cover Supervised algorithms, unsupervised algorithms (classification, regression, clustering, dimensionality reduction etc) as applicable to IoT datasets


Unique Elements for IoT

This module emphasises the following unique elements for IoT

  • Complex event processing (sensor fusion)
  • Deep Learning and
  • Real Time (Spark, Kafka etc)


FAQ: Summary of Benefits and Features


Impact on your work Designed for developers/ICT contractors/Entrepreneurs who want to transition their career towards Data science roles with an emphasis on IoT
Typical profile A developer who has skills in programming environments like Java, Ruby, Python, Oracle etc and wants to learn Data Science within the context of Internet of Things with the goal of becoming a Data Scientist for IoT
Community support? Yes. Also includes the Alumni network i.e. beyond the duration of the course at no extra cost.
Approach to Big Data For Big Data, the course is focussed on Apache Spark – specifically Scala, SQL, mlib. Graphx and others on HDFS
Approach to Programming see scope below
Approach to Algorithms see scope below
Is this a full data science course? Yes, we cover machine learning / Data science techniques which are applicable to any domain. Our focus is Internet of Things. The course is practitioner oriented i.e. not academic and is not affiliated to a university.
Investment Offline(London):  £1,200 GBP + VAT(if applicable)
Online:  Yes. Please contact us at [email protected]
Help with jobs/employment yes, we aim to transition your career. Hence, we are selective in the recruitment for the course. There are no guarantees – but a career transition is a key goal for us. We work with you  over the duration of the course(including the Project) to get a new role in Data Science/IoT
Created by professionals See our profiles below
Personalization The course is based on a PLP (Personal learning plan) which allows you to customize for language, projects, domains, career goals, entrepreneurial goals etc . The course can be personalized. Examples include a focus on CEP/Sensor fusion,  RNNs and Time series, Edge processing, SQL  etc. There is no extra cost for this but we agree scope before we start through a Personal Learning Program(PLP). If you are interested in this option, please let us know at [email protected]If you want to see examples of our work and content, please see Spark SQL real time analytics by Sumit Pal(published on kdnuggets)The evolution of Deep learning models by Ajit Jaokar
Duration The course starts from March 2016 and extends to July  2016. We work with you for the next six months after that on a specific project and to help transition your career to Data Science through our network. The extra time also allows you to catch up on specific modules in the course
Projects A significant part of the course is Project based. Projects are based on   predictive analytics algorithms for IoT applications. Projects use our methodology which is based on a formalized way of solving IoT analytics  problems. Projects can be based in any of the Programming Languages we cover i.e. R or Python. Spark(Scala) and SQL(distributed processing i.e. Big Data) and  Theano and deeplearning4j for Deep learning . If you want to work on a specific project you should indicate in advance(or if you want to explore some ideas deeper)
Access to knowledge We do not restrict access to knowledge by specialization. For example – if you choose to focus on sensor fusion – you will still have access to all material for Deep learning
Batch sizes Are limited to ensure personalized attention
Time per week about 5 hours/week. No additional materials needed to buy etc
Certificate of completion Yes – based on the quiz and projects.
Delivery of content via video. You do not have to be online at specific times


How is this approach different to the more traditional MOOCs?

Here’s how we differ from MOOCs

a)  We are not ‘Massive’ – this approach works for small groups with more focused and personalized attention. We will never have 1000s of participants

b)  We help in career leverage: We work actively with you for career leverage – ex you are a startup / you want to transition to a new job etc

c)  We are vendor agnostic

d)  We work actively with you to build your brand(Blogs/Open source/conferences etc)

e)  The course can be personalized to streams(ex with Deep learning, Complex event processing, Streaming etc)

f)  We teach the foundations of maths where applicable

g)  We work with a small number of platforms which provide current / in-demand skills – ex Apache Spark, R etc

h)  We are exclusively focused on IoT (although the concepts can apply to any other vertical)


Approach to Programming

The main Programming focus is on Python, R , Spark (Scala, SQL and R). We also use  Deeplearning4j and Theano(for Deep learning).  We will also use an ioT platform (like Thingworx) but we will emphasize IoT analytics.  The participants need to be able to Code/come from a development background (the Programming language itself does not matter).


What is your approach to working with Algorithms and Maths?

The course is based on modelling IoT based problems in the Python and R programming language.  We follow a context based learning approach – hence we co-relate the maths to specific R based IoT models. You will need an aptitude for maths. However, we cover the mathematical foundations necessary. These include: Linear Algebra including Matrix algebra, Bayesian Statistics, Optimization techniques (such as Gradient descent) etc.


What is the implication of an emphasis on IoT?

In 2015, IoT is emerging but the impact is yet to be felt over the next five years. Today, we see IoT driven by Bluetooth 4.0 including iBeacons. Over the next five years, we will see IoT connectivity driven by the wide area network (with the deployment of 5G 2020 and beyond). We will also see entirely new forms of connectivity (ex LoRa, Sigfox etc). Enterprises (Renewables, Telematics, Transport, Manufacturing, Energy, Utilities etc) will be the key drivers for IoT. On the consumer side, Retail and wearables will play a part. This tsunami of data will lead to an exponential demand for analytics since analytics is the key business model behind the data deluge. Most of this data will be Time series data but will also include other types of data. For example, our emphasis on IoT also includes Deep Learning since we treat video and images as sensors.  IoT will lead to a Re-imagining of everyday objects.


Why is this course unique?

The course emphasizes some aspects are unique to IoT (in comparison to traditional data science). These include: A greater emphasis on time series data, Edge computing, Real-time processing, Cognitive computing, In memory processing, Deep learning, Geospatial analysis for IoT, Managing massive geographic scale(ex for Smart cities), Telecoms datasets, Strategies for integration with hardware and Sensor fusion (Complex event processing). Note that we include video and images as sensors through cameras (hence the study of Deep learning)



Who is creating/teaching this course?

The course is created by futuretext and conducted by Ajit Jaokar, Dr Paul Katsande and Sumit Pal

Ajit Jaokar  – Based in London, Ajit’s research and consulting is based on Data Science and the Internet of Things. His work is based on his teaching at Oxford University and UPM (Technical University of Madrid) and covers IoT, Data Science, Smart cities and Telecoms.



Sumit Pal is a big data, visualisation and data science consultant. He is also a software architect and big data enthusiast and builds end-to-end data-driven analytic systems. Sumit has worked for Microsoft (SQL server development team), Oracle (OLAP development team) and Verizon (Big Data analytics team) in a career spanning 22 years. Currently, he works for multiple clients advising them on their data architectures and big data solutions and does hands on coding with Spark, Scala, Java and Python. Sumit is based in Boston.


Dr Paul Katsande is a technical architect based in London working with Apache Spark, Scala and Data Science. Paul’s PhD research is based on image processing from the University of Manchester.



We have limited spaces. Please contact us at [email protected] if you want to take the next steps!



See video below




Weekly schedule


Week 0 March 15 Orientation, introductions, Personal learning plans, Platform signup
Week 1 mar 21 Foundations:An analytics Driven Organization – IoT and Machine Learning  - Data Science for IoT – Unique characteristics – Data Science for IoT – why now?
Mar 28 Machine Learning concepts Deep Learning concepts
Apr 4 An introduction to IoT (Internet of Things)
Apr 11 IoT platforms – From sensor to Cloud
Apr  18 Concepts of Big Data Part One
Apr  25 Concepts of Big Data Part Two
May 2 Market drivers for IoT
May 9 Choosing a model – what technique to Use?
May 16 Use Cases  and IoT datasets (these will continue throughout the course)
May  23 Time series and NoSQL databases
May 30 Streaming analytics part One
June  6 Streaming analytics part two
June 13 Deep learning part one
June 20 Deep learning part two
June  2 7 Machine learning algorithms – part one
July 4 Machine learning algorithms – part two
July 11 Mathematical foundations – part one
July 18 Mathematical foundations – part two
July To Dec 31 Project





Week 0 Mar 15 Orientation, introductions, Personal learning plans, Platform signup
Week 1 mar 21
Mar 28
Apr 4 Intro to R, Installations, Basics of R
Apr 11
Apr  18 Data Frames in R & Tabular Data
Apr  25
May 2 Data Processing & Data Visualization in R
May 9
May 16 Scala basics
May  23
May 30 Spark batch processing I
June  6
June 13 Spark Batch Processing II
June 20
June  2 7 Spark SQL
July 4
July 11 Spark Streaming
July 18
July To Dec 31 Projects


 Contact  us at [email protected] to signup