Book review: About Time Series Databases and a New look at Anomaly detection by Ted Dunning and Ellen Friedman

Introduction

 This blog is a review of two books. Both are available for free from the MapR site, written by Ted Dunning and Ellen Friedman (published by O Reilly) : About Time Series Databases: New ways to store and access data and A new look at Anomaly Detection

 The  MapR platform is a key part of the Data Science for the Internet of Things (IoT) course – University of Oxford and I shall be covering these issues in my course

 In this post, I discuss the significance of Time series databases from an IoT perspective based on my review of these books. Specifically, we discuss Classification and Anomaly detection which often go together for typical IoT applications. The books are easy to read with analogies like HAL (Space Odyssey ) and I recommend them.

 

Time Series data

The idea of time series data is not new. Historically, time series data can be stored even in simple structures like flat files. The difference now is the huge volume of data and the future applications possible by collecting this data – especially for IoT. These large scale time series databases and applications are the focus of the book. Large scale time series applications typically need a NoSQL database like Apache Cassandra, Apache HBase,  MapR-DB etc.  The book’s focus is Apache HBase and MapR-DB for the collection, storage and access of large-scale time series data.

  Essentially, time series data involves measurements or observations of events as a function of the time at which they occurred. The airline ‘black box’ is a good example of a time series data. The black box records data many times per second for dozens of parameters throughout the flight including altitude, flight path, engine temperature and power, indicated air speed, fuel consumption, and control settings. Each measurement includes the time it was made. The analogy applies to sensor data. Increasingly, with the proliferation of IoT, Time series data is becoming more common and universal. The data so acquired through sensors is typically stored in Time Series Databases.  The TSDB (Time series database) is optimized for best performance for queries based on a range of time

 

Time series data applications

Time series databases apply to many IoT use cases for example:

  • Trucking, to reduce taxes according to how much trucks drive on public roads (which sometimes incur a tax). It’s not just a matter of how many miles a truck drives but rather which miles.
  • A smart pallet can be a source of time series data that might record events of interest such as when the pallet was filled with goods, when it was loaded or unloaded from a truck, when it was transferred into storage in a warehouse, or even the environmental parameters involved, such as temperature.
  • Similarly, commercial waste containers, called dumpsters in the US, could be equipped with sensors to report on how full they are at different points in time.
  • Cell tower traffic can also be modelled as a time series and anomalies like flash crowd events that can be used to provide early warning.
  • Data Center Monitoring can be modelled as a Time series to predict  outages, plan upgrades
  • Similarly, Satellites, Robots and many more devices can be modelled as Time series data

From these readings captured in a Time Series database, we can derive analytics such as:

Prognosis: What are the short- and long-term trends for some measurement or ensemble of measurements?

Introspection: How do several measurements correlate over a period of time?

Prediction:  How do I build a machine-learning model based on the temporal behaviour of many measurements correlated to externally known facts?

Introspection:  Have similar patterns of measurements preceded similar events?

Diagnosis:  What measurements might indicate the cause of some event, such as a failure?

 

Classification and Anomaly detection for IoT

The books gives examples of usage of Anomaly detection and Classification for IoT data.

For Time series IoT based readings, anomaly detection and Classification go together. Anomaly detection determines what normal looks like, and how to detect deviations from normal.

When searching for anomalies, we don’t know what their characteristics will be in advance. Once we know characteristics, we can use a different form of machine learning i.e. classification

Anomaly in this context just means different than expected—it does not refer to desirable or un‐ desirable. Anomaly detection is a discovery process to help you figure out what is going on and what you need to look for. The anomaly-detection program must discover interesting patterns or connections in the data itself.

Anomaly detection and classification go together when it comes to finding a solution to real-world problems. Anomaly detection is used first in the discovery phase—to help you figure out what is going on and what you need to look for. You could use the anomaly-detection model to spot outliers, then set up an efficient classification model to assign new examples to the categories you’ve already identified. You then update the anomaly detector to consider these new examples as normal and repeat the process

The book goes on to give examples of usage of these techniques in EKG

For example, for the challenge of finding an approachable, practical way to model normal for a very complicated curve such as the EKG, we could use a type of machine learning known as deep learning.

 Deep learning involves letting a system learn in several layers, in order to deal with large and complicated problems in approachable steps. Curves such as the EKG have repeated components separated in time rather than superposed. We take advantage of the repetitive and separated nature of an EKG curve in order to accurately model its complicated shape to detect normal patterns using Deep learning

The book also refers to a Data structure called t-Digest for Accurate Calculation of Extreme Quantiles  t-digest was developed by one of the authors, Ted Dunning, as a way to accurately estimate extreme quantiles for very large data sets with limited memory use. This capability makes t-digest particularly useful for selecting a good threshold for anomaly detection. The t-digest algorithm is available in Apache Mahout as part of the Mahout math library. It’s also available as open source at https://github.com/tdunning/t-digest

 

Anomaly detection is a complex field and needs a lot of data.

For example: what happens if you only save a month of sensor data at a time, but the critical events leading up to a catastrophic part failure happened six weeks or more before the event?

IoT from a large scale Data standpoint

To conclude, much of the complexity for IoT analytics comes from the management of Large scale data.

Collectively, Interconnected Objects and the data they share make up the Internet of Things (IoT).

Relationships between objects and people, between objects and other objects, conditions in the present, and histories of their condition over time can be monitored and stored for future analysis, but doing so is quite a challenge.

However, the rewards are also potentially enormous. That’s where machine learning and anomaly detection can provide a huge benefit.

For Time series, the book covers themes such as

Storing and Processing Time Series Data

The Direct Blob Insertion Design

Why Relational Databases Aren’t Quite Right

Architecture of Open TSDB

Value Added: Direct Blob Loading for High Performance

Using SQL-on-Hadoop Tools

Using Apache Spark SQL

 Advanced Topics for Time Series Databases(Stationary Data, Wandering Sources, Space-Filling Curves )

For Anomaly detection:

Windows and Clusters

 Anomalies in Sporadic Events

Website Traffic Prediction

Extreme Seasonality Effects

Etc

 

Links again:

About Time Series Databases: New ways to store and access data and A new look at Anomaly Detection  by Ted Dunning and Ellen Friedman (published by O Reilly).

Also the link for Data Science for the Internet of Things (IoT) course – University of Oxford where I hope to cover these issues in more detail in context of  MapR

Data Science for Internet of Things (IoT) course – University of Oxford

I am pleased to announce a unique course  – Data Science for the Internet of Things (IoT) course – University of Oxford

We are launching first with very limited places We already are collaborating with Mapr, Sigfox, Hypercat and Red Ninja and many others

So the course will be based on practical insights from current systems

Everyone finishing the course will receive a University of Oxford certificate showing that they have completed the course

Course is fully online

Have a look  Data Science for the Internet of Things (IoT) course – University of Oxford for more

Welcome feedback and will update a lot more over next few weeks

If you want to avail of this very unique certification, please email me for more information ajit.jaokar at futuretext.com

Infographic: Fascinating Advancements in Electrical/Computer Engineering by Ohio State University

Ohio University Online

Programming for Data Science the Polyglot approach: Python vs. R OR Python + R + SQL

 

 

 

 

In this post, I discuss a possible new approach to teaching Programming for Data Science.

Programming for Data Science is focussed on the R vs. Python question.  Everyone seems to have a view including the venerable Nature journal (Programming – Pick up Python).

Here, I argue that we look beyond Python vs. R debate and look to teach R, Python and SQL together. To do this, we need to look at the big picture first (the problem we are solving in Data science) and then see how that problem is broken down and solved by different approaches. In doing so, we can more easily master multiple approaches and then even combine them if needed.

On first impressions, this Polyglot approach (ability to master multiple languages) sounds complex.

Why teach 3 languages together?  (For simplicity – I am including SQL as a language here)

Here is some background

Outside of Data science, I also co-founded a social enterprise to teach Computer Science to kids  Feynlabs. At Feynlabs, we have been working with ways to accelerate learning to Code. One way to do this is to compare and contrast multiple programming languages. This approach makes sense for Data Science also because a learner can potentially approach Data science from many directions.

To learn programming for Data Science, it would thus help to build up from an existing foundation they are already familiar with and then co-relate new ideas to this foundation through other approaches. From a pedagogical standpoint, this approach is similar to David Asubel who stressed the importance of prior knowledge in being able to learn new concepts:  “The most important single factor influencing learning is what the learner already knows.”

But first, we address what is the problem we are trying to solve and how that problem can be broken down

I also propose to make this approach as part of Data Science for IoT course/certification but I also expect I will teach it as a separate module – probably in a workshop format in London and USA. If you are interested to know more, please sign up on the mailing list   HERE

Data Science – the problem we are trying to solve

Data science involves the extraction of knowledge from data. Ideally, we need lots of data from a variety of sources.  Data Science lies at the intersection of multiple disciplines: Programming, Statistics, Algorithms, Data analysis etc. The quickest way to solve Data Science problems is to start analyzing data as soon as possible. However, Data Science also needs a good understanding of the theory – especially the machine learning approaches.

A Data Scientist typically approaches a problem using a methodology like OSEMN (Obtain, Scrub, Explore, Model, Interpret). Some of these steps are common to a classic data warehouse and are similar to classic ETL (Extract Transform Load) approach. However, the modelling and interpreting stage are unique to Data Science. Modelling needs an understanding of Machine Learning algorithms and how they fit together. For example: Unsupervised algorithms (Dimensionality reduction, Clustering) and Supervised algorithms (Regression, Classification)

To understand Data Science, I would expect some background in Programming. Certainly, one would not expect a Data Scientist to start from “Hello World”. But on the other hand, the syntax of a language is often over-rated. Languages have quirks – and they are easy to get around with most modern tools.

So, if we try to look at the problem / big picture first (ex the Obtain, Scrub, Explore, Model and Interpret) stages – it is easier to fit in the Programming languages to the stages. Machine Learning has 2 phases: the Model Building phase and the Prediction phase. We first build the model (often as a batch mode – and it takes longer). We then perform predictions on the model in a dynamic/real-time mode. Thus, to understand Programming for Data Science, we can divide the learning into four stages: The Tool itself (IDE), Data Management, Modelling and Visualization

Tools, IDE and Packages

After understanding the base syntax – it’s easier to understand the language in terms of its packages and libraries. Both Python and R have a vast number of packages (such as Statsmodels)  – often distributed as libraries (scikit-learn). Both languages are interpreted. Both have good IDEs such as Spyder, iPython for Python and RStudio for R. If using Python, you would probably use a library like scikit-learn and a distribution of Python such as the Anaconda distribution. With R, you would use the RStudio  and install specific packages using R’s  CRAN package management system.

Data management

Apart from R and Python, you would also need to use SQL. I include SQL because SQL plays a key role in the Data Scrubbing stage. Some have called this stage as the Janitor work of Data Science and it takes a lot of time. SQL also plays a part in SQL on Hadoop approaches like Apache Drill which allow users to write SQL queries on data stored in Hadoop and receive results

With SQL, you are manipulating data in Sets. However, once the data is inside the Programming environment, it is treated differently depending on the language.

In R, everything is a vector and R Data structures and functions are vectorized . This means, most functions in R work on Vectors (i.e. on all the elements – not on individual elements in a loop). Thus, in R, you read your data in a data frame and use a built-in model (here are the steps / packages for linear regression) . In Python, if you did not use a library like scikit-learn , you would need to make many decisions yourselves and that can be a lot harder. However, with a package like scikit-learn, you get a consistent, well documented  interface to the models. That makes your job a lot easier by focussing on the usage.

Data Exploration and Visualization

After the Data modelling stage, we come to Data exploration and visualization. Here, for Python – the pandas package is a powerful tool for data exploration. Here is a simple and quick intro to the power of Python Pandas (YouTube video). Similarly, R uses dplyr and ggplot2 packages for Data exploration and visualization.

A moving goalpost and a Polyglot approach

Finally, much of this discussion is a rapidly moving goalpost. For example, in R, large calculations need the data to be loaded in a matrix (ex nxn matrix manipulation). But, with platforms like Revolution Analytics – that can be overcome. Especially with the acquisition of Revolution analytics by Microsoft – and with Microsoft’s history for creating good developer tools – we can expect development in R would be simplified.

Also, since both R and Python are operating in the context of Hadoop for Data science, we would expect to leverage the Hadoop architecture through HDFS connectors both for Python Hadoop frameworks and R Hadoop integration. Also, one would argue that we are already living in a post hadoop/mapreduce world with Spark and Storm especially for Real time calculations and that at least some Hadoop functions may be replaced by Spark

Here is a good introduction to Apache Spark and a post about Getting started with Spark in Python. Interestingly, the Spark programming guide includes integration with 3 languages (Scala, Java and Python) but no R. But the power of Open source means we have SparkR which integrates R with Spark.

The approach to cover multiple languages has some support – for instance, with the Beaker notebook . You could also achieve the same effect by working on the command line for example in Data Science at the Command Line

Conclusions

Even in a brief blog post – you can get a lot of insights when we look at the wider problem of Data science and compare how different approaches are addressing segments of that problem. You just need to get the bigger picture of how these Languages fit together for Data Science and understand the  major differences (for example vectorization in R).

Use of good IDEs, packages etc softens the impact of programming.

It then changes our role, as Data Scientists, to mixing and matching a palette of techniques as APIs – sometimes spanning languages.

I hope to teach this approach as part of Data Science for IoT course/certification

Programming for Data Science will also be a separate module talk over the next few months at fablab london, London IT contractors meetup group, CREATE Miami, a venture accelerator at Miami Dade College, City Sciences conference(as part of a larger paper) in Shanghai and MCS Madrid

For more schedules and details please sign up HERE

Call for Papers Shanghai, 4-5 June 2015 – International Conference on City Sciences (ICCS 2015): New architectures, infrastructures and services for future cities

Call for Papers from  International  Conference on City  Sciences  (ICCS  2015): New  architectures,  infrastructures  and  services  for  future cities co-organized by City sciences where I teach

 

 

 

Call for Papers  Shanghai,  4-5  June  2015

International  Conference on City  Sciences  (ICCS  2015): New  architectures,  infrastructures  and  services  for  future cities

The   new   science   of   cities   stands   at   a   crossroads.   It   encompasses   rather   different,   or   even  conflicting,  approaches.  Future  cities  place  citizens  at  the  core  of  the  innovation  process  when  creating  new  urban  services,  through  “experience  labs”,  the  development  of  urban  apps  or  the  provision   of     ”open    data”.     But     future     cities     also    describe     the     modernisation     of    urban  infrastructures     and    services    such    as    transport,    energy,    culture,    etc.,    through    digital    ICT  technologies:   ultra-­‐fast   fixed  and  mobile  networks,  the  Internet  of  things,  smart  grids,  data  centres,   etc.  In  fact  during  the  last   two  decades local   authorities  have  invested   heavily  in  new  infrastructures   and  services,   for  instance  putting  online  more  and  more  public  services  and  trying    to   create   links   between  still prevalent silo   approaches   with   the   citizen   taking   an  increasingly  centre-­‐stage  role.  However,  so  far  the  results  of  these  investments  have  not  lived  up  to  expectations,  and  particularly  the  transformation  of  the  city  administration  has  not  been  as    rapid   nor   as   radical   as   anticipated.   Therefore,   it   can   be   said   that   there   is   an   increasing  awareness  of  the  need  to  deploy  new  infrastructures  to  support  updated  public  services  and  of  the     need    to   develop   new    services    able    to   share    information    and    knowledge    within    and  between   organizations   and   citizens.   In   addition,   urban   planning   and   urban   landscape   are  increasingly    perceived   as   a   basic   infrastructure,   or   rather   a   framework,   where   the   rest   of  infrastructures  and  services  rely  upon.  Thus,  as  an  overarching  consequence,  there  is  an  urgent  need   to   discuss  among  practitioners  and   academicians  successful  cases  and   new  approaches  able  to  help  to  build  better  future  cities.

Taking  place  in  Shanghai,   the  paradigm  of  challenges  for  future  cities  and   a  crossroad   itself  between   East  and   West,  the  International   Conference  on  City  Sciences  responds  to  these  and  other   issues  by  bringing  together  academics,  policy  makers,  industry  analysts,  providers  and  practitioners     to    present    and    discuss    their    findings.    A    broad    range    of    topics    related  to  infrastructures    and   services   in   the   framework   of   city   sciences   are   welcome   as   subjects   for  papers,  posters  and  panel  sessions:

  • Developments  of   new  infrastructures  and  services  of   relevance  in  an  urban  context:  broadband,    wireless,   sensors,   data,   energy,   transport,   housing,   water,   waste,   and  environment.
  • City sustainability  from  infrastructures  and  services
  • ICT-­‐enabled  urban  innovations
  • Smart city  developments  and  cases
  • Social and  economic  developments  citizen-­‐centric
  • Renewed  government  services  in  a  local  level
  • Simulation and  modelling  of  the  urban  context
  • Urban landscape  as new infrastructure Additional relevant topics  is  also  welcomed.

Authors  of  selected   papers  from  the  conference  will  be  invited   to   submit  to   special  issues  of International  peer-reviewed  academic  journals.

Important  deadlines:

  • 20 February:  Deadline  for  Abstracts  and  Panel  Session  Suggestions
  • 30 March:  Notification  of  Acceptance
  • 30  Apr:  Deadline  for  Final  Papers  and  Panel  Session  Outlines
  • 4- 5  June:  International  Conference  on  City  Sciences  at  Tongji  University  in  Shanghai,  PR  China

Submission of  Abstracts:

Abstracts  should   be  about  2  pages  (800  to   1000  words)  in   length   and  contain  the   following

information:

  • Title of  the  contribution
  • A  research  question
  • Remarks on  methodology
  • Outline of  (expected)  results
  • Bibliographical notes  (up  to  6  main  references  used  in  the  paper)
  • All abstracts  will  be  subject  to  blind  peer  review  by  at  least  two  reviewers.

conference link: International  Conference on City  Sciences  (ICCS  2015): New  architectures,  infrastructures  and  services  for  future 

 

 

 

 

Data Science at the command line – Book and workshop ..

 

 

 

 

 

 

 

 

 

 

I am reading a great book called Data Science at the Command line

The author Jeroen Janssens has a workshop in London on Data Science at the command line which I am attending

Here is a brief outline of some of the reasons why I like this approach ..

I have always liked the Command line .. from my days of starting with Unix machines. I must be one of the few people to actually want a command line mobile phone!

 If you have worked with Command line tools, you already know that they are powerful and fast.
For data science especially, that’s relevant because of the need to manipulate data and work with a range of products that can be invoked through a shell like interface
The book is based on the Data science toolbox – created by the author as an Open source tool and is brief and concise(187 pages). The book focuses on specific commands / strategies that can be linked together using simple but powerful command line interfaces
Examples include:
using tools such as json2csv tapkee dimensionality reduction library  and Rio (created by the author). Rio loads CSVs into R as a data.frame, executes given commands and gets the output as CSV or PNG )
run_experiment -  a SciKit-Learn command-line utility for running a series of learners on datasets specified in a configuration file.
tools like topwords.R
and many others
By co-incidence I read this as I was working on this post:  command line tools can be 235x faster than your hadoop cluster

I recommend both the book and the workshop.

 UPDATE:

a) I have been informed that there is a 50% discount offered for students, academics, startups and NGOs for the workshop
b) Jeroen says that:  The book is not really based on the Data Science Toolbox, but rather provides a modified one so that you don’t have to install everything yourself in order to get started. You can download the VM HERE

Data Science for IoT: The role of hardware in analytics

 

 

 

 

 

This post is leading to vision for Data Science for IoT course/certification. Please sign up on the link if you wish to know more when launched in Feb.

Often, Data Science for IoT differs from conventional data science due to the presence of hardware. Hardware could be involved in integration with the Cloud or Processing at the Edge (which Cisco and others have called Fog Computing). Alternately, we see entirely new classes of hardware specifically involved in Data Science for IoT(such as synapse chip for Deep learning)

Hardware will increasingly play an important role in Data Science for IoT. A good example is from a company called Cognimem which natively implements classifiers(unfortunately, the company does not seem to be active any more as per their twitter feed)

In IoT, speed and real time response play a key role. Often it makes sense to process the data closer to the sensor. This allows for a limited / summarized data set to be sent to the server if needed and also allows for localized decision making.  This architecture leads to a flow of information out from the Cloud and the storage of information at nodes which may not reside in the physical premises of the Cloud.

In this post, I try to explore the various hardware touchpoints for Data analytics and IoT to work together.

Cloud integration: Making decisions at the Edge

Intel Wind River edge management system certified to work with the Intel stack  and includes capabilities such as data capture, rules-based data analysis and response, configuration, file transfer and  Remote device management

Integration of Google analytics into Lantronix hardware –  allows sensors to send real-time data to any node on the Internet or to a cloud based application.

Microchip integration with Amazon Web services  uses an  embedded application with the Amazon Elastic Compute Cloud (EC2) service. Based on  Wi-Fi Client Module Development Kit . Languages like Python or Ruby can be used for development

Integration of Freescale and Oracle which consolidates data collected from multiple appliances from multiple Internet of things service providers.

Libraries

Libraries are another avenue for analytics engines to be integrated into products – often at the point of creation of the device. Xively cloud services is an example of this strategy through xively libraries

APIs

In contrast, keen.io provides APIs for IoT devices to create their own analytics engines ex (smartwatch Pebble’s using of keen.io)  without locking equipment providers into a particular data architecture.

Specialized hardware

We see increasing deployment  of specialized hardware for analytics. Ex egburt from Camgian which uses sensor fusion technolgies for IoT.

In the Deep learning space, GPUs are widely used and more specialized hardware emerges such as IBM’s synapse chip. But more interesting hardware platforms are emerging such as Nervana Systems which creates hardware specifically for Neural networks.

Ubuntu Core and IFTTT spark

Two more initiatives on my radar deserve a space in themselves – even when neither of them have currently an analytics engine:  Ubuntu Core – Docker containers+lightweight Linux distribution as an IoT OS and IFTTT spark initiatives

Comments welcome

This post is leading to vision for Data Science for IoT course/certification. Please sign up on the link if you wish to know more when launched in Feb.

Image source: cognimem

Understanding the nature of IoT data

This post is in a series Twelve unique characteristics of IoT based Predictive analytics/machine learning .

I will be exploring these ideas in the Data Science for IoT course /certification program when it’s launched.

Here, we discuss IoT devices and the nature of IoT data

Definitions and terminology

Business insider makes some bold predictions for IoT devices

The Internet of Things will be the largest device market in the world.

By 2019 it will be more than double the size of the smartphone, PC, tablet, connected car, and the wearable market combined.

The IoT will result in $1.7 trillion in value added to the global economy in 2019.

Device shipments will reach 6.7 billion in 2019 for a five-year CAGR of 61%.

The enterprise sector will lead the IoT, accounting for 46% of device shipments this year, but that share will decline as the government and home sectors gain momentum.

The main benefit of growth in the IoT will be increased efficiency and lower costs.

The IoT promises increased efficiency within the home, city, and workplace by giving control to the user.

And others say internet things investment will run 140bn next five years

 

Also, the term IoT has many definitions – but it’s important to remember that IoT is not the same as M2M (machine to machine). M2M is a telecoms term which implies that there is a radio (cellular) at one or both ends of the communication. On the other hand, IOT means simply connecting to the Internet. When we are speaking of IoT(billions of devices) – we are really referring to Smart objects. So, what makes an Object Smart?

What makes an object smart?

Back in 2010, the then Chinese Premier Wen Jiabo once said “Internet + Internet of things = Wisdom of the earth”. Indeed the Internet of Things revolution promises to transform many domains .. As the term Internet of Things implies (IOT) – IOT is about Smart objects

 

For an object (say a chair) to be ‘smart’ it must have three things

-       An Identity (to be uniquely identifiable – via iPv6)

-       A communication mechanism(i.e. a radio) and

-       A set of sensors / actuators

 

For example – the chair may have a pressure sensor indicating that it is occupied

Now, if it is able to know who is sitting – it could co-relate more data by connecting to the person’s profile

If it is in a cafe, whole new data sets can be co-related (about the venue, about who else is there etc)

Thus, IOT is all about Data ..

How will Smart objects communicate?

How will billions of devices communicate? Primarily through the ISM band and Bluetooth 4.0 / Bluetooth low energy. Certainly not through the cellular network (Hence the above distinction between M2M and IoT is important). Cellular will play a role in connectivity and there will be many successful applications / connectivity models (ex Jasper wireless). A more likely scenario is IoT specific networks like Sigfox (which could be deployed by anyone including Telecom Operators).  Sigfox currently uses the most popular European ISM band on 868MHz (as defined by ETSI and CEPT), along with 902MHz in the USA (as defined by the FCC), depending on specific regional regulations.

Smart objects will generate a lot of Data ..

Understanding the nature of IoT data

In the ultimate vision of IoT, Things are identifiable, autonomous, and self-configurable. Objects  communicate among themselves and interact with the environment. Objects can sense, actuate and predictively react to events

Billions of devices will create massive volume of streaming and geographically-dispersed data. This data will often need real-time responses. There are primarily two modes of IoT data: periodic observations/monitoring or abnormal event reporting. Periodic observations present demands due to their high volumes and storage overheads. Events on the other hand are one-off but need a rapid reponse. If we consider video data(ex from survillance cameras) as IoT Data, we have some additional characteristics.

Thus, our goal is to understand the implications of predictive analytics to IoT data. This ultimately entails using IoT data to make better decisions.

I will be exploring these ideas in the Data Science for IoT course /certification program when it’s launched. Comments welcome. In the next part of this series, I will explore Time Series data

 

Content and approach for a Data Science for IoT course/certification

 

 

 

 

 

 

UPDATE: 

Feb 15:  Applications now open -  Data Science for IoT Professional development short course at Oxford University  - more coming soon. Any questions, please email me at ajit.jaokar at futuretext.com

We are pleased to announce support from Mapr, Sigfox, Hypercat and Red Ninja for the Data Science. Everyone finishing the course will receive a University of Oxford certificate showing that they have completed the course. Places are limited – so please apply soon if interested

In a previous post, I mentioned that I am exploring creating a course/certification for Data Science for IoT

Here are some more thoughts

I believe that this is the first attempt to create such a course/program

I use the the phrase “Data Science” to collectively mean Machine learning/Predictive analytics

There are ofcourse many Machine Learning courses – the most well known being Andrew Ng’s course at Coursera/Stanford and the domain is complex enough as it is.

Thus, creating a course/ certification covering both Machine Learning/Predictive analytics and also IoT can be daunting

However, the sector specific focus gives us some unique advantages

Already at UPM (Universidad Politechnica de Madrid) I teach Machine Learning/Predictive analytics for the Smart cities domain through their citysciences program (the remit there being to create a role for the Data Scientist for a Smart city)

So, this idea is not totally new for me ..

Based on my work at UPM (for Smart cites) – teaching DataScience for a specific domain (like IoT) has both challenges but also some unique advantages

The challenges are: You have an extra level of complexity to deal with (in teaching IoT alongwith Predictive analytics)

But the advantages are:

a) The IoT domain focus allows us to be more pragmatic by addressing unique Data Science problems for IoT

b) We can take a Context based learning approach - a technique more common in Holland and Germany for teaching Engineering disciplines – and which I have used in teaching computer science to kids at feynlabs

c)  We don’t need to cover the maths upfront

d)  The participant can be productive faster and apply ideas faster to industry

Here are my thoughts on the elements such a program could cover based on the above approach: 

1) Unique characteristics – IoT ecosystem and data

2) Problems and datasets. This would cover specific scenarios and datasets needed (without addressing the predictive aspects)

3) An overview of Machine learning techniques and algorithms (Classification, Regression, Clustering, Dimensionality reduction etc) – this would also include the basic Math techniques needed for understanding algorithms

4) Programming python scikit-learn

5) Specific platforms/case studies

 Time series data(Mapr)

Sensor fusion for IoT(Camgian – Egburt)

NoSQL data for IoT (ex mongodb for IoT) ,

managing very high volume IoT data Mapr loading time series database 100 million points second

I also include image processing with sensors / IoT(ex surveillance cameras)

Hence,

IBM – Detecting skin cancer more quickly with visual machine learning

Real time face recognition using Deep learning algorithms

and even – Combining the Internet of Things with deep learning / predictive algorithms @numenta 

To conclude:

The above approach for teaching a course on Data Science for IoT  would help focus Machine Learning / Predictive algorithms in a real life problem solving scenario for IoT

Comments welcome.

You can sign up for more information at  futuretext and also follow me on twitter @ajitjaokar

Image source: wired

A business model for IoT retail(Beacon) : ‘Datalogix like’ insights which tie the social to the physical through Data Science and IoT?

 

 

 

 

This post is a part of my trends for the newsletter/course+certification I am launching in 2015 for “Data Science in IoT”.

Please sign up at futuretext if you want to know more as they develop in Jan

Note: In this post – I am not interested in the Datalogix – store card model. More to the implications of what it could mean for IoT .

Late last year, Oracle acquired a company called Datalogix ..

A Christmas gift perhaps for Larry Ellison – but with profound and disruptive implications

Datalogix does something very unique .. and had been on my radar especially for it’s relationship to facebook

The EFF describes this process in more detail which I summarize (Deep dive facebook and datalogix – what’s actually getting shared )

Datalogix is an advertising metrics company that describes its data set as including “almost every U.S. household and more than $1 trillion in consumer transactions.” It specifically relies on loyalty card data – cards anyone can get by filling out a form at a participating grocery store.

Data from such loyalty programs is the backbone of Datalogix’s advertising metrics business

What data is actually exchanged?

Datalogix assesses the impact of Facebook advertisements on shopping in the physical world.

Datalogix begins by providing Facebook with a (presumably enormous) dataset that includes hashed email addresses, hashed phone numbers, and Datalogix ID numbers for everyone they’re tracking. Using the information Facebook already has about its own users, Facebook then tests various email addresses and phone numbers against this dataset until it has a long list of the Datalogix ID numbers associated with different Facebook users.

Facebook then creates groups of users based on their online activity. For example, all users who saw a particular advertisement might be Group A, and all users who didn’t see that ad might be Group B. Then Facebook will give Datalogix a list of the Datalogix ID numbers associated with everyone in Groups A and B and ask Datalogix specific questions – for example, how many people in each group bought Ocean Spray cranberry juice? Datalogix then generates a report about how many people in Group A bought cranberry juice and how many people in Group B bought cranberry juice. This will provide Facebook with data about how well an ad is performing, but because the results are aggregated by groups, Facebook shouldn’t have details on whether a specific user bought a specific product. And Datalogix won’t know anything new about the users other than the fact that Facebook was interested in knowing whether they bought cranberry juice.

This is very interesting and powerful

But lets think beyond store cards .. Think IoT / Beacons

Substitute ‘store cards’ with ‘Retail IoT’ and you have a very unique models that could power IoT in Retail powered by IoT analytics

Beacon based shopping alredy exists via companies like estimote

So, my point is .. the model(independent  of Datalogix the company) could be used to close the loop between the Physical and the social. IoT / Data Science / Data analytics will pay a key role here

Comments welcome on twitter @ajitjaokar