Small Data: A Deterministic and predictive approach

 

Image source: Daniel Villatoro 

Abstract

In this blog/article, I expand on the idea of ‘Small data’.

I present a generic model for Small data combining Deterministic and Predictive components

Although I have presented the ideas in context of IoT(which I understand best) – the same algorithms and approach could apply to domains such as Retail, Telecoms, Banking etc

We could have a number of data sets which may be individually small but it is possible to find value at their intersection.  This approach is similar to the mobile industry/ foursquare scenario of knowing the context to provide the best service/offer etc to a customer segment of one. That’s a powerful idea in itself and a reason to consider Small Data. However, I wanted to extend the deterministic aspects of Small data (intersection of many small data sets) by also considering the predictive aspects. The article describes a general approach for adding a predictive component to Small data which comprises of three steps: a) A limited set of features are extracted, b) Their dimensionality is reduced(ex using clustering) and c) finally we use a classification and a recognition method like Hidden Markov Models to recognize a higher order metric (ex walking or footfall)

Introduction

 Last week, I gave an invited talk on IoT and Machine Learning at the Bigdap conference organized by the Ontic project . The Ontic project is a EU FP7 project doing some interesting work on Big Data and Analytics mainly from a Telco perspective.

The audience was technical and was reflected in the themes of the event which (for example : Techniques, models and algorithms for Big data, Scalable Data Mining and Machine learning techniques and mechanisms, Big Data Security and Privacy challenges, Cleaning Big Data (noise reduction), acquisition & integration, Multidimensional Big Data, Algorithms for enhancing data quality.)

This blog post is inspired by some conversations following my talk with Daniel Villatoro (BBVA) and Dr Alberto Mozo (UPM/Ontic). It extends many of the ideas and papers I referenced in my talk.

Background

In his talk, Daniel referred to ‘small data’ (image from Slides used with permission). In this context, as per slide, Small data refers to the intersection of various elements like customers, offers, social context etc in a small retailer context. Small data is an interesting concept and I wanted to explore it more. So, I spent the weekend thinking more about it.

When you have data elements, the concept of small data is a deterministic. It is similar to the mobile industry/ foursquare scenario of knowing the context to provide the best service/offer etc. Thus, given the right datasets, you can find value at the intersection. This works even if the individual Data sets are small as long as you find enough intersecting datasets to create a customer segment of one at their intersection.

That’s a powerful idea in itself and a reason to consider Small Data.

However, I wanted to extend the deterministic aspects of Small data (intersection of many small data sets) by also considering the predictive aspects. In the case of Predictive aspects, we want to infer insights from relatively limited data sets

In addition, I was also looking for a good use case to teach my students @citysciences. Hence, this blog will explore the predictive aspects of Small data in an IoT context

I believe the ideas I discuss could apply to any scenario (ex retail/banking) and indeed also to Big Data sets

A caveat:

The examples I have considered below strictly apply to Wireless Sensor Networks(WSNs). WSNs differ from IoT because there is potentially communication between the nodes. The topology of the WSNs can vary from a simple star network to an advanced multi-hop wireless mesh network. The propagation technique between the hops of the network can be routing or flooding.  In contrast, IoT nodes do not necessarily communicate between each other in this way. But for the purposes of our example, the examples are valid because we are interested in the insights inferred from the Data.

Predictive characteristics of Small data

From a predictive standpoint, I propose that Small data will have the following characteristics:

1)      The Data is missing or incomplete

2)      The data is limited

3)      Alternatively, we have Large data sets which need to be converted to a smaller data set to make it more relevant(ex a small retailer)  to the problem at hand

4)      The need for inferred metrics i.e. higher order metrics derived from raw data

This complements the deterministic aspects of Small data i.e. finding a number of data sets to identify the value at their intersection even if each data set itself may be small(Small data)

So, based on papers I reference below, I propose three methodologies that can be used for understanding Small data from a predictive standpoint

1)      Feature extraction

2)      Dimensionality reduction

3)      Feature Classification and recognition

To discuss these in detail, I use the problem of monitoring physical activity for assisted living patients. These patients live in an apartment under a privacy-aware manner. Here, we use sensors and infer behaviour based on the sensor readings but yet want to protect the privacy of the patient

The papers I have referred to are (also in my talk):

  • Activity Recognition Using Inertial Sensing for Healthcare, Wellbeing and Sports Applications: A Survey – Akin Avci, Stephan Bosch, Mihai Marin-Perianu, Raluca Marin-Perianu, Paul Havinga University of Twente, The Netherlands
  • Robust location-aware activity recognition: Lu and Fu 

This problem is a ‘small data’ problem because we have limited data, some of it is missing (not all sensors can be monitoring at all times) and we have to infer behaviour based on raw sensor readings. We will complement this with the deterministic interpretation of Small Data (where we accurately know a reading).

Small data: Assisted Living Scenario

source Robust Location-Aware Activity Recognition Using Wireless Sensor Network in an Attentive Home Ching-Hu Lu, Student Member, IEEE, and Li-Chen Fu, Fellow, IEEE

In an assisted living scenario, the goal is to recognize activity based on the observations of specific sensors. Traditionally, researchers used vision sensors for activity recognition. However, that is very privacy invasive.  The challenge is thus to recognize human behaviour based on raw readings / activity from multiple sensors. In addition, in an assisted living system, the subject being monitored may have a disorder (for example Cognitive disorders or Chronic conditions).

The techniques presented below could also apply to other scenarios – ex to detect Quality of Experience in Telecoms or in general for any situation where we have to infer insights from relatively limited data sets(ex footfall)

The steps/methods for retrieving activity information from raw sensor data are: preprocessing, segmentation, feature extraction, dimensionality reduction and classification

 In this post, we will consider the last three i.e. feature extraction, dimensionality reduction and classification. We could use these three techniques for situations where we want to create a predictive component for ‘small data’

 

Small data: Extracting predictive insights

In the above scenario, we could extract new insights using the following predictive techniques (even when we have less data)

 1)      Feature extraction

Feature extraction takes inputs from raw data readings and finds find the main characteristics of a data segment that accurately represent the original data. The smaller set of features can be described as abstractions of raw data. The purpose of feature extraction is to transform large quantities of input data into a reduced set of features. This smaller set of Data is represented as an n-dimensional feature vector. This feature vector is then used as an input to a classification algorithm.

 2)      Dimensionality Reduction

Dimensionality reduction methods aim to increase accuracy and reduce computational effort. By reducing the features involved in the classification process, less computational effort and memory are needed to perform the classification. In other words, if the dimensionality of a feature set is too high, some features might be irrelevant and do not even provide useful information for classification.The two general forms of dimensionality reduction are: feature selection and feature transform.

 Feature selection methods select the features, which are most discriminative and contribute most to the performance of the classifier, in order to create a subset of the existing features. For example: SVM-Based Feature Selection select several most important features and conclude that 5 attributes would be enough to classify daily activities accurately. K-Means Clustering is a method to uncover structure in a set of samples by grouping them according to a distance metric. K-means clustering algorithms rank individual features according to their discriminative properties and their co-relationships.

 Feature Transform Methods : Feature transform techniques try to map the high dimensional feature space into a much lower dimension, yielding fewer features that are a combination of the original features. They are useful in situations where multiple features collectively provide good discrimination but individually, those features would provide poor discrimination. Principal Component Analysis (PCA) PCA is a well known and widely used statistical analysis method and can be used to transform the original features into a lower dimensional space.

 3)     Classification and Recognition: The selected or reduced features from the dimensionality reduction process are used as inputs for the classification and recognition methods.  

For example: Nearest Neighbor (NN) algorithms are used for classification of activities based on the closest training examples in the feature space. (ex k-NN algorithm)

 Naïve Bayes is a simple probabilistic classifier based on Bayes’ theorem which can be used for Classification.

 Support Vector Machines (SVMs) are supervised learning methods used for classification. In the assisted living scenario, SVM based activity recognition system using objects attached with sensors can be used to recognize drinking, phoning, and writing activities

 Hidden Markov Models (HMMs) are statistical models that can also be used for activity recognition. I used a simple analogy to explain hidden markov analysis from a paper which explained HMM for inferring temperature in the distant past based on tree ring sizes

 Gaussian Mixture Models (GMMs) can be used to recognize transitions between activities

 Artificial Neural Networks can also be used to detect occurrences – ex falls.

 Thus, we get a scenario as below

 

 

 

 

 

 

 

 

 

 

sensors(adapted from Activity Recognition Using Inertial Sensing for Healthcare,Wellbeing and Sports Applications: A Survey)

activity (adapted from Robust location-aware activity recognition: Lu and Fu  )

Small Data: Complementing the Deterministic by the predictive

To conclude:

Small Data could be a deterministic problem when we know a number of datasets and value lies at the intersection of these data sets. This strategy is possible with Mobile context based services and Location based services. The results so achieved could also be complemented by a predictive component of Small data.

In this case,  a limited set of features are extracted, their dimensionality is reduced(ex using clustering) and finally we use a classification and a recognition method like Hidden Markov Models to actually recognize a higher order metric (ex walking, retail footfall etc)

I believe that these ideas could be adapted to many domains. Data science is engineering problem. It’s like building a Bridge where there is no fixed solution in advance. Every Bridge is different and will present a unique set of challenges.  I like the blog post – Machine Learning is not a Kaggle competition . The author(Julia Evans) correctly emphasizes that we need to understand the business problem first. So, I think the above approach could apply to many business scenarios – ex in Retail (footfall), Healthcare, Airport lounges etc by inferring predictive insights from data streams

 

Ardusat, Countdown Institute at CTIA connected for Good event (part of super mobility week) in Las Vegas

In October, we fully launch the Countdown Institute in Miami (lab Miami) for STEM education

Countdown is based on using Ardusat technology which allows you to conduct experiments in space on a live Cubesat based satellite

Essentially, the Ardusat is based on Cubesat and contains Arduino sensors which allows us to learn Computer Science in context of Space exploration experiments

Sunny Washington President of Ardusat is speaking at the CTIA connected for good event (part of the Super Mobility week) in Las Vegas today

It’s great to see this

The talk reflects the hard work our team in Miami has been putting in working with Ardusat (Richard, Jessica, Alex and also the faculty Nelson, Willie and Patrick)

If you are at CTIA – say Hi to the Ardusat team!

New futuretext web site is now live

Over the last two years, I have been refocussing my work and much of that is now complete

 

Have a look at the new futuretext site which reflects my emphasis on Machine Learning and IoT – both for projects and teaching

 

Why I signed a petition in favour of Amazon at change.org

 

 

 

 

 

 

I supported this change,org petition in favour of Amazon – Stop fighting low prices and fair wages with the following comment

While I may not agree everything Amazon does, I think Amazon has created a level playing field for a whole set of new content creators. In that sense, in future – it will serve new content creators better and  lead to more innovation. Existing publishers can never do that. I also agree with the ebook pricing argument from Amazon. Also, as a customer of Amazon – they have my goodwill and trust. I cannot say the same of any other traditional publisher(with the exception of O Reilly – who are very non traditional also). Thus, I believe – from the past record – Amazon will continue to innovate and serve its content creators and customers better than existing publishers  

My slides for IoT and machine learning – Computational Intelligence conference #CIUUK14

I spoke at the Computational intelligence  on Sat at BT HQ in St Paul Londonand it was a very interesting event

I was surprised to see more than 300 people in London on a sunny afternoon for what is essentially a VERY geeky topic!
My talk (IoT and Machine Learning) got a lot of +ve feedback as per
Thura Z. Maung @thuramg 11h Enjoyed the talks #CIUUK14 today, particularly Artificial Super Intelligence and IoT/Machine Learning…
Brett Hutley @hitechnomad 12h I enjoyed the conference #CIUUK14 my favourite talk was probably the Internet of Things and Machine Learning
Robert Thomas @dizzybanjo 13h Arrived at #CIUUK14 interesting talk about machine learningpic.twitter.com/bV2W9aSVCb
Diogo Neves @DiogoSnows 13h .@AjitJaokar what a great talk you just gave! thanks!!!!
Joe Da Silva @joemagicdevelop Brilliant talk by Ajit Jaokar on #MachineLearning applied to the #enterprise and #gov http://ow.ly/i/6mcmn #CIUUK14
+
Pls sign up at futuretext
I am working on a larger paper on IoT and Machine Learning
shall email it when its released

About the feynlabs methodology

We have been working on feynlabs for about a year and a half and leading upto a launch for the new comp science syllabus in Sep

Contact details info at feynlabs.com OR ajit.jaokar at futuretext.com 

Here are more details:

 feynlabs develops apps for Computer Science education.

Specifically, we address the problem of accelerating the learning of Computer Science in schools. 

Many countries – including the UK, China, USA – are switching to a more enhanced Computer Science syllabus in Schools (ages 10 to 17). Both teachers and students have to navigate a steep learning curve due to this change.

Although Learning to Code is an important part of Computer Science, Computer Science is more than to code.

There are two aspects of Computer Science: Programabiliy (learning to code) and Computability (i.e. Physical Computing, Problem solving, Algorithmic thinking etc).

Our methodology combines these two aspects by reusing Concept maps for teaching Computer science.  Concept Mapping is a learning technique originally developed in 1970s by Joseph Novak and Bob Gowin

In practise, we use Concept maps in two ways to accelerate the learning of Computer Science in schools:

a)       Feyncode:  feyncodes uses the ideas of assimilation theory i.e. stressing the importance of prior knowledge in learning new concepts (which is one of the foundations of concept mapping) to learn Programming. We start with the familiar and extend to new concepts. We begin with concept maps of one Programming language (Python) and then extend this idea to other programming languages through similarities and differences. This allows learners to quickly master the familiar – and then focus on the new elements in other languages by co-relating back to existing knowledge.  For example – we start with Python  and then extend to JavaScript and C. This strategy extends the learning to the systems domain(C) and the Web domain(JavaScript) while starting from a familiar paradigm (Python). Also, we use the feyncodes technique to explore multiple languages implemented within a platform (ex Raspberry Pi). The Pi already has many languages ported on it .  More recently, Mathematica and the Go language have been ported to the Pi.  The feyncodes mapping technique thus allows us to explore multiple languages in context of the Raspberry Pi.

 

b)      Feynmaps: Using concept maps, feynmaps address the problem of Computational thinking and Problem solving. We look at common sets of problems solved by Physical devices like the Raspberry Pi and Arduino.  In the first instance, we identify the following categories:  Actuating , Entertainment,  Environment,  Home automation, Monitoring, Robotics, Sensing and  Software and Utilities. Feynmaps are concept maps for each of the above categories focussed on the problems being solved and how they relate back to the Computer Science syllabus for teaching. Thus, feynmaps enable the teacher and the learner to assimilate, learn and teach a large amount of information about specific Physical computing

Additional notes about the vision:

  • Feynmaps and feyncodes are released under Creative commons
  • We follow the UK Computer Science syllabus – specifically the UK CAS syllabus
  • We incorporate Physical computing especially Raspberry Pi and Arduino
  • We believe in the idea of ‘incomplete models’ for learning – most recently articulated in the book The Curiosity cycle by Dr Jonathan Mugan. The Curiosity Cycle builds on the idea of ‘incomplete models’ i.e. the idea that an incorrect or incomplete models is better than no model at all – as long as the process of creating,  assimilating and validating models i.e. the curiosity cycle is inculcated in a child
  • 1984 book by Joseph Novak and Bob Govin originally outlined Concept maps. A more recent version of the book is still available on Amazon

Testimonials from teachers and industry leaders

Industry leaders

Hung Ly – Head of Department at Sir John Cass Secondary School

“My name is Hung Ly and I am the Head of Department at Sir John Cass Secondary School and Sixth Form College in Tower Hamlets, East London. I have been asked to write a short testimonial to what I think of the free programming course run by Ajit Jaokar of feynlabs. To be honest this is really new to school especially with the introduction of the Raspberry Pi and Python coding. I really liked the hands on approached and the excellent communications that Ajit offers. Each session is communicated in advance and liaised with me to ensure that the resources are available and pitched at the right level with the combination of theory/concept/metaphor of programming to really trying out programming itself. We are coming towards the end of our sessions and I sure the students and other staff members will miss Ajit and his lectures. I would like to thank Ajit and his associations in providing such an invaluable insight into the world of programming and making it an experience that we will never forget and something that can grow at this school in the future.” 

Wiard Vasen – Teacher Computer Science Montessori Lyceum Amsterdam

“Ajit Jaokar feels the urge to help people, no matter what age, gender or race, to find their individual fulfilment and meaning in life and He does this with the art of Programming.”

Robert Mullins – Raspberry Pi foundation

“Since early 2012, I have been following the work of Ajit Jaokar and feynlabs – as they use the Raspberry Pi in innovative ways in education. I watch this space with interest to see how their work evolves”

 

Robert Mullins – co-founder of the Raspberry Pi foundation 

 

Peter Vesterbacka – Mighty Eagle at Rovio Mobile

“I was one of the first people to LIKE the feynlabs page on Facebook.  Angry Birds demonstrate that we need the next generation to understand computer science from the outset. Initiatives like this will encourage more young people to take up computer science – and it’s great to see the progress and uptake for feynlabs”

Carlos Domingo – Director of Product Development and Innovation – Telefonica

“As someone who follows innovation and start-ups worldwide and a recent father, I am conscious of the need for creating an interest in Computer Science in the next generation.  In this context, Ajit Jaokar and feynlabs are doing some great work.. and i hope it helps create more start-ups in future”

Dr Mike Short CBE FREng FIET – IET President 2011/2012

“Computer science and programming are more important to the Digital economy than ever before. Courses such as these go back to basics and can help prepare Digital citizens to inspire development and follow their interests in the modern world .

 

Prof Peter Cochrane OBE

“The education system is broken!  Remembering facts and solving problems by ‘turning handles’ just doesn’t cut the mustard in the fast world of technology. We need a new breed who solve problems by thinking !  Feynlabs mission is to transform those constrained by a national curriculum and turn them into the problem solvers of tomorrow.”

Howard Rheingold – Internet Pioneer, Author and Thought leader

“Understanding programming is important for even (especially!) young students who are growing up in a digital world — either they learn how to shape that world, or will have to accept that their world will be shaped by others — and understanding computation, a powerful thinking tool in the tradition of logic and  geometry,is perhaps even more important in a world where knowing how to think and how to skillfully wield thinking tools is ever more important. The approach being explored by feynlabs could be crucially important — an experiment with potential social payback that far outweighs the risk of failure. Indeed, knowing how to deal with failure — and to use it to overcome obstacles — is essential to both programming and learning.”

Lawrence Lipsitz is founder, editor, publisher of “Educational Technology

“Ajit Jaokar is a visionary who seeks to take advantage of the digital revolution now underway throughout the world in order to vastly improve education for all children and young adults. He believes that a deep knowledge of Programming — in all of its aspects — is becoming a necessity for both career advancement and everyday living in the world that is coming into clear view, for those, like Ajit, able and willing to see it.”

IoT and Machine Learning – participation, proceedings, case studies etc

 

 

 

 

 

 

 

The IoT and Machine Learning workshop at the IOT world event promises to be a truly special event.

We have some attendee passes(with a discount code which allows you to attend the day only) and opportunities for case studies/presentations

If you are interested in attending with the discount code or contributing – please contact me at ajit.jaokar at futuretext.com

 

Countdown: Coding for the Stars – By Ajit Jaokar and Aditya Jaokar – learn the Raspberry Pi and Arduino through space exploration

 

 

 

 

 

 

 

 

 

 

 

Here is more about our book  (co-authored by me and my 10 year old son)

Extending the ideas in the book, working with Alex De Carvalho and the Lab Miami , we are also setting up an accompanying Multimedia research center in Miami for kids to learn Programming and Computer Science using Space technology

The book will be launched as a Kickstarter project in June 2014. A research center will also be based in Miami on the ideas for this book. The center will enable kids to learn about the Raspberry Pi and Arduino through Space technology

We will also hold learning to code sessions in Miami in the week beginning June 9

If you are interested to know more – please email me at ajit.jaokar at futuretext.com

Story

Idea originally inspired by a NASA scientist who said ‘Space unites humanity .. ‘

A group of kids who are based globally decide to collaborate and launch a Satellite in space.

The Satellite is based on the Raspberry Pi, Arduino and other open source technologies

In doing so, they learn about specific technologies like Raspberry Pi and Arduino in context of space exploration.

Each child has expertise and is based globally (and often has some limitations/quirks).

The protagonist (a boy aged 10 based in London) – has the idea to launch a grand plan – a satellite in space based on Arduino / Raspberry Pi

He creates a group on social media – and asks to see who wants to join to help him create this satellite

A group of kids globally respond:

A boy from China who has great mechanical abilities
A girl from Philippines who is good at programming
A girl living in Miami who is originally from Brazil who is into design
A boy from Germany who is good at hardware
A boy from Russia who is also good at technology

The story is about this group of kids who collaborate to launch the satellite in space.

The book is a series of three books

a)      The plan – Design (book launched in Oct 2014 in Miami)

b)      The build – How to make the satellite

c)       Blast off – the launch

The vision

The world will be like this in future

Science and skill will unite humanity.

Talent will be found all over the world and people(even kids) will collaborate to create something amazing

Technically the idea of launching an Arduino based satellite into space is very much possible. We use this to teach design and programming to kids

The technology

Apart from the story line, from a technical perspective – the basic idea is:

You could teach kids about a temperature sensor in isolation OR you could teach them the same idea of a temperature sensor but in context of a satellite. Which one is better?

The idea is not as far-fetched as it may sound i.e. this is very much doable. The NASA Elana program provides a possibility to explore these ideas based on the  Cubesat standardexplained more here (NASA Elana program with Cubesat standard)

So, we have a story (A group of kids who collaborate globally to launch a Satellite in Space) – and the idea is to teach kids about programming and design using the Pi / Arduino using the NASA Elana program

So, while our story is fictional – its a great way for kids to learn about real programming in context. The book will also include real examples, exercises and code about the venture

Thoughts and comments welcome ..

Launch and Research center

The book will be launched as a Kickstarter project in June 2014. A research center will also be based in Miami on the ideas for this book. The center will enable kids to learn about the Raspberry Pi and Arduino through Space technology

We will also hold learning to code sessions in Miami in the week beginning June 9

We are grateful for the help and feedback from NASA and the European Space Agency for this project

We also thank Alex De Carvalho and the Lab Miami for their help in this project

If you are interested to know more – please email me at ajit.jaokar at futuretext.com

PS – I just saw this Tiny KickSat Sprite satellites hitch ride into orbit . Hence, our idea of teaching kids to code using a story in context of Space is nearer than we thought!

PPS: We can even fork the code on github

4th FOKUS Media Web Symposium

I am a regular speaker at the Media Web Symposium at fraunhofer fokus in Berlin

Unfortunately, this year, I cannot make it – but as usual – they have a great speaker lineup

The registration is HERE

 

Using Satellites to teach Programming, Raspberry Pi, Arduino and engineering ..

 

 

 

 

 

 

I have been exploring this idea before – in the form of a fictional book

The story line of our book – for kids to learn Raspberry Pi and Arduino by learning how to design and launch a satellite 

Apart from the story line, from a technical perspective – the basic idea is:

You could teach kids about a temperature sensor in isolation OR you could teach them the same idea of a temperature sensor but in context of a satellite. Which one is better?

The idea is not as far-fetched as it may sound i.e. this is very much doable and at least – and hence it is a story based on fact

The NASA Elana program provides a possibility to explore these ideas based on the  Cubesat standard explained more here (NASA Elana program with Cubesat standard)

(NASA’s Kennedy Space Center in Florida is adapting the Poly-Picosatellite Orbital Deployer, or PPOD, to put these CubeSats into orbit. This deployment system, designed and manufactured by the California Polytechnic State University in partnership with Stanford University, has flown previously on Department of Defense and commercial launch vehicles.)

So, we have a story (A group of kids who collaborate globally to launch a Satellite in Space) – and the idea is to teach kids about programming and design using the Pi / Arduino using the NASA Elana program

The ardusat project used Arduino with cubesat and used the following sensors

The Arduino processors may sample data from the following sensors  :

one digital 3-axis magnetometer (MAG3110)

one digital 3-axis gyroscope (ITG-3200)

one 3-axis accelerometer (ADXL345)

one infrared temperature sensor with a wide sensing range (MLX90614)

four digital temperature sensors (TMP102) : 2 in the payload, 2 on the bottomplate

two luminosity sensor (TSL2561) covering both infrared and visible light : 1 on the bottomplate camera, 1 on the bottomplate slit

two geiger counter tubes (LND 716)

one optical spectrometer (Spectruino)

one 1.3MP camera (C439)

 So, while our story is fictional – its a great way for kids to learn about real programming in context ex you could teach temperature sensors in isolation – or in context of a satellite in space. Which is better!

Thoughts and comments welcome ..

Image source: NASA