Data Science at the command line – Book and workshop ..

 

 

 

 

 

 

 

 

 

 

I am reading a great book called Data Science at the Command line

The author Jeroen Janssens has a workshop in London on Data Science at the command line which I am attending

Here is a brief outline of some of the reasons why I like this approach ..

I have always liked the Command line .. from my days of starting with Unix machines. I must be one of the few people to actually want a command line mobile phone!

 If you have worked with Command line tools, you already know that they are powerful and fast.
For data science especially, that’s relevant because of the need to manipulate data and work with a range of products that can be invoked through a shell like interface
The book is based on the Data science toolbox – created by the author as an Open source tool and is brief and concise(187 pages). The book focuses on specific commands / strategies that can be linked together using simple but powerful command line interfaces
Examples include:
using tools such as json2csv tapkee dimensionality reduction library  and Rio (created by the author). Rio loads CSVs into R as a data.frame, executes given commands and gets the output as CSV or PNG )
run_experiment -  a SciKit-Learn command-line utility for running a series of learners on datasets specified in a configuration file.
tools like topwords.R
and many others
By co-incidence I read this as I was working on this post:  command line tools can be 235x faster than your hadoop cluster

I recommend both the book and the workshop.

 UPDATE:

a) I have been informed that there is a 50% discount offered for students, academics, startups and NGOs for the workshop
b) Jeroen says that:  The book is not really based on the Data Science Toolbox, but rather provides a modified one so that you don’t have to install everything yourself in order to get started. You can download the VM HERE

Data Science for IoT: The role of hardware in analytics

 

 

 

 

 

This post is leading to vision for Data Science for IoT course/certification. Please sign up on the link if you wish to know more when launched in Feb.

Often, Data Science for IoT differs from conventional data science due to the presence of hardware. Hardware could be involved in integration with the Cloud or Processing at the Edge (which Cisco and others have called Fog Computing). Alternately, we see entirely new classes of hardware specifically involved in Data Science for IoT(such as synapse chip for Deep learning)

Hardware will increasingly play an important role in Data Science for IoT. A good example is from a company called Cognimem which natively implements classifiers(unfortunately, the company does not seem to be active any more as per their twitter feed)

In IoT, speed and real time response play a key role. Often it makes sense to process the data closer to the sensor. This allows for a limited / summarized data set to be sent to the server if needed and also allows for localized decision making.  This architecture leads to a flow of information out from the Cloud and the storage of information at nodes which may not reside in the physical premises of the Cloud.

In this post, I try to explore the various hardware touchpoints for Data analytics and IoT to work together.

Cloud integration: Making decisions at the Edge

Intel Wind River edge management system certified to work with the Intel stack  and includes capabilities such as data capture, rules-based data analysis and response, configuration, file transfer and  Remote device management

Integration of Google analytics into Lantronix hardware –  allows sensors to send real-time data to any node on the Internet or to a cloud based application.

Microchip integration with Amazon Web services  uses an  embedded application with the Amazon Elastic Compute Cloud (EC2) service. Based on  Wi-Fi Client Module Development Kit . Languages like Python or Ruby can be used for development

Integration of Freescale and Oracle which consolidates data collected from multiple appliances from multiple Internet of things service providers.

Libraries

Libraries are another avenue for analytics engines to be integrated into products – often at the point of creation of the device. Xively cloud services is an example of this strategy through xively libraries

APIs

In contrast, keen.io provides APIs for IoT devices to create their own analytics engines ex (smartwatch Pebble’s using of keen.io)  without locking equipment providers into a particular data architecture.

Specialized hardware

We see increasing deployment  of specialized hardware for analytics. Ex egburt from Camgian which uses sensor fusion technolgies for IoT.

In the Deep learning space, GPUs are widely used and more specialized hardware emerges such as IBM’s synapse chip. But more interesting hardware platforms are emerging such as Nervana Systems which creates hardware specifically for Neural networks.

Ubuntu Core and IFTTT spark

Two more initiatives on my radar deserve a space in themselves – even when neither of them have currently an analytics engine:  Ubuntu Core – Docker containers+lightweight Linux distribution as an IoT OS and IFTTT spark initiatives

Comments welcome

This post is leading to vision for Data Science for IoT course/certification. Please sign up on the link if you wish to know more when launched in Feb.

Image source: cognimem

Understanding the nature of IoT data

This post is in a series Twelve unique characteristics of IoT based Predictive analytics/machine learning .

I will be exploring these ideas in the Data Science for IoT course /certification program when it’s launched.

Here, we discuss IoT devices and the nature of IoT data

Definitions and terminology

Business insider makes some bold predictions for IoT devices

The Internet of Things will be the largest device market in the world.

By 2019 it will be more than double the size of the smartphone, PC, tablet, connected car, and the wearable market combined.

The IoT will result in $1.7 trillion in value added to the global economy in 2019.

Device shipments will reach 6.7 billion in 2019 for a five-year CAGR of 61%.

The enterprise sector will lead the IoT, accounting for 46% of device shipments this year, but that share will decline as the government and home sectors gain momentum.

The main benefit of growth in the IoT will be increased efficiency and lower costs.

The IoT promises increased efficiency within the home, city, and workplace by giving control to the user.

And others say internet things investment will run 140bn next five years

 

Also, the term IoT has many definitions – but it’s important to remember that IoT is not the same as M2M (machine to machine). M2M is a telecoms term which implies that there is a radio (cellular) at one or both ends of the communication. On the other hand, IOT means simply connecting to the Internet. When we are speaking of IoT(billions of devices) – we are really referring to Smart objects. So, what makes an Object Smart?

What makes an object smart?

Back in 2010, the then Chinese Premier Wen Jiabo once said “Internet + Internet of things = Wisdom of the earth”. Indeed the Internet of Things revolution promises to transform many domains .. As the term Internet of Things implies (IOT) – IOT is about Smart objects

 

For an object (say a chair) to be ‘smart’ it must have three things

-       An Identity (to be uniquely identifiable – via iPv6)

-       A communication mechanism(i.e. a radio) and

-       A set of sensors / actuators

 

For example – the chair may have a pressure sensor indicating that it is occupied

Now, if it is able to know who is sitting – it could co-relate more data by connecting to the person’s profile

If it is in a cafe, whole new data sets can be co-related (about the venue, about who else is there etc)

Thus, IOT is all about Data ..

How will Smart objects communicate?

How will billions of devices communicate? Primarily through the ISM band and Bluetooth 4.0 / Bluetooth low energy. Certainly not through the cellular network (Hence the above distinction between M2M and IoT is important). Cellular will play a role in connectivity and there will be many successful applications / connectivity models (ex Jasper wireless). A more likely scenario is IoT specific networks like Sigfox (which could be deployed by anyone including Telecom Operators).  Sigfox currently uses the most popular European ISM band on 868MHz (as defined by ETSI and CEPT), along with 902MHz in the USA (as defined by the FCC), depending on specific regional regulations.

Smart objects will generate a lot of Data ..

Understanding the nature of IoT data

In the ultimate vision of IoT, Things are identifiable, autonomous, and self-configurable. Objects  communicate among themselves and interact with the environment. Objects can sense, actuate and predictively react to events

Billions of devices will create massive volume of streaming and geographically-dispersed data. This data will often need real-time responses. There are primarily two modes of IoT data: periodic observations/monitoring or abnormal event reporting. Periodic observations present demands due to their high volumes and storage overheads. Events on the other hand are one-off but need a rapid reponse. If we consider video data(ex from survillance cameras) as IoT Data, we have some additional characteristics.

Thus, our goal is to understand the implications of predictive analytics to IoT data. This ultimately entails using IoT data to make better decisions.

I will be exploring these ideas in the Data Science for IoT course /certification program when it’s launched. Comments welcome. In the next part of this series, I will explore Time Series data

 

Content and approach for a Data Science for IoT course/certification

 

 

 

 

 

 

UPDATE: 

Feb 15:  Applications now open -  Data Science for IoT Professional development short course at Oxford University  - more coming soon. Any questions, please email me at ajit.jaokar at futuretext.com

We are pleased to announce support from Mapr, Sigfox, Hypercat and Red Ninja for the Data Science. Everyone finishing the course will receive a University of Oxford certificate showing that they have completed the course. Places are limited – so please apply soon if interested

In a previous post, I mentioned that I am exploring creating a course/certification for Data Science for IoT

Here are some more thoughts

I believe that this is the first attempt to create such a course/program

I use the the phrase “Data Science” to collectively mean Machine learning/Predictive analytics

There are ofcourse many Machine Learning courses – the most well known being Andrew Ng’s course at Coursera/Stanford and the domain is complex enough as it is.

Thus, creating a course/ certification covering both Machine Learning/Predictive analytics and also IoT can be daunting

However, the sector specific focus gives us some unique advantages

Already at UPM (Universidad Politechnica de Madrid) I teach Machine Learning/Predictive analytics for the Smart cities domain through their citysciences program (the remit there being to create a role for the Data Scientist for a Smart city)

So, this idea is not totally new for me ..

Based on my work at UPM (for Smart cites) – teaching DataScience for a specific domain (like IoT) has both challenges but also some unique advantages

The challenges are: You have an extra level of complexity to deal with (in teaching IoT alongwith Predictive analytics)

But the advantages are:

a) The IoT domain focus allows us to be more pragmatic by addressing unique Data Science problems for IoT

b) We can take a Context based learning approach - a technique more common in Holland and Germany for teaching Engineering disciplines – and which I have used in teaching computer science to kids at feynlabs

c)  We don’t need to cover the maths upfront

d)  The participant can be productive faster and apply ideas faster to industry

Here are my thoughts on the elements such a program could cover based on the above approach: 

1) Unique characteristics – IoT ecosystem and data

2) Problems and datasets. This would cover specific scenarios and datasets needed (without addressing the predictive aspects)

3) An overview of Machine learning techniques and algorithms (Classification, Regression, Clustering, Dimensionality reduction etc) – this would also include the basic Math techniques needed for understanding algorithms

4) Programming python scikit-learn

5) Specific platforms/case studies

 Time series data(Mapr)

Sensor fusion for IoT(Camgian – Egburt)

NoSQL data for IoT (ex mongodb for IoT) ,

managing very high volume IoT data Mapr loading time series database 100 million points second

I also include image processing with sensors / IoT(ex surveillance cameras)

Hence,

IBM – Detecting skin cancer more quickly with visual machine learning

Real time face recognition using Deep learning algorithms

and even – Combining the Internet of Things with deep learning / predictive algorithms @numenta 

To conclude:

The above approach for teaching a course on Data Science for IoT  would help focus Machine Learning / Predictive algorithms in a real life problem solving scenario for IoT

Comments welcome.

You can sign up for more information at  futuretext and also follow me on twitter @ajitjaokar

Image source: wired

A business model for IoT retail(Beacon) : ‘Datalogix like’ insights which tie the social to the physical through Data Science and IoT?

 

 

 

 

This post is a part of my Data Science for IoT course

Note: In this post – I am not interested in the Datalogix – store card model. More to the implications of what it could mean for IoT .

Late last year, Oracle acquired a company called Datalogix ..

A Christmas gift perhaps for Larry Ellison – but with profound and disruptive implications

Datalogix does something very unique .. and had been on my radar especially for it’s relationship to facebook

The EFF describes this process in more detail which I summarize (Deep dive facebook and datalogix – what’s actually getting shared )

Datalogix is an advertising metrics company that describes its data set as including “almost every U.S. household and more than $1 trillion in consumer transactions.” It specifically relies on loyalty card data – cards anyone can get by filling out a form at a participating grocery store.

Data from such loyalty programs is the backbone of Datalogix’s advertising metrics business

What data is actually exchanged?

Datalogix assesses the impact of Facebook advertisements on shopping in the physical world.

Datalogix begins by providing Facebook with a (presumably enormous) dataset that includes hashed email addresses, hashed phone numbers, and Datalogix ID numbers for everyone they’re tracking. Using the information Facebook already has about its own users, Facebook then tests various email addresses and phone numbers against this dataset until it has a long list of the Datalogix ID numbers associated with different Facebook users.

Facebook then creates groups of users based on their online activity. For example, all users who saw a particular advertisement might be Group A, and all users who didn’t see that ad might be Group B. Then Facebook will give Datalogix a list of the Datalogix ID numbers associated with everyone in Groups A and B and ask Datalogix specific questions – for example, how many people in each group bought Ocean Spray cranberry juice? Datalogix then generates a report about how many people in Group A bought cranberry juice and how many people in Group B bought cranberry juice. This will provide Facebook with data about how well an ad is performing, but because the results are aggregated by groups, Facebook shouldn’t have details on whether a specific user bought a specific product. And Datalogix won’t know anything new about the users other than the fact that Facebook was interested in knowing whether they bought cranberry juice.

This is very interesting and powerful

But lets think beyond store cards .. Think IoT / Beacons

Substitute ‘store cards’ with ‘Retail IoT’ and you have a very unique models that could power IoT in Retail powered by IoT analytics

Beacon based shopping alredy exists via companies like estimote

So, my point is .. the model(independent  of Datalogix the company) could be used to close the loop between the Physical and the social. IoT / Data Science / Data analytics will pay a key role here

Comments welcome on twitter @ajitjaokar

IoT and the Rise of the Predictive Organization

 

 

 

 

 

 

 

 

 

 

 

I will be launching a newsletter starting in Jan 2015 to cover these ideas in detail.

You can sign up for the newsletter at futuretext IoT Machine Learning – Predictive Analytics – newsletter

I will also be launching a course/certification for “Data Science in IoT” at Oxford, London and San Francisco – email me at ajit.jaokar at futuretext.com if you want to know more

 

In the Godfather II, Hyman Roth said to Micheal Corleone

             ’Michael – we are bigger than US Steel“.

Over the holiday season,  I said this to my friend Jeremy Geelan when I was comparing the Mobile industry to the IoT.

The term Internet of Things was coined by the British technologist Kevin Ashton in 1999, to describe a system where the Internet is connected to the physical world via ubiquitous sensors. Languishing depths of academia(at least here in Europe …) – IoT had it’s netscape moment early in 2014 when Google acquired Nest

Mobile is huge and has dominated the Tech landscape for the last decade.

But the Internet of Things(IoT) will be bigger.

How big?

Here are some numbers. Souce (adapted from  David Wood blog )

By 2020, we are expected to have 50 billion connected devices

To put in context:

  • The first commercial citywide cellular network was launched in Japan by NTT in 1979.
  • The milestone of 1 billion mobile phone connections was reached in 2002.
  • The 2 billion mobile phone connections milestone was reached in 2005.
  • The 3 billion mobile phone connections milestone was reached in 2007.
  • The 4 billion mobile phone connections milestone was reached in February 2009.
  • We reached 7.2 billion active mobile connections 2014

So, 50 billion by 2020 is a massive number by a factor, and no one doubts that number any more.

But IoT is much more than the number of connections – it’s all about the Data and the intelligence that can be gleaned from the Data.

As more objects are becoming embedded with sensors and gain the ability to communicate, new business models emerge.

IoT also creates new pathways for information to travel – especially across an Organization’s bounday and across it’s value chain and in engaging with their customers.

This Data and the Intelligence gleaned from it – will fundamentally transform organizations creating a new kind of ‘Predictive Organization’ which has Predictive analytics / Machine Learning at it’s core i.e. Algorithms that will learn from experience.

Machine learning is the study of algorithms and systems that improve their performance with experience. There are broadly two ways for algorithms to learn:  Supervised learning(where the algorithm is trained in advance using labelled data sets) and unsuprevised learning (with no prior learning – ex with methods like Clustering etc).

Machine Learning algorithms take the billions of Data points as inputs and extract actionable insights from ther data. So, the Predictive Organization starts with the prediction process and then creates a feedback loop through measuring and managing. Crucially, this tales place across the boundary of the Enterprise

I believe there are twelve unique characterictics of IoT based Predictive analytics/machine learning

1)     Time Series Data: Processing sensor data.

2)     Beyond sensing: Using Data for improving lives and businesses.

3)     Managing IoT Data.

4)     The Predictive Organization: Rethinking the edges of the Enterprise: Supply Chain and CRM impact

5)     Decisions at the ‘Edge’

6)     Real time processing.

7)     Cognitive computing – Image processing and beyond.

8)     Managing Massive Geographic scale.

9)     Cloud and Virtualization.

10)  Integration with Hardware.

11)  Rethinking existing Machine Learning Algorithms  for the IoT world.

12)  Co-relating IoT data to social data – the Datalogix model for IoT

Indeed one could argue that IoT leads to the creation of new types of organization – for instance  based on the sharing economy based on converging the digital and the physical world.

I will be launching a newsletter starting in Jan 2015 to cover these ideas in detail.

You can sign up for the newsletter at futuretext IoT Machine Learning – Predictive Analytics – newsletter

I will also be launching a course/certification for “Data Science in IoT” at Oxford, London and San Francisco - email me at ajit.jaokar at futuretext.com if you want to know more

Image source: wikipedia

ForumOxford: Internet of Things Conference 2015 listed among 40 most important #IoT events to attend this year ..

What a nice way to end the year ..

Jeremey Geelan who created a list of the top 40 Internet of Things Conferences to attend in 2015 has added the forumoxford : 2015 Internet of Things conference  to the list of 40 important Internet of Things conferences for 2015

Date: 6 November, 2014

Venue: Rewley House, University of Oxford
URL: forumoxford : 2015 Internet of Things conference

co-chaired by me and Tomi Ahonen. Now in it’s 10th year. Mark the dates!

full list again  list of the top 40 Internet of Things Conferences to attend in 2015

 

Infographic – The evolution of wireless networks

PS
I get many such requests to post infographics ..
But this one is good
Comes from a reliable source (New Jersey Institute of Technology - Online Masters of Science in Electrical Engineering)

Infographic – The evolution of wireless networks

New Jersey Institute of Technology’s Online Master of Science in Electrical Engineering

Space Clouds: Turtles in Space – Learning to Code

Here is something I have been thinking as part of the Countdown Institute.

The Countdown Institute  teaches young people aged 10 to 16 to learn programming skills using Space exploration

I have been a fan of Seymour Papert’s Turtles based on my work at feynlabs.

Turtles in Python(Python Turtles) and in general(Turtle Graphics) are a great way of learning to code.

Object Oriented paradigms (like Turtles) are an easy way to start learning Programming (as opposed to Procedural Paradigms) because they help to tie back to the problem / context easily. The Turtles concept also downplays the more complex aspects of OO programming such as Inheritance and Polymorphism.

Countdown helps enables young people to learn coding by solving problems in a specific context – in this case – Space exploration.

But we need a simple and a consistent way to model problems. Space Clouds is a data/modelling layer which relates Space exploration to coding within Space exploration. We can think of the Space Cloud as a unifying Data layer / software objects/class. It is a consistent way of modelling a problem and getting kids  to code

From a programmatic standpoint , we have varying space objects(Satellites, Drones, Planets, Space missions etc).

Like an Object (such as a Turtle) – each of these are Objects have behaviour and data

Each lesson starts with describing (modelling) the objects involved in the ‘world’ – ex in a high altitude balloon – jet stream could be defined as part of the space cloud.

This is a very easy paradigm to understand for a Child .. ie I switch on a device and the ‘sky lights up’ so to speak.

Depending on the problem – the Objects could be Planets, Satellites, missions(Orion, Rosetta)

Space Clouds is a simple, context specific modelling language for the context of space exploration created with the goal of teaching young people to code. Space Clouds is Programming Language agnostic. Current modelling languages like UML are designed for modelling entire systems and are not really suited for learning to code. 

The idea of Space Clouds can be thought of as the concept of in ‘Turtles in Space’

A recent blog on learning to code said that No-fuss setups and Task Oriented tools are key features to get more kids to code.

Space Clouds takes a similar approach by simplifying (limiting) input in early stages and connecting to a specific context

Image source Valiant turtle – wikipedia

 

 

Implementing Tim Berners-Lee’s vision of Rich Data vs. Big Data

 

 

 

 

 

 

 

 

INTRODUCTION:

In a previous blog post,  I discussed (Magna Carta for the Web) about the potential of Tim Berners-Lee vision of Rich Data.

When I met Tim at the EIF event in Brussels, I asked about the vision of Rich Data. I also thought more about how this vision could be actually implemented from a Predictive/Machine learning standpoint.

To recap the vision from the previous post:

So what is Rich Data? It’s Data (and Algorithms) that would empower the individual. According to Tim Berners-Lee: “If a computer collated data from your doctor, your credit card company, your smart home, your social networks, and so on, it could get a real overview of your life.” Berners-Lee was visibly enthusiastic about the potential applications of that knowledge, from living more healthily to picking better Christmas presents for his nephews and nieces. This, he said, would be “rich data”. (Motherboard

This blog explores a possible way this idea could be implemented. I hope perhaps I can implement it perhaps as part of an Open Data Institute incubated start-up

To summarize my view here:

The world of Big Data needs to maintain large amounts of Data because the past is used to predict the future. This is needed  because we do not voluntarily share data and Intent. Here,  I propose that to engender Trust, both the Algorithms and the ‘training’ should be transparent – which leads to greater Trust and greater sharing.  This in turn does not need us to hold large amounts of Data (Big Data) to determine Predictions(Intents). Instead, Intents will be known (shared voluntarily) by people at the point of need. This would create a world of Rich Data – where the Intent is determined algorithmically using smaller data sets (and without the need to maintain a large amount of historical data)

BACKGROUND AND CHALLENGES:

Thus, to break it down further, here are some more thoughts:

a)      Big Data vs. Rich Data: To gain insights from data, we currently collect all the data we can lay our hands on (Big Data).  In contrast, for Rich Data, instead of collecting all data in one place in advance, you need access to many small data sets for a given person and situation. But crucially, this ‘linking of datasets’ should happen at the point of need and dynamically. For example:  Personal profile, Contextual information and risk profile ex for a person who is at a risk of Diabetes or a Stroke – only at the point of a medical emergency(vs. gathered in advance).

b)      Context already exists: Much of this information exists already. The mobile industry has done a great job of  capturing contextual  information accurately – for example location and tying it to content(Geo tagged images)

c)       The ‘segment of one’ idea has been tried in many variants: Segmenting has been tried – with some success. In Retail (The future of Retail is segment of One), BCG perspective paper (Segment of One marketing – pdf) Inc magazine – Audience segmenting – targeting your customers . Segmentation is already possible

d)      Intents are not linked to context: The feedback loop is not complete because currently while context exists – it is not tied to Intent. Most people do not trust advertisers and others with their intent

e)      Intent (Predictions) are based on the past:  Because we do not trust providers with Intent – Intent is gleaned through Big Data. Intents are related to Predictions. Predictions are based on a large number of historical observations either of the individual or related individuals. To create accurate predictions in this way, we need large amounts of centralized data and any other forms of Data.  That’s the Big Data world we live in

f)       IoT: IoT will not solve the problem. It will create an order of magnitude of contextual information – but providers will not be trusted and datasets will not be shared. And we will continue to create larger datasets with bigger volumes.

CREATING A TRUST FRAMEWORK FOR SHARING DATA AT AN ALGORITHMIC LEVEL

To recap:

a)      To gain insights from data, we currently collect all the data we can lay our hands on. This is the world of Big Data.

b)      We take this approach because we do not know the Intent.

c)       Rather, we (as people) do not trust providers with Intent.

d)      Hence, in the world of Big Data, we need a lot of Data.  In contrast, for Rich Data, instead of collecting all data in one place in advance, you need access to many small data sets for a given person and situation. But crucially, this ‘linking of datasets’ should happen at the point of need and dynamically. For example:  Personal profile, Contextual information and risk profile ex for a person who is at a risk of Diabetes or a Stroke – only at the point of a medical emergency(vs. gathered in advance).

 

From an algorithmic standpoint, the overall objective is:  To determine the maximum likelihood of sharing under a Trust framework. Given a set of trust frameworks and a set of personas ( for example person with a propensity of a stroke)  - We want to know the probability of sharing information and under which trust framework

We need a small number of observations for an individual

We need an inbuilt trust framework for sharing

We need the Calibration of Trust to be ‘people driven’ and not provider driven

POSSIBLE ALGORITHMIC APPROACH

A possible way to implement the above could be through a Naive Bayes Classifier.

  • In machine learning, Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.
  • Workings: Let {f1, . . . , fm} be a predefined set of m features. A classifier is a function f that maps input feature vectors x ∈ X to output class labels y ∈ {1, . . . , C} where X is the feature space. Our goal is to learn f from a labelled training set of N input-output pairs, (xn, yn), n = 1 : N; this is an example of supervised learning i.e. the algorithm has to be trained
  • An advantage of Naive Bayes is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification.
  • This represents the basics of Naive Bayes. Tom Mitchell in a Carnegie Mellon paper says “A hundred independently drawn training examples will usually suffice to obtain a maximum likelihood estimate of P(Y) that is within a few percent of its correct value1 when Y is a Boolean variable. However, accurately estimating P(X|Y) typically requires many more examples.”
  • In addition, we need to consider feature selection and dimensionality reduction. Feature selection is the process of selecting a subset of relevant features for use in model construction. Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations of attributes, where as feature selection methods include and exclude attributes present in the data without changing them. Examples of dimensionality reduction methods include Principal Component Analysis

IMPLEMENTATION

  • Thus, a combination of Naive Bayes and PCA may be  a start to implementing Rich Data. Naive Bayes needs relatively a smaller amount of data. PCA will reduce dimensionality.
  • How to incorporate Trust? The next question is: How to incorporate Trust? Based on above, Trust become a feature (an input vector) to the algorithm with an appropriate weightage. The output is then based on the probability of sharing under a Trust framework for a given persona
  • Who calibrates the Trust? A related and bigger question is: How to calibrate Trust within the Algorithm? This is indeed the Holy Grail and underpins the foundation of the approach. Prediction in research has grown exponentially due to the availability of Data – but Predictive science is not perfect (Good paper: The Good, the Bad, and the Ugly of Predictive) .  Predictive Algorithms gain their intelligence through two ways:  Supervised learning  (like Naive Bayes where the algorithm learns through training Data) or through Unsupervised learning where the algorithm tries to find hidden structure in unlabeled data.

 

So, if we have to calibrate trust for a Supervised learning algorithm – the workings must be open and the trust (propensity to share) must be created from the personas itself. Ex – People at risk of a stroke, elderly etc. Such an Open algorithm that learns from the people and whose workings are transparent will engender trust. It will in turn lead to greater sharing – and a different type of predictive algorithm which will need smaller historical amounts of data  - but will track a larger number of Data streams to determine value at their intersection. This in turn will complete the feedback loop and tie intent to context

Finally, I do not propose that a specific algorithm (such as Naive Bayes) is the answer – rather I propose that both the Algorithms and the ‘training’ should be transparent – which leads to greater Trust and greater sharing.  This in turn does not need us to hold large amounts of Data (Big Data) to determine Predictions(Intents). Instead, Intents will be known (shared voluntarily) by people at the point of need. This would create a world of Rich Data – where the Intent is determined algorithmically using smaller data sets (and without the need to maintain a large amount of historical data)

Comments welcome – at ajit.jaokar at futuretext.com