Implementing Tim Berners-Lee’s vision of Rich Data vs. Big Data

 

 

 

 

 

 

 

 

INTRODUCTION:

In a previous blog post,  I discussed (Magna Carta for the Web) about the potential of Tim Berners-Lee vision of Rich Data.

When I met Tim at the EIF event in Brussels, I asked about the vision of Rich Data. I also thought more about how this vision could be actually implemented from a Predictive/Machine learning standpoint.

To recap the vision from the previous post:

So what is Rich Data? It’s Data (and Algorithms) that would empower the individual. According to Tim Berners-Lee: “If a computer collated data from your doctor, your credit card company, your smart home, your social networks, and so on, it could get a real overview of your life.” Berners-Lee was visibly enthusiastic about the potential applications of that knowledge, from living more healthily to picking better Christmas presents for his nephews and nieces. This, he said, would be “rich data”. (Motherboard

This blog explores a possible way this idea could be implemented. I hope perhaps I can implement it perhaps as part of an Open Data Institute incubated start-up

To summarize my view here:

The world of Big Data needs to maintain large amounts of Data because the past is used to predict the future. This is needed  because we do not voluntarily share data and Intent. Here,  I propose that to engender Trust, both the Algorithms and the ‘training’ should be transparent – which leads to greater Trust and greater sharing.  This in turn does not need us to hold large amounts of Data (Big Data) to determine Predictions(Intents). Instead, Intents will be known (shared voluntarily) by people at the point of need. This would create a world of Rich Data – where the Intent is determined algorithmically using smaller data sets (and without the need to maintain a large amount of historical data)

BACKGROUND AND CHALLENGES:

Thus, to break it down further, here are some more thoughts:

a)      Big Data vs. Rich Data: To gain insights from data, we currently collect all the data we can lay our hands on (Big Data).  In contrast, for Rich Data, instead of collecting all data in one place in advance, you need access to many small data sets for a given person and situation. But crucially, this ‘linking of datasets’ should happen at the point of need and dynamically. For example:  Personal profile, Contextual information and risk profile ex for a person who is at a risk of Diabetes or a Stroke – only at the point of a medical emergency(vs. gathered in advance).

b)      Context already exists: Much of this information exists already. The mobile industry has done a great job of  capturing contextual  information accurately – for example location and tying it to content(Geo tagged images)

c)       The ‘segment of one’ idea has been tried in many variants: Segmenting has been tried – with some success. In Retail (The future of Retail is segment of One), BCG perspective paper (Segment of One marketing – pdf) Inc magazine – Audience segmenting – targeting your customers . Segmentation is already possible

d)      Intents are not linked to context: The feedback loop is not complete because currently while context exists – it is not tied to Intent. Most people do not trust advertisers and others with their intent

e)      Intent (Predictions) are based on the past:  Because we do not trust providers with Intent – Intent is gleaned through Big Data. Intents are related to Predictions. Predictions are based on a large number of historical observations either of the individual or related individuals. To create accurate predictions in this way, we need large amounts of centralized data and any other forms of Data.  That’s the Big Data world we live in

f)       IoT: IoT will not solve the problem. It will create an order of magnitude of contextual information – but providers will not be trusted and datasets will not be shared. And we will continue to create larger datasets with bigger volumes.

CREATING A TRUST FRAMEWORK FOR SHARING DATA AT AN ALGORITHMIC LEVEL

To recap:

a)      To gain insights from data, we currently collect all the data we can lay our hands on. This is the world of Big Data.

b)      We take this approach because we do not know the Intent.

c)       Rather, we (as people) do not trust providers with Intent.

d)      Hence, in the world of Big Data, we need a lot of Data.  In contrast, for Rich Data, instead of collecting all data in one place in advance, you need access to many small data sets for a given person and situation. But crucially, this ‘linking of datasets’ should happen at the point of need and dynamically. For example:  Personal profile, Contextual information and risk profile ex for a person who is at a risk of Diabetes or a Stroke – only at the point of a medical emergency(vs. gathered in advance).

 

From an algorithmic standpoint, the overall objective is:  To determine the maximum likelihood of sharing under a Trust framework. Given a set of trust frameworks and a set of personas ( for example person with a propensity of a stroke)  - We want to know the probability of sharing information and under which trust framework

We need a small number of observations for an individual

We need an inbuilt trust framework for sharing

We need the Calibration of Trust to be ‘people driven’ and not provider driven

POSSIBLE ALGORITHMIC APPROACH

A possible way to implement the above could be through a Naive Bayes Classifier.

  • In machine learning, Naive Bayes classifiers are a family of simple probabilistic classifiers based on applying Bayes’ theorem with strong (naive) independence assumptions between the features.
  • Workings: Let {f1, . . . , fm} be a predefined set of m features. A classifier is a function f that maps input feature vectors x ∈ X to output class labels y ∈ {1, . . . , C} where X is the feature space. Our goal is to learn f from a labelled training set of N input-output pairs, (xn, yn), n = 1 : N; this is an example of supervised learning i.e. the algorithm has to be trained
  • An advantage of Naive Bayes is that it only requires a small amount of training data to estimate the parameters (means and variances of the variables) necessary for classification.
  • This represents the basics of Naive Bayes. Tom Mitchell in a Carnegie Mellon paper says “A hundred independently drawn training examples will usually suffice to obtain a maximum likelihood estimate of P(Y) that is within a few percent of its correct value1 when Y is a Boolean variable. However, accurately estimating P(X|Y) typically requires many more examples.”
  • In addition, we need to consider feature selection and dimensionality reduction. Feature selection is the process of selecting a subset of relevant features for use in model construction. Feature selection is different from dimensionality reduction. Both methods seek to reduce the number of attributes in the dataset, but a dimensionality reduction method do so by creating new combinations of attributes, where as feature selection methods include and exclude attributes present in the data without changing them. Examples of dimensionality reduction methods include Principal Component Analysis

IMPLEMENTATION

  • Thus, a combination of Naive Bayes and PCA may be  a start to implementing Rich Data. Naive Bayes needs relatively a smaller amount of data. PCA will reduce dimensionality.
  • How to incorporate Trust? The next question is: How to incorporate Trust? Based on above, Trust become a feature (an input vector) to the algorithm with an appropriate weightage. The output is then based on the probability of sharing under a Trust framework for a given persona
  • Who calibrates the Trust? A related and bigger question is: How to calibrate Trust within the Algorithm? This is indeed the Holy Grail and underpins the foundation of the approach. Prediction in research has grown exponentially due to the availability of Data – but Predictive science is not perfect (Good paper: The Good, the Bad, and the Ugly of Predictive) .  Predictive Algorithms gain their intelligence through two ways:  Supervised learning  (like Naive Bayes where the algorithm learns through training Data) or through Unsupervised learning where the algorithm tries to find hidden structure in unlabeled data.

 

So, if we have to calibrate trust for a Supervised learning algorithm – the workings must be open and the trust (propensity to share) must be created from the personas itself. Ex – People at risk of a stroke, elderly etc. Such an Open algorithm that learns from the people and whose workings are transparent will engender trust. It will in turn lead to greater sharing – and a different type of predictive algorithm which will need smaller historical amounts of data  - but will track a larger number of Data streams to determine value at their intersection. This in turn will complete the feedback loop and tie intent to context

Finally, I do not propose that a specific algorithm (such as Naive Bayes) is the answer – rather I propose that both the Algorithms and the ‘training’ should be transparent – which leads to greater Trust and greater sharing.  This in turn does not need us to hold large amounts of Data (Big Data) to determine Predictions(Intents). Instead, Intents will be known (shared voluntarily) by people at the point of need. This would create a world of Rich Data – where the Intent is determined algorithmically using smaller data sets (and without the need to maintain a large amount of historical data)

Comments welcome – at ajit.jaokar at futuretext.com