Please connect with me if you want to stay in touch on linkedin and for future updates
This article is a part of an evolving theme. Here, I explain the basics of Deep Learning and how Deep learning algorithms could apply to IoT and Smart city domains. Specifically, as I discuss below, I am interested in complementing Deep learning algorithms using IoT datasets. I elaborate these ideas in the Data Science for Internet of Things program which enables you to work towards being a Data Scientist for the Internet of Things (modelled on the course I teach at Oxford University and UPM – Madrid). I will also present these ideas at the International conference on City Sciences at Tongji University in Shanghai and the Data Science for IoT workshop at the Iotworld event in San Francisco
Please connect with me if you want to stay in touch on linkedin and for future updates
Deep learning is often thought of as a set of algorithms that ‘mimics the brain’. A more accurate description would be an algorithm that ‘learns in layers’. Deep learning involves learning through layers which allows a computer to build a hierarchy of complex concepts out of simpler concepts.
The obscure world of deep learning algorithms came into public limelight when Google researchers fed 10 million random, unlabeled images from YouTube into their experimental Deep Learning system. They then instructed the system to recognize the basic elements of a picture and how these elements fit together. The system comprising 16,000 CPUs was able to identify images that shared similar characteristics (such as images of Cats). This canonical experiment showed the potential of Deep learning algorithms. Deep learning algorithms apply to many areas including Computer Vision, Image recognition, pattern recognition, speech recognition, behaviour recognition etc
To understand the significance of Deep Learning algorithms, it’s important to understand how Computers think and learn. Since the early days, researchers have attempted to create computers that think. Until recently, this effort has been rules based adopting a ‘top down’ approach. The Top-down approach involved writing enough rules for all possible circumstances. But this approach is obviously limited by the number of rules and by its finite rules base.
To overcome these limitations, a bottom-up approach was proposed. The idea here is to learn from experience. The experience was provided by ‘labelled data’. Labelled data is fed to a system and the system is trained based on the responses. This approach works for applications like Spam filtering. However, most data (pictures, video feeds, sounds, etc.) is not labelled and if it is, it’s not labelled well.
The other issue is in handling problem domains which are not finite. For example, the problem domain in chess is complex but finite because there are a finite number of primitives (32 chess pieces) and a finite set of allowable actions(on 64 squares). But in real life, at any instant, we have potentially a large number or infinite alternatives. The problem domain is thus very large.
A problem like playing chess can be ‘described’ to a computer by a set of formal rules. In contrast, many real world problems are easily understood by people (intuitive) but not easy to describe (represent) to a Computer (unlike Chess). Examples of such intuitive problems include recognizing words or faces in an image. Such problems are hard to describe to a Computer because the problem domain is not finite. Thus, the problem description suffers from the curse of dimensionality i.e. when the number of dimensions increase, the volume of the space increases so fast that the available data becomes sparse. Computers cannot be trained on sparse data. Such scenarios are not easy to describe because there is not enough data to adequately represent combinations represented by the dimensions. Nevertheless, such ‘infinite choice’ problems are common in daily life.
Deep learning is involved with ‘hard/intuitive’ problem which have little/no rules and high dimensionality. Here, the system must learn to cope with unforeseen circumstances without knowing the Rules in advance. Many existing systems like Siri’s speech recognition and Facebook’s face recognition work on these principles. Deep learning systems are possible to implement now because of three reasons: High CPU power, Better Algorithms and the availability of more data. Over the next few years, these factors will lead to more applications of Deep learning systems.
Deep Learning algorithms are modelled on the workings of the Brain. The Brain may be thought of as a massively parallel analog computer which contains about 10^10 simple processors (neurons) – each of which require a few milliseconds to respond to input. To model the workings of the brain, in theory, each neuron could be designed as a small electronic device which has a transfer function similar to a biological neuron. We could then connect each neuron to many other neurons to imitate the workings of the Brain. In practise, it turns out that this model is not easy to implement and is difficult to train.
So, we make some simplifications in the model mimicking the brain. The resultant neural network is called “feed-forward back-propagation network”. The simplifications/constraints are: We change the connectivity between the neurons so that they are in distinct layers. Each neuron in one layer is connected to every neuron in the next layer. Signals flow in only one direction. And finally, we simplify the neuron design to ‘fire’ based on simple, weight driven inputs from other neurons. Such a simplified network (feed-forward neural network model) is more practical to build and use.
a) Each neuron receives a signal from the neurons in the previous layer
b) Each of those signals is multiplied by a weight value.
c) The weighted inputs are summed, and passed through a limiting function which scales the output to a fixed range of values.
d) The output of the limiter is then broadcast to all of the neurons in the next layer.
Image and parts of description in this section adapted from : Seattle robotics site
The most common learning algorithm for artificial neural networks is called Back Propagation (BP) which stands for “backward propagation of errors”. To use the neural network, we apply the input values to the first layer, allow the signals to propagate through the network and read the output. A BP network learns by example i.e. we must provide a learning set that consists of some input examples and the known correct output for each case. So, we use these input-output examples to show the network what type of behaviour is expected. The BP algorithm allows the network to adapt by adjusting the weights by propagating the error value backwards through the network. Each link between neurons has a unique weighting value. The ‘intelligence’ of the network lies in the values of the weights. With each iteration of the errors flowing backwards, the weights are adjusted. The whole process is repeated for each of the example cases. Thus, to detect an Object, Programmers would train a neural network by rapidly sending across many digitized versions of data (for example, images) containing those objects. If the network did not accurately recognize a particular pattern, the weights would be adjusted. The eventual goal of this training is to get the network to consistently recognize the patterns that we recognize (ex Cats).
The whole objective of Deep Learning is to solve ‘intuitive’ problems i.e. problems characterized by High dimensionality and no rules. The above mechanism demonstrates a supervised learning algorithm based on a limited modelling of Neurons – but we need to understand more.
Deep learning allows computers to solve intuitive problems because:
This is similar to the way a child learns ‘what a dog is’ i.e. by understanding the sub-components of a concept ex the behavior(barking), shape of the head, the tail, the fur etc and then putting these concepts in one bigger idea i.e. the Dog itself.
The (knowledge) representation problem is a recurring theme in Computer Science.
Knowledge representation incorporates theories from psychology which look to understand how humans solve problems and represent knowledge. The idea is that: if like humans, Computers were to gather knowledge from experience, it avoids the need for human operators to formally specify all of the knowledge that the computer needs to solve a problem.
For a computer, the choice of representation has an enormous effect on the performance of machine learning algorithms. For example, based on the sound pitch, it is possible to know if the speaker is a man, woman or child. However, for many applications, it is not easy to know what set of features represent the information accurately. For example, to detect pictures of cars in images, a wheel may be circular in shape – but actual pictures of wheels may have variants (spokes, metal parts etc). So, the idea of representation learning is to find both the mapping and the representation.
If we can find representations and their mappings automatically (i.e. without human intervention), we have a flexible design to solve intuitive problems. We can adapt to new tasks and we can even infer new insights without observation. For example, based on the pitch of the sound – we can infer an accent and hence a nationality. The mechanism is self learning. Deep learning applications are best suited for situations which involve large amounts of data and complex relationships between different parameters. Training a Neural network involves repeatedly showing it that: “Given an input, this is the correct output”. If this is done enough times, a sufficiently trained network will mimic the function you are simulating. It will also ignore inputs that are irrelevant to the solution. Conversely, it will fail to converge on a solution if you leave out critical inputs. This model can be applied to many scenarios as we see below in a simplified example.
Deep learning involves learning through layers which allows a computer to build a hierarchy of complex concepts out of simpler concepts. This approach works for subjective and intuitive problems which are difficult to articulate.
Consider image data. Computers cannot understand the meaning of a collection of pixels. Mappings from a collection of pixels to a complex Object are complicated.
With deep learning, the problem is broken down into a series of hierarchical mappings – with each mapping described by a specific layer.
The input (representing the variables we actually observe) is presented at the visible layer. Then a series of hidden layers extracts increasingly abstract features from the input with each layer concerned with a specific mapping. However, note that this process is not pre defined i.e. we do not specify what the layers select
For example: From the pixels, the first hidden layer identifies the edges
From the edges, the second hidden layer identifies the corners and contours
From the corners and contours, the third hidden layer identifies the parts of objects
Finally, from the parts of objects, the fourth hidden layer identifies whole objects
Image and example source: Yoshua Bengio book – Deep Learning
In addition, we have limitations in the technology. For instance, we have a long way to go before a Deep learning system can figure out that you are sad because your cat died(although it seems Cognitoys based on IBM watson is heading in that direction). The current focus is more on identifying photos, guessing the age from photos(based on Microsoft’s project Oxford API)
And we have indeed a way to go as Andrew Ng reminds us to think of Artificial Intelligence as building a rocket ship
“I think AI is akin to building a rocket ship. You need a huge engine and a lot of fuel. If you have a large engine and a tiny amount of fuel, you won’t make it to orbit. If you have a tiny engine and a ton of fuel, you can’t even lift off. To build a rocket you need a huge engine and a lot of fuel. The analogy to deep learning [one of the key processes in creating artificial intelligence] is that the rocket engine is the deep learning models and the fuel is the huge amounts of data we can feed to these algorithms.”
Today, we are still limited by technology from achieving scale. Google’s neural network that identified cats had 16,000 nodes. In contrast, a human brain has an estimated 100 billion neurons!
There are some scenarios where Back propagation neural networks are suited
Given an IoT domain, we could consider the top-level questions:
Now, extending more deeply into the research domain, here are some areas of interest that I am following.
In essence, these techniques/strategies complement Deep learning algorithms with IoT datasets.
1) Deep learning algorithms and Time series data : Time series data (coming from sensors) can be thought of as a 1D grid taking samples at regular time intervals, and image data can be thought of as a 2D grid of pixels. This allows us to model Time series data with Deep learning algorithms (most sensor / IoT data is time series). It is relatively less common to explore Deep learning and Time series – but there are some instances of this approach already (Deep Learning for Time Series Modelling to predict energy loads using only time and temp data )
2) Multiple modalities: multimodality in deep learning. Multimodality in deep learning algorithms is being explored In particular, cross modality feature learning, where better features for one modality (e.g., video) can be learned if multiple modalities (e.g., audio and video) are present at feature learning time
3) Temporal patterns in Deep learning: In their recent paper, Ph.D. student Huan-Kai Peng and Professor Radu Marculescu, from Carnegie Mellon University’s Department of Electrical and Computer Engineering, propose a new way to identify the intrinsic dynamics of interaction patterns at multiple time scales. Their method involves building a deep-learning model that consists of multiple levels; each level captures the relevant patterns of a specific temporal scale. The newly proposed model can be also used to explain the possible ways in which short-term patterns relate to the long-term patterns. For example, it becomes possible to describe how a long-term pattern in Twitter can be sustained and enhanced by a sequence of short-term patterns, including characteristics like popularity, stickiness, contagiousness, and interactivity. The paper can be downloaded HERE
I see Smart cities as an application domain for Internet of Things. Many definitions exist for Smart cities/future cities. From our perspective, Smart cities refer to the use of digital technologies to enhance performance and wellbeing, to reduce costs and resource consumption, and to engage more effectively and actively with its citizens (adapted from Wikipedia). Key ‘smart’ sectors include transport, energy, health care, water and waste. A more comprehensive list of Smart City/IoT application areas are: Intelligent transport systems – Automatic vehicle , Medical and Healthcare, Environment , Waste management , Air quality , Water quality, Accident and Emergency services, Energy including renewable, Intelligent transport systems including autonomous vehicles. In all these areas we could find applications to which we could add an intuitive component based on the ideas above.
Typical domains will include Computer Vision, Image recognition, pattern recognition, speech recognition, behaviour recognition. Of special interest are new areas such as the Self driving cars – ex the Lutz pod and even larger vehicles such as self driving trucks
Deep learning involves learning through layers which allows a computer to build a hierarchy of complex concepts out of simpler concepts. Deep learning is used to address intuitive applications with high dimensionality. It is an emerging field and over the next few years, due to advances in technology, we are likely to see many more applications in the Deep learning space. I am specifically interested in how IoT datasets can be used to complement deep learning algorithms. This is an emerging area with some examples shown above. I believe that it will have widespread applications, many of which we have not fully explored(as in the Smart city examples)
I see this article as part of an evolving theme. Future updates will explore how Deep learning algorithms could apply to IoT and Smart city domains. Also, I am interested in complementing Deep learning algorithms using IoT datasets.
I elaborate these ideas in the Data Science for Internet of Things program (modelled on the course I teach at Oxford University and UPM – Madrid). I will also present these ideas at the International conference on City Sciences at Tongji University in Shanghai and the Data Science for IoT workshop at the Iotworld event in San Francisco
Please connect with me if you want to stay in touch on linkedin and for future updates
Something extraordinary happened last week
An app (meerkat) (which was a ‘massive hit’ at SXSW) and which was launched only two months ago – raised $14m in funding.
Three days after that – it’s popularity plunged rapidly after the launch of Twitter’s periscope.
Probably never to return to its height.
A few more days after that Meerkat and Periscope are neck to neck
In two months an app goes from launch – to funding (14m) – plunge.
Some blame the Tech journalists – and there is some truth in that.
A whole ecosystem has grown up to support the ‘app economy’ – including the VCs, tech journalists, conference creators, hackathons and industry analysts who rank apps.
Sentiment changes rapidly.
Now, some articles call it the Schrödinger’s meerkat(is it dead or is it alive?)
Others have taken to defend the tech journalists themselves ex from the Guardian Tech journalists may have been wrong about Meerkat but they’re right to get excited about new apps
But there is a wider question here ..
Apps uptake metrics(ex downloads) have become a bit like the dot com era obsession ..
There is a lot of activity but it is transient (as we see in the case of Meerkat) because the value no longer lies in the App itself.
For long term success, the value (if it exists) lies beyond the app.
Here are some reasons why the app economy dynamic is changing and value is shifting away from the app:
a) Even when the app has been poor, the company has done well when the value lay beyond the app. The best example of this is LinkedIn – whose app and website are always frustrating to me. I need to sometimes use wikihow to understand even the basics such as deleting a contact . The app could be a lot better – but we still use it despite the app
b) APIs are becoming increasingly important and are managing much of the complexity for example health care APIs. The app then becomes a simple interface – APIs do the work
c) ‘App only’ brands are hard to sustain and expand: Unlike Linkedin – where the value lies beyond the app – for Rovio(angry birds) the product (and the value) was in the app itself. And 2014 has been a bad year for Rovio. It’s unclear if the popularity of the brand will ever return.
d) Content has a fleeting timescale and its getting even smaller: The diminishing popularity timescales apply to all online content. Gangnum style broke the YouTube popularity counter – but look again.. Gangnam style was launched in July 2012. Google trends for Gangnum style shows that it peaked in Dec 2012 – with a precipitous drop soon after. And Gangnum has been dropping in popularity ever since(even when cumulative views increase). Content apps also may have the same problem. Beyond the first year (or two) – they appear to be from an older era especially if the user base is younger. The Draw something app also had the same problem of drop in popularity
e) Which apps do IoT developers use? Is like focussing on the dashboard and ignoring the engine: Which apps do IoT developers use is the wrong question – because it places too much emphasis on the app than the vertical(IoT). It’s like saying – which web development technique they use for their website? Does it matter? IoT is a hugely complex domain. Same will apply to automotive apps, healthcare apps etc.
f) Apps are not open: Coming back to Meerkat – we are reminded with Twitter’s move that apps and social media are not open. If Twitter does a deal with Operators for ‘sponsored data’ – that’s even worse for innovation like Meerkat (and I expect that type of deal will be increasingly common – further suppressing Long Tail innovation)
Apps continue to drive Long tail innovation
But for the reasons mentioned above, there is a fundamental shift in the ecosystem
Value is now closely tied to the vertical
In some ways, it is a natural maturing of the ecosystem
But when tied to a specific vertical – the value apportioned to the app is relatively less
Knowledge and integration about the Vertical now becomes more important than app in this maturing phase(leaving aside the Openness issue).
For example – for IoT – IBM bet $3 billion into IoT – but the focus is on analyzing data coming from many different devices.
The skillsets to do this are not the same as for the app – although there will be undoubtedly an app interface
So, does the app economy still exist?
Increasingly, not in the form we know it (across verticals)
In a more maturing phase, we will see deeper integration with specific verticals.
For other forms of apps – there is no way to predict economic value even over short periods
PS – if you are interested in IoT – have a look at this( upskill to Big Data, Data Science and IoT )
We will also have an online version. Please contact me at ajit.jaokar at futuretext.com
Great to be on this list http://www.onalytica.com/…/the-internet-of-things-landscap…/ (full list needs a free download) – I am No 90 (for individuals)
Good list of people and brands to follow
Over the past few years, I have been teaching a specialized course at Oxford University for Telecoms and Big Data
This year, I have also started teaching a new course for Data Science and IoT.
Here, we apply Predictive algorithms to IoT datasets.
Its a complex course and currently we have launched it with a few corporates through Oxford
Independent of the academic course, I have also launched a version with fablab London
The outline below gives you the the approach, content and modules.
If you can commute to London and want to master Data Science for Internet of Things – have a look at London Data Science for IoT
Alternately, we will have an online version for $600
This blog is a review of two books. Both are available for free from the MapR site, written by Ted Dunning and Ellen Friedman (published by O Reilly) : About Time Series Databases: New ways to store and access data and A new look at Anomaly Detection
The MapR platform is a key part of the Data Science for the Internet of Things (IoT) course – University of Oxford and I shall be covering these issues in my course
In this post, I discuss the significance of Time series databases from an IoT perspective based on my review of these books. Specifically, we discuss Classification and Anomaly detection which often go together for typical IoT applications. The books are easy to read with analogies like HAL (Space Odyssey ) and I recommend them.
The idea of time series data is not new. Historically, time series data can be stored even in simple structures like flat files. The difference now is the huge volume of data and the future applications possible by collecting this data – especially for IoT. These large scale time series databases and applications are the focus of the book. Large scale time series applications typically need a NoSQL database like Apache Cassandra, Apache HBase, MapR-DB etc. The book’s focus is Apache HBase and MapR-DB for the collection, storage and access of large-scale time series data.
Essentially, time series data involves measurements or observations of events as a function of the time at which they occurred. The airline ‘black box’ is a good example of a time series data. The black box records data many times per second for dozens of parameters throughout the flight including altitude, flight path, engine temperature and power, indicated air speed, fuel consumption, and control settings. Each measurement includes the time it was made. The analogy applies to sensor data. Increasingly, with the proliferation of IoT, Time series data is becoming more common and universal. The data so acquired through sensors is typically stored in Time Series Databases. The TSDB (Time series database) is optimized for best performance for queries based on a range of time
Time series databases apply to many IoT use cases for example:
From these readings captured in a Time Series database, we can derive analytics such as:
Prognosis: What are the short- and long-term trends for some measurement or ensemble of measurements?
Introspection: How do several measurements correlate over a period of time?
Prediction: How do I build a machine-learning model based on the temporal behaviour of many measurements correlated to externally known facts?
Introspection: Have similar patterns of measurements preceded similar events?
Diagnosis: What measurements might indicate the cause of some event, such as a failure?
The books gives examples of usage of Anomaly detection and Classification for IoT data.
For Time series IoT based readings, anomaly detection and Classification go together. Anomaly detection determines what normal looks like, and how to detect deviations from normal.
When searching for anomalies, we don’t know what their characteristics will be in advance. Once we know characteristics, we can use a different form of machine learning i.e. classification
Anomaly in this context just means different than expected—it does not refer to desirable or un‐ desirable. Anomaly detection is a discovery process to help you figure out what is going on and what you need to look for. The anomaly-detection program must discover interesting patterns or connections in the data itself.
Anomaly detection and classification go together when it comes to finding a solution to real-world problems. Anomaly detection is used first in the discovery phase—to help you figure out what is going on and what you need to look for. You could use the anomaly-detection model to spot outliers, then set up an efficient classification model to assign new examples to the categories you’ve already identified. You then update the anomaly detector to consider these new examples as normal and repeat the process
The book goes on to give examples of usage of these techniques in EKG
For example, for the challenge of finding an approachable, practical way to model normal for a very complicated curve such as the EKG, we could use a type of machine learning known as deep learning.
Deep learning involves letting a system learn in several layers, in order to deal with large and complicated problems in approachable steps. Curves such as the EKG have repeated components separated in time rather than superposed. We take advantage of the repetitive and separated nature of an EKG curve in order to accurately model its complicated shape to detect normal patterns using Deep learning
The book also refers to a Data structure called t-Digest for Accurate Calculation of Extreme Quantiles t-digest was developed by one of the authors, Ted Dunning, as a way to accurately estimate extreme quantiles for very large data sets with limited memory use. This capability makes t-digest particularly useful for selecting a good threshold for anomaly detection. The t-digest algorithm is available in Apache Mahout as part of the Mahout math library. It’s also available as open source at https://github.com/tdunning/t-digest
Anomaly detection is a complex field and needs a lot of data.
For example: what happens if you only save a month of sensor data at a time, but the critical events leading up to a catastrophic part failure happened six weeks or more before the event?
To conclude, much of the complexity for IoT analytics comes from the management of Large scale data.
Collectively, Interconnected Objects and the data they share make up the Internet of Things (IoT).
Relationships between objects and people, between objects and other objects, conditions in the present, and histories of their condition over time can be monitored and stored for future analysis, but doing so is quite a challenge.
However, the rewards are also potentially enormous. That’s where machine learning and anomaly detection can provide a huge benefit.
For Time series, the book covers themes such as
Storing and Processing Time Series Data
The Direct Blob Insertion Design
Why Relational Databases Aren’t Quite Right
Architecture of Open TSDB
Value Added: Direct Blob Loading for High Performance
Using SQL-on-Hadoop Tools
Using Apache Spark SQL
Advanced Topics for Time Series Databases(Stationary Data, Wandering Sources, Space-Filling Curves )
For Anomaly detection:
Windows and Clusters
Anomalies in Sporadic Events
Website Traffic Prediction
Extreme Seasonality Effects
About Time Series Databases: New ways to store and access data and A new look at Anomaly Detection by Ted Dunning and Ellen Friedman (published by O Reilly).
Also the link for Data Science for the Internet of Things (IoT) course – University of Oxford where I hope to cover these issues in more detail in context of MapR
I am pleased to announce a unique course – Data Science for the Internet of Things (IoT) course – University of Oxford
We are launching first with very limited places We already are collaborating with Mapr, Sigfox, Hypercat and Red Ninja and many others
So the course will be based on practical insights from current systems
Everyone finishing the course will receive a University of Oxford certificate showing that they have completed the course
Course is fully online
Have a look Data Science for the Internet of Things (IoT) course – University of Oxford for more
Welcome feedback and will update a lot more over next few weeks
If you want to avail of this very unique certification, please email me for more information ajit.jaokar at futuretext.com
In this post, I discuss a possible new approach to teaching Programming for Data Science.
Here, I argue that we look beyond Python vs. R debate and look to teach R, Python and SQL together. To do this, we need to look at the big picture first (the problem we are solving in Data science) and then see how that problem is broken down and solved by different approaches. In doing so, we can more easily master multiple approaches and then even combine them if needed.
On first impressions, this Polyglot approach (ability to master multiple languages) sounds complex.
Why teach 3 languages together? (For simplicity – I am including SQL as a language here)
Here is some background
Outside of Data science, I also co-founded a social enterprise to teach Computer Science to kids Feynlabs. At Feynlabs, we have been working with ways to accelerate learning to Code. One way to do this is to compare and contrast multiple programming languages. This approach makes sense for Data Science also because a learner can potentially approach Data science from many directions.
To learn programming for Data Science, it would thus help to build up from an existing foundation they are already familiar with and then co-relate new ideas to this foundation through other approaches. From a pedagogical standpoint, this approach is similar to David Asubel who stressed the importance of prior knowledge in being able to learn new concepts: “The most important single factor influencing learning is what the learner already knows.”
But first, we address what is the problem we are trying to solve and how that problem can be broken down
I also propose to make this approach as part of Data Science for IoT course/certification but I also expect I will teach it as a separate module – probably in a workshop format in London and USA. If you are interested to know more, please sign up on the mailing list HERE
Data science involves the extraction of knowledge from data. Ideally, we need lots of data from a variety of sources. Data Science lies at the intersection of multiple disciplines: Programming, Statistics, Algorithms, Data analysis etc. The quickest way to solve Data Science problems is to start analyzing data as soon as possible. However, Data Science also needs a good understanding of the theory – especially the machine learning approaches.
A Data Scientist typically approaches a problem using a methodology like OSEMN (Obtain, Scrub, Explore, Model, Interpret). Some of these steps are common to a classic data warehouse and are similar to classic ETL (Extract Transform Load) approach. However, the modelling and interpreting stage are unique to Data Science. Modelling needs an understanding of Machine Learning algorithms and how they fit together. For example: Unsupervised algorithms (Dimensionality reduction, Clustering) and Supervised algorithms (Regression, Classification)
To understand Data Science, I would expect some background in Programming. Certainly, one would not expect a Data Scientist to start from “Hello World”. But on the other hand, the syntax of a language is often over-rated. Languages have quirks – and they are easy to get around with most modern tools.
So, if we try to look at the problem / big picture first (ex the Obtain, Scrub, Explore, Model and Interpret) stages – it is easier to fit in the Programming languages to the stages. Machine Learning has 2 phases: the Model Building phase and the Prediction phase. We first build the model (often as a batch mode – and it takes longer). We then perform predictions on the model in a dynamic/real-time mode. Thus, to understand Programming for Data Science, we can divide the learning into four stages: The Tool itself (IDE), Data Management, Modelling and Visualization
After understanding the base syntax – it’s easier to understand the language in terms of its packages and libraries. Both Python and R have a vast number of packages (such as Statsmodels) – often distributed as libraries (scikit-learn). Both languages are interpreted. Both have good IDEs such as Spyder, iPython for Python and RStudio for R. If using Python, you would probably use a library like scikit-learn and a distribution of Python such as the Anaconda distribution. With R, you would use the RStudio and install specific packages using R’s CRAN package management system.
Apart from R and Python, you would also need to use SQL. I include SQL because SQL plays a key role in the Data Scrubbing stage. Some have called this stage as the Janitor work of Data Science and it takes a lot of time. SQL also plays a part in SQL on Hadoop approaches like Apache Drill which allow users to write SQL queries on data stored in Hadoop and receive results
With SQL, you are manipulating data in Sets. However, once the data is inside the Programming environment, it is treated differently depending on the language.
In R, everything is a vector and R Data structures and functions are vectorized . This means, most functions in R work on Vectors (i.e. on all the elements – not on individual elements in a loop). Thus, in R, you read your data in a data frame and use a built-in model (here are the steps / packages for linear regression) . In Python, if you did not use a library like scikit-learn , you would need to make many decisions yourselves and that can be a lot harder. However, with a package like scikit-learn, you get a consistent, well documented interface to the models. That makes your job a lot easier by focussing on the usage.
After the Data modelling stage, we come to Data exploration and visualization. Here, for Python – the pandas package is a powerful tool for data exploration. Here is a simple and quick intro to the power of Python Pandas (YouTube video). Similarly, R uses dplyr and ggplot2 packages for Data exploration and visualization.
Finally, much of this discussion is a rapidly moving goalpost. For example, in R, large calculations need the data to be loaded in a matrix (ex nxn matrix manipulation). But, with platforms like Revolution Analytics – that can be overcome. Especially with the acquisition of Revolution analytics by Microsoft – and with Microsoft’s history for creating good developer tools – we can expect development in R would be simplified.
Also, since both R and Python are operating in the context of Hadoop for Data science, we would expect to leverage the Hadoop architecture through HDFS connectors both for Python Hadoop frameworks and R Hadoop integration. Also, one would argue that we are already living in a post hadoop/mapreduce world with Spark and Storm especially for Real time calculations and that at least some Hadoop functions may be replaced by Spark
Here is a good introduction to Apache Spark and a post about Getting started with Spark in Python. Interestingly, the Spark programming guide includes integration with 3 languages (Scala, Java and Python) but no R. But the power of Open source means we have SparkR which integrates R with Spark.
The approach to cover multiple languages has some support – for instance, with the Beaker notebook . You could also achieve the same effect by working on the command line for example in Data Science at the Command Line
Even in a brief blog post – you can get a lot of insights when we look at the wider problem of Data science and compare how different approaches are addressing segments of that problem. You just need to get the bigger picture of how these Languages fit together for Data Science and understand the major differences (for example vectorization in R).
Use of good IDEs, packages etc softens the impact of programming.
It then changes our role, as Data Scientists, to mixing and matching a palette of techniques as APIs – sometimes spanning languages.
I hope to teach this approach as part of Data Science for IoT course/certification
Programming for Data Science will also be a separate module talk over the next few months at fablab london, London IT contractors meetup group, CREATE Miami, a venture accelerator at Miami Dade College, City Sciences conference(as part of a larger paper) in Shanghai and MCS Madrid
For more schedules and details please sign up HERE
Call for Papers from International Conference on City Sciences (ICCS 2015): New architectures, infrastructures and services for future cities co-organized by City sciences where I teach
Call for Papers Shanghai, 4-5 June 2015
International Conference on City Sciences (ICCS 2015): New architectures, infrastructures and services for future cities
The new science of cities stands at a crossroads. It encompasses rather different, or even conflicting, approaches. Future cities place citizens at the core of the innovation process when creating new urban services, through “experience labs”, the development of urban apps or the provision of ”open data”. But future cities also describe the modernisation of urban infrastructures and services such as transport, energy, culture, etc., through digital ICT technologies: ultra-‐fast fixed and mobile networks, the Internet of things, smart grids, data centres, etc. In fact during the last two decades local authorities have invested heavily in new infrastructures and services, for instance putting online more and more public services and trying to create links between still prevalent silo approaches with the citizen taking an increasingly centre-‐stage role. However, so far the results of these investments have not lived up to expectations, and particularly the transformation of the city administration has not been as rapid nor as radical as anticipated. Therefore, it can be said that there is an increasing awareness of the need to deploy new infrastructures to support updated public services and of the need to develop new services able to share information and knowledge within and between organizations and citizens. In addition, urban planning and urban landscape are increasingly perceived as a basic infrastructure, or rather a framework, where the rest of infrastructures and services rely upon. Thus, as an overarching consequence, there is an urgent need to discuss among practitioners and academicians successful cases and new approaches able to help to build better future cities.
Taking place in Shanghai, the paradigm of challenges for future cities and a crossroad itself between East and West, the International Conference on City Sciences responds to these and other issues by bringing together academics, policy makers, industry analysts, providers and practitioners to present and discuss their findings. A broad range of topics related to infrastructures and services in the framework of city sciences are welcome as subjects for papers, posters and panel sessions:
Authors of selected papers from the conference will be invited to submit to special issues of International peer-reviewed academic journals.
Submission of Abstracts:
Abstracts should be about 2 pages (800 to 1000 words) in length and contain the following