Book review: About Time Series Databases and a New look at Anomaly detection by Ted Dunning and Ellen Friedman

Introduction

 This blog is a review of two books. Both are available for free from the MapR site, written by Ted Dunning and Ellen Friedman (published by O Reilly) : About Time Series Databases: New ways to store and access data and A new look at Anomaly Detection

 The  MapR platform is a key part of the Data Science for the Internet of Things (IoT) course – University of Oxford and I shall be covering these issues in my course

 In this post, I discuss the significance of Time series databases from an IoT perspective based on my review of these books. Specifically, we discuss Classification and Anomaly detection which often go together for typical IoT applications. The books are easy to read with analogies like HAL (Space Odyssey ) and I recommend them.

 

Time Series data

The idea of time series data is not new. Historically, time series data can be stored even in simple structures like flat files. The difference now is the huge volume of data and the future applications possible by collecting this data – especially for IoT. These large scale time series databases and applications are the focus of the book. Large scale time series applications typically need a NoSQL database like Apache Cassandra, Apache HBase,  MapR-DB etc.  The book’s focus is Apache HBase and MapR-DB for the collection, storage and access of large-scale time series data.

  Essentially, time series data involves measurements or observations of events as a function of the time at which they occurred. The airline ‘black box’ is a good example of a time series data. The black box records data many times per second for dozens of parameters throughout the flight including altitude, flight path, engine temperature and power, indicated air speed, fuel consumption, and control settings. Each measurement includes the time it was made. The analogy applies to sensor data. Increasingly, with the proliferation of IoT, Time series data is becoming more common and universal. The data so acquired through sensors is typically stored in Time Series Databases.  The TSDB (Time series database) is optimized for best performance for queries based on a range of time

 

Time series data applications

Time series databases apply to many IoT use cases for example:

  • Trucking, to reduce taxes according to how much trucks drive on public roads (which sometimes incur a tax). It’s not just a matter of how many miles a truck drives but rather which miles.
  • A smart pallet can be a source of time series data that might record events of interest such as when the pallet was filled with goods, when it was loaded or unloaded from a truck, when it was transferred into storage in a warehouse, or even the environmental parameters involved, such as temperature.
  • Similarly, commercial waste containers, called dumpsters in the US, could be equipped with sensors to report on how full they are at different points in time.
  • Cell tower traffic can also be modelled as a time series and anomalies like flash crowd events that can be used to provide early warning.
  • Data Center Monitoring can be modelled as a Time series to predict  outages, plan upgrades
  • Similarly, Satellites, Robots and many more devices can be modelled as Time series data

From these readings captured in a Time Series database, we can derive analytics such as:

Prognosis: What are the short- and long-term trends for some measurement or ensemble of measurements?

Introspection: How do several measurements correlate over a period of time?

Prediction:  How do I build a machine-learning model based on the temporal behaviour of many measurements correlated to externally known facts?

Introspection:  Have similar patterns of measurements preceded similar events?

Diagnosis:  What measurements might indicate the cause of some event, such as a failure?

 

Classification and Anomaly detection for IoT

The books gives examples of usage of Anomaly detection and Classification for IoT data.

For Time series IoT based readings, anomaly detection and Classification go together. Anomaly detection determines what normal looks like, and how to detect deviations from normal.

When searching for anomalies, we don’t know what their characteristics will be in advance. Once we know characteristics, we can use a different form of machine learning i.e. classification

Anomaly in this context just means different than expected—it does not refer to desirable or un‐ desirable. Anomaly detection is a discovery process to help you figure out what is going on and what you need to look for. The anomaly-detection program must discover interesting patterns or connections in the data itself.

Anomaly detection and classification go together when it comes to finding a solution to real-world problems. Anomaly detection is used first in the discovery phase—to help you figure out what is going on and what you need to look for. You could use the anomaly-detection model to spot outliers, then set up an efficient classification model to assign new examples to the categories you’ve already identified. You then update the anomaly detector to consider these new examples as normal and repeat the process

The book goes on to give examples of usage of these techniques in EKG

For example, for the challenge of finding an approachable, practical way to model normal for a very complicated curve such as the EKG, we could use a type of machine learning known as deep learning.

 Deep learning involves letting a system learn in several layers, in order to deal with large and complicated problems in approachable steps. Curves such as the EKG have repeated components separated in time rather than superposed. We take advantage of the repetitive and separated nature of an EKG curve in order to accurately model its complicated shape to detect normal patterns using Deep learning

The book also refers to a Data structure called t-Digest for Accurate Calculation of Extreme Quantiles  t-digest was developed by one of the authors, Ted Dunning, as a way to accurately estimate extreme quantiles for very large data sets with limited memory use. This capability makes t-digest particularly useful for selecting a good threshold for anomaly detection. The t-digest algorithm is available in Apache Mahout as part of the Mahout math library. It’s also available as open source at https://github.com/tdunning/t-digest

 

Anomaly detection is a complex field and needs a lot of data.

For example: what happens if you only save a month of sensor data at a time, but the critical events leading up to a catastrophic part failure happened six weeks or more before the event?

IoT from a large scale Data standpoint

To conclude, much of the complexity for IoT analytics comes from the management of Large scale data.

Collectively, Interconnected Objects and the data they share make up the Internet of Things (IoT).

Relationships between objects and people, between objects and other objects, conditions in the present, and histories of their condition over time can be monitored and stored for future analysis, but doing so is quite a challenge.

However, the rewards are also potentially enormous. That’s where machine learning and anomaly detection can provide a huge benefit.

For Time series, the book covers themes such as

Storing and Processing Time Series Data

The Direct Blob Insertion Design

Why Relational Databases Aren’t Quite Right

Architecture of Open TSDB

Value Added: Direct Blob Loading for High Performance

Using SQL-on-Hadoop Tools

Using Apache Spark SQL

 Advanced Topics for Time Series Databases(Stationary Data, Wandering Sources, Space-Filling Curves )

For Anomaly detection:

Windows and Clusters

 Anomalies in Sporadic Events

Website Traffic Prediction

Extreme Seasonality Effects

Etc

 

Links again:

About Time Series Databases: New ways to store and access data and A new look at Anomaly Detection  by Ted Dunning and Ellen Friedman (published by O Reilly).

Also the link for Data Science for the Internet of Things (IoT) course – University of Oxford where I hope to cover these issues in more detail in context of  MapR