An outlier is an observation that deviates so much from other observations as to arouse suspicion that it was generated by a different mechanism
Hawkins [Haw80I]

About the workshop

Detecting anomalies is of high interest in multiple industries for identifying safety and security risks, ensuring production quality, or finding new business opportunities. However, anomaly detection faces some unique challenges. First, identifying anomalies by hand is difficult, especially in multidimensional data. Second, anomalies are usually poorly represented in datasets.

Anomaly detection in predictive maintenance can prevent system failures(image source)

Anomaly detection must therefore rely largely on unsupervised learning with possibly contaminated nominal data. These methods need additional assumptions about the data to be able to identify anomalies reliably. In this workshop, we will review common approaches for anomaly detection and discuss their strengths and weaknesses in different application areas.

We published the material for this workshop on GitHub, which includes a Codespace to start with the content right away. Furthermore, we’ve recorded the training and published it on our Learning Platform as well as YouTube.

To get the most of the training, we recommend to book an iteration of the training on the top of this page. This setup includes regular discussions with our experts on the topic. Feel free to discuss recent research and personal projects in addition to the training’s content.

Learning outcomes

Understanding qualitative and quantitative definitions of anomalies.
Overview of theoretical foundations and practical implementations of multiple anomaly detection algorithms.
Understand which algorithms are suitable for which application areas.
Learn how to evaluate and compare performances of different algorithms.
Learn to find thresholds for anomaly detection using extreme value theory.

Structure of the workshop

Part 1: Introduction to anomaly detection

We start with a brief introduction to anomaly detection and its applications. We discuss the special challenges of anomaly detection and the different types of anomalies. We then introduce the contamination framework. Finally, we introduce evaluation metrics for anomaly detection and discuss the class imbalance problem. Visualization of “few, sparse, different” assumption under the contamination framework

Evaluation metrics for anomaly detection.

The informal notion of anomaly and definition attempts.
The contamination framework.
Class imbalance.
Evaluation metrics.
Mahalanobis distance.

Part 2: Anomaly detection via density estimation and robustness

Density estimation is a common approach for anomaly detection. It rests on the assumption that anomalies appear in unlikely areas of the feature space. We discuss the Kernel density estimation (KDE) algorithm as a generic example of a density estimation method. When the training data might contain unrecognized anomalies, robustness is an important property of the estimation procedure. We discuss robust variants of KDE and apply them to a real-world dataset with mislabelled data. Kernel density estimation (image source)

Kernel density estimation.
Robust variants of kernel density estimation.
Example: Housing prices and mislabelled data.

Part 3: Anomaly detection via isolation

The isolation forest algorithm is a tree-based approach for anomaly detection. It is based on the assumption that anomalies are rare and isolated. It has drawn a lot of attention in the last years and is considered a state-of-the-art algorithm for anomaly detection with excellent performance across a multitude of benchmarks. We use the isolation forrest for network intrusion detection in the KDD99 dataset. Partition diagram of an isolation tree

Isolation depth of a nominal point (green) and an anomaly (red) in an isolation tree

Isolation Forrest.
Example: Network intrusion detection with KDD99 dataset.

Part 4: Anomaly detection via reconstruction error

Anomaly detection is particularly challenging when the data is high-dimensional. The previously introduced methods suffer from the curse of dimensionality and usually quickly degrade in performance as the dimension rises above a couple dozens. Auto-encoders are a class of neural networks that can be used to learn a representation of the data that is more compact than the original data. Auto-encoders can be used to detect anomalies by comparing the reconstruction error of the original data with the reconstruction error of the anomalous data. We apply auto-encoders to the MNIST dataset to detect corrupted images. Schematic view of an auto-encoder

Auto-encoder reconstruction error

Auto-encoders
Example: Identify corrupted images in the MNIST dataset.

Part 5: Anomaly detection in time series

Time series are a special type of data that is often used in anomaly detection. We discuss the challenges of anomaly detection in time series and give a bit of background in time series analysis that is useful for anomaly detection. We then introduce the SARIMA model as a simple forecasting model that can be used to detect anomalies in time series. Finally, we apply the SARIMA model to the New York taxi dataset to detect anomalies in the number of taxi rides.

Anomaly detection in time series with SARIMA

Introduction to time series analysis.
Anomaly types: Point, context and pattern anomalies.
Preprocessing techniques for anomaly detection in time series.
Anomaly detection via forecasting error: SARIMA models.
Example: Detecting anomalies in New York taxi data.

Part 6: Extreme value theory and GEV distributions

Most anomaly detection methods return a score for each data point. The score indicates how anomalous a point is. However, the algorithms usually do not provide thresholds for classifying a point as anomalous. We introduce extreme value theory (EVT) as a method for finding thresholds for anomaly detection. EVT is based on the assumption that the scores of anomalous points are significantly higher than the scores of nominal points. It estimates the tail of the score distribution and uses this to find a probabilistically interpretable threshold. We apply EVT to the New York taxi dataset to find a threshold for detecting anomalies in the number of taxi rides. Peaks over threshold method

Fitting a GEV distribution

Relevance of EVT for anomaly detection.
GEV distributions.
Fitting GEV distributions.
Example: Find detection threshold for anomalies in New York taxi data.

Prerequisites

We assume some prior exposure to machine learning and the underlying concepts.
Basic knowledge of Python is required to complete the exercises.

References

[Haw80I]

Identification of Outliers, D. M. Hawkins.

1980

Any applied statistician who has analysed a number of sets of real data is likely to have come across ‘outliers’. The intuitive definition of an outlier would be ‘an observation which deviates so much from other observations as to arouse suspicions that it was generated by a different mechanism’. An inspection of a sample containing outliers would show up such characteristics as large gaps between …

Publication

[Agg17O]

Outlier Analysis, Charu C. Aggarwal.

2017

Outliers are also referred to as abnormalities, discordants, deviants, or anomalies in the data mining and statistics literature. In most applications, the data is created by one or more generating processes, which could either reflect activity in the system or observations collected about entities. When the generating process behaves unusually, it results in the creation of outliers. Therefore, …

Publication

[Col13I]

An Introduction to Statistical Modeling of Extreme Values., Stuart Coles.

2013

Directly oriented towards real practical application, this book develops the basic theoretical framework of extreme value models and the statistical inference techniques for using these models in practice. Intended for statisticians and non-statisticians alike, the theoretical treatment is elementary, with heuristics often replacing detailed mathematical proof. Most aspects of extreme modeling …

Publication

[Cas97F]

Fitting the Generalized Pareto Distribution to Data, Enrique Castillo, Ali S. Hadi.

Dec 1997

The generalized Pareto distribution (GPD) was introduced by Pickands to model exceedances over a threshold. It has since been used by many authors to model data in several fields. The GPD has a scale parameter (\[sgrave] > 0) and a shape parameter (−∞ < k < ∞). The estimation of these parameters is not generally an easy problem. When k > 1, the maximum likelihood estimates do not exist, and when k …

Publication

[Cha19D]

Deep Learning for Anomaly Detection: A Survey, Raghavendra Chalapathy, Sanjay Chawla.

Jan 2019

Anomaly detection is an important problem that has been well-studied within diverse research areas and application domains. The aim of this survey is two-fold, firstly we present a structured and comprehensive overview of research methods in deep learning-based anomaly detection. Furthermore, we review the adoption of these methods for anomaly across various application domains and assess their …

Publication

[Cha09A]

Anomaly detection: A survey, Varun Chandola, Arindam Banerjee, Vipin Kumar.

Jul 2009

Anomaly detection is an important problem that has been researched within diverse research areas and application domains. Many anomaly detection techniques have been specifically developed for certain application domains, while others are more generic. This survey tries to provide a structured and comprehensive overview of the research on anomaly detection. We have grouped existing techniques into …

Publication

[Cli11N]

Novelty Detection with Multivariate Extreme Value Statistics, David Andrew Clifton, Samuel Hugueny, Lionel Tarassenko.

Dec 2011

Novelty detection, or one-class classification, aims to determine if data are “normal” with respect to some model of normality constructed using examples of normal system behaviour. If that model is composed of generative probability distributions, the extent of “normality” in the data space can be described using Extreme Value Theory (EVT), a branch of statistics concerned with describing the …

Publication

[Dau18U]

The UCR time series classification archive, , Hoang Anh, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, , Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, Gustavo Batista, .

Oct 2018

The UCR Time Series Archive - introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 …