A Simple Way to Find Outliers in an array with Python
Using a basic definition of an outlier we can write a simple Python function to detect such values and highlight them on a plot.
In statistics, an outlier is an observation point that is distant from other observations. An outlier may be due to variability in the measurement or it may indicate experimental error.
The outlier definition is somewhat vague and subjective, so it's helpful to have a rule to help in considering if a data point truly is an outlier. One rule that is very simple to apply utilizes the interquartile range (or IQR): `IQR = Q_3 - Q_1`, where `Q_1`, `Q_3` - the lower and upper quartiles.
The rule of thumb is to define as a suspected outlier any data point outside the interval `[Q_1 - 1.5 * IQR, Q_3 + 1.5 * IQR]`.
Any potential outlier obtained by this method should be examined in the context of the entire set of data. Outliers are not necessary erroneous, sometimes they can be an indicator of something interesting in data and give a knowledge you can act on.
- Introduction of Clementine and Data Mining. Chapter 5: Outliers and Anomalous Data.
- Outlier detection, Irad Ben-Gal in Maimon O. and Rockach L. (Eds.) Data Mining and Knowledge Discovery Handbook: A Complete Guide for Practitioners and Researchers," Kluwer Academic Publishers, 2005, ISBN 0-387-24435-2.
- Lecture Notes for Chapter 10 from Introduction to Data Mining by Tan, Steinbach, Kumar.