Introduction
Anomaly simply means something unusual or abnormal. We often encounter anomalies in our daily life. It can be suspicious activities of an end-user on a network or malfunctioning of equipment. Sometimes it is vital to detect such anomalies to prevent a disaster. For example, detecting a bad user can prevent online fraud or detecting malfunctioning equipment can prevent system failure.
Classification Algorithms vs Anomaly Detection:
Machine learning provides us many techniques to classify things into classes, for example, we have algorithms like logistic regression and support vector machine for classification problems. But these algorithms fail to classify anomalous and non-anomalous problems. In a typical classification problem, we have almost an equal or a comparable number of positive and negative examples. Suppose we have a classification problem in which we have to decide whether a vehicle is a car or not. Then in our dataset the share of cars maybe 40-60%, similarly the share of non-car examples maybe 40-60%. So, we generally have a balanced amount of positive and negative examples, and we train our model on a good amount positive as well as negative examples. On the other hand, in anomaly detection problem we have a significantly lesser amount of positive (anomalous) examples than the negative (non-anomalous) examples. The positive examples may be less than 5% or even 1% (obviously that is why they are anomalous). In such case, a classification algorithm cannot be trained well on positive examples. Here comes the anomaly detection algorithm to rescue us.
Anomaly Detection Algorithm:
Anomaly detection algorithm works on probability distribution technique. Here we use Gaussian distribution to model our data. It is a bell-shaped function given by
Ɲ (µ, σ2)
µ = mean
σ2= Variance
σ = Standard Deviation
If the probability of χ is Gaussian with mean µ and variance σ2 then
χ ~ Ɲ ( µ, σ2 )
‘~’ is called tilde which is read as: “χ is distributed as“.
µ or mean describes the center of the curve.
σ or standard deviation and describes width of the curve.
The full function is as follows
p(x, µ, σ2) = (1/ (σ*sqrt(2π)))* e– (1/2)*(sqr(x-µ)/σ2)
Image Source: Wikipedia
Suppose we have data set X as follows
X = [ [ x11 x12 x13 ….. x1n ]
[ x21 x22 x23 ….. x2n ]
:
:
:
[ xm1 xm2 xm3 ….. xmn ] ]
here xij means jth feature ith example.
To find mean µ :
We do column-wise summation and divide it by a number of examples. We get a row matrix of dimension (1x n).
µ = (1/m) * ( Σmi=1 Xi ) ; ( Here subscript i represents column no.)
= [ [ µ1 µ2 µ3 ………. µn ] ]
To find variance σ2 :
σ2 = (1/m) * ( Σmi=1 (xi – µ)2 ) ; ( Here subscript i represents column no. )
In Octave/MATLAB syntax the above expression can be written as
σ2 = sum ((X- µ).^2) * (1/m)
= [ σ21 σ22 σ23 …. σ2n ]
Here we get σ2 as a row matrix of dimension (1x n).
Algorithm:
Training algorithm
- Choose features xi that you think to be indicative of anomalies.
- Split non-anomalous examples as 60% for training set, 20% for cross validation and 20% for test set.
- Split anomalous examples as 50% for cross validation set and 50% test set.
- Calculate µ and σ2 on training set.
4. Take an arbitrary value as threshold ɛ. - Check a every example x from cross validation set if it is anomalous or not calculate P(x) as follows
P(x)= Πnj=1 p(xj ; µj ; σj2 )
= Πnj=1{(1/ (σ*sqrt(2π)))* exp(- (1/2)*((xj-µj)2 /σj2))}
= p(x1 ; µ1 ; σ12 ) * p(x2 ; µ2 ; σ22 ) *p(x3 ; µ3 ; σ32 )*…
……*p(xn ; µn ; σn2 )
- If P(x) is less than threshold ɛ then it is an anomaly otherwise not.
- Calculate F1 score for current ɛ.
- Repeat steps 4,5 and 6 for different values of ɛ and select ɛ which has highest F1 score
After training, we have µ, σ2 and ɛ .
Algorithm to check whether a new example is anomaly or not
- Calculate P(x) for the new examples x as follows:
P(x)= Πnj=1 p(xj ; µj ; σj2 )
= Πnj=1{(1/ (σ*sqrt(2π)))* exp(- (1/2)*((xj-µj)2 /σj2))}
= p(x1 ; µ1 ; σ12 ) * p(x2 ; µ2 ; σ22 ) *p(x3 ; µ3 ; σ32 )*…
……*p(xn ; µn ; σn2 )
- If P(x) is less than threshold ɛ then it is an anomaly otherwise not.
F1 Score:
F1 score is an error metrics for skewed data. Skewed data is the data where either of positive and negative example is significantly large than the other (>90%).
F1 score can be given as follows:
F1 = 2*{(precision * recall)/(precision + recall) }
Where,
precision = True Positive/(True positive + False positive )
recall = True Positive/(True positive + False negative)
True Positive => When actual output is +ve and your algorithm predicts +ve
False Positive => When actual output is -ve and your algorithm predicts +ve
True Negative => When actual output is -ve and your algorithm predicts -ve
False Negative => When actual output is +ve and your algorithm predicts -ve
A good algorithm has high precision and recall. F1 tells us how well our algorithm works.
Higher the F1 score the better.
Image source: Wikipedia
When to use Anomaly Detection:
- When we have very small number of true examples.
- When we have different types of anomalies and it is difficult to learn from +ve examples what anomalies look like.
When to use Supervised Learning:
- When we have large number of both +ve and -ve examples
- When we have enough +ve examples to sense what new +ve example will look like
Choosing what features to use:
Selection of features affects how well your anomaly detection algorithm works. Select those features which are indicative of anomalies. The features you select must be gaussian. To check whether your features are gaussian or not, plot them. The plotted graph should be bell-shaped.
If your features are not gaussian then transform them into gaussian using any of the functions
- log(x)
- log(x+1)
- log(x+c)
- sqrt(x)
- x1/3