Local outlier factor from scratch in Python
Overview
Local outlier factor (LOF) is an anomaly detection algorithm which can give you an idea of how similar an item is to other items in a dataset.
The algorithm essentialy compares the density of an item's neighbourhood (how close the items in an area are to eachother) to the density of the item's neighbour's neighbourhoods.
Why use local outlier factor?
LOF is great for datasets with several clusters because it only compares an item to other items in it's locality.
This means that an item that is a large distance from it's nearest neighbours may not be classified as an outlier so long as it's neighbours are also large distances from their nearest neighbours.
This image shows how an outlier in one area of a dataset may not be an outlier in another area of the dataset:
Implenting the algorithm in Python
This is how LOF is implemented in the OpenHands project.
The algorithm is split into 3 methods corrosponding to the 3 maths functions from the algorithm's Wikipedia page.
The LOF method takes 4 arguments:
- A (the item to be classified, as a list of features)
- a list of A's nearest neighbours
- an instance of the KNN classifier
- a value for K
1. Reachability distance
The reachability distance between items A and B is the distance between B and it's Kth nearest neighbour or the distance between A and B (whichever is larger).
def reachability_distance(A, B, classifier, k):
B_n_neighbours = classifier.get_knn(B)
k_dist = max([distance.euclidean(B, n) for n in B_n_neighbours])
AB_dist = distance.euclidean(A, B)
r_distance = max(k_dist, AB_dist)
return r_distance
2. Local reachability density
Local reachability density is the inverse of the average reachability distance between A and each of it's K nearest neighbours.
def local_r_density(A, nearest_neighbours, classifier, k): # Local reachability density
r_distance_list = []
for B in nearest_neighbours:
B_r_distance = rDistance(A, B, classifier, k)
r_distance_list.append(B_r_distance)
avg_r_distance = sum(r_distance_list)/k
local_r_density = 1.0/avg_r_distance
return local_r_density
3. Local outlier factor
The local outlier factor of object A can then be defined as the average local reachability-density of each of A’s neighbours divided by the local reachability-density of A.
def local_outlier_factor(A, nearest_neighbours, classifier, k):
A_local_r_density = local_r_density(A, nearest_neighbours, classifier, k)
local_r_density_list = []
for B in nearest_neighbours:
B_n_neighbours = classifier.get_knn(B)
B_local_r_density = local_r_density(B, B_n_neighbours, classifier, k)
local_r_density_list.append(B_local_r_density)
local_r_density_total = sum(local_r_density_list)
divisor = A_local_r_density*k
local_outlier_factor = local_r_density_total/divisor
local_outlier_factor = -local_outlier_factor
return local_outlier_factor
Performance
LOF is slow because since it essentially runs KNN k²+1
times. If performance is important it may be worth looking at the local distance outlier factor algorithm instead.