Title: | Educational Outlier Package with Common Outlier Detection Algorithms |
---|---|
Description: | Provides implementations of some of the most important outlier detection algorithms. Includes a tutorial mode option that shows a description of each algorithm and provides a step-by-step execution explanation of how it identifies outliers from the given data with the specified input parameters. References include the works of Azzedine Boukerche, Lining Zheng, and Omar Alfandi (2020) <doi:10.1145/3381028>, Abir Smiti (2020) <doi:10.1016/j.cosrev.2020.100306>, and Xiaogang Su, Chih-Ling Tsai (2011) <doi:10.1002/widm.19>. |
Authors: | Andres Missiego Manjon [aut, cre], Juan Jose Cuadrado Gallego [aut] |
Maintainer: | Andres Missiego Manjon <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.0.0 |
Built: | 2025-01-19 05:15:52 UTC |
Source: | https://github.com/missiegobeats/outlierslearn |
This function implements the box & whiskers algorithm to detect outliers
boxandwhiskers(data, d, tutorialMode)
boxandwhiskers(data, d, tutorialMode)
data |
Input data. |
d |
Degree of outlier or distance at which an event is considered an outlier |
tutorialMode |
if TRUE the tutorial mode is activated (the algorithm will include an explanation detailing the theory behind the outlier detection algorithm and a step by step explanation of how is the data processed to obtain the outliers following the theory mentioned earlier) |
None, does not return any value
Andres Missiego Manjon
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))) inputData = data.frame(inputData) boxandwhiskers(inputData,2,FALSE) # Can be set to TRUE
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))) inputData = data.frame(inputData) boxandwhiskers(inputData,2,FALSE) # Can be set to TRUE
Outlier detection method using DBSCAN
DBSCAN_method(inputData, max_distance_threshold, min_pts, tutorialMode)
DBSCAN_method(inputData, max_distance_threshold, min_pts, tutorialMode)
inputData |
Input Data (must be a data.frame) |
max_distance_threshold |
This is used to calculate the distance between all the points and check if the euclidean distance is less than the max_distance_threshold parameter to decide if add it to the neighbors or not |
min_pts |
the minimum number of points to form a dense region |
tutorialMode |
if TRUE the tutorial mode is activated (the algorithm will include an explanation detailing the theory behind the outlier detection algorithm and a step by step explanation of how is the data processed to obtain the outliers following the theory mentioned earlier) |
None, does not return any value
Andres Missiego Manjon
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); eps = 4; min_pts = 3; DBSCAN_method(inputData, eps, min_pts, FALSE); #Can be set to TRUE
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); eps = 4; min_pts = 3; DBSCAN_method(inputData, eps, min_pts, FALSE); #Can be set to TRUE
This function calculates the euclidean distance between 2 points. They must have the same number of dimensions
euclidean_distance(p1, p2)
euclidean_distance(p1, p2)
p1 |
One of the points that will be used by the algorithm with N dimensions |
p2 |
The other point that will be used by the algorithm with N dimensions |
Euclidean Distance calculated between the two N-dimensional points
Andres Missiego Manjon
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); point1 = inputData[1,]; point2 = inputData[4,]; distance = euclidean_distance(point1, point2);
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); point1 = inputData[1,]; point2 = inputData[4,]; distance = euclidean_distance(point1, point2);
This function implements the knn algorithm for outlier detection
knn(data, d, K, tutorialMode)
knn(data, d, K, tutorialMode)
data |
Input Data (must be a data.frame) |
d |
Degree of outlier or distance at which an event is considered an outlier |
K |
Nearest neighbor for which an event must have a degree of outlier to be considered an outlier |
tutorialMode |
if TRUE the tutorial mode is activated (the algorithm will include an explanation detailing the theory behind the outlier detection algorithm and a step by step explanation of how is the data processed to obtain the outliers following the theory mentioned earlier) |
None, does not return any value
Andres Missiego Manjon
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))) inputData = data.frame(inputData) knn(inputData,3,2,FALSE) #Can be changed to TRUE
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))) inputData = data.frame(inputData) knn(inputData,3,2,FALSE) #Can be changed to TRUE
Local Outlier Factor algorithm to detect outliers
lof(inputData, K, threshold, tutorialMode)
lof(inputData, K, threshold, tutorialMode)
inputData |
Input Data (must be a data.frame) |
K |
This number represents the nearest neighbor to use to calculate the density of each point. This value is chosen arbitrarily and is responsibility of the data scientist/user to select a number adequate to the dataset. |
threshold |
Value that is used to classify the points comparing it to the calculated ARDs of the points in the dataset. If the ARD is smaller, the point is classified as an outliers. If not, the point is classified as a normal point (inlier) |
tutorialMode |
if TRUE the tutorial mode is activated (the algorithm will include an explanation detailing the theory behind the outlier detection algorithm and a step by step explanation of how is the data processed to obtain the outliers following the theory mentioned earlier) |
None, does not return any value
Andres Missiego Manjon
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); lof(inputData,3,0.5,FALSE) #Can be changed to TRUE
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); lof(inputData,3,0.5,FALSE) #Can be changed to TRUE
Calculates the mahalanobis_distance given the input data
mahalanobis_distance(value, sample_mean, sample_covariance_matrix)
mahalanobis_distance(value, sample_mean, sample_covariance_matrix)
value |
Point to calculate the mahalanobis_distance |
sample_mean |
Sample mean |
sample_covariance_matrix |
Sample Covariance Matrix |
Mahalanobis distance associated to the point
Andres Missiego Manjon
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); inputData = as.matrix(inputData); sampleMeans = c(); for(i in 1:ncol(inputData)){ column = inputData[,i]; calculatedMean = sum(column)/length(column); print(sprintf("Calculated mean for column %d: %f", i, calculatedMean)) sampleMeans = c(sampleMeans, calculatedMean); } covariance_matrix = cov(inputData); distance = mahalanobis_distance(inputData[3,], sampleMeans, covariance_matrix);
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); inputData = as.matrix(inputData); sampleMeans = c(); for(i in 1:ncol(inputData)){ column = inputData[,i]; calculatedMean = sum(column)/length(column); print(sprintf("Calculated mean for column %d: %f", i, calculatedMean)) sampleMeans = c(sampleMeans, calculatedMean); } covariance_matrix = cov(inputData); distance = mahalanobis_distance(inputData[3,], sampleMeans, covariance_matrix);
Detect outliers using the Mahalanobis Distance method
mahalanobis_method(inputData, alpha, tutorialMode)
mahalanobis_method(inputData, alpha, tutorialMode)
inputData |
Input Data dataset that will be processed (with or not the step by step explanation) to obtain the underlying outliers. It must be a data.frame type. |
alpha |
Significance level alpha. This value indicates the proportion that it is expected to be outliers out of the dataset. It has to be in the range from 0 to 1 |
tutorialMode |
if TRUE the tutorial mode is activated (the algorithm will include an explanation detailing the theory behind the outlier detection algorithm and a step by step explanation of how is the data processed to obtain the outliers following the theory mentioned earlier) |
None, does not return any value
Andres Missiego Manjon
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); mahalanobis_method(inputData, 0.7, FALSE); #Can be set to TRUE
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))); inputData = data.frame(inputData); mahalanobis_method(inputData, 0.7, FALSE); #Can be set to TRUE
Calculates the manhattan distance between two 2D points
manhattan_dist(A, B)
manhattan_dist(A, B)
A |
One of the 2D points |
B |
The other 2D point |
Manhattan distance calculated between point A and B
Andres Missiego Manjon
distance = manhattan_dist(c(1,2), c(3,4));
distance = manhattan_dist(c(1,2), c(3,4));
Calculates the mean of the given data vector
mean_outliersLearn(data)
mean_outliersLearn(data)
data |
Input Data that will be processed to calculate the mean. It must be a vector |
Mean of the input data
Andres Missiego Manjon
mean = mean_outliersLearn(c(2,3,2.3,7.8));
mean = mean_outliersLearn(c(2,3,2.3,7.8));
Function that obtains the 'v' quantile
quantile_outliersLearn(data, v)
quantile_outliersLearn(data, v)
data |
Input Data |
v |
Goes from 0 to 1 (e.g. 0.25). Indicates the quantile that wants to be obtained |
Quantile v calculated
Andres Missiego Manjon
q = quantile_outliersLearn(c(12,2,3,4,1,13), 0.60)
q = quantile_outliersLearn(c(12,2,3,4,1,13), 0.60)
Calculates the standard deviation of the input data given the mean.
sd_outliersLearn(data, mean)
sd_outliersLearn(data, mean)
data |
Input Data that will be used to calculate the standard deviation. Must be a vector |
mean |
Mean of the input data vector of the function. |
Standard Deviation of the input data
Andres Missiego Manjon
inputData = c(1,2,3,4,5,6,1); mean = sum(inputData)/length(inputData); sd = sd_outliersLearn(inputData, mean);
inputData = c(1,2,3,4,5,6,1); mean = sum(inputData)/length(inputData); sd = sd_outliersLearn(inputData, mean);
Transform any type of data to a vector
transform_to_vector(data)
transform_to_vector(data)
data |
Input data that will be transformed into a vector |
Data formatted as a vector
Andres Missiego Manjon
numeric_data = c(1, 2, 3) character_data = c("a", "b", "c") logical_data = c(TRUE, FALSE, TRUE) factor_data = factor(c("A", "B", "A")) integer_data = as.integer(c(1, 2, 3)) complex_data = complex(real = c(1, 2, 3), imaginary = c(4, 5, 6)) list_data = list(1, "apple", TRUE) data_frame_data = data.frame(x = c(1, 2, 3), y = c("a", "b", "c")) transformed_numeric = transform_to_vector(numeric_data) transformed_character = transform_to_vector(character_data) transformed_logical = transform_to_vector(logical_data) transformed_factor = transform_to_vector(factor_data) transformed_integer = transform_to_vector(integer_data) transformed_complex = transform_to_vector(complex_data) transformed_list = transform_to_vector(list_data) transformed_data_frame = transform_to_vector(data_frame_data)
numeric_data = c(1, 2, 3) character_data = c("a", "b", "c") logical_data = c(TRUE, FALSE, TRUE) factor_data = factor(c("A", "B", "A")) integer_data = as.integer(c(1, 2, 3)) complex_data = complex(real = c(1, 2, 3), imaginary = c(4, 5, 6)) list_data = list(1, "apple", TRUE) data_frame_data = data.frame(x = c(1, 2, 3), y = c("a", "b", "c")) transformed_numeric = transform_to_vector(numeric_data) transformed_character = transform_to_vector(character_data) transformed_logical = transform_to_vector(logical_data) transformed_factor = transform_to_vector(factor_data) transformed_integer = transform_to_vector(integer_data) transformed_complex = transform_to_vector(complex_data) transformed_list = transform_to_vector(list_data) transformed_data_frame = transform_to_vector(data_frame_data)
This function implements the outlier detection algorithm using standard deviation and mean
z_score_method(data, d, tutorialMode)
z_score_method(data, d, tutorialMode)
data |
Input Data that will be processed with or without the tutorial mode activated |
d |
Degree of outlier or distance at which an event is considered an outlier |
tutorialMode |
if TRUE the tutorial mode is activated (the algorithm will include an explanation detailing the theory behind the outlier detection algorithm and a step by step explanation of how is the data processed to obtain the outliers following the theory mentioned earlier) |
None, does not return any value
Andres Missiego Manjon
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))) inputData = data.frame(inputData) z_score_method(inputData,2,FALSE) #Can be changed to TRUE
inputData = t(matrix(c(3,2,3.5,12,4.7,4.1,5.2, 4.9,7.1,6.1,6.2,5.2,14,5.3),2,7,dimnames=list(c("r","d")))) inputData = data.frame(inputData) z_score_method(inputData,2,FALSE) #Can be changed to TRUE