Data visualizations of anomaly score locally around a specific data point

# S3 method for stranger
plot(x, type = "cluster", id = ".id", score = NULL, anomaly_id = NULL, ...)

# S3 method for fortifiedanomaly
plot(
  x,
  type = "feature_importance",
  id = ".id",
  anomaly_id = NULL,
  score = NULL,
  ...
)

# S3 method for anomalies
plot(x, type = "feature_importance", id = ".id", anomaly_id = NULL, ...)

# S3 method for singular
plot(x, type = "cluster", id = ".id", score = NULL, anomaly_id = NULL, ...)

Arguments

type

is the name of the visualization; (1) A hierarchical clustering, named "cluster", showing among the top n-anomaly which records belongs to the same cluster a specific record. Finding the commun pattern amoung the cluster may lead to the orign of of the specifi record score. (2) A dots plot, named "neighbours", showing the relationship between the anomly score and each feature for the k nearest neighbours of a specific record. (3) A bar chart, named "feature_importance", showing how sensitive is the anomaly score of a specific record to each of feature. This may help to identify the features behind the score. (4) A dots plot, names "score_decline", showing the decrease in anomaly score among the k nearest neighbours of a specific record. The shape indicates how extrem and how frequent is the anomaly score of a speicif record amoung its neighbours. (5) A Regression tree, named "regression_tree", showing the roots to high score around a specific record.

id

is the colname with records IDs

score

is the colname which contains the anomaly score

anomaly_id

is the record ID you want to investigate

data

is either of class dataframe, stranger or anomaly. It contains the observations; each row represents an observation and each variable is stored in one column. It must have at least one column with IDs and one column with the anomaly score for each ID.

check

logical indicating if object data should be checked for validity. The default is TRUE, this check is not necessary when data is known to be valid such as when it is the direct result of stranger().

keep

character vector: names of columns to keep (filter)

drop

character vector: names of columns to drop (filter)

n.cluster

is the number of cluster groups to emphasis. This parameter must only be specified with type ="cluster".

n.anom

is the number of top anomalies to be considered. This parameter must only be specified with type ="cluster".

k

is the number of neighbours to be considered. This parameter must always be specified, except with type = "cluster".

n_label

specifies the number of data point to be labelled in the plot. This parameter must only be specified with type ="scores_decline".

Value

A plot

Details

Function that produces visualizations to understand the anomaly score locally around a specific data point. We believe this should help people to trust scores a made by models even if they don’t fully understand them. Today, 5 visualisazions are implemented; (1) A hierarchical clustering, named "cluster", showing among the top n-anomaly which records belongs to the same cluster a specific record. Finding the commun pattern amoung the cluster may lead to the orign of of the specifi record score. (2) A dots plot, named "neighbours", showing the relationship between the anomly score and each feature for the k nearest neighbours of a specific record. (3) A bar chart, named "feature_importance", showing how sensitive is the anomaly score of a specific record to each of feature. This may help to identify the features behind the score. (4) A dots plot, names "score_decline", showing the decrease in anomaly score among the k nearest neighbours of a specific record. The shape indicates how extrem and how frequent is the anomaly score of a speicif record amoung its neighbours. (5) A Regression tree, named "regression_tree", showing the roots to high score around a specific record.

Examples

# \dontrun{
data(iris)
library(dplyr)
data <- iris %>% select(-Species) %>% crazyfy()
anom1<- data %>% strange()
result <- fortify(anom1)
investigate(result, type="cluster", id = ".id", score = "knn_k_10_mean", anomaly_id = 10, n.cluster = 4, n.anom = 50)
#> Error in investigate(result, type = "cluster", id = ".id", score = "knn_k_10_mean",     anomaly_id = 10, n.cluster = 4, n.anom = 50): could not find function "investigate"
investigate(result, type="neighbours", id = ".id", score = "knn_k_10_mean", anomaly_id = 10, k = 200)
#> Error in investigate(result, type = "neighbours", id = ".id", score = "knn_k_10_mean",     anomaly_id = 10, k = 200): could not find function "investigate"
investigate(result, type="feature_importance", id = ".id", score = "knn_k_10_mean", anomaly_id = 10, k = 100)
#> Error in investigate(result, type = "feature_importance", id = ".id",     score = "knn_k_10_mean", anomaly_id = 10, k = 100): could not find function "investigate"
investigate(result, type="scores_decline", id = ".id", score = "knn_k_10_mean", anomaly_id = 10, k = 50, n_label = 10)
#> Error in investigate(result, type = "scores_decline", id = ".id", score = "knn_k_10_mean",     anomaly_id = 10, k = 50, n_label = 10): could not find function "investigate"
investigate(result, type="regression_tree", id = ".id", score = "knn_k_10_mean", anomaly_id = 10, k = 1000)
#> Error in investigate(result, type = "regression_tree", id = ".id", score = "knn_k_10_mean",     anomaly_id = 10, k = 1000): could not find function "investigate"
# }