Unsupervised Learning

The data is not labeled, and the machine tries to find groups or outliers in the data using data characteristics.

Clustering

The clustering model will be trained on unlabeled data and will try to find different groups of instances (clusters). Later those clusters can be used to detect anomalies or even to label the data!

Because the clustering algorithms are not biased to any label, we can get a “pure” model on the data behavior. If the clustering model can produce excellent and informative clusters, we can use them to investigate the data and have a better understanding of it. We might also find a correlation with the label, which can help us label the data.

Types of clustering algorithms:

Centroid based clustering

The algorithm partitions the data into K clusters by trying to minimize the sum of the distance function of each instance in the cluster to the center of the cluster.

Algorithms: K-means

centroid-based-clustering — centroid based clustering

Density-based clustering

The algorithm constructs clusters based on how closely the instances are packed together compared to their neighbors

Algorithms: OPTICS, DBSCAN (Density-based spatial clustering of applications with noise)

density based clustering — density-based clustering

Connectivity based clustering – Hierarchical

The algorithm finds clusters based on previously generated clusters resulting in a dendrogram – cluster tree

Agglomerative (bottom-up)– starts with each instance as a separate cluster

Divisive (top-down) – starts with all the data as a cluster

Algorithms: SLINK, CLINK

connectivity based clustering hierarchical — connectivity-based clustering hierarchical

Distribution based clustering

The algorithm generates the clusters upon instances that share similar distribution.

For example, EM – expectation–maximization:

The algorithm iterates with steps:

Expectation step – estimate the probability for each instance to be in a cluster using likelihood function based on the attributes
Maximization step – calculate the parameters of each cluster (mean)
Test – between each iteration a test is applied to find the difference (error) in each cluster if the error is bigger than previous iterations the algorithm halts

Algorithms: Expectation maximization

Outlier

An outlier is an instance with high distance from the other instances

Type of outlier algorithms:

Distance-based outlier

The algorithm finds outliers that are far from their neighbors and have a less dense neighborhood, while normal instances have a dense neighborhood

Algorithms: KNN based, ROF

distance based outlier — distance-based outlier

Depth based outlier

The algorithm finds outliers that are in the border of the data space (normal in the center)

Algorithms: ISODEPTH, FDC

depth based outlier — depth-based outlier

Deviation based outlier

The algorithm finds outliers that are in the outer points of the space, if these points are removed the data variance is reduced significantly.

Density-based outlier

The algorithm finds outliers that have a relatively different density point compared

Algorithms: LOF, LOCI

density based outlier — density-based outlier

High dimensional outlier

The algorithm finds outliers for very sparse data with many attributes where almost all the instances are outliers. Therefore a neighbor approach is not relevant.

example:

ABOD – angle-based outlier degree – state that an outlier is a point with similar direction to other points and a normal point has a different direction to different points

Other Algorithms: ABOD, grid-based, SOD

Association rules

Discover rules and patterns in an item-set data. Each instance is a set of items; our goal is to find a connection between the items and extract rules. The Apriori principle is used to discover these rules, the principal state that a subset of frequent instances and values must be frequent. The Apriori logic can be used as follows: start and find frequent 1 item set in the data (=most frequent items in the data) ({A}, {B}, {D}). then find frequent 2 item set based on the items in the previous step ({A, B}, {B, D}), iteratively add K+1 item-set based on the previous ones ({A, B, D}). Then generate rules based on the item-sets: A, B -> D | D, A->B | B, D -> A

Cyber Data Scientist