Consistent with labeled data, the task is to learn the label and predict it on new coming instances. The model is trained with the labeled data and then tested and evaluated with test labeled data.
Describes the relationship between X scalar attribute X and the continues label Y. in training the algorithm choose attributes that have a linear connection with y and calculate their coefficients, resulting in a linear function (sum product of the coefficient and x value) which will be the classifier.
For example: a dataset of scalars where there are 3 attributes, the regression function for estimating y is:
Y = 0.4*x1 + 0.1*x2-0.1*x3
A very popular and easy to understand model, each Node in the tree is an “if” statement for an attribute that is split by the attribute value. The leaves are the classification and the path from root to leaf is the classification process of an instance.
There are different types of algorithms that construct the tree, therefore the black box concept works nicely. The most basic way of constructing a tree is a top-down approach where we start from the root with all the trained data.
We calculate for each attribute how much it’s informative to the label (via information gain / gain ratio). We choose the highest ranked attribute to create a node and split the instances per the values.
We continue building the tree until there are no remaining attributes, the attributes are not informative or that all the instances share the same label. Eventually we use the leaves for classification by using the majority rule (the major instances label in the leaf).
Avoiding Overfitting – if we have too many branches in the model and there is bad classification on unseen instances, we probably have overfitting. We can deal with overfitting by modifying the algorithm as follows: not allowing the construction of unnecessary branches, delete nodes that are not statistically significant.
In the graph below the X axis is feature no1, Y is Feature no2 and the color is the labeled cases. By splitting the data using different X and Y values, we can separate between the labels and then contract a relevant decision tree (the decision tree graph).
Based on the Bayes theory the classifiers use conditional distribution on the attributes and the label to train the model, what is the probability to have a certain label given a value of an attribute.
The basic algorithm is called a naïve base and is built on a probabilistic function for each attribute value for each label. Then when classifying new instances, we calculate the product of the instance function for each label and choose the highest probability.
Bayes basic conditional probability:
For each instance calculate P1 and P2 based on the train data:
- P1-Probability for instance to be Spam:
- p2-Probability for instance to be Ham:
If P1 > P2 then spam
A more sophisticated model is Bayesian network that constructs a directed acyclic graph based on the probabilities. A node is a possible state/label and the edges are the conditional dependency.
Instance based learning is a very simple model, where the classification is performed by choosing the instance k’s nearest neighbor from the labeled data, along with the majority rule or distance based methods. K’s nearest neighbors (KNN)
In the 3 nearest neighbor graph below, the trained instances are spread on the space of features (in this case 2 features). For each instance we want to classify we find the 3 nearest neighbors and classify them by their label. For example: The A point will be classified as blue (2 blue 1 red) and B will be classified as red
Inspired from the biological neural network, the ANN is a network constructed from neurons as nodes connected with synapses as the edges with weight W (w1, w2…wn). Each attribute of an instance is an input X (x1,x2,…xn) connected with synapses to the neuron. Each neuron has a function f and the output of a neuron is f(x) = K(sum(W*g(X))) where K is an activation function to normalize the output.
To get a trained network we need to determine what the W Values that minimize the error are. This process is called backpropagation.
Feedforward – simple network layers without cyclic connections (MLP, RBF)
Recurrent – layers can be connected back together and create a cycle (RNN)
Stochastic – derived from Bayesian network, consist of an input layer, pattern layer, summation layer and output layer used for classification (PNN, Boltzmann machine, RBM)
Class of algorithms that transform the raw data into a new feature vector using a kernel function that computes the inner product between all the instances.
The classifier is constructed from many poor classifiers; these classifiers are usually preferable because they are fast to compute and we are not depending on one single classifier.
Boosting – build a strong classifier from weak classifiers, Ada-boost – a sequence of classifiers where each one is focused on the previous errors
Bagging – bootstrap aggregating – building several classifiers on split data and combining their classification
Random Forest – a very popular and good ensemble classifier, the classifier is constructed as follows: train T Bootstrap trees, then classify using majority voting.
supervised models comparison
|Decision tree||J48 C4.5||Simple to understand
Little data preparation
Accept different attribute types
Unbalanced data causes unbalanced tree
|Lazy classifiers||KNN||Easy to explain
Deal with non-linearity
Model should store all data
Distance function choosing
|Probabilistic / Bayesian classifiers
|Good for many attributes and few instances
Fast to train and apply
|conditional independent assumption
|Artificial Neural Networks||MLP
|Kernel based classifiers||SVM||Deal with high dimensionality
Can Deal with overfitting
|Need many instances
Kernel function choosing
Hard to understand model
Memory and runtime
|Ensemble base classifiers||Random Forest, AdaBoost||Usually outperforms all others