Consistent with labeled data, the task is to learn the label and predict it on new coming instances. The model is trained with the labeled data and then tested and evaluated with test labeled data.
Logistic Regression
Describes the relationship between X scalar attribute X and the continues label Y. in training the algorithm choose attributes that have a linear connection with y and calculate their coefficients, resulting in a linear function (sum product of the coefficient and x value) which will be the classifier.
For example: a dataset of scalars where there are 3 attributes, the regression function for estimating y is:
Y = 0.4*x1 + 0.1*x20.1*x3
Decision trees
A very popular and easy to understand model, each Node in the tree is an “if” statement for an attribute that is split by the attribute value. The leaves are the classification and the path from root to leaf is the classification process of an instance.
There are different types of algorithms that construct the tree, therefore the black box concept works nicely. The most basic way of constructing a tree is a topdown approach where we start from the root with all the trained data.
We calculate for each attribute how much it’s informative to the label (via information gain / gain ratio). We choose the highest ranked attribute to create a node and split the instances per the values.
We continue building the tree until there are no remaining attributes, the attributes are not informative or that all the instances share the same label. Eventually we use the leaves for classification by using the majority rule (the major instances label in the leaf).
Avoiding Overfitting – if we have too many branches in the model and there is bad classification on unseen instances, we probably have overfitting. We can deal with overfitting by modifying the algorithm as follows: not allowing the construction of unnecessary branches, delete nodes that are not statistically significant.
In the graph below the X axis is feature no1, Y is Feature no2 and the color is the labeled cases. By splitting the data using different X and Y values, we can separate between the labels and then contract a relevant decision tree (the decision tree graph).
Probabilistic/Bayesian classifiers
Based on the Bayes theory the classifiers use conditional distribution on the attributes and the label to train the model, what is the probability to have a certain label given a value of an attribute.
The basic algorithm is called a naïve base and is built on a probabilistic function for each attribute value for each label. Then when classifying new instances, we calculate the product of the instance function for each label and choose the highest probability.
Bayes basic conditional probability:
For each instance calculate P1 and P2 based on the train data:
 P1Probability for instance to be Spam:
 p2Probability for instance to be Ham:
If P1 > P2 then spam
A more sophisticated model is Bayesian network that constructs a directed acyclic graph based on the probabilities. A node is a possible state/label and the edges are the conditional dependency.
Lazy classifiers
Instance based learning is a very simple model, where the classification is performed by choosing the instance k’s nearest neighbor from the labeled data, along with the majority rule or distance based methods. K’s nearest neighbors (KNN)
In the 3 nearest neighbor graph below, the trained instances are spread on the space of features (in this case 2 features). For each instance we want to classify we find the 3 nearest neighbors and classify them by their label. For example: The A point will be classified as blue (2 blue 1 red) and B will be classified as red
Artificial neural networks (ANN)
Inspired from the biological neural network, the ANN is a network constructed from neurons as nodes connected with synapses as the edges with weight W (w1, w2…wn). Each attribute of an instance is an input X (x1,x2,…xn) connected with synapses to the neuron. Each neuron has a function f and the output of a neuron is f(x) = K(sum(W*g(X))) where K is an activation function to normalize the output.
To get a trained network we need to determine what the W Values that minimize the error are. This process is called backpropagation.
Types on neural networks:
Feedforward – simple network layers without cyclic connections (MLP, RBF)
Recurrent – layers can be connected back together and create a cycle (RNN)
Stochastic – derived from Bayesian network, consist of an input layer, pattern layer, summation layer and output layer used for classification (PNN, Boltzmann machine, RBM)
Kernel based classifiers
Class of algorithms that transform the raw data into a new feature vector using a kernel function that computes the inner product between all the instances.
Support vector machine – the SVM model represents the instances as a point in space where the labels are separated with the line using the kernel trick.
Ensemble based classifiers
The classifier is constructed from many poor classifiers; these classifiers are usually preferable because they are fast to compute and we are not depending on one single classifier.
Types of ensemble classifiers:
Boosting – build a strong classifier from weak classifiers, Adaboost – a sequence of classifiers where each one is focused on the previous errors
Bagging – bootstrap aggregating – building several classifiers on split data and combining their classification
Random Forest – a very popular and good ensemble classifier, the classifier is constructed as follows: train T Bootstrap trees, then classify using majority voting.
supervised models comparison
model  algorithms  pros  cons 
regression  Linear regression  Fast
Widely accepted 
Less accurate
Nonlinear connection Only numeric

Decision tree  J48 C4.5  Simple to understand
Little data preparation Accept different attribute types 
Overfitting
Unstable Unbalanced data causes unbalanced tree 
Lazy classifiers  KNN  Easy to explain
Deal with nonlinearity 
Overfitting
K choosing Model should store all data Distance function choosing Only numeric

Probabilistic / Bayesian classifiers

Naïve base
Bayesian Network 
Good for many attributes and few instances
Fast to train and apply

conditional independent assumption
less accurate handling 
Artificial Neural Networks  MLP
RNN RBM

Nonlinearity

OverfittingComplicated
Slow training Only numeric

Kernel based classifiers  SVM  Deal with high dimensionality
Can Deal with overfitting

Need many instances
Kernel function choosing Hard to understand model Memory and runtime Only numeric

Ensemble base classifiers  Random Forest, AdaBoost  Usually outperforms all others
Statistically better 
Overfitting
Complicated
