Supervised Learning


Consistent with labeled data, the task is to learn the label and predict it on new coming instances. The model is trained with the labeled data and then tested and evaluated with test labeled data.

supervised leaening 4 layers
supervised leaening 4 layers

Logistic Regression

Describes the relationship between X scalar attribute X and the continues label Y. in training the algorithm choose attributes that have a linear connection with y and calculate their coefficients, resulting in a linear function (sum product of the coefficient and x value) which will be the classifier.

For example: a dataset of scalars where there are 3 attributes, the regression function for estimating y is:

Y = 0.4*x1 + 0.1*x2-0.1*x3

regression classifier


Decision trees

A very popular and easy to understand model, each Node in the tree is an “if” statement for an attribute that is split by the attribute value. The leaves are the classification and the path from root to leaf is the classification process of an instance.

There are different types of algorithms that construct the tree, therefore the black box concept works nicely. The most basic way of constructing a tree is a top-down approach where we start from the root with all the  trained data.

We calculate for each attribute how much it’s informative to the label (via information gain / gain ratio). We choose the highest ranked attribute to create a node and split the instances per the values.

We continue building the tree until there are no remaining attributes, the attributes are not informative or that all the instances share the same label. Eventually we use the leaves for classification by using the majority rule (the major instances label in the leaf).

Avoiding Overfitting – if we have too many branches in the model and there is bad classification on unseen instances, we probably have overfitting. We can deal with overfitting by modifying the algorithm as follows: not allowing the construction of unnecessary branches, delete nodes that are not statistically significant.

In the graph below the X axis is feature no1, Y is Feature no2 and the color is the labeled cases. By splitting the data using different X and Y values, we can separate between the labels and then contract a relevant decision tree (the decision tree graph).

linear classifier
linear classifier
decision tree classifier
decision tree classifier

Probabilistic/Bayesian classifiers

Based on the Bayes theory the classifiers use conditional distribution on the attributes and the label to train the model, what is the probability to have a certain label given a value of an attribute.

The basic algorithm is called a naïve base and is built on a probabilistic function for each attribute value for each label. Then when classifying new instances, we  calculate the product of the instance function for each label and choose the highest probability.

Bayes basic conditional probability:


For each instance calculate P1 and P2 based on the train data:

  • P1-Probability for instance to be Spam:


  • p2-Probability for instance to be Ham:


If P1 > P2 then spam

A more sophisticated model is Bayesian network that constructs a directed acyclic graph based on the probabilities. A node is a possible state/label and the edges are the conditional dependency.

Lazy classifiers

Instance based learning is a very simple model, where the classification is performed by choosing the instance k’s nearest neighbor from the labeled data, along with the majority rule or distance based methods. K’s nearest neighbors (KNN)

In the 3 nearest neighbor graph below, the trained instances are spread on the space of features (in this case 2 features). For each instance we want to classify we find the 3 nearest neighbors and classify them by their label. For example: The A point will be classified as blue (2 blue 1 red) and B will be classified as red

3nn lazy classifier
3nn lazy classifier

Artificial neural networks (ANN)

Inspired from the biological neural network, the ANN is a network constructed from neurons as nodes connected with synapses as the edges with weight W (w1, w2…wn). Each attribute of an instance is an input X (x1,x2,…xn) connected with synapses to the neuron. Each neuron has a function f and the output of a neuron is f(x) = K(sum(W*g(X))) where K is an activation function to normalize the output.

To get a trained network we need to determine what the W Values that minimize the error are. This process is called backpropagation.

simple neuron network with one neuron
simple neuron network with one neuron


Types on neural networks:

Feedforward – simple network layers without cyclic connections (MLP, RBF)

ann feedforward
ann feedforward

Recurrent – layers can be connected back together and create a cycle (RNN)


Stochastic – derived from Bayesian network, consist of an input layer, pattern layer, summation layer and output layer used for classification (PNN, Boltzmann machine, RBM)


Kernel based classifiers

Class of algorithms that transform the raw data into a new feature vector using a kernel function that computes the inner product between all the instances.

Support vector machine – the SVM model represents the instances as a point in  space where the labels are separated with the line using the kernel trick.

svm classifier
svm classifier

Ensemble based classifiers

The classifier is constructed from many poor classifiers; these classifiers are usually preferable because they are fast to compute and we are not depending on one single classifier.

Types of ensemble classifiers:

Boosting – build a strong classifier from weak classifiers, Ada-boost – a sequence of classifiers where each one is focused on the previous errors

Bagging – bootstrap aggregating – building several classifiers on split data and combining their classification

Random Forest – a very popular and good ensemble classifier, the classifier is constructed as follows: train T Bootstrap trees, then classify using majority voting.

random forest classifier
random forest classifier


supervised models comparison

model algorithms pros cons
regression Linear regression Fast

Widely accepted

Less accurate

Non-linear connection

Only numeric


Decision tree J48 C4.5 Simple to understand

Little data preparation

Accept different attribute types



Unbalanced data causes unbalanced tree

Lazy classifiers KNN Easy to explain

Deal with non-linearity


K choosing

Model should store all data

Distance function choosing

Only numeric


Probabilistic / Bayesian classifiers


Naïve base

Bayesian Network

Good for many attributes and few instances

Fast to train and apply


conditional independent assumption

less accurate


Artificial Neural Networks MLP







Slow training

Only numeric


Kernel based classifiers SVM Deal with high dimensionality

Can Deal with overfitting


Need many instances

Kernel function choosing

Hard to understand model

Memory and runtime

Only numeric


Ensemble base classifiers Random Forest, AdaBoost Usually outperforms all others

Statistically better






Leave a Reply

Your email address will not be published. Required fields are marked *