# Supervised Learning

Consistent with labeled data, the task is to learn the label and predict it on new coming instances. The model is trained with the labeled data and then tested and evaluated with test labeled data.

## Logistic Regression

Describes the relationship between X scalar attribute X and the continues label Y. in training the algorithm choose attributes that have a linear connection with y and calculate their coefficients, resulting in a linear function (sum product of the coefficient and x value) which will be the classifier.

For example: a dataset of scalars where there are 3 attributes, the regression function for estimating y is:

Y = 0.4*x1 + 0.1*x2-0.1*x3

## Decision trees

A very popular and easy to understand model, each Node in the tree is an “if” statement for an attribute that is split by the attribute value. The leaves are the classification and the path from root to leaf is the classification process of an instance.

There are different types of algorithms that construct the tree, therefore the black box concept works nicely. The most basic way of constructing a tree is a top-down approach where we start from the root with all the  trained data.

We calculate for each attribute how much it’s informative to the label (via information gain / gain ratio). We choose the highest ranked attribute to create a node and split the instances per the values.

We continue building the tree until there are no remaining attributes, the attributes are not informative or that all the instances share the same label. Eventually we use the leaves for classification by using the majority rule (the major instances label in the leaf).

Avoiding Overfitting – if we have too many branches in the model and there is bad classification on unseen instances, we probably have overfitting. We can deal with overfitting by modifying the algorithm as follows: not allowing the construction of unnecessary branches, delete nodes that are not statistically significant.

In the graph below the X axis is feature no1, Y is Feature no2 and the color is the labeled cases. By splitting the data using different X and Y values, we can separate between the labels and then contract a relevant decision tree (the decision tree graph).

## Probabilistic/Bayesian classifiers

Based on the Bayes theory the classifiers use conditional distribution on the attributes and the label to train the model, what is the probability to have a certain label given a value of an attribute.

The basic algorithm is called a naïve base and is built on a probabilistic function for each attribute value for each label. Then when classifying new instances, we  calculate the product of the instance function for each label and choose the highest probability.

Bayes basic conditional probability:

For each instance calculate P1 and P2 based on the train data:

• P1-Probability for instance to be Spam:

• p2-Probability for instance to be Ham:

If P1 > P2 then spam

A more sophisticated model is Bayesian network that constructs a directed acyclic graph based on the probabilities. A node is a possible state/label and the edges are the conditional dependency.

## Lazy classifiers

Instance based learning is a very simple model, where the classification is performed by choosing the instance k’s nearest neighbor from the labeled data, along with the majority rule or distance based methods. K’s nearest neighbors (KNN)

In the 3 nearest neighbor graph below, the trained instances are spread on the space of features (in this case 2 features). For each instance we want to classify we find the 3 nearest neighbors and classify them by their label. For example: The A point will be classified as blue (2 blue 1 red) and B will be classified as red

## Artificial neural networks (ANN)

Inspired from the biological neural network, the ANN is a network constructed from neurons as nodes connected with synapses as the edges with weight W (w1, w2…wn). Each attribute of an instance is an input X (x1,x2,…xn) connected with synapses to the neuron. Each neuron has a function f and the output of a neuron is f(x) = K(sum(W*g(X))) where K is an activation function to normalize the output.

To get a trained network we need to determine what the W Values that minimize the error are. This process is called backpropagation.

### Types on neural networks:

Feedforward – simple network layers without cyclic connections (MLP, RBF)

Recurrent – layers can be connected back together and create a cycle (RNN)

Stochastic – derived from Bayesian network, consist of an input layer, pattern layer, summation layer and output layer used for classification (PNN, Boltzmann machine, RBM)

## Kernel based classifiers

Class of algorithms that transform the raw data into a new feature vector using a kernel function that computes the inner product between all the instances.

Support vector machine – the SVM model represents the instances as a point in  space where the labels are separated with the line using the kernel trick.

## Ensemble based classifiers

The classifier is constructed from many poor classifiers; these classifiers are usually preferable because they are fast to compute and we are not depending on one single classifier.

### Types of ensemble classifiers:

Boosting – build a strong classifier from weak classifiers, Ada-boost – a sequence of classifiers where each one is focused on the previous errors

Bagging – bootstrap aggregating – building several classifiers on split data and combining their classification

Random Forest – a very popular and good ensemble classifier, the classifier is constructed as follows: train T Bootstrap trees, then classify using majority voting.

## supervised models comparison

 model algorithms pros cons regression Linear regression Fast Widely accepted Less accurate Non-linear connection Only numeric Decision tree J48 C4.5 Simple to understand Little data preparation Accept different attribute types Overfitting Unstable Unbalanced data causes unbalanced tree Lazy classifiers KNN Easy to explain Deal with non-linearity Overfitting K choosing Model should store all data Distance function choosing Only numeric Probabilistic / Bayesian classifiers Naïve base Bayesian Network Good for many attributes and few instances Fast to train and apply conditional independent assumption less accurate handling Artificial Neural Networks MLP RNN RBM Non-linearity OverfittingComplicated Slow training Only numeric Kernel based classifiers SVM Deal with high dimensionality Can Deal with overfitting Need many instances Kernel function choosing Hard to understand model Memory and runtime Only numeric Ensemble base classifiers Random Forest, AdaBoost Usually outperforms all others Statistically better Overfitting Complicated