In machine learning, we learn from previous data and use it to extract knowledge in future events. The general work is to try to solve different general tasks via different models that are implemented by an algorithm.
In practice, as shown in the ML 4 level approach, when trying to find an ML solution for a problem we first think on what sort of task we are trying to solve, then which model is the most suitable and afterward, we need to choose an algorithm that will implement the model. The machine learning process must be validated and evaluated to find the best solution.
The general approach of working with machine learning is a grey/Black box. As cyber data scientists, we don’t need to understand every bit in the algorithm, but to understand how to model our problem and what the task is we are trying to solve.
Still, we must be familiar with the logic of the algorithm, especially with those that have the best results.
The black box approach will generally include splitting the data to train and test datasets. The ML algorithm builds a model based on the train data and with the test data. The model is evaluated by examining different performance criteria.
One of the most essential principles in this field is Occam’s Razor that is commonly explained as: choose the most straightforward theory. This principle will allow you to avoid overfitting and get a more generic and generalized model that can handle unseen instances.
Supervised learning is when the data is labeled, and the machine needs to predict the label for new instances.
In supervised learning, we will try to classify a new instance to a group such as if a file is a malware (Red) or not (Blue). In the example below the dots are instances over two features (Axis). Their color is the label, and the black curve is the model that distinguishes between the samples. The model presented is a decision tree that during training it splits nodes by their features’ values and uses the leaves for classification.
in supervised learning, we might want to use regression models to find a score, for example, to find a score for the potential damage of a malware file. In the graph, you can see that there is a correlation between the label Y and the feature X that can be represented by a regression line.
The regression line is a linear formula: Y = 2.2*X + 0.2. in this formula, we can assign X and get Y. the regression model also works on more than one feature and will result in a linear formula as well: Y = a1*x1 + a2*x2 + a3*x3 + b.
In unsupervised learning, the data is not labeled and the machine tries to find groups (clusters) or outliers in the data.
In unsupervised learning, we will use a clustering model to find groups of instances based only on their characteristics. Cyber example: to profile a user’s behavior as clusters and find a malicious cluster.
an outlier is an event that does not fit within a normal event. To detect a new type of fraud. For example, in a credit card scenario, we would like to find an outlier transaction that does not fit in with the normal behavior.
Semi-supervised learning is when we have many instances, but only a small amount is labeled, we wish to train a good model based on “good” instances. This is a very common task because in the real world the data is not labeled and takes a lot of effort from domain experts to understand .
The same concept as a planning and decision theory, the model gets feedback from the environment on its actions and tunes the model accordingly. The environment can be modeled as a Markov decision process (MDP). The goal is to maximize the utility function by choosing rather explore the environment or to exploit (perform an action).
Genetic programming is based on imitating the evolutionary theory of creating new generations based on the old generation’s genes. The method genetically breeds a population of computer programs to solve a problem. Using GP algorithms, it is possible to iteratively transform a population of programs into a new generation. First, we start in an initial population and in each generation of the population by applying:
- Reproduction – copy the same program as the previous generation
- Crossover – combining parts from different programs into one
- Mutation – randomly generates a mutation for a program part
- Architecture – changing the program’s architecture
After having enough programs, we can choose to best one that solves our problem.
In cybersecurity, it possible to use genetic programming on malware and to find a form of this malware in the future.
- Genetic Programming tutorial
- A Taxonomy of Semi-Supervised Learning Algorithms By Olivier Chapelle December 2005