After understanding the fundamentals of the cyber data scientist, we can put these skills and knowledge into action! In this module, we suggest a workflow for solving cybersecurity problems using machine learning.
This workflow is a combination of different data scientist workflows but is mostly influenced by Dr. Philip Guo’s workflow.
First, we must distinguish between 4 different Business questions we will mostly meet in our work:
- Simple – A simple question about the basic parameters that can be derived from simple statistics of the data.
Example: What is the average file size of a malware family?
- Hypothesis – An assumption we have regarding the data and entities in it, these hypotheses can easily be solved using statistical tests.
Example: The file size of malware Family A is bigger than malware Family B
- Segmentation – A general group of assumptions on an entity in the data
Example: What are the characteristics of Malware from Family A?
- Classification – How to classify a record that has never been seen before. This question can be answered using a machine learning model.
Example: Is a specific unlabeled file Malware from Family A?
As explained in the fundamentals module under the machine learning and statistics chapter, the general approach of working with machine learning is as grey/Black box, as a data scientists we don’t need to understand every bit of the algorithm, but must understand how to model our problem and what the task is we are trying to solve. Still, we must be familiar with the algorithms logic, especially with those that have the best results.
The black box approach will generally include splitting the data to train and test datasets using a validation method. The ML algorithm builds a model based on the train data and with the test data the model is evaluated by examining different performance criteria.
In the fundamentals module, we focused on machine learning high-level tasks. In this module, we will explain how to create a validate process of ML. You will learn how to take a problem from cybersecurity and build a machine learning model using a relevant algorithm. We will go through the different models and what algorithms implement them.
As mentioned, the workflow is quite generic and intuitive. First, we learn Domain (1) where we are trying to solve which is the cyber domain, and then we start working on our Data (2) by gathering it from the problem’s environment. Then we Explore the values and try to understand it so we can find what Task we are trying to solve. Afterward, we Process it for extracting relevant information and model it for the task we are trying to solve.
After we have processed the raw data into a dataset, we choose a validation (3) method to split the data for future evaluations correctly. Then we start with modeling. (4) First, we Choose a Model we think can solve the task and pick a relevant algorithm to train the data on. Then, after having a trained model we test it from data we got from the validation and Evaluate (5) its performance among other models resulting in the best one for deployment (6).