As mentioned in the Previous Post, Data related issues take-up of 80% of the time from your entire project. Everything starts and ends with the data.
This post is covering one of the first components in our workflow, after we have understood the Cyber Domain, the Data and what task we are trying to solve, it’s time to Process it for later use by machine learning.
the relevant Workflow components to this post are in Red box:
Processing the data is an essential task in the workflow; the recommended techniques are based on Pyle 99 and Garcia15 (you can find links below) methods and are widely accepted:
Aggregation used to model the data for solving the machine learning task. In the real world, the data is not suitable for machines to understand and some adjustments are needed to present the problem as a dataset with attributes, instances and label for the algorithm to deal with. The aggregation is structured from a “group by” column and an “aggregation function” that is applied in another column. The aggregated data will be the dataset for the machine to learn that each row is an instance.
Most of the time, the data is very noisy and dirty due to technical issues or security; this results in several issues we need to address:
there are many techniques of dealing with missing values, if there is a column with many missing values perhaps it will be better to eliminate it. Other approaches are trying to solve the function of filling it, such as: choose the Mode (most repetitive value), to build a predictive model from previous values and try to predict the value.
Some attributes or instances consist with outlier values due to a bad collection of the data and other realistic reasons. Those instances need to be excluded, otherwise we might harm the learning process. There are good outlier detection models that can be performed on the data.
too many features can damage the model, we might want to balance our instances so that they will meet the population better.
- Feature selection – some algorithms calculate how much a feature is informative and contribute to the task. We can use these algorithms to select features for training the model.
- Imbalance handling – in many cases, especially in cybersecurity, there is an imbalance in the label (too few attacks; many benign instances) this may harm the training process; therefore, we might choose different data sampling methods to deal with it.
most of the time, there is more information we can squeeze from the data via a feature extraction. Still, we don’t want to overload a too complex feature to calculate as that will harm the deployment. We must not over change the data and additional information that is not accessible in the real world (such as information from future instances).
There are three known methods for FE we can implement:
1. Transformation FE
a function on the raw data attribute or several attributes
- Simple Function – a linear or nonlinear function on the one or more attributes
- Aggregation function – while performing the aggregation we can extract sophisticated aggregated features by using a function on one or more attributes in the aggregation “group by”
- Discretization – transform a numeric value into groups of binned data; this transformation significantly reduces the attribute dimensionality and facilitate the training process for models that uses nominal attributes (decision trees). The binning algorithms can be performed in three manners:
- Gain ratio discretization – if there is a label, the algorithm finds the most informative binning split of the attribute values
- Equal frequency discretization – divide to K bins, each bin has an equal number of instances
- Equal width discretization – divide to k bins, each bin has the same value size w, w = (max(att) – min(att)) / k
- Normalization – scale the values between a scale, usually between zero to one. Normalized attributes are easier to compare with each other. which is very useful. That uses comparisons between instances in the algorithm
- Min-max normalization – new value = (value – min(att)) / (max(att) – min(att))
- Z-score normalization – using a normal distribution score – new value = (value -mean(att)/std(att)
2. Historical FE
more complicated features involved with calculating features from historical instances in time series data. These features can help the model to understand the context of the instance better.
3. Lookup FE
we can pre-calculate a lookup table on the values of the data and then extract the feature by using the value as an ID in the lookup table.