Data related issues may take up 80% of the time from your entire project. Everything starts and ends with the data. First, we must acquire it and understand it to state the task we are trying to solve.
Choosing a task is critical because it affects how we process and enrich the data and later format the solutions and algorithms.
the relevant Workflow components to this post are in Red box:
- Raw data – the most basic presentation of the collected data from the data source
- Data set – an assembly of aggregated data that can be used later for models and is constructed from:
- Instance = row – one sample of information from the dataset that will be handled by the model
- Attribute = column = feature = characteristic – an attribute in an instance
- Value – is the actual data of an instance in a specific attribute
- Id – an attribute that is specific or group instances from the same entity
- Schema – a formal language that describes structured data
- Data Types:
- Integer – zero and natural values
- Interval – a real number
- Ratio – an interval between zero and one
- Nominal / Categorical – string with 2 or more values without order
- Binomial – nominal with 2 values
- Ordinal – nominal with order
There are many ways to gather raw data. You might get it already processed, record it by yourself or get it from external API. Each one of those approaches holds many challenges:
- Technical– we need technical skills to connect to the database or to record the data samples correctly. In cybersecurity, there is a wide range of tools for investigating an attack and many of them are used for recording data: sandbox, packet sniffers, etc.
- Availability – the data collected for experiments will not necessarily be available in the deployment due to regulation, privacy, and capacity issues. We must consider it and try to have the dataset in the experiment as close to the real world in real-time
In the exploration, we will try to understand the data by visualizing its different statistics and behavior using statistic graphs. Our goal is to tell the story behind the data and find holes and opportunities in it.
- Attributes types – what are the value types? Are they being appropriately addressed in a schema?
- Data quality – We want to find the values spread nicely among different entities and see if they are correlated to each other or with the label. We want to find good attributes that might be useful and meaningful.
- Wrong attribute values – On the other hand, we want to find tricky and problematic attributes that are not typically consistent, contain many missing values, and incremental attributes that are increased over time and are not representative of the main values.
Choosing a task is a vital mission that needs to be first addressed in the data phase or even before. As explained in the AI chapter, we have two main tasks in ML: supervised and unsupervised learning. It’s straightforward to distinguish between them. Still, this separation is only in the beginning, as when you go deeper into the data science world, you will find different versions and combinations.
In the most basic way, if we have labeled data that we want to learn from, we will use supervised learning as the label is orthogonal; otherwise, we use regression classification. On the other hand, if the data is not labeled, we will choose unsupervised learning. If we want to find groups we can use clustering and if we want to find anomalies, we will choose outliers.