article

Machine learning (ML) in cybersecurity

This article about machine learning in cybersecurity explains the core elements of machine learning, including the definition, types, and challenges. It provides an understanding of the role of machine learning in cybersecurity and guidance on evaluating machine learning models. Also covered in this article is a review of benefits and use cases.

See why AI, ML, and automation are needed to proactively identify risks and help IT teams and business stakeholders make more informed decisions.

What is machine learning?

Machine learning is a subset of artificial intelligence (AI) that allows systems to automatically identify features, classify information, find patterns in data, make determinations and predictions, and uncover insights. Historical data is transmitted to systems that use algorithms to create machine learning models that continuously train the systems to increase accuracy.

The quality of a machine learning model depends on two key aspects, which are especially important for machine learning in cybersecurity:

  1. The quality of the input data (i.e., garbage in, garbage out)
  2. The algorithm’s alignment with the use case

The choice of algorithm for machine learning models depends on the type of data that is available and the specific task.

Examples of how algorithms are used for machine learning in cybersecurity include:

  • Decision tree algorithm—for detecting and classifying attacks
  • Dimensionality reduction algorithms—for removing noisy and irrelevant data
  • K-means clustering—for detecting malware
  • K-nearest neighbors classifier (kNN)—for facial recognition used for authentication
  • Linear regression—for predicting network security outcomes
  • Logistic regression—for fraud detection
  • Naïve Bayes algorithm—for intrusion detection
  • Random forest algorithm—for classifying phishing attacks
  • Support Vector Machine (SVM) algorithm—for classifying, detecting, and predicting blacklisted IP addresses and port addresses
Origin of the term “machine learning” 
An American scientist, Arthur Samuel, coined the term machine learning in 1959. He defined it as “The field of study that gives computers the capability to learn without being explicitly programmed.” He developed one of the world’s first successful machine-learning programs, the Samuel Checkers-playing Program, which was used to play checkers better than the program’s author. 

Types of machine learning

Supervised machine learning in cybersecurity

Supervised machine learning in cybersecurity is used to classify data or predict outcomes. It uses labeled datasets to train algorithms and define the variables to be assessed for correlations, with the input and outputs specified. As part of the cross-validation process, when input data is fed, the model adjusts its weights until it has been fitted appropriately to avoid overfitting or underfitting.

Supervised machine learning in cybersecurity is used in several ways, including:

  • Identifying unique labels of network risks, such as scanning and spoofing
  • Predicting or classifying a target variable for a specific security threat (e.g., a distributed denial of service or DDOS attack)
  • Training models on benign and malicious samples to help them predict whether new samples are malicious

In addition to machine learning in cybersecurity, supervised machine learning can be used for:

  • Binary classification—dividing data into two categories
  • Multi-class classification—choosing between more than two types of answers
  • Regression modeling—predicting continuous values
  • Ensemble learning—combining the predictions of multiple machine learning models to produce an accurate prediction

Examples of techniques used for supervised machine learning in cybersecurity:

  • Adaptive boosting and logistic regression
  • Linear regression
  • Logistic regression
  • Naïve Bayes
  • Neural networks
  • Random forest
  • Support vector machines (SVM)

Reinforcement machine learning in cybersecurity

Reinforcement machine learning is a model used for machine learning in cybersecurity that is similar to supervised machine learning. However, reinforcement machine learning trains the algorithm by trial and error rather than using sample data. Positive or negative cues are given and registered along the way, with the algorithm programmed to seek affirmation and avoid penalties.

Reinforcement machine learning is often used to teach a machine to complete a multi-step process where the rules are clearly defined, such as training robots.

Reinforcement machine learning in cybersecurity is used in several ways, including:

  • Adversarial simulation to train ML models to identify and respond to attacks in real-time
  • Autonomous intrusion detections
  • Cyber-physical systems
  • Distributed denial of service (DDoS) defenses

In addition to machine learning for cybersecurity, reinforcement machine learning is often used in situations where:

  • A model of the environment is known, but an analytic solution is unavailable
  • Only a simulation model of the environment is given
  • The only way to collect environmental information is to interact with it

Examples of techniques used for reinforcement machine learning in cybersecurity:

  • Deep Deterministic
  • Deep Q Network (DQN)
  • Policy Gradient (DDPG)

Unsupervised machine learning in cybersecurity

Unsupervised machine learning in cybersecurity is used to analyze and cluster unlabeled datasets (e.g., photo images, audio and video recordings, articles, or social media posts). It can identify hidden patterns or data groupings without human intervention.

The algorithm scans through data sets, looking for patterns that are used to group information into subsets. Unsupervised machine learning is most commonly used for deep learning.

Unsupervised machine learning in cybersecurity can be used in a number of ways, including:

  • Detecting unusual behavior
  • Identifying new attack patterns
  • Mitigating zero-day attacks

In addition to machine learning for cybersecurity, unsupervised machine learning can be used for:

  • Anomaly detection
  • Association mining
  • Clustering
  • Dimensionality reduction (i.e., reducing the number of variables in a data set)

Examples of techniques used for unsupervised machine learning in cybersecurity:

  • K-means clustering
  • Neural networks
  • Principal component analysis (PCA)
  • Probabilistic clustering
  • Singular value decomposition (SVD)

Semi-supervised machine learning in cybersecurity

Semi-supervised machine learning in cybersecurity blends supervised and unsupervised machine learning. It pulls a small labeled data set from a larger, unlabeled data set for classification and feature extraction when there is not enough labeled data for a supervised learning algorithm. It is also used when labeling a data set is prohibitively expensive.

Semi-supervised machine learning for cybersecurity can be used for:

  • Adversarial neural networks
  • Malicious and benign bot identification
  • Malware detection
  • Ransomware detection

In addition to machine learning for cybersecurity, semi-supervised learning can be used for:

  • Fraud detection
  • Labeling data
  • Machine translation

Examples of techniques used for semi-supervised learning in cybersecurity:

  • Consistency regularization
  • Label propagation
  • Pseudo-labeling
  • Self-training

Benefits of machine learning in cybersecurity

  • Enables BYOD (bring your own device) and CYOD (choose your own device) to be securely implemented
  • Automates cybersecurity processes
  • Detects threats in the early stages
  • Enables adaptable and proactive defense systems
  • Expedites threat detection and response times
  • Identifies hard-to-find network vulnerabilities
  • Internalizes learnings from previous attacks to prevent future attacks based on similar profiles
  • Makes it easier for security analysts to quickly identify, prioritize, and remediate attacks
  • Minimizes human errors
  • Powers sophisticated authentication mechanisms, such as facial recognition, fingerprint recognition, motion tracking, retinal scanners, and voice recognition
  • Helps prevent security threats against endpoints
  • Provides insights into advanced threats
  • Reduces workloads
  • Scans massive amounts of data to identify malware
  • Understands nuances of normal behavior to enable the detection of the smallest deviances

Machine learning in cybersecurity use cases

Detecting and preventing DDoS attacks and botnets

Models can be trained to analyze the large volumes of traffic between different endpoints to proactively identify and predict DDoS attacks (e.g., application, protocol, and volumetric attacks) and botnets.

Detecting web shells

Machine learning models can be trained to identify web shells despite sophisticated evasion techniques.

Web shell detection has been proven far more accurate with machine learning than other systems because the models are able to improve complete predictions for unknown pages significantly.

Threat detection and classification

Machine learning is used in applications to facilitate and expedite detection and responses to attacks. Large datasets of security events are analyzed to identify patterns of malicious activities.

When an incident is detected, the machine learning model automatically takes action. Datasets are drawn from a number of sources, such as indicators of compromise (IOCs) and security system log files.

Fighting malware

Models can be trained to help anti-virus solutions fight all types of malware, such as adware, backdoors, ransomware, spyware, and trojans.

Network risk scoring

Machine learning can be used to analyze previous cyberattack datasets to determine areas targeted by particular attacks and assign accurate risk scores that quantify an attack’s location, likelihood, and impact. This data helps organizations prioritize the allocation of resources and directs responses in the event of a pervasive attack.

Protecting against application attacks

Machine learning can be utilized to train models to detect anomalies in HTTP/S, SQL, and XSS attacks to protect applications prone to different Layer 7 attacks.

Securing mobile endpoints

Machine learning is used in a number of detection and response applications to address threats to mobile devices. Another use of sophisticated machine learning is to protect against attacks using voice-based commands by training models to differentiate between the owner’s voice and hackers’ voices.

Security operation centers (SOCs)

This use case for machine learning supports the monitoring and detection of and response to security threats by automating the analysis of a large amount of data generated at high volumes.

Preventing phishing attacks

Machine learning can be used to analyze data in real-time and to identify and stop phishing emails. By training machine learning models on email headers, body copy, and punctuation patterns, they can learn to delineate between harmful and harmless emails, identifying patterns to classify and reveal possible phishing attacks. The models can also be trained to identify malicious URLs embedded in emails that appear benign.

Task automation

Machine learning excels at automating time-consuming, repetitive, and error-prone security tasks, such as network log analysis, threat analysis, triaging intelligence, and vulnerability assessment. In addition to providing automation, machine learning can identify threats and anomalies at a rate that is faster and far more effective than if performed by humans.

User and entity behavior analytics (UEBA)

UEBA leverages machine learning to provide complete visibility of users and entities, detect account compromises, and mitigate and detect malicious or anomalous insider activity. By using ML algorithms, baselines for normal behavior patterns are established and used to identify unusual activity, such as an employee login late at night, inconsistent remote access, or an unusually high number of downloads.

Email monitoring and security

Natural Language Processing (NLP), a type of machine learning, is highly effective for monitoring and assessing email for malware and viruses without opening the message.

Evaluating machine learning models

In cases where a machine learning model is not pre-built into a solution, care must be taken when evaluating and selecting models for machine learning in cybersecurity. Considerations when searching for a machine learning model that suits the use case and data include:

  • Determine what resources are available to support machine learning models (e.g., training, monitoring, maintenance, and measuring success)
  • Establish the objective and identify potential data inputs
  • Evaluate outcomes of machine learning models for similar use cases
  • Understand how much data the model requires to be effective

Machine learning challenges

Machine learning in cybersecurity is indisputably a powerful and effective advancement. However, machine learning in cybersecurity does have challenges.

Some of the most commonly cited challenges related to machine learning include:

  • Algorithms trained on data sets that exclude certain information or contain errors can lead to inaccurate models.
  • Overfitting and underfitting degrade machine learning models:
  • Monitoring and maintenance are required to keep machine learning models performing optimally.
  • Overfitting occurs when a machine learning model is trained with too much data and starts capturing noise and inaccurate data into the training data set, negatively affecting its performance.
  • Underfitting occurs when a model cannot fully learn the patterns in the training data and cannot deliver accurate results.

Machine learning myths

Myth Reality 
Machine learning in cybersecurity can fully replace human experts. While powerful, machine learning cannot replace skilled cybersecurity professionals who offer contextual knowledge, creativity, critical thinking, intuition, and a nuanced understanding of complex attack vectors and cybercriminals’ thinking.
Machine learning can address all threats and vulnerabilities. Certain types of attacks, such as zero-day exploits or highly targeted and sophisticated attacks, can be missed by machine learning models that lack training in that area.
Machine learning models in cybersecurity do not make mistakes. Machine learning models are only as good as the datasets they are fed. The results will be subpar or incorrect if the data is incomplete or inaccurate.
Machine learning renders attacks ineffective. While machine learning models can adjust defenses to counter cyberattack vectors, criminals continuously adjust their approaches with a high degree of efficacy.
Machine learning in cybersecurity is impervious to adversarial attacks. Unfortunately, machine learning is susceptible to adversarial attacks. If an attacker is able to inject misleading or incorrect data into a training dataset, the machine learning model will generate inaccurate results or make erroneous predictions.
Machine learning is only available to large organizations. Machine learning is available and in wide use. Any organization can use and benefit from machine learning at some level by leveraging user-friendly security tools, cloud-based security services, and pre-built models.
Machine learning in cybersecurity requires large datasets to provide value. The efficacy of machine learning improves with the volume of data provided, but models can be used and trained with smaller quantities of quality data.

Machine learning in cybersecurity bolsters solutions that fight threats

Machine learning in cybersecurity gives solutions a special edge that allows them to adjust and become more effective with time and experience. Threat intelligence produced by machine learning not only supports proactive threat protection, but helps make the solutions even better. Machine learning is pervasive and is expected to be a standard part of many solutions.

Date: September 8, 2023Reading time: 11 minutes
AI & Machine Learning