Machine Learning for Cyber Security resources

A curated list of tools and resources related to the use of machine learning for cyber security.

The problem regarding the use of machine learning in cyber security is difficult to solve because the advances in the field offer many opportunities that it is challenging to find exceptional and beneficial use cases for implementation and decision making. Moreover, such technologies can be used by intruders to attack computer systems. The goal of this list is to give you the tools and resources related to the use of machine learning for cyber security.

Machine Learning Cyber security resources

Datasets

Samples of Security Related Data
- Samples of various types of Security related covering
  - Network
  - Malware
  - System
  - Password
  - Threat Feeds
Stratosphere IPS Data Sets
- Stratosphere Research Laboratory
Open Data Sets
- Comprehensive, Multi-Source Cyber-Security Events
- Unified Host and Network Data Set
- User-Computer Authentication Associations in Time
Data Capture from the National Security Agency
- Datasets permitted by The National Security Agency
  - Snort Intrusion Detection Log
  - Domain Name Service Logs
  - Web Server Logs
  - Log Server Aggregate Log
The ADFA Intrusion Detection Data Sets
- The datasets cover both Linux and Windows; they are designed for evaluation by system call based HIDS
NSL-KDD Data Sets
Malicious URLs Data Sets
- Detecting Malicious URLs
Multi-Source Cyber-Security Events
- This data set represents 58 consecutive days of de-identified event data collected from five sources within Los Alamos National Laboratory’s corporate, internal computer network.

Papers

Fast, Lean, and Accurate: Modeling Password Guessability Using Neural Networks
- Awarded Best Paper
Outside the Closed World: On Using Machine Learning for Network Intrusion Detection
- using machine learning tools to monitor network’s activity
Anomalous Payload-Based Network Intrusion Detection
- fully automatic payload-based anomaly detector
Malicious PDF detection using metadata and structural features
Adversarial support vector machine learning
Exploiting machine learning to subvert your spam filter
CAMP – Content Agnostic Malware Protection
Notos – Building a Dynamic Reputation System for DNS
Kopis – Detecting malware domains at the upper dns hierarchy
Pleiades – From Throw-away Traffic To Bots – Detecting The Rise Of DGA-based Malware
EXPOSURE – Finding Malicious Domains Using Passive DNS Analysis
Polonium – Tera-Scale Graph Mining for Malware Detection
Nazca – Detecting Malware Distribution in Large-Scale Networks
PAYL – Anomalous Payload-based Network Intrusion Detection
Anagram – A Content Anomaly Detector Resistant to Mimicry Attacks
Applications of Machine Learning in Cyber Security
- This study covers phishing detection, network intrusion detection, testing security properties of protocols, authentication with keystroke dynamics, cryptography, human interaction proofs, spam detection in social network, smart meter energy consumption profiling, and issues in security of machine learning techniques itself.
An Investigation of Byte N-Gram Features for Malware Classiﬁcation

Books

Data Mining and Machine Learning in Cybersecurity
- this is a pretty decent, well-organized book, and seems it’s written from vast Experience and Research.
Machine Learning and Data Mining for Computer Security
- This book provides an overview of the current state of research in machine learning and data mining as it applies to problems in computer security.
Network Anomaly Detection: A Machine Learning Perspective
- this book presents machine learning techniques in depth to help you more effectively detect and counter network intrusion.
Machine Learning for Hackers: Case Studies and Algorithms to Get You Started
More Machine learning books

Videos

Tutorials

Big Data and Data Science for Security and Fraud Detection
- review of big data analytics tools and technologies that combine text mining, machine learning and network analysis for security threat prediction, detection and prevention at an early stage
Using deep learning to break a Captcha system
Data mining for network security and intrusion detection

Courses

Data Mining for Cyber Security by Stanford

Miscellaneous

Machine Learning Will Not Replace Other Cybersecurity Methods

5 Reasons Why Machine Learning Will Not Replace Other Cybersecurity Methods and Real-Life Examples of Effective ML for Data Protection

Where and How Machine Learning Is Used in Cybersecurity: 5 Practical Cases

It is believed that today it is banks, first of all, that are the largest users and drivers of the development of Big Data technologies and machine learning in the field of cybersecurity. For example, here we wrote how Machine Learning helps Home Credit Bank’s IT specialists monitor the operation of banking systems and timely identify abnormal activity of individual components or users. Machine Learning (ML) methods are also actively used by other high-tech companies in the development of special software.

In particular, the history of the creation of a secure Sqrrl DBMS, a graphical NoSQL database based on Apache Accumulo, is interesting. This cyberthreat search platform uses machine learning to visualize the vulnerabilities of computer networks. In January 2018 the corporation Amazon acquired Sqrrl for its cloud Amazon Web Services business.

Demisto , a company promoting a Security Orc hestration, Automation and Response (SOAR ) approach to cybersecurity, uses ML algorithms in its platform’s visual dashboard to prioritize threat messages.

It is also worth noting the experience of the domestic IT company Kaspersky Lab , which actively integrates machine learning models into its anti-virus products. To reduce the number of false positives, improve the interpretability of results, and increase the software resistance to actions of a potential attacker, Kaspersky Lab uses decision trees, locally stable convolutions, behavioral models, and ML clustering algorithms .

Likewise, Microsoft has created its own cybersecurity Windows Advanced Threat Protection system for proactive protection, breach detection, automatic investigation, and threat response. This product is integrated into all Windows 10-based devices and is actively used together with the company’s cloud services. Also, the ML system built into Windows Defender conducts behavioral analysis of a lot of data every day to prevent a possible attack. For example, when installing a malicious cryptominer into a browser at the level of an individual Windows user, the system recognizes and blocks this threat in just a few milliseconds. A similar threat at the enterprise level will be reflected in a couple of seconds thanks to the effective use of methods machine Learning.

How machine learning improves cybersecurity?

Will Machine Learning Launch a Revolution in Cybersecurity and Why

Despite optimistic forecasts that Machine Learning will soon replace all living information security specialists with its automatic algorithms, in reality, it is still too early to talk about this. The following reasons prevent the complete abandonment of previous cybersecurity methods in favor of machine learning:

Neural network models behave like a “black box”, a “thing in itself”, which does not explain why this particular result was obtained from such input data. This lack of direct cognitive feedback makes it impossible to completely abandon human control in important areas such as information security, similar to having manual control of an airplane, even with a very smart autopilot.
The lack of a sufficient number of datasets for the correct training of ML-models in all areas of cyber threats, from computer viruses to social engineering techniques;
the possibility of specific attacks on ML algorithms and the datasets used, which can lead to wrong decisions, missed attacks or false positives;
Attackers also use Machine Learning algorithms to create malware, analyze user behavior, develop bots that collect personal data, search for vulnerabilities, guess passwords, spoof identity, bypass security systems, etc.

It is also worth noting some conflict between the requirements of the General Data Protection Regulation (GDPR) on the protection of personal data of citizens and residents of the European Union and the use of this information in ML-models of cybersecurity. In particular, the GDPR assumes that the user has the opportunity to “be forgotten” if he does not consent to the collection of his personal data or decides to withdraw it. This requirement may be violated if some ML-model automatically analyzes user behavior (cookies, data about the device, browser, etc.) to prevent threats, without explicitly informing the client about it. For more information on what GDPR is and how it relates to personal data, we talked here.

Thus, while machine learning cannot replace the previously existing cybersecurity methods, but it significantly complements and expands them. In particular, ML models improve the accuracy of signature analysis, which processes queries quickly and does not require a long training period. Thus, you can use signature analysis to identify queries with clear signs of an attack, and machine learning to analyze the rest of the queries. As a result of such a combination of different methods, high speed of anti-virus software is achieved with a minimum number of false positives and missed attacks.

Machine learning does not supplant old cybersecurity methods, but complements them