Introduction

Nowadays, the cybersecurity is becoming more and more important in
everyday life as technologies evolve continuously and more malicious people, hackers, want to break down those technologies and put the users in danger.

Developing a tool that scrapes important information from CVE files,
crawls a web server for plugins, themes and other critical data a website uses to properly work, checks for potentially malware files and combining all of them together involves hard work and time.

Besides the fact that the industry has some good tools, I have tried another approach, a passive one. This new tool is a multi-functional one, containing:

  • CVE scraper, used for extracting all the necessary data from an exploit file and storing in a MongoDB database;
  • Web-Crawler, used for finding URLs and links in a starting page, extracting server’s important data and repeating this for all new discovered links;
  • API Interface, designed for people who want to search for particular exploits;
  • Malware Detection model, used to find potentially dangerous files stored on the servers;

Scraping and Crawling

Searching the GitHub, I have found a repository with all exploits, https://github.com/offensive-security/exploitdb, making the job of CVE Scraper a little bit easier. As exploits are moving fast and everyday more and more are discovered, I have made a script for downloading the archive from GitHub and extracts all the content, another one that extracts information from the MITRE Reference https://cve.mitre.org/data/refs/refmap/source-EXPLOIT-DB.html and a script that takes every file from every folder from the archive downloaded before.

The hardest part of developing the project was, by far, that of scraping the files retrieved from GitHub. Walking around and opening folders, I was able to categorize the files by the language, or in some cases by file extensions, the exploit was written. In that way, the project contains no more than 48 scrapers, one for almost every known file extension (e.g .rb, .py, .html etc).

I am using a MongoDB database for storing the extracted data in a pretty, human-readable manner. The following screenshot shows the details:

MongoDB document

Through modularity, the Web Crawler was divided into small Python modules:

  • The Crawler itself;
  • A Queue, for storing the found URLs in a certain order, grouping them by a domain, reordering and blacklisting some domains like Facebook, Twitter, etc;
  • The Checker, which has the job of comparing the founded data from server with the information gathered from the scraped exploits and returning the matching vulnerabilities and their likelihood;
  • Data scraper parses, if was not already parsed, the given URL in order to get the technologies, plugins and other information the URL use;
  • A Redis for storing the scraped data from the domain and their respective vulnerabilities in order to reduce the time spent on a single URL and increase the efficiency;
  • The Extractor, which contains only one method that uses HTMLParser from selectolax module to scrape for URLs in the given page.

The Checker class is the one responsible for matching the results gained from the Data parser and the information extracted from CVE Scraper. This class holds all the exploits found and their likelihood. Depending on the data extracted and their origin, there are four types of likeliness and I compute them according to the technologies attacked:

  • True vulnerabilities, with 100% chance of success, are the exploits in which the CVE Scraper successfully found the version in title and it matches the version found on server. Also, if the target is a plugin, besides the version of the plugin, the match  must also include the version of the CMS. The last requirement is the presence of the attacked path on the server, which is also fulfilled.
  • Almost true vulnerabilities, with 75% chance of success, are just like the true ones, except that the version is found in description, not in title, so there might be some errors in regexes used.
  • Probable vulnerabilities, with 50% chance of success, are those exploits which the CVE Scraper was not able to find any version, but that does not mean that the target is not vulnerable so if there is an URL on the server that the exploit is targeting, there might be a chance of success.
  • Possible vulnerabilities, with 25% chance of success, are exploits in which the CVE Scraper found all the targeted versions, but it did not find any attack path or the path is not present on the server.

After all the possible URLs are parsed and vulnerabilities are discovered, the tool shows all the information founded for the domains in scope, like the following screenshot

Results

Malware Detection

Surfing the Internet on different websites, some legitimate, other malicious, the users are putting themselves at any of the following risks:

  • losing control of computers
  • Identity theft
  • Financial theft

Back in time, the Malware Detection was made by using signatures. Nowadays, it is done by using advanced Machine Learning techniques, thus making the detection more accurate.

The goal is to predict whether a given file is malicious or not. To do
this, the model will need to learn to distinguish between malicious and clean files. For this purpose, I found a good dataset, https://drive.google.com/file/d/1HIJShr0GvQCUp_0R_kQe_WLG5PippurN/view, containing approximately 200,000 Windows PE samples divided into those two categories, each PE file having 486 features to train. The dataset was made by evilsocket and published on his website.

I have used a Deep Learning model, with 3 Dense layers and Dropouts between, Adam optimizer and a binary cross-entropy loss. The model managed to get a 97% accuracy on test, as the following picture shows:

Accuracy curve
Confusion Matrix

Conclusions & Future Work

I have gone through a lot of information about the CybeberSecurity field
and machine learning to be able to present the tool in a way that everybody
can use it daily, without worrying about the problems that might occur.

Because everyday attackers come with new malicious code, adding more
sources of exploits would be a good start of future work. In this way, the tool can find more vulnerabilities and be up to date. Also, discovering websites that are handcrafted would be a good point to add to the improvements list.

Right now, the tool has only one main thread, hardly affecting the performance. A good point to start increasing the performance would be maximizing the number of threads accordingly to the needs.

Regarding the machine learning model, it can only detect Windows PE
malware and in this way, a new feature should be ELF detection.

The tool can be found on my GitHub page.