(this is a very long post, sorry; took weeks to distill it into something that I hope is readable)
As promised, today I am finally going to demonstrate that the piracy is good! (sometimes)
In order to do so, I need to start in a non-sequitur way though…
There are two questions that today’s forensic and telemetry technologies fail to answer quickly, let alone clearly:
The first question is super important.
Before we even start that basic forensic triage, kick off these evidence collecting scripts and heat up the pipelines focused on automated forensic data processing… it would be really cool to read a basic summary of what that endpoint IS all about – showing us the ‘easy’ findings first:
Nuh, just kidding… it’s a trap.
So many automated-data-extraction approaches focus on all this unimportant, but easy to extract stuff that it almost always ends up delivering something that I vehemently hate: fluff.
Let’s remember that activity is not a productivity.
That is, even with all these fancy auto-generated summaries available, the question posed above will still remain unanswered…
Why?
Because the true, honest answer to this question can take many forms, and none of them really care about the ‘easy to code, but fail to answer the basic question’ type of automation cases… What it means is that we don’t want yet another ‘quantity over quality’ vulnerability scanner’s endpoint equivalent in the house… doing forensic-wannabe job, but in the end delivering nothing, but non-actionable, confusion-soaked nothingburgers.
So, without a further ado, let’s demonstrate how a proper answer could look like (and it’s 100% hypothetical). I hope you will agree that any subset of the below could be helpful:
Do you see what I am talking about here?
The art of quick, meaningful but also early system profiling based on the existing forensic evidence!
And yeah, it’s not an easy task, it’s also not fool-proof, and yeah, we can’t just rely solely on regular expressions or AI here that can be applied to various forensic artifacts discovered during triage/preliminary analysis, but… anything goes… any decent commentary we can provide about the actual system’s content before any manual forensic exam starts… is a decent start! Does it bring a bias to the exam? 100%. Does it make it easier to automate triage towards this bias? 100%.
What I posit is this:
Given the advances in forensic technologies related to data acquisition, data collection, data processing, data triage and data analysis automation (plus maybe AI), are we in a position to move the evidence analysis flagpole forward towards… maybe not the ‘one click solution’ yet… but kinda towards it, anyway? Saving lots of personhours in the process?
And if we extrapolate…
The question #2 is very fascinating as well…
What will I find inside this org?
Your asset inventory, your SBOM, your ad hoc queries combing through recent process/file/service creation telemetry are all adding value… BUT… it’s not working, long-term.
Why?
Collating information from various (very dynamic in nature) sources is HARD. The IT sector is still firmly stuck in a Don Quixotic notion believing that we can create a perfect asset inventory using available people, process, technology adjustments, but the reality is far more complicated than that and even more nuanced…
On a practical level…
The bottom line is: there is no such a thing as an asset inventory. There is an asset inventory process. It’s a living thing, very dynamic and capricious in nature and it’s time to start treating it as such. Does it sound familiar? Yes… Security is a process, not a product/tool, too.
We can’t win all battles, but we can settle on winning the important ones.
If we think of it… the new assets don’t appear out of nowhere – new employees join, cloud systems are being created via API or UI, data center computers (both physical and virtual) are being added, acquired, leased, deployed, new IoT devices are being added to the network because a reason, and new devices are being added to the production, let it be corp, dev, lab, guest networks any time of the day, and so on and so forth.
The very same can be said about asset decommissioning – laptop’s battery got swollen and laptop is out, laptop got old and is out, company laptop uplifting/upgrade program replaces the old devices, and then sometimes a random laptop is lost or stolen, and then it’s out, and then that specific cloud system that was active for 1 month only and then terminated by a script, is out, or an employee was fired and the employee’s laptop got wiped out and got back to the available pool of laptops, and so on and so forth. It’s all very complicated when you look at it as a whole, but it becomes far easier when we realize that every single case involves SOMEONE or a SCRIPT doing SOMETHING that affects the state of the asset inventory snapshot.
If any sort of a process to handle these use cases is actually defined, described, and present at any given time… everything else falls in place. Including the contribution to maintaining the best asset inventory ever. Sorry, I meant… following the asset inventory process, that is.
If we dig deeper, we should start looking at our application inventory too — you know, the asset inventory bit covering all the software being used at the company. The naive approach focuses on:
Of course, we miss the whole class of software that is marketed as ‘portable’ and can be installed and stored in some random directories on the system. Of course we miss the whole class of software that are browser plugins, email client plug-ins, and <any type of program> plug-ins. Of course we miss the whole class of cloud/web-based software. Of course we miss the whole class of software code that is hidden inside some random nim, pip, go packages. Of course, we miss a lot of software that is directly incorporated or embedded from many resources outside of our control…
And there is more…
Have you ever heard of Homebrew, Chocolatey? Then there are legitimate App stores, dodgy App stores, warez sites, hacktools, ppl learning pentesting on the job and playing with hacking tools downloaded into, and executed from random places, random software repos introduced by downloading and unpacking random archives, and… of course… the internal software developed at the company — lab, and dev environments are code rich and include lots of test, ad hoc compiled code that is often not very useful and add nothing to the whole idea of asset inventory, but may trigger AV/EDR alerts.
Still, the idea of asset inventory snapshot comes with a territory. Be invasive, be scrupulous. If any of them uses external, often unpatched libraries that may be vulnerable (f.ex. log4j, libcurl), we certainly want to ‘see’ them, too.
There are so many unknown unknowns out there that it is scary. And it should be.
This is why we can introduce a number of enhancements to our asset inventory process concept:
Enter the art of artifact collection and hoarding for the sake of forensic exclusivity.
Knowing as many known programs as possible is helpful.
Knowing means:
The second last bullet point is where I am finally going to demonstrate that the piracy is good! (sometimes).
You may collect lots of software packages for analysis, you may excel in installer unpacking and analyzing, but I think nothing will beat the software categorization that pirate sites offer. There is a lot of crowdsourced information available on warez sites that we can think of leveraging/utilizing. Whether it is a torrent site, magnet link site, and/or a good-old (S)FTP or Usenet site, or even one of many prevalent open directory type of resources, they almost always come with some sort of categorization in place…
We can web scrap and web spider these sites, we can build classifiers based on folder and file names, archive names, internal file names, directories and registry keys present in all these creations (some present inside the installers).
Luckily, there is more.
The bitter truth is that anyone trying to classify 20-30K software packages ‘present’ or ‘discoverable’ inside your average org is going to have a hard time.
The ‘warez’ angle is useful, but it’s probably the least impressive/important. So… piracy may be useful, but not adding much value.
Why?
The more advanced and detailed software classification approach is already available on the internet. All over the place. You go to your random/favorite software downloading site and you can see all these software packages arranged and categorized. You can web scrap/spider it too.
Then there is the whole business of PAD files. It’s an old, XML-based, software description standard developed by the Association of Software Professionals. The more PAD files you can download and parse, the easier it may become to classify software found in your org! Of course, the ASP ceased to function in 2021. The very useful http://repository.appvisor.com/ page is no longer in operation, but we can still find its snapshots on Web Archive – last saved on March 2024. The categorization bit available there is gold and should be preserved!:
Then there is a completely new level of abstraction: SEO-driven web pages. There is a lot of websites that list a lot of references to categorized software packages f.ex TechStack:
These are great points of reference and we should utilize them as well.
And there are also websites that very much still live in the past: oldversion.com, majorgeeks.com, etc. After exploring these resources for a bit, you may realize that a lot of software available today and that can be found on org endpoints is not only already very well classified, but it is also hard to miss…
When you explore a couple of ideas presented here, and data available on a couple of other sites you suddenly realize that we live in a very well-established software ecosystem – there is (actually) a very limited number of GOOD software packages for each type of software.
Now, I must be honest. The usefulness of all this is questionable today. The digital transformation has changed the way we use computers. The desktop computers and workstations and even dedicated servers are becoming obsolete and outside of some specialized tasks (gaming, research, hosting, etc.), the modern portable devices are nothing but thin clients we use to access the ‘always-most-up-to-date’ version of web-hosted software…
In a way, this article is an example of cyber Elephants’ graveyard. If you still need to do old-school endpoint/device forensics, it may inspire you. If you don’t, you will perhaps scratch your head and move on – this is a different type of ‘forensic exclusivity’, of course, but it’s a good one. Because everything we will ever witness in this game is a subject to an ever-changing process. One that is always outside of our control.