Many forensic artifacts can be looked at from many different angles. A few years ago I proposed a concept of filighting that tried to solve a problem of finding unusual, orphaned and potentially malicious files dropped inside directories that contain files that DO NOT reference these orphaned files at all.
I really hope that forensic analysis tools will evolve to add more features that will help to automate file system analysis based not only on a list of known hashes and/or file extensions, but also paths, partial (relative) paths, file names, actual file types based on their content, and ideas that rely on more complex algorithms: using prebuilt artifacts collections, leveraging various correlations (ideas like filighting), and of course machine learning and AI.
Today I want to explore one more angle of looking at file system artifacts — classes of file content. There are many file formats out there: executables, documents, configuration files, database files, and many other file types. The classification I am focusing on today though is slightly different – the format itself doesn’t interest me too much, but the function of the file does…
My guinea pig will be a license file. The type of a file that is all over the place, but no one reads them. And yes, removing them from the examiner’s view (during file system analysis) may not add a lot of value, but it’s used here only to illustrate the idea. There are many other file classes like this that can be classified as noise to the examiners’ eyes and if we start clustering them together, who knows, maybe we have just saved some personhours there…
I asked myself the following question:
– having a file system in front of me, how do I find all license files on it?
There are at least a few approaches I can think of:
All of them have their own challenges:
I am going to focus here on the second one.
Your typical license file is usually called license, license.txt, eula.txt, and in case of Open Source, we often see files named like gpl.txt, license.gpl.txt, lgpl.txt, etc.
When you start researching this file naming bit a bit more, you will soon realize that there are a lot of variations. A lot of issues listed in 3rd point come to play as well f.ex.:
As usual, the more you look, the more complex the problem you see.
For this post I have compiled a large file containing possible license file names. You can download it here.
Will it make anybody’s life easier?
I don’t know.
What matters is that we learned a little bit more how difficult the process of automated file system analysis is. What started as a trivial and frivolous idea ended up being a Don Quixotish attempt to formalize something that is impossible to tackle, even with a data-heavy approach…