This is a follow-up to my previous blog post looking at how to install/run the new John the Ripper Tokenizer attack [Link]. The focus of this post will be on performing a first pass analysis about how the Tokenizer attack actually performs.
Before I dive into the tests, I want to take a moment to describe the goals of this testing. My independent research schedule is largely driven by what brings me joy. Because of that I'm trying to get better at scoping efforts to something I can finish in a couple of days. It's easy to be interested in something for a couple of days! Therefore, my current plan is to run a couple of tests to get a high level view of how the Tokenizer attack performs and then see where things go.
To that end, this particular blog post will focus on three main "tests" to answer a couple of targeted questions.
Test 1: Analyze how sensitive Tokenizer is to the size of the training data
Test 2: Compare a very basic Tokenizer attack against Incremental and OMEN.
Test 3: Compare Tokenizer and CutB as Part of a Larger Password Cracking Session
Note on Testing Tools:
Below is an example of how I run checkpass.py and use that to generate these graphs. Note: Checkpass can also create a list of uncracked passwords. This is helpful since it lets me chain together different attacks to simulate more complex cracking sessions.
Training: RockYou
Note on RockYou Dataset: The RockYou dataset contains duplicate passwords as well as all the encoding weirdness found in the original dump. I randomized the order of the passwords in it to avoid any correlations between passwords present in the original dump, and split it into 32 1-million subsets to allow training/testing against different passwords.
Testing: LinkedIn 2012 Data Breach
Notes on LinkedIn 2012 Dataset:
Total Side Tangent on LinkedIn List + Hashmob Wordlists:
I'll be up front: Given the age of this dataset and the speed of the underlying hashing algorithm (raw-sha1), I was expecting the hashmob wordlist to crack over 96% of the hashes. So after seeing so many uncracked passwords, I decided to run a standard PCFG attack against the remaining hashes just to perform a sanity check. To my surprise I got a few quick hits almost immediately:
Noticing all the new cracks had non-ASCII characters, I then started up a new attack using the included Russian ruleset:
These aren't complicated passwords. For example, I believe снейка means "snake" in Russian. Wanting to dig into this more, I then ran my cracked list from 2014 when I was investigating this list against the left list.
The actual cracked list was much more, but what's interesting was that almost all of the new (or really old depending on how you look at it) cracks were of e-mail addresses. I talked with a couple other researchers, one of which graciously provided me his cracked list, and I saw similar results. More e-mail addresses and more non-ASCII cracked passwords.
Current Theory: I suspect the Hasmob team strips e-mail addresses from their plain/cracked wordlists they provide to the public. I also suspect they run into issue creating a wordlist with all the weird encoding issues found with passwords in the wild, so their wordlist has gaps in non-ASCII cracks. I want to stress, all of these gaps are 100% totally reasonable, and when it comes to stripping e-mail addresses, commendable! But it's something to keep in mind when using these lists to conduct academic research.
Impact to these tests: While I'd like to have a higher crack percentage, given the fact that so many of the uncracked passwords likely contain non-ASCII characters or are e-mail addresses, this shouldn't have a big impact when analyzing how tokenizer performs. This is because as configured, my tokenizer attacks are unlikely to crack very many of these uncracked passwords. In the future I might run another "real" test of tokenizer against these hashes, but I'm going to put that off until I spend more time validating/improving my testing tools.
The two tokenizer attacks trained on 1 million passwords performed very similarly (you almost can't see the second line on the graph). This is a good result since it points to being somewhat resilient to minor differences in the training data. You will notice though that the tokenizer attack trained on the full 32 million RockYou passwords does perform noticeably better.
There's a lot of additional questions that come to mind about this, but I'm going to let these results stand alone for your interpretation and move on to the next set of planned tests.
In my previous post I posted the first 25 guesses my training of tokenizer produced, and it looked "weird". SolarDesigner replied with what they were seeing when running their own copy which was very different (and looked more like what I originally expected) [Link]. I reran all my training, and then started getting similar results to Solar. Long story short, somewhere along the way with my troubleshooting and figuring out this attack I made a mistake. Here are the updated results of the first 25 guesses generated by tokenizer with the Rockyou training data above, along with the results Solar provided:
The guesses highlighted in green are guesses that were shared with one of the other training runs.
Training:
All three attack modes were trained on the same 1 million subset of RockYou passwords
Testing:
Not a lot of surprises here, which is good! OMEN is a very effective attack mode so that was always a tough one to beat. The challenge with OMEN is the lack of an indexing function (aka being able to tell it "generate password at position 2941932", which leads to complications with pausing/restarting cracking sessions. So I generally use Incremental mode in my real password cracking sessions. It's just easier. Which means that having the Tokenize attack improve upon standard Incremental mode is a big deal.
Side note: I try to point this out whenever talking about OMEN, but you'll notice the sawtooth success rate as it tends to crack more passwords at the start of OMEN "level". This highlights significant room for improvement if any researchers want to look into this. Ideally you'd have a smoother graph to frontload all your effective guesses near the beginning of your cracking session.
For this last test I wanted to simulate a larger cracking session. For this I'm loosely going to base my attacks on EvilMog's "Random AD Methodology" describe [Here]. By loosely I mean I'm just going to simulate the first three steps:
For the first step, I'm going to use the full RockYou wordlist (only unique words) and the "Hashcat" ruleset in John the Ripper. I figure that gets close the the intention of step #1 without having to resort to making 100k random rules up on the spot.
The John the Ripper "Hashcat" ruleset is actually a collection of rules from the Hashcat repo modified to work with JtR:
[List.Rules:hashcat]
.include [List.Rules:best64]
.include [List.Rules:d3ad0ne]
.include [List.Rules:dive]
.include [List.Rules:InsidePro]
.include [List.Rules:T0XlC]
.include [List.Rules:rockyou-30000]
.include [List.Rules:specific]
The challenge from an analysis perspective these attacks generate an absolute ton of guesses! Partly this is because they aren't optimized. None of these attacks have "reject" functions built into them so every mangling rule is applied to every input word regardless if the mangling rule would actually change that word. This is also because there is significant overlap in the individual mangling rules across these different rulesets.
What I'm trying to say is if I ran this attack with the Rockyou wordlist on my research laptop and piped it into checkpass.py (which itself can be a bit slow), the attack would take me around two weeks to complete. To that end, I ran a "quick" attack of just 5 billion guesses which gets through the best64 ruleset and into d3ad0ne ruleset using checkpass.py simply because I wanted to compare that to my previous graphs. I then launched all these attacks for real on a different computer to create a potfile of all the passwords cracked using these attacks.
(Future Improvement): Hashcat supports the ability to record "guess position" in the outfiles (potfiles) it generates. I've never really used that, but I plan on looking into that feature in a future "improve my testing process" research sprint. For now though, it's just easier to launch JtR and let it run while I do other things.
While I could be more scientific about it, given the 16 million word wordlist (Rockyou-Unique) and the Best64 ruleset (which has slightly more than 64 rules), the Best64 ruleset finishes up somewhere around 1 billion guesses, which is pretty evident from the graph above. The other Hashcat rulesets are not nearly as optimized. This does highlight though that starting a password cracking session off with a "smart" dictionary attack is still one of the best ways to crack passwords quickly.
Success Ratio for Full Hashcat Rules vs. LinkedIn:
For this test, steps #2 and #3 involve using expander and cutb. If you are not familiar with these tools, they are part of Hashcat Utilities [Link].
While you can build the tools in Hashcat Utilities from source [Link], the latest release binaries are available [Here].
As to what Hashcat Utilities are, you can get more detailed information from the first link above, but at a high level they are a set of tools that each perform one specific task Many of them can be chained together (or used stand-alone) to create targeted wordlists which is how we'll be using them in this experiment.
Expander: This tool basically mangles and creates new combinations of words from individual characters found in each word in the input dictionary. The actual operation is a bit weird, but imagine you wrote the input word on a piece of paper and then folded the paper into a circle so the word is like a bracelet. Expander then creates new words by taking cuts out of that bracelet. So "password123" can generate the guess "3pas" as it wraps around. By default it will generate all 1-4 letter combinations from the input wordlist that is piped to it. You can increase this by changing a macro variable in the source and recompiling it. Some people will have multiple versions of expander built with the length of guesses they generate appended to the filename. For example "expander8.bin". Another approach to make longer guesses without having to recompile the code is to combine multiple runs of "length 4" expander using Hashcat's combinator mode to generate longer password guesses. Here is an example of me running expander with one input "word".
Expander will then return the following output (only showing a sample as the full output is 40 unique words):
Side note: I was really surprised by guesses Expander didn't make. For example "23pa" was not generated. So it's not an exhaustive list and there are some exceptions in the substrings it generates.
Expander is the basis of what's been called a "Fingerprint" attack. This was first described by pure_hate in the following blogpost where they used it as part of the 2010 CMIYC competition [Link]. A more modern take and example of using a Fingerprint attack can be found [Here].
Now, you generally need to be selective in the input wordlists you feed to Expander since this attack can very quickly get to the point where it's almost equivalent to a full dumb brute-force attack. You also need to make sure you "sort -u" the outputs of Expander since it often generates a ton of duplicate guesses. Because of this, I generally wouldn't recommend using Expander on normal password cracking wordlists. Instead, people will often use Expander on previously cracked passwords to get new cracks. For example:
To continue to build this out and target passwords greater than 8 characters long you can re-run variations of the above commands like as follows:
You can keep building this process out for longer guesses. Now you know how to run a fingerprint attack!
CutB: This tool allows you to "cut" substrings from an input wordlist for use in hashcat combinator and hybrid (rule-based) attacks. It's a lot easier than piping your wordlists into sed, awk, or other linux tools to retrieve substrings. I'd recommend checking out the Hashcat wiki for info on how to use it, but at a high level you can give it two numbers on the command line to specify which substrings you want to extract. Aka:
Often CutB will be run in a script to generate many, many, different subsections of a password guess. You may notice that CutB is pretty similar in operation to Expander, but it allows you much more flexibility to be somewhat targeted about how you apply your cuts.
Side note: CutB's code is weird, and it won't always perform like you'd expect. For example:
I really don't know what's going on with those two last guesses.....
Tokenizer_RockyouFull:
Tokenizer_LinkedinPot:
Expander:
CutB:
All attacks will be run against the remaining uncracked passwords from the 2012 LinkedIn password list after the JtR Hashcat rules with Rockyou-Unique wordlist have been run against it. Each attack will be run for 5 billion password guesses. This is a very short runtime for these attacks. Normally these attacks will generate trillions of password guesses. Future testing might include Hashcat's outfile debugging formats or running the attacks for a set time, but I figure 5 billion guesses can start to indicate how these attacks will compare to each other.
Quick summary of results:
I didn't expect Expander to do very well given the short number of guesses, but this low number really shocked me. I'm pretty sure just creating random wordlist rules using "hashcat -g 100000" would be more effective.
While it's never fun to say that the biggest finding is that your test setup is flawed, that's my main takeaway from these tests. When looking at the results, 5 billion guesses is way too low a number to analyze these attacks after trillions of guesses have been made running wordlist attacks. Going back to Test 2, (and quick disclaimer this is not a direct comparison due to different training sets for Tokenizer), but Tokenizer cracked over 1 million passwords when it was run as the first attack. So when it cracks just 14k unique passwords more than the Hashcat Rules based attacks, that shows a strong overlap in the guesses that these two attacks are making.
This is a long way of saying, after an initial very long run using the Hashcat Rules attack against LinkedIn, I don't expect any non-wordlist based attack to do very well for just 5 billion guesses. So while it's easy for me to make fun of Expander, I really can't make any definitive statement about how these attacks perform in real life unless I run a cracking session that represents several days with a GPU.
Looking at the bright side, I'm glad I ran this test. It forced me to better understand some of the tools in Hashcat Utilities, as well as start to identify what future tests should look like as well as gaps in my testing strategies.
I'll be up front: The holidays are coming up, and I have a lot of other research items I'm working on that I would like to finish up [Spoiler/Link]. This basically means that while there are a ton of unanswered questions from this blog post, I'm probably not going to get around to investigating them anytime soon. As a note to my future self though, here are a couple of topics that jump out to me: