Disclaimer 1: This blog post is on a new and still under development toolset in John the Ripper. Results depict the state of the toolset as-is and may not reflect changes made as the toolset evolves.
Disclaimer 2: I really need to run some actual tests and password cracking sessions using this attack, but I'm splitting that analysis up into a separate blog post. Basically I have enough forgotten drafts sitting in my blogger account that I didn't want to add another one by trying to "finish" this post before hitting publish. So stay tuned for new posts if you want to see how effective this attack really is.
It's been about 15 years since I last wrote about John the Ripper's Markov based Incremental mode attacks [Link] [Link 2]. 15 years is a long time! A lot of work has been done applying Markov based attacks to password cracking sessions, ranging from the OMEN approach to Neural Network based password crackers. That's why I was so excited to see a new proof of concept (PoC) enhancement of JtR's Incremental mode that was just published by JtR's original creator SolarDesigner.
The name of this enhancement is likely going to change over time. Originally it was described in an e-mail thread as "Markov phrases". That's not a bad descriptor, but doesn't really get to the heart of the current PoC. Therefore in this blog post I'm going to call this attack after the new script SolarDesigner released (tokenize.pl) and just refer to it as a "Tokenizer Attack". I think this gets closer to conveying how the underlying enhancement differs from the original Incremental attack which the tokenizer attack is built on.
The next question of course is "What does this new enhancement do?" To take a step back, what make Markov attacks "Markovian" is they represent the conditional probability of tokens appearing together. For example, if you see the letter 'q' in an English word, the next letter is very likely to be a 'u'. How far you look ahead to apply this probability is measured by the "order" of the Markov chain. So the above example would describe a first-order Markov process. If we extended this and said that given a 'q' and then a 'u', the next most likely character would be a 'e', now we are talking about a second-order Markov process. Where this starts to play out compared to a first-order process is that 'qu' -> 'e' with a high probability but the highest probability next character for the substring 'tu' might be 'r'. Therefore the highest probable letter following a 'u' might change based on the letter before the 'u' Basically the order represents the "memory" of the Markov chain. It can remember X previous tokens of the string it is generating.
The reason I bring this up is that JtR's Incremental mode works with trigraphs which "roughly" can be represented by second-order Markov processes. I use "roughly" since like most Markov implementations, Incremental mode contains nuances and differences from the abstract/academic definition of Markov based processes. Those nuances aren't super important for this current investigation/blog-post so I'm going to gloss over them for now.
One interesting area of research is implementing "variable order" Markov processes. Aka some particularly likely substrings might be created by a third/fourth/fifth/sixth order Markov process, but other less likely transitions might be implemented in a lower order (first/second) Markov process. As an example of that, if you were using a second order Markov chain the initial letters 'or' might generate a next letter of 'k' with a high probability, but if you take a larger step back and see the earlier letters are 'passwor' then the next letter will almost certainly be a 'd' instead. Based on this, it's tempting to "apply the largest memory possible" to your Markov processes. The limiting factor though is if you extend things out too much you run into overtraining issues, not being able to generate substrings not seen in the training data, and general implementation issues (the size of your Markov grammar explodes). So being able to dynamically switch how much memory/state you are keeping track of can be very helpful when generating password guesses. While more research is needed, this is my current theory why the CMU Neural Network Markov based attacks [Link] outperform other Markov implementations. I strongly suspect the Neural Network makes use of a variable order Markov process when generating guesses.
That's a lot of words/background to say that JtR's Tokenizer attack can be thought of as a way to incorporate variable length Markov orders into JtR's current Incremental mode attack. It does this by identifying certain likely substrings (aka "tokens") and then replacing them in the training set with a "placeholder" character. So for example the likely substring "love" might be replaced with the hex value of x15. This would normally be an unprintable character in ASCII (NAK), but it allows JtR's normal Incremental mode charset to be trained using these replacements. The resulting Incremental charset will have probability information for generating the character "x15" (NAK) as part of a password guess. Now (NAK) isn't actually part of a real password (unless the user did something really cool). This means when running a password cracking session, you'll need to then apply an External mode to translate these guesses back into the full ASCII text outputs. For example the guess "I(x15)MyWife" would be translated by the External mode into "IloveMyWife" which can then be fed into a real password cracking session.
While I can certainly dive more into the details of the tokenizer attack, this intro is already too long. So lets instead look at how to run it!
Example output when training on one million passwords from RockYou. You may note that the results are slightly different than when SolarDesigner trained on the full RockYou list:
Example output of running the training session on one million training passwords:
Below is a screenshot of me running the tokenize.pl script again, with the External mode section highlighted.
Next is a screenshot of me pasting the resulting External mode into my John the Ripper config file.
To run the attack, you can use the following commands in addition to your normal attack commands:
Here is a screenshot comparing the first 25(ish) guesses generated by both the new Tokenize attack and the default JtR Incremental attack. It's interesting to note that the default Incremental attack seems to start off stronger with more likely passwords. What's important though is longer password cracking sessions, which I'll investigate in future tests.
The next step will be to run some actual tests. The experiments SolarDesigner ran seemed to imply a significant improvement when running the tokenize attack over the first two billion guesses compared to a standard Incremental attack. This assumes you run these attacks after cracking the "common" passwords first using a more traditional wordlist attack. I'm having to restrain myself from speculating about the results of future tests that I'm planning on running (better to just run them), but hopefully this blog post is helpful for anyone else who also wants to experiment with this new attack. It's a cool attack and a neat approach so I'm looking forward to seeing how it evolves going forward.