by Sina Pilehchiha, Concordia University
Trail of Bits has manually curated a wealth of data—years of security assessment reports—and now we’re exploring how to use this data to make the smart contract auditing process more efficient with Slither-simil.
Based on accumulated knowledge embedded in previous audits, we set out to detect similar vulnerable code snippets in new clients’ codebases. Specifically, we explored machine learning (ML) approaches to automatically improve on the performance of Slither, our static analyzer for Solidity, and make life a bit easier for both auditors and clients.
Currently, human auditors with expert knowledge of Solidity and its security nuances scan and assess Solidity source code to discover vulnerabilities and potential threats at different granularity levels. In our experiment, we explored how much we could automate security assessments to:
- Minimize the risk of recurring human error, i.e., the chance of overlooking known, recorded vulnerabilities.
- Help auditors sift through potential vulnerabilities faster and more easily while decreasing the rate of false positives.
Slither-simil
Slither-simil, the statistical addition to Slither, is a code similarity measurement tool that uses state-of-the-art machine learning to detect similar Solidity functions. When it began as an experiment last year under the codename crytic-pred, it was used to vectorize Solidity source code snippets and measure the similarity between them. This year, we’re taking it to the next level and applying it directly to vulnerable code.
Slither-simil currently uses its own representation of Solidity code, SlithIR (Slither Intermediate Representation), to encode Solidity snippets at the granularity level of functions. We thought function-level analysis was a good place to start our research since it’s not too coarse (like the file level) and not too detailed (like the statement or line level.)
Figure 1: A high-level view of the process workflow of Slither-simil.
In the process workflow of Slither-simil, we first manually collected vulnerabilities from the previous archived security assessments and transferred them to a vulnerability database. Note that these are the vulnerabilities auditors had to find with no automation.
After that, we compiled previous clients’ codebases and matched the functions they contained with our vulnerability database via an automated function extraction and normalization script. By the end of this process, our vulnerabilities were normalized SlithIR tokens as input to our ML system.
Here’s how we used Slither to transform a Solidity function to the intermediate representation SlithIR, then further tokenized and normalized it to be an input to Slither-simil:
function transferFrom(address _from, address _to, uint256 _value) public returns (bool success) { require(_value <= allowance[_from][msg.sender]); // Check allowance allowance[_from][msg.sender] -= _value; _transfer(_from, _to, _value); return true; }
Figure 2: A complete Solidity function from the contract TurtleToken.sol.
Function TurtleToken.transferFrom(address,address,uint256) (*) Solidity Expression: require(bool)(_value <= allowance[_from][msg.sender]) SlithIR: REF_10(mapping(address => uint256)) -> allowance[_from] REF_11(uint256) -> REF_10[msg.sender] TMP_16(bool) = _value <= REF_11 TMP_17 = SOLIDITY_CALL require(bool)(TMP_16) Solidity Expression: allowance[_from][msg.sender] -= _value SlithIR: REF_12(mapping(address => uint256)) -> allowance[_from] REF_13(uint256) -> REF_12[msg.sender] REF_13(-> allowance) = REF_13 - _value Solidity Expression: _transfer(_from,_to,_value) SlithIR: INTERNAL_CALL, TurtleToken._transfer(address,address,uint256)(_from,_to,_value) Solidity Expression: true SlithIR: RETURN True
Figure 3: The same function with its SlithIR expressions printed out.
First, we converted every statement or expression into its SlithIR correspondent, then tokenized the SlithIR sub-expressions and further normalized them so more similar matches would occur despite superficial differences between the tokens of this function and the vulnerability database.
type_conversion(uint256) binary(**) binary(*) (state_solc_variable(uint256)):=(temporary_variable(uint256)) index(uint256) (reference(uint256)):=(state_solc_variable(uint256)) (state_solc_variable(string)):=(local_solc_variable(memory, string)) (state_solc_variable(string)):=(local_solc_variable(memory, string)) ...
Figure 4: Normalized SlithIR tokens of the previous expressions.
After obtaining the final form of token representations for this function, we compared its structure to that of the vulnerable functions in our vulnerability database. Due to the modularity of Slither-simil, we used various ML architectures to measure the similarity between any number of functions.
$ slither-simil test etherscan_verified_contracts.bin --filename TurtleToken.sol --fname TurtleToken.transferFrom --input cache.npz --ntop 5 Output: Reviewed 825062 functions, listing the 5 most similar ones: filename contract function score ... TokenERC20.sol TokenERC20 freeze 0.991 ... ETQuality.sol StandardToken transferFrom 0.936 ... NHST.sol NHST approve 0.889
Figure 5: Using Slither-simil to test a function from a smart contract with an array of other Solidity contracts.
Let’s take a look at the function transferFrom from the ETQuality.sol smart contract to see how its structure resembled our query function:
function transferFrom(address _from, address _to, uint256 _value) returns (bool success) { if (balances[_from] >= _value && allowed[_from][msg.sender] >= _value && _value > 0) { balances[_to] += _value; balances[_from] -= _value; allowed[_from][msg.sender] -= _value; Transfer(_from, _to, _value); return true; } else { return false; } }
Figure 6: Function transferFrom from the ETQuality.sol smart contract.
Comparing the statements in the two functions, we can easily see that they both contain, in the same order, a binary comparison operation (>= and <=), the same type of operand comparison, and another similar assignment operation with an internal call statement and an instance of returning a “true” value.
As the similarity score goes lower towards 0, these sorts of structural similarities are observed less often and in the other direction; the two functions become more identical, so the two functions with a similarity score of 1.0 are identical to each other.
Related Research
Research on automatic vulnerability discovery in Solidity has taken off in the past two years, and tools like Vulcan and SmartEmbed, which use ML approaches to discovering vulnerabilities in smart contracts, are showing promising results.
However, all the current related approaches focus on vulnerabilities already detectable by static analyzers like Slither and Mythril, while our experiment focused on the vulnerabilities these tools were not able to identify—specifically, those undetected by Slither.
Much of the academic research of the past five years has focused on taking ML concepts (usually from the field of natural language processing) and using them in a development or code analysis context, typically referred to as code intelligence. Based on previous, related work in this research area, we aim to bridge the semantic gap between the performance of a human auditor and an ML detection system to discover vulnerabilities, thus complementing the work of Trail of Bits human auditors with automated approaches (i.e., Machine Programming, or MP).
Challenges
We still face the challenge of data scarcity concerning the scale of smart contracts available for analysis and the frequency of interesting vulnerabilities appearing in them. We can focus on the ML model because it’s sexy but it doesn’t do much good for us in the case of Solidity where even the language itself is very young and we need to tread carefully in how we treat the amount of data we have at our disposal.
Archiving previous client data was a job in itself since we had to deal with the different solc versions to compile each project separately. For someone with limited experience in that area this was a challenge, and I learned a lot along the way. (The most important takeaway of my summer internship is that if you’re doing machine learning, you will not realize how major a bottleneck the data collection and cleaning phases are unless you have to do them.)
Figure 7: Distribution of 89 vulnerabilities found among 10 security assessments.
The pie chart shows how 89 vulnerabilities were distributed among the 10 client security assessments we surveyed. We documented both the notable vulnerabilities and those that were not discoverable by Slither.
The Road Ahead for Slither-simil
This past summer we resumed the development of Slither-simil and SlithIR with two goals in mind:
- Research purposes, i.e., the development of end-to-end similarity systems lacking feature engineering.
- Practical purposes, i.e., adding specificity to increase precision and recall.
We implemented the baseline text-based model with FastText to be compared with an improved model with a tangibly significant difference in results; e.g., one not working on software complexity metrics, but focusing solely on graph-based models, as they are the most promising ones right now.
For this, we have proposed a slew of techniques to try out with the Solidity language at the highest abstraction level, namely, source code.
To develop ML models, we considered both supervised and unsupervised learning methods. First, we developed a baseline unsupervised model based on tokenizing source code functions and embedding them in a Euclidean space (Figure 8) to measure and quantify the distance (i.e., dissimilarity) between different tokens. Since functions are constituted from tokens, we just added up the differences to get the (dis)similarity between any two different snippets of any size.
The diagram below shows the SlithIR tokens from a set of training Solidity data spherized in a three-dimensional Euclidean space, with similar tokens closer to each other in vector distance. Each purple dot shows one token.
Figure 8: Embedding space containing SlithIR tokens from a set of training Solidity data
We are currently developing a proprietary database consisting of our previous clients and their publicly available vulnerable smart contracts, and references in papers and other audits. Together they’ll form one unified comprehensive database of Solidity vulnerabilities for queries, later training, and testing newer models.
We’re also working on other unsupervised and supervised models, using data labeled by static analyzers like Slither and Mythril. We’re examining deep learning models that have much more expressivity we can model source code with—specifically, graph-based models, utilizing abstract syntax trees and control flow graphs.
And we’re looking forward to checking out Slither-simil’s performance on new audit tasks to see how it improves our assurance team’s productivity (e.g., in triaging and finding the low-hanging fruit more quickly). We’re also going to test it on Mainnet when it gets a bit more mature and automatically scalable.
You can try Slither-simil now on this Github PR. For end users, it’s the simplest CLI tool available:
- Input one or multiple smart contract files (either directory, .zip file, or a single .sol).
- Identify a pre-trained model, or separately train a model on a reasonable amount of smart contracts.
- Let the magic happen, and check out the similarity results.
$ slither-simil test etherscan_verified_contracts.bin --filename MetaCoin.sol --fname MetaCoin.sendCoin --input cache.npz
Conclusion
Slither-simil is a powerful tool with potential to measure the similarity between function snippets of any size written in Solidity. We are continuing to develop it, and based on current results and recent related research, we hope to see impactful real-world results before the end of the year.
Finally, I’d like to thank my supervisors Gustavo, Michael, Josselin, Stefan, Dan, and everyone else at Trail of Bits, who made this the most extraordinary internship experience I’ve ever had.