By Matt Schwager and Sam Alws
We are publishing a set of 30 custom Semgrep rules for Ansible playbooks, Java/Kotlin code, shell scripts, and Docker Compose configuration files. These rules were created and used to audit for common security vulnerabilities in the listed technologies. This new release of our Semgrep rules joins our public CodeQL queries and Testing Handbook in an effort to share our technical expertise with the security community. This blog post will briefly cover the new Semgrep rules, then go in depth on two lesser-known Semgrep features that were used to create these rules: generic
mode and YAML support.
For this release of our internal Semgrep rules, we focused on issues like unencrypted network transport (HTTP, FTP, etc.), disabled SSL certificate verification, insecure flags specified for common command-line tools, unrestricted IP address binding, miscellaneous Java/Kotlin concerns, and more. Here are our new rules:
Mode | Rule ID | Rule description |
Generic | container-privileged |
Found container command with extended privileges |
Generic | container-user-root |
Found container command running as root |
Generic | curl-insecure |
Found curl command disabling SSL verification |
Generic | curl-unencrypted-url |
Found curl command with unencrypted URL (e.g., HTTP, FTP, etc.) |
Generic | gpg-insecure-flags |
Found gpg command using insecure flags |
Generic | installer-allow-untrusted |
Found installer command allowing untrusted installations |
Generic | openssl-insecure-flags |
Found openssl command using insecure flags |
Generic | ssh-disable-host-key-checking |
Found ssh command disabling host key checking |
Generic | tar-insecure-flags |
Found tar command using insecure flags |
Generic | wget-no-check-certificate |
Found wget command disabling SSL verification |
Generic | wget-unencrypted-url |
Found wget command with unencrypted URL (e.g. HTTP, FTP, etc.) |
Java, Kotlin | gc-call |
Calling gc suggests to the JVM that the garbage collector should be run, and memory should be reclaimed. This is only a suggestion, and there is no guarantee that anything will happen. Relying on this behavior for correctness or memory management is an anti-pattern. |
Java, Kotlin | mongo-hostname-verification-disabled |
Found MongoDB client with SSL hostname verification disabled |
YAML (Ansible) | apt-key-unencrypted-url |
Found apt key download with unencrypted URL (e.g., HTTP, FTP, etc.) |
YAML (Ansible) | apt-key-validate-certs-disabled |
Found apt key with SSL verification disabled |
YAML (Ansible) | apt-unencrypted-url |
Found apt deb with unencrypted URL (e.g., HTTP, FTP, etc.) |
YAML (Ansible) | dnf-unencrypted-url |
Found dnf download with unencrypted URL (e.g., HTTP, FTP, etc.) |
YAML (Ansible) | dnf-validate-certs-disabled |
Found dnf with SSL verification disabled |
YAML (Ansible) | get-url-unencrypted-url |
Found file download with unencrypted URL (e.g., HTTP, FTP, etc.) |
YAML (Ansible) | get-url-validate-certs-disabled |
Found file download with SSL verification disabled |
YAML (Ansible) | rpm-key-unencrypted-url |
Found RPM key download with unencrypted URL (e.g., HTTP, FTP, etc.) |
YAML (Ansible) | rpm-key-validate-certs-disabled |
Found RPM key with SSL verification disabled |
YAML (Ansible) | unarchive-unencrypted-url |
Found unarchive download with unencrypted URL (e.g., HTTP, FTP, etc.) |
YAML (Ansible) | unarchive-validate-certs-disabled |
Found unarchive download with SSL verification disabled |
YAML (Ansible) | wrm-cert-validation-ignore |
Found Windows Remote Management connection with certificate validation disabled |
YAML (Ansible) | yum-unencrypted-url |
Found yum download with unencrypted URL (e.g., HTTP, FTP, etc.) |
YAML (Ansible) | yum-validate-certs-disabled |
Found yum with SSL verification disabled |
YAML (Ansible) | zypper-repository-unencrypted-url |
Found Zypper repository with unencrypted URL (e.g., HTTP, FTP, etc.) |
YAML (Ansible) | zypper-unencrypted-url |
Found Zypper package with unencrypted URL (e.g., HTTP, FTP, etc.) |
YAML (Docker Compose) | port-all-interfaces |
Service port is exposed on all interfaces |
Semgrep 201: intermediate features
Semgrep is a static analysis tool for finding code patterns. This includes security vulnerabilities, bug variants, secrets detection, performance and correctness concerns, and much more. While Semgrep includes a proprietary cloud offering and more advanced rules, Semgrep CLI is free to install and run locally. You can run Trail of Bits’ rules, including the rules mentioned above, with the following command:
semgrep scan --config p/trailofbits /path/to/code
This post will not go into all the details of each rule presented above. The basics of Semgrep have already been discussed extensively by both Trail of Bits and the broader security community, so this post will discuss two lesser-known Semgrep features in more depth: generic
mode and YAML support.
generic mode
Semgrep’s generic
mode provides an easy method for searching for arbitrary text. Unlike Semgrep’s syntactic support for programming languages like Java and Python, generic
mode is glorified text search. Naturally, this provides both advantages and disadvantages: generic
mode has a tendency to produce more false positives but also fewer false negatives. In other words, it produces more findings, but you may have to sift through them. Limiting rule paths is one way to avoid false positives. However, the primary reason for using generic
mode is the breadth of data it can search.
generic
mode can roughly be thought of as an ergonomic alternative to regular expressions. They both perform arbitrary text search, but generic
mode offers improved handling of newlines and other white space. It also offers Semgrep’s familiar ellipsis operator, metavariables, and a tight integration with the rest of the Semgrep ecosystem for managing findings. Any text file or text-based data can be analyzed in generic
mode, so it’s a great option when you want to analyze less commonly used formats such as Jinja templates, NGINX configuration files, HAML templates, TOML files, HTML content, or any other text-based format.
The primary disadvantage of generic
mode is that it has no semantic understanding of the text it parses. This means, for example, that patterns may be incorrectly detected in commented code or other unintended places—in other words, false positives. For example, if we search for os.system(...)
in both generic
mode and python
mode in the following code, we will get different results:
import os # Uncomment when debugging # os.system("debugger") os.system("run_production")
Figure 1: Python code with a line commented out
$ semgrep scan --lang python --pattern "os.system(...)" test.py ... test.py 6┆ os.system("run_production") ... Ran 1 rule on 1 file: 1 finding.
Figure 2: python mode semantically understands the comment.
$ semgrep scan --lang generic --pattern "os.system(...)" test.py ... test.py 4┆ # os.system("debugger") ⋮┆---------------------------------------- 6┆ os.system("run_production") ... Ran 1 rule on 1 file: 2 findings.
Figure 3: generic mode does not semantically understand the comment.
Another disadvantage of generic
mode is that it misses the extensive list of Semgrep equivalences. Despite this, we still felt it was the right tool for the job when searching for these specific patterns. Sifting through a few false positives is okay if it means we don’t miss a critical security bug.
Given generic
mode’s disadvantages, why use it for many of the rules released in this post? After all, Semgrep has official language support for both Bash and Dockerfiles. But consider the ssh-disable-host-key-checking
rule. Using generic
mode will find SSH commands disabling StrictHostKeyChecking
in Bash scripts, Dockerfiles, CI configuration, documentation files, system calls in various programming languages, or other places we may not even be considering. Using the official Bash or Dockerfile support will cover only a single use case. In other words, using generic
mode gives us the broadest possible coverage for a relatively simple heuristic that is applicable in many different scenarios.
For more information, see Semgrep’s official documentation on generic pattern matching.
YAML support
In addition to generic
mode, YAML support helps make Semgrep a one-stop shop for searching for code, or text, in basically any text-based file in your filesystem. And YAML is eating the world: Kubernetes configuration, AWS CloudFormation, Docker Compose, GitHub Actions, GitLab CI, Argo CD, Ansible, OpenAPI specifications, and yes, Semgrep rules themselves are even written in YAML. In fact, Semgrep has best practice rules written for Semgrep rules in Semgrep rules. Sem-ception.
Of course, you could write a basic utility in your programming language of choice that uses a mainstream YAML library to parse YAML and search for basic heuristics, but then you would be missing out on the rest of the Semgrep ecosystem. The fact that you can manage all these different types of files and file formats in one place is Semgrep’s killer feature. YAML rules sit next to Python rules, which sit next to Java rules, which sit next to generic rules. They all run in CI together, and findings can be managed in the same place. Ten tools for 10 types of files are no longer necessary.
We were recently engaged in an audit that included a large Ansible implementation. With this in mind, we set out to cover many of the basic security concerns one may expect in the Ansible.Builtin namespace. Searching for YAML patterns using Semgrep’s YAML rule format has a tendency to make your head spin, but once you get used to it, it becomes relatively formulaic. The highly structured nature of formats like JSON and YAML makes searching for patterns straightforward. The Ansible rules presented at the top of this post are relatively clear-cut, so instead let’s consider the port-all-interfaces
rule patterns
, which highlights the YAML functionality more distinctly:
patterns: - pattern-inside: | services: ... - pattern: | ports: - ... - "$PORT" - ... - focus-metavariable: $PORT - metavariable-regex: metavariable: $PORT regex: '^(?!127.\d{1,3}.\d{1,3}.\d{1,3}:).+'
Figure 4: patterns searching for ports listening on all interfaces
The | YAML block style indicator used in the pattern-inside
and pattern
operators states that the text below is a plaintext string, not additional Semgrep rule syntax. Semgrep then interprets this plaintext string as YAML. Again, the fact that this is YAML within YAML takes some squinting at first, but the rest of the rule is relatively straightforward Semgrep syntax.
The rule itself is looking for services binding to all interfaces. The Docker Compose documentation states that, by default, services will listen on 0.0.0.0
when specifying ports. This rule finds ports that don’t start with loopback addresses, like 127.0.0.1
, which indicates they listen on all interfaces. This is not always a problem, but it can lead to issues like firewall bypass in certain circumstances.
Extend your reach with Semgrep
Semgrep is a great tool for finding bugs across many disparate technologies. This post introduced 30 new Semgrep rules and discussed two lesser-known features: generic
mode and YAML support. Adding YAML and generic searching to Semgrep’s extensive list of supported programming languages makes it an even more universal tool. Heuristics for problematic code or infrastructure and their corresponding findings can be managed in a single location.
If you’d like to read more about our work on Semgrep, we have used its capabilities in several ways, such as securing machine learning pipelines, discovering goroutine leaks, and securing Apollo GraphQL servers.
Contact us if you’re interested in custom Semgrep rules for your project.