30 new Semgrep rules: Ansible, Java, Kotlin, shell scripts, and more

By Matt Schwager and Sam Alws

We are publishing a set of 30 custom Semgrep rules for Ansible playbooks, Java/Kotlin code, shell scripts, and Docker Compose configuration files. These rules were created and used to audit for common security vulnerabilities in the listed technologies. This new release of our Semgrep rules joins our public CodeQL queries and Testing Handbook in an effort to share our technical expertise with the security community. This blog post will briefly cover the new Semgrep rules, then go in depth on two lesser-known Semgrep features that were used to create these rules: generic mode and YAML support.

For this release of our internal Semgrep rules, we focused on issues like unencrypted network transport (HTTP, FTP, etc.), disabled SSL certificate verification, insecure flags specified for common command-line tools, unrestricted IP address binding, miscellaneous Java/Kotlin concerns, and more. Here are our new rules:

Mode	Rule ID	Rule description
Generic	`container-privileged`	Found `container` command with extended privileges
Generic	`container-user-root`	Found `container` command running as root
Generic	`curl-insecure`	Found `curl` command disabling SSL verification
Generic	`curl-unencrypted-url`	Found `curl` command with unencrypted URL (e.g., HTTP, FTP, etc.)
Generic	`gpg-insecure-flags`	Found `gpg` command using insecure flags
Generic	`installer-allow-untrusted`	Found `installer` command allowing untrusted installations
Generic	`openssl-insecure-flags`	Found `openssl` command using insecure flags
Generic	`ssh-disable-host-key-checking`	Found `ssh` command disabling host key checking
Generic	`tar-insecure-flags`	Found `tar` command using insecure flags
Generic	`wget-no-check-certificate`	Found `wget` command disabling SSL verification
Generic	`wget-unencrypted-url`	Found `wget` command with unencrypted URL (e.g. HTTP, FTP, etc.)
Java, Kotlin	`gc-call`	Calling `gc` suggests to the JVM that the garbage collector should be run, and memory should be reclaimed. This is only a suggestion, and there is no guarantee that anything will happen. Relying on this behavior for correctness or memory management is an anti-pattern.
Java, Kotlin	`mongo-hostname-verification-disabled`	Found MongoDB client with SSL hostname verification disabled
YAML (Ansible)	`apt-key-unencrypted-url`	Found apt key download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible)	`apt-key-validate-certs-disabled`	Found apt key with SSL verification disabled
YAML (Ansible)	`apt-unencrypted-url`	Found apt deb with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible)	`dnf-unencrypted-url`	Found dnf download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible)	`dnf-validate-certs-disabled`	Found dnf with SSL verification disabled
YAML (Ansible)	`get-url-unencrypted-url`	Found file download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible)	`get-url-validate-certs-disabled`	Found file download with SSL verification disabled
YAML (Ansible)	`rpm-key-unencrypted-url`	Found RPM key download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible)	`rpm-key-validate-certs-disabled`	Found RPM key with SSL verification disabled
YAML (Ansible)	`unarchive-unencrypted-url`	Found unarchive download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible)	`unarchive-validate-certs-disabled`	Found unarchive download with SSL verification disabled
YAML (Ansible)	`wrm-cert-validation-ignore`	Found Windows Remote Management connection with certificate validation disabled
YAML (Ansible)	`yum-unencrypted-url`	Found yum download with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible)	`yum-validate-certs-disabled`	Found yum with SSL verification disabled
YAML (Ansible)	`zypper-repository-unencrypted-url`	Found Zypper repository with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Ansible)	`zypper-unencrypted-url`	Found Zypper package with unencrypted URL (e.g., HTTP, FTP, etc.)
YAML (Docker Compose)	`port-all-interfaces`	Service port is exposed on all interfaces

Semgrep 201: intermediate features

Semgrep is a static analysis tool for finding code patterns. This includes security vulnerabilities, bug variants, secrets detection, performance and correctness concerns, and much more. While Semgrep includes a proprietary cloud offering and more advanced rules, Semgrep CLI is free to install and run locally. You can run Trail of Bits’ rules, including the rules mentioned above, with the following command:

semgrep scan --config p/trailofbits /path/to/code

This post will not go into all the details of each rule presented above. The basics of Semgrep have already been discussed extensively by both Trail of Bits and the broader security community, so this post will discuss two lesser-known Semgrep features in more depth: generic mode and YAML support.

generic mode

Semgrep’s generic mode provides an easy method for searching for arbitrary text. Unlike Semgrep’s syntactic support for programming languages like Java and Python, generic mode is glorified text search. Naturally, this provides both advantages and disadvantages: generic mode has a tendency to produce more false positives but also fewer false negatives. In other words, it produces more findings, but you may have to sift through them. Limiting rule paths is one way to avoid false positives. However, the primary reason for using generic mode is the breadth of data it can search.

generic mode can roughly be thought of as an ergonomic alternative to regular expressions. They both perform arbitrary text search, but generic mode offers improved handling of newlines and other white space. It also offers Semgrep’s familiar ellipsis operator, metavariables, and a tight integration with the rest of the Semgrep ecosystem for managing findings. Any text file or text-based data can be analyzed in generic mode, so it’s a great option when you want to analyze less commonly used formats such as Jinja templates, NGINX configuration files, HAML templates, TOML files, HTML content, or any other text-based format.

The primary disadvantage of generic mode is that it has no semantic understanding of the text it parses. This means, for example, that patterns may be incorrectly detected in commented code or other unintended places—in other words, false positives. For example, if we search for os.system(...) in both generic mode and python mode in the following code, we will get different results:

import os

# Uncomment when debugging
# os.system("debugger")

os.system("run_production")

Figure 1: Python code with a line commented out

$ semgrep scan --lang python --pattern "os.system(...)" test.py 
...                       
    test.py 
            6┆ os.system("run_production")
...
Ran 1 rule on 1 file: 1 finding.

Figure 2: python mode semantically understands the comment.

$ semgrep scan --lang generic --pattern "os.system(...)" test.py 
...                       
    test.py 
            4┆ # os.system("debugger")
            ⋮┆----------------------------------------
            6┆ os.system("run_production")
...
Ran 1 rule on 1 file: 2 findings.

Figure 3: generic mode does not semantically understand the comment.

Another disadvantage of generic mode is that it misses the extensive list of Semgrep equivalences. Despite this, we still felt it was the right tool for the job when searching for these specific patterns. Sifting through a few false positives is okay if it means we don’t miss a critical security bug.

Given generic mode’s disadvantages, why use it for many of the rules released in this post? After all, Semgrep has official language support for both Bash and Dockerfiles. But consider the ssh-disable-host-key-checking rule. Using generic mode will find SSH commands disabling StrictHostKeyChecking in Bash scripts, Dockerfiles, CI configuration, documentation files, system calls in various programming languages, or other places we may not even be considering. Using the official Bash or Dockerfile support will cover only a single use case. In other words, using generic mode gives us the broadest possible coverage for a relatively simple heuristic that is applicable in many different scenarios.

For more information, see Semgrep’s official documentation on generic pattern matching.

YAML support

In addition to generic mode, YAML support helps make Semgrep a one-stop shop for searching for code, or text, in basically any text-based file in your filesystem. And YAML is eating the world: Kubernetes configuration, AWS CloudFormation, Docker Compose, GitHub Actions, GitLab CI, Argo CD, Ansible, OpenAPI specifications, and yes, Semgrep rules themselves are even written in YAML. In fact, Semgrep has best practice rules written for Semgrep rules in Semgrep rules. Sem-ception.

Of course, you could write a basic utility in your programming language of choice that uses a mainstream YAML library to parse YAML and search for basic heuristics, but then you would be missing out on the rest of the Semgrep ecosystem. The fact that you can manage all these different types of files and file formats in one place is Semgrep’s killer feature. YAML rules sit next to Python rules, which sit next to Java rules, which sit next to generic rules. They all run in CI together, and findings can be managed in the same place. Ten tools for 10 types of files are no longer necessary.

We were recently engaged in an audit that included a large Ansible implementation. With this in mind, we set out to cover many of the basic security concerns one may expect in the Ansible.Builtin namespace. Searching for YAML patterns using Semgrep’s YAML rule format has a tendency to make your head spin, but once you get used to it, it becomes relatively formulaic. The highly structured nature of formats like JSON and YAML makes searching for patterns straightforward. The Ansible rules presented at the top of this post are relatively clear-cut, so instead let’s consider the port-all-interfaces rule patterns, which highlights the YAML functionality more distinctly:

patterns:
  - pattern-inside: |
      services:
        ...
  - pattern: |
      ports:
        - ...
        - "$PORT"
        - ...
  - focus-metavariable: $PORT
  - metavariable-regex:
      metavariable: $PORT
      regex: '^(?!127.\d{1,3}.\d{1,3}.\d{1,3}:).+'

Figure 4: patterns searching for ports listening on all interfaces

The | YAML block style indicator used in the pattern-inside and pattern operators states that the text below is a plaintext string, not additional Semgrep rule syntax. Semgrep then interprets this plaintext string as YAML. Again, the fact that this is YAML within YAML takes some squinting at first, but the rest of the rule is relatively straightforward Semgrep syntax.

The rule itself is looking for services binding to all interfaces. The Docker Compose documentation states that, by default, services will listen on 0.0.0.0 when specifying ports. This rule finds ports that don’t start with loopback addresses, like 127.0.0.1, which indicates they listen on all interfaces. This is not always a problem, but it can lead to issues like firewall bypass in certain circumstances.

Extend your reach with Semgrep

Semgrep is a great tool for finding bugs across many disparate technologies. This post introduced 30 new Semgrep rules and discussed two lesser-known features: generic mode and YAML support. Adding YAML and generic searching to Semgrep’s extensive list of supported programming languages makes it an even more universal tool. Heuristics for problematic code or infrastructure and their corresponding findings can be managed in a single location.

If you’d like to read more about our work on Semgrep, we have used its capabilities in several ways, such as securing machine learning pipelines, discovering goroutine leaks, and securing Apollo GraphQL servers.