Ideas and Execution

Ideas and Execution
2024-12-9 16:8:51 Author: soatok.blog(查看原文) 阅读量:6 收藏

I’ve been known to blog about ideas that I don’t have the time or energy to build myself–from using asynchronous ratcheting trees to support multicast networking in WireGuard (and other Noise-based protocols) to how Bluesky could introduce private accounts (or limited audiences) without changing their current architecture–in the hopes that someone might read them and be inspired to build something similar for themselves.

Business types are fond of saying, “Ideas are cheap, execution is everything.”

In their world, that’s probably sage advice. I immediately imagine people that see themselves as the “idea person” but actually have little contribution to the actual work that needs to get done.

Thus, in the “business” world, a person that has ideas is probably seen as a nuisance, unless they’re also able and willing to execute on them.

But this blog doesn’t exist in their world, now does it?

Three panel comic.
First panel: Soatok (blue dhole) picks up a red book, confused.
Second panel: The cover of the book. It reads "Being a Yes-Man to Yacht Owners" with the subtitle, "The Rat Race Imagined As An Optimization Problem".
Third panel: Soatok smiles and tosses the book over his shoulder as if to say, "Don't need that shit!" — Sophie

For years, I’ve managed to avoid the perverse incentives that come with profit-seeking, though marketers still contact me all the time requesting I promote their nonsense.

This blog does not serve ads. In fact, it promotes the use of adblockers through this WordPress plugin.

I refuse to engage in so-called native advertising, or to allow other people to write posts here.

It’s not that I hate money. In fact, I’ve shared a link to let folks buy me a coffee for a while now, and a few people were kind enough to do so.

Recently, due to popular demand, I even decided to start offering Critiques.

Basically, you can pay for up to an hour of my time to give feedback on software, design docs, etc. in preparation for more formal review. The linked webpage has more details.

It’s there if anyone wants it, but nobody should feel obligated or pressured at all to even consider it. I have a full-time job as a security engineer.

Point being: If I’m not playing their game, I also don’t need to play by their rules. So here are a few free ideas that anyone can pick up and run with if they want.

Some of them could yield profitable small businesses (or startups, if you really want to play the venture capitalism / enshittification game). Others might make for neat open source software projects. Some might be total wastes of time, or require expertise that none of us have (myself especially).

Threshold Single-Sign On
Full-text Beacons for Encrypted Databases
Random-Access AEAD
Differential Complexity Fingerprinting

I will describe each in as much details as I’ve envisioned.

Free Ideas (Take One)

Each of these ideas is free for the taking, should anyone wish to adopt them, with no strings or stipulations attached.

However, at the end of this post, I do have one request to anyone that succeeds at implementing any of them.

I was originally going to include a few more ideas, but they’re too similar to ones I’ve heard other people share, and I don’t want to step on friends’ toes.

Threshold Single Sign-On

Elevator Pitch: Use Threshold Signature Schemes and hardware tokens to provide SSO with built-in failover.

Threshold Signature Schemes, such as FROST (RFC 9591), enable use-cases where multiple participants must cooperate in order to produce a valid signature.

FROST specifically is attractive here because one of the ciphersuites produces a valid Ed25519 signature.

To date, most of the interest in threshold signatures comes from the cryptocurrency space, where non-custodial wallets are attractive.

What I’m proposing is to instead use them for building an identity provider that has a dedicated service component (what Okta and Auth0 satisfy today) along with an on-premise component for enterprise customers.

What Threshold SSO Might Look Like

Either a custom FIDO2-compatible hardware token, or software that uses the hmac-secret extension to derive wrapping keys to encrypt the end user’s shares.
An appropriate threshold where the end user and either the service provider or the on-premise component can produce a valid signature.
For example: A 3-of-4 configuration where 2 of the shares are protected by the hardware token.
The message being signed, along with the valid signature, produces a valid credential. See also, OpenID Connect.

“What does this get you?”

With threshold signature schemes, if the service is ever compromised, an attacker can at best obtain the service’s share. This is insufficient to generate a valid signature and impersonate users.

The same cannot be said of identity providers today.

Additionally, if the service goes offline or experiences some sort of outage (e.g., distributed denial-of-service attack), the types of enterprise customers that want resilience get automatic fail-over to their own authentication infrastructure, and this can be invisible to their end users.

Resilience, uptime, and a harder attack surface. What’s not to love?

“Why haven’t you built this yourself?”

I don’t have the time.

I’m already busy with other projects which are taking up a ton of my time, and I have a full-time job to boot.

There might be technical, financial, political, or logistical hurdles that I haven’t anticipated.

For starters, you need a library that implements FROST for signing and to define the protocol for enrolling shares into the system, and mapping the signature to a specific user. That might become complicated, in practice.

Full-text Beacons for Encrypted Databases

Elevator Pitch: Full-text searching for databases that only store ciphertext.

To fully understand what I’m talking about for this idea, you need to first read Database Cryptography Fur the Rest Of Us. Specifically, the section about truncated HMAC indexes for encrypted data.

OwO sticker — CMYKat
(Seriously, read it first.)

Briefly, you can build what AWS calls “beacons” out of HMAC, with a static key, if you truncate the output. This lets you query your database for records that have the same beacon, which doesn’t necessarily mean the same plaintext (if your beacon length is chosen appropriately). This lets you control the false positive rate (and thus, minimize performance penalty) without giving active attackers a crib on which plaintext certainly corresponds to which record.

You can also build beacons out of transformations of the plaintext. This lets you, for example, make a case-insensitive beacon if you normalize your inputs to lowercase (or uppercase).

Unfortunately, turning this into a secure full-text search is nontrivial. I have a few different ideas here, many of which may not actually work that well, but all of them start with the same premise. So I will share that much today.

Full-text Search, Hold the Encryption

First, you’re going to understand how PostgreSQL implements full text search.

Confused protogen — Yeah, I know. This is a lot of required reading.
(Art: Harubaki)

When you insert a record in PostgreSQL with fulltext search enabled, you create a tsvector of the text, and typically create a GIN index on the tsvector. Learn more.

When searching, you create a ts_query type out of your search parameters, and the database engine will figure out which records match. You can also rank the results of the query, based on arbitrary weights.

Now With Extra Beacon

Adapting the Beacon model for querying encrypted records to how PostgreSQL implements fulltext search is straightforward enough:

Specify a short beacon length, and transform each lexeme into a beacon.

“Short” here means a power of 2 larger than $\sqrt{n}$ , where $n$ is the number of all possible lexemes in your corpus of plaintext. For English text, $n$ should be just above $2^{17}$ , so a beacon length of 9 bits might be appropriate.

What’s not straightforward is which optimization or encoding strategies to use (if any at all).

Some might seem fine, but if you know that certain lexemes are weighted more heavily than others, you might be able to deduce some information with frequency analysis. Thus, it might be wise to run the beacon algorithm multiple times with independent static keys, and then only use some of them when querying.

It might also be worth storing a non-beacon tsvector (encrypted, of course) so that client-side software can filter out false positives more effectively. This would require an application-layer library that emits a similar same data structure that the database software uses.

There’s a lot you can do from here. I’ve considered creating a sort of blinded “heat map” of the tsvectors. I’ve considered exploring ways to use Fast Fourier Transforms or lattice-based algorithms to encode the tsvectors in a way that permits faster matching.

But, as I said previously: I have a lot of other projects on my plate, and I haven’t had the time to flesh this idea out entirely.

Random-Access AEAD

Elevator Pitch: Enable authenticated reads to a subset of a ciphertext, without needing the whole blob.

This is kind of niche, but may be a fun research area.

The S3 Encryption Client supports so-called Ranged GETs. What this means is you have a large blob of ciphertext stored in Amazon S3, and then you ask S3 for part of it, and then decrypt that sequence of bytes.

Currently, if you use this feature, you’re handling unauthenticated ciphertext–even if the object in its entirety uses authenticated encryption.

This is obviously true: AES-GCM is, under the hood, just AES in Counter Mode with an authentication tag and specific usage requirements.

So here’s the challenge: How can you prove the ciphertext hasn’t been tampered with? Especially if you cannot download the entire ciphertext blob?

How can you implement it in a way that remains backwards compatible with data that was stored using AES-GCM years ago? (This means you can add metadata, but you cannot change the encryption format at all.)

Possible Solution

One idea is to build Merkle Tree proofs of inclusion, where each leaf corresponds to the HMAC of a block of ciphertext. Each block uses a different subkey, calculated from a counter-based key derivation algorithm.

The inputs to this counter-based KDF should include the context of the object in question.

For example, using HMAC-SHA-512 then AES-256-CTR to derive a distinct subkey for each key:

function getLeafIKM(key, objectName, gcmIv, gcmTag, leafBlockSize = 8192) {
  return crypto.createHmac('sha512', key)
    // Some protocol-specific constnat
    .update(DOMAIN_SEPARATION_PREFIX)
    // What is the S3 object we're working with?
    .update(le64(objectName.length))
      .update(objectName)
    // The IV from the object metadata, used for AES-GCM
    .update(le64(gcmIv))
      .update(gcmIv)
    // The existing authentication tag
    .update(le64(gcmTag))
      .update(gcmTag)
    // The block size for each leaf
    .update(le64(leafBlockSize))
    // Finally, we only return 32 bytes:
    .digest().slice(0, 32);
}

/**
 * Use the AES-256 keystream over two sequential blocks
 * as a fast way to derive multiple subkeys from the
 * value returned by getLeafIKM().
 */
function getLeafSubKey(leafIKM, blockIndex = 0) {
  const cipher = crypto.createCipheriv('aes-256-ecb', leafIKM);
  const lowerHalf = le128((blockIndex << 1)    );
  const upperHalf = le128((blockIndex << 1) + 1);
  const subkey = [].join(
     cipher.update(lowerHalf),
     cipher.update(upperHalf),
  );
  cipher.finish();
  return subkey;
}

Once you have a fast way of calculating a different leaf subkey, calculate the HMAC tag of each block of ciphertext, then construct a Merkle tree out of these HMAC leaves.

Security-Relevant Notice

Each leaf must have a distinct key. This prevents block reordering.

It’s important to not YOLO your multi-part hashing.

Be sure to commit to the leaf block size and object name in your key derivation.

It’s not as important to commit to the GCM IV and authentication tag. I included all of them for completeness, and to get closer to the platonic ideal of “transcript hashing”, but they are strictly speaking “not necessary”.

Store the Merkle tree in the object metadata (or next to the object). You can do this with objects already stored encrypted this way.

On the Read path, fetch the hashes that correspond to the authentication paths for each of the blocks that overlap with the desired Range. Verify their proofs of inclusion.

Client-side, you will be verifying up to 2 times the block size of excess bytes for your Ranged GET, then stripping the excess undesired bytes (if any) after decrypting.

The advantage to this approach is you can staple the capability to authenticate your Ranged GETs onto existing data.

The disadvantage? It requires storing the entire Merkle tree, which is a number of hashes close to twice the number of blocks.

Additionally, each block in the range will require $log_{2}(n)$ hashes for its proof of inclusion.

Can We Do Better?

I don’t know. Maybe you have ideas that I don’t?

Differential Complexity Fingerprinting

Elevator Pitch: Measure code complexity across versions, flag anomalies (especially ones that violate the versioning scheme).

That’s enough cryptography nerd stuff. How about an idea for the application security nerds in the crowd?

Whether you’re a package manager (NPM, PyPI, Maven, Composer, etc.) or a browser extension ecosystem, you might stand to benefit from a differential complexity fingerprinting approach to catch crypto-miners and source code hijacks.

“What does that even mean?”

For each release of a piece of software:

Measure the cyclomatic complexity of each unit of code throughout the codebase.
Store this information along each tag for later analysis.

Later (or, rather, immediately after a new release), you can diff the two values and see what has changed.

Has there been a lot of complexity added in a patch release? You might have a crypto-miner.

Has the entire software structure changed entirely? Perhaps their account was hacked, or the developer might have decided to pull a left-pad on the downstream packages.

Just Cyclomatic Complexity?

You can probably get a lot of value out of just cyclomatic complexity measurements, but you can always take it a step further to detect bugdoors too:

Static analysis reports (e.g., via Semgrep)
Taint analysis scans (i.e., against dangerous functions or features)

But you don’t need to go this far. I’d wager that just studying how complexity changes over time will yield a lot of insight into rogue/compromised open source developers looking to infect or abuse downstream systems.

One Simple Request

Thanks for taking the time to read through and consider some of the ideas I’m sharing with the world today.

If anyone chooses to embark on the journey to bring any of these ideas to life, I have but one simple request:

At least one of your major releases should have a nickname, which is named after your favorite animal or a pun about your favorite animal.

Granted, there are a lot of other things I could ask of you. I could ask that you pay my kindness forward; to help others out where you can. I could ask that you use a copyleft license and keep the spirit of free and open source software alive. I could ask that you structure any business ventures as a worker-owned co-operative, so that nobody gets screwed over.

But those suggestions all have a cost attached to them, even if they’re one you might be willing to accept.

It is not my wish to encumber anyone that tolerates my blog and entertains my ideas.

I’d rather make a silly, harmless request that costs nothing to consider. If anyone takes me up on this when they succeed, the world will be a slightly more fun place.

“These ideas are half-baked!”

You’re entitled to a full refund for the money you spent hearing them.

Header art by CMYKat.

文章来源: https://soatok.blog/2024/12/09/ideas-and-execution/
如有侵权请联系:admin#unsafe.sh