There are more than four ways to mask data, but these are the main ones to focus on in this post.
If we want to detect a compressed or encrypted stream of bytes but can’t rely on a file header for a signature, the best way is by using something like a Chi-Square test. The more uniform the data is, the more likely it is to be compressed or encrypted.
Steganography is better at masking. Some image formats already use lossless compression to reduce the size of files. The PNG format, for example, uses Zlib, and the high compression ratio will result in the file having a high amount of entropy. The GIF format uses LZW as its compression method. TIFF also uses LZW by default while WEBP uses Brotli compression.
In mathematics, an involution, or an involutory function, is a function that is its own inverse; For the following instructions, I’m merely using this word to describe what they do in practice. Executed once will mask data, and executing again will unmask. These are very common but also very weak when used alone.
1) eXclusive-OR bitwise operation xor eax, -1 2) Bitwise NOT not eax xor eax, -1 3) Bitwise Negation neg eax imul eax, -1 4) Circular Shift shrd eax, eax, 16 ror eax, 16 rol eax, 16 ror al, 4 5) Byte Swapping bswap eax xchg al, ah
The circular shift and byte swapping operations are much closer to a permutation. They could also be used on large arrays in addition to the shuffling.
Let’s imagine you want to shuffle a deck of cards for an online poker game. The shuffling algorithm must be unbiased, and the results can’t be predictable before a game begins. Many who have asked for such an algorithm know of the Fisher-Yates shuffle. It’s an algorithm for generating a random permutation of a finite sequence. It was proposed by Ronald Fisher and Frank Yates in their book Statistical Tables for Biological, Agricultural and Medical Research published in 1939. Richard Durstenfeld modified the algorithm in 1964, and Donald E. Knuth popularised it in his 1968 book The Art of Computer Programming, hence why some refer to it as the Knuth Shuffle.
The following code in C illustrates how one might shuffle a byte array. Here, we’re using the current time as a seed to initialise the PRNG, which wouldn’t be recommended for a poker game. 😀
void fisher_yates_shuffle(void *inbuf, size_t inlen) { uint8_t *in = (uint8_t*)inbuf; uint8_t t; srand(time(0)); for (size_t i = inlen - 1; i > 0; i--) { uint32_t j = rand() % (i + 1); uint8_t = in[i]; in[i] = in[j]; in[j] = t; } }
Obtaining a unique sequence of numbers to shuffle the array is problematic. Most software will use a pseudorandom number generator (PRNG). However, knowing how to generate the same sequence of numbers used to shuffle a deck of cards allows us to determine where every card is and even reverse the process. But that’s precisely what makes Fisher-Yates useful for masking. We want to unshuffle our masked data later; it’s just that rand() isn’t suitable. We need something else.
Apart from rand() being weak for shuffling, unshuffling the array would require starting with the last number returned by it. rand() doesn’t support this type of random access, therefore our unshuffling algorithm would be required to generate the exact same sequence of numbers and store each one in memory before starting to unshuffle. We need a function that can produce deterministic values based on a seed or key. Seeded or keyed shuffling and unshuffling is really what we need.
A PRNG is also a Deterministic Random Bit Generator (DRBG). The DRBG/PRNG-generated sequence is not truly random because an initial value, called the PRNG’s seed (which may include truly random values), entirely determines the output bits generated by it. Therefore, we can replace rand() with a stream cipher like RC4, ChaCha, or a block cipher like AES in Counter (CTR) mode and generate deterministic values.
NIST has defined how to construct a DRBG from CTR mode in SP 800-90Ar1, but it’s unnecessary to use this for masking. Rather than implement a DRBG, we just need to encrypt the range index using a secret key and then derive an unbiased number within that range from the ciphertext. The following code tries to demonstrate how it might be done in practice.
#if defined(_WIN64) // // SPECK128-256 // #define WORDLEN 64 #define PRNG_MAX_INT (INT64_MAX + 1) #define ENCRYPT_KEY_LEN 32 #define ENCRYPT_BLOCK_LEN 16 #define R(v,n)(((v)>>(n))|((v)<<(64-(n)))) typedef unsigned long long W; void encrypt(void*mk,void*p){ W k[4],*x=(W*)p,i,t; for (i=0; i<4; i++) k[i] = ((W*)mk)[i]; for (i=0; i<34; i++) { x[1] = (R(x[1], 8) + x[0]) ^ k[0], x[0] = R(x[0], 61) ^ x[1], k[1] = (R(k[1], 8) + k[0]) ^ i, k[0] = R(k[0], 61) ^ k[1]; t = k[1], k[1] = k[2], k[2] = k[3], k[3] = t; } } #else // // SPECK64-128 // #define WORDLEN 32 #define PRNG_MAX_INT (INT32_MAX + 1) #define ENCRYPT_KEY_LEN 16 #define ENCRYPT_BLOCK_LEN 8 #define R(v,n)(((v)>>(n))|((v)<<(32-(n)))) typedef unsigned int W; void encrypt(void* mk, void* p) { W k[4],*x=(W*)p,i,t; for (i=0; i<4; i++) k[i] = ((W*)mk)[i]; for (i=0; i<27; i++) { x[0] = (R(x[0], 8) + x[1]) ^ k[0], x[1] = R(x[1], 29) ^ x[0], t = k[3], k[3] = (R(k[1], 8) + k[0]) ^ i, k[0] = R(k[0], 29) ^ k[3], k[1] = k[2], k[2]=t; } } #endif W prng_word(void *key, W max) { W r, x[2], ctr = 1, d = ((-max) / max) + 1; if (d == 0) return 0; for (;;) { x[0] = max; x[1] = ctr++; encrypt(key, x); r = x[0] / d; if (r < max) return r; } } void shuffle(void *seed, void *inbuf, size_t inlen) { uint8_t *in = (uint8_t*)inbuf; for (size_t i = inlen - 1; i > 0; i--) { uint32_t j = prng_word(seed, (i + 1)); uint8_t t = in[i]; in[i] = in[j]; in[j] = t; } } void unshuffle(void *seed, void *inbuf, size_t inlen) { uint8_t *in = (uint8_t*)inbuf; for (size_t i = 0; i < inlen; i++) { uint32_t j = prng_word(seed, (i + 1)); uint8_t t = in[i]; in[i] = in[j]; in[j] = t; } }
There are times when elements of the array will remain in the same position after shuffling. This typically happens with small arrays. In that case, something else is required for masking. Now, if you know of a way to fix that, feel free to leave a comment or drop me an email.
Shuffling doesn’t provide any confidentiality for the masked data like encryption does and doesn’t reduce its size like compression does. However, shuffling a large enough array using a secure cipher and secret key to generate a sequence of numbers can probably make it difficult to recover the original data without the key used to initialise the PRNG. That seems helpful in masking data and better than an XOR. But of course, something like this is in no way intended or implied to be a suitable replacement for encryption and shouldn’t be used for any critical information!