The Power of Epsilon: September 2024

Monday, September 23, 2024

Identity Management Atoms: Universally Unique Identifiers and Nonces

I have a confession to make: I'm not sure what role, if any, Universally Unique Identifiers (UUIDs) play in cryptography. Nonetheless, I find them fascinating, and figure it would be useful to understand them, just in case.

A UUID is a 128-bit number (ie, 16 bytes) that does not need to be assigned from a "central" location (and thus doesn't need to have an internet connection to be generated), and has a very high probability of being unique. They enable networks of devices to create packets of data and assign IDs for them, so that when they are brought together, very few, if any, of these packets will have the same ID.

There are 7 official versions of UUIDs and two "customizable" versions -- and anyone who needs an ID for their purposes doesn't even have to adhere to these standards. Twitter, for example, created their own "Snowflake ID" standard, a 64-bit ID that has since been adopted (and sometimes modified) by various other communication platforms, to keep track of messages.

Most of the different versions are various mashups of device IDs, organization IDs, incremental IDs (in case you create several items at the same millisecond), hashes, timestamps, and even random numbers, with a few bits reserved for designating which version is being used. The version that most amuses me (I refuse to say I have a "favorite" version, both because I haven't used them much, and because I can see reasons for using any of several of the versions) is Version 4: it is a completely random number, with the exception of the four bits that are reserved to say "I'm just a completely random number!"

Version 4 amuses me in no small part because the chance of two separate devices generating the same 124-bit number is practically nonexistent -- but it has nonetheless happened in the "wild", both because of bad random number generation, and because it has happened by chance.

So, why would anyone want to use Version 4, when you have versions that pretty much guarantee you'll always have a different ID? (So long, at least, as you ensure that device IDs and other parts are unique! -- because apparently even other versions have had their own collisions.) One major reason is that all the other versions are predictable -- and what's worse, they provide useful information, like timestamps, device IDs, and even sequential information about when a particular packet of data was created.

If something is created at random, none of these issues arise. Granted, they aren't "useful" in the way that other IDs are -- but they can't be guessed, or forged, or created out of full cloth, either.

Such an ID is called a "nonce" -- a term for something that's intended to be used only once -- and they can be used when a "throwaway" ID is needed. A nonce can be used to protect passwords, uniquely identify packets, maintain session connections, and seed cryptographic algorithms to provide even more randomness to their results.

Thursday, September 12, 2024

Identity Management Atoms: Randomness

The enemy of cryptographic systems is predictability. If you can find patterns in a bit of encrypted text, those patterns can be used to find other patterns, which can then, eventually, lead to the recovery of the original message.

To illustrate this, consider a popular puzzle found in newspapers: the cryptogram. Granted, this isn't a "serious" encryption scheme -- it is only, after all, the assigning of each letter in the alphabet to a different letter -- but it's also illustrative of why randomness is so important! Here is a cryptogram from one of my puzzle books:

JRCJ BRQPR UOOIU JRO ROQXRJ GS

CXUMWTQJV QY GYO XOYOWCJQGY GSJOY

KOPGIOU JRO ROQXRJ GS BQUTGI QY JRO

YONJ.

This may seem a little intimidating, but if you squint long enough, you might notice "JRCJ" and "JRO" share a "JR" -- and recall that "THAT" is a common word that begins and ends with the same letter -- and guess that "JRCJ" is "THAT" and "JRO" is "THE". From there, you might look at "ROQXRJ", which we now have letters for: "H E -- -- H T" -- and wonder if that word is "HEIGHT" -- and from there, wonder whether "QY" might be "IN" or "IS" (it can't be "IT", though, because we already know that "J" is the code for "T"!) -- hopefully it's clear how recognizing these patterns can "break" little bits of this code, which can then be used to "break" other little bits, until it cascades and unravels completely. Indeed, while it's challenging, it's easy enough that this technique is used as a popular past-time!

So, if we want to encrypt something, we need to make it look as random as possible. It has been proven that the strongest encryption, hands down, is the "One Time Pad" -- a pair of notepads filled with duplicate copies of random numbers, shared with person you wish to communicate with -- and if you wish to send a message, you take a page from your pad, use it to encrypt your message, and send it to your friend, who uses the same page to decrypt the message. Really simple! And proven mathematically that it's absolutely unbreakable, if it's used right.

Why don't we use this "one time pad" for all our communication needs, then? There are several reasons:

It can be difficult to get a One Time Pad to your friend, and thus, he'll have a difficult time reading your messages;
You have to make sure you and your friend are literally on the "same page" for the messages;
If you mess up, or get lazy, and use the same page for two messages, both of the encrypted messages can be combined to reveal both messages -- perhaps the most dramatic illustration of this I have seen uses two images encrypted with a One Time Pad to show just how both messages become apparent when combined;
A One Time Pad has a limited number of pages -- and if you run out, you'll be tempted to reuse pages, or you will have to get another pad to your friend;
If you want to communicate with another friend, you have to have another pair of One Time Pads, and this doesn't scale very well!

Modern cryptography techniques are designed to get around these problems -- at the cost of being "less secure" than this gold standard -- but the methods are complex enough that they make pattern recovery pretty much impossible ... and yet much of the security of these methods still hinge on the ability to create random numbers!

Which begs a couple of questions: How are random numbers generated, anyway? And how do we know if something is "sufficiently random"? While I said in the introduction that I'm not going to discuss the algorithms behind these "atoms" of cryptography, I'm going to make randomness somewhat of an exception, because it's at the core of everything! And it's good to see how difficult it can be to be truly random.

I'll answer the second question first. "Randomness" has a simple statistical definition: if you have a list, or perhaps a way of generating a list, of allegedly random numbers, and you cannot reliably guess what the next number will be, the list of numbers is considered random. Considering that statistics is the mathematical art of finding patterns in data, essentially this definition boils down to "we can't find any patterns, no matter how hard we try!"

The first random number "generator" was a simple polynomial, used over and over again, using the previous output as a "seed" for the next number -- but after a limited number of these steps, the "next" number would be the first, and the sequence would repeat! The mathematician who suggested this wasn't serious about it -- he thought that the very notion of using a deterministic machine like a computer was particularly funny -- but it was relatively simple way to provide "randomness" for simple games.

Naturally, if we assume we cannot calculate random numbers, it would seem to be reasonable to turn to nature itself for randomness. These strategies have included watching a bit of radioactive material decay over time and watching a wall of lava lamps. It might be tempting to suggest that this is just as bad as a computer, because theoretically the universe is "deterministic" -- but while quantum mechanics itself throws a wrench in this notion, it doesn't really matter if these things are deterministic, because our ability to capture "initial conditions" to the degree needed to make these things predictable is woefully inadequate -- and chaos theory even suggests that it will always be so.

Nonetheless, these methods have their own gotchas that can make them impractical. For one thing, it's necessary to be very careful with sampling! If we attempt to sample the events at too fast or too slow of a rate, we can introduce extra "1"s and "0"s into the data that eat away at the randomness we're trying to capture. (There are strategies for fixing this that I won't review here.) Also, it's kindof hard to keep a wall of lava lamps in your laptop or pocket computer, so it's necessary to get these bits from the internet -- and if other people are getting these same bits at the same time, that can eat away at randomness too.

Nowadays, we use a combination of the two approaches: cryptographic functions like "hashes" and "symmetric keys" are designed to thoroughly scramble data to the point that you can only use these functions to restore the data again, and thus have made the notion of "cryptographic pseudo-random number generator" possible, contrary to our original intuition! But to ensure further randomization, these functions need a "seed" value to start with ... and that seed is provided by the random fluctuations of things like the start times and keyboard inputs of the computers themselves.

Monday, September 2, 2024

An Interesting Insight into the Notion of Calculus

Yesterday, I had an interesting insight! An important notion of "calculus" is the notion of symbol -- and that the "calculus" of identity management would benefit from a "symbolic" approach. Before I explain why this is an interesting insight, I will have to go through a little digression.

Several months ago, I came across a math book in a used bookstore, that had an interesting claim in its introduction: that mathematicians over the years have distanced themselves from the notion of "variable" -- that the notion is rather confusing, and they have been thinking about things in different ways. My reaction at the time was "well, that is weird! I'm a mathematician, and I have never known about this controversy". I put the book back on the shelf, even though I really wanted to get it, just for the introduction alone, but I couldn't bring myself to justify, at the time, even that tiny expense, especially since I have so little space for yet another book.

Bah! I should have bought that book!

The notion wormed its way into my head as I wondered "what would I use, if not variables?" -- and then it hit me: "variables" imply too much -- the word implies that what you are working with can be expected to change -- so the notion you really need is symbols.

A "symbol" can be anything -- it can be a word, a letter, a wedge or dot or dash above a letter, a picture, a collection of lines and wedges pressed into clay, even a physical object sitting in front of you. Anything can be a symbol, and any given symbol can represent anything.
All you need to do, when you want to talk about a particular idea, is to choose a symbol, describe what you want that symbol to mean, and then use the symbol for that purpose.
Symbols are the foundation of civilization. When sequences of sounds are assigned "meaning", they make spoken language possible. When they are used as hieroglyphics for words or letters for sounds, they make writing possible. Those same words can capture abstract ideas, and whether they are manipulated by sentences or by letters, or combined with dots, circles, and lines, they become the foundations for exploring mathematical ideas. And they are one of the features that makes Common Lisp the powerful computer language it is -- you can create whatever symbol you need for the task at hand, and even create "hidden" symbols, used once for a very specific purpose, so that no one can "clobber" that symbol by accidentally using it.

If you think Algebra is challenging, try doing it without letters! There is a common notion that the devil introduced letters into math, but compare this: "consider a function that takes a number, multiplies it by itself, adds three of itself to that, and adds five" to this: "f(x) = x^2 + 3x + 5". In the latter statement, "f" is commonly recognized as a function, "x" as a number, "^2" as "multiply the number by itself", etc -- and because of its compact form, it's far easier to work with! (Indeed, when one considers the "Cubic Formula" -- the equivalent of the "Quadratic Formula" used to find zeros of something like "f(x)" above -- and is a big complicated jumble of cube roots and square roots, requiring the use of "imaginary" numbers to get to "real" number solutions, which may very well be plain vanilla integers disguised as a mix of square and cubic roots that non-obviously can be simplified -- when one considers all this, what is amazing is that it was all done in words, well before letters and other symbols were introduced to mathematics!)

So, what does all this have to do with Identity Management? It has occurred to me that perhaps discussions of the various components of Identity Management can be simplified by treating their applications symbolically -- making it a lot easier to see what's going on -- than is typically handled by Cryptography texts.

Consider an example (putting aside that I haven't really discussed what each of these things mean!): Let M be a message, H be a hash function, and (Af, Ap) be a Alice's Public-Private Key Pair -- I call the key "A" for "Alice", but the "f" here means free, and "p" means "private" (it kindof drives me nuts that "public" and "private" start with the same letter; however, I may want to come up with better ways to designate the parts of the key) -- with these things in mind, how would I sign a message, and send it to Bob, who has a Public-Private key pair (Bf, Bp)?

First, Alice would need to sign the message, by obtaining the hash H(M) of the message, signing the hash with the private key, Ap(H(M)), and then appending the result to the message itself: [M|Ap(H(M))]. (Since this is an "exploration" on my part, I'm going to leave the parentheses in for now, but it might be easier to see what's going on by shortening all this to [M|ApHM].) Now, to send this to Bob, she'll want to encrypt it with Bob's public key: Bf([M|Ap(H(M))]). She then sends this message to Bob.

So, how does Bob read all this? He applies his private key to the message: Bp(Bf([M|Ap(H(M))])) -- and since Bp(Bf) cancel each other out, he has the message [M|Ap(H(M))]. To confirm that the message really came from Alice, he then takes the hash of the message H(M), and "decrypts" the message from Alice with her public key: Af(Ap(H(M)) -- the Af(Ap) cancel each other out, so if the H(M) that Bob calculated matches the H(M) that Alice "signed", the message indeed came from Alice!

Ah, so the approach has promise! (And I'll definitely want to drop the parentheses!) Perhaps this approach has already been done -- I may be duplicating other people's efforts while creating an equivalent notation that no one else uses -- but if that's indeed the case, it means I'm on the right track! Having said that, while I'm not an expert in this field, I have read a few documents describing these principles over the years, and while I remember a lot of pretty pictures, I don't recall anyone doing this kind of thing.

And this approach will go a long way to justifying the use of the term "calculus" to describe the field!

And this may also mean I'll want to figure out how to display mathematical notation in Blogger ....