Computer memory has almost no structure: it's just a big field of "bits" collected into nice-sized chunks ("bytes", which are 8 bits, are the most well-known, but these collections have varied over time, with some as few as 6 bits, to modern computers having 32 or 64 bits) -- and "addresses" to specify which particular block of bits you are referring to. In terms of "raw data", these bits represent integers -- but not quite integers, because they can't hold negative values. Instead, it is up to us to decide that a particular value -- or several of values, if it suits us -- means "199" or "-32" or "17.125" or "3/4" or "2.5 + 5.7i" or ever "Q" or "v" or "fuschia with 25% transparency" or "440Hz at 37dB" or "seconds since January 1, 1970". And that is the beauty of this: numbers can mean anything!
Ok, so I'm going off on a tangent. I didn't quite mean to wax eloquent this -- but it's also tempting to pull out that paragraph and put it into its own atom: raw Integers. It is because we can encode anything as numbers, after all, that makes cryptography possible!
Now, while there's no structure to any of this at the lowest levels of the computer, one of the first things I learned in my Computer Science class is about data structures, which are made possible when one observes that an integer can represent a block of memory as readily as it can anything else. Thus, with the simple idea of a "Node" and "Data", we can create "arrays" and "vectors" and "lists" and "trees", each with special properties and perils for keeping track of data.
One of these data structures is called a "Hash Table". The idea is simple: for some of these data structures, it can take a long time to find an item -- but we can make it incredibly fast if we took the data we wanted to store, "scrambled" the bits somehow, and then stick it in memory based on this scrambled "hash".
My toy hash table had a simple "scrambling" algorithm: divide the number by 10, and use the remainder as a memory address in an array. Thus, 29 would be stuck in "slot 9" and 77 would be stuck in "slot 7". What happens if we try to stick "17" in there, when "77" is already there? We then have what's called a collision -- and there are different strategies for handling when that happens -- in my case, I used a "list" to store all the possible values at that location, for example. And this is a weakness of hash tables: if we have a lot of data to store, we're more likely to get collisions, which is going to slow the whole thing down.
Hence, my little implementation being merely a toy -- it's only good for a couple dozen data points, in particular -- but it's sufficiently large to get an idea of how these data structures work!
Of course, in cryptography, there's no need to worry about memory locations -- but there is a need to be able to scramble a lot of data and reduce it to a tiny thing! Functions that do that are called cryptographic hashes, and they typically take hundreds or even millions of bytes, and reduce them to a few dozen bytes. These are also sometimes called "one way" encryption schemes -- they encrypt data, but they aren't intended to be reversible -- and what's more, reversibility would even lead to weaknesses in our cryptographic schemes.
In order for a hash function to be cryptographic, it has to satisfy several properties:
- Given two slightly different inputs (eg, two duplicates of a document, but one has the character in position 937 changed from the letter "A" to the letter "B"), the outputs need to be wildly different,
- Collisions between two different documents need to be rare, and
- For a given a hash output, it needs to be nearly impossible to identify or create documents that match that hash output.
The last property isn't just a theoretical concern: a couple of popular algorithms, MD5 and SHA1, have already been compromised! I'm not entirely sure if any attack had been successfully made against systems using these hashes, but they have been replaced with algorithms SHA256 and SHA512 -- which, oddly enough, is just SHA1 repeated 256 or 512 times, respectively.
There are two kinds of cryptographic hashes, which to the best of my knowledge, aren't really "named" -- but I like names, and this is a distinction that's rather important! -- so the first kind I call "digest" hashes, and the second, "password" hashes.
The purpose of the first type of cryptographic hash is to identify documents -- and when you have a document of millions or even billions of bytes you want to identify, you need to do it quickly. And that's just what digest hashes do: they take a big document, and give you a result quickly. These types of hashes can be used in special data structures designed to detect changes quickly; they can be used to identify changes made to a project; they can be used to prove that a document was unchanged -- and because in this use-case, it's far easier to encrypt this little bit with a private key, than it is to encrypt an entire document, it is an important function for "signing" documents. (Because of this, I have been tempted to call these hash functions "signing" hashes, but that confuses the issue: signing is only one role, of several, of these functions, and all these roles are captured by the notion of "digest".)
But there are also situations where being fast can be disastrous! The above conditions are perfect for storing encrypted passwords -- hence, my desire to call them "password" hashes -- and, in particular, these hashes can guard passwords against a data breach, when a password cracker gets a hold of lots of a lot of usernames and passwords all at once -- this is our last line of defense against crackers, because once they are in the hands of evil people, they are at the mercy of all the tricks that can be deployed against them.
In this case, we have a major advantage over the nefarious schemes of a cracker: we don't really care if there's a few seconds of delay in getting our credentials approved -- but when a cracker has millions of password hashes, he's going to want to try as many passwords as he can, as fast as he can, to find the valid ones. In this a day and age of heavy computation power, particularly with GPUs that can run millions of operations in parallel, a cryptographic password hash function needs to be able to throw "wrenches" into these works. The latest algorithm for this purpose, called "scrypt", is not only specifically designed to be slow -- it requires a lot of memory in order to speed it up somewhat! Since GPUs in particular are designed to do millions of small, memory limited functions in parallel, this means that a GPU is just as ineffective at cracking these passwords in parallel as a single fast microprocessor.