A virtual teacher who reveals to you the great secrets of Base64

What is Base64?

Base64 is a encoding algorithm that allows you to transform any characters into an alphabet which consists of Latin letters, digits, plus, and slash. Thanks to it, you can convert Chinese characters, emoji, and even images into a “readable” string, which can be saved or transferred anywhere.

To figuratively understand why Base64 was invented, imagine that during a phone call Alice wants to send an image to Bob. The first problem is that she cannot simply describe how the image looks, because Bob needs an exact copy. In this case, Alice may convert the image into the binary system and dictate to Bob the binary digits (bits), after that he will be able to convert them back to the original image. The second problem is that the tariffs for phone calls are too expensive and dictate each byte as 8 binary digits will last too long. To reduce costs, Alice and Bob agree to use a more efficient data transfer method by using a special alphabet, which replaces every “six digits” with one “letter”.

To realize the difference, check out a 5x5 image converted to binary digits:

010001 110100 100101 000110 001110 000011 011101 100001 000000 010000 000000 000001 000000 001111 000000 000000 000000 001111 111100 000000 000000 000000 000000 000000 000000 000010 110000 000000 000000 000000 000000 000000 000000 010000 000000 000001 000000 000000 000000 000010 000000 100100 010000 000001 000000 000011 001011

Although the same image converted to Base64 looks like this:

R0lGODdhAQABAPAAAP8AAAAAACwAAAAAAQABAAACAkQBADs

I think the difference is obvious. Even if you remove spaces or padding zeros from binary digits, the Base64 string will still be shorter. I grouped bits only to show that each group meets each character of the Base64 string.

Well, the story about Alice and Bob is just a thought-out example to tell you what kind of problem solves the Base64 algorithm. In fact, it is a binary-to-text encoding, whose task is to encode binary data into printable characters, when the data transmission channel or the storage medium cannot handle 8-bit character encodings.

History

The history of the Base64 started long ago, in those times when engineers argued how many bits should be in a byte. Now we use eight-bit bytes, but before that were used seven-bit, six-bit, and even three-bit bytes. By the time the eight-bit encoding was approved as a standard, many systems used old encodings and did not support the “new standard”. This led to the fact that some data was simply lost during the transfer between the new and the old systems. For example, a mail server may discard the eighth bit when sending emails. Moreover, there was another problem with mail servers — they could only send text, but not binary data (such as images, video, archives). And so, in a magical way, clever minds develop an algorithm to solve these problems. Of course, over time, other binary-to-text encodings were developed, but thanks to the simplicity, efficiency and portability, Base64 became the most popular and was used almost everywhere.

For the first time the algorithm was described back in 1987 by a document describing the PEM protocol (if you are interested in the details, check the RFC 989 § 4.3). Since then, the algorithm has evolved, giving rise to new standards that are actively used throughout the world of IT.

Naming

Initially, the algorithm was named as “printable encoding” and only after a couple of years, in June 1992, RFC 1341 defines it as “Base64”. Since this algorithm uses 64 basic characters it was not difficult to give it a name (especially that Base85 already existed). Therefore, I think it will not be a problem for you to guess what means the names of algorithms such as Base16, Base32, Base36, Base58, Base91, or Base122.

Size

During encoding, the Base64 algorithm replaces each three bytes with four bytes and, if necessary, adds padding characters, so the result will always be a multiple of four. Simply put, the size of the result will always be 33% (more exactly, 43) larger than the original data. The formula for calculating the length of the result string without padding is as follows: n * 4 / 3, where n is the length of the original data.

Usage

Base64 is most commonly used to encode binary data (for example, images, or sound files) for embedding into HTML, CSS, EML, and other text documents. In addition, Base64 is used to encode data that may be unsupported or damaged during transfer, storage, or output. Here are some of the applications of the algorithm:

  • Attach files when sending emails
  • Embed images in HTML or CSS via data URI
  • Preserve raw bytes of cryptographic functions
  • Output binary data as XML or JSON in API responses
  • Save binary files to database when BLOB is unavailable
  • Hide secrets from prying eyes (really a very bad idea)

Security

Base64 is not an encryption algorithm and in no case should it be used to “hash” passwords or “encrypt” sensitive data, because it is a reversible algorithm and the encoded data can be easily decoded. Base64 may only be used to encode raw result of a cryptographic function.

Roughly speaking, in terms of information security, Base64 is just a foreign language that some people do not understand. Nevertheless, even they can understand the meaning of the encoded message simply by using an online translator, which instantly returns the original message.

Comments (2)

I hope you enjoy this discussion. In any case, I ask you to join it.

  • Alan,
    Thanks for a great explanatory article. This is something I've used by feel more than understanding, and it's nice to fill in the blanks in my knowledge. The only thing I'd add is under usage. Your API responses example touches on this at a high level, but I often find it useful for sanitizing string values that can include special characters ({}, <>, ', ;, newline, etc.) without using language specific methods to qualify strings.
    • Administrator,
      Hello Alan,
      Thank you for your comment. I'm glad you like this article.

      As for using Base64 to sanitize strings, this is a known practice, but since it has several drawbacks it should be used wisely.
Add new comment

If you have any questions, remarks, need help, or just like this page, please feel free to let me know by leaving a comment using the form bellow.
I will be happy to read every comment and, if necessary, I will do my best to respond as quickly as possible. Of course, spammers are welcome only as readers.

Loading...