Computers encode letters, digits, punctuation, and control characters as binary numbers. The most common encoding is ASCII (American Standard Code for Information Interchange), a 7-bit code covering 128 distinct characters.

ASCII only uses 7 bits, but characters get stored in 8-bit bytes: the ASCII code takes the low-order seven bits and the high bit is 0 (or used for parity in some systems).

Why ASCII

Two properties make ASCII nice to work with.

  1. Sequential ordering: alphabetic characters (AZ, az) and numeric characters (09) get codes in increasing order. Sort text by treating ASCII codes as unsigned binary numbers and you get alphabetical/numerical order for free.

  2. BCD embedded in the digit codes: the low-order four bits of each digit character’s ASCII code are exactly the binary-coded decimal of that digit. '5' has ASCII code : high nibble (digit-character marker), low nibble ( in BCD).

So converting between an ASCII digit character and its numeric value is just masking off the low 4 bits.

Standard ASCII range

Code (decimal)RangePurpose
0–31Control charactersNull, tab, line feed, carriage return, escape, etc.
32SpaceWhitespace
33–47! " # ... .Punctuation
48–5709Digit characters
58–64: ; ... @More punctuation
65–90AZUppercase letters
91–96`[ \ ] ^ _ “Punctuation (95 = _, 96 = backtick)
97–122azLowercase letters
123–127`{} ~ DEL`

The gap between uppercase and lowercase is exactly 32, so tolower(c) = c | 0x20 and toupper(c) = c & ~0x20 for letters. Handy bit twiddle in low-level code.

Beyond ASCII

ASCII covers English fine but can’t handle international text. The extensions and replacements:

  • Extended ASCII / ISO 8859: uses the high bit for another 128 characters (accents, currency symbols, etc.), with many regional variants.
  • Unicode: code points for every character of every writing system, U+0000 to U+10FFFF (about 1.1 million values, fitting in 21 bits).
  • UTF-8: variable-length encoding of Unicode, backward-compatible with ASCII (ASCII characters use 1 byte, others 2–4). Dominant on the web.
  • UTF-16: 16-bit code units; what Java and Windows use internally for strings.

New code almost always uses UTF-8, with ASCII falling out as a special case.

Encoding is invisible when everything works and a nightmare when it doesn’t. Read a UTF-8 file as Latin-1 and you get “mojibake”, characters interpreted by the wrong code page showing up as gibberish.