Computers encode letters, digits, punctuation, and control characters as binary numbers. The most common encoding is ASCII (American Standard Code for Information Interchange), a 7-bit code covering 128 distinct characters.
ASCII only uses 7 bits, but characters get stored in 8-bit bytes: the ASCII code takes the low-order seven bits and the high bit is 0 (or used for parity in some systems).
Why ASCII
Two properties make ASCII nice to work with.
-
Sequential ordering: alphabetic characters (
A–Z,a–z) and numeric characters (0–9) get codes in increasing order. Sort text by treating ASCII codes as unsigned binary numbers and you get alphabetical/numerical order for free. -
BCD embedded in the digit codes: the low-order four bits of each digit character’s ASCII code are exactly the binary-coded decimal of that digit.
'5'has ASCII code : high nibble (digit-character marker), low nibble ( in BCD).
So converting between an ASCII digit character and its numeric value is just masking off the low 4 bits.
Standard ASCII range
| Code (decimal) | Range | Purpose |
|---|---|---|
| 0–31 | Control characters | Null, tab, line feed, carriage return, escape, etc. |
| 32 | Space | Whitespace |
| 33–47 | ! " # ... . | Punctuation |
| 48–57 | 0–9 | Digit characters |
| 58–64 | : ; ... @ | More punctuation |
| 65–90 | A–Z | Uppercase letters |
| 91–96 | `[ \ ] ^ _ “ | Punctuation (95 = _, 96 = backtick) |
| 97–122 | a–z | Lowercase letters |
| 123–127 | `{ | } ~ DEL` |
The gap between uppercase and lowercase is exactly 32, so tolower(c) = c | 0x20 and toupper(c) = c & ~0x20 for letters. Handy bit twiddle in low-level code.
Beyond ASCII
ASCII covers English fine but can’t handle international text. The extensions and replacements:
- Extended ASCII / ISO 8859: uses the high bit for another 128 characters (accents, currency symbols, etc.), with many regional variants.
- Unicode: code points for every character of every writing system, U+0000 to U+10FFFF (about 1.1 million values, fitting in 21 bits).
- UTF-8: variable-length encoding of Unicode, backward-compatible with ASCII (ASCII characters use 1 byte, others 2–4). Dominant on the web.
- UTF-16: 16-bit code units; what Java and Windows use internally for strings.
New code almost always uses UTF-8, with ASCII falling out as a special case.
Encoding is invisible when everything works and a nightmare when it doesn’t. Read a UTF-8 file as Latin-1 and you get “mojibake”, characters interpreted by the wrong code page showing up as gibberish.