Character representation in a computer encodes letters, digits, punctuation, and control characters as binary numbers. The most common encoding is ASCII (American Standard Code for Information Interchange), a 7-bit code that covers 128 distinct characters.
Although ASCII uses only 7 bits, characters are typically stored in 8-bit bytes — the ASCII code occupies the low-order seven bits and the high-order bit is set to 0 (or used for parity in some systems).
Why ASCII
ASCII has two important properties:
-
Sequential ordering: alphabetic characters (
A–Z,a–z) and numeric characters (0–9) are assigned codes in increasing order. So sorting text by treating ASCII codes as unsigned binary numbers automatically gives alphabetical/numerical order. -
BCD encoding embedded in digit codes: the low-order four bits of each digit character’s ASCII code are exactly the binary-coded decimal representation of that digit. So
'5'has ASCII code — the high nibble is (digit-character marker) and the low nibble is ( in BCD).
This embedded BCD makes converting between ASCII digit characters and numeric values trivial: just mask off the low 4 bits.
Standard ASCII range
| Code (decimal) | Range | Purpose |
|---|---|---|
| 0–31 | Control characters | Null, tab, line feed, carriage return, escape, etc. |
| 32 | Space | Whitespace |
| 33–47 | ! " # ... . | Punctuation |
| 48–57 | 0–9 | Digit characters |
| 58–64 | : ; ... @ | More punctuation |
| 65–90 | A–Z | Uppercase letters |
| 91–96 | `[ \ ] ^ _ “ | Punctuation (95 = _, 96 = backtick) |
| 97–122 | a–z | Lowercase letters |
| 123–127 | `{ | } ~ DEL` |
The gap between uppercase and lowercase is exactly 32, so tolower(c) = c | 0x20 and toupper(c) = c & ~0x20 for letters. Another bit-twiddle that’s useful in low-level code.
Beyond ASCII
ASCII covers English well but is inadequate for international text. Several extensions and replacements:
- Extended ASCII / ISO 8859: uses the high bit to encode an additional 128 characters (accents, currency symbols, etc.). Many regional variants.
- Unicode: encodes every character of every writing system. The standard defines code points in the range U+0000 to U+10FFFF (about 1.1 million values, fitting in 21 bits).
- UTF-8: variable-length encoding of Unicode that’s backward-compatible with ASCII (ASCII characters use 1 byte, others use 2–4 bytes). The dominant encoding on the web.
- UTF-16: 16-bit code units; what Java and Windows internally use for strings.
Modern software almost universally uses UTF-8 for new code, with ASCII as a special case.
Why this matters
Character encoding is invisible when everything works and a nightmare when it doesn’t. Mixing encodings (a UTF-8 file read as Latin-1, for example) produces “mojibake” — characters interpreted by the wrong code page show up as gibberish.
For the related concept of how a character’s ASCII bits map to a digit value, see BCD Addition. For the broader context of how memory stores bytes, see Byte addressability and Endianness.