Character Representation

Computers encode letters, digits, punctuation, and control characters as binary numbers. The most common encoding is ASCII (American Standard Code for Information Interchange), a 7-bit code covering 128 distinct characters.

ASCII only uses 7 bits, but characters get stored in 8-bit bytes: the ASCII code takes the low-order seven bits and the high bit is 0 (or used for parity in some systems).

Why ASCII

Two properties make ASCII nice to work with.

Sequential ordering: alphabetic characters (A–Z, a–z) and numeric characters (0–9) get codes in increasing order. Sort text by treating ASCII codes as unsigned binary numbers and you get alphabetical/numerical order for free.
BCD embedded in the digit codes: the low-order four bits of each digit character’s ASCII code are exactly the binary-coded decimal of that digit. '5' has ASCII code $00110101$ : high nibble $0011$ (digit-character marker), low nibble $0101$ ( $5$ in BCD).

So converting between an ASCII digit character and its numeric value is just masking off the low 4 bits.

Standard ASCII range

Code (decimal)	Range	Purpose
0–31	Control characters	Null, tab, line feed, carriage return, escape, etc.
32	Space	Whitespace
33–47	`! " # ... .`	Punctuation
48–57	`0`–`9`	Digit characters
58–64	`: ; ... @`	More punctuation
65–90	`A`–`Z`	Uppercase letters
91–96	`[ \ ] ^ _ “	Punctuation (95 = `_`, 96 = backtick)
97–122	`a`–`z`	Lowercase letters
123–127	`{	} ~ DEL`

The gap between uppercase and lowercase is exactly 32, so tolower(c) = c | 0x20 and toupper(c) = c & ~0x20 for letters. Handy bit twiddle in low-level code.

Beyond ASCII

ASCII covers English fine but can’t handle international text. The extensions and replacements:

Extended ASCII / ISO 8859: uses the high bit for another 128 characters (accents, currency symbols, etc.), with many regional variants.
Unicode: code points for every character of every writing system, U+0000 to U+10FFFF (about 1.1 million values, fitting in 21 bits).
UTF-8: variable-length encoding of Unicode, backward-compatible with ASCII (ASCII characters use 1 byte, others 2–4). Dominant on the web.
UTF-16: 16-bit code units; what Java and Windows use internally for strings.

New code almost always uses UTF-8, with ASCII falling out as a special case.

Encoding is invisible when everything works and a nightmare when it doesn’t. Read a UTF-8 file as Latin-1 and you get “mojibake”, characters interpreted by the wrong code page showing up as gibberish.

Idriss Rami — Notes

Explorer

Character Representation

Why ASCII

Standard ASCII range

Beyond ASCII

Graph View

Table of Contents