Computer performance

Computer performance usually means how quickly a program executes. What influences it: the instruction set, the hardware design, the technology used to fabricate the chip, the operating system, and (for high-level code) the compiler.

Here I care about the architectural levers, the hardware/architecture choices that improve performance, not benchmarking methodology.

Technology

The biggest single lever historically: smaller transistors.

The speed at which logic gates switch between $0$ and $1$ depends largely on transistor size. Smaller transistors:

Switch faster (less capacitance to charge/discharge).
Pack more densely, allowing more complex circuits per chip.
Use less power per operation.

Decades of fabrication advances (driven by Very Large-Scale Integration, VLSI) have followed this curve. See Moore’s law for the doubling rate that has driven computing performance since the 1960s.

The shrinking is hitting physical limits: at 5nm and below, quantum tunneling and atomic-scale variation start to matter. Gains increasingly come from architecture rather than pure shrinking.

Parallelism

Do multiple operations in parallel. Parallelism shows up at several levels.

Instruction-level parallelism

The simplest execution model finishes one instruction before starting the next, which is slow. Pipelining overlaps execution: while instruction $N$ is being executed, instruction $N + 1$ is being decoded, and $N + 2$ is being fetched. See Instruction execution cycle for the basic 5-stage cycle.

Superscalar processors dispatch multiple instructions per cycle to multiple functional units. Modern CPUs can issue 4–8 instructions per cycle, limited by both program-inherent constraints (data dependencies between instructions) and hardware constraints (number of execution ports, register-rename resources, decode width, branch-prediction accuracy, cache miss stalls). On real workloads the achieved instructions-per-cycle sits well below the issue width because of these combined limits.

Multi-core processors

Modern processors contain multiple cores on a single chip, each a complete processor. Examples: dual-core, quad-core, octo-core, server CPUs with 64+ cores.

Each core runs an independent instruction stream, so a multi-threaded program can use several cores at once. A single-threaded program only uses one core no matter how many are available.

Multiprocessors

Some systems contain multiple physical CPU chips, each with multiple cores. Shared-memory multiprocessors let all processors access the same main memory, sharing data and synchronizing via memory operations.

Pros: easy programming model (any processor can access any memory).

Cons: contention for memory bandwidth, complex cache coherence, scaling limits.

Multicomputers (distributed systems)

Connect multiple complete computers over a network, each with its own private memory, and you get a multicomputer (also called a cluster). Communication uses message passing instead of shared memory.

Pros: scales to thousands or millions of nodes (datacenters, supercomputers).

Cons: programmer must explicitly handle communication, much higher latency than shared memory.

What limits performance

Three things constrain how fast a computer can run a program:

Computation rate: how fast the cores execute instructions. Limited by clock speed, ILP, and cache hits.
Memory bandwidth: how fast data flows between memory and CPU. Limited by bus width and DRAM speed.
Memory latency: how long it takes to fetch a value from memory. Hidden by caches when access patterns have locality.

The CPU/memory speed gap means memory often dominates. A cache miss costs hundreds of cycles, which is why caches matter so much.

Beyond raw speed

Other performance dimensions:

Throughput: total work done per unit time. Servers care about this more than latency.
Latency: time from request to first response. Real-time systems care about this.
Energy efficiency: performance per watt. Critical for mobile and datacenter.
Cost-effectiveness: performance per dollar.

Which dimension wins is workload-dependent. A Postgres query planner, for example, optimises for p99 latency under concurrent load rather than peak throughput: a query that takes 50 ms 999 times and 5 s once is worse for user experience than one that always takes 80 ms.

Idriss Rami — Notes

Explorer