Computer performance usually means how quickly a program executes. What influences it: the instruction set, the hardware design, the technology used to fabricate the chip, the operating system, and (for high-level code) the compiler.
Here I care about the architectural levers, the hardware/architecture choices that improve performance, not benchmarking methodology.
Technology
The biggest single lever historically: smaller transistors.
The speed at which logic gates switch between and depends largely on transistor size. Smaller transistors:
- Switch faster (less capacitance to charge/discharge).
- Pack more densely, allowing more complex circuits per chip.
- Use less power per operation.
Decades of fabrication advances (driven by Very Large-Scale Integration, VLSI) have followed this curve. See Moore’s law for the doubling rate that has driven computing performance since the 1960s.
The shrinking is hitting physical limits: at 5nm and below, quantum tunneling and atomic-scale variation start to matter. Gains increasingly come from architecture rather than pure shrinking.
Parallelism
Do multiple operations in parallel. Parallelism shows up at several levels.
Instruction-level parallelism
The simplest execution model finishes one instruction before starting the next, which is slow. Pipelining overlaps execution: while instruction is being executed, instruction is being decoded, and is being fetched. See Instruction execution cycle for the basic 5-stage cycle.
Superscalar processors dispatch multiple instructions per cycle to multiple functional units. Modern CPUs can issue 4–8 instructions per cycle, limited by both program-inherent constraints (data dependencies between instructions) and hardware constraints (number of execution ports, register-rename resources, decode width, branch-prediction accuracy, cache miss stalls). On real workloads the achieved instructions-per-cycle sits well below the issue width because of these combined limits.
Multi-core processors
Modern processors contain multiple cores on a single chip, each a complete processor. Examples: dual-core, quad-core, octo-core, server CPUs with 64+ cores.
Each core runs an independent instruction stream, so a multi-threaded program can use several cores at once. A single-threaded program only uses one core no matter how many are available.
Multiprocessors
Some systems contain multiple physical CPU chips, each with multiple cores. Shared-memory multiprocessors let all processors access the same main memory, sharing data and synchronizing via memory operations.
Pros: easy programming model (any processor can access any memory).
Cons: contention for memory bandwidth, complex cache coherence, scaling limits.
Multicomputers (distributed systems)
Connect multiple complete computers over a network, each with its own private memory, and you get a multicomputer (also called a cluster). Communication uses message passing instead of shared memory.
Pros: scales to thousands or millions of nodes (datacenters, supercomputers).
Cons: programmer must explicitly handle communication, much higher latency than shared memory.
What limits performance
Three things constrain how fast a computer can run a program:
- Computation rate: how fast the cores execute instructions. Limited by clock speed, ILP, and cache hits.
- Memory bandwidth: how fast data flows between memory and CPU. Limited by bus width and DRAM speed.
- Memory latency: how long it takes to fetch a value from memory. Hidden by caches when access patterns have locality.
The CPU/memory speed gap means memory often dominates. A cache miss costs hundreds of cycles, which is why caches matter so much.
Beyond raw speed
Other performance dimensions:
- Throughput: total work done per unit time. Servers care about this more than latency.
- Latency: time from request to first response. Real-time systems care about this.
- Energy efficiency: performance per watt. Critical for mobile and datacenter.
- Cost-effectiveness: performance per dollar.
Which dimension wins is workload-dependent. A Postgres query planner, for example, optimises for p99 latency under concurrent load rather than peak throughput: a query that takes 50 ms 999 times and 5 s once is worse for user experience than one that always takes 80 ms.