CPU vs GPU vs TPU vs NPU: Architecture, History and How to Choose for Your Workload

These four chip types underpin almost everything in modern compute. Understanding the differences is not just academic - it directly affects infrastructure cost, performance and what you can actually scale.

CPU: Built for Everything, Optimized for Nothing in Particular

The CPU came first - by decades. Before dedicated chips existed, computers used racks of vacuum tubes and later discrete transistors. The Intel 4004 in 1971 was the first commercially available microprocessor - a single chip that could execute general-purpose instructions. That was the breakthrough.

Intel’s 8086 in 1978 established the x86 architecture that still runs most of the world’s servers today. The 286, 386, 486 and Pentium generations through the 80s and 90s followed the same playbook: more transistors, faster clock speeds, better single-thread performance. AMD became a serious competitor with the Athlon in 1999. The multi-core era arrived with Intel’s Core 2 Duo in 2006 - the industry had hit physical limits on clock speed and started adding cores instead.

ARM has a different lineage. Designed by Acorn Computers in 1983 as a low-power processor for the BBC Micro, the first ARM chip shipped in 1985. Apple used it in the Newton PDA in 1992. When smartphones arrived, ARM’s power efficiency made it the only viable option - every iPhone and Android device runs ARM. The shift to ARM in cloud infrastructure is more recent, accelerating after AWS launched Graviton in 2018 and Apple’s M1 proved in 2020 that ARM could compete with x86 on raw performance.

CPUs are built for sequential execution, complex branching logic and low latency. A modern server CPU might have 8 to 64 cores, each with deep instruction pipelines and large caches. The goal is fast, flexible single-threaded performance.

x86 (Intel and AMD)

x86 is the dominant architecture in servers and desktops. The x64 extension allows larger memory addressing - critical for databases and memory-intensive workloads. The instruction set is CISC (Complex Instruction Set Computing), which means powerful single-thread performance at the cost of higher power draw.

ARM: The Quiet Takeover

ARM uses a RISC (Reduced Instruction Set Computing) design. Fewer, simpler instructions per clock cycle, but more efficient per watt. It started in mobile - every iPhone and Android device runs ARM. Now it’s taking over cloud infrastructure.

AWS Graviton instances run ARM and come in 20-40% cheaper than equivalent x86 instances for many workloads. Apple’s M4 delivers more compute per watt than most x86 chips. ARM is winning on cost efficiency, and cloud providers know it.

When to use a CPU: Web servers, databases, general application logic, anything with complex branching or sequential processing. The default choice for most workloads.

GPU: Parallel Compute at Scale

GPUs are newer than CPUs by about 25 years. Before dedicated graphics hardware existed, games rendered 3D graphics on the CPU - which is why early 3D games were slow and choppy. The 3dfx Voodoo in 1996 was the first widely adopted dedicated 3D accelerator, offloading graphics from the CPU entirely. NVIDIA’s GeForce 256 in 1999 coined the term GPU and was the first chip to handle hardware transform and lighting - meaning the chip itself calculated geometry, not your CPU.

The gaming hardware war through the 2000s drove enormous investment into parallel processing. By the mid-2000s, GPUs were powerful enough that researchers started asking if they could be used for non-graphics workloads. NVIDIA answered with CUDA in 2006 - a programming model that let developers write general-purpose code that ran on GPU cores. That was the moment everything changed.

The connection to modern AI is direct. In 2012, Alex Krizhevsky trained AlexNet on two NVIDIA GTX 580s and won ImageNet by a margin that shocked the research community. That result proved GPU-accelerated deep learning worked. The entire AI wave since then runs on the foundation CUDA built.

A GPU has thousands of smaller cores, each less capable than a CPU core but collectively able to execute massively parallel operations. The trade-off is deliberate: give up single-thread performance, gain parallelism at scale.

GPU Tiers Worth Knowing

Consumer cards like the RTX 4090 are surprisingly capable for local AI inference. Relatively affordable and practical for on-prem experimentation.

Professional workstation cards (NVIDIA A-series) sit in the middle - more VRAM, better reliability, higher price.

Data center cards are where production AI training happens. The A100 is the workhorse that ran most of the last generation of model training. The H100 is roughly 6x faster for transformer model training. If you’re renting GPU capacity for serious training runs, you’re almost certainly on H100s.

AMD ROCm is the open alternative to NVIDIA CUDA. It’s growing, but the CUDA ecosystem - libraries, tooling, community knowledge - is still dominant. If you’re building on-prem GPU infrastructure and NVIDIA is within budget, the ecosystem maturity alone justifies it.

When to use a GPU: AI/ML training and inference, scientific simulation, video rendering, any workload that is embarrassingly parallel. If you can express it as matrix math, a GPU will beat a CPU by orders of magnitude.

TPU: Designed for One Thing

The TPU is the newest of the three - and the most specialized. Google built the first one internally in 2015. The motivation was practical: Google’s search ranking had started using neural networks, and the volume of inference requests was so large that running them on GPUs would have required doubling their global data center footprint. They needed something faster and more power-efficient for a specific job.

TPU v1 shipped internally in 2015 and was inference-only - it could run trained models but not train them. Google announced it publicly at I/O 2016. TPU v2 in 2017 added training capability. v3 in 2018 added liquid cooling to handle the heat from higher performance. v4 in 2021 and v5 in 2023 continued the scaling trajectory. Each generation has been purpose-built around one insight: neural network training is mostly matrix multiplication, so build the chip around that operation and nothing else.

The design goal was narrow: accelerate TensorFlow neural network inference, specifically matrix multiplication.

TPUs contain Matrix Multiply Units (MXUs) that are purpose-built for tensor operations. They are not general purpose. You cannot use a TPU to run a database or serve a web request. What they do, they do faster and more efficiently than GPUs at scale - particularly for transformer model training.

TPU pods are available via Google Cloud. You rent them, you don’t buy them. The interconnect between TPU chips is designed from the ground up for distributed training, which makes scaling simpler than GPU clusters.

When to use a TPU: Large-scale model training on Google’s stack (JAX or TensorFlow), foundation model pretraining where cost per FLOP matters. At sufficient scale, TPU pods can be cheaper than equivalent H100 clusters.

Google Isn’t the Only One

TPU is Google’s brand name, but the category - custom AI silicon - is broader. AWS built Trainium for training and Inferentia for inference, with the same motivation: cut per-unit ML compute costs below what NVIDIA charges. Apple’s Neural Engine in every iPhone and M-series Mac is the same concept applied to on-device inference.

A few others worth knowing: Graphcore built the IPU (Intelligence Processing Unit) as a direct competitor before being acquired by SoftBank in 2023. Cerebras built the WSE (Wafer Scale Engine) - literally an entire silicon wafer as a single chip, designed for massive parallel training. Meta and Microsoft both have custom AI accelerator programs in development for internal use.

FPGAs (Field-Programmable Gate Arrays) are worth a mention here too. Intel and AMD/Xilinx sell FPGAs that can be programmed to accelerate specific ML inference workloads with extremely low latency - Microsoft ran Bing’s neural ranking on FPGAs for years, and they appear in financial trading and network processing where microsecond-level inference latency matters. Not the right tool for general AI workloads, but the right tool when fixed-function performance and deterministic latency are the constraint.

The pattern is consistent: at sufficient scale, every major tech company builds custom silicon to avoid paying NVIDIA’s margins. Google just got there first and named it in a way that stuck.

TPU vs GPU: The Market Reality

There is an ongoing debate about whether TPUs threaten NVIDIA’s dominance. The short answer is: not really, and here’s why.

Google’s internal cost per compute unit on TPUs is reportedly lower than NVIDIA’s top chips - roughly $3.50-4.38 vs $6.30 for equivalent NVIDIA capacity. On paper that looks like a threat. In practice, the CUDA ecosystem is the moat. Decades of libraries, tooling, researcher familiarity and community knowledge are built around CUDA. Switching to TPUs means rewriting pipelines in JAX or TensorFlow and accepting that you’re locked to Google Cloud.

Most organizations training and running models are not at a scale where the cost-per-FLOP difference justifies that migration. TPUs are a genuine advantage if you’re already deep in Google’s stack - and a serious consideration for very large training runs where the economics shift. For everyone else, H100s on whatever cloud you prefer is still the practical answer.

NPU: On-Device AI at the Edge

The NPU (Neural Processing Unit) is the newest category - and the one most likely to affect product decisions for anyone building software that runs on consumer hardware.

NPUs are dedicated inference engines embedded directly into consumer chips. Unlike TPUs, which are cloud-scale accelerators you rent from Google, NPUs live inside the devices your users already own. Apple’s Neural Engine shipped with the A11 Bionic in the iPhone X in 2017 and has been in every Apple chip since. Qualcomm’s Hexagon NPU is in most Android flagships. Intel’s AI Boost ships in Meteor Lake and Arrow Lake CPUs. AMD’s XDNA architecture is in Ryzen AI processors. MediaTek’s APU is in a large share of mid-range Android devices.

The design goal is narrow: run ML inference fast, with minimal power draw, without touching the CPU or GPU. A modern Apple M4 Neural Engine can execute 38 trillion operations per second (TOPS). Microsoft’s Copilot+ PC certification requires a minimum of 40 TOPS - which is now a purchasing criterion listed on laptop spec sheets alongside RAM and storage.

NPUs are matrix multiply engines, the same fundamental operation as TPUs, but sized for milliwatts rather than megawatts. On an iPhone, Face ID, Live Text, real-time camera processing and on-device Siri all run on the Neural Engine - not the CPU or GPU.

Why This Matters for Product Development

On-device inference via NPU has three advantages that cloud inference does not:

Latency: No network round-trip. Inference runs in microseconds, not hundreds of milliseconds.

Privacy: Data never leaves the device. For healthcare, legal or financial applications this is often a regulatory requirement, not a preference.

Cost: No per-token API fees. Once the model is on-device, inference is free at runtime.

The constraint is model size. NPUs work well with small, quantized models - typically a few gigabytes at most. You cannot run a 70B parameter model on a phone. Apple’s on-device models for Apple Intelligence are in the 3B range. Qualcomm’s Snapdragon on-device deployments target 7B quantized. The engineering challenge is fitting a useful model into the memory and compute budget available.

The Frameworks

CoreML is Apple’s framework for compiling models to run on the Neural Engine. If you’re building an iOS or macOS app with ML features, CoreML is how you target the NPU - and WWDC 2024 added on-device foundation model APIs that expose NPU-accelerated inference directly to app developers.

Qualcomm’s AI Hub and the ONNX Runtime with QNN execution provider handle Snapdragon NPU targeting. On Windows, DirectML and the Windows ML API surface AI Boost on Intel and AMD chips through a unified interface.

When to use an NPU: Any ML inference that needs to run on a user’s device - mobile apps, desktop applications, edge hardware. If your use case requires low latency, data privacy or offline capability, the NPU is the right layer. For cloud inference at scale, GPU or TPU remains the answer.

How Each Architecture Scales

CPUs scale two ways: vertically (larger instance) or horizontally (more instances behind a load balancer). Straightforward and predictable. Costs are well understood.

GPUs scale by adding more cards, but coordination overhead is non-trivial. NVLink handles communication between GPUs on the same node. InfiniBand connects multi-node clusters. NCCL manages collective operations across the lot. Getting efficient distributed GPU training requires real engineering effort.

TPU pods scale via Google’s internal interconnect - purpose-built for distributed matrix operations. Simpler to scale than GPU clusters, but you’re locked to Google Cloud and the JAX/TensorFlow ecosystem.

NPUs don’t scale in the traditional sense - they’re fixed silicon in a device. You scale NPU-backed inference by deploying to more devices, not by adding more chips. The relevant question isn’t throughput, it’s whether the model fits within the NPU’s memory and compute budget on the target hardware.

What the Cloud Providers Actually Run

AWS: General compute on x86 Intel/AMD (EC2), ARM via Graviton (EC2 Graviton - frequently the right default), NVIDIA GPU via P and G instance families. AWS Trainium handles model training and Inferentia2 handles inference - both are custom ML chips that sit below NVIDIA GPU pricing for throughput-oriented workloads. Inferentia2 in particular is worth pricing out if you’re running high-volume inference on a fixed model, since it can cut per-token cost significantly versus an H100 instance.

Google Cloud: x86 general instances, ARM via Tau T2A (Ampere), TPU v4 and v5 pods and A100/H100 GPU instances. The TPU offering is what differentiates GCP for large-scale training.

Azure: x86 general compute, Cobalt ARM instances, NDv4 (A100) and NDv5 (H100) for GPU workloads.

Cloudflare Workers: V8 isolates on x86. You never choose the chip - it’s fully abstracted. This is fine. Workers are not the right layer for compute-intensive work.

Serverless: Does the Chip Matter?

Mostly no. But there are a few cases where it does.

AWS Lambda runs x86 by default. You can opt into ARM (Graviton2) - it’s cheaper per GB-second and frequently faster for compute-bound functions. If your Lambda does real work rather than just proxying requests, switching to ARM is a low-effort win.

Cloudflare Workers, Vercel Edge Functions and similar platforms are fully abstracted. You get V8 isolates. The underlying chip is irrelevant to your code.

Google Cloud Run supports ARM now. Same trade-off as Lambda on Graviton - potentially cheaper for the right workloads.

Where the chip genuinely matters in serverless: if you’re running ML inference inside a Lambda (via ONNX or a small quantized model), x86 vs ARM will affect inference latency. If you need GPU for inference, Lambda does not offer it. You need a container-based runtime - ECS, Cloud Run, Kubernetes - or a dedicated inference endpoint via AWS SageMaker, Replicate or Modal.

Self-Hosted GPUs: The Hype Is Real, With Caveats

There is genuine momentum around running AI locally. Tools like Ollama, llama.cpp and LM Studio have made it accessible to run capable models on your own hardware without a cloud account. The reasons people are doing it are legitimate - data privacy, latency, cost at scale and simply not wanting a dependency on an API that can change pricing or go down.

Consumer GPUs

The RTX 4090 is the most discussed option for local AI. 24GB of VRAM, fast memory bandwidth and widely available. It runs 7B to 13B parameter models at good speeds and can handle 70B models quantized to 4-bit. At around $1,600-2,000 it pays for itself quickly if you are running significant inference volume that would otherwise go to an API.

The RTX 3090 and 3090 Ti are older but still capable - more affordable on the secondhand market and also carry 24GB VRAM.

AMD’s RX 7900 XTX has 24GB VRAM and is cheaper than the 4090, but ROCm support for popular AI tools is still inconsistent. Worth watching but not the default recommendation yet.

Apple Silicon

The M-series Macs are an underrated option for local AI. The unified memory architecture means the GPU and CPU share the same memory pool - an M2 Ultra with 192GB of RAM can run very large models that would require multiple enterprise GPUs to fit into VRAM on a traditional setup. Memory bandwidth is the constraint for LLM inference, and Apple’s architecture handles it efficiently.

llama.cpp has excellent Metal support. An M3 Max or M4 Max MacBook Pro is a genuinely capable inference machine for most local use cases - and it runs silently on a laptop.

On-Premise GPU Servers

For teams that want more than a consumer card, the options are:

Building your own: A workstation with one or two A100s or H100s is expensive upfront - A100 80GB cards run $10,000-15,000 each on the secondhand market, H100s more. But at serious inference volume, the economics can work out cheaper than cloud within 12-18 months.

Pre-built GPU servers: Lambda Labs, Bizon and Puget Systems sell workstations and rackmount servers configured for AI workloads. Less setup friction than building from scratch.

NVIDIA DGX systems: Purpose-built for enterprise AI. A DGX H100 ships with 8 H100s interconnected via NVLink. Expensive and overkill for most teams - relevant only if you are training large models in-house at significant scale.

The Software Stack

Hardware is only half of it. The tools that make local GPU inference practical:

Ollama: the easiest on-ramp. Pull a model, run it, get an OpenAI-compatible API locally. Works on Mac, Linux and Windows.
llama.cpp: lower level, more control, supports more quantization formats. What most tools are built on under the hood.
LM Studio: GUI for running local models, good for non-technical users or quick experimentation.
vLLM: production-grade inference server. If you are running a local model at scale with multiple concurrent users, vLLM handles batching and memory management properly.

When Self-Hosting Actually Makes Sense

The case for local GPU is strongest when: data cannot leave your network (healthcare, legal, finance), you are running high inference volume that makes API costs significant, or you need low latency without network round-trips.

The case against: upfront hardware cost, maintenance overhead, and cloud GPUs have gotten more accessible. Services like Replicate, Modal and RunPod let you spin up GPU instances on demand without managing hardware - useful middle ground between cloud APIs and full self-hosting.

Practical Takeaways

Web apps and APIs: ARM instances (Graviton on AWS, Tau T2A on GCP) are the default right answer today. Better price-to-performance than x86 for most stateless workloads.

AI inference at small to medium scale: A single A100 or H100 instance covers most production inference needs. For local or on-prem, an RTX 4090 is a cost-effective starting point.

AI training at scale: H100 cluster or TPU pods depending on your stack. If you’re on JAX or TensorFlow and training at serious scale, price out both.

Serverless workloads: Switch to the ARM runtime where available. Ignore the chip otherwise unless you’re doing on-function inference.

On-prem GPU: NVIDIA still dominates because of CUDA. AMD ROCm is maturing but ecosystem maturity matters more than raw specs when your team’s productivity is on the line.

On-device AI (mobile/desktop apps): Target the NPU. CoreML for Apple platforms, ONNX Runtime with QNN for Qualcomm, DirectML for Windows. If your workload needs low latency, privacy or offline capability, the NPU is the right layer - not a cloud API.

The decision framework is straightforward: CPU for general logic, GPU for parallel compute, TPU for large-scale tensor workloads on Google’s stack, NPU for on-device inference where latency or privacy matter. ARM over x86 whenever cost is a factor and your workload isn’t doing something x86-specific. That covers 90% of infrastructure decisions you’ll actually face.