On-Device AI Inference: Why the Next Shift Is Already in Your Pocket
The assumption that AI requires a cloud API is becoming less true every year. The hardware in modern phones and laptops is capable of running real models locally - and the tooling has caught up to the point where getting started is not a research project anymore.
This is worth paying attention to. Not because cloud AI is going away, but because on-device inference changes what you can build, who you can build it for and what you can charge.
The Hardware Doing the Work
Every major chip vendor now ships dedicated AI acceleration hardware. The CPUs and GPUs in your devices are not doing this work - there are separate units purpose-built for tensor operations.
Apple Neural Engine launched in the A11 Bionic in 2017 - the same chip that introduced Face ID. It started at 600 billion operations per second. The Neural Engine in M4 chips runs at 38 trillion operations per second. Every iPhone since the XS and every Mac with Apple Silicon has one. When Siri processes a request, when Photos identifies a face, when Live Text reads text from a camera frame - that is the Neural Engine, not the CPU.
Qualcomm Hexagon NPU powers Android flagship devices and, more recently, Windows laptops via the Snapdragon X Elite. The Snapdragon 8 Gen 3 in current Android flagships runs 45 TOPS (tera operations per second). The Snapdragon X Elite in Windows ARM laptops is a significant jump - 45 TOPS of dedicated NPU alongside competitive CPU and GPU performance. This is what makes Windows on ARM a serious platform now rather than a novelty.
MediaTek APU ships in a large portion of mid-range Android devices. Less discussed than Qualcomm and Apple but covers a significant share of the global device market.
The pattern is consistent: dedicated silicon for AI inference, separate from the CPU and GPU, designed to run matrix operations efficiently at low power draw.
Quantization: How Large Models Fit on Small Hardware
A 7B parameter model in full 32-bit float precision takes roughly 28GB of memory. That does not fit on a phone. Quantization is the technique that makes on-device inference practical.
Quantization reduces the numerical precision of model weights - from 32-bit floats to 8-bit integers, or 4-bit integers. A 7B model at 4-bit quantization fits in around 4GB. Quality degrades, but less than you might expect. For most practical tasks - summarization, classification, translation, code completion - a well-quantized 7B model is competitive with much larger cloud-hosted models.
The GGUF format (used by llama.cpp) is the most common format for quantized models. Hugging Face hosts thousands of pre-quantized models ready to run locally. You do not need to quantize models yourself to get started.
What Actually Runs Well On-Device
Be realistic about what fits where.
On current phones (8-12GB RAM): 1B to 3B parameter models run well. Apple’s own on-device models for Apple Intelligence are in this range. Google Gemini Nano on Pixel devices is around 1.8B parameters. These handle summarization, short-form generation, classification and translation competently.
On Apple Silicon Macs: 7B models run fast. 13B models are usable. An M3 Max or M4 Max with 48-64GB of unified memory can run 34B models at reasonable speeds. The unified memory architecture (CPU and GPU share the same pool) is the key advantage here - you are not constrained by VRAM.
Tasks that work well on-device: Translation, summarization, text classification, OCR and document parsing, code completion, image classification and object detection.
Tasks that still benefit from cloud: Complex multi-step reasoning, very long context windows, state-of-the-art coding tasks, anything requiring the latest frontier models.
How to Get Started as a Developer
The right starting point depends on what you’re building. Desktop, mobile and web all have distinct paths.
Desktop: Ollama
The fastest path to running a model locally. Install Ollama, pull a model and you have an OpenAI-compatible API running on localhost.
brew install ollama
ollama pull llama3.2
ollama run llama3.2
The API is compatible with the OpenAI SDK - swap the base URL and most existing code works without changes:
from openai import OpenAI
client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama")
response = client.chat.completions.create(
model="llama3.2",
messages=[{"role": "user", "content": "Summarize this document..."}]
)
On Apple Silicon, Ollama uses Metal for GPU acceleration via the Neural Engine. No configuration needed.
Desktop: llama.cpp
Lower level than Ollama, more control. Useful when you need to tune quantization levels, run headless on a server or integrate inference directly into a C/C++ application. llama.cpp is what most tools are built on under the hood.
The Python bindings via llama-cpp-python are straightforward:
pip install llama-cpp-python
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="bartowski/Llama-3.2-3B-Instruct-GGUF",
filename="*Q4_K_M.gguf",
)
output = llm("What is on-device inference?", max_tokens=256)
Pass n_gpu_layers=-1 to offload all layers to GPU/Neural Engine on supported hardware.
Native Mobile: iOS with Core ML
For shipping on iOS, macOS, watchOS or tvOS, Core ML is the native path. Models run on the Neural Engine automatically - Apple’s runtime handles scheduling between CPU, GPU and Neural Engine based on the workload.
The workflow: convert a model to Core ML format using coremltools, then load it via the Core ML API in Swift or Objective-C.
import coremltools as ct
import torch
# Convert a PyTorch model to Core ML
model = YourPyTorchModel()
traced = torch.jit.trace(model, example_input)
mlmodel = ct.convert(traced, inputs=[ct.TensorType(shape=example_input.shape)])
mlmodel.save("YourModel.mlpackage")
Apple also publishes pre-converted models for common tasks (text classification, image recognition, depth estimation) via their Core ML Models page - worth checking before converting your own.
Native Mobile: Android
Android has a few options depending on your stack. Google ML Kit is the easiest entry point - it wraps common tasks (translation, smart reply, barcode scanning, text recognition) behind a simple API without requiring you to manage models directly. For more control, ONNX Runtime Mobile and TensorFlow Lite both run on-device across the Hexagon NPU on Snapdragon devices.
For React Native, react-native-executorch brings Meta’s ExecuTorch runtime to RN apps. Still early but the direction is clear - Meta wants PyTorch models running natively on mobile without a server.
Web Apps: transformers.js and WebGPU
On-device inference in the browser is more capable than most developers expect.
transformers.js by Hugging Face runs models entirely in the browser via WebAssembly - no server, no API key, works fully offline. The API mirrors the Python transformers library:
import { pipeline } from '@huggingface/transformers';
const classifier = await pipeline('sentiment-analysis');
const result = await classifier('This is surprisingly fast.');
WebGPU (now shipping in Chrome and Safari) gives browser code direct GPU access, pushing inference speeds close to native for many tasks. For common use cases like object detection, pose estimation and text classification, MediaPipe’s web runtime is a solid option with good cross-browser support.
The main constraint with web is model download size. Quantized models in the 100-400MB range are practical. Anything larger starts to hurt the user experience on first load - though caching via the Cache API helps on repeat visits.
Cross-Platform: ONNX Runtime
If you’re targeting desktop, mobile and web from a single model, ONNX Runtime is the most practical cross-platform option. Convert your model to ONNX format and the runtime handles execution across CPU, GPU and NPU on Windows, Linux, macOS, iOS and Android. On Windows with Snapdragon X Elite, the QNN execution provider routes inference to the Hexagon NPU automatically.
ExecuTorch (Meta)
Worth calling out separately: ExecuTorch is Meta’s end-to-end framework for taking a PyTorch model from training to on-device deployment. Covers the full pipeline - export, optimize, quantize and deploy to iOS, Android or embedded targets. Still maturing but the most coherent story for teams already deep in PyTorch.
Privacy Is a First-Class Feature
On-device inference means data never leaves the device. For healthcare apps processing patient notes, legal tools analyzing contracts, finance apps reading transaction data or any personal productivity tool - this is not just a nice-to-have. It removes a category of compliance risk entirely.
This is one of the strongest arguments for on-device over cloud API for certain product categories. An AI feature that processes sensitive data locally is a fundamentally different product than one that sends that data to a third-party API.
Where This Is Heading
Models are getting smaller faster than most expect. Apple Intelligence runs capable generative AI on-device on a three-year-old iPhone. Google Gemini Nano handles real tasks on mid-range Android phones. The trajectory points toward most everyday AI tasks - summarization, translation, classification, local search - moving on-device within the next two to three years.
The cloud is not going away for complex reasoning, training and frontier model access. But the assumption that AI requires a network call is already outdated for a growing set of use cases.
If you’re building products that handle user data, operate in low-connectivity environments or need sub-100ms response times - on-device inference is worth evaluating now, not later.