May 20268 min read

The Cache Cliff: Why Edge AI Latency Isn't Linear

research edge-ai ml-systems icccn-2026

Listen hands-freecontinues while you browse

Everyone benchmarks edge AI models by accuracy. Almost nobody benchmarks where the latency actually comes from. That gap is why most encoder selection decisions are wrong before the hardware even boots.

This is the research I published at ICCCN 2026 in Manchester. The finding changed how I think about model selection entirely.

The Sensory Bottleneck

The common assumption is that the language model decoder is the expensive part of a VLLM. It's not. The vision encoder is.

Across my benchmarks, the encoder accounted for 70 to 85 percent of Time-to-First-Token (TTFT). The LLM portion, which everyone obsesses over, was secondary. I'm calling this the Sensory Bottleneck Hypothesis: the image tokenisation stage, not the autoregressive decoder, determines the real-time viability of edge VLLM deployments.

The full pipeline breaks down like this:

Input Processing

Resize · Normalize

224px or 384pxResolution sets encoder cost

L_pre ≈ trivial

70-85% TTFT

Vision Encoder

Primary Bottleneck

CNN / ViT / Hybrid576-1024 visual tokens70-85% of total TTFT

L_enc: 1.6 – 98 ms

Projector

MLP or Q-Former

Aligns encoder spacewith LLM token spaceQ-Former compresses 16×

L_proj: 15 – 80 ms

LLM Decoder

Prefill + Generation

Autoregressive decodePrefill scales with T(visual token count)

L_prefill(T)

TTFT =L_enc+L_proj+L_prefill(T)where L_enc dominates

Source: vllmarchitect.yashvardhan.dev · ICCCN 2026, Manchester UK

TTFT = L_enc + L_proj + L_prefill(T) where L_enc is encoder latency (the bottleneck, 1.6ms to 98ms), L_proj is the projector aligning encoder embeddings with LLM space (15-80ms), and T is the visual token count passed to prefill.

What I Benchmarked

I ran 8 vision encoder architectures across 4 hardware platforms: Raspberry Pi 5, Raspberry Pi 5 with Google Coral TPU, NVIDIA Jetson Orin Nano, and iPhone 15 Pro. Every number below is averaged over 100 inference passes with a 10-pass warm-up, variance under 8%.

Backbone	Platform	Precision	Latency (ms)	Accuracy
EfficientNet-B0	Raspberry Pi 5	TFLite INT8	28-32	77.1%
EfficientNet-B3	Jetson Orin Nano	TensorRT FP16	14.2	81.6%
EfficientNet-B4	Jetson Orin Nano	TensorRT FP16	48.5	82.9%
MobileNetV3	Legacy Pi 4	TFLite INT8	42-56	75.2%
MobileViT-XS	iPhone 15 Pro	CoreML NPU	7.2	78.9%
ViT-Base/16	Jetson Orin Nano	TensorRT FP16	98.0	81.0%
FastViT-HD	Jetson Orin Nano	TensorRT FP16	18.5	82.2%
EfficientFormer-L1	RPi 5 + Coral TPU	TFLite + NNAPI

That EfficientFormer-L1 number is the one that stuck with me: 1.6 ms on a Coral TPU, 79.2% accuracy. ViT-Base gets 81.0% accuracy on a Jetson at 98 ms. You're trading 60x the latency for 1.8 percentage points. The chart below shows all 8 models plotted on the Pareto frontier:

EfficientNet-B3 and FastViT-HD sit on the Pareto frontier. EfficientNet-B4 falls off it entirely, which is where the Cache Cliff becomes visible.

The Cache Cliff

The most important finding is what happens between EfficientNet-B3 and EfficientNet-B4 on the Jetson Orin Nano.

B3 runs at 14.2 ms. B4 runs at 48.5 ms. That's a 3.4x jump in latency for 1.3 percentage points of accuracy gain (81.6% to 82.9%).

This is not a linear scaling penalty. It's a cliff. When the model's working set exceeds the L2/L3 cache on the NPU, you get a non-linear jump as the system spills to DRAM. I'm calling this the Cache Cliff phenomenon: the point where model size crosses the 2-8 MB L2/L3 cache constraint and latency jumps discontinuously.

The practical consequence is that you can't interpolate between benchmarks. You need to know where the cliff is for each specific hardware platform you're targeting.

Three Governing Design Principles

From the formal complexity analysis and Roofline modeling, I derived three principles that hold across all four platforms.

The Law of Resolution: Encoder latency scales quadratically with input resolution for CNNs and quartically for ViTs. Doubling resolution increases CNN latency by roughly 4x and ViT latency by roughly 16x. For OCR tasks, EfficientNet-B0 at 384px consistently outperforms EfficientNet-B4 at 224px because higher resolution preserves the spatial detail that downsampling destroys.

The Law of Depth vs Width: On edge NPUs, inference latency is dominated by sequential memory access overhead, not arithmetic computation. Deep networks create compute unit starvation. ResNet-101 leaves processing units idle waiting for weight transfers from off-chip DRAM. Wide, shallow networks like MobileNetV3 and EfficientFormer maximise SIMD utilisation instead.

The Hybridization Principle: Pure ViTs have disproportionate latency on edge accelerators because of quadratic attention complexity and suboptimal operator support. The Coral TPU has to offload ViT-specific ops like Softmax and reshape to the CPU. Hybrid CNN-Transformer architectures use a convolutional stem for spatial reduction before any attention blocks, getting most of the accuracy with far better hardware utilisation. EfficientFormer-L1 achieves 79.2% at 1.6 ms versus ViT-Base at 81.0% at 98 ms.

The Encoder Selection Matrix

Based on everything above, I built a hardware-aware selection matrix across three latency regimes:

Regime	Target Platform	Recommended Encoder	Accuracy	Notes
Real-Time < 15ms	IoT / Raspberry Pi	MobileViT-XS, EfficientFormer-L1	78-79%	Only hybrid models are feasible here
Interactive < 50ms	Coral TPU, Jetson Nano	EfficientNet-B0 to B3, FastViT	79-82%	Sweet spot for balanced performance
Batch < 200ms	Jetson Orin, Edge GPU	EfficientNet-B4+	82-83%	Watch the cache cliff beyond B3

The interactive tool I built to navigate this interactively is live at vllmarchitect.yashvardhan.dev. It lets you select your hardware platform and latency budget and get a specific encoder recommendation with estimated latency.

What I Got Wrong Initially

My first instinct was to rank encoders by accuracy and work backwards from there. That's the wrong frame. On edge hardware, you're not picking the most accurate model that fits your latency budget. You're picking the model that sits below the cache cliff on your specific hardware, and then checking whether its accuracy is sufficient.

The "best" encoder is always relative to your cache topology, your resolution requirements, and whether you need real-time or batch throughput. ViT-Base is genuinely great on a server. On a Jetson Orin with a 15ms budget, it's unusable.

What's Next

The paper is accepted at ICCCN 2026 in Manchester. The project site has the interactive encoder selector, full benchmark tables, and the architecture taxonomy. If you're building anything that involves vision on constrained hardware, I think the cache cliff framing is worth understanding before you commit to an architecture.

Thoughts, questions, or want to collaborate?

Mail LinkedIn