Everyone benchmarks edge AI models by accuracy. Almost nobody benchmarks where the latency actually comes from. That gap is why most encoder selection decisions are wrong before the hardware even boots.
This is the research I published at ICCCN 2026 in Manchester. The finding changed how I think about model selection entirely.
The common assumption is that the language model decoder is the expensive part of a VLLM. It's not. The vision encoder is.
Across my benchmarks, the encoder accounted for 70 to 85 percent of Time-to-First-Token (TTFT). The LLM portion, which everyone obsesses over, was secondary. I'm calling this the Sensory Bottleneck Hypothesis: the image tokenisation stage, not the autoregressive decoder, determines the real-time viability of edge VLLM deployments.
The full pipeline breaks down like this:
TTFT = L_enc + L_proj + L_prefill(T) where L_enc is encoder latency (the bottleneck, 1.6ms to 98ms), L_proj is the projector aligning encoder embeddings with LLM space (15-80ms), and T is the visual token count passed to prefill.
I ran 8 vision encoder architectures across 4 hardware platforms: Raspberry Pi 5, Raspberry Pi 5 with Google Coral TPU, NVIDIA Jetson Orin Nano, and iPhone 15 Pro. Every number below is averaged over 100 inference passes with a 10-pass warm-up, variance under 8%.
| Backbone | Platform | Precision | Latency (ms) | Accuracy |
|---|---|---|---|---|
| EfficientNet-B0 | Raspberry Pi 5 | TFLite INT8 | 28-32 | 77.1% |
| EfficientNet-B3 | Jetson Orin Nano | TensorRT FP16 | 14.2 | 81.6% |
| EfficientNet-B4 | Jetson Orin Nano | TensorRT FP16 | 48.5 | 82.9% |
| MobileNetV3 | Legacy Pi 4 | TFLite INT8 | 42-56 | 75.2% |
| MobileViT-XS | iPhone 15 Pro | CoreML NPU | 7.2 | 78.9% |
| ViT-Base/16 | Jetson Orin Nano | TensorRT FP16 | 98.0 | 81.0% |
| FastViT-HD | Jetson Orin Nano | TensorRT FP16 | 18.5 | 82.2% |
| EfficientFormer-L1 | RPi 5 + Coral TPU | TFLite + NNAPI |
That EfficientFormer-L1 number is the one that stuck with me: 1.6 ms on a Coral TPU, 79.2% accuracy. ViT-Base gets 81.0% accuracy on a Jetson at 98 ms. You're trading 60x the latency for 1.8 percentage points. The chart below shows all 8 models plotted on the Pareto frontier:
EfficientNet-B3 and FastViT-HD sit on the Pareto frontier. EfficientNet-B4 falls off it entirely, which is where the Cache Cliff becomes visible.
The most important finding is what happens between EfficientNet-B3 and EfficientNet-B4 on the Jetson Orin Nano.
B3 runs at 14.2 ms. B4 runs at 48.5 ms. That's a 3.4x jump in latency for 1.3 percentage points of accuracy gain (81.6% to 82.9%).
This is not a linear scaling penalty. It's a cliff. When the model's working set exceeds the L2/L3 cache on the NPU, you get a non-linear jump as the system spills to DRAM. I'm calling this the Cache Cliff phenomenon: the point where model size crosses the 2-8 MB L2/L3 cache constraint and latency jumps discontinuously.
The practical consequence is that you can't interpolate between benchmarks. You need to know where the cliff is for each specific hardware platform you're targeting.
From the formal complexity analysis and Roofline modeling, I derived three principles that hold across all four platforms.
The Law of Resolution: Encoder latency scales quadratically with input resolution for CNNs and quartically for ViTs. Doubling resolution increases CNN latency by roughly 4x and ViT latency by roughly 16x. For OCR tasks, EfficientNet-B0 at 384px consistently outperforms EfficientNet-B4 at 224px because higher resolution preserves the spatial detail that downsampling destroys.
The Law of Depth vs Width: On edge NPUs, inference latency is dominated by sequential memory access overhead, not arithmetic computation. Deep networks create compute unit starvation. ResNet-101 leaves processing units idle waiting for weight transfers from off-chip DRAM. Wide, shallow networks like MobileNetV3 and EfficientFormer maximise SIMD utilisation instead.
The Hybridization Principle: Pure ViTs have disproportionate latency on edge accelerators because of quadratic attention complexity and suboptimal operator support. The Coral TPU has to offload ViT-specific ops like Softmax and reshape to the CPU. Hybrid CNN-Transformer architectures use a convolutional stem for spatial reduction before any attention blocks, getting most of the accuracy with far better hardware utilisation. EfficientFormer-L1 achieves 79.2% at 1.6 ms versus ViT-Base at 81.0% at 98 ms.
Based on everything above, I built a hardware-aware selection matrix across three latency regimes:
| Regime | Target Platform | Recommended Encoder | Accuracy | Notes |
|---|---|---|---|---|
| Real-Time < 15ms | IoT / Raspberry Pi | MobileViT-XS, EfficientFormer-L1 | 78-79% | Only hybrid models are feasible here |
| Interactive < 50ms | Coral TPU, Jetson Nano | EfficientNet-B0 to B3, FastViT | 79-82% | Sweet spot for balanced performance |
| Batch < 200ms | Jetson Orin, Edge GPU | EfficientNet-B4+ | 82-83% | Watch the cache cliff beyond B3 |
My first instinct was to rank encoders by accuracy and work backwards from there. That's the wrong frame. On edge hardware, you're not picking the most accurate model that fits your latency budget. You're picking the model that sits below the cache cliff on your specific hardware, and then checking whether its accuracy is sufficient.
The "best" encoder is always relative to your cache topology, your resolution requirements, and whether you need real-time or batch throughput. ViT-Base is genuinely great on a server. On a Jetson Orin with a 15ms budget, it's unusable.
The paper is accepted at ICCCN 2026 in Manchester. The project site has the interactive encoder selector, full benchmark tables, and the architecture taxonomy. If you're building anything that involves vision on constrained hardware, I think the cache cliff framing is worth understanding before you commit to an architecture.
| 1.6 |
| 79.2% |