TengineAIBETA
Illustration for 'Understanding On-Device AI Accuracy: Why the Same Model Performs Differently'

Understanding On-Device AI Accuracy: Why the Same Model Performs Differently

·8 min read
on-device AImodel quantizationAI accuracyedge AI deploymentmobile AI optimization
neural network performancehardware-specific AIAI model testingquantized modelsedge computing AI
Share:

You've spent weeks training your neural network. The validation metrics look great - 93% accuracy on your test set. You quantize the model, optimize it for mobile deployment, and ship it to production. Then the bug reports start rolling in: users on certain devices are seeing terrible results, while others work perfectly. A quick investigation reveals the shocking truth: your model's accuracy has dropped to 71% on some chipsets, even though it's the exact same quantized weights.

This isn't a hypothetical scenario. It's a real problem that catches many ML practitioners off guard when they move from cloud deployment to edge devices. The assumption that a model file will behave identically across different hardware is one of the most expensive mistakes in on-device AI deployment.

In this post, we'll break down why identical AI models can show dramatically different accuracy across hardware platforms, what factors affect on-device inference, and most importantly, what you need to test before deploying your model to production.

The Quantization Reality Check

When you quantize a model from FP32 (32-bit floating point) to INT8 (8-bit integer), you're making a trade-off: smaller model size and faster inference in exchange for some precision loss. In theory, this precision loss should be consistent - if you lose 2% accuracy during quantization, you should see that same 2% drop everywhere.

But that's not what happens in practice.

The problem starts with how different hardware handles quantized operations. Modern mobile chipsets include specialized AI accelerators - Qualcomm's Hexagon DSP, Apple's Neural Engine, MediaTek's APU, and others. Each of these accelerators implements quantized operations slightly differently:

  • Rounding strategies: Some chips round to nearest, others truncate. This affects every single operation in your network.
  • Accumulation precision: During convolution operations, intermediate values are accumulated. Some accelerators use 32-bit accumulators, others use 16-bit, and the difference compounds across layers.
  • Activation function implementations: A ReLU6 or sigmoid might be implemented as a lookup table on one chip and calculated on another, leading to slightly different outputs.

These small differences wouldn't matter much if they occurred in isolation. But in a deep neural network with dozens or hundreds of layers, these tiny variations compound. By the time you reach the final output layer, what started as a 0.01% difference per operation has snowballed into a 20%+ accuracy drop.

The Hardware-Specific Optimization Problem

Here's where things get even more complicated: not all quantization schemes are created equal, and different hardware prefers different approaches.

Symmetric vs. Asymmetric Quantization

Symmetric quantization assumes your weights and activations are centered around zero. It's simpler and faster, but if your actual data distribution is skewed, you're wasting precious bits. Asymmetric quantization handles arbitrary ranges better but requires more computation.

Snapdragon 8 Gen 1 might handle asymmetric quantization efficiently, while an older Snapdragon 865 might see better results with symmetric quantization. Use the wrong scheme for your target hardware, and you'll see accuracy degradation even though the model architecture is identical.

Per-Channel vs. Per-Tensor Quantization

When quantizing weights, you can use a single scale factor for an entire tensor (per-tensor) or different scale factors for each output channel (per-channel). Per-channel quantization preserves more information and typically gives better accuracy, but not all hardware supports it efficiently.

If you quantize per-channel and deploy to hardware that only accelerates per-tensor operations, the runtime might fall back to a slower, less accurate execution path. Your model runs, but not the way you intended.

The Software Stack Matters Too

Even with identical hardware, different software stacks can produce different results. Consider these scenarios:

Runtime Versions

TensorFlow Lite 2.12 might implement quantized convolution differently than version 2.8. If you quantized your model with one version but your app uses another, you might see accuracy differences. This becomes especially tricky when you're supporting multiple Android versions, each potentially using different TFLite runtime versions.

Delegate Selection

Most mobile ML frameworks use "delegates" to route operations to specialized hardware. But delegate selection isn't always straightforward:

  • The NNAPI delegate on Android might route some operations to the GPU, others to the DSP, and fall back to CPU for unsupported ops.
  • Each transition between hardware units introduces quantization/dequantization overhead and potential precision loss.
  • A model that runs entirely on the DSP might be more accurate than one split between DSP and GPU, even though both are "accelerated."

Operator Coverage

Not every operation in your model might be supported by the hardware accelerator. When the runtime encounters an unsupported operation, it has to fall back to CPU execution. This creates a mixed-precision scenario: some layers run in INT8 on the accelerator, others run in FP32 on the CPU, with conversions happening at the boundaries.

These conversions are expensive and introduce additional quantization error. A model with 95% operator coverage on the accelerator might perform worse than a simpler model with 100% coverage.

Real-World Deployment Variations

Let's look at a concrete example: deploying a MobileNetV2 image classifier quantized to INT8.

Snapdragon 888 (2021):

  • Hexagon 780 DSP with improved INT8 support
  • Full per-channel quantization support
  • Observed accuracy: 91-93%

Snapdragon 765G (2020):

  • Hexagon 696 DSP with basic INT8 support
  • Limited per-channel support, falls back to per-tensor for some layers
  • Observed accuracy: 85-88%

Snapdragon 730 (2019):

  • Older Hexagon 688 DSP
  • Minimal INT8 optimization
  • Some operations fall back to GPU or CPU
  • Observed accuracy: 71-76%

Same model file. Same quantization scheme. Dramatically different results.

The accuracy drop isn't just about raw compute power - the 730 isn't that much slower than the 765G. It's about how the hardware handles the specific operations your model uses and how the software stack adapts (or fails to adapt) to hardware limitations.

Testing Strategy: What You Need to Verify

Given all these variables, how do you ensure your model will work across different devices? Here's a practical testing approach:

1. Identify Your Target Hardware

Don't try to support everything. Look at your user analytics and identify the top 5-10 device models or chipset families. Focus your testing efforts there.

2. Test on Physical Devices

Emulators and simulators won't catch these issues. You need real hardware running real Android/iOS versions. Device farms can help, but having a few key devices in-house for iterative testing is invaluable.

3. Measure More Than Just Latency

Most deployment guides focus on inference speed, but accuracy matters more. For each target device:

  • Run your validation dataset through the on-device model
  • Compare outputs to your reference (non-quantized) model
  • Track not just overall accuracy but per-class metrics
  • Look for systematic errors that might indicate hardware-specific issues

4. Profile Your Model

Use profiling tools to understand how your model executes:

  • Which operations run on which hardware?
  • Are there unexpected CPU fallbacks?
  • How much time is spent on data conversion?

Tools like Android GPU Inspector, Qualcomm's Snapdragon Profiler, or TensorFlow Lite's built-in benchmarking can reveal execution patterns you didn't expect.

5. Consider Multiple Quantization Strategies

Don't assume one quantized model will work everywhere. You might need:

  • An INT8 model optimized for newer Snapdragon chips
  • A hybrid FP16/INT8 model for mid-range devices
  • A conservative quantization scheme for older hardware

Yes, this means maintaining multiple model variants, but it's better than shipping a model that doesn't work for 30% of your users.

Debugging Accuracy Issues

When you discover accuracy problems on specific hardware, here's how to diagnose them:

Compare Layer-by-Layer Outputs

Don't just look at final predictions. Extract intermediate layer outputs from both your reference model and the on-device model. This helps you identify which layers are causing problems.

If accuracy is fine through the first 10 layers but degrades after that, you know where to focus your optimization efforts.

Check for Numerical Overflow

INT8 has a limited range (-128 to 127). If your activations or intermediate values exceed this range, they'll clip, causing accuracy loss. This is especially common in:

  • Models with aggressive batch normalization
  • Networks with large activation values
  • Architectures that weren't designed with quantization in mind

Verify Calibration Data

Post-training quantization requires calibration data to determine optimal scale factors. If your calibration dataset doesn't match your actual use case, the quantization might be suboptimal.

For example, if you calibrated on well-lit images but deploy to an app used in low-light conditions, the quantization ranges might be completely wrong for real-world data.

The Path Forward

On-device AI accuracy isn't just about model architecture or quantization techniques - it's about understanding the entire hardware-software stack and testing rigorously across your target deployment environment.

The 93% to 71% accuracy variation isn't a bug; it's a feature of the complex, heterogeneous landscape of mobile AI acceleration. The good news is that with proper testing and optimization, you can achieve consistent, reliable performance across different devices.

Before you deploy your next on-device model, make sure you've tested on real hardware, profiled execution patterns, and validated accuracy on your target chipsets. The time you invest in deployment testing will save you from user complaints, bad reviews, and emergency patches down the line.

The future of AI is increasingly on-device, but getting there requires treating deployment as seriously as model training. Your model might be state-of-the-art in the lab, but it only matters if it works in users' hands.

Share this article

Stay Updated

Get the latest articles on AI, automation, and developer tools delivered to your inbox.

More from TengineAI