On-Device vs. Cloud Sound Processing: Why It Matters

There are two ways to build a sound classifier that runs all day: send the audio to a server and classify it there, or ship the classifier inside the app and run it on the phone. Both approaches work. Only one is tenable for an app that deaf and hard of hearing users depend on. This is the case for why.

The cloud argument

The appeal of cloud processing is simple: a server can run a much larger model than a phone can. You can deploy state-of-the-art architectures, update the model any time you want, and let every user benefit from the same inference engine. The per-inference cost is also cheaper at scale than burning the phone's CPU and battery.

If the thing you are classifying is occasional and non-sensitive, say, a one-off photo search, that tradeoff can make sense. The latency is acceptable and the privacy surface is manageable.

Sound awareness is not that problem.

Why cloud does not work for always-on audio

Latency is a killer. A smoke alarm alert that arrives 800ms after it should have is not a smoke alarm alert. On a typical cellular connection, round-trip time for a server inference is hundreds of milliseconds before you add the actual model compute. On a flaky network, it is seconds. On no network, the app is dead. The user discovers this at the worst possible moment.

Continuous upload is not acceptable. Sound awareness requires the mic on all day. If every second of audio is streaming to a server, that audio exists somewhere outside the user's control. It can be intercepted, subpoenaed, leaked, or reprocessed. Even with strong TLS and strict data-deletion policies, the fact that the audio exists at all is a privacy surface most users correctly find unacceptable.

Battery economics flip. People assume cloud is cheaper on battery because you are not running the work locally. For continuous audio streams, the opposite is true. Keeping a radio awake to upload audio 24/7 drains the battery far faster than running a lightweight classifier during active hours. Cellular radios are expensive; on-device inference is cheap.

Trust is binary. A cloud-processed audio app can claim "we delete the audio immediately." The user has no way to verify that claim. An on-device app can prove it by working with no network at all. That is not a policy promise; it is a physical constraint.

What Apple's silicon made possible

On-device sound classification is not a 2024 idea. What has changed is the economics of doing it well.

Modern iPhones ship with dedicated hardware for running moderately sized machine learning models at roughly the efficiency of a desktop GPU from ten years ago, at a fraction of the power. iOS lets us ship models in formats this hardware can execute directly, without routing through the CPU.

In practical terms: the classifier SoundSense uses is compact enough to ship inside the app, fast enough that inference finishes in a blink, and efficient enough that active inference adds very little to the battery cost.

This was not true in 2018. It was just becoming true in 2022. It is comfortably true in 2026, and it is the reason SoundSense could be built as an on-device app without compromising accuracy.

What we give up

Being honest: on-device processing has tradeoffs.

Model size is bounded. We cannot ship a two-gigabyte model in a phone app. The classifier has to be efficient enough to run on the user's phone while leaving headroom for everything else their phone is doing. That is a real constraint.

Updates require app updates. When we improve the classifier, users need to update the app to get the improvements. We cannot push a new model overnight the way a cloud app can push a new version of an inference stack.

Older devices are not supported. SoundSense requires iPhone 8 or later, on iOS 16 or later. Older devices do not have the on-device machine learning capability we rely on. This is a real exclusion, and we do not love it. It is the price of the architectural guarantee.

What we gain

For a sound awareness app built for a community where privacy and reliability are non-negotiable, the gain from on-device is straightforward.

The audio never leaves the phone. Not in airplane mode, not on bad Wi-Fi, not ever. The app works without the network. Latency is dominated by how fast on-device inference can run, which is plenty fast. The privacy surface is bounded by iOS's sandbox, not by our server-side data-retention policy, because we do not have a server-side data-retention policy for audio, because we do not have audio to retain.

That is the architecture. That is why SoundSense looks the way it does. The hard call is an old one; the easy one is the right one for the people who will use the app.

Aaron Guss

Founder, SoundSense · Rochester, NY