There are two ways to build a sound classifier that runs all day: send the audio to a server and classify it there, or ship the classifier inside the app and run it on the phone. Both approaches work. Only one is tenable for an app that deaf and hard of hearing users depend on. This is the case for why.
The cloud argument
The appeal of cloud processing is simple: a server can run a much larger model than a phone can. You can deploy state-of-the-art architectures, update the model any time you want, and let every user benefit from the same inference engine. The per-inference cost is also cheaper at scale than burning the phone's CPU and battery.
If the thing you are classifying is occasional and non-sensitive, say, a one-off photo search, that tradeoff can make sense. The latency is acceptable and the privacy surface is manageable.
Sound awareness is not that problem.
Why cloud does not work for always-on audio
Latency is a killer. A smoke alarm alert that arrives 800ms after it should have is not a smoke alarm alert. On a typical cellular connection, round-trip time for a server inference is hundreds of milliseconds before you add the actual model compute. On a flaky network, it is seconds. On no network, the app is dead. The user discovers this at the worst possible moment.
Continuous upload is not acceptable. Sound awareness requires the mic on all day. If every second of audio is streaming to a server, that audio exists somewhere outside the user's control. It can be intercepted, subpoenaed, leaked, or reprocessed. Even with strong TLS and strict data-deletion policies, the fact that the audio exists at all is a privacy surface most users correctly find unacceptable.
Battery economics flip. People assume cloud is cheaper on battery because you are not running ML locally. For continuous audio streams, the opposite is true. Keeping a radio awake to upload audio 24/7 drains the battery far faster than running a lightweight classifier during active hours. Cellular radios are expensive; the Neural Engine is cheap.
Trust is binary. A cloud-processed audio app can claim "we delete the audio immediately." The user has no way to verify that claim. An on-device app can prove it by working with no network at all. That is not a policy promise; it is a physical constraint.
What Apple's silicon made possible
On-device sound classification is not a 2024 idea. What has changed is the economics of doing it well.
Modern iPhones ship with a dedicated Neural Engine that can run moderately sized ML models at roughly the efficiency of a desktop GPU from ten years ago, at a fraction of the power. Core ML lets us ship models in formats that this Neural Engine can execute directly, without routing through the CPU.
In practical terms: the classifier SoundSense uses fits in under 20 MB, runs at well under 100 ms per inference on an iPhone XS or newer, and draws a rounding-error amount of power during active inference.
This was not true in 2018. It was just becoming true in 2022. It is comfortably true in 2026, and it is the reason SoundSense could be built as an on-device app without compromising accuracy.
What we give up
Being honest: on-device processing has tradeoffs.
Model size is bounded. We cannot ship a two-gigabyte model in a phone app. The classifier has to be efficient enough to run on the user's phone while leaving headroom for everything else their phone is doing. That is a real constraint.
Updates require app updates. When we improve the classifier, users need to update the app to get the improvements. We cannot push a new model overnight the way a cloud app can push a new version of an inference stack.
Older devices are not supported. SoundSense requires an iPhone XS or newer, with iOS 16+. Older devices do not have the Neural Engine capability we depend on. This is a real exclusion, and we do not love it. It is the price of the architectural guarantee.
What we gain
For a sound awareness app built for a community where privacy and reliability are non-negotiable, the gain from on-device is straightforward.
The audio never leaves the phone. Not in airplane mode, not on bad Wi-Fi, not ever. The app works without the network. Latency is dominated by how fast the Neural Engine can run, which is plenty fast. The privacy surface is bounded by iOS's sandbox, not by our server-side data-retention policy, because we do not have a server-side data-retention policy for audio, because we do not have audio to retain.
That is the architecture. That is why SoundSense looks the way it does. The hard call is an old one; the easy one is the right one for the people who will use the app.