Efficient Multimodal Neural Networks for Trigger-Free Voice Assistants

The adoption of multimodal interactions by voice assistants (VAs) is rapidly increasing to improve human-computer interaction. Smartwatches now include trigger-free methods of summoning VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without a clear trigger. Current state-of-the-art RTS systems rely on heuristics and designed finite-state machines for multimodal decision making for the fusion of gesture and audio data. However, these methods have limitations, including limited adaptability, scalability, and inherent human bias. In this work, we propose a neural network-based multimodal audio-gesture fusion system that (1) better understands the temporal correlation between audio and gesture data, resulting in accurate calls (2) generalizes to a wide range of environments and scenarios (3) lightweight and applicable in low-power devices such as smart watches with fast startup time (4) Improves productivity in asset development processes.

Source link