Espressif Systems recently introduced an AI voice development kit built around the ESP32-S3 system-on-chip for monitoring smart IoT devices– ESP32-S3-BOX. The hardware platform has a touch screen controller to provide human-computer interaction that can control various sensors and smart devices. With the increasing adoption of voice assistant-based smart devices to control the surrounding environment, Alexa and Google Home have seen widespread community support. However, the Espressif ESP32-S3-BOX is different from the cloud-native voice services, as the hardware platform has several onboard modules and a rich set of peripherals as well for easy interaction with these smart IoT devices.
As the name suggests, ESP32-S3-BOX is built around Espressif’s in-house ESP32-S3 system-on-chip featuring XTensa LX7 microcontroller clocked with a frequency of 240MHz. The ESP32-S3 SoC comes with integrated 2.4GHz IEEE 802.11b/g/n Wi-Fi and Bluetooth 5 (Low Energy) wireless connectivity. The ESP32-S3 system-on-chip also supports high-speed octal SPI flash storage and PSRAM with configurable data and instruction cache. One of the key highlights of ESP32-S3 is that it has support for vector instruction to provide acceleration for neural network computing and signal processing workloads.
Espressif System ESP32-S3-BOX is capable of running its in-house audio front-end algorithm and ESP-Skainet, an offline voice-assistant SDK, along with the Alexa-for-IoT SDK to provide enhanced offline and online voice functionalities. Espressif audio front-end algorithm is a high-performance audio algorithm to enable voice user interface and provide the flexibility to build low-cost voice-assisted applications. The software algorithm is recognized by Amazon as a “Software Audio Front-End” solution for Alexa built-in devices. The voice-optimized solution can comfortably operate with ESP32-S3.
Three key features delivered by the Espressif audio front-end algorithms are acoustic echo cancellation, blind source separation, and noise suppression. The acoustic echo cancellation is designed to remove echoes from the audio input through a microphone. The blind source separation algorithm uses multiple microphones to detect the direction of the coming audio which helps in improving the quality of desired audio source in a noisy environment. The noise suppression algorithm works on a single-channel audio signal to eliminate unwanted non-human noise to improve the audio signal that needs to be processed.
Another key software functionality that ESP32-S3-BOX is designed with is the ESP-Skainet, an intelligent voice assistant that supports the Wake Word Engine and Speech Commands Recognition. Espressif wakes word engine, WakeNet is designed to provide high performance and a low memory footprint wake word detection algorithm for users. This will give IoT devices the ability to always wait for wake words. On the other hand, Espressif’s speech command recognition model, MultiNet, is designed to provide flexible offline speech commands to smart IoT devices. The model allows the user to easily add custom speech commands to eliminate the need to train the model again.
ESP RainMaker, a complete system to build AIoT products with minimum coding, is also available on the ESP32-S3-BOX and can be used to configure GPIOs and offline commands to provide control via phone applications or voice assistants. Apart from the flagship ESP32-S3-BOX, the manufacturers have also launched another simplified version of the AI voice development kit, ESP32-S3-BOX-Lite. The hardware platform is very similar but without the capacitive touch panel and mute buttons. In addition, the ESP32-S3-BOX-Lite comes with three function buttons that can be customized by the user.
The manufacturer has provided all technical resources, including hardware reference design and user guides for public availability. The hardware platform can also be purchased on Amazon for US-based customers, and AliExpress and Adafruit for the rest of the world.