DIY Alexa: Create Your Own Voice Assistant with ESP32 & TensorFlow Lite!

View All Posts

27 min read

Want to keep up to date with the latest posts and videos? Subscribe to the newsletter

· · · · · Posts · Videos · Tags · Support

« Printing PCBs At Home!

Budget LCR Meter Review: Proster BM4070 Unboxing & Test »

HELP SUPPORT MY WORK: If you're feeling flush then please stop by Patreon Or you can make a one off donation via ko-fi

Learn how to build an Alexa-like system with wake word detection, audio capture, and intent recognition using TensorFlow Lite, ESP32, and wit.ai.

Build Your Own Voice-Controlled Robot with ESP32 & TensorFlow Lite - Learn how to create a voice-controlled robot using ESP32 and TensorFlow Lite with this step-by-step guide on creating neural networks, generating training data, and implementing firmware codes.

TensorFlow Lite With Platform.io and the ESP32 - Learn how to train a simple TensorFlow Lite model and run it on the ESP32 using PlatformIO! With clear instructions and a helpful video, this tutorial will have your project up and running in no time.

Wireless Mic on ESP32: Bluetooth Struggles & Success! - Learn how to create a wireless microphone using ESP32 and Bluetooth Hands-Free Profile (HFP). Although the audio quality isn't perfect, the project pushes the ESP32's capabilities to their limits, resulting in an effective wireless speakerphone.

ESP32 Walkie-Talkie: DIY Audio Magic - Discover how to create your own Walkie Talkie using an ESP32, I2S microphones, and a 3W amplifier board with this step-by-step guide. Explore code, schematics, and more to customize your Walkie Talkie project!

DIY Alexa With the ESP32 and Wit.ai - This post provides a comprehensive guide to building a do-it-yourself (DIY) Alexa using an ESP32 and Wit.ai. It illustrates how to create a wake word detection system, use Python for machine learning and employ TensorFlow for the 'wake' word identification. It also covers the usage of Wit.ai for intent recognition and managing commands. The post is fully backed with code snippets, examples and video tutorials to deliver an interactive learning experience to readers.

ESP32 Audio Input - MAX4466, MAX9814, SPH0645LM4H, INMP441 - In this blog post, I've delved deep into the world of audio input for ESP32, exploring all the different options for getting analogue audio data into the device. After discussing the use of the built-in Analogue to Digital Converts (ADCs), I2S to read ADCs with DMA, and using I2S to read directly from compatible peripherals, I go on to present hands-on experiments with four different microphones (MAX4466, MAX9814, SPH0645, INPM441). This comprehensive look at getting audio into the ESP32 should be a valuable resource for anyone hungry for a deep-dive into ESP32's audio capabilities, complete with YouTube videos for an even more detailed look!

Look at my shiny crystal balls - Just upgraded my basic AliExpress crystal balls with some tech wizardry - I've thrown in an ESP32-S3-MINI, a mic, and made them battery powered. Thanks to WLED software, they're now smart and responsive! Shared the KiCAD project for fellow tinkerers. Check out my video to see these balls in action!

Decoding AVI Files for Fun and... - After some quality time with my ESP32 microcontroller, I've developed a version of the TinyTV and learned a lot about video and audio streaming along the way. Using Python and Wi-Fi technology, I was able to set up the streaming server with audio data, video frames, and metadata. I've can also explored the picture quality challenges of uncompressed image data and learned about MJPEG frames. Together with JPEGDEC for depth decoding, I've managed to effectively use ESP32's dual cores to achieve an inspiring 28 frames per second. Discussing audio sync, storage options and the intricacies of container file formats for video storage led me to the AVI format. The process of reading and processing AVI file headers and the listing subtype 'movi' allowed me to make significant headway in my project. All in all, I'm pretty chuffed with my portable battery powered video player. You can check out my code over on Github!

Self Organising WS2811 LEDs - I've successfully used addressable WS2811 LED strings and an ESP-CAM board to create an adjustable lighting system. The best part is that the image processing code can be duplicated in JavaScript which allows you to use a plain dev board to drive the LEDs instead of needing a camera on your ESP32 board. If you want to replicate this project, you'll need your own ESP32 dev board and some addressable LEDs. After figuring out the location of each LED in 2D space, it's easy to map from each LED's x and y location onto a pattern you want to show on the frame buffer. Desiring to keep it accessible, I've posted detailed instructions and my sample code on GitHub, making sure anyone with basic knowledge can undertake this fun technological DIY project!

[0:01] “Marvin - turn on the lights”
[0:06] OK
[0:08] “Marvin - turn off the bedroom”
[0:13] OK
[0:14] “Marvin - turn off the kitchen”
[0:21] OK
[0:22] “Marvin - tell me a joke”
[0:28] What goes up and down but does not move?
[0:31] stairs
[0:34] “Marvin - turn off the lights”
[0:40] OK
[0:42] Hey everyone
[0:43] So, if you’ve been playing along at home
[0:46] you’ll have known that we’ve been building towards something
[0:50] we’ve covered getting audio into the ESP32
[0:54] getting audio out of the ESP32
[0:57] and we’ve looked at getting some ai running using TensorFlow Lite
[1:01] This has all been building towards building an Alexa type system
[1:06] So, what actually is an Alexa system?
[1:09] What components do we need to plug together to get something working?
[1:13] The first thing we’re going to need is some kind of wake word detection system.
[1:19] This will continuously listen to audio waiting for a trigger phrase or word
[1:25] When it hears this word it will wake up the rest of the system
[1:28] and start recording audio
[1:30] to capture whatever instructions the user has
[1:34] Once the audio has been captured it will send it off to a server to be recognized
[1:39] The server processes the audio and works out what the user is asking for
[1:44] The server may process the user’s request and may trigger actions in other services
[1:48] In the system we’re building we’ll just be using the server
[1:52] to work out what the user’s intention was
[1:55] This intention is then sent back to the device
[1:58] and the device tries to perform what the user asked it to do
[2:01] So we need three components:
[2:04] A wake word detection
[2:07] Audio capture and intent recognition
[2:09] And intent execution
[2:12] Let’s start off with the wake word detection
[2:17] We’re going to be using TensorFlow Lite for our wake word detection
[2:22] and as with any machine learning problem our first port of call
[2:25] is to find some data to train against.
[2:28] Now, fortunately, the good folk at Google have already done the heavy lifting for us and
[2:33] collated a speech commands data set.
[2:38] This data set contains over 100 000 audio files
[2:42] consisting of a set of 20 core command words such as up down left right yes no
[2:49] and a set of extra words
[2:52] each of the samples is one second long
[2:55] There’s one word in particular that looks like a good candidate for a wake word.
[2:59] I’ve chosen to use the word “Marvin” as my wake word
[3:04] Oh God I’m so depressed
[3:07] Let’s have a listen to a couple of the files:
[3:10] Marvin
[3:12] Marvin
[3:14] Marvin
[3:17] Marvin
[3:19] Seven
[3:21] Seven
[3:23] Seven
[3:26] I’ve also recorded a large sample of ambient background noise
[3:31] consisting of tv and radio shows and general office noise
[3:37] So now we’ve got our training data
[3:39] we need to work out what features to train our neural network against
[3:44] it’s unlikely that feeding in raw audio samples will give us a good result
[3:49] Reading around and looking at some TensorFlow samples
[3:53] a good approach seems to be to treat the problem as an image recognition problem
[3:57] We need to turn our audio samples into something that looks like an image
[4:02] to do this we can take a spectrogram of the audio sample
[4:06] To get a spectrogram of an audio sample we break the sample into small sections
[4:12] we then perform a discrete fourier transform on each of these sections
[4:16] this gives us the frequencies that are present in that slice of audio
[4:21] putting these frequency slices together gives us a spectrogram of the sample
[4:29] I’ve created a Jupyter notebook to create the training data
[4:34] As always the first thing we do is import the libraries we’re going to need and set up some constants
[4:40] We’ve got a list of words in our training data along with a dummy word for the background noise
[4:46] I’ve made some helper functions for getting all the files for a word
[4:50] and also for detecting if the file actually contains voice data
[4:54] some of the samples are not exactly one second long
[4:57] and some of them have truncated audio data
[5:01] We then have our function for generating the spectrogram for an audio sample
[5:06] we first make sure the audio sample is normalized and then we compute the spectrogram
[5:11] we reduce the result of this by applying average pooling
[5:16] we finally take a log of the spectrogram
[5:19] so that we don’t feed extreme values into our neural network
[5:22] which might make it harder to train
[5:24] For each file we collect training data from we apply some random modifications
[5:30] we randomly shift the audio sample in its one-second segment
[5:34] this makes sure that our neural network generalizes around the audio position
[5:39] we also add in some random sample of background noise
[5:43] this helps our neural network work out
[5:45] the unique features of our target word and ignore any background noise
[5:51] Now, we need to add more samples of the Marvin word to our data set
[5:55] as it would otherwise be swamped by the other words in our training data
[5:59] so we repeat it multiple times this also helps our neural network generalize
[6:04] as there will be multiple samples of the word with different background noises
[6:08] and in different positions in the one second sample
[6:12] we then add in samples from our background noise we run through
[6:16] each file of background noise and chop it into one second segments
[6:20] and then we also generate some random utterances from our background noise
[6:25] once again this should help our network distinguish between the word Marvin and random noises
[6:32] During my testing of the system, I found that there were
[6:35] some particular noises that seemed to trigger false detection of the word Marvin
[6:39] These seem to consist of low-frequency humming and strange scraping sounds.
[6:45] I’ve collected some of these sounds as more negative samples for the training process
[6:51] With all this data we end up with a reasonably sized training validation and testing data set.
[6:57] So we can save this to disk for use in our training workbook
[7:01] We can also have a look at the spectrograms for different words in our training data
[7:05] So here are some examples of Marvin
[7:09] and here are some examples of the word yes
[7:14] So that’s our training data prepared
[7:16] let’s have a look at how we train our model up
[7:20] I’ve created another Jupyter notebook for training our model
[7:24] Once again we have to pull in the imports we need.
[7:28] we also set up TensorBoard so that we can visualize the training of our model
[7:32] we’ve got our list of words it’s
[7:35] important that this is in the same order as in the training workbook
[7:39] and we have the code to load up our training data
[7:46] if we plot a histogram of the training data
[7:49] you can see that we have a lot of examples of the word at position 16
[7:53] and quite a few at position 35
[7:56] combining this with our words we can see that this matches up to the
[8:00] word Marvin and to our background noise
[8:04] now for our system, we only really care about detecting the word Marvin
[8:08] so we’ll modify our y labels so that it contains a one for Marvin and a zero for everything else
[8:15] plotting another histogram we can see that we now have a fairly
[8:19] balanced set of training data with examples of our positive and negative classes
[8:24] We can now feed our data into TensorFlow datasets.
[8:28] We set up our training data to repeat forever, randomly shuffle and to come out in, batches.
[8:37] Now we create our model I’ve played around with a few different model architectures
[8:42] and ended up with this as a trade-off between time to train accuracy and model size
[8:48] We have a convolution layer followed by a max-pooling layer
[8:53] followed by another convolution layer with a max-pooling layer
[8:56] and the result of this is fed into a densely connected layer and finally to our output neuron
[9:04] looking at the summary of our model we can see how many parameters it has
[9:09] this gives us a fairly good indication of how large the model will be when we convert it to TensorFlow lite
[9:16] finally, we compile our model, set up the TensorBoard logging, and kick off the training
[9:25] With our training completed, we can now take a look at how well it has performed
[9:30] looking at the tensorboard we can see that our training performance
[9:34] is pretty close to our validation performance there is a bit of noise on
[9:38] the unsmooth lines ideally we should probably try and
[9:41] increase the size of our training and validation data
[9:44] Let’s see how well it does on the testing data set
[9:48] I’m going to use the best model that was found during training and work from that
[9:53] You can see that we get pretty good results
[9:56] checking the confusion matrix we can see how many false positives
[10:00] and how many false negatives we get these are pretty good results as well
[10:04] I would rather get more false negatives than false positives
[10:08] as we don’t want to be randomly waking up from background noises
[10:12] let’s try it with a higher threshold and see how that performs
[10:16] this is probably what we will go for in our code we will get a lot more false
[10:20] negatives but also far fewer false positives
[10:25] So, as we don’t seem to be overfitting I’m happy to train the model
[10:29] on our complete data set training validation and testing
[10:34] all combined into one large data set
[10:44] let’s see how this performs on all our data
[10:47] once again we have pretty good results our next step
[10:51] is to convert the model to TensorFlow Lite for use on the esp32
[10:56] Let’s jump into another workbook for this
[11:00] We have our imports to bring in TensorFlow and NumPy
[11:04] we’re also going to need our data
[11:06] we need this so that the converter can quantize our model accurately
[11:12] once the model has been converted we can run a command line tool
[11:15] to generate the C code and we can now compile that into our project
[11:20] We’ll take a look at the wake word detection code on the ESP32 side of things later.
[11:25] First, we need to get to another building block
[11:29] Once we’ve detected the wake word
[11:31] we need to record some audio and work out what the user wants us to do.
[11:35] we’re going to need something that can understand speech
[11:41] So, to do the heavy lifting of actually recognizing the text
[11:46] we’re going to be using a service from Facebook called “wit ai”
[11:50] This is a free service that will analyze speech and work out what the intention is behind the speech
[11:59] We log in using Facebook
[12:06] and then the first thing we need to do is create a new application
[12:11] So let’s just call this Marven and we’ll make it private for now
[12:22] Now we need to train our application to work out what it is we’re trying to do
[12:29] so let’s add a few sample phrases
[12:32] let’s try turning something on
[12:39] we need to create an intent
[12:48] and then we need to start highlighting some of the bits of text
[12:53] let’s try and pull out the device that we’re trying to turn on
[12:58] I’ll create an entity
[13:01] Now we have an entity called device
[13:04] and we’ve highlighted bedroom as the piece of text that should correspond to that device
[13:09] now we can add a trait for on and off
[13:13] This is the built-in trait that’s supplied by wit
[13:16] and we want to say that this should be turned on
[13:20] So let’s train this
[13:26] Now let’s try adding another piece of text
[13:33] So you can see that it’s worked out the device already
[13:37] and it’s worked out the value should be off
[13:39] so let’s add that to our “turn off and on “ intent
[13:47] let’s try adding another one
[13:50] let’s try and turn on the kitchen
[13:53] so it’s worked out that it’s an on-off trait
[13:57] and it’s worked out on and then let’s
[13:59] highlight this and tell it that’s the device and we’ll
[14:03] train that as well
[14:09] let’s try another one
[14:11] “turn off the kitchen”
[14:14] so it’s improved its understanding
[14:16] now it can see the device is kitchen and the trait is off
[14:20] so let’s train and validate that
[14:23] you can keep adding more utterances to improve the performance of your application
[14:28] but I think for now that should be enough for our use case
[14:33] so let’s try this out with some real text
[14:38] I’ve made some sample utterances and recorded them to WAV files
[14:43] I have a turn-off, a turn-on and another example turn-on
[14:49] Let’s have a quick listen to these files
[14:54] “turn off the bedroom”
[14:56] so hopefully that should turn off the bedroom
[14:59] Let’s try running this through wit
[15:04] So. we have a curl command here that will post up the WAV file to the back end service
[15:15] You can see it’s detected that the device is bedroom
[15:20] and it’s detected that we want to turn off the bedroom
[15:24] let’s try another one
[15:26] so we’ll try
[15:30] “turn on the lights”
[15:33] so this should turn on the light
[15:37] so let’s try that
[15:43] so we can see once again it’s detected the device
[15:46] it tells us that it’s the light it’s found the intent turn on device and
[15:52] it says we want to turn it on
[15:54] Let’s try our last sample
[15:58] so turn on 2
[16:03] This should turn on the bedroom as well
[16:06] Let’s check that works
[16:16] It’s found the device and it’s worked out we want to turn it on
[16:21] So, I think this wit application should work for us
[16:26] Let’s integrate it into our code
[16:31] So, that’s our building blocks completed.
[16:34] We have something that will detect a wake word
[16:36] and we have something that will work out what the user’s intention was
[16:41] let’s have a look at how this is all
[16:43] wired up on the ESP32 side of things
[16:47] I’ve created a set of libraries for the main components of the project
[16:52] We have the tfmicro library which includes everything needed to run a TensorFlow lite model
[16:58] and we have a wrapper library to make it slightly easier to use
[17:02] here’s our trained model converted into C code and
[17:06] here are the functions that we’ll use to communicate with i
[17:09] we have one to get the input buffer and another to run a prediction on the input data
[17:15] we’ve covered this in more detail in a
[17:17] previous video so i won’t go into too many details on this now
[17:22] moving on we have a couple of helper libraries for getting audio in and out of the system
[17:27] we can support both I2S microphones directly
[17:31] and analog microphones using the analog to digital converter
[17:37] samples from the microphone are read into a circular buffer
[17:40] with room for just over one second’s worth of audio
[17:45] our audio output library supports playing
[17:48] WAV files from SPIFFS via an I2S amplifier
[17:53] we’ve then got our audio processing code this needs to recreate the same process
[17:58] that we used for our training data.
[18:00] The first thing we need to do is work
[18:03] out the mean and max values of the samples
[18:05] so that we can normalize the audio
[18:07] we then step through the one second of audio
[18:11] extracting a window of samples on each step
[18:14] the input samples are normalized and copied into our FFT input buffer
[18:19] The input to the FFT is a power of 2 so there is a blank area that we need to zero out
[18:26] before performing the FFT we apply a hamming window
[18:31] and then once we have done the FFT we
[18:33] extract the energy in each frequency bin
[18:36] we follow that by the same average pooling process as in training
[18:40] and then finally we take the log
[18:43] this gives us the set of features that
[18:46] our neural network is expecting to see
[18:50] finally, we have the code for talking to wit.ai
[18:54] to avoid having to buffer the entire audio sample in memory
[18:57] we need to perform a chunked upload of the data
[19:01] we create a connection to wit.ai and then upload the chunks of data
[19:05] until we’ve collected sufficient audio to capture the user’s command
[19:10] we decode the results from wit.ai and extract the pieces of information
[19:14] that we are interested in
[19:16] we only care about the intent, the device and whether the user wants to turn the device on or off
[19:24] That’s all the components of our application
[19:27] let’s see how these are all coordinated
[19:30] in our setup function, we do all the normal work of setting up the serial port
[19:34] connecting to wi-fi and starting up SPIFFS
[19:38] we configure the audio input and the audio output
[19:42] and we set up some devices and map them onto GPIO ports
[19:46] finally, we create a task that will
[19:49] delegate onto our application class before we kick off the audio input
[19:55] our application task is woken up every time the audio input
[19:59] fills up one of the sections of the ring buffer
[20:02] every time that happens it services the application
[20:08] our application consists of a very simple state machine
[20:11] we can be in one of two states: we can either be waiting for the wake word
[20:16] or we can be recognizing a command
[20:19] let’s have a look at the detect wake word state
[20:23] the first thing we do is get hold of the ring buffer
[20:26] we rewind by one second’s worth of samples
[20:29] and then generate the spectrogram
[20:32] this spectrogram is fed directly into the neural network’s input buffer
[20:36] so we can run the prediction
[20:39] if the neural network thinks the wake word occurred
[20:41] then we move on to the next state otherwise we stay in the current state
[20:48] for the command recognition state,
[20:50] when we enter the state we make a connection to wit ai this can take up to 1.5 seconds
[20:56] as making an SSL connection on the ESP32 is quite slow
[21:02] we then start streaming samples to the server
[21:04] to allow for the SSL connection time we go back one second into the past
[21:08] so we don’t miss too much of what the user said
[21:12] once we have streamed three seconds of samples we asked wit.ai what the user said
[21:17] we could be more clever here and we could wait until
[21:21] we think the user has stopped speaking but that’s probably work for a future version
[21:27] wit.ai processes the audio and tells us what the user asked
[21:32] we pass that onto our intent processor
[21:34] to interpret the request and move on to the next state which will
[21:38] put us back into waiting for the wake word
[21:42] our intent processor simply looks at the intent name that wit,.ai provides us
[21:47] and carries out the appropriate action
[21:53] “Marvin tell me about life”
[21:58] Life, don’t talk to me about life
[22:05] So, there we have it a DIY Alexa.
[22:08] How well does it actually work?
[22:12] It works reasonably well
[22:14] We have a very lightweight wake word detection system
[22:17] It runs in around 100 milliseconds and there’s still room for lots of optimization
[22:24] Accuracy on the wake word is okay
[22:26] We do need more training data to make it really robust
[22:30] You can easily trick it into activating by using similar words to Marvin such as “marvellous”, “martin”, “marlin”
[22:39] More negative examples of words would help with this problem
[22:43] The wit.ai system works very well and you can easily add your own intents and traits
[22:48] and build a very powerful system
[22:51] There are also alternative paid versions which you can use instead one is
[22:56] available from Microsoft and Google and Amazon also have similar and equivalent services
[23:02] All the codes in GitHub the link is in the description
[23:07] All you actually need is a microphone to get audio data into the ESP32
[23:12] You don’t necessarily need a speaker. You can just comment out the sections that try and talk to you
[23:19] Let me know how you get on in the comments section
[23:22] As always, thanks for watching
[23:25] I hope you enjoyed this video as much as I enjoyed making it
[23:27] and please hit the subscribe button if you did and I’ll keep on making videos

HELP SUPPORT MY WORK: If you're feeling flush then please stop by Patreon Or you can make a one off donation via ko-fi

Want to keep up to date with the latest posts and videos? Subscribe to the newsletter

· · · · · Posts · Videos · Tags · Support

DIY Alexa: Create Your Own Voice Assistant with ESP32 & TensorFlow Lite!

Written by

Chris Greening

Supported by

atomic14

A collection of slightly mad projects, instructive/educational videos, and generally interesting stuff. Building projects around the Arduino and ESP32 platforms - we'll be exploring AI, Computer Vision, Audio, 3D Printing - it may get a bit eclectic...

DIY Alexa: Create Your Own Voice Assistant with ESP32 & TensorFlow Lite!

Related Videos

Related Posts

Written by

Chris Greening

Supported by

atomic14

A collection of slightly mad projects, instructive/educational videos, and generally interesting stuff. Building projects around the Arduino and ESP32 platforms - we'll be exploring AI, Computer Vision, Audio, 3D Printing - it may get a bit eclectic...