Learn how to build an Alexa-like system with wake word detection, audio capture, and intent recognition using TensorFlow Lite, ESP32, and wit.ai.
[0:01] “Marvin - turn on the lights”
[0:06] OK
[0:08] “Marvin - turn off the bedroom”
[0:13] OK
[0:14] “Marvin - turn off the kitchen”
[0:21] OK
[0:22] “Marvin - tell me a joke”
[0:28] What goes up and down but does not move?
[0:31] stairs
[0:34] “Marvin - turn off the lights”
[0:40] OK
[0:42] Hey everyone
[0:43] So, if you’ve been playing along at home
[0:46] you’ll have known that we’ve been building towards something
[0:50] we’ve covered getting audio into the ESP32
[0:54] getting audio out of the ESP32
[0:57] and we’ve looked at getting some ai running using TensorFlow Lite
[1:01] This has all been building towards building an Alexa type system
[1:06] So, what actually is an Alexa system?
[1:09] What components do we need to plug together to get something working?
[1:13] The first thing we’re going to need is some kind of wake word detection system.
[1:19] This will continuously listen to audio waiting for a trigger phrase or word
[1:25] When it hears this word it will wake up the rest of the system
[1:28] and start recording audio
[1:30] to capture whatever instructions the user has
[1:34] Once the audio has been captured it will send it off to a server to be recognized
[1:39] The server processes the audio and works out what the user is asking for
[1:44] The server may process the user’s request and may trigger actions in other services
[1:48] In the system we’re building we’ll just be using the server
[1:52] to work out what the user’s intention was
[1:55] This intention is then sent back to the device
[1:58] and the device tries to perform what the user asked it to do
[2:01] So we need three components:
[2:04] A wake word detection
[2:07] Audio capture and intent recognition
[2:09] And intent execution
[2:12] Let’s start off with the wake word detection
[2:17] We’re going to be using TensorFlow Lite for our wake word detection
[2:22] and as with any machine learning problem our first port of call
[2:25] is to find some data to train against.
[2:28] Now, fortunately, the good folk at Google have already done the heavy lifting for us and
[2:33] collated a speech commands data set.
[2:38] This data set contains over 100 000 audio files
[2:42] consisting of a set of 20 core command words such as up down left right yes no
[2:49] and a set of extra words
[2:52] each of the samples is one second long
[2:55] There’s one word in particular that looks like a good candidate for a wake word.
[2:59] I’ve chosen to use the word “Marvin” as my wake word
[3:04] Oh God I’m so depressed
[3:07] Let’s have a listen to a couple of the files:
[3:10] Marvin
[3:12] Marvin
[3:14] Marvin
[3:17] Marvin
[3:19] Seven
[3:21] Seven
[3:23] Seven
[3:26] I’ve also recorded a large sample of ambient background noise
[3:31] consisting of tv and radio shows and general office noise
[3:37] So now we’ve got our training data
[3:39] we need to work out what features to train our neural network against
[3:44] it’s unlikely that feeding in raw audio samples will give us a good result
[3:49] Reading around and looking at some TensorFlow samples
[3:53] a good approach seems to be to treat the problem as an image recognition problem
[3:57] We need to turn our audio samples into something that looks like an image
[4:02] to do this we can take a spectrogram of the audio sample
[4:06] To get a spectrogram of an audio sample we break the sample into small sections
[4:12] we then perform a discrete fourier transform on each of these sections
[4:16] this gives us the frequencies that are present in that slice of audio
[4:21] putting these frequency slices together gives us a spectrogram of the sample
[4:29] I’ve created a Jupyter notebook to create the training data
[4:34] As always the first thing we do is import the libraries we’re going to need and set up some constants
[4:40] We’ve got a list of words in our training data along with a dummy word for the background noise
[4:46] I’ve made some helper functions for getting all the files for a word
[4:50] and also for detecting if the file actually contains voice data
[4:54] some of the samples are not exactly one second long
[4:57] and some of them have truncated audio data
[5:01] We then have our function for generating the spectrogram for an audio sample
[5:06] we first make sure the audio sample is normalized and then we compute the spectrogram
[5:11] we reduce the result of this by applying average pooling
[5:16] we finally take a log of the spectrogram
[5:19] so that we don’t feed extreme values into our neural network
[5:22] which might make it harder to train
[5:24] For each file we collect training data from we apply some random modifications
[5:30] we randomly shift the audio sample in its one-second segment
[5:34] this makes sure that our neural network generalizes around the audio position
[5:39] we also add in some random sample of background noise
[5:43] this helps our neural network work out
[5:45] the unique features of our target word and ignore any background noise
[5:51] Now, we need to add more samples of the Marvin word to our data set
[5:55] as it would otherwise be swamped by the other words in our training data
[5:59] so we repeat it multiple times this also helps our neural network generalize
[6:04] as there will be multiple samples of the word with different background noises
[6:08] and in different positions in the one second sample
[6:12] we then add in samples from our background noise we run through
[6:16] each file of background noise and chop it into one second segments
[6:20] and then we also generate some random utterances from our background noise
[6:25] once again this should help our network distinguish between the word Marvin and random noises
[6:32] During my testing of the system, I found that there were
[6:35] some particular noises that seemed to trigger false detection of the word Marvin
[6:39] These seem to consist of low-frequency humming and strange scraping sounds.
[6:45] I’ve collected some of these sounds as more negative samples for the training process
[6:51] With all this data we end up with a reasonably sized training validation and testing data set.
[6:57] So we can save this to disk for use in our training workbook
[7:01] We can also have a look at the spectrograms for different words in our training data
[7:05] So here are some examples of Marvin
[7:09] and here are some examples of the word yes
[7:14] So that’s our training data prepared
[7:16] let’s have a look at how we train our model up
[7:20] I’ve created another Jupyter notebook for training our model
[7:24] Once again we have to pull in the imports we need.
[7:28] we also set up TensorBoard so that we can visualize the training of our model
[7:32] we’ve got our list of words it’s
[7:35] important that this is in the same order as in the training workbook
[7:39] and we have the code to load up our training data
[7:46] if we plot a histogram of the training data
[7:49] you can see that we have a lot of examples of the word at position 16
[7:53] and quite a few at position 35
[7:56] combining this with our words we can see that this matches up to the
[8:00] word Marvin and to our background noise
[8:04] now for our system, we only really care about detecting the word Marvin
[8:08] so we’ll modify our y labels so that it contains a one for Marvin and a zero for everything else
[8:15] plotting another histogram we can see that we now have a fairly
[8:19] balanced set of training data with examples of our positive and negative classes
[8:24] We can now feed our data into TensorFlow datasets.
[8:28] We set up our training data to repeat forever, randomly shuffle and to come out in, batches.
[8:37] Now we create our model I’ve played around with a few different model architectures
[8:42] and ended up with this as a trade-off between time to train accuracy and model size
[8:48] We have a convolution layer followed by a max-pooling layer
[8:53] followed by another convolution layer with a max-pooling layer
[8:56] and the result of this is fed into a densely connected layer and finally to our output neuron
[9:04] looking at the summary of our model we can see how many parameters it has
[9:09] this gives us a fairly good indication of how large the model will be when we convert it to TensorFlow lite
[9:16] finally, we compile our model, set up the TensorBoard logging, and kick off the training
[9:25] With our training completed, we can now take a look at how well it has performed
[9:30] looking at the tensorboard we can see that our training performance
[9:34] is pretty close to our validation performance there is a bit of noise on
[9:38] the unsmooth lines ideally we should probably try and
[9:41] increase the size of our training and validation data
[9:44] Let’s see how well it does on the testing data set
[9:48] I’m going to use the best model that was found during training and work from that
[9:53] You can see that we get pretty good results
[9:56] checking the confusion matrix we can see how many false positives
[10:00] and how many false negatives we get these are pretty good results as well
[10:04] I would rather get more false negatives than false positives
[10:08] as we don’t want to be randomly waking up from background noises
[10:12] let’s try it with a higher threshold and see how that performs
[10:16] this is probably what we will go for in our code we will get a lot more false
[10:20] negatives but also far fewer false positives
[10:25] So, as we don’t seem to be overfitting I’m happy to train the model
[10:29] on our complete data set training validation and testing
[10:34] all combined into one large data set
[10:44] let’s see how this performs on all our data
[10:47] once again we have pretty good results our next step
[10:51] is to convert the model to TensorFlow Lite for use on the esp32
[10:56] Let’s jump into another workbook for this
[11:00] We have our imports to bring in TensorFlow and NumPy
[11:04] we’re also going to need our data
[11:06] we need this so that the converter can quantize our model accurately
[11:12] once the model has been converted we can run a command line tool
[11:15] to generate the C code and we can now compile that into our project
[11:20] We’ll take a look at the wake word detection code on the ESP32 side of things later.
[11:25] First, we need to get to another building block
[11:29] Once we’ve detected the wake word
[11:31] we need to record some audio and work out what the user wants us to do.
[11:35] we’re going to need something that can understand speech
[11:41] So, to do the heavy lifting of actually recognizing the text
[11:46] we’re going to be using a service from Facebook called “wit ai”
[11:50] This is a free service that will analyze speech and work out what the intention is behind the speech
[11:59] We log in using Facebook
[12:06] and then the first thing we need to do is create a new application
[12:11] So let’s just call this Marven and we’ll make it private for now
[12:22] Now we need to train our application to work out what it is we’re trying to do
[12:29] so let’s add a few sample phrases
[12:32] let’s try turning something on
[12:39] we need to create an intent
[12:48] and then we need to start highlighting some of the bits of text
[12:53] let’s try and pull out the device that we’re trying to turn on
[12:58] I’ll create an entity
[13:01] Now we have an entity called device
[13:04] and we’ve highlighted bedroom as the piece of text that should correspond to that device
[13:09] now we can add a trait for on and off
[13:13] This is the built-in trait that’s supplied by wit
[13:16] and we want to say that this should be turned on
[13:20] So let’s train this
[13:26] Now let’s try adding another piece of text
[13:33] So you can see that it’s worked out the device already
[13:37] and it’s worked out the value should be off
[13:39] so let’s add that to our “turn off and on “ intent
[13:47] let’s try adding another one
[13:50] let’s try and turn on the kitchen
[13:53] so it’s worked out that it’s an on-off trait
[13:57] and it’s worked out on and then let’s
[13:59] highlight this and tell it that’s the device and we’ll
[14:03] train that as well
[14:09] let’s try another one
[14:11] “turn off the kitchen”
[14:14] so it’s improved its understanding
[14:16] now it can see the device is kitchen and the trait is off
[14:20] so let’s train and validate that
[14:23] you can keep adding more utterances to improve the performance of your application
[14:28] but I think for now that should be enough for our use case
[14:33] so let’s try this out with some real text
[14:38] I’ve made some sample utterances and recorded them to WAV files
[14:43] I have a turn-off, a turn-on and another example turn-on
[14:49] Let’s have a quick listen to these files
[14:54] “turn off the bedroom”
[14:56] so hopefully that should turn off the bedroom
[14:59] Let’s try running this through wit
[15:04] So. we have a curl command here that will post up the WAV file to the back end service
[15:15] You can see it’s detected that the device is bedroom
[15:20] and it’s detected that we want to turn off the bedroom
[15:24] let’s try another one
[15:26] so we’ll try
[15:30] “turn on the lights”
[15:33] so this should turn on the light
[15:37] so let’s try that
[15:43] so we can see once again it’s detected the device
[15:46] it tells us that it’s the light it’s found the intent turn on device and
[15:52] it says we want to turn it on
[15:54] Let’s try our last sample
[15:58] so turn on 2
[16:03] This should turn on the bedroom as well
[16:06] Let’s check that works
[16:16] It’s found the device and it’s worked out we want to turn it on
[16:21] So, I think this wit application should work for us
[16:26] Let’s integrate it into our code
[16:31] So, that’s our building blocks completed.
[16:34] We have something that will detect a wake word
[16:36] and we have something that will work out what the user’s intention was
[16:41] let’s have a look at how this is all
[16:43] wired up on the ESP32 side of things
[16:47] I’ve created a set of libraries for the main components of the project
[16:52] We have the tfmicro library which includes everything needed to run a TensorFlow lite model
[16:58] and we have a wrapper library to make it slightly easier to use
[17:02] here’s our trained model converted into C code and
[17:06] here are the functions that we’ll use to communicate with i
[17:09] we have one to get the input buffer and another to run a prediction on the input data
[17:15] we’ve covered this in more detail in a
[17:17] previous video so i won’t go into too many details on this now
[17:22] moving on we have a couple of helper libraries for getting audio in and out of the system
[17:27] we can support both I2S microphones directly
[17:31] and analog microphones using the analog to digital converter
[17:37] samples from the microphone are read into a circular buffer
[17:40] with room for just over one second’s worth of audio
[17:45] our audio output library supports playing
[17:48] WAV files from SPIFFS via an I2S amplifier
[17:53] we’ve then got our audio processing code this needs to recreate the same process
[17:58] that we used for our training data.
[18:00] The first thing we need to do is work
[18:03] out the mean and max values of the samples
[18:05] so that we can normalize the audio
[18:07] we then step through the one second of audio
[18:11] extracting a window of samples on each step
[18:14] the input samples are normalized and copied into our FFT input buffer
[18:19] The input to the FFT is a power of 2 so there is a blank area that we need to zero out
[18:26] before performing the FFT we apply a hamming window
[18:31] and then once we have done the FFT we
[18:33] extract the energy in each frequency bin
[18:36] we follow that by the same average pooling process as in training
[18:40] and then finally we take the log
[18:43] this gives us the set of features that
[18:46] our neural network is expecting to see
[18:50] finally, we have the code for talking to wit.ai
[18:54] to avoid having to buffer the entire audio sample in memory
[18:57] we need to perform a chunked upload of the data
[19:01] we create a connection to wit.ai and then upload the chunks of data
[19:05] until we’ve collected sufficient audio to capture the user’s command
[19:10] we decode the results from wit.ai and extract the pieces of information
[19:14] that we are interested in
[19:16] we only care about the intent, the device and whether the user wants to turn the device on or off
[19:24] That’s all the components of our application
[19:27] let’s see how these are all coordinated
[19:30] in our setup function, we do all the normal work of setting up the serial port
[19:34] connecting to wi-fi and starting up SPIFFS
[19:38] we configure the audio input and the audio output
[19:42] and we set up some devices and map them onto GPIO ports
[19:46] finally, we create a task that will
[19:49] delegate onto our application class before we kick off the audio input
[19:55] our application task is woken up every time the audio input
[19:59] fills up one of the sections of the ring buffer
[20:02] every time that happens it services the application
[20:08] our application consists of a very simple state machine
[20:11] we can be in one of two states: we can either be waiting for the wake word
[20:16] or we can be recognizing a command
[20:19] let’s have a look at the detect wake word state
[20:23] the first thing we do is get hold of the ring buffer
[20:26] we rewind by one second’s worth of samples
[20:29] and then generate the spectrogram
[20:32] this spectrogram is fed directly into the neural network’s input buffer
[20:36] so we can run the prediction
[20:39] if the neural network thinks the wake word occurred
[20:41] then we move on to the next state otherwise we stay in the current state
[20:48] for the command recognition state,
[20:50] when we enter the state we make a connection to wit ai this can take up to 1.5 seconds
[20:56] as making an SSL connection on the ESP32 is quite slow
[21:02] we then start streaming samples to the server
[21:04] to allow for the SSL connection time we go back one second into the past
[21:08] so we don’t miss too much of what the user said
[21:12] once we have streamed three seconds of samples we asked wit.ai what the user said
[21:17] we could be more clever here and we could wait until
[21:21] we think the user has stopped speaking but that’s probably work for a future version
[21:27] wit.ai processes the audio and tells us what the user asked
[21:32] we pass that onto our intent processor
[21:34] to interpret the request and move on to the next state which will
[21:38] put us back into waiting for the wake word
[21:42] our intent processor simply looks at the intent name that wit,.ai provides us
[21:47] and carries out the appropriate action
[21:53] “Marvin tell me about life”
[21:58] Life, don’t talk to me about life
[22:05] So, there we have it a DIY Alexa.
[22:08] How well does it actually work?
[22:12] It works reasonably well
[22:14] We have a very lightweight wake word detection system
[22:17] It runs in around 100 milliseconds and there’s still room for lots of optimization
[22:24] Accuracy on the wake word is okay
[22:26] We do need more training data to make it really robust
[22:30] You can easily trick it into activating by using similar words to Marvin such as “marvellous”, “martin”, “marlin”
[22:39] More negative examples of words would help with this problem
[22:43] The wit.ai system works very well and you can easily add your own intents and traits
[22:48] and build a very powerful system
[22:51] There are also alternative paid versions which you can use instead one is
[22:56] available from Microsoft and Google and Amazon also have similar and equivalent services
[23:02] All the codes in GitHub the link is in the description
[23:07] All you actually need is a microphone to get audio data into the ESP32
[23:12] You don’t necessarily need a speaker. You can just comment out the sections that try and talk to you
[23:19] Let me know how you get on in the comments section
[23:22] As always, thanks for watching
[23:25] I hope you enjoyed this video as much as I enjoyed making it
[23:27] and please hit the subscribe button if you did and I’ll keep on making videos