Learn how to create a voice-controlled robot using ESP32 and TensorFlow Lite with this step-by-step guide on creating neural networks, generating training data, and implementing firmware codes.
[0:00] “Forward”
[0:02] “Right”
[0:04] “Forward”
[0:06] “Right”
[0:08] “Forward”
[0:10] “Left”
[0:12] “Backward”
[0:15] “Backward”
[0:17] “Left”
[0:19] “Forward”
[0:21] Hey Everyone,
[0:22] We’re back with another dive into some speech recognition.
[0:26] In the last video, we built our very own Alexa using wake word detection running on the ESP32.
[0:33] “Marvin”
[0:35] “Tell me joke”
[0:39] “What goes up and down but does not move”
[0:43] “Stairs…”
[0:45] The actual processing of the user’s request is performed by a service called Wit.ai
[0:51] which takes speech and converts it into an intention that can be executed by the ESP32.
[0:58] In this video we’re going to some limited voice recognition on the ESP32 and build a
[1:04] voice controlled robot!
[1:09] Once again we’ll be using the Commands Dataset as our training data.
[1:13] I’ve selected a set of words that would be suitable for controlling a small robot:
[1:17] “left”, “right”, “forward”, and “backward”
[1:21] We’ll train up a neural network to recognise these words and then run that model on the
[1:27] ESP32 using TensorFlow Lite.
[1:31] We’re going to be able to reuse a lot of the code from our previous video with some minor
[1:36] modifications.
[1:37] Let’s have a quick look at generating our training data.
[1:41] We have our standed set of imports and some constants.
[1:45] In a departure from our previous Alexa work we’re going to split the words into two sections,
[1:51] command words and nonsense words.
[1:53] We’ll train our model to recognise the command words and reject the nonsense words and background noise.
[2:00] We have the same set of helper functions for getting the list of files and validating the audio
[2:05] and we have our function for extracting the spectrogram from audio data.
[2:10] Once again, we’re going to augment our data - we’ll randomly reposition the word within the audio segment
[2:17] and we’ll add some random background noise to the word.
[2:21] To get sufficient data for our command words we’ll repeat them multiple times
[2:26] this will give our neural network more data to train on and should help it to generalise.
[2:32] A couple of the words - forward and backward have fewer examples so I’ve repeated these more often.
[2:40] For our nonsense words we won’t bother repeating them as we have quite a few examples.
[2:46] As before we’ll include background noise and we’ll also include the same problem noises
[2:51] we identified in the previous project.
[2:53] With the training data generation completed we just save it to disk.
[2:58] Here are some examples of the words in their spectrogram format.
[3:05] In our previous project we just trained to recognise one word, we’ll now want to recognise multiple words.
[3:11] Once again we have our usual includes, and we have the lists of words that want to recognise.
[3:16] We load up our data and if we plot a histogram we can see the distribution of words.
[3:22] Ideally we’d have a bit more of a balanced dataset but having more negative examples
[3:26] may actually help us.
[3:28] We have a fairly simple convolutional neural network, with 2 convolution layers followed
[3:33] by a fully connected layer which is then followed by our output layer.
[3:38] As we are now trying to recognise multiple different words we use the “softmax” activation
[3:42] function and we use the “CategoricalCrossentropy” as our loss function.
[3:47] I do have a couple of introductory videos on TensorFlow that explain these terms in a bit more detail.
[3:54] After training our model we get just under 92% accuracy on our training data and just
[3:59] over 92% accuracy on our validation data.
[4:04] Our test dataset gives us a similar level of performance.
[4:10] Looking at the confusion matrix we can see that it’s mostly misclassifying our words as invalid.
[4:15] This is probably what we’d prefer as ideally we’d like to err on the side of false negatives
[4:20] instead of false positives.
[4:23] Since we don’t appear to be overfitting the model I’ve trained it on the complete dataset.
[4:28] This gives us a final accuracy of around 94% and looking at the confusion matrix we see a lot better results.
[4:35] It’s possible that now we might have some overfitting, but let’s try it in the real world.
[4:41] For that we are going to need a robot!
[4:45] I’m going to build a very simple two-wheeled robot.
[4:48] We’re going to use two continuous servos and a small powercell.
[4:52] We’ll need quite a wide wheelbase as the breadboard with the ESP32 on it is quite large.
[4:58] After a couple of iterations, I’ve ended up with something that looks like it will work.
[5:04] To assemble it, it’s pretty straightforward we just need to bolt the two servos onto the chassis
[5:09] and attach the wheels.
[5:11] The breadboard just sits on top of the whole contraption.
[5:18] motor noises…
[5:24] Let’s have a look at the firmware.
[5:27] We have some helper libraries:
[5:29] The tfmicro library contains all the TensorFlow Lite code.
[5:33] We have a wrapper around that to make it slightly easier to use.
[5:37] This library contains the trained model exported as C code along with a helper class to run
[5:43] the neural network prediction.
[5:45] We then have our audio processing.
[5:47] This recreates the code that we used when we generated the training data.
[5:52] This processes a one-second window of samples and generated the spectrogram that will be
[5:56] used by the neural network.
[5:59] Finally, we have our audio input library.
[6:03] This will read samples either from the internal ADC for analogue microphones or from the I2S
[6:08] interface for digital microphones.
[6:12] In the main application code we have the setup function which creates our command processor and our command detector.
[6:20] The command detector is run by a task that waits for audio samples to become available
[6:25] and then services the command detector.
[6:28] Our command detector rewinds the audio data by one second, gets the spectrogram and then
[6:34] runs the prediction.
[6:36] To improve the robustness of our detection we sample the prediction over multiple audio segments
[6:42] and also reject any detections that happen within one second of a previous detection.
[6:48] If we detect a command then we queue it up for processing by the command processor.
[6:54] Our command processor runs a task that listens on this queue for commands,
[6:58] when a command arrives it changes the PWM signal that is being sent to the motors to either stop them
[7:04] or set the required direction.
[7:07] To move forward we drive both motors forward, for backwards we drive both motors backward.
[7:13] For left we revers the left motor and drive the right motor forward and for right we do
[7:18] the opposite, right motor reverse and left motor forward.
[7:22] With our continuous servos a duty cycle of 1500us should hold them stopped, lower than
[7:29] this should reverse them and higher should drive them forward.
[7:34] I’ve slightly tweaked the values for the right motor forward value as it was not turning
[7:38] as fast as the left motor and this caused the robot to veer off to one side.
[7:44] Note that because we have the right motor upside down
[7:48] to drive it forward we actually run it in reverse
[7:51] and to drive it backwards we run it forward.
[7:54] You may need to calibrate your own motors to get the robot to go in a straight line.
[8:01] So, that’s the firmware code. Let’s see the robot in action again!
[8:37] How well does it actually work?
[8:40] Reasonably well…
[8:41] It’s a nice technology demonstration and fun project.
[8:45] It does occasionally confuse words and mix up left and right.
[8:48] It’s got a mind of its own and will just start wondering around it you don’t talk to it.
[9:05] We’re starting to reach the limits of what’s really possible
[9:08] We have a limited amount of RAM to play with and the models are starting to get very big.
[9:14] We also have a limited amount of CPU to play with.
[9:16] The larger models take longer to process making real-time detection harder.
[9:21] Having said that, there are a lot of improvements that can be made.
[9:25] So, thanks for watching, I hope you found the video useful and interesting, please subscribe if you did.
[9:32] All the code is on GitHub - let me know how you get on in the comments!
[9:36] See you in the next video!