View All Posts
Want to keep up to date with the latest posts and videos? Subscribe to the newsletter
HELP SUPPORT MY WORK: If you're feeling flush then please stop by Patreon Or you can make a one off donation via ko-fi

Discover how to recreate a Sudoku app using browser APIs and ecosystem advancements, and explore the image processing pipeline technique for extracting Sudoku puzzles and solving them.

Related Content

[0:00] Hey Everyone!
[0:01] A long time ago I wrote an app for the iPhone that let you take a grab a sudoku puzzle using
[0:08] your iPhone’s camera.
[0:10] Recently when I was investigating my self organising Christmas lights project I realised
[0:16] that the browser APIs and ecosystem had advanced to the point where it was possible
[0:22] to recreate my Sudoku application running purely in the browser.
[0:27] As you can see it works pretty well - you can try it out for yourself using the links
[0:32] in the description and as always, all the code is in GitHub for you to try out for yourself.
[0:38] Hopefully, this video will give you a good idea how the system works and the thinking behind what I’ve done.
[0:46] Let’s take a look under the hood…
[0:49] Let’s define our problem:
[0:52] Given an image identify and extract the Sudoku puzzle, recognise the numbers in each square
[0:59] of the puzzle, solve the puzzle, render the results on top of the original image.
[1:06] As a classically trained image processing guy, my first port of call is to remove redundant information.
[1:14] We don’t care about colours, so we’ll convert the image to greyscale.
[1:18] As you can see we’ve not really lost any information.
[1:22] Looking at the image, we only ready care about paper and ink,
[1:26] so it makes sense to binarise the image and make black or white.
[1:31] Next we need to identify the blob that is the puzzle.
[1:35] Once we’ve got that we can extract the image of the puzzle and extract the contents of
[1:40] each box of the puzzle.
[1:41] Applying some OCR to the contents of each box, we can then solve the puzzle and our job is done.
[1:50] It’s a pretty straightforward image processing pipeline!
[1:54] Let’s dive into each of these steps in turn and see how they work.
[1:58] We’re taking a feed from the camera on the device.
[2:02] This comes into us as an RGB image.
[2:05] We’re not really interested in colour as we’re working with printed puzzlestypically
[2:09] which will be printed in black and white.
[2:13] So our first step is to convert from RGB to greyscale.
[2:18] If we look at the standard formula for this we get this fomula (
[2:21] We can see that a large portion of the value is coming from the green channel.
[2:26] So we can apply a little shortcut to our RGB to greyscale conversion and just take the
[2:31] green channel of the image.
[2:33] We’re going to be using morphological operations for locating the puzzle
[2:39] typically these work on black and white binary images, so our next step is binarise our image.
[2:46] This involves applying a threshold to seperate out foreground from background pixels.
[2:52] I’ve simulated an image here that has been taken in poor lighting conditions.
[2:58] As you can see from the histogram there’s not a clear seperation of ink and paper.
[3:03] This makes it very hard to apply a global threshold to the whole image.
[3:09] However, if we look at a small section of the image we can see from the histogram that
[3:14] we do have an obvious collection of ink pixels and paper pixels.
[3:19] What we need to do is examine each pixel in the context of its surrounding area.
[3:24] A simple way to do this is to apply a blur to the image and then compare the original
[3:31] image with this blurred image.
[3:35] Doing this gives us a very clean segmented image even when lighting conditions are not ideal
[3:42] We’ve now got a fairly clean image that contains our puzzle along with a bunch of random other
[3:48] printed elements.
[3:50] A classic approach would be to jump straight in here with something like a hough transform to detect lines
[3:56] However, we can be a bit smarter about what we are doing and take a bit of a shortcut.
[4:03] We know that whoever is taking the picture should be pointing the camera at a puzzle.
[4:07] So we can assume that the puzzle would be the largest object in the image
[4:13] We can also apply simple heuristics to filter out elements that probably aren’t a puzzle.
[4:21] If we scan the image pulling out all connected components.
[4:26] The connected component that contains the puzzle should be the one with the most points. And matches our heuristics.
[4:32] This lets us isolate the puzzle grid.
[4:36] With the puzzle grid identified we now need to identify the four corners of the puzzle.
[4:43] Once again we can apply a fairly simple heuristic to this problem.
[4:47] The pixels that are in the four corners should have the smallest manhatten distance from
[4:53] the corners of the max extents of the extracted component.
[4:58] You can see the calculation of manhattan distance in the animations.
[5:04] Runing this algorithm for all four extents of our located puzzle gives us the four corners.
[5:13] We know where the puzzle is in our image.
[5:17] We have the location of the four corner points.
[5:21] To extract the puzzle image we can reformulate our problem into one of a homographic transform.
[5:29] We want to map from the camera image of the puzzle to an ideal image of the puzzle that
[5:35] is not distorted by perpective or rotations.
[5:40] We do this by computing a homography between the two planes.
[5:46] The formula shown can be transformed into this new formula - as you can see we need four points
[5:54] - this maps nicely onto the corner points that we’ve already located for the puzzle.
[6:02] If we rewrite the matrices algebraically we can find h using this algorithm.
[6:07] Please see the linked paper for details on how this is derived
[6:14] and alternative more stable methods for calculating the homography.
[6:19] Now that we have the homography between our ideal image and the camera image we can map pixels between them.
[6:28] This lets us extract a nice square puzzle image from the camera image.
[6:33] Now that we have the square puzzle image we need to extract the contents of each individual cell.
[6:40] We can use the thresholded version of the image to help us with this.
[6:45] Looking at each box in turn we extract the bounds of any connected region starting from
[6:51] the center of each cell.
[6:54] We can then use this bounds to extract an image of the digit from the square greyscale image.
[7:01] If there is no connected region in the center of the cell then we know that it is empty.
[7:07] We now have an image for each populated cell of the puzzle.
[7:13] We’re going to use a neural network to perform OCR on each image.
[7:18] I’m going to use tensorflow to train the network and then tensorflow js to run the network
[7:24] in the browser.
[7:25] Our neural network is going to be trained to recognise the digits 1 through 9.
[7:32] I’ve synthesized a large number of training examples from a wide selection of fonts rendering
[7:38] a greyscale image of each digit.
[7:42] I’ve also added a small amount of blur and some Gaussian noise to each image.
[7:47] Hopefully this will provide a reasonable representation of what we will be getting in a live environment.
[7:54] I’ve also sperated out 10% of the images into a testing data set that we can use to evaluate
[8:01] how well our network performs.
[8:06] I’m using an interactive jupyter notebook to train the network.
[8:11] We’ll import TensorFlow and then set up the batch size
[8:15] and the number of epochs for our training session.
[8:19] I’m going to use a batch size of 32 and run the training for 100 epochs.
[8:26] We’re also going to resize our images to 20 by 20 pixels.
[8:32] We need to remember to also do this in the javascript code so that it matches.
[8:39] I’m augmenting our training data to help the network generalise.
[8:44] This will add some random mutations to our input data set.
[8:49] We’ll split out 20% of our images into a validation set for use during training and also normalises the pixel values.
[9:00] We’ll also have a data generator that does no augmentation and only normalises the pixels values.
[9:08] We can now load the data, splitting our training images into training and validation subsets.
[9:14] Here’s a small sample of our augmented training data.
[9:25] For our problem we don’t need a very complicated model, I have built a single convolution layer
[9:31] followed by a dense hidden layer and then finally an output layer with nine outputs
[9:38] one output for each digit.
[9:40] We’ll log the trainign out to tensorboard and also create a confusion matrix to show
[9:46] us which digits are causing problems.
[9:49] For training, we need to use CategoricalCrossentropy loss function.
[9:57] Running the training we can see that we have pretty good accuracy
[10:01] on both training and validation.
[10:05] Looking at the confusion matrix we can see that “1”s and “7”s are currently getting confused.
[10:15] Once the training is complete we can see that we have good performance on both the training data
[10:21] and the validation data.
[10:24] This indicates that our network is performing well on data it has seen as well as data it
[10:31] has never seen before.
[10:33] Looking at the confusion matrix there are still some issues between 1s and 7s.
[10:40] Let’s save the model and see which images it’s actually failing on.
[10:45] Looking at the failed images I’m pretty happy with the performance.
[10:49] I think the fonts that it is failing on would not be used in the real world so I’m happy
[10:54] to ignore these failures and move on with this network.
[10:58] As I’m happy with neural network structure I’m going to train it on all the data and
[11:06] use the resulting network in my production code.
[11:12] The final training gives us really good acuuracy.
[11:16] Checking to see if any images fail.
[11:19] We now see that no images are incorrect.
[11:23] We’ll convert this model for use in TensorFlow.js.
[11:28] To run our model we need to perform the same image resizing and normalisations as we did
[11:34] during training and then we can simply feed our images into the model and ask it to predict
[11:41] the digits.
[11:44] We can solve a sudoku puzzle using Donald Knuths application of Dancing Links to his
[11:50] Algorithm X for solving the exact cover problem.
[11:55] Dancing links uses the seemingly very simple realisation that we can remove and put back
[12:01] nodes from a doubly-linked list very efficiently.
[12:06] We can see that when we remove node B from the list it still has pointers back to the
[12:12] nodes that used to link to it.
[12:15] This means that node B contains all the information required to add it back into the list.
[12:23] So what is Algorithm X?
[12:27] Algorithm X is an algorithm that solves the exact cover problem.
[12:33] The exact cover problem is solved by finding the selection of rows in the grid that will
[12:37] combine so that there is a 1 in each column.
[12:41] This example is taken from Wikipedia -
[12:48] Looking at our grid Column 1 has the fewest entries so we select it.
[12:56] This gives us a choice of adding rows A or B to our solution.
[13:03] We try adding row A to our potential solution set.
[13:06] We can see it has an entry in columns 1, 4 and 7.
[13:10] And these columns have entries in Rows A, B, C, E and F
[13:17] Removing these columns leave us with row D.
[13:21] We now have zero rows in column 2 so we have failed to find a solution.
[13:27] We need to backtrack and try again.
[13:32] This time we try adding Row B to our solution.
[13:36] This flags column 1 and 4 for removal which means that rows A, B and C should also be removed.
[13:44] Doing this leaves us with rows D, E and F and columns 2, 3, 5,6 and 7.
[13:53] Now column 5 has the lowest number of entries so we select it.
[13:59] This gives us a choice of Row D. Selecting row D flags columns 3, 5 and 6 for removal
[14:07] which means that rows D and E should also be removed.
[14:12] We now have Row F remaining.
[14:14] We include Row F in our solution which means that columns 2 and 7 are flagged for removal
[14:21] which removes row F.
[14:23] We now have an empty matix which means we have found a solution.
[14:29] The final solution is rows B, D and F.
[14:33] And if you look in the grid you can see that combining those rows together would give you a 1 in each column.
[14:41] So, how does dancing links fit into this?
[14:45] Dancing links solves the problem of backtracking in an efficient way.
[14:50] We can encode the sparse matrix as a doubly linked circular list going left and right
[14:55] for the rows and up and down for the columns.
[14:59] Since these are now linked lists we can use fact that we can efficiently remove and re-add
[15:04] elements to perform the column and row removals and now when we backtrack it can be done very efficiently.
[15:10] This is what a Sudoku puzzle looks like when it’s been turned into a set of constraints.
[15:17] You can see as we add the known numbers.
[15:20] The number of rows and columns decreases.
[15:23] As we run the search algorithm you can see how quickly the search space is reduced.
[15:29] You can also see that it occasionally needs to backtrack before we find the complete solution.
[15:36] To render our results we could draw our text into a square image and then project each
[15:42] pixel onto the image coming from the camera.
[15:45] This would be quite computationally intensive.
[15:49] Alternatively, for each puzzle cell, we could take an imaginary line through the centre of the cell.
[15:55] We can then project that using our homographic transform and use this project line to calculate
[16:01] an approximate height and angle for the cell.
[16:04] We then simply draw the digit at the projected location of the cell with the calculated height and angle.
[16:14] So that’s it for this video.
[16:16] All the code for this project is on GitHub - check the video description for the link.
[16:24] I hope you’ve enjoyed this video - please hit the subscribe button and leave any thoughts
[16:29] you might have in the comments.
[16:31] Thanks for watching and I’ll see you in the next video!

HELP SUPPORT MY WORK: If you're feeling flush then please stop by Patreon Or you can make a one off donation via ko-fi
Want to keep up to date with the latest posts and videos? Subscribe to the newsletter
Blog Logo

Chris Greening

> Image


A collection of slightly mad projects, instructive/educational videos, and generally interesting stuff. Building projects around the Arduino and ESP32 platforms - we'll be exploring AI, Computer Vision, Audio, 3D Printing - it may get a bit eclectic...

View All Posts