In my exploration for the fastest JPEG decoder for the ESP32, I trod a path from the original JPEG decoder library at 109 milliseconds, to the accelerated TJPEG decoder at 55 milliseconds, and finally the impressive JPEG dec library at 32 milliseconds. But wait, there's more. Along came a GitHub issue suggesting decoding JPEG with SIMD, efficiently working wonders at a decoding speed of just 20 milliseconds. However, it threw a curveball when it came to drawing. The decoding process doesn't overlap with the pixel transfer making it slower than expected. It does bring great benefits for streaming JPEGs though, as the decoding of the next frame can start while the current one is loading. But the space this library would expropriate, around 130 kilobytes, from the memory of an ESP32 module can be an issue. Yet, there are promising improvements marching on the horizon, so stay tuned for the upcoming enhancements!
[0:00] I’ve been sent a link to a very fast JPEG decoder for the ESP32, but just how fast is
[0:05] it and is it actually practical?
[0:07] We’ve been decoding JPEGs on the ESP32.
[0:09] We’ve gone from the fairly slow to the pretty snappy.
[0:13] We’ve got the original memory optimised JPEG decoder library clocking in at 109 milliseconds.
[0:19] And then we’ve got the improved version, TJPEG decoder, which is based on the tiny
[0:23] JPEG decoder codebase.
[0:24] This is about twice as fast at only 55 milliseconds.
[0:28] And then we’ve got the amazing JPEG dec library that only takes 32 milliseconds.
[0:33] But can we do any better?
[0:34] Well, you wouldn’t be watching this if we couldn’t.
[0:37] I received a slightly vague GitHub issue on my ESP32 TV project, decoding JPEG with SIMD,
[0:44] along with a link to some sample code.
[0:46] Now SIMD stands for Single Instruction Multiple Data.
[0:49] Basically it can perform the same operation on multiple data elements simultaneously.
[0:54] The ESP32 has a bunch of these instructions and the ESP32-S3 has a bunch more.
[0:59] Now decoding JPEG can really take advantage of these instructions, potentially giving
[1:03] a huge performance improvement.
[1:05] If we used the library suggested in the GitHub issue, we’d down to just 20 milliseconds
[1:09] to decode a JPEG image.
[1:11] Now that is pretty amazing.
[1:13] But there’s some really interesting timing when we combine decoding with drawing.
[1:18] This chart shows the overhead added when we start sending pixels to the display.
[1:22] Our first two libraries need an extra 10 milliseconds when we draw to the screen.
[1:27] The JPEG dec library needs just 6 milliseconds extra.
[1:31] But for some reason our super fast library needs an extra 17 milliseconds to draw to
[1:35] the screen.
[1:36] So what’s going on?
[1:37] Surely we’re sending the same number of pixels.
[1:39] Why is there a difference?
[1:41] Well this is pretty interesting.
[1:43] We’re using DMA to draw the pixels to the display.
[1:46] This means that once the DMA process has been started, the CPU can get back to doing work,
[1:51] so potentially we can overlap decoding the JPEG with drawing the JPEG.
[1:55] The first two libraries decode the JPEG in 16 by 16 blocks of pixels at a time.
[2:00] I’ve slowed down this process here so you can see it drawing.
[2:03] We can send these 256 pixels off to the display using DMA and while that’s happening, the
[2:09] CPU can be working on the next block of pixels.
[2:12] This DMA transfer is pretty small so we end up waiting for the CPU to give us more data
[2:16] to send to the display.
[2:18] The JPEG deck library takes us even further.
[2:21] It decodes 128 by 16 pixels at a time, which is 2048 pixels.
[2:27] This is much better.
[2:28] We can really take advantage of DMA to blast the pixels out of the display while the CPU
[2:32] is doing some work.
[2:34] So why does our new super fast JPEG decoder seem so slow to draw?
[2:38] Well, the fast JPEG decoder decodes the entire image in one go and then we have to draw all
[2:43] the pixels to the screen.
[2:45] This means that unlike the other libraries, we don’t get any overlap between the processing
[2:49] and the sending of the pixels.
[2:51] Now at first this may seem a bit disappointing.
[2:53] We’ve got a really fast JPEG decoder but it’s not actually any faster to use than
[2:57] JPEG deck.
[2:59] However, for our TV project we’re streaming JPEGs so we can still take advantage of overlapping
[3:04] DMA and CPU work.
[3:06] We can start the DMA transfer at one frame and immediately start decoding the next frame.
[3:11] This could potentially work really well.
[3:13] Now there is of course a massive trade-off going on here.
[3:16] Our slow libraries don’t need much RAM.
[3:19] A 16 by 16 block of pixels is only 512 bytes.
[3:23] A 128 by 16 block of pixels is only 4K.
[3:26] But a fully decoded image is around 130 kilobytes.
[3:31] The ESP32-S3 module I’m using does have PSRAM built in, so it’s not a huge problem
[3:35] for me but for your regular ESP32 modules this would cause quite a problem.
[3:40] It’s going to be pretty difficult to get that much free RAM in one continuous block
[3:44] and it would be near impossible to get two blocks of RAM that sized for overlapping decoding.
[3:50] But stay tuned.
[3:52] The library is being actively worked on and partial decoding and improvements are in the
[3:56] works.
[3:57] It’s pretty exciting stuff.