With GFX, when you need to render a rectangle of highly varied pixels, such as a gradient, or other pattern, you can use batching to dramatically increase performance and smooth your draws. This project demonstrates asynchronous batching on an ESP32 with an ILI9341.
Introduction
GFX has had draw destination batching since its inception but you can't readily take advantage of it directly, and there are limited situations in which GFX can use it. This limited your ability to push a rectangular window of pixels to the display as fast as possible. Usually, you'd have to allocate a temporary bitmap to do so, write to the bitmap and then send it all at once, memory permitting.
The problems with that are that it's added code complexity, it requires the memory to complete the operation, and there's no fallback mechanism.
User level batching solves that problem by handling the bitmap technique for you, while falling back to driver level batching when there's not enough memory to use bitmaps.
With this article, I will endeavor to explain how to use it.
Concepts
The slowest part of displaying any graphics is communicating over the bus. This raw I/O is ultimately the determining factor of performance, but you can increase performance if you can reduce this traffic.
Little IoT display controllers almost all work roughly the same way, and due to the way they work, sending a pixel takes a significant amount of overhead, while sending a whole rectangle of pixels takes very little extra overhead.
Batching in GFX is the process of specifying a rectangular window to write to, and then writing out the pixels top to bottom, left to right without specifying the coordinates of the individual pixels. Not specifying the coordinates is where we see the reduction in bus traffic.
While low level/draw destination level batching is more efficient, you can get more efficient still by sending bitmaps. This isn't because of bus traffic, but because of SPI transaction overhead on the MCU itself, which we also want to reduce. Basically, sending a whole stream of data is faster than sending say, 16-bits of it at a time, regardless of the fact that it's the same traffic.
The bottom line is bitmaps are the preferred mechanism, memory permitting. We want to fall back to low level batching if there's no memory for the bitmaps, and only go pixel by pixel if batching isn't supported at all, or if the destination supports blting in which case we don't need to worry about a bus.
One of the other advantages of using bitmaps is we can do asynchronous DMA transfers as you write out pixels, such that it's sending in the background while you're writing, increasing efficiency. In order to take advantage of this, you have to set up your driver's bus to enable DMA.
The Demo
What it looks like:
And now onto the meat:
main.cpp
First include our headers and import the namespaces:
#include <Arduino.h>
#include <tft_io.hpp>
#include <ili9341.hpp>
#include <gfx_cpp14.hpp>
#include "DEFTONE.hpp"
using namespace arduino;
using namespace gfx;
Now here's our wiring and configuration. The project supports the ESP WROVER KIT 4.1, or a standard ESP32 with the display wired to the default pins for VSPI: MOSI 23, MISO 19, SCLK 18. In addition the other pins are CS 5, DC 2, RST 4, and BCKL 15.
#define LCD_HOST VSPI
#ifdef ESP_WROVER_KIT // don't change these
#define PIN_NUM_MISO 25
#define PIN_NUM_MOSI 23
#define PIN_NUM_CLK 19
#define PIN_NUM_CS 22
#define PIN_NUM_DC 21
#define PIN_NUM_RST 18
#define PIN_NUM_BCKL 5
#define BCKL_HIGH false
#else // change these to your setup. below is fastest
#define PIN_NUM_MISO 19
#define PIN_NUM_MOSI 23
#define PIN_NUM_CLK 18
#define PIN_NUM_CS 5
#define PIN_NUM_DC 2
#define PIN_NUM_RST 4
#define PIN_NUM_BCKL 15
#define BCKL_HIGH true
#endif
If you don't see any display, trying changing BCKL_HIGH
to false, as some displays require it to be pulled low instead of high.
Now we declare our bus and driver using the above settings:
using bus_t = tft_spi_ex<VSPI,
PIN_NUM_CS,
PIN_NUM_MOSI,
PIN_NUM_MISO,
PIN_NUM_CLK,
SPI_MODE0,
false,
320*240*2+8,
2>;
using lcd_t = ili9341<PIN_NUM_DC,
PIN_NUM_RST,
PIN_NUM_BCKL,
bus_t,
1,
BCKL_HIGH,
400, 200>;
On the ESP32, the DMA figure is derived by computing the total bytes to hold the framebuffer plus 8, in this case 320*240*2+8
because each pixel is 2 bytes at 320x240. We like DMA channel 2 on this platform because sometimes DMA channel 1 is used for other purposes.
The following just makes it so we can get X11 colors for this display easily. For example, we can do color_t::sky_blue
.
using color_t = color<typename lcd_t::pixel_type>;
After that, we have some settings that dictate the behavior of the application:
const bool gradient = false;
const bool async = true;
const char* text = "hello world!";
const uint16_t text_height = 75;
const open_font& text_font = DEFTONE_ttf;
Hopefully, the comments make it clear, but you can always play around with the values to get a better idea of how they work.
On to the global variables, of which we have several:
lcd_t lcd;
float hue;
srect16 text_rect;
float text_scale;
These hold the driver instance, the current hue, the rectangle where the text will be drawn, and the precomputed scale factor for the text, respectively.
Next we just premeasure the text in setup. Since the text is always the same, we do it here so it's only done once.
void setup() {
Serial.begin(115200);
text_scale = text_font.scale(text_height);
text_rect = text_font.measure_text(ssize16::max(),
spoint16::zero(),
text,
text_scale).
bounds().
center((srect16)lcd.
bounds());
}
Finally, we get to the good stuff, the batching:
void loop() {
hsv_pixel<24> px(true,hue,1,1);
auto ba = (async)?draw::batch_async(lcd,lcd.bounds()):
draw::batch(lcd,lcd.bounds());
if(gradient) {
for(int y = 0;y<lcd.dimensions().height;++y) {
px.template channelr<channel_name::S>(((double)y)/lcd.bounds().y2);
for(int x = 0;x<lcd.dimensions().width;++x) {
px.template channelr<channel_name::V>(((double)x)/lcd.bounds().x2);
ba.write(px);
}
}
} else {
for(int y = 0;y<lcd.dimensions().height;y+=16) {
for(int yy=0;yy<16;++yy) {
for(int x = 0;x<lcd.dimensions().width;x+=16) {
for(int xx=0;xx<16;++xx) {
if (0 != ((x + y) % 32)) {
ba.write(px);
} else {
ba.write(color_t::white);
}
}
}
}
}
}
ba.commit();
float hue2 = hue+.5;
if(hue2>1.0) {
hue2-=1.0;
}
px=hsv_pixel<24>(true,hue2,1,1);
rgba_pixel<32> px2,px3;
convert(px,&px2);
px2.channelr<channel_name::A>(.75);
draw::text(lcd,text_rect,spoint16::zero(),text,text_font,text_scale,px2);
hue+=.1;
if(hue>1.0) {
hue = 0.0;
}
}
You can see most of the complexity here is the actual drawing, and part of that is the requirement that the pixels be in order from left to right, top to bottom. Particularly with the checkerboard, it required a little creativity via those quadruple nested loops. The actual batching is quite simple, requiring three calls - batch<>()
/batch_async<>()
, write<>()
, and commit()
. If you don't commit explicitly, the batch will be committed when it goes out of scope. However, if you attempt to draw other things while a batch is in progress, the results are undefined, so it's best to commit explicitly in order to avoid the situation.
It should be noted that there's a much faster way to draw the checkerboard pattern without batching simply by drawing the squares using draw::filled_rectangle<>()
. The reason being it simply takes less bus traffic than updating every pixel of the display, which is what we do when we batch in this case. The reason it was done this way though was to demonstrate batching.
History
- 25th April, 2022 - Initial submission