Sometimes, you might need to process a lot of data a little at a time. We do that here with a streaming JSON reader implemented over my LexContext, ported to the Arduino software platform for 32 bit devices but targeted and tested on an ESP32.
Introduction
JSON is everywhere these days. IoT gadgets are slowly but surely becoming similarly ubiquitous. Naturally, the two are going to collide. There is an Arduino library for processing JSON out there but it doesn't stream very well because it's not a pull parser, which is what this little library gives you. This library is a port of part of the JsonTextReader
in my Json library which allows you to read JSON data of virtually any size by examining just a little bit of it at a time.
Furthermore, this article is intended to give you a technique for simplifying lexing and parsing on the fly over a streaming source. To that end, I have also ported LexContext to this platform and backported the JsonReader
offering to use it.
You might think it strange to port something from C# to C++, but these two classes were perfect candidates to port to the Arduino SDK because they were already simple, small and fast. They just needed a little bit of witchcraft to transform them to C++ and make them function on a tiny device.
With these handy widgets, you will now be able to scan and parse big JSON freely, or build your own efficient parsers over a streaming source.
Update: Small code cleanup, large addition to article exploring the inner workings
Update 2: Improved error handling, bug fixes. Note that the code in the article does not reflect these changes. The error handling significantly clutters the code, so I decided to leave it out here. User lastError()
to get the error code and value()
to get the error text.
Update 3: Bugfix with reporting incorrect error message during some out of memory conditions
Update 4: Bugfix with skipping (partial parsing) - removal of non-canonical skip since it wasn't needed. Changed name of Key
to Field
for consistency. Updated article code.
Conceptualizing this Mess
Pull parsing works by grabbing a little bit of a structured document at a time and doing enough bookkeeping to know what you're on top of and what your next move is. This usually involves building a finite state machine to do the heavy lifting. That's fancy language for a switch
/case
over an integer member variable that we update as we go:
int _state = -1;
bool read() {
switch(_state) {
case -1: _state = 0;
case 0: _state = 1;
return true;
case 1: if(textUnderCursorIsB()) {
_state = 1;
return true;
}
_state = -2;
return false;
}
}
It's spaghetti. It's not very friendly to humans but CPUs eat this stuff up. We're working pretty close to the metal on these little devices so complications like state machines are justifiable if they are efficient.
The upside though is we can now call it iteratively simply by doing this:
while(parser.read()) { ... }
Each call to read()
executes a step of the parse. The general idea is we set up the state machine so that read()
will return true
until there is no more data to be read.
That's parsing, but before we even get that far, we need a way to manage a streaming cursor. Previously, LexContext
wrapped things like TextReader
or an IEnumerable<char>
source. Now however, we simply wrap the Arduino Stream
class since it's derived by the File
class and the WiFiClient
class for example.
The LexContext
provides facilities for doing basic operations over a streaming source like capturing the current character under the cursor, advancing along the input, reading and skipping whitespace, or reading or skipping until a certain character is encountered. When we create one, we instantiate it with a fixed length buffer to use for capture data. This buffer should be the length of the maximum amount of data anticipated to be worked with at once. For JSON data, this might be the length of a field name or scalar value.
A parser can then use this to manage its input while parsing.
Coding this Mess
To set this up, wire an SD card reader into your 32 bit device's primary SPI port and primary CS/SS. Copy the included data.json file to the root directory of an SD card formatted Fat32 and then you can run the demo.
LexContext
LexContext works such that current()
always retrieves the current character under the cursor, while advance()
moves to the next position and returns that character. capture()
captures a character to the buffer and captureBuffer()
gives us a string from that buffer. line()
, column()
and position()
track our cursor's location. We start by declaring it with a capture buffer size:
LexContext<1024> lc;
Here's an example of using it to read from the serial port until a non-digit is encountered:
while (LexContext<1024>::EndOfInput != lc.advance() && isdigit((char)lc.current())) {
Serial.print((char)lc.current());
}
Serial.println();
Above, we're initializing it with Serial
as the input source. We then advance, and check the character to see if it's not an end of input marker and then if it's a digit. If it is, we print it and keep going. Keep in mind we advance before doing anything else. The cursor must be primed by advancing once first initialized. If your routine needs to ensure this happened but doesn't know already, you can call ensureStarted()
.
The API is essentially the same as the C# API with casing style changed to fit. Please see that article for more details. The main difference is initializing it with the amount of capture, and then calling begin()
with the input source. Keep in mind that unlike the C# API, there is no corresponding close()
mechanism here. The input source must be closed when finished from outside this class, due to the differences in underlying architecture.
JsonReader
Now that we've seen our cursor management, let's come back around to the JSON parsing. Using pull parsers is very efficient but takes some getting used to. We saw above that we read partial parses in a loop until there's no more data. Well inside that loop is where the interesting things happen.
First, we probably want to examine the nodeType()
to see what kind of node we are on. This can be Initial
, Value
, Field
, Array
, EndArray
, Object
, EndObject
, EndDocument
or Error
. These constants are accessed off the JsonReader
class itself. If it's a Value
node, we might want to examine the valueType()
as well which can be String
, Boolean
, Number
, or Null
. If it's a String
, most likely you'll call undecorate()
to remove the quotes and translate the escapes to real characters. Note that after you do this, future calls to valueType()
on this same node are unreliable. Finally, you can call value()
to get the value as a char*
, numericValue()
to get it as a double
, or booleanValue()
to get it as a bool
. Note that undecorate()
makes calls to these unreliable as well. The reason undecorate()
affects all these functions is that it modifies the string value in place to save space.
Let's take a look at the provided ino file:
#include <SD.h>
#include <Json.h>
typedef JsonReader<2048> JsonReader2k;
JsonReader2k jsonReader;
void dumpJsonFile() {
File file = SD.open("/data.json", FILE_READ);
if (!file) {
Serial.println(F("/data.json not found or could not be opened"));
while (true); }
jsonReader.begin(file);
while (jsonReader.read()) {
switch (jsonReader.nodeType()) {
case JsonReader2k::Value: Serial.print("Value ");
switch (jsonReader.valueType()) { case JsonReader2k::String: Serial.print("String: ");
jsonReader.undecorate(); Serial.println(jsonReader.value()); break;
case JsonReader2k::Number: Serial.print("Number: ");
Serial.println(jsonReader.numericValue()); break;
case JsonReader2k::Boolean: Serial.print("Boolean: ");
Serial.println(jsonReader.booleanValue()); break;
case JsonReader2k::Null: Serial.print("Null: ");
Serial.println("null"); break;
}
break;
case JsonReader2k::Field: Serial.print("Field ");
Serial.println(jsonReader.value());
break;
case JsonReader2k::Object: Serial.println("Object");
break;
case JsonReader2k::EndObject: Serial.println("End Object");
break;
case JsonReader2k::Array: Serial.println("Array");
break;
case JsonReader2k::EndArray: Serial.println("End Array");
break;
case JsonReader2k::Error: Serial.print("Error: (");
Serial.print(jsonReader.lastError());
Serial.print(") ");
Serial.println(jsonReader.value());
break;
}
}
file.close();
}
void dumpId(bool recurse) {
File file = SD.open("/data.json", FILE_READ);
if (!file) {
Serial.println(F("/data.json not found or could not be opened"));
while (true); }
jsonReader.begin(file);
if (jsonReader.skipToField("id", recurse) && jsonReader.read()) {
Serial.println((int32_t)jsonReader.numericValue(), DEC); }
file.close();
}
void dumpShowName() {
File file = SD.open("/data.json", FILE_READ);
if (!file) {
Serial.println(F("/data.json not found or could not be opened"));
while (true); }
jsonReader.begin(file);
if (jsonReader.skipToField("name") && jsonReader.read()) {
jsonReader.undecorate(); Serial.println(jsonReader.value());
}
file.close();
}
void setup() {
Serial.begin(115200);
if (!SD.begin()) {
Serial.println(F("SD card mount failed"));
while (true); }
dumpJsonFile();
}
void loop() {
if (Serial.available()) {
LexContext<1024> lc;
lc.begin(Serial);
while (LexContext<1024>::EndOfInput != lc.advance() && isdigit((char)lc.current())) {
Serial.print((char)lc.current());
}
Serial.println();
}
}
We've got a lot going on here. Of particular interest are dumpShowName()
, dumpId()
and dumpJsonFile()
.
In the latter, we are going through the motions of reading data out of the file. Note that we can get type information for our fields and get typed values out of them by calling the appropriate methods. All numbers resolve to a double
currently. If you want an int
, you will have to use atoi()
on value()
.
The other routines show you how to move around the document. They don't demonstrate all of the features for this, but they do demonstrate using an important one, skipToField()
. This method finds the next field in the document with the given name, optionally searching subelements. Usually once you find it, you'll want to read()
once to get to the next element - the field's value. We do that in the ino above.
There's also skipSubtree()
which skips over the entire subtree you're on, skipToEndArray()
and skipToEndObject()
which skip to the closing array or object marker on the same level, respectively.
One of the things I'd like to add is an in memory tree representation that integrates with this, or perhaps integrate with ArduinoJson. I may first create a JSONPath to CPP code generator that can generate code that uses this library to fulfill JSONPath queries. All in time.
Cool, But How Does It Work?
Like I said at the beginning, the whole mess is basically a state machine over a LexContext
which manages a cursor over streaming input. The state machine starts in the Initial
state. Each state corresponds directly to a value returned from nodeType()
.
bool read() {
int16_t qc;
int16_t ch;
switch (_state) {
case JsonReader<S>::Error:
case JsonReader<S>::EndDocument:
return false;
case JsonReader<S>::Initial:
_lc.ensureStarted();
_state = Value;
case JsonReader<S>::Value:
value_case:
_lc.clearCapture();
switch (_lc.current()) {
case LexContext<S>::EndOfInput:
_state = EndDocument;
return false;
case ']':
_lc.advance();
_lc.trySkipWhiteSpace();
_lc.clearCapture();
_state = EndArray;
return true;
case '}':
_lc.advance();
_lc.trySkipWhiteSpace();
_lc.clearCapture();
_state = EndObject;
return true;
case ',':
_lc.advance();
_lc.trySkipWhiteSpace();
if (!read()) { _lastError = JSON_ERROR_UNTERMINATED_ARRAY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_ARRAY_MSG,S-1);
_state = Error;
}
return true;
case '[':
_lc.advance();
_lc.trySkipWhiteSpace();
_state = Array;
return true;
case '{':
_lc.advance();
_lc.trySkipWhiteSpace();
_state = Object;
return true;
case '-':
case '.':
case '0':
case '1':
case '2':
case '3':
case '4':
case '5':
case '6':
case '7':
case '8':
case '9':
qc = _lc.current();
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
while (LexContext<S>::EndOfInput != _lc.advance() &&
('E' == _lc.current() ||
'e' == _lc.current() ||
'+' == _lc.current() ||
'.' == _lc.current() ||
isdigit((char)_lc.current()))) {
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
}
_lc.trySkipWhiteSpace();
return true;
case '\"':
_lc.capture();
_lc.advance();
if(!_lc.tryReadUntil('\"', '\\', true)) {
if(LexContext<S>::EndOfInput==_lc.current()) {
_lastError = JSON_ERROR_UNTERMINATED_STRING;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_STRING_MSG,S-1);
_state = Error;
return true;
} else {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
}
_lc.trySkipWhiteSpace();
if (':' == _lc.current())
{
_lc.advance();
_lc.trySkipWhiteSpace();
if (LexContext<S>::EndOfInput == _lc.current()) {
_lastError = JSON_ERROR_FIELD_NO_VALUE;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_FIELD_NO_VALUE_MSG,S-1);
_state = Error;
return true;
}
_state = Field;
}
return true;
case 't':
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('r' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('u' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('e' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
_lc.advance();
_lc.trySkipWhiteSpace();
ch = _lc.current();
if (',' != ch && ']' != ch && '}' != ch && LexContext<S>::EndOfInput != ch) {
_lastError = JSON_ERROR_UNEXPECTED_VALUE;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNEXPECTED_VALUE_MSG,S-1);
_state = Error;
}
return true;
case 'f':
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('a' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('l' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('s' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('e' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
_lc.advance();
_lc.trySkipWhiteSpace();
ch = _lc.current();
if (',' != ch && ']' != ch && '}' != ch && LexContext<S>::EndOfInput != ch) {
_lastError = JSON_ERROR_UNEXPECTED_VALUE;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNEXPECTED_VALUE_MSG,S-1);
_state = Error;
}
return true;
case 'n':
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('u' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('l' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if ('l' != _lc.advance()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
if (!_lc.capture()) {
_lastError = JSON_ERROR_OUT_OF_MEMORY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_OUT_OF_MEMORY_MSG,S-1);
_state = Error;
return true;
}
_lc.advance();
_lc.trySkipWhiteSpace();
ch = _lc.current();
if (',' != ch && ']' != ch && '}' != ch && LexContext<S>::EndOfInput != ch) {
_lastError = JSON_ERROR_UNEXPECTED_VALUE;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNEXPECTED_VALUE_MSG,S-1);
_state = Error;
}
return true;
default:
_lastError = JSON_ERROR_UNEXPECTED_VALUE;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNEXPECTED_VALUE_MSG,S-1);
_state = Error;
return true;
}
default:
_state = Value;
goto value_case;
}
}
Each part here examines the current state, and the character under the cursor to determine what to do next. This form of parsing is similar to LL(1). Actually, since JSON is even simpler to parse than that, it's not much more difficult than matching regular expressions. Since we don't have a separate lexer, our parser handles the lexing as well, except for the low level lexing like trySkipWhiteSpace()
which it delegates to LexContext
. The tedious bits are the parts that determine we're on a number and the parts that determine whether we encountered true
, false
, or null
. Other than that, it's pretty straightforward.
As an optimization, this parser supports partial parsing while skipping over parts of the document. This does just enough parsing to determine if the document is well formed but otherwise normalizes nothing, speeding up the operation.
We have two routines for skipping over nested objects and arrays. They delegate to each other recursively when arrays are nested in objects and vice versa. Since they are nearly identical, we'll explore one:
void skipObjectPart()
{
int depth = 1;
while (Error!=_state && LexContext<S>::EndOfInput != _lc.current())
{
switch (_lc.current())
{
case '[':
if(LexContext<S>::EndOfInput==_lc.advance()) {
_lastError = JSON_ERROR_UNTERMINATED_ARRAY;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_ARRAY_MSG,S-1);
_state = Error;
return;
}
skipArrayPart();
break;
case '{':
++depth;
_lc.advance();
if(LexContext<S>::EndOfInput==_lc.current())
_lastError = JSON_ERROR_UNTERMINATED_OBJECT;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_OBJECT_MSG,S-1);
_state = Error;
break;
case '\"':
skipString();
break;
case '}':
--depth;
_lc.advance();
if (depth == 0)
{
_lc.trySkipWhiteSpace();
return;
}
if(LexContext<S>::EndOfInput==_lc.current()) {
_lastError = JSON_ERROR_UNTERMINATED_OBJECT;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNTERMINATED_OBJECT_MSG,S-1);
_state = Error;
}
break;
default:
_lc.advance();
break;
}
}
}
These in turn, are used by skipSubtree()
:
bool skipSubtree()
{
switch (_state)
{
case JsonReader<S>::Error:
return false;
case JsonReader<S>::EndDocument: return false;
case JsonReader<S>::Initial: if (read())
return skipSubtree();
return false;
case JsonReader<S>::Value: return true;
case JsonReader<S>::Field: if (!read())
return false;
return skipSubtree();
case JsonReader<S>::Array: skipArrayPart();
_lc.trySkipWhiteSpace();
_state = EndArray; return true;
case JsonReader<S>::EndArray: return true;
case JsonReader<S>::Object: skipObjectPart();
_lc.trySkipWhiteSpace();
_state = EndObject; return true;
case JsonReader<S>::EndObject: return true;
default:
_lastError = JSON_ERROR_UNKNOWN_STATE;
strncpy_P(_lc.captureBuffer(),JSON_ERROR_UNKNOWN_STATE_MSG,S-1);
_state = Error;
return true;
}
}
Here, we're skipping the next subtree depending on where you are. If you're on the initial node, the subtree would be the entire document. If you're on a value, it's already skipped by the next read()
call. If you're on a field, read to the next element - the field's value, and skip it. If we're on an array or object, we use the nested skip routine outlined just above.
To search, we provide methods like skipToIndex()
and skipToField()
. These allow you to move through the document by querying for field names or indices with arrays:
bool skipToIndex(int index) {
if (Initial==_state || Field == _state) if (!read())
return false;
if (Array==_state) { if (0 == index) {
if (!read())
return false;
}
else {
for (int i = 0; i < index; ++i) {
if (EndArray == _state) return false;
if (!read())
return false;
if (!skipSubtree())
return false;
}
if ((EndObject==_state || EndArray==_state) && !read())
return false;
}
return true;
}
return false;
}
Note that the above is for arrays. The following is for objects:
bool skipToField(const char* field, bool searchDescendants = false) {
if (searchDescendants) {
while (read()) {
if (Field == _state) { undecorate();
if (!strcmp(field , value()))
return true;
}
}
return false;
}
switch (_state)
{
case JsonReader<S>::Initial:
if (read())
return skipToField(field);
return false;
case JsonReader<S>::Object:
while (read() && Field == _state) { undecorate();
if (strcmp(field,value()))
skipSubtree(); else
break;
}
return Field == _state;
case JsonReader<S>::Field: undecorate();
if (!strcmp(field,value()))
return true;
else if (!skipSubtree())
return false;
while (read() && Field == _state) { undecorate();
if (strcmp(field , value()))
skipSubtree(); else
break;
}
return Field == _state;
default:
return false;
}
}
This is considerably more involved than skipToIndex()
simply because there are so many corner cases to deal with. Also, unlike the previous method, this one needs to be able to search or skip over descendants. It's odd until you think about it, but it's actually easier to "recursively" search for a field because you don't have to skip subtrees to stay on the same level of the hierarchy.
As far as storing and retrieving element scalar values and field names, we use the LexContext
's captureBuffer()
for that. Doing so saves precious RAM versus copying it out of the buffer. One additional thing we do to save RAM is stipulate that any conversion of the data must be in place, where possible. This means that we have an undecorate()
function which removes quotes and translates escapes from strings. It does so in place, knowing that the resulting string will always be shorter than the input string. This is because non-escapes are always shorter than escapes and because the quotes are always stripped:
void undecorate() {
char *src = _lc.captureBuffer();
char *dst = src;
char ch = *src;
if ('\"' != ch)
return;
++src;
uint16_t uu;
while ((ch = *src) && ch != '\"') {
switch (ch) {
case '\\':
ch = *(++src);
switch (ch) {
case '\'':
case '\"':
case '\\':
case '/':
*(dst++) = ch;
++src;
break;
case 'r':
*(dst++) = '\r';
++src;
break;
case 'n':
*(dst++) = '\n';
++src;
break;
case 't':
*(dst++) = '\t';
++src;
break;
case 'b':
*(dst++) = '\b';
++src;
break;
case 'u':
uu = 0;
ch = *(++src);
if (isHexChar(ch)) {
uu = fromHexChar(ch);
ch = *(++src);
uu *= 16;
if (isHexChar(ch)) {
uu |= fromHexChar(ch);
ch = *(++src);
uu *= 16;
if (isHexChar(ch)) {
uu |= fromHexChar(ch);
ch = *(++src);
uu *= 16;
if (isHexChar(ch)) {
uu |= fromHexChar(ch);
ch = *(++src);
}
}
}
}
if (0 < uu) {
if (256 > uu) {
*(dst++) = (char)uu;
} else
*(dst++) = '?';
}
}
break;
default:
*dst = ch;
++dst;
++src;
}
}
*dst = 0;
}
That's not very nice. What it's doing is this - it's managing two cursors over the same buffer. The destination cursor *dst
trails the source cursor *src
by at least one because of the leading quote. Basically, we just copy characters from source to destination until we hit an escape in which case we translate it. When we find a final quote, we're done and we reterminate the string at the new end.
Another interesting function is valueType()
which tells us what sort of JSON value we're looking at - note that it should not be called after undecorate()
:
int8_t valueType() {
char *sz = _lc.captureBuffer();
char ch = *sz;
if('\"'==ch)
return String;
if('t'==ch || 'f'==ch)
return Boolean;
if('n'==ch)
return Null;
return Number;
}
We take a number of liberties here. All we ever do is examine the first character and for a number we don't even do that, we just get to it by process of elimination. This is only reliable because we already checked these values while parsing. For example, if it starts with t
, we know it's going to be true
simply because nothing else is allowed to start with t
unless it's surrounded by quotes. We already know it's not tree
because the parser would have errored earlier. Now you can see the damage undecorate()
does if it's called before this!
We've now covered the meat of the entire library, and where you go from here is up to you. I hope you enjoy this contribution and that your code is lean, pretty and bug resistant.
History
- 9th December, 2020 - Initial submission
- 9th December, 2020 - Update: added "How It Works" section
- 10th December, 2020 - Update 2: added better error handling and bug fixes
- 10th December, 2020 - Update 3: fixed bug with incorrect error message during some out of memory conditions
- 11th December, 2020 - Update 4: fixed bug with skipping, changed Key to Field, removed non-canonical skipping and updated article code