Introduction
A common programming task (the only task?) is moving and manipulating data, sometimes between different languages. This can be a chore since language creators don't implement any standard serializing format for nested data such as arrays. In fact, many languages don't support any native serializing at all.
Scripting is all the rage these days. Perhaps, it's just the particular field I work in, but I seem to run into more and more scripting languages. Yes, there are the ubiquitous languages, JavaScript, ASP, PHP, etc... but there are countless other scripting languages for industrial robotics control, camera control, phone switching, on and on.
At some point, you want to get data into or out of these languages, and the choices can be vague or less than optimal. XML is standard, but XML data is bulky and hard to parse. An XML parser is not something I really want to add to ACME's Burger Flipper Scripting Language, just so you can get burger flipping statistics into your Total Meal Performance package. Ultra simple string formats are not extensible, and require you to constantly upgrade everything to handle new parameters. Other custom formats I see used are often inefficient, limited in usefulness, or buggy.
The goal of this article then, is to provide an easy to implement minimal format for simple data exchange especially between new and obscure scripting languages.
Instead of calling it ETIMFFSDEEBNAOSL, which sounds German, I'm going to refer to this format formally as Simple Cross-language Serializing or SCS, so we have a shorthand, and later, in design reviews, we can use cool phrases like, 'I'll just SCS that data to you' and make nearby management people feel dumb.
Goals
Unfortunately, the section above already one-upped this one by not only mentioning the goal, but making a joke about it. So, instead of restating it, I'll just go into more detail.
Above all, simplicity and flexibility will be stressed. There will be ways in which the protocol could be expanded to reduce the size of the serialized data, such as using binary, or compression, etc.... But the trade off would be to increase the complexity of the encoder / decoder, and thus increase the implementation / debugging time and raise the risk of compatibility flaws and lazy implementations.
Since this data format is chiefly for scripting formats that are new and/or obscure, we want something that is quickly and easily implemented.
So, steps to obtaining our goal will be to...
- Define a compact portable data format for nested data.
- Outline an easy to implement encoder / decoder.
- Provide example encoder / decoder in C++, PHP, and JavaScript.
Use
To make this article a little easier to read and understand, I'm going to cover a few example uses before I go further, since I don't have a cool picture to put on top of the article. Here is a simple example of how SCS will be used for, say, passing data from PHP to JavaScript, for instance...
$person_info = array( 'Fred'=>array( 'hair'=>'blonde', 'eye'=>'blue', 'age'=>26 ) );
echo 'var javaArray = DeserializeArray( "' . SerializeArray( $person_info ) . '" );';
Or how about sending data from PHP to C++:
$person_info = QueryDatabase( "fred" );
echo SerializeArray( $person_info );
CPsPropertyBag cPersonInfo( GetPhpReturnString() );
long lAge = cPersonInfo[ "Fred" ][ "Age" ].ToLong();
Why another format?
If you Google 'pass array PHP JavaScript', substituting your favorite languages, you'll find various implementations including many lazy implementations of standards like XML. The goal here is to define the simplest possible practical implementation. In fact, ideally, a lazier implementation will not even be possible.
Note on XML. It seems I've run into many people that claim XML is the end all format. I've never seen such strong claims of one-size-fits-all before, despite the many thousands of formats that have gone before. But, it seems many are determined to make XML the only format for data exchange. Someone has even once suggested to me that I base-64 encode streaming video data and wrap it in XML to make it 'standard'. I don't share, and am not going to consider, such a narrow view on any format or language. There is a clear trade-off between a formats feature set and the complexity of implementation. I will offer what I hope is an objective comparison with XML for the challenges within the scope of this article.
What about other standards? There is a huge list of similar protocols, check out XML Alternatives for a start. After many hours of pouring through many formats, I was unable to find a good fit for the objectives explained here.. If I've missed an identical solution, you're welcome to rub it in. But, I doubt you will be able to say it was obvious. At some point, one just has to bite the bullet and get things done; at least, I'm sharing in an obvious place...
Custom solutions. Another thing I'm going to look at, is a common runaway encoding technique many people use on custom implementations. It seems that their implementation started as a flat representation, such as a=1,b=2,c=3, then recursion was added to handle nested data. Though the simplicity is hard to beat, the data expansion from nested encoding can be pretty significant. It also makes the nested values hard for a human to read. Because of these two factors, I will stray from simplicity to solve this one problem.
Here is such a runaway implementation in PHP:
function RunawayEncode( &$params, $sep = '&', $arr = '?' )
{
$ret = '';
foreach( $params as $key => $val )
{
if ( strlen( $key ) && $val )
{
if ( strlen( $ret ) ) $ret .= $sep;
if ( $arr && is_array( $val ) )
$ret .= $key . '=' . $arr . rawurlencode( MakeParams( $val, $sep, $arr ) );
else $ret .= $key . '=' . rawurlencode( $val );
}
}
return $ret;
}
function RunawayDecode( $params, $sep = '&', $arr = '?' )
{
$ret = array();
$parr = split( $sep, $params );
foreach( $parr as $val )
{
$kv = split( '=', $val );
if ( !isset( $kv[ 0 ] ) || !strlen( $kv[ 0 ] ) );
else if ( !isset( $kv[ 1 ] ) )
$ret[ $kv[ 0 ] ] = 1;
else if ( !$arr || $kv[ 1 ][ 0 ] != $arr )
$ret[ $kv[ 0 ] ] = rawurldecode( $kv[ 1 ] );
else $ret[ $kv[ 0 ] ] =
ParseParams( rawurldecode( substr( $kv[ 1 ], 1 ) ), $sep, $arr );
}
return $ret;
}
The data format
The encoding rules for SCS data:
- The equality character '=' will separate name from value.
- The comma character ',' will separate name/value pairs.
- The curly bracket characters '{' and '}' will enclose nested data.
- All data will be encoded as strings using the URL encoding scheme as described in RFC 1738.
Pretty simple? I choose RFC 1738 encoding because many script languages have built-in functions, and if not, it's easy to implement (see the C++ ScsSerialize.h header for an example). Additionally, much of the same logic behind using this encoding in URLs applies here. This format also gives us the advantage of being able to read most data rather easily. Here is an example of an encoded array:
Mary{Married=No,DOB=7-2-82}
The PHP declaration of the above array would be:
$test_array = array(
'Mary'=>
array(
'Married'=>'No',
'DOB'=>'7-2-82',
),
);
Here's a slightly more complex example demonstrating nested encoding. Notice how data is only encoded once.
Mary{Married=No,DOB=7-2-82,Pets{Dog=1},Invalid%20Characters=%21%40%23%24%25%5E%26%2A%28%29}
Declared in PHP...
$test_array = array(
'Mary'=>
array(
'Married'=>'No',
'DOB'=>'7-2-82',
'Pets'=>
array(
'Dog'=>1,
),
'Invalid Characters'=>'!@#$%^&*()'
),
);
Parsing
As mentioned, the big pro here is that this can be easily parsed. Here is an example PHP implementation. You'll notice that the encoding function is similar in complexity to the above runaway example; however, the decoding function is more complex. This extra complexity avoids the re-encoding of data. Myself and a few others attempted a simpler decoder and struck out. I'd be very interested if anyone is able to do better.
function ScsSerialize( &$params )
{
$ret = '';
foreach( $params as $key => $val )
{
if ( $ret ) $ret .= ',';
$ret .= rawurlencode( $key );
if ( is_array( $val ) ) $ret .= '{' . ScsSerialize( $val ) . '}';
else if ( strlen( $val ) ) $ret .= '=' . rawurlencode( $val );
}
return $ret;
}
function ScsDeserialize( $params, &$last = 0 )
{
$arr = array();
$l = strlen( $params ); $s = 0; $e = 0;
while ( $e < $l )
{
switch( $params[ $e ] )
{
case ',' : case '}' :
{
if ( 1 < $e - $s )
{
$a = split( '=', substr( $params, $s, $e - $s ) );
if ( !isset( $a[ 0 ] ) ) $a[ 0 ] = 0;
else $a[ 0 ] = rawurldecode( $a[ 0 ] );
if ( !isset( $a[ 1 ] ) ) $arr[ $a[ 0 ] ] = '';
else $arr[ $a[ 0 ] ] = rawurldecode( $a[ 1 ] );
}
$s = $e + 1;
if ( '}' == $params[ $e ] )
{ if ( $last ) $last = $e + 1; return $arr; }
} break;
case '{' :
{
$k = rawurldecode( substr( $params, $s, $e - $s ) );
if ( isset( $k ) )
{
$end_array = 1;
$arr[ $k ] = ScsDeserialize( substr( $params, $e + 1 ), $end_array );
$e += $end_array;
}
$s = $e + 1;
} break;
}
$e++;
}
return $arr;
}
Here's the JavaScript version. Not quite a one to one, as JavaScript apparently doesn't support references of generic types.
function ScsSerialize( x_params )
{
var ret = '';
for ( var key in x_params )
{
if ( key && x_params[ key ] )
{
if ( ret ) ret += ',';
ret += escape( key );
if( x_params[ key ].constructor == Array ||
x_params[ key ].constructor == Object )
{
ret += '{' + ScsSerialize( x_params[ key ] ) + '}';
}
else ret += '=' + escape( x_params[ key ] );
}
}
return ret;
}
function ScsDeserialize( x_params, x_arr )
{
var l = x_params.length, s = 0, e = 0;
while ( e < l )
{
switch( x_params[ e ] )
{
case ',' : case '}' :
{
var a = x_params.substr( s, e - s ).split( '=' );
if ( 1 < e - s )
{
if ( null == a[ 0 ] ) a[ 0 ] = 0;
else a[ 0 ] = unescape( a[ 0 ] );
if ( null == a[ 1 ] ) x_arr[ 0 ] = '';
else x_arr[ a[ 0 ] ] = unescape( a[ 1 ] );
}
s = e + 1;
if ( '}' == x_params[ e ] ) return e + 1;
} break;
case '{' :
{
var k = x_params.substr( s, e - s );
if ( k.length )
{
k = unescape( k );
x_arr[ k ] = Array();
e += ScsDeserialize( x_params.substr( e ), x_arr[ k ] );
}
s = e + 1;
} break;
}
e++;
}
return e;
}
I know it's kinda long, but I'll go ahead and post the C++ version just so this article is as complete as possible. The mad cut-and-pasters will appreciate it, I'm sure. The actual encode / decode functions are about the same, but I added functions for converting from strings to integers and doubles etc... Just to make things easy to use. C++ does not have built-in support of this type.
#include <map>
#include <string>
template < class T > class TScsPropertyBag
{
public:
template < class T > class CAutoMem
{
public:
CAutoMem() { m_p = NULL; }
~CAutoMem() { release(); }
void release() { if ( m_p ) { delete m_p; m_p = NULL; } }
T& Obj() { if ( !m_p ) m_p = new T; return *m_p; }
operator T&() { return Obj(); }
private:
T *m_p;
};
typedef std::basic_string< T > t_String;
typedef std::map< t_String, CAutoMem< TScsPropertyBag< T > > > t_StringArray;
public:
TScsPropertyBag() { }
TScsPropertyBag( t_String sStr )
{ deserialize( sStr );
}
static bool IsStdChar( T ch )
{ return ( _T( 'a' ) <= ch && _T( 'z' ) >= ch ) ||
( _T( 'A' ) <= ch && _T( 'Z' ) >= ch ) ||
( _T( '0' ) <= ch && _T( '9' ) >= ch ) ||
_T( '_' ) == ch || _T( '-' ) == ch ||
_T( '.' ) == ch;
}
static t_String urlencode( t_String sStr )
{
t_String sRes;
long lLen = sStr.length(), i = 0;
T tmp[ 256 ];
while ( i < lLen )
{
if ( IsStdChar( sStr[ i ] ) ) sRes += sStr[ i ];
else
{ _stprintf( tmp, _T( "%%%02lX" ), (long)sStr[ i ] );
sRes += tmp;
}
i++;
}
return sRes;
}
static t_String urldecode( t_String sStr )
{
t_String sRes;
long lLen = sStr.length(), i = 0;
T tmp[ 256 ];
while ( i < lLen )
{
if ( _T( '%' ) != sStr[ i ] ) sRes += sStr[ i ];
else
{
tmp[ 0 ] = sStr[ ++i ]; tmp[ 1 ] = sStr[ ++i ]; tmp[ 2 ] = 0;
sRes += (TCHAR)( _tcstoul( tmp, NULL, 16 ) );
}
i++;
}
return sRes;
}
void destroy() { m_lstSub.clear(); m_str.release(); }
t_String serialize()
{
t_String sRes;
if ( !IsArray() ) return m_str.Obj();
t_StringArray::iterator pos = m_lstSub.begin();
while ( pos != m_lstSub.end() )
{
if ( sRes.length() ) sRes += _T( ',' );
sRes += pos->first;
if ( pos->second.Obj().IsArray() )
{
sRes += _T( '{' );
sRes += pos->second.Obj().serialize();
sRes += _T( '}' );
}
else sRes += _T( '=' ), sRes += urlencode( (LPCTSTR)pos->second.Obj() );
pos++;
}
return sRes;
}
LONG deserialize( t_String sStr, BOOL bMerge = FALSE,
LONG *pLast = NULL, TScsPropertyBag *pPs = NULL )
{
if ( !pPs ) pPs = this;
if ( !bMerge ) pPs->destroy();
LONG lItems = 0;
long lLen = sStr.length(), s = 0, e = 0;
while ( e < lLen )
{
switch( sStr[ e ] )
{
case ',' : case '}' :
{
if ( 1 < e - s )
{
long a = s; while ( a < e && '=' != sStr[ a ] ) a++;
t_String sKey, sVal;
if ( a == s )
sKey = urldecode( t_String( &sStr.c_str()[ s + 1 ], e - s - 1 ) );
else sKey = urldecode( t_String( &sStr.c_str()[ s ], a - s ) );
if ( 1 >= e - a ) (*pPs)[ sKey ] = _T( "" );
else (*pPs)[ sKey ] =
urldecode( t_String( &sStr.c_str()[ a + 1 ], e - a - 1 ) );
lItems++;
}
s = e + 1;
if ( '}' == sStr[ e ] )
{ if ( pLast ) *pLast = e + 1; return lItems; }
} break;
case '{' :
{
t_String sKey =
urldecode( t_String( &sStr.c_str()[ s ], e - s ) );
if ( sKey.length() )
{
LONG lEnd = 0;
lItems += deserialize( t_String( &sStr.c_str()[ e + 1 ] ),
TRUE, &lEnd, &(*pPs)[ sKey ] );
e += lEnd;
}
s = e + 1;
} break;
}
e++;
}
return lItems;
}
TScsPropertyBag& operator []( LPCTSTR pKey ) { return m_lstSub[ pKey ]; }
TScsPropertyBag& operator []( t_String sKey ) { return m_lstSub[ sKey.c_str() ]; }
TScsPropertyBag& operator []( long n )
{ TCHAR szKey[ 256 ] = _T( "" );
_stprintf( szKey, _T( "%li" ), n );
return m_lstSub[ szKey ];
}
TScsPropertyBag& operator []( unsigned long n )
{ TCHAR szKey[ 256 ] = _T( "" );
_stprintf( szKey, _T( "%lu" ), n );
return m_lstSub[ szKey ];
}
TScsPropertyBag& operator []( double n )
{ TCHAR szKey[ 256 ] = _T( "" );
_stprintf( szKey, _T( "%g" ), n );
return m_lstSub[ szKey ];
}
t_String operator = ( t_String sStr )
{ m_str.Obj() = sStr.c_str(); return m_str.Obj(); }
t_String operator = ( LPCTSTR pStr )
{ m_str.Obj() = pStr; return m_str.Obj(); }
t_String operator = ( long lVal )
{ T num[ 256 ] = _T( "" );
_stprintf( num, _T( "%li" ), lVal );
m_str.Obj() = num; return m_str.Obj();
}
t_String operator = ( unsigned long ulVal )
{ T num[ 256 ] = _T( "" );
_stprintf( num, _T( "%lu" ), ulVal );
m_str.Obj() = num; return m_str.Obj();
}
t_String operator = ( double dVal )
{ T num[ 256 ] = _T( "" );
_stprintf( num, _T( "%g" ), dVal );
m_str.Obj() = num; return m_str.Obj();
}
operator LPCTSTR() { return ToStr(); }
LPCTSTR ToStr() { return m_str.Obj().c_str(); }
long ToLong() { return _tcstol( ToStr(), NULL, 10 ); }
long ToULong() { return _tcstoul( ToStr(), NULL, 10 ); }
long ToDouble() { return _tcstod( ToStr(), NULL ); }
BOOL IsArray() { return 0 < m_lstSub.size(); }
private:
CAutoMem< t_String > m_str;
t_StringArray m_lstSub;
};
typedef TScsPropertyBag< TCHAR > CScsPropertyBag;
Comparison
In terms of simplicity, it's hard to get much simpler. The only simpler versions I have seen are of the runaway type, or language specific. Such as manually outputting a JavaScript array, for instance. In this case, our work is lost if we want to now switch to another target language.
One way in which XML excels, as seen below, is human readability. Although it is possible to decipher the SCS string, it is not as clear unless you add new line characters. It would have been possible to make the decoder white space agnostic, but it would have required tokenizing the data. This would have just been something someone could leave out of the implementation, and thus we would have strayed from our goals. Also, the introduction of white space could potentially cause problems when pasting data as strings into source files. This has priority here as being closer to our goals of cross-language communication. Though we attempt to make it somewhat readable, take into account that human readability is not a priority for SCS when considering your options.
In terms of bandwidth, say for an AJAX project. Consider the following array...
$A = array( 'Department'=>
array(
'Accounting'=>
array(
'John'=>
array(
'Married'=>'Yes',
'DOB'=>'1-14-78',
'Pets'=>
array(
'Fish'=>8,
'Dog'=>1,
'Cat'=>2
),
'ValidCharacters'=>'.-_',
'InvalidCharacters'=>'[,=]'
),
'Mary'=>
array(
'Married'=>'No',
'DOB'=>'7-2-82',
'Pets'=>
array(
'Dog'=>1,
),
'InvalidCharacters'=>'!@#$%^&*()'
),
),
),
);
Our SCS implementation comes in at 218 bytes. 42% less than the XML equivalent. But is harder to read. It looks like this:
Department{Accounting{John{Married=Yes,DOB=1-14-78,Pets{Fish=8,Dog=1,Cat=2},\
ValidCharacters=.-_,InvalidCharacters=%5B%2C%3D%5D},\
Mary{Married=No,DOB=7-2-82,Pets{Dog=1},\
InvalidCharacters=%21%40%23%24%25%5E%26%2A%28%29}}}
This typical XML output weighs in at 517 bytes. I struggled a little with whether or not to remove the header and formatting characters. I decided to leave them since this really is a lot of the argument for using XML, to be 'standard'. This is actually cheating a little since I used the same URL encoding instead of the more common base-64. But, XML allows me this.
="1.0"="UTF-8"
<Department>
<Accounting>
<John>
<Married>Yes</Married>
<DOB>1-14-78</DOB>
<Pets>
<Fish>8</Fish>
<Dog>1</Dog>
<Cat>2</Cat>
</Pets>
<ValidCharacters>.-_</ValidCharacters>
<InvalidCharacters>%5B%2C%3D%5D</InvalidCharacters>
</John>
<Mary>
<Married>No</Married>
<DOB>7-2-82</DOB>
<Pets>
<Dog>1</Dog>
</Pets>
<InvalidCharacters>%21%40%23%24%25%5E%26%2A%28%29</InvalidCharacters>
</Mary>
</Accounting>
</Department>
The typical runaway implementation. It should be noted that this example particularly amplifies the redundant encoding issue. There are other instances where it would be competitive though never significantly better. Notice the severe mangling due to the recursive encoding.
Department=?Accounting%3D%3FJohn%253D%253FMarried%25253DYes%252526DOB%25253D1-14-78%252526\
Pets%25253D%25253FFish%2525253D8%25252526Dog%2525253D1%25252526Cat%2525253D2%252526\
ValidCharacters%25253D.-_%252526InvalidCharacters%25253D%2525255B%2525252C%2525253D%\
2525255D%2526Mary%253D%253FMarried%25253DNo%252526DOB%25253D7-2-82%252526\
Pets%25253D%25253FDog%2525253D1%252526InvalidCharacters%25253D%25252521%25252540\
%25252523%25252524%25252525%2525255E%25252526%2525252A%25252528%25252529
Flexibility
We are not going to attempt to encode variable types or other properties such as minimum and maximum values at the parser level. But these things can still be done in the framework of the current protocol. For example, consider the following XML:
<variable name=x type=float min=-10 max=10>3.14</variable>
We can represent this type of information by just adding a sub array. In the case of XML, the content or value field is implicit between the tags. We will need to add an explicit 'value' field. And the result is actually shorter than the minimal XML.
variable{name=x,type=float,min=-10,max=10,value=3.14}
Or better still...
x{type=float,min=-10,max=10,value=3.14}
Also, there cannot be similar names at a given scope. For instance...
<table><tr><td>One</td><td>Two</td></tr>
<tr><td>Three</td></tr><table>
Would have to be represented as something like:
table{tr{0{td{0=One,1=Two}},1{td{0=three}}}}
// For clarity
table
{
tr
{
0
{
td
{
0 = One,
1 = Two
}
},
1
{
td
{
0 = three
}
}
}
}
You'll find most data structures can be represented well enough in this protocol. It's usually just a matter of efficiency, especially when dealing with high-bandwidth, binary data like live video or audio. Then again, what format covers everything well?
Conclusion
I think that supplies a good idea of what was being attempted, and what was achieved. A few notes...
Notice that the supplied functions allow you to easily serialize parts of the array as well as the whole array. Also, you can decode one array into a larger array. This is a subtle but powerful construct.
The property bag concept achieved in the C++ implementation is a powerful addition to the language. It can severely cut development time when dealing with data. The nice thing about C++ is that you can describe how exactly you want operators to behave. I actually use a more advanced form of this class that allows serializing/deserializing into lots of formats like the Windows Registry, INI files, URL GET and POST variables, MIME formats, database, etc... This can be an enormously powerful way to handle generic data. I know I didn't invent this by the way, there are many examples out there...
I'd like to add more languages to this example. Perl, Python, VB, come to mind. If anyone wants to donate, please feel free.
Thanks everybody!