Passing parameters between C++, PHP, JavaScript, etc...

Robert Umbehant

4.25/5 (9 votes)

2 Nov 2006CPOL9 min read

565

Easy to implement minimal format for simple data exchange, especially between new and obscure scripting languages.

Download source code - 7.1 KB

Introduction

A common programming task (the only task?) is moving and manipulating data, sometimes between different languages. This can be a chore since language creators don't implement any standard serializing format for nested data such as arrays. In fact, many languages don't support any native serializing at all.

Scripting is all the rage these days. Perhaps, it's just the particular field I work in, but I seem to run into more and more scripting languages. Yes, there are the ubiquitous languages, JavaScript, ASP, PHP, etc... but there are countless other scripting languages for industrial robotics control, camera control, phone switching, on and on.

At some point, you want to get data into or out of these languages, and the choices can be vague or less than optimal. XML is standard, but XML data is bulky and hard to parse. An XML parser is not something I really want to add to ACME's Burger Flipper Scripting Language, just so you can get burger flipping statistics into your Total Meal Performance package. Ultra simple string formats are not extensible, and require you to constantly upgrade everything to handle new parameters. Other custom formats I see used are often inefficient, limited in usefulness, or buggy.

The goal of this article then, is to provide an easy to implement minimal format for simple data exchange especially between new and obscure scripting languages.

Instead of calling it ETIMFFSDEEBNAOSL, which sounds German, I'm going to refer to this format formally as Simple Cross-language Serializing or SCS, so we have a shorthand, and later, in design reviews, we can use cool phrases like, 'I'll just SCS that data to you' and make nearby management people feel dumb.

Goals

Unfortunately, the section above already one-upped this one by not only mentioning the goal, but making a joke about it. So, instead of restating it, I'll just go into more detail.

Above all, simplicity and flexibility will be stressed. There will be ways in which the protocol could be expanded to reduce the size of the serialized data, such as using binary, or compression, etc.... But the trade off would be to increase the complexity of the encoder / decoder, and thus increase the implementation / debugging time and raise the risk of compatibility flaws and lazy implementations.

Since this data format is chiefly for scripting formats that are new and/or obscure, we want something that is quickly and easily implemented.

So, steps to obtaining our goal will be to...

Define a compact portable data format for nested data.
Outline an easy to implement encoder / decoder.
Provide example encoder / decoder in C++, PHP, and JavaScript.

Use

To make this article a little easier to read and understand, I'm going to cover a few example uses before I go further, since I don't have a cool picture to put on top of the article. Here is a simple example of how SCS will be used for, say, passing data from PHP to JavaScript, for instance...

// Create PHP array
$person_info = array( 'Fred'=>array( 'hair'=>'blonde', 'eye'=>'blue', 'age'=>26 ) );

// Send to Javascript
echo 'var javaArray = DeserializeArray( "' . SerializeArray( $person_info ) . '" );';

Or how about sending data from PHP to C++:

// PHP - Return data on fred

$person_info = QueryDatabase( "fred" );
echo SerializeArray( $person_info );

// C++ - Use fred's age

CPsPropertyBag  cPersonInfo( GetPhpReturnString() );
long lAge = cPersonInfo[ "Fred" ][ "Age" ].ToLong();

Why another format?

If you Google 'pass array PHP JavaScript', substituting your favorite languages, you'll find various implementations including many lazy implementations of standards like XML. The goal here is to define the simplest possible practical implementation. In fact, ideally, a lazier implementation will not even be possible.

Note on XML. It seems I've run into many people that claim XML is the end all format. I've never seen such strong claims of one-size-fits-all before, despite the many thousands of formats that have gone before. But, it seems many are determined to make XML the only format for data exchange. Someone has even once suggested to me that I base-64 encode streaming video data and wrap it in XML to make it 'standard'. I don't share, and am not going to consider, such a narrow view on any format or language. There is a clear trade-off between a formats feature set and the complexity of implementation. I will offer what I hope is an objective comparison with XML for the challenges within the scope of this article.

What about other standards? There is a huge list of similar protocols, check out XML Alternatives for a start. After many hours of pouring through many formats, I was unable to find a good fit for the objectives explained here.. If I've missed an identical solution, you're welcome to rub it in. But, I doubt you will be able to say it was obvious. At some point, one just has to bite the bullet and get things done; at least, I'm sharing in an obvious place...

Custom solutions. Another thing I'm going to look at, is a common runaway encoding technique many people use on custom implementations. It seems that their implementation started as a flat representation, such as a=1,b=2,c=3, then recursion was added to handle nested data. Though the simplicity is hard to beat, the data expansion from nested encoding can be pretty significant. It also makes the nested values hard for a human to read. Because of these two factors, I will stray from simplicity to solve this one problem.

Here is such a runaway implementation in PHP:

function RunawayEncode( &$params, $sep = '&', $arr = '?' )
{
    $ret = '';
    foreach( $params as $key => $val )
    {
        if ( strlen( $key ) && $val )
        {
            // Continue link

            if ( strlen( $ret ) ) $ret .= $sep;

            // Multidimensional assignment

            if ( $arr && is_array( $val ) )
                $ret .= $key . '=' . $arr . rawurlencode( MakeParams( $val, $sep, $arr ) );

            // Flat assignment

            else $ret .= $key . '=' . rawurlencode( $val );

        } // end if


    } // end foreach


    return $ret;
}

function RunawayDecode( $params, $sep = '&', $arr = '?' )
{
    $ret = array();
    $parr = split( $sep, $params );

    foreach( $parr as $val )
    {
        $kv = split( '=', $val );

        // NULL param

        if ( !isset( $kv[ 0 ] ) || !strlen( $kv[ 0 ] ) );

        // One dimensional

        else if ( !isset( $kv[ 1 ] ) )
            $ret[ $kv[ 0 ] ] = 1;

        // Flat assignment

        else if ( !$arr || $kv[ 1 ][ 0 ] != $arr )
            $ret[ $kv[ 0 ] ] = rawurldecode( $kv[ 1 ] );

        // Multi dimensional assignment

        else $ret[ $kv[ 0 ] ] = 
            ParseParams( rawurldecode( substr( $kv[ 1 ], 1 ) ), $sep, $arr );

    }

    return $ret;
}

The data format

The encoding rules for SCS data:

The equality character '=' will separate name from value.
The comma character ',' will separate name/value pairs.
The curly bracket characters '{' and '}' will enclose nested data.
All data will be encoded as strings using the URL encoding scheme as described in RFC 1738.

Pretty simple? I choose RFC 1738 encoding because many script languages have built-in functions, and if not, it's easy to implement (see the C++ ScsSerialize.h header for an example). Additionally, much of the same logic behind using this encoding in URLs applies here. This format also gives us the advantage of being able to read most data rather easily. Here is an example of an encoded array:

Mary{Married=No,DOB=7-2-82}

The PHP declaration of the above array would be:

$test_array = array( 
                        'Mary'=>
                            array(
                                    'Married'=>'No',
                                    'DOB'=>'7-2-82',
                                ),
                    );

Here's a slightly more complex example demonstrating nested encoding. Notice how data is only encoded once.

Mary{Married=No,DOB=7-2-82,Pets{Dog=1},Invalid%20Characters=%21%40%23%24%25%5E%26%2A%28%29}

Declared in PHP...

$test_array = array( 
                        'Mary'=>
                            array(
                                    'Married'=>'No',
                                    'DOB'=>'7-2-82',
                                    'Pets'=>
                                        array(
                                                'Dog'=>1,
                                            ),
                                    'Invalid Characters'=>'!@#$%^&*()'
                                ),
                    );

Parsing

As mentioned, the big pro here is that this can be easily parsed. Here is an example PHP implementation. You'll notice that the encoding function is similar in complexity to the above runaway example; however, the decoding function is more complex. This extra complexity avoids the re-encoding of data. Myself and a few others attempted a simpler decoder and struck out. I'd be very interested if anyone is able to do better.

function ScsSerialize( &$params )
{
    $ret = '';

    foreach( $params as $key => $val )
    {
        if ( $ret ) $ret .= ',';

        // Save the key

        $ret .= rawurlencode( $key );

        // Save array

        if ( is_array( $val ) ) $ret .= '{' . ScsSerialize( $val ) . '}';

        // Save single value if any

        else if ( strlen( $val ) ) $ret .= '=' . rawurlencode( $val );

    } // end foreach


    return $ret;
}

function ScsDeserialize( $params, &$last = 0 )
{
    $arr = array();

    $l = strlen( $params ); $s = 0; $e = 0;

    while ( $e < $l )
    {
        switch( $params[ $e ] )
        {
            case ',' : case '}' : 
            {
                // Any data here?

                if ( 1 < $e - $s )
                {
                    // Divide

                    $a = split( '=', substr( $params, $s, $e - $s ) );
                    
                    // Valid?

                    if ( !isset( $a[ 0 ] ) ) $a[ 0 ] = 0;

                    else $a[ 0 ] = rawurldecode( $a[ 0 ] );

                    // Single value?

                    if ( !isset( $a[ 1 ] ) ) $arr[ $a[ 0 ] ] = '';

                    // Key / value pair

                    else $arr[ $a[ 0 ] ] = rawurldecode( $a[ 1 ] );

                } // end if


                // Move start

                $s = $e + 1;

                // Punt if end of array

                if ( '}' == $params[ $e ] ) 
                {   if ( $last ) $last = $e + 1; return $arr; }

            } break;

            case '{' :
            {
                $k = rawurldecode( substr( $params, $s, $e - $s ) );

                if ( isset( $k ) ) 
                {
                    $end_array = 1;

                    $arr[ $k ] = ScsDeserialize( substr( $params, $e + 1 ), $end_array );

                    $e += $end_array; 

                } // end if

                
                $s = $e + 1;
                
            } break;

        } // end switch


        // Next e

        $e++;

    } // end while


    return $arr;
}

Here's the JavaScript version. Not quite a one to one, as JavaScript apparently doesn't support references of generic types.

JavaScript

function ScsSerialize( x_params )
{
    var ret = '';
    for ( var key in x_params )
    {
        if ( key && x_params[ key ] )
        {
            // Continue link

            if ( ret ) ret += ',';

            // Save the key

            ret += escape( key );

            if( x_params[ key ].constructor == Array ||
                x_params[ key ].constructor == Object )
            {
                ret += '{' + ScsSerialize( x_params[ key ] ) + '}'; 
            }
            else ret += '=' + escape( x_params[ key ] );

        } // end if

    }

    return ret;
}

function ScsDeserialize( x_params, x_arr )
{
    var l = x_params.length, s = 0, e = 0;

    while ( e < l )
    {
        switch( x_params[ e ] )
        {
            case ',' : case '}' :
            {
                var a = x_params.substr( s, e - s ).split( '=' );

                if ( 1 < e - s )
                {
                    // Valid?

                    if ( null == a[ 0 ] ) a[ 0 ] = 0;

                    // Decode

                    else a[ 0 ] = unescape( a[ 0 ] );

                    // Single value?

                    if ( null == a[ 1 ] ) x_arr[ 0 ] = '';

                    // Key / value pair

                    else x_arr[ a[ 0 ] ] = unescape( a[ 1 ] );

                } // end if


                // Next data

                s = e + 1;

                // Punt if end of array

                if ( '}' == x_params[ e ] ) return e + 1;

            } break;

            case '{' :
            {
                // Get the key

                var k = x_params.substr( s, e - s );

                if ( k.length )
                {
                    // Decode the key

                    k = unescape( k );

                    // Decode array

                    x_arr[ k ] = Array();
                    e += ScsDeserialize( x_params.substr( e ), x_arr[ k ] );

                } // end if


                // Next data

                s = e + 1;

            } break;

        } // end switch


        // Next e

        e++;

    } // end while


    return e;
}

I know it's kinda long, but I'll go ahead and post the C++ version just so this article is as complete as possible. The mad cut-and-pasters will appreciate it, I'm sure. The actual encode / decode functions are about the same, but I added functions for converting from strings to integers and doubles etc... Just to make things easy to use. C++ does not have built-in support of this type.

#include <map>

#include <string>


//==================================================================

// TScsPropertyBag

//

/// Implements a multi-dimensional property bag with nested serialization

/**

    This class provides functionality for a multi-dimensional
    property bag. It also provides automatic type conversions
    and, hopefully, easily ported serialization.

    Typical use

    CScsPropertyBag arr1, arr2;

    arr1[ "A" ][ "AA" ] = "Hello World!";
    arr1[ "A" ][ "AB" ] = (long)1;
    arr1[ "B" ][ "BA" ] = (double)3.14159;
    
    for ( long i = 0; i < 4; i++ )
        arr1[ "list" ][ i ] = i * 2;

    // Encode
    CScsPropertyBag::t_String str = arr.serialize();

    // Let's have a look at the encoded string...
    TRACE( str.c_str() ); TRACE( _T( "\n" ) );

    // Decode
    arr2.deserialize( str );    

    // 'Hello World!' check...
    TRACE( arr2[ "A" ][ "AA" ] ); TRACE( _T( "\n" ) );

    // Get long value
    long lVal = arr2[ "A" ][ "AB" ].ToLong();

    // Get double
    double dVal = arr2[ "B" ][ "BA" ].ToDouble();

    // Get string value
    LPCTSTR pString = arr2[ "list" ][ 0 ];

*/
//==================================================================

template < class T > class TScsPropertyBag
{
public:

    //==================================================================

    // CAutoMem

    //

    /// Just a simple auto pointer

    /**
        This class is a simple auto pointer. It has properties that I
        particularly like for this type of job. I'll quit making my
        own when boost comes with VC... 
    */
    //==================================================================

    template < class T > class CAutoMem
    {
        public:

            /// Default constructor

            CAutoMem() { m_p = NULL; }

            /// Destructor

            ~CAutoMem() { release(); }

            /// Release allocated object

            void release() { if ( m_p ) { delete m_p; m_p = NULL; } }

            /// Returns a pointer to encapsulated object

            T& Obj() { if ( !m_p ) m_p = new T; return *m_p; }

            /// Returns a pointer to the encapsulated object

            operator T&() { return Obj(); }

        private:

            /// Contains a pointer to the controlled object

            T       *m_p;
            
    };

    /// Unicode friendly string

    typedef std::basic_string< T > t_String;

    /// Our multi-dimensional string array type

    typedef std::map< t_String, CAutoMem< TScsPropertyBag< T > > > t_StringArray;

public:
    
    /// Default constructor

    TScsPropertyBag() { }

    //==============================================================

    // TScsPropertyBag()

    //==============================================================

    /// Constructos object from encoded string

    /**
        \param [in] sStr    -   Encoded array
    */
    TScsPropertyBag( t_String sStr )
    {   deserialize( sStr );
    }

    //==============================================================

    // IsStdChar()

    //==============================================================

    /// Returns non-zero if the character does *not* need encoding

    /**
        \param [in] ch  -   Character to check
    */
    static bool IsStdChar( T ch ) 
    {   return  ( _T( 'a' ) <= ch && _T( 'z' ) >= ch ) ||
                ( _T( 'A' ) <= ch && _T( 'Z' ) >= ch ) ||
                ( _T( '0' ) <= ch && _T( '9' ) >= ch ) ||
                _T( '_' ) == ch || _T( '-' ) == ch || 
                _T( '.' ) == ch;
    }


    //==============================================================

    // urlencode()

    //==============================================================

    /// Returns URL encoded version of a string.

    /**
        \param [in] sStr    -   String to encode
        
        \return TSend_string object containing encoded string
    */
    static t_String urlencode( t_String sStr )
    {
        t_String sRes;
        long lLen = sStr.length(), i = 0;

        T tmp[ 256 ];
        while ( i < lLen )
        {
            if ( IsStdChar( sStr[ i ] ) ) sRes += sStr[ i ];
            else
            {   _stprintf( tmp, _T( "%%%02lX" ), (long)sStr[ i ] );
                sRes += tmp;
            } // end else            


            i++;

        } // end while


        return sRes;
    }


    //==============================================================

    // urldecode()

    //==============================================================

    /// Decodes URL encoded string

    /**
        \param [in] sStr    -   URL encoded string to decode
        
        \return Decoded string
    */
    static t_String urldecode( t_String sStr )
    {
        t_String sRes;
        long lLen = sStr.length(), i = 0;

        T tmp[ 256 ];
        while ( i < lLen )
        {
            if ( _T( '%' ) != sStr[ i ] ) sRes += sStr[ i ];
            else
            {
                tmp[ 0 ] = sStr[ ++i ]; tmp[ 1 ] = sStr[ ++i ]; tmp[ 2 ] = 0;

                sRes += (TCHAR)( _tcstoul( tmp, NULL, 16 ) );

            } // end else            


            i++;

        } // end while


        return sRes;
    }


    //==============================================================

    // destroy()

    //==============================================================

    /// Releases all memory resources and prepares class for reuse.

    void destroy() { m_lstSub.clear(); m_str.release(); }


    //==============================================================

    // serialize()

    //==============================================================

    /// Serializes the array

    /**
        \return Serialized array.
    
        \see 
    */
    t_String serialize()
    {
        t_String sRes;
        
        // Just return our value if we're not an array

        if ( !IsArray() ) return m_str.Obj();

        // Iterator

        t_StringArray::iterator pos = m_lstSub.begin();

        // For each array element

        while ( pos != m_lstSub.end() )
        {
            // Add separator if needed

            if ( sRes.length() ) sRes += _T( ',' );

            sRes += pos->first;

            // Is it an array?

            if ( pos->second.Obj().IsArray() )
            { 
                sRes += _T( '{' ); 
                sRes += pos->second.Obj().serialize(); 
                sRes += _T( '}' ); 
            }
                
            // Serialize the value

            else sRes += _T( '=' ), sRes += urlencode( (LPCTSTR)pos->second.Obj() );

            // Next array element

            pos++;

        } // end while

        
        return sRes;
    }


    //==============================================================

    // deserialize()

    //==============================================================

    /// Deserializes an array from string

    /**
        \param [in] sStr    -   Serialized array string.
        \param [in] bMerge  -   Non-zero if array should be merged
                                into current data. Set to zero to
                                replace current array.
        \param [in] pLast   -   Receives the number of bytes decoded.
        \param [in] pPs     -   Property bag that receives any decoded
                                characters. We could also have just
                                called this function on the object,
                                but this way provides a little extra
                                flexibility for later.
        
        \return Number of items deserialized.
    
        \see 
    */
    LONG deserialize(   t_String sStr, BOOL bMerge = FALSE, 
                        LONG *pLast = NULL, TScsPropertyBag *pPs = NULL )
    {
        // Ensure object

        if ( !pPs ) pPs = this;

        // Do we want to merge?

        if ( !bMerge ) pPs->destroy();

        LONG lItems = 0;
        long lLen = sStr.length(), s = 0, e = 0;

        while ( e < lLen )
        {
            switch( sStr[ e ] )
            {
                case ',' : case '}' :
                {
                    if ( 1 < e - s )
                    {
                       // Find '='

                       long a = s; while ( a < e && '=' != sStr[ a ] ) a++;

                       t_String sKey, sVal;

                       // First character is separator

                       if ( a == s ) 
                        sKey = urldecode( t_String( &sStr.c_str()[ s + 1 ], e - s - 1 ) );

                       else sKey = urldecode( t_String( &sStr.c_str()[ s ], a - s ) );
     
                       // Single token

                       if ( 1 >= e - a ) (*pPs)[ sKey ] = _T( "" );

                       // Both tokens present

                       else (*pPs)[ sKey ] = 
                        urldecode( t_String( &sStr.c_str()[ a + 1 ], e - a - 1 ) );

                       // Count one item

                       lItems++;

                    } // end if


                    // Next element

                    s = e + 1;

                    // Time to exit?

                    if ( '}' == sStr[ e ] )
                    {   if ( pLast ) *pLast = e + 1; return lItems; }

                } break;

                case '{' :
                {              
                    // Get key

                    t_String sKey = 
                        urldecode( t_String( &sStr.c_str()[ s ], e - s ) );

                    // Do we have a key?

                    if ( sKey.length() )
                    {
                        // This will point to the end of the array we're about to decode

                        LONG lEnd = 0;

                        // Get the sub array

                        lItems += deserialize(  t_String( &sStr.c_str()[ e + 1 ] ), 
                                                TRUE, &lEnd, &(*pPs)[ sKey ] );

                        // Skip the array we just decoded

                        e += lEnd;

                    } // end if


                    // Skip this token

                    s = e + 1; 

                } break;

            } // end switch


            // Next i

            e++;

        } // end while


        return lItems;
    }


    //==============================================================

    // operator []()

    //==============================================================

    /// Indexes into sub array

    /**
        \param [in] pKey    -   Index key
        
        \return Reference to sub class.
    
        \see 
    */
    TScsPropertyBag& operator []( LPCTSTR pKey ) { return m_lstSub[ pKey ]; }

    //==============================================================

    // operator []()

    //==============================================================

    /// Indexes into sub array

    /**
        \param [in] sKey    -   Index key
        
        \return Reference to sub class.
    
        \see 
    */
    TScsPropertyBag& operator []( t_String sKey ) { return m_lstSub[ sKey.c_str() ]; }

    //==============================================================

    // operator []()

    //==============================================================

    /// Indexes into sub array

    /**
        \param [in] n   -   Index key
        
        \return Reference to sub class.
    
        \see 
    */
    TScsPropertyBag& operator []( long n ) 
    {   TCHAR szKey[ 256 ] = _T( "" ); 
        _stprintf( szKey, _T( "%li" ), n );
        return m_lstSub[ szKey ]; 
    }

    //==============================================================

    // operator []()

    //==============================================================

    /// Indexes into sub array

    /**
        \param [in] n   -   Index key
        
        \return Reference to sub class.
    
        \see 
    */
    TScsPropertyBag& operator []( unsigned long n ) 
    {   TCHAR szKey[ 256 ] = _T( "" ); 
        _stprintf( szKey, _T( "%lu" ), n );
        return m_lstSub[ szKey ]; 
    }

    //==============================================================

    // operator []()

    //==============================================================

    /// Indexes into sub array

    /**
        \param [in] n   -   Index key
        
        \return Reference to sub class.
    
        \see 
    */
    TScsPropertyBag& operator []( double n ) 
    {   TCHAR szKey[ 256 ] = _T( "" ); 
        _stprintf( szKey, _T( "%g" ), n );
        return m_lstSub[ szKey ]; 
    }

    //==============================================================

    // operator = ()

    //==============================================================

    /// Conversion from string object

    t_String operator = ( t_String sStr ) 
    {   m_str.Obj() = sStr.c_str(); return m_str.Obj(); }


    //==============================================================

    // operator = ()

    //==============================================================

    /// Conversion from string

    t_String operator = ( LPCTSTR pStr ) 
    {   m_str.Obj() = pStr; return m_str.Obj(); }


    //==============================================================

    // operator = ()

    //==============================================================

    /// Conversion from long

    t_String operator = ( long lVal )
    {   T num[ 256 ] = _T( "" );
        _stprintf( num, _T( "%li" ), lVal );
        m_str.Obj() = num; return m_str.Obj(); 
    }

    //==============================================================

    // operator = ()

    //==============================================================

    /// Conversion from unsigned long

    t_String operator = ( unsigned long ulVal )
    {   T num[ 256 ] = _T( "" );
        _stprintf( num, _T( "%lu" ), ulVal );
        m_str.Obj() = num; return m_str.Obj(); 
    }

    //==============================================================

    // operator = ()

    //==============================================================

    /// Conversion from double

    t_String operator = ( double dVal )
    {   T num[ 256 ] = _T( "" );
        _stprintf( num, _T( "%g" ), dVal );
        m_str.Obj() = num; return m_str.Obj(); 
    }


    //==============================================================

    // LPCTSTR()

    //==============================================================

    /// Conversion to string

    operator LPCTSTR() { return ToStr(); }


    //==============================================================

    // ToStr()

    //==============================================================

    /// Returns local string object

    LPCTSTR ToStr() { return m_str.Obj().c_str(); }

    //==============================================================

    // ToLong()

    //==============================================================

    /// Converts to long

    long ToLong() { return _tcstol( ToStr(), NULL, 10 ); }

    //==============================================================

    // ToULong()

    //==============================================================

    /// Converts to unsigned long

    long ToULong() { return _tcstoul( ToStr(), NULL, 10 ); }

    //==============================================================

    // ToDouble()

    //==============================================================

    /// Converts to double

    long ToDouble() { return _tcstod( ToStr(), NULL ); }

    //==============================================================

    // IsArray()

    //==============================================================

    /// Returns non-zero if array elements are present

    BOOL IsArray() { return 0 < m_lstSub.size(); }

private:

    /// Our value

    CAutoMem< t_String >            m_str;

    /// Array of strings

    t_StringArray                   m_lstSub; 

};

/// Property bag type

/** \see TScsPropertyBag */
typedef TScsPropertyBag< TCHAR > CScsPropertyBag;

Comparison

In terms of simplicity, it's hard to get much simpler. The only simpler versions I have seen are of the runaway type, or language specific. Such as manually outputting a JavaScript array, for instance. In this case, our work is lost if we want to now switch to another target language.

One way in which XML excels, as seen below, is human readability. Although it is possible to decipher the SCS string, it is not as clear unless you add new line characters. It would have been possible to make the decoder white space agnostic, but it would have required tokenizing the data. This would have just been something someone could leave out of the implementation, and thus we would have strayed from our goals. Also, the introduction of white space could potentially cause problems when pasting data as strings into source files. This has priority here as being closer to our goals of cross-language communication. Though we attempt to make it somewhat readable, take into account that human readability is not a priority for SCS when considering your options.

In terms of bandwidth, say for an AJAX project. Consider the following array...

JavaScript

// Test array
$A = array( 'Department'=>
                array(
                    'Accounting'=>
                        array(
                            'John'=>
                                array(
                                    'Married'=>'Yes',
                                    'DOB'=>'1-14-78',
                                    'Pets'=>
                                        array(
                                            'Fish'=>8,
                                            'Dog'=>1,
                                            'Cat'=>2
                                        ),
                                    'ValidCharacters'=>'.-_',
                                    'InvalidCharacters'=>'[,=]'
                                ),
                            'Mary'=>
                                array(
                                    'Married'=>'No',
                                    'DOB'=>'7-2-82',
                                    'Pets'=>
                                        array(
                                            'Dog'=>1,
                                        ),
                                    'InvalidCharacters'=>'!@#$%^&*()'
                                ),
                        ),
                ),
            );

Our SCS implementation comes in at 218 bytes. 42% less than the XML equivalent. But is harder to read. It looks like this:

JavaScript

Department{Accounting{John{Married=Yes,DOB=1-14-78,Pets{Fish=8,Dog=1,Cat=2},\
    ValidCharacters=.-_,InvalidCharacters=%5B%2C%3D%5D},\
    Mary{Married=No,DOB=7-2-82,Pets{Dog=1},\
    InvalidCharacters=%21%40%23%24%25%5E%26%2A%28%29}}}

This typical XML output weighs in at 517 bytes. I struggled a little with whether or not to remove the header and formatting characters. I decided to leave them since this really is a lot of the argument for using XML, to be 'standard'. This is actually cheating a little since I used the same URL encoding instead of the more common base-64. But, XML allows me this.

XML

<?xml version="1.0" encoding="UTF-8" ?>

<Department>
    <Accounting>
        <John>
            <Married>Yes</Married>
            <DOB>1-14-78</DOB>

            <Pets>
                <Fish>8</Fish>
                <Dog>1</Dog>
                <Cat>2</Cat>
            </Pets>
            <ValidCharacters>.-_</ValidCharacters>
            <InvalidCharacters>%5B%2C%3D%5D</InvalidCharacters>

        </John>
        <Mary>
            <Married>No</Married>
            <DOB>7-2-82</DOB>
            <Pets>
                <Dog>1</Dog>
            </Pets>

            <InvalidCharacters>%21%40%23%24%25%5E%26%2A%28%29</InvalidCharacters>
        </Mary>
    </Accounting>
</Department>

The typical runaway implementation. It should be noted that this example particularly amplifies the redundant encoding issue. There are other instances where it would be competitive though never significantly better. Notice the severe mangling due to the recursive encoding.

Department=?Accounting%3D%3FJohn%253D%253FMarried%25253DYes%252526DOB%25253D1-14-78%252526\
    Pets%25253D%25253FFish%2525253D8%25252526Dog%2525253D1%25252526Cat%2525253D2%252526\
    ValidCharacters%25253D.-_%252526InvalidCharacters%25253D%2525255B%2525252C%2525253D%\
    2525255D%2526Mary%253D%253FMarried%25253DNo%252526DOB%25253D7-2-82%252526\
    Pets%25253D%25253FDog%2525253D1%252526InvalidCharacters%25253D%25252521%25252540\
    %25252523%25252524%25252525%2525255E%25252526%2525252A%25252528%25252529

Flexibility

We are not going to attempt to encode variable types or other properties such as minimum and maximum values at the parser level. But these things can still be done in the framework of the current protocol. For example, consider the following XML:

XML

<variable name=x type=float min=-10 max=10>3.14</variable>

We can represent this type of information by just adding a sub array. In the case of XML, the content or value field is implicit between the tags. We will need to add an explicit 'value' field. And the result is actually shorter than the minimal XML.

variable{name=x,type=float,min=-10,max=10,value=3.14}

Or better still...

x{type=float,min=-10,max=10,value=3.14}

Also, there cannot be similar names at a given scope. For instance...

XML

<table><tr><td>One</td><td>Two</td></tr>
<tr><td>Three</td></tr><table>

Would have to be represented as something like:

table{tr{0{td{0=One,1=Two}},1{td{0=three}}}}

// For clarity
table
{
    tr
    {
        0
        {
            td
            {
                0 = One,
                1 = Two
            }
        },

        1
        {
            td
            {
                0 = three
            }
        }
    }
}

You'll find most data structures can be represented well enough in this protocol. It's usually just a matter of efficiency, especially when dealing with high-bandwidth, binary data like live video or audio. Then again, what format covers everything well?

Conclusion

I think that supplies a good idea of what was being attempted, and what was achieved. A few notes...

Notice that the supplied functions allow you to easily serialize parts of the array as well as the whole array. Also, you can decode one array into a larger array. This is a subtle but powerful construct.

The property bag concept achieved in the C++ implementation is a powerful addition to the language. It can severely cut development time when dealing with data. The nice thing about C++ is that you can describe how exactly you want operators to behave. I actually use a more advanced form of this class that allows serializing/deserializing into lots of formats like the Windows Registry, INI files, URL GET and POST variables, MIME formats, database, etc... This can be an enormously powerful way to handle generic data. I know I didn't invent this by the way, there are many examples out there...

I'd like to add more languages to this example. Perl, Python, VB, come to mind. If anyone wants to donate, please feel free.

Thanks everybody!

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)