Byte Sex

Paul M Watt

4.88/5 (5 votes)

29 Mar 2014CPOL9 min read

18.9K

Byte endianness

Download source code - 8.3 KB

Byte-gender; not, "Yes! Please!"
Good! Now that I have your attention, let's solve a relatively simple problem, byte sex. A less sensational name for this concept is byte endianness. This is one of those concepts that you should at least be aware of, even if you don't have to pay much attention to it in your day-to-day work. Each CPU architecture has its own definition for memory. One of these properties is the endianness format of data registers. This is the first issue that I address for Network Alchemy.

Endianness

Endianness defines the order bytes are arranged within a word in machine hardware. There are two commonly used types, big-endian and little-endian. Other formats do exist, however, they are rare. Endianness is often mentioned with the context of network communication. Be aware that it must actually be addressed for any medium used to transfer data from one machine to another. This includes binary file formats.

Big Endian: Big Endian format places the most significant byte to the left of the byte sequence. Essentially, the byte sequence starts with the most significant byte, and ends with the least significant byte. This maps to the same format that we represent numbers.
Little Endian: The least significant byte is at the left of the byte sequence.
Network Byte-order: The byte-order format used to transfer data between two different machines. Big-endian byte-order format is used for network byte-order by convention.
Host Byte-order: The local machine's byte order format.

The description given at Wikipedia contains a more thorough treatment of this subject.

Traditional Solution

A small set of functions are usually provided with the socket libraries for a platform. These functions will perform a byte-order conversion for integers represented with multi-byte formats. The set of supported functions varies between platforms. You can generally count on this set of four to exist:

htons: Converts a 16-bit integer (short) from host to network byte-order
ntohs: Converts a 16-bit integer (short) from network to host byte-order
htonl: Converts a 32-bit integer (long) from host to network byte-order
ntohl: Converts a 32-bit integer (long) from network to host byte-order

C++

unsigned short s_before = 0xC0DE;
unsigned long  l_before = 0x600D1DEA;  
unsigned short s_after  = htons(s_before); 
unsigned long  l_after  = htonl(l_before);

Endianness	Type	Before	Before (hex)	After	After (hex)
Big	short	0xC0DE	49374	0xC0DE	49374
Big	long	0x600D1DEA	1611472362	0x600D1DEA	1611472362
Little	short	0xC0DE	49374	0xDEC0	57024
Little	long	0x600D1DEA	1611472362	0xEA1D0D60	246222176

The results for values converted by htonXX functions

Little-endian systems are the only systems that will require a modification of data. The call to htonXX() functions will not generate any code on big-endian type systems. The results in the table will be the same if the same data is input to the ntohXX(). In fact, the implementation for ntohs() could be implemented like this:

C++

inline
unsigned short ntohs(
  unsigned short netshort
)
{
  return htons(netshort);
}

Potential Problems

I have experienced these two issues, related to the byte-order conversion functions, creeping into a project. This message structure and byte-order conversion function will be used for reference to demonstrate these problems.

C++

struct DataMsg
{
  int           count;
  short         index;
  unsigned char code;
  unsigned long value;
};

void ConvertMsgToHost(DataMsg &msg)
{
  msg.count = (int)ntohl((unsigned long)msg.count);
  msg.index = (short)ntohs((unsigned short)msg.index);
  msg.value = ntohl(msg.value);
}

Inconsistent Conversion Calls

If a strong pattern of implementation is not created and followed with the message transport process, more care will be required to keep track of the state of a message buffer. Has the buffer already been converted to network byte-order? If a buffer that has already been converted to network byte-order is converted a second time, the result will be returned to the host byte-order.

Mistake #1

C++

struct DataMsg
// Calling the convert function an even number of times
// will return the data to its original byte-order:
ConvertMsgToHost(msg);

// ... later called on the same message.
ConvertMsgToHost(msg);

Mistake #2

C++

// Developer is unaware that
// ConvertMsgToHost exists or is called:
ConvertMsgToHost(msg);

// ... later
// ERROR: The result returned is in network order.
long value = ntohl(msg.value);

Field Type Changes

Many times, the field types in a message may change. It is possible for the type of field to be changed, but the byte-order conversion function to remain unchanged. This is especially true because of the tendency to use explicit casts with these functions. If this change goes undetected, data values will be truncated and incorrect values will be transported across the network.

C++

// Noted types are changed in the message format below:
// Noted types are changed in the message format below: 
struct DataMsg 
{
  int   count; 
  size_t index;        // Changed from short 
  unsigned short code; // Changed from unsigned char 
  unsigned long value; 
}; 
 
// The mistake occurs if the conversion 
// function is not properly updated: 
void ConvertMsgToHost(DataMsg &msg) 
{ 
  msg.count = (long) ntohl( (unsigned long)msg.count); 
  
  // The explicit cast will force this call to succeed. 
  // Error: data is truncated and in wrong order 
  msg.index = (short)ntohs((unsigned short)msg.index); 
  msg.value = ntohl(msg.value); 
  // The value msg.code will not be properly converted  
}

Generic Solution

One way to improve the maintainability of your code is to use solutions that are highly consistent. When compared to the traditional approach, our solution will have a more consistent implementation by the method that we handle the context sensitive information of the solution. I am referring to:

the data-type to be converted
the host byte-order
the target byte-order of the data

If any of the field types in our DataMsg struct changes, our converter function will continue to be valid without any required changes. The EndianSwap function simply knows what to do. This does not yet resolve the issue of inconsistently converted messages. We will address that after we have a robust and consistent method to swap the byte-order of data fields.

C++

void ConvertMsgToHost(DataMsg &msg) 
{ 
  msg.count = EndianSwap(msg.count); 
  msg.index = EndianSwap(msg.index);  
  msg.Code  = EndianSwap(msg.code);  
  msg.value = EndianSwap(msg.value); 
}

We can create a function like EndianSwap that will take the appropriate action regardless of the type of field passed into it. The socket byte conversion functions are written to be compatible with C. Therefore, a different name must be given to each function, for each type supported. Function overloading in C++ will allow us to create a set of functions that can be used to generically convert the byte-order of many different types of fields. This still leaves the problem of calling the convert function an even number of times returns the data format to its original type. We will revisit this once I create the first part of the solution.

Byte-Order Swap

Because a large variety of types may be encoded in data, we will start with a template-based approach to create a generic EndianSwap function. This will allow us to create sets of solutions and reuse them, rather than duplicating code and giving a new function prototype to each function. The base implementation will provide an empty solution that simply returns the input value. The compiler will optimize away the function call and result assignment. This effectively turns this call into a no-op:

C++

template <typename T>
inline
T EndianSwap(T input)
{
  return input;
}

A specialization of this template will be implemented for each type that requires byte-order conversion logic. Here is the implementation for unsigned 16-bit and unsigned 32-bit integers.

C++

template < >
inline
uint16_t EndianSwap(uint16_t input)
{
  return  (input << convert ::k_8bits)
        | (input >> convert ::k_8bits); 
} 
  
template < >  
inline
uint32_t EndianSwap(uint32_t input)
{
  return  (input  << convert ::k_24bits) 
        | ((input >> convert::k_8bits)  & 0x0000FF00)
        | ((input << convert ::k_8bits) & 0x00FF0000)
        | (input  >> convert::k_24bits); 
}

I chose to implement my own set of byte-order swap functions for a couple of reasons.

To remain independent of system socket libraries. The functions are not portable implemented or available.
There is now only one byte-order conversion function rather than two with different names.
Added flexibility; The socket functions become no-ops on big-endian solutions, whereas EndianSwap will always swap.

Another way we will improve upon the existing byte-order conversion functions is by providing a specialization for the signed types. This is necessary to eliminate the need to provide a cast with calls to these functions. The implementations for the signed versions can be implemented in terms of the unsigned definitions. However, care must be taken to avoid conditions that would create an integer overflow condition. Overflows with signed integers results in truncated data.

C++

template < >
inline
int16_t EndianSwap(int16_t input)
{
  return static_cast < int32_t >(
    EndianSwap(static_cast < uint16_t >(input))
  ); 
} 
 
template < >
inline
int32_t EndianSwap(int32_t input)
{
  return static_cast < int32_t >(
    EndianSwap(static_cast < uint32_t >(input))
  );
}

Manage Context Information

We now have a consistent method to swap the byte-order for any data-type that we choose. However, we still need to account for the other types of context sensitive information for this solution to be useful. We have two additional pieces of context information:

Big Endian / Little Endian
Host Byte-order / Network Byte-order

Two types of context information with binary values means we will have 4 possible solutions. Here is a first attempt to create a solution that incorporates all of the pieces.

Constants deduced from compiler settings and platform header files:

C++

/// Platform Endian types
enum Endianess
{
  k_big_endian    = 0,
  k_little_endian = 1
}; 
 
/// Constant indicates machine endianess.
const Endianess k_endianess = Endianess(NA_ENDIANESS);

Host-order conversion function:

C++

template < typename T>
inline
T ToHostOrder(T input)
{
  if (k_little_endian == k_endianess)
  {
    return EndianSwap(input);
  }
  else
  {
    return input;
  } 
} 

// A function called ToNetworkOrder is also
// implemented with the identical implementation.

At this point, we have two separate functions, that both handle two cases. This manages our four distinct possibilities for combinations of context-sensitive information. However, this is not the final solution. In fact, this solution has all of the same issues as the socket library calls, except we can handle any type of data. We have even added a runtime cost to the implementation with an if statement.

Improve the Solution

We are going to put the compiler to work for us to improve this solution. Templates are an excellent option when you have all of the information you need at compile-time. The solution in the previous step used two runtime mechanisms to manage the context-sensitive information, a function call and a conditional. Let's create a template that will let the compiler decide if it is necessary to swap the order of bytes for a data-type rather than a runtime conditional statement.

Generic EndianSwap template to swap based upon the endian-type of the machine.

C++

// Default template implementation
// This version always performs a swap.
template < typename T, bool isSwap >
struct EndianType
{
  static
  T SwapOrder(const T& value)
  {
    return EndianSwap(value);
  }
};

EndianSwap specialization that does not swap the order of bytes.

C++

// The false value is set for the isSwap field
// This is the selector to prevent the byte-order swap
template < typename T >
struct EndianType < T,false>
{
  static
  T SwapOrder(const T& value)
  {
    return value;
  }
};

Template meta-programming solutions require all of the information to be known up front, and any decisions need to be calculated by the compiler. Therefore, constants and data-types become the primary mechanisms used to store information rather than variables. The compiler must have all of the answers fixed in-place to determine which pieces to use for construction of the logical statements. We now have a new struct with a static function to conditionally swap the order of bytes in a data-type. Let's connect it to the final piece of the solution.

We will shift another piece of the context-sensitive information from a variable to a type. The message endian byte-order will be encoded in this type. This object handles for conversion of data that is in host byte-order. The ToHost function call is a no-op, and the ToNetwork will provide the byte-order conversion.

C++

template < Endianess E > 
struct HostByteOrderT
{
  static const
    Endianess order = E; 
 
  static const
    bool       isHost = true; 
 
  template < typename T>
  static
  T ToNetwork(const T& input)
  {
    return EndianType < T,
                        (k_big_endian != order)
                      >::SwapOrder(input); 
  }

  template < typename T>
  static
  T ToHost(const T& input)
  {
    return input;
  }
};

Here is the corresponding implementation for the network byte-order type.

C++

template < Endianess E >
struct NetworkByteOrderT 
{ 
  static const 
    Endianess order = E; 

  static const
    bool       isHost = false; 
 
  template < typename T> 
  static
  T ToHost(const T& input)
  {
    return EndianType < T,
                        (k_big_endian != order) 
                      >::SwapOrder(input); 
  } 

  template < typename T>
  static 
  T ToNetwork(const T& input)
  {
    return input; 
  } 
};

These two typdefs will simplify usage:

C++

typedef HostByteOrderT < k_endianess > HostByteOrder;
typedef NetByteOrderT  < k_endianess > NetByteOrder;

Usage

All of the pieces are in place to have the compiler generate context-sensitive code, based upon the machine architecture, and the desired target endian byte-order. It may not be obvious, but all of the components that exist for us create a conversion function to safely convert data between byte-order types. We can even prevent a mishap from occurring by inadvertently converting a data-type from host-order twice.

The key is within the types that allow us to indicate of a value is in host or network byte-order. To create a message whose byte-order is properly managed, encode its endian byte-order with a template.

C++

template < typename T>
struct DataMsg
  : T
{
  int   count;
  short index;
  unsigned char code;
  unsigned long value; 
}; 

typedef DataMsg< HostByteOrder > DataMsg_Host;
typedef DataMsg< NetByteOrder >  DataMsg_Net;

The corresponding conversion function for network-to-host conversion:

C++

// This function will only convert  
// network types to host-order. 
// The output is into a host-order type. 
template < typename T> 
void ConvertMsgToHost( 
  const DataMsg_Net  &in, 
        DataMsg_Host &out
) 
{ 
  out.count = DataMsg < T >::ToHost(in.count);
  out.code  = DataMsg < T >::ToHost(in.code); 
  out.index = DataMsg < T >::ToHost(in.index);  
  out.value = DataMsg < T >::ToHost(in.value); 
}

Summary

The solution created in this entry is a fairly independent piece of code. This goal of this solution is to provide a generic structure for how byte-order conversions are performed. This will ensure they are performed consistently over time, and hopefully reduce the chances of misuse. The topic of IPC is already dancing near the edges of type-safe data. Anything that we can do to keep type information around as long as possible will help improve the quality and correctness of our software.

Before we can progress much further with the Network Alchemy library, I will need to provide some foundation in template meta-programming concepts. Therefore, the next few entries related to this library will be focused on concepts, and a design for the next phase of the library.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)