The Mystery of the Missing Bytes

Coral Kashri

0.00/5 (No votes)

22 Apr 2023CPOL6 min read

1.4K

In this article you will see how to save bytes and alignment when containing some types inside a structure.

Here you will also learn about techniques to avoid wasting space on unique_ptr deleter and on other structures allocators like std vector or map, taking into account Empty Base Optimization (EBO) and no_unique_address attribute sice C++20.

It’s been a long dark (mode) night. The last thing I can remember is the pointers I used. There was a tiny space that left for me to use, 8 bytes only. The exact size of a pointer. I wanted to protect it, and used std::unique_ptr. Everything seemed working, and then it hit me. std::unique_ptr has to store an additional space for its deleter. And that’s where our adventure begins.

Chapter 1: The Phenomenon

Before investigating a phenomenon, we have to dig deeper and understand the exact case which is in front of us.

So this phenomenon happens only when using an empty captured lambda or an empty functor by value as a deleter. In any other case, the std::unique_ptr size get changed. Whether we pass the functor by reference, or using std::function / function pointer as deleters.

So it actually seems like some kind of optimization, but which one? You have no such available optimization for a case when you store an instance as a class member, at least not before C++20.

The first thing that came to my head is “Do compilers use a hidden optimization technique that no one knows about?”

There is nothing more deceptive than an obvious fact.
Sherlock Holmes

Chapter 2: Known Techniques

Always look for a possible alternative and be prepared for it.
Sherlock Holmes

Anytime I notice an unknown or unexplained phenomenon, the first thing I do is to look for a similar behavior. And fortunately, there is one, pre C++20 (and one more that is added in 20) that might explain this phenomenon.

Empty Base Optimization

This optimization is applied when we have a derived class, whose base class doesn’t contain any non-static data members. Whenever we have such a case, the compiler can optimize the space usage, and avoid allocating a byte for the base class (although the rule forces any instance to be at least 1 byte size). Examples (taken from cppreference):

C++

struct Base {}; // empty class
 
struct Derived1 : Base {
    int i;
};
 
int main()
{
    // the size of any object of empty class type is at least 1
    static_assert(sizeof(Base) >= 1);
 
    // empty base optimization applies
    static_assert(sizeof(Derived1) == sizeof(int));
}

This optimization has some limitations and sometimes, its behavior is compiler specific. For example, this optimization won’t apply if the first non-static data member’s type is the same of the base class or derived from it:

C++

struct Base {}; // empty class
 
struct Derived1 : Base {
    int i;
};
 
struct Derived2 : Base {
    Base c;     // Base, occupies 1 byte, followed by padding for i
    int i;
};
 
struct Derived3 : Base {
    Derived1 c; // derived from Base, occupies sizeof(int) bytes
    int i;
};
 
int main()
{
    // empty base optimization does not apply,
    // base occupies 1 byte, Base member (c) occupies 1 byte
    // followed by 2 bytes of padding to satisfy int alignment requirements
    static_assert(sizeof(Derived2) == 2*sizeof(int));
 
    // empty base optimization does not apply,
    // base takes up at least 1 byte plus the padding
    // to satisfy alignment requirement of the first member (whose
    // alignment is the same as int)
    static_assert(sizeof(Derived3) == 3*sizeof(int));
}

For more information about this optimization, see cppreference – Empty Base Optimization.

[[no_unique_address]] // Since C++20

This attribute takes the empty base optimization some steps ahead, and allowing us to perform the same optimization (and a little bit more) on contained data members, without the need to inherit from them. Moreover, any tail padding of the marked data member, may be used for other data members. Examples from cppreference:

C++

struct Empty {}; // empty class
 
struct X {
    int i;
    Empty e;
};
 
struct Y {
    int i;
    [[no_unique_address]] Empty e;
};
 
struct Z {
    char c;
    [[no_unique_address]] Empty e1, e2;
};
 
struct W {
    char c[2];
    [[no_unique_address]] Empty e1, e2;
};
 
int main()
{
    // the size of any object of empty class type is at least 1
    static_assert(sizeof(Empty) >= 1);
 
    // at least one more byte is needed to give e a unique address
    static_assert(sizeof(X) > sizeof(int));
 
    // empty member optimized out
    std::cout << "sizeof(Y) == sizeof(int) is " << std::boolalpha 
              << (sizeof(Y) == sizeof(int)) << '\n';
 
    // e1 and e2 cannot share the same address because they have the
    // same type, even though they are marked with [[no_unique_address]]. 
    // However, either may share address with c.
    static_assert(sizeof(Z) >= 2);
 
    // e1 and e2 cannot have the same address, but one of them can share with
    // c[0] and the other with c[1]
    std::cout << "sizeof(W) == 2 is " << (sizeof(W) == 2) << '\n';
}

Note about the test (sizeof(W) == 2): Currently gcc, clang and msvc return false for this test. It’s unclear if it’s an incomplete feature, or a mistake in cppreference.

One more note (thanks to obsidian_golem): In MSVC compiler, this attribute doesn’t exist, instead we should use the attribute [[msvc::no_unique_address]].

Chapter 3: The Bridge between Possible and Impossible

This optimization is easy to implement using [[no_unique_address]]. But compilers can use this feature only when compiled with C++20 standard, and because this optimization is applied also in old standard versions, we didn’t finish our research yet.

Now, we left with the first option of empty base optimization. And it’s time to get inside the compilers to see exactly how it’s done.

LLVM Technique

LLVM uses something they called (and probably taken from boost) compressed_pair. Using a wrapper for each element in the pair, they can decide if they inherit the type or contain it inside. The wrapper is written in the following way:

C++

template <class _Tp, int _Idx,
          bool _CanBeEmptyBase = is_empty<_Tp>::value && !__libcpp_is_final<_Tp>::value>
struct __compressed_pair_elem {
public:
    const _Tp& get() { return __value_; }
private:
    _Tp __value_;
};
template <class _Tp, int _Idx>
struct __compressed_pair_elem<_Tp, _Idx, true> : private _Tp {
public:
    const _Tp& get() { return *this; }
};

Let’s split it to multiple parts:

C++

template <class _Tp, int _Idx,
          bool _CanBeEmptyBase = is_empty<_Tp>::value && !__libcpp_is_final<_Tp>::value>

Here, we can see the identifier for a class type. _Tp is the type we want to store (whether it’s the deleter or the managed object). _Idx is used as a class key (we’ll see where it comes to use in few moments. And one more interesting non-type template parameter _CanBeEmptyBase which has a significant hint in the name (something like Empty Base Optimization?). The default value it gets is:

C++

is_empty<_Tp>::value && !__libcpp_is_final<_Tp>::value>

It might be a little confusing, but don’t forget that we are inside std namespace. So we are actually using here std::is_empty and std::__libcpp_is_final functions.

std::is_empty has value true if it satisfies some conditions. The main condition that is important for our understanding is that the type (and its base classes hierarchy) doesn’t have non-static data members inside. For more conditions and explanations, you can refer to cppreference.

std::__libcpp_is_final is another hint we get here, because a final class cannot be inherited to another class.

So the combination to get a true value to _CanBeEmptyBase parameter is if the type is empty and not final. Which means if we can inherit it and use the empty base class optimization on it.

In the default case for this class, we will use the first implementation (which is holding the type as a private data member: _Tp __value_;. And the specialization for true value in _CanBeEmptyBase is derived from this type:

C++

template <class _Tp, int _Idx>
struct __compressed_pair_elem<_Tp, _Idx, true> : private _Tp

And now that we have compressed_pair_element, we can construct a compressed_pair out of it:

C++

template <class _T1, class _T2>
class __compressed_pair : private __compressed_pair_elem<_T1, 0>,
                          private __compressed_pair_elem<_T2, 1> {
    typedef _LIBCPP_NODEBUG_TYPE __compressed_pair_elem<_T1, 0> _Base1;
    typedef _LIBCPP_NODEBUG_TYPE __compressed_pair_elem<_T2, 1> _Base2;
    _LIBCPP_INLINE_VISIBILITY
  typename _Base1::reference first() _NOEXCEPT {
    return static_cast<_Base1&>(*this).__get();
    }
    _LIBCPP_INLINE_VISIBILITY
  typename _Base1::const_reference first() const _NOEXCEPT {
    return static_cast<_Base1 const&>(*this).__get();
    }
    _LIBCPP_INLINE_VISIBILITY
  typename _Base2::reference second() _NOEXCEPT {
    return static_cast<_Base2&>(*this).__get();
    }
    _LIBCPP_INLINE_VISIBILITY
  typename _Base2::const_reference second() const _NOEXCEPT {
    return static_cast<_Base2 const&>(*this).__get();
    }
};

I admit, it might seem a little intimidating at first look (and it's much more intimidating at the source where they declared also the constructors and a swap function). But let’s clear that implementation a little to see what it is really about.

C++

template <class _T1, class _T2>
class __compressed_pair : private __compressed_pair_elem<_T1, 0>,
                          private __compressed_pair_elem<_T2, 1> {
    typedef __compressed_pair_elem<_T1, 0> _Base1;
    typedef __compressed_pair_elem<_T2, 1> _Base2;
    inline typename _Base1::reference first() noexcept() {
    return static_cast<_Base1&>(*this).__get();
    }
    inline typename _Base1::const_reference first() const noexcept() {
    return static_cast<_Base1 const&>(*this).__get();
    }
    inline typename _Base2::reference second() noexcept() {
    return static_cast<_Base2&>(*this).__get();
    }
    inline typename _Base2::const_reference second() const noexcept() {
    return static_cast<_Base2 const&>(*this).__get();
    }
};

So now, we’ve got here a class that inherits from compressed_pair_elem twice. And here is the explanation for the _Idx: This is necessary to enable inherit _compressed_pair_elem that holds the same inner type multiple times. Because we can't inherit from the same class twice, we have to give the class an additional unique key, and LLVM chose an integer (which can also be combined with variadic template and __make_tuple_indices or different similar approaches). Note that in this specific implementation of compressed_pair, there is an additional static_assert that forbids the two types to be the same:

C++

// NOTE: This static assert should never fire because __compressed_pair
// is *almost never* used in a scenario where it's possible for T1 == T2.
// (The exception is std::function where it is possible that the function
//  object and the allocator have the same type).
static_assert((!is_same<_T1, _T2>::value),
  "__compressed_pair cannot be instantiated when T1 and T2 are the same type; "
  "The current implementation is NOT ABI-compatible with the previous "
  "implementation for this configuration");

Now, all we have to do is to use this magnificent creature inside the unique_ptr class, and to potentially save some bytes:

C++

template <class _Tp, class _Dp = default_delete<_Tp> >
class unique_ptr {
public:
    typedef _Tp element_type;
    typedef _Dp deleter_type;
    typedef typename __pointer_type<_Tp, deleter_type>::type pointer;
private:
    __compressed_pair<pointer, deleter_type> __ptr_;
    /* ... */
};

GCC Technique

In GCC, the technique is a little bit different, but based on the same type of optimization. Inside the unique_ptr class, they used the following data member declaration:

C++

tuple<_Tp, _Dp> _M_t;

A tuple in some compilers uses empty base class optimization, and it’s time to look over tuples in GCC.

C++

// Use the Empty Base-class Optimization for empty, non-final types.
template<typename _Tp>
    using __empty_not_final
    = __conditional_t<__is_final(_Tp), false_type,
		      __is_empty_non_tuple<_Tp>>;
template<size_t _Idx, typename _Head,
	   bool = __empty_not_final<_Head>::value>
    struct _Head_base;
#if __has_cpp_attribute(__no_unique_address__)
    template<size_t _Idx, typename _Head>
    struct _Head_base<_Idx, _Head, true>
    {
        static constexpr _Head&
      _M_head(_Head_base& __b) noexcept { return __b._M_head_impl; }
        static constexpr const _Head&
      _M_head(const _Head_base& __b) noexcept { return __b._M_head_impl; }
        [[__no_unique_address__]] _Head _M_head_impl;
    };
#else
    template<size_t _Idx, typename _Head>
    struct _Head_base<_Idx, _Head, true>
    : public _Head
    {
        static constexpr _Head&
      _M_head(_Head_base& __b) noexcept { return __b; }
        static constexpr const _Head&
      _M_head(const _Head_base& __b) noexcept { return __b; }
    };
#endif
template<size_t _Idx, typename _Head>
struct _Head_base<_Idx, _Head, false>
{
    static constexpr _Head&
      _M_head(_Head_base& __b) noexcept { return __b._M_head_impl; }
    static constexpr const _Head&
      _M_head(const _Head_base& __b) noexcept { return __b._M_head_impl; }
    _Head _M_head_impl;
};

And we got a twist – GCC checked if they are allowed to use no_unique_address attribute, and in case of true, they implemented a really similar class to the class specialization that doesn’t use the empty base optimization, only with this attribute.

Now we have the optimization in _Head_base (similarly to _compressed_pair_elem in LLVM implementation), and we can inherit it in _Tuple_impl:

C++

template<size_t _Idx, typename... _Elements>
struct _Tuple_impl;
/**
 * Recursive tuple implementation. Here we store the @c Head element
 * and derive from a @c Tuple_impl containing the remaining elements
 * (which contains the @c Tail).
 */
template<size_t _Idx, typename _Head, typename... _Tail>
struct _Tuple_impl<_Idx, _Head, _Tail...>
  : public _Tuple_impl<_Idx + 1, _Tail...>,
    private _Head_base<_Idx, _Head>
{ /* ... */ };

Here, we can see the unpacking of the tail, in a classic recursion way, and each unpacked _Head is being wrapped with _Head_base class, which is being derived to the _Tuple_impl.

Chapter 4: There’ll Always be More Rivers to Cross and More Bridges to Build

std::unique_ptr is just one of many std containers which accept allocator or deleter. And each one of them may use this ability in order to save some bytes. We won’t cover the way it’s done in other containers here, however the idea is the same.

Hope you enjoyed reading. If you have any thoughts about it, or feel that this knowledge assists you in your development, feel free to share it in the comments section.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)