Click here to Skip to main content
65,938 articles
CodeProject is changing. Read more.
Articles / Languages / C

Not-So-Common C Syntax/Extension to the Rescue (or Just for Fun)

4.78/5 (9 votes)
14 Mar 2021CPOL15 min read 8.5K  
C syntax/extensions that I somehow stumbled upon
In this post, you will find some C syntax/extensions that I somehow stumbled upon. Some are being actively used in some fields such as the Linux kernel, but the others still have their own interesting aspects.

Introduction

Some might say C is an "ancient" language that is not relevant to most modern-day projects, but you cannot deny the fact that C is still ensuring its firm position in many areas, such as OS kernels and embedded devices. This may be the major factor that prompts the language itself to renovate itself and accommodate the convenience of its users, namely with the form of language extensions and new syntax. If you're not deeply involved in any low-level projects, most of them are something that you've never heard of or maybe you won't have a single chance to utilize. But still, it may save you from being completely confused when you happen to inspect low-level C code. Anyhow, some of them are just fun to know about.

1. Ellipsis in Switch-case

C++
switch (*lptr) {
  case 1 ... 31:
    // ...
    break;
  case 128 ... 255:
    // ...
    break;
  case '\\':
    // ...
    break;
}

You may know that standard C forbids you to use a non-integer expression as a switch-case selector (*lptr in the example code). Because of this, the standard usage of the switch-case statement is to specify the case labels for each relevant integer. You may reduce the code redundancy by taking advantage of the fallthrough feature, but still, you need to type in cases as many as the relevant integers. What if a range of integers falls into the same case? If I were in that case, I'd rather use if statements and forget about the switch-case madness.

But it seems like many other C programmers had the same problem, and have thought that the if statement solution is not very aesthetically appealing. Seriously, if a snippet of code is written with a switch-case statement, we can easily sense that it largely depends on a single integer value. On the other hand, if it's written with if statements, anything can go; who knows one of your colleges messed up on the range expressions in the if conditions?

For this reason (at least I think), GCC has introduced a language extension where you can specify a range of integers with a single case label. For example, if you want the control-flow to fall into a specific block of code if the switch-case selector is between 1 and 31 (both inclusive), you can simply write a case label like case 1 ... 31:.

This especially comes in handy when you have to deal with character variables (char) or the integer variables that you effectively use like an enum. This extension is well-used in AFL, one of the most recognized security fuzzers, and I suppose other system applications are using it as well.

2. Ellipsis in Array Initialization

C++
static const uint8_t simplify_lookup[256] = {
  [0]          = 1,
  [1 ... 255]  = 128
}

If we can use ellipses in the switch-case statement, why not in the array initialization? You may not even know that you can selectively initialize some elements of arrays by specifying the desired offsets and their values. This is possible as of C99 (as far as I know). But this poses a similar problem to the one that we encountered in the switch-case statement; what if we need to initialize a range of offsets with the same value?

So GCC also suggested yet another language extension, where you can shorten a range of offsets with an ellipsis. This can be also found in the AFL source code, as well as the Linux kernel source code, which heavily utilizes the compile-time array initialization all over the project.

3. Ternary Operator without True Value

C++
machine->max_cpus = machine->max_cpus ?: 1;

If you look into the Linux kernel source code carefully enough, you may find the operator that looks like "?:" here and there. When I first stumbled upon this operator, I didn't even know how I can search this weird-looking operator in Google, since I never noticed that this is a variant of the ternary operator without a true value. Unlike the above extensions whose meanings and purposes are fairly straightforward, this operator (strictly speaking, the variant of the ternary operator) requires some explanations about its meaning and purpose.

First, the meaning. This (again) GCC-extended operator returns the value of the left-hand-side expression if it has a non-zero value (hence true), or returns the value of the right-hand-side expression. For example, the left-hand-side of the operator is machine->max_cpus and the right-hand-side counterpart is 1. If machine->max_cpus is non-zero, the evaluation result of the operator is machine->max_cpus. If it's zero, then the evaluation result is still 1, because the right-hand-side value dictates it to be.

Second, why the hell would you need this operator? The example code is supposedly from the Android source code, but you can actually find more use cases in the Linux kernel, and in the Linux kernel, it is a prevalent pattern that a piece of code returns some value only if it is not erroneous; if it is, it returns an error code instead of the desired value. You can certainly write down this pattern with ordinaryif statements, but you may not want to type in tons of if statements just to implement this stereotypical pattern. So most of the use cases in the Linux kernel look like this:

C++
(return_value) = (error_checker_function) ?: (desired_value);

Here, the error checker function returns the corresponding error code (which is arguably non-zero) and returns zero if there is no error. In this regard, you may need to remember that zero usually means "no error" in the Linux world, as opposed to all error codes having non-zero values. The thing that I'm not sure about is whether this usage in the Linux kernel made GCC introduce this extension, or GCC just made this extension and the Linux kernel later incorporated this.

Some people say that its purpose is somewhat similar to the "??" operator in C#, which is evaluated as the left-hand-side reference if it is non-null or the right-hand-side expression otherwise (usually the instance creation). You can indeed use the "?:" operator in C like the "??" operator in C#. But I don't think this usage is as common as in C#, because there are no constructors in C and you need to do some follow-up initialization after allocating memory. If you're already deploying the factory pattern alike, that may help though.

4. Global Constructor & Destructor

C++
__attribute__((constructor(0))) void foo(void)
{
	printf("foo is running before main\n");
}
 
int main(int argc, char* argv[])
{
	printf("main is running\n");
}

In C++, classes can have constructors and destructors that initializes and cleans up the data associated with them before/after the lifetime of their instances. This comes in handy, because you frequently need to prepare something before working with a new instance, and sometimes you have something to be done before some class instances die. This is largely due to the fact that the lifetime of class instances is mostly shorter than the lifetime of the entire program.

But in C, do we need the counterparts of the class constructor and destructor? You may think that structs can incorporate the concept of constructors alike, but C is not designed to be object-oriented; by merely being a chunk of data, structs don't initialize/destroy themselves. You initialize/destroy them. Seeing this, it seems like there is no space in C where constructors/destructors squeeze themselves.

But what if I say you sometimes need to make something to be done before and after main? In most of the situations, you don't have to, but consider the situation that your own logic has to analyze an open-source project at runtime by running along with it. Since your analysis logic is a new piece of source code, it may require some initialization and cleanup on its own, but the original main function certainly wouldn't call them. So how should we do in this situation?

You can (of course) tap into the beginning and the end of the main function, but there is a more elegant solution. In GCC and its compliant compilers (like clang), you can specify some function to be called before or after the main function. Those are called the global constructor/destructor, and you can put whatever source code in them without modifying the other existing logics even a little. You can even designate multiple constructors and destructors, where multiple of them are sequentially called by their specified priorities in order.

5. Zero-length Array Member

C++
struct thread_info {
    struct task_struct  *task;
    struct exec_domain  *exec_domain;
    unsigned long       flags;
    __u32           status;
    __u32           cpu;
    int         preempt_count;
    mm_segment_t        addr_limit;
    struct restart_block    restart_block;
    void __user     *sysenter_return;
    unsigned long           previous_esp;
    __u8            supervisor_stack[0];
};

Some structs in the Linux kernel have an array member at the end, whose length is specified in zero. It denotes an array member whose length is undefined at the compile time, and you may specify its length when you allocate it at runtime.

C++
struct thread_info ti = malloc(sizeof(struct thread_info) + array_length);

Notice that you can achieve this same feature with a regular array member with a non-zero length by allocating the struct with a larger (or even smaller) allocation size. What is different from that alternative is that programmers can figure out whether the array member is intended to be used with a flexible length or it is supposed to be constant. After all, the standard C (not in GCC) forbade programmers to specify 0 as the array length, only after C99 was it that programmers were allowed to specify no length (namely []) to declare flexible-length array members.

6. Nested Functions

C++
int main() {
  int x = 30;
  void foo() { printf("You can do this in GCC. %d\n", x); }
  foo();
}

If you're familiar with some functional languages like OCaml, you may be also familiar with the concept of nested functions. Just as it sounds like, nested functions are the functions nested inside of another function. It effectively serves as a local function that can only be referenced by the outer function, and by being local, it also shares the local variables of its outer function as well. So if you need to write an algorithmic code that is subdivided into several sub-logics sharing the same set of local states, nested functions may greatly make your code straightforward to write (and read).

But unfortunately, C is not a functional language, and (supposedly) it was built on top of the philosophy that "all functions are mutually equivalent," meaning that no function includes another function in it. You can still write such an algorithm without something like nested functions, but well. Can't we just have it in C? It looks not so complicated to implement, then why not?

I think GCC made an extension of everything the programmers once might have imagined. Again, GCC introduced a language extension for nested functions in C. It looks surprisingly similar to what every programmer may have pictured if C has a nested function, and so is the way you use it. It's actually almost exactly the same as other regular functions, except it's declared inside of another function and can reference the local variables of the outer function.

Great! So can we enjoy this feature and write algorithmic programs a little more conveniently? You guessed it. Since it's not a standard feature, you have to consider some (serious) limitations before you use this feature in your daily projects. First, you cannot compile your project other than GCC, if it contains even a single nested function. I checked whether clang-7.0 can compile nested functions and guess what, it doesn't. Considering that the increasing number of projects now use clang as their base compiler, this poses non-trivial problems if your project has even a little chance that can be deployed with such a project. 

Second, if the address of at least one nested function is taken somewhere, your program now has an executable stack. I'm not going to cover how the address-taken nested functions make your stack executable (you can refer to here!), but I must say the executable stack wide opens the window of security attacks; the attacker can simply put her own code in the stack memory and call it. The attack is made that simple!

7. Blocks

C++
/* blocks-test.c */
#include <stdio.h>
#include <Block.h>
/* Type of block taking nothing returning an int */
typedef int (^IntBlock)();

IntBlock MakeCounter(int start, int increment) {
    __block int i = start;
    
    return Block_copy( ^(void) {
        int ret = i;
        i += increment;
        return ret;
    });
    
}

int main(void) {
    IntBlock mycounter = MakeCounter(5, 2);
    printf("First call: %d\n", mycounter());
    printf("Second call: %d\n", mycounter());
    printf("Third call: %d\n", mycounter());
    
    /* because it was copied, it must also be released */
    Block_release(mycounter);
    
    return 0;
}

Until now, most of the extensions are made by GCC. Then does that mean clang is faithfully compliant with the standard C? I don't know how compliant it actually is, but there is at least one non-compliant extension that clang has. It's called blocks, and it's sort of like a nested function because its function-like definition locates inside of another function, but it's a lot closer to the lambda functions that you can find in more recent languages (like the modern-day C++). Plus, it does not require an executable stack even if its address is taken. In fact, you cannot take its address in the way you used to do for any regular functions, because blocks are different entities from the regular functions.

Technically speaking, the pointer to a block is something called a fat pointer. It's effectively a tuple of regular pointers, one for the pointer to the instructions that the block executes, and the other for the stack memory that the block is supposed to recognize local. Since a regular function pointer only has one address to the instructions, these two pointers are not compatible with each other.

This exacerbates the portability problem more than the nested functions in GCC poses; while the incompatibility of nested function made by GCC largely arises at the language parsing level (i.e., the standard C states that the function definition should not be inside of another function.), the incompatibility of blocks stems from the semantics of the blocks themselves; the other compilers do not even have a concept of fat pointers.

According to its wiki page, this feature is supposed to be for the Apple platforms and its main language, Objective-C. You know Apple took over the entire LLVM project (including clang) for their platform years ago, and it seems like it was one of the attempts where Apple tried to take advantage of having its own compiler for their platform. Personally, being not at all familiar with any iOS application code, I've never seen this feature being used in the wild.

8. Statement Expression

You may have already heard about the comma operator. It's rarely used in practice (as far as in my case), but it does exist anyway. The comma operator is the operator that carries multiple sub-expressions as a sequence, and is evaluated as the result of the last sub-expression. For example, if you initialize a variable like this, the variable x would be initialized to 30 regardless of the other operands.

C++
int x = 10, foo(), 30;

The logic behind the scene is that, the program first evaluates the operand 10 that has no side-effect whatsoever, and discards the evaluation result, 10. Then it moves onto the second operand foo(), where it invokes the function foo() and again discards the result. What only matters when it comes to the final result is the very last operand 30, because the comma operator only considers the evaluation result of the last operand. This is no different from simply writing it as int x = 30; if the function foo() is side-effect-free, but it's very much different if it has a side-effect on its own such as updating global variables.

You see, you can pack in whatever side-effects you need inside an expression using this operator, but you need to pay attention to the fact that this operator only accepts expressions as its operand. This alone excludes tons of possibilities that you might have wanted to achieve with this operator. For example, suppose you want to embed an auxiliary declaration inside an if condition and avoid messing up the rest of the code with the declaration that you only need inside that specific if condition.

C++
if (int p = get_pointer(), p ? *p : 0)

Can you do this with a comma operator?

The answer is no, because declarations are not expressions. As a little grammatical background, an expression is a grammatical element that can be embedded in a certain statement, and a declaration is a kind of totally different statement ("Declaration statement"). This means you can't pack in a declaration inside an if condition.

So comma operators indeed are of as little value as it's not frequently used in practice, is it? Well, the comma operators are maybe, but the statement expression that GCC presents are maybe not. There is a subtle upset in the grammatical hierarchy but we can ignore that as long as it works as it intends. What it essentially is that you can pack in a whole chunk of C code inside an expression. For example, the code below is legal at least in GCC.

if ({ int p = get_pointer(); p ? *p : 0; })

Like the comma operator that takes the last expression's result, the statement expression also takes the last statement's result as its final evaluation result. In this example, since p ? *p : 0; is the last statement, the entire if condition boils down to the result of this ternary operator. Also, the auxiliary declarations like p in this example are only valid inside of the statement expression.

I personally have seen this expression twice in real life. The first one is from an automated instrumentation script that instruments existing if conditions at the source level [LAVA paper]. Since you can't always guarantee that the line above an if condition is executed just before the if condition (you can certainly write multiple if statements in one line!), it is safe to embed the instrumentation code inside the if condition itself. This was my first time when I learned about the statement expression. The second one is somewhere that I cannot remember right now, but it used statement expressions to define macro functions. Actually, this is the exemplified usage in the official GCC documentation. If your macro functions have to return some value (unlike ordinary macro functions that purely have side-effects) but the comma operator is not suitable for your needs, the statement expression may be an option.

As an additional note, there is a standard C++17 syntax that recalls the statement expressions, whose main purpose is pretty much the same as the example I introduced above. I'm just roughly guessing that they share the same underlying needs that prompted their invention.

Final Notes

These are the language extensions and relatively new syntax that I know of as of now. I did not include some extensions that I personally think not very interesting. If you know other extensions or exotic syntax, you may edit this article and add more sections for them too. :)

History

  • 3rd March, 2021: Initial version
  • 4th March, 2021: Added "nested functions" and "blocks" sections.
  • 14th March, 2021: Added a "statement expression" section.

License

This article, along with any associated source code and files, is licensed under The Code Project Open License (CPOL)