Martin Uecker, 2026-01-06
Nested functions are extremely useful, which is why basically any computer language since ALGOL60 has them. Except C.
In particular, nested functions are useful to write kernels for passing them to higher-order functions. The kernels can access local state, and this is important because it then does not have to be passed as a void pointer to the higher-order function.
void tree_increase_all(tree(int) *t, int increment)
{
void update(int *value) { (*value) += increment; }
tree_walk(t, update);
}
For very simply nested functions it makes also sense to use a lambda expression, i.e. an anonymous nested function.
void tree_increase_all(tree(int) *t, int increment)
{
tree_walk(t, (void(int *value)){
(*value) += increment);
});
}
The nested function gets passed in a special register ‐ the static chain register ‐ a link to its environment. It can then access the variables of the parent function(s) via this pointer. But how does the calling function know what to pass in the static chain register? The obvious solution is to have this information contained in the function pointer, which would be allowed by the C standard. But on most platform the standard application binary interface (ABI) does not allow this. There are different solutions, as we will see below.
GCC supports nested function as an extension (but not lambdas). C++ has
lambdas with a different syntax, but those are actually objects where
each instance has its own unique type. C++'s lambdas are therefor most
useful in when used as argument to template functions. To be able to pass
them to a non-template function, its unique type needs to be erased
by wrapping it in std::function. Why C++'s lambdas are
an ingenious design, but have various properties which make them a
terrible fit for C. But this is a story for another day.
When creating a pointer to a nested function to pass it to another function,
a program compiled with GCC creates at runtime a small piece of code - a so-called
trampoline based on an idea from the eighties (Breuel et al.). These trampolines
were traditionally placed on the stack, which then has to be executable. For
security reason, it is commonly agreed on that an executable stack should be avoided.
An alternative solution is to create the trampoline somewhere else, e.g. on memory
allocated on the heap. In fact, GCC now can support this using the flag
-ftrampoline-impl=heap. Unfortunately, this is computationally more
expensive and it can also lead to memory leaks when the cleanup path is skipped
over in via longjmp. It therefor makes sense to look for alternative
solutions.
Instead of a trampoline, one can use a special function pointer type, which is
essentially what C++ does with std::function and also what
Apple's Block language extension does using its special syntax.
There are essentially two ways how this can work. Either, the new function type
is a wide pointer that consists of two parts, a pointer to the machine code of
the function just as in a regular function pointer and another pointer that
points to the environment of the function, i.e. the static chain pointer. When
the function is called, the value for the static chain register is extracted from
the wide pointer and loaded into the static chain register (r10 on x86_64)
before the code pointer is loaded and the function is called.
A alternative is to build a function descriptor that contains code pointer and environment pointer, and then construct a regular pointer to this descriptor. A major advantage is that the pointer has the same size as any regular pointer. A disadvantage is that calling the function requires a second indirection to first load code and data pointers from the descriptor. Another disadvantage is that a conversion of a regular pointer then requires the construction of a descriptor, which needs to be stored somewhere, while it could be converted directly to a wide pointer simply by setting the static chain to NULL.
Defining a wide pointer type for a function R(A) is not difficult.
#define wide(R, A) \
struct wideptr_##R##_##A { \
void *chain; \
R (*code)(A); \
}
Given such a wide pointer, we can then call this function using a
built-in function __builtin_call_with_static_chain that GCC
already provides to support other languages that make use of this ABI.
#define CALL(ptr, arg) \
__builtin_call_with_static_chain(ptr.code(args), ptr.chain)
But where do we get the two pointer when constructing the wideptr? Surprisingly, there is no built-in function for this. In the past, I used a grotesque hack to read the two pointer from the trampoline that GCC has created. But this still required the creation of the trampoline even though it is then never used. The technique is also not portable, as the trampoline differs for each CPU architecture, or may change between GCC versions. Instead, I have developed a preliminary patch which adds two new built-ins to GCC that makes it possible to directly obtain the code pointer of the function (instead of the pointer to the trampoline) and the static chain pointer directly. I can then build by wide pointer as in the following example.
#define CLOSURE(R, A, x) \
(wide(R, A)){ __builtin_static_chain(x), __builtin_nested_code(x) }
int baz(wide(int, int) p, int x)
{
return CALL(p, (x));
}
int foo(int k)
{
int bar(int x) { return k + x; }
return baz(CLOSURE(int, int, bar), 2 * k);
}
This is then essentially transformed by the compiler to the following, except that
the chain is passed in the special static chain register.
Example: Godbolt
static int baz(wide(int, int) p, int x)
{
return p.code(p.chain, x);
}
struct frame_foo { int k; };
static int bar(void *_chain, int x)
{
struct frame_foo *chain = _chain;
return chain->k + x;
}
int foo(int _k)
{
struct frame_foo frame = { _k; };
return baz((wide(int, int){ &frame, bar }, 2 * k.frame);
}
In this case (and also the version using my built-in functions), the compiler will happily optimize this to a single assembly instruction that implements a multiplication by three.
foo:
lea eax, [rdi+rdi*2]
ret
To test these ideas in a situation where it is not optimized away, I ran Knuth's Man-Or-Boy test (which was meant to test correctness of ALGOL compilers).
typedef int (*fun)();
int manorboy(int k, fun x1, fun x2, fun x3, fun x4, fun x5)
{
int B() { return manorboy(--k, B, x1, x2, x3, x4); }
return (k <= 0) ? x4() + x5() : B();
}
bool test()
{
int pone () { return +1; }
int mone () { return -1; }
int zero () { return 0; }
return -67 != manorboy(10, pone, mone, mone, pone, zero);
}
You can find the results
here together
with my patch to GCC.
I will discuss these another time in detail looking at the generated code, but so far
it looks that wide pointers have excellent performance, matching the direct implementation
which is also very efficient.
The descriptor version is the fastest, essentially because the
nested function in this toy example merely shuffles pointers around, and if those have
double size this is more costly. This is likely also one of the reasons why the version
using stack-based trampolines is actually fairly good on some CPUs, even though it needs
to create the trampolines. I still have bad experience using C++'s lambdas in terms of
performance, usability, and correctness (probably my own fault because I am not understanding
the language anymore), but it seems one can get competitive performance similar to
a C wide pointer version with std::function_ref and deducing this
in C++ 26.
All this could be made more convenient with a bit more language support. My vision is to simply add a function qualifer to indicate a wide pointer. That's it. The example would then look almost exactly like the nested function code for GCC, but not need any trampoline.
typedef int (*fun)() wide;
int manorboy(int k, fun x1, fun x2, fun x3, fun x4, fun x5)
{
int B() { return manorboy(--k, B, x1, x2, x3, x4); }
return (k <= 0) ? x4() + x5() : B();
}
bool test()
{
int pone () { return +1; }
int mone () { return -1; }
int zero () { return 0; }
return -67 != manorboy(10, pone, mone, mone, pone, zero);
}
Why use a qualifier on the function type? The reason is that implicit conversions from a regular (unqualified) function pointer to a wide pointer would then work correctly according to the usual conversion rules. Internally, the compiler could simple extend the regular pointer by a static chain pointer which is NULL. The opposite conversion is dangerous because the static chain pointer would get thrown away, and the usual qualifier rules would forbid this as an implicit conversion. So this would fit in perfectly without the need for any special rules.