Using GCC's Nested Functions with Wide Pointers and no Trampolines

Martin Uecker, 2026-01-06

Introduction

Nested functions are extremely useful, which is why basically any computer language since ALGOL60 has them. Except C.

In particular, nested functions are useful to write kernels for passing them to higher-order functions. The kernels can access local state, and this is important because it then does not have to be passed as a void pointer to the higher-order function.


	void tree_increase_all(tree(int) *t, int increment)
	{
		void update(int *value) { (*value) += increment; }

		tree_walk(t, update);
	}
	

For very simply nested functions it makes also sense to use a lambda expression, i.e. an anonymous nested function.


	void tree_increase_all(tree(int) *t, int increment)
	{
		tree_walk(t, (void(int *value)){

			(*value) += increment);
		});
	}
	

How does this work?

The nested function gets passed in a special register ‐ the static chain register ‐ a link to its environment. It can then access the variables of the parent function(s) via this pointer. But how does the calling function know what to pass in the static chain register? The obvious solution is to have this information contained in the function pointer, which would be allowed by the C standard. But on most platform the standard application binary interface (ABI) does not allow this. There are different solutions, as we will see below.

GCC supports nested function as an extension (but not lambdas). C++ has lambdas with a different syntax, but those are actually objects where each instance has its own unique type. C++'s lambdas are therefor most useful in when used as argument to template functions. To be able to pass them to a non-template function, its unique type needs to be erased by wrapping it in std::function. Why C++'s lambdas are an ingenious design, but have various properties which make them a terrible fit for C. But this is a story for another day.

The Problem with GCC's Nested Functions

When creating a pointer to a nested function to pass it to another function, a program compiled with GCC creates at runtime a small piece of code - a so-called trampoline based on an idea from the eighties (Breuel et al.). These trampolines were traditionally placed on the stack, which then has to be executable. For security reason, it is commonly agreed on that an executable stack should be avoided. An alternative solution is to create the trampoline somewhere else, e.g. on memory allocated on the heap. In fact, GCC now can support this using the flag -ftrampoline-impl=heap. Unfortunately, this is computationally more expensive and it can also lead to memory leaks when the cleanup path is skipped over in via longjmp. It therefor makes sense to look for alternative solutions.

Wide Pointers and Descriptors

Instead of a trampoline, one can use a special function pointer type, which is essentially what C++ does with std::function and also what Apple's Block language extension does using its special syntax. There are essentially two ways how this can work. Either, the new function type is a wide pointer that consists of two parts, a pointer to the machine code of the function just as in a regular function pointer and another pointer that points to the environment of the function, i.e. the static chain pointer. When the function is called, the value for the static chain register is extracted from the wide pointer and loaded into the static chain register (r10 on x86_64) before the code pointer is loaded and the function is called.

A alternative is to build a function descriptor that contains code pointer and environment pointer, and then construct a regular pointer to this descriptor. A major advantage is that the pointer has the same size as any regular pointer. A disadvantage is that calling the function requires a second indirection to first load code and data pointers from the descriptor. Another disadvantage is that a conversion of a regular pointer then requires the construction of a descriptor, which needs to be stored somewhere, while it could be converted directly to a wide pointer simply by setting the static chain to NULL.

Preliminary Implementation in GCC

Defining a wide pointer type for a function R(A) is not difficult.


	#define wide(R, A)		\
	struct wideptr_##R##_##A {	\
		void *chain;		\
		R (*code)(A);		\
	}
	

Given such a wide pointer, we can then call this function using a built-in function __builtin_call_with_static_chain that GCC already provides to support other languages that make use of this ABI.


	#define CALL(ptr, arg) \
		__builtin_call_with_static_chain(ptr.code(args), ptr.chain)
	

But where do we get the two pointer when constructing the wideptr? Surprisingly, there is no built-in function for this. In the past, I used a grotesque hack to read the two pointer from the trampoline that GCC has created. But this still required the creation of the trampoline even though it is then never used. The technique is also not portable, as the trampoline differs for each CPU architecture, or may change between GCC versions. Instead, I have developed a preliminary patch which adds two new built-ins to GCC that makes it possible to directly obtain the code pointer of the function (instead of the pointer to the trampoline) and the static chain pointer directly. I can then build by wide pointer as in the following example.


	#define CLOSURE(R, A, x) \
		(wide(R, A)){ __builtin_static_chain(x), __builtin_nested_code(x) }
	


	int baz(wide(int, int) p, int x)
	{
		return CALL(p, (x));
	}

	int foo(int k)
	{
		int bar(int x) { return k + x; }
    		return baz(CLOSURE(int, int, bar), 2 * k);
	}
	

This is then essentially transformed by the compiler to the following, except that the chain is passed in the special static chain register.

Example: Godbolt


	static int baz(wide(int, int) p, int x)
	{
		return p.code(p.chain, x);
	}

	struct frame_foo { int k; };

	static int bar(void *_chain, int x) 
	{
		struct frame_foo *chain = _chain;
		return chain->k + x;
	}

	int foo(int _k)
	{
		struct frame_foo frame = { _k; };
	
		return baz((wide(int, int){ &frame, bar }, 2 * k.frame);
	}
	

In this case (and also the version using my built-in functions), the compiler will happily optimize this to a single assembly instruction that implements a multiplication by three.


	foo:
        	lea     eax, [rdi+rdi*2]
        	ret
	

Experiments

To test these ideas in a situation where it is not optimized away, I ran Knuth's Man-Or-Boy test (which was meant to test correctness of ALGOL compilers).


	typedef int (*fun)();

	int manorboy(int k, fun x1, fun x2, fun x3, fun x4, fun x5)
	{
		int B() { return manorboy(--k, B, x1, x2, x3, x4); }

		return (k <= 0) ? x4() + x5() : B();
	}

	bool test()
	{
		int pone () { return +1; }
		int mone () { return -1; }
		int zero () { return 0; }

		return -67 != manorboy(10, pone, mone, mone, pone, zero);
	}	
	

You can find the results here together with my patch to GCC. I will discuss these another time in detail looking at the generated code, but so far it looks that wide pointers have excellent performance, matching the direct implementation which is also very efficient. The descriptor version is the fastest, essentially because the nested function in this toy example merely shuffles pointers around, and if those have double size this is more costly. This is likely also one of the reasons why the version using stack-based trampolines is actually fairly good on some CPUs, even though it needs to create the trampolines. I still have bad experience using C++'s lambdas in terms of performance, usability, and correctness (probably my own fault because I am not understanding the language anymore), but it seems one can get competitive performance similar to a C wide pointer version with std::function_ref and deducing this in C++ 26.

A Wide Function Qualifier

All this could be made more convenient with a bit more language support. My vision is to simply add a function qualifer to indicate a wide pointer. That's it. The example would then look almost exactly like the nested function code for GCC, but not need any trampoline.


	typedef int (*fun)() wide;

	int manorboy(int k, fun x1, fun x2, fun x3, fun x4, fun x5)
	{
		int B() { return manorboy(--k, B, x1, x2, x3, x4); }

		return (k <= 0) ? x4() + x5() : B();
	}

	bool test()
	{
		int pone () { return +1; }
		int mone () { return -1; }
		int zero () { return 0; }

		return -67 != manorboy(10, pone, mone, mone, pone, zero);
	}
	

Why use a qualifier on the function type? The reason is that implicit conversions from a regular (unqualified) function pointer to a wide pointer would then work correctly according to the usual conversion rules. Internally, the compiler could simple extend the regular pointer by a static chain pointer which is NULL. The opposite conversion is dangerous because the static chain pointer would get thrown away, and the usual qualifier rules would forbid this as an implicit conversion. So this would fit in perfectly without the need for any special rules.

Literature