The performance observation is real but the two approaches are not equivalent, and the article doesn't mention what you're actually trading away, which is the part that matters.
The C++11 threadsafety guarantee on static initialization is explicitly scoped to block local statics. That's not an implementation detail, that's the guarantee.
The __cxa_guard_acquire/release machinery in the assembly is the standard fulfilling that contract. Move to a private static data member and you're outside that guarantee entirely. You've quietly handed that responsibility back to yourself.
Then there's the static initialization order fiasco, which is the whole reason the meyers singleton with a local static became canonical. Block local static initializes on first use, lazily, deterministically, thread safely. A static data member initializes at startup in an order that is undefined across translation units. If anything touches Instance() during its own static initialization from a different TU, you're in UB territory. The article doesn't mention this.
Real world singleton designs also need: deferred/configuration-driven initialization, optional instantiation, state recycling, controlled teardown. A block local static keeps those doors open. A static data member initializes unconditionally at startup, you've lost lazy-init, you've lost the option to not initialize it, and configuration based instantiation becomes awkward by design.
Honestly, if you're bottlenecking on singleton access, that's design smell worth addressing, not the guard variable.
Excellent points about the initialization order fiasco. I've been bitten by this in embedded systems where startup timing is critical.
One thing I'd add: the guard overhead can actually matter in high-frequency scenarios. I once profiled a logging singleton that was called millions of times per second in a real-time system - the atomic check was showing up as ~3% CPU. But your point stands: if you're hitting that bottleneck, you probably want to reconsider the architecture entirely.
The lazy initialization guarantee is usually worth more than the performance gain, especially since most singletons aren't accessed in tight loops. The static member approach feels like premature optimization unless you've actually measured the guard overhead in your specific use case.
It's a bit contrived, but a global with a nontrivial constructor can spawn a thread that uses another global, and without synchronization the thread can see an uninitialized or partially initialized value.
I haven't written C++ in a long time, but isn't the issue here that the initialization order of globals in different translation units is unspecified? Lazy initialization avoids that problem at very modest cost.
I liked using singletons back in the day, but now I simply make a struct with static members which serves the same purpose with less verbose code. Initialization order doesn't matter if you add one explicit (and also static) init function, or a lazy initialization check.
Honestly the guard overhead is a non-issue in practice — it's one atomic check after first init. The real problem with the static data member approach is
initialization order across translation units. If singleton A touches singleton B during startup you get fun segfaults that only show up in release builds with a
different link order.
I ended up using std::call_once for those cases. More boilerplate but at least you're not debugging init order at 2am.
Came here to say the same thing. Static is OK as long as the object has no dependencies but as soon as it does you're asking for trouble. Second the call_once approach. Another approach is an explicit initialization order system that ensures dependencies are set up in the right order, but that's more complex and only works for binaries you control.
Nice breakdown. I’m curious how often the guard check for a function-local static actually shows up in real profiles. In most codebases Instance() isn’t called in tight loops, so the safety of lazy initialization might matter more than a few extra instructions. Has anyone run into this being a real bottleneck in practice?
The C++11 threadsafety guarantee on static initialization is explicitly scoped to block local statics. That's not an implementation detail, that's the guarantee.
The __cxa_guard_acquire/release machinery in the assembly is the standard fulfilling that contract. Move to a private static data member and you're outside that guarantee entirely. You've quietly handed that responsibility back to yourself.
Then there's the static initialization order fiasco, which is the whole reason the meyers singleton with a local static became canonical. Block local static initializes on first use, lazily, deterministically, thread safely. A static data member initializes at startup in an order that is undefined across translation units. If anything touches Instance() during its own static initialization from a different TU, you're in UB territory. The article doesn't mention this.
Real world singleton designs also need: deferred/configuration-driven initialization, optional instantiation, state recycling, controlled teardown. A block local static keeps those doors open. A static data member initializes unconditionally at startup, you've lost lazy-init, you've lost the option to not initialize it, and configuration based instantiation becomes awkward by design.
Honestly, if you're bottlenecking on singleton access, that's design smell worth addressing, not the guard variable.
One thing I'd add: the guard overhead can actually matter in high-frequency scenarios. I once profiled a logging singleton that was called millions of times per second in a real-time system - the atomic check was showing up as ~3% CPU. But your point stands: if you're hitting that bottleneck, you probably want to reconsider the architecture entirely.
The lazy initialization guarantee is usually worth more than the performance gain, especially since most singletons aren't accessed in tight loops. The static member approach feels like premature optimization unless you've actually measured the guard overhead in your specific use case.
I might be responding to a llm so...
I ended up using std::call_once for those cases. More boilerplate but at least you're not debugging init order at 2am.