

shared_ptr<T>: the (not always) atomic reference counted smart pointer
source link: https://www.tuicool.com/articles/hit/j6VZFbq
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Introduction
This is a write-up of the “behavioral analysis” of shared_ptr<T>
reference count in GNU’s libstdc++. This smart pointer is used to share references to the same underlaying pointer.
The mechanism beneath works by tracking the amount of references through a reference count so the pointer gets freed only after the last reference is destructed. It is usually used in multi-threaded programs (in conjunction with other types) because of the guarantees of having its reference count tracked atomically.
Story time
A few months ago, I was running a micro-benchmark on data structures in Rust vs C++ ones.
At one point, I found that my Rust port of an immutable RB tree insertion was significantly slower than the C++ one. It was unexpected to me as both codebases were idiomatic and Rustc optimizes very well usually matching C++ speed.
I proceeded to re-check that my code was correct. At first I thought that my re-balancing code could be wrong so I put it side by side with the C++ one but couldn’t find any defect.
Profiling
The second day, I started profiling with callgrind and cachegrind . Here is where I got the aha moment. Every part of the code that was copying shared_ptr<T>
was being much faster than my equivalent Arc::clone
calls in Rust.
Inside KCachegrind, I saw something unexpected, the code was straightforward but before increasing shared_ptr
’s reference count during a pointer copy, there was a branch to decide if it should do an atomic addition or a non-atomic one. The code-path being taken was the non atomic one!
Certainly, my knowledge about shared_ptr
was being challenged. As far as I knew, the reference count should be atomic so it could be used in parallel programs sharing the value without the risk of racing the count and ending up with dangling pointers or memory leaks.
Tracking the code
Simplified C++ poc:
const auto tree = make_shared<Tree<int>>(10); for(auto i = 0; i < 100; i++) { const shared_ptr<Tree<int>> tree_copy = tree; // black_box(tree_copy); }
In Rust it is almost the same line by line:
let tree = Arc::new(Tree::new(10)); for _i in 0..100 { let tree_copy = tree.clone(); // test::black_box(tree_copy); }
To understand what was happening better, I compiled the C++ without optimizations which gave me some disassembly to follow:
loop code:
d7b: cmp DWORD PTR [rbp-0x14],0x63 d7f: jg d9a <main+0x59> d81: lea rdx,[rbp-0x30] d85: lea rax,[rbp-0x40] d89: mov rsi,rdx d8c: mov rdi,rax d8f: call f1c <std::shared_ptr<Tree<int> >::operator=(std::shared_ptr<Tree<int> > const&)> #### (*) d94: add DWORD PTR [rbp-0x14],0x1 d98: jmp d7b <main+0x3a>
operator=
:
1021: mov rax,QWORD PTR [rbp-0x8] 1025: mov rdi,rax 1028: call 1158 <std::_Sp_counted_base<(__gnu_cxx::_Lock_policy)2>::_M_add_ref_copy()> #### (*) 102d: mov rax,QWORD PTR [rbp-0x18] 1031: mov rax,QWORD PTR [rax]
And following _M_add_ref_copy()
:
116c: mov esi,0x1 1171: mov rdi,rax 1174: call cfd <__gnu_cxx::__atomic_add_dispatch(int*, int)>
__atomic_add_dispatch
:
0000000000000cfd <__gnu_cxx::__atomic_add_dispatch(int*, int)>: cfd: push rbp cfe: mov rbp,rsp d01: sub rsp,0x10 d05: mov QWORD PTR [rbp-0x8],rdi d09: mov DWORD PTR [rbp-0xc],esi d0c: call c20 <__gthread_active_p()> #### (1) d11: test eax,eax d13: setne al d16: test al,al d18: je d2d d1a: mov edx,DWORD PTR [rbp-0xc] d1d: mov rax,QWORD PTR [rbp-0x8] d21: mov esi,edx d23: mov rdi,rax d26: call c59 <__gnu_cxx::__atomic_add(int volatile*, int)> #### (2) d2b: jmp d3e d2d: mov edx,DWORD PTR [rbp-0xc] d30: mov rax,QWORD PTR [rbp-0x8] d34: mov esi,edx d36: mov rdi,rax d39: call c9b <__gnu_cxx::__atomic_add_single(int*, int)> #### (3) d3e: nop d3f: leave d40: ret
Once here, I found a very interesting pattern. Depending on the return of __gthread_active_p()
(1), it could either call atomic_add
(2) or atomic_add_single
(3).
atomic_add
does what I expect:
c6b: lock add DWORD PTR [rax],edx
but atomic_add_single
does not:
caa: mov edx,DWORD PTR [rax] cac: mov eax,DWORD PTR [rbp-0xc] caf: add edx,eax cb1: mov rax,QWORD PTR [rbp-0x8] cb5: mov DWORD PTR [rax],edx
There are no atomic operations inside that function which opened these new question:
- Why is the C++ standard library optimizing the atomic addition?
- Is this even safe?
To atomic or not to
As expected, atomic_add
and atomic_add_single
were both straightforward:
static inline void __atomic_add_single(_Atomic_word* __mem, int __val) { *__mem += __val; }
static inline void __atomic_add(volatile _Atomic_word* __mem, int __val) { __atomic_fetch_add(__mem, __val, __ATOMIC_ACQ_REL); }
static inline void __attribute__ ((__unused__)) __atomic_add_dispatch(_Atomic_word* __mem, int __val) { if (__gthread_active_p()) __atomic_add(__mem, __val); else __atomic_add_single(__mem, __val); }
Now the new set of questions were about __gthread_active_p()
. After a quick grep
, I found that many functions were depending on its return value to go with a thread-safe operation or not. Finding all of them is an exercise for the reader.
To find the right implementation of __gthread_active_p
, I preprocessed the file with g++ -E main.cpp
and landed on /usr/include/x86_64-linux-gnu/c++/6/bits/gthr-default.h:246
:
static __typeof(pthread_key_create) __gthrw___pthread_key_create __attribute__ ((__weakref__("__pthread_key_create")));
static inline int __gthread_active_p (void) { static void *const __gthread_active_ptr = __extension__ (void *) &__gthrw___pthread_key_create; return __gthread_active_ptr != 0; }
weakref
and pthread_key_create
__weakref__
is an attribute to declare a weak symbol . It means that if it is referenced in another place, it becomes available, but while it isn’t, it’s a NULL pointer. It can be used for defining external symbols which may or may not be available. Can also be used for defining functions that could be intercepted by other more specialized ones. There is a blog post with more information about ithere.
__pthread_key_create
is a function used to assign values into the local thread storage.
I’m sure you have discovered by now what’s happening, but just in case C++ developers left a comment:
For a program to be multi-threaded the only thing that it certainly must be using is pthread_create. However, there may be other libraries that intercept pthread_create with their own definitions to wrap pthreads functionality for some purpose. In those cases, pthread_create being defined might not necessarily mean that libpthread is actually linked in. For the GNU C library, we can use a known internal name. This is always available in the ABI, but no other library would define it. That is ideal, since any public pthread function might be intercepted just as pthread_create might be. __pthread_key_create is an “internal” implementation symbol, but it is part of the public exported ABI. Also, it’s among the symbols that the static libpthread.a always links in whenever pthread_create is used, so there is no danger of a false negative result in any statically-linked, multi-threaded program. For others, we choose pthread_cancel as a function that seems unlikely to be redefined by an interceptor library. The bionic (Android) C library does not provide pthread_cancel, so we do use pthread_create there (and interceptor libraries lose).
So basically, what is happening here is checking if pthread_create
is being imported into the program. If it is, the weak reference becomes available, otherwise it is NULL. Checking this variable, it is easy to see if the program is using threads or not.
Sound or not
What if a program uses parallelism without bringing pthread_key_create
symbol into context? Is it possible?
We can theorize…
Parallelism without pthread
It is possible to create threads by using the OS syscalls bypassing completely the requirement of pthead. (Un)fortunately, I couldn’t find any popular libraries that implement the functionality by using the syscall interface instead of relying on pthread. OpenMP and a few other runtimes I checked all depend on it.
It might exist but doesn’t seem to be very common.
Shared library
Code compiled into a dynamic library can be called from other programs that might introduce external parallelism and expecting the library to be thread-safe because it uses shared_ptr
.
To get more into the question, I created an object that doesn’t use pthread_create because it expects all the parallelism to be external. After an objdump of the symbol table I can see that the symbol is still imported as weak (the w in the second column):
0000000000000000 w F *UND* 0000000000000000 __pthread_key_create@@GLIBC_2.2.5
Shared library loaded by static binary
The programs that would introduce the external parallelism would also load pthread
and enable the weak symbol in the library too. However, if the loading program has compiled pthread statically, the dynamic loader has no way of knowing if pthread_create
is used by the program and wouldn’t make the weak symbol in the loaded library available. If this happens, the assumption would be broken and shared_ptr would be behaving erronously.
I assume that this is also a very rare case and from a quick googling, I can see tons of problems caused by using dlopen
in statically compiled binaries.
In conclusion, I’ll assume this is not a typical scenario and it is mostly safe.
Why not go further with the optimization efforts?
If a program can assume that all its threading is happening through pthread with runtime checks, its implementation can be adapted to also detect when more than a thread is running.
Speculating, pthread could be updating a global count with the amount of running threads whenever threads are created, (un)suspended, or canceled. As far as I can think about it, the check doesn’t have to be atomic because when there is only one thread creating another one, the count doesn’t have to be atomic (from 0 to 1). Then when it’s suspending other threads, it can suspend it and later decrement the count. By the time the count is synchronized with the other thread, there is only one active again.
Is this a missed optimization opportunity?, probably not… However, the C++ standard is quite clear about the atomic operations in thelibraries:
Library Wide Requirements - (2) Requirements specified in terms of interactions between threads do not apply to programs having only a single thread of execution.
Other C++ implementations
After my adventure with libstdc++ , I decided to check VisualC++ and libcxx ones.
Libcxx has a compilation check with a macro to disable threads completely by the flag _LIBCPP_HAS_NO_THREADS
. If it is set, all atomic operations will fallback to non-atomic ones. There is more information in the documentation .
VisualC++ doesn’t have its source code available, but from disassembling shared_ptr::operator=
, I can see that the increment is only atomic and there is no runtime check to fall-back into a non-atomic one. It’s unclear to me if other versions provide them.
De-optimizing the micro-benchmark
This was an easy step, I just referenced pthread_create
in the program and the reference count became atomic again.
Although uninteresting to the topic of the blog post, after the modifications, both programs performed very similarly in the benchmarks.
Add this optimization to Rust!
Not so fast! Arc actually means Atomic Reference Counted so it would be a plain lie if it hadn’t use atomic operations on the reference count.
Furthermore, Rust’s std offers both Rc and Arc which share similar APIs so they can be used interchangeably whenever necessary and the type system would get your back if you had sent Rc between tasks due to it being !Send
(not send).
Conclusion
This was another failed case of micro-benchmarking. Optimizations go beyond your simple -O3
. In this case, I didn’t know that the libstdc++ was changing its behaviour depending on if pthread_create
was imported by the program or not.
While I’m probably not going to spend any more time on this, I’ve found a personally unknown thing about C++ standard library (GNU) and wanted to document it because it was interesting to track down.
Unfortunately, I cannot conclude if shared_ptr<T>
behaviour is completely safe in uncommon environments.
Also, my teammates should be preparing themselves for all my new weakref
optimizations that I’m introducing in my code…
Thanks for reading.
- Follow me on Twitter: @snfernandez
- Contact me at Gmail: sebanfernandez
- Secure is better:GPG Key
Recommend
-
1120
Have you heard the term “lamp post metric”? [i] This is a measurement that is easy t...
-
7
Why getting voting right is hard, Part II: Hand-Counted Paper Ballots Eric Rescorla December 14,...
-
6
Article io_uring [5.11] Exploring null pointer de-reference in io_uring_create Palash Oswal 17 Apr 2021 • 6 min read ...
-
5
Walk The First 3D-Printed Bridge And Be Counted Way back in 2018, we brought you news of a 3D-printed stainless steel pedestrian bridge being planned to span a Dutch canal in Amsterdam.
-
6
New issue Fix suggestion to borrow when casting from pointer to reference #89528
-
5
Calin Baenen Posted on Nov 16...
-
3
V2EX › C++ 《C++ Primer》关于 reference 和 pointer 部分看的人“生气” vcfghtyjc · 7...
-
5
You watched, we counted! The AdBlitz 2022 winners are… Search Input ...
-
10
Get the function pointer to run in a shared library I did not load directly advertisements My Linux application (A) links a...
-
3
Inside STL: The different types of shared pointer control blocks
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK