CodingBison

One of the pain points of having multiple threads is dealing with deadlocks. A deadlock occurs when we have a circular dependency between threads and resources; these resources can be as simple as Pthread mutexes. Deadlocks are bad because the threads involved in the deadlock do nothing but sit there forever. Not exactly what we had in mind, when we created them!

For example, let us say we have a thread, (say t1) and that is holding a resource (say r1). But as per the application logic, t1 also needs to acquire another resource (say r2). Now as luck would have it, r2 is being held by a second thread (say t2) and that is actually waiting for r1. Thus, t1 is waiting for r2 which is held by t2, which in turn, is waiting for r1 that is held by t1! Due to this, threads t1 and t2 would wait indefinitely. The following picture illustrates this deadlock.



Figure: Deadlock between two threads (t1 and t2)

Having two threads wait for each other is the simplest case for a deadlock. It is possible that we can have N threads run into a deadlock such that each one is waiting for a resource that is needed by the other thread. Think of this as a loop. The first thread waits for a resource held by the second thread, the second thread waits for a resource held by third thread, and the last thread waits for the resource held by the first thread.

Having threads wait for a resource in a circular fashion is at the heart of a deadlock. Further, for deadlocks to happen, two or more resources should be such that they cannot be shared among multiple threads. If we take a mutex as a resource, then this sounds right because a mutex can have only one owner and the thread that locks it is the only thread that can unlock it. The ownership is one of the key reasons why we end up with deadlocks.

Investigating deadlocks

Debugging a deadlock when dealing with multiple threads might be one of the hardest things a programmer would ever have to do! When threads decide to sit in a deadlock, refusing to move an inch further, you can forget about relying upon debugs. With threads sitting in a blocked state, they would not execute any statements, and so there is absolutely no way, they would print any debugs.

When dealing with a deadlock, one way to start debugging is to consider all possible paths, where threads can go and thereby deadlock. Clearly, when we have scores of threads, this method can have its own limitation. For such cases, one approach is to fire GDB and when threads sit in a stuck state, look at the state of each thread, and each resource (e.g. mutex).

With that goal in mind, we provide an example that artificially creates a deadlock and then debug it using GDB. The example (provided below) has a single goal of creating a deadlock. It would be safe to say that this example does not do anything even remotely useful!

The example has two threads, kept in an array thread[]. Each thread has its own callback function; callback1() and callback2(). The example uses two mutexes as resources: mutex1 and mutex2.

The callback functions are similar in nature that they both sleep for random times and when they wake up, they try to lock both mutexes, one after another. However, the ordering of the mutexes is different. The first callback attempts to lock mutex1 first, followed by mutex2. The second callback attempts to lock mutex2 first, followed by mutex1. Due to this, the first thread locks mutex1 and goes to sleep. The second thread locks mutex2 and goes to sleep. When thread1 wakes up, it tries to lock mutex2, but that is already being held by thread2. Likewise, when thread2 wakes up, it tries to lock mutex1, which is being held by thread1. In other words, we get into a deadlock!

Each of these callback functions sit within a while loop. The reason for that is it is possible that the two threads could miss the deadlock in the first pass. This can happen if one of the threads sleep longer than the two combined sleep of the other thread. Let us say, thread1 sleeps for 1 second and after locking, it sleeps for 2 seconds. If thread2 were to sleep for 3 second, then it would wake up exactly when thread1 is done in the first pass. So, if the threads miss the deadlock, having a loop forces them to retry achieving deadlock in subsequent passes!

 #include <stdio.h>
 #include <pthread.h>
 #include <time.h>

 #define MAX_SLEEP 10
 #define MAX_THREADS 2

 pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;
 pthread_mutex_t mutex2 = PTHREAD_MUTEX_INITIALIZER;

 void *callback1 (void *arg) {
     int random_time;

     while (1) { 
         srand(time(NULL));
         random_time = rand() % MAX_SLEEP;
         printf("[%s]Sleeping for %d seconds\n", __FUNCTION__, random_time);
         sleep(random_time);
         printf("[%s]Trying to acquire mutex1 (holding none)\n", __FUNCTION__);
         pthread_mutex_lock(&mutex1);
         printf("[%s]Acquired mutex1\n", __FUNCTION__);

         random_time = rand() % MAX_SLEEP;
         printf("[%s]Sleeping for %d seconds\n", __FUNCTION__, random_time);
         sleep(random_time);

         printf("[%s]Trying to acquire mutex2 (holding mutex1) \n", __FUNCTION__);
         pthread_mutex_lock(&mutex2);
         printf("[%s]Acquired mutex2\n\n", __FUNCTION__);

         pthread_mutex_unlock(&mutex2);
         pthread_mutex_unlock(&mutex1);
     }
     return NULL;
 }

 void *callback2 (void *arg) {
     int random_time;

     while (1) { 
         srand(time(NULL));
         random_time = rand() % MAX_SLEEP;
         printf("[%s]Sleeping for %d seconds\n", __FUNCTION__, random_time);
         sleep(random_time);
         printf("[%s]Trying to acquire mutex2 (holding none)\n", __FUNCTION__);
         pthread_mutex_lock(&mutex2);
         printf("[%s]Acquired mutex2\n", __FUNCTION__);

         random_time = rand() % MAX_SLEEP;
         printf("[%s]Sleeping for %d seconds\n", __FUNCTION__, random_time);
         sleep(random_time);

         printf("[%s]Trying to acquire mutex1 (holding mutex2) \n", __FUNCTION__);
         pthread_mutex_lock(&mutex1);
         printf("[%s]Acquired mutex1\n\n", __FUNCTION__);

         pthread_mutex_unlock(&mutex1);
         pthread_mutex_unlock(&mutex2);
     }
     return NULL;
 }

 int main () {
     pthread_t thread[MAX_THREADS];
     int status, i;

     status = pthread_create(&thread[0], NULL, callback1, NULL);
     if (status != 0) {
         fprintf(stderr, "pthread_create failed [status: %d]\n", status);
         return -1;
     }

     status = pthread_create(&thread[1], NULL, callback2, NULL);
     if (status != 0) {
         fprintf(stderr, "pthread_create failed [status: %d]\n", status);
         return -1;
     }

     for (i=0; i < MAX_THREADS; i++) { 
         status = pthread_join(thread[i], NULL);
         if (status != 0) {
             fprintf(stderr, "pthread_join failed for thread1 [%d]\n", status);
             return -1;
         }
     }
     return 0;
 }

Let us now compile and run the program. We use the pthread option with gcc to add support for Pthreads library.

 [codingbison]$ gcc -pthread deadlock_two_threads.c  -o deadlock
 [codingbison]$ 
 [codingbison]$ ./deadlock
 [callback1]Sleeping for 4 seconds
 [callback2]Sleeping for 4 seconds
 [callback1]Trying to acquire mutex1 (holding none)
 [callback1]Acquired mutex1
 [callback1]Sleeping for 2 seconds
 [callback2]Trying to acquire mutex2 (holding none)
 [callback2]Acquired mutex2
 [callback2]Sleeping for 9 seconds
 [callback1]Trying to acquire mutex2 (holding mutex1) 
 [callback2]Trying to acquire mutex1 (holding mutex2) 


 ^C
 [codingbison]$ 

The output confirms what we expected. The thread1 sleeps for 4 seconds, wakes up and locks mutex1. Then it sleeps for 2 seconds again and when wakes up, tries to lock mutex2. The thread2 has a similar story. After the last line in the output, both threads sit in a deadlock and if program is not killed (using Control-C), then they would sit there waiting for mutexes, for ever!

We can use GDB to look into the states of these threads and mutexes. For that, let us recompile the program by passing "-g" option to gcc; this option enables gcc to build symbols which are later used by GDB. We provide the output first and explain it after that.

 [codingbison]$ gcc -g -pthread deadlock_two_threads.c  -o deadlock
 [codingbison]$ 
 [codingbison]$ gdb deadlock 
 GNU gdb (GDB) Fedora (7.3.50.20110722-16.fc16)
 Copyright (C) 2011 Free Software Foundation, Inc.
 License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
 This is free software: you are free to change and redistribute it.
 There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
 and "show warranty" for details.
 This GDB was configured as "i686-redhat-linux-gnu".
 For bug reporting instructions, please see:
 <http://www.gnu.org/software/gdb/bugs/>...
 Reading symbols from /home/codingbison/threads/deadlock...done.
 (gdb) 
 (gdb) run
 Starting program: /home/codingbison/threads/deadlock 
 [Thread debugging using libthread_db enabled]
 Using host libthread_db library "/lib/libthread_db.so.1".
 [New Thread 0xb7fe8b40 (LWP 4786)]
 [callback1]Sleeping for 7 seconds
 [New Thread 0xb77e7b40 (LWP 4787)]
 [callback2]Sleeping for 7 seconds
 [callback1]Trying to acquire mutex1 (holding none)
 [callback1]Acquired mutex1
 [callback1]Sleeping for 6 seconds
 [callback2]Trying to acquire mutex2 (holding none)
 [callback2]Acquired mutex2
 [callback2]Sleeping for 2 seconds
 [callback2]Trying to acquire mutex1 (holding mutex2) 
 [callback1]Trying to acquire mutex2 (holding mutex1) 

 ^C
 Program received signal SIGINT, Interrupt.
 0xb7fff424 in __kernel_vsyscall ()
 Missing separate debuginfos, use: debuginfo-install glibc-2.14.90-24.fc16.9.i686
 (gdb) 
 (gdb) info threads
   Id   Target Id         Frame 
   3    Thread 0xb77e7b40 (LWP 4787) "deadlock" 0xb7fff424 in __kernel_vsyscall ()
   2    Thread 0xb7fe8b40 (LWP 4786) "deadlock" 0xb7fff424 in __kernel_vsyscall ()
 * 1    Thread 0xb7fe9900 (LWP 4783) "deadlock" 0xb7fff424 in __kernel_vsyscall ()
 (gdb) 
 (gdb) thread apply all bt full

 Thread 3 (Thread 0xb77e7b40 (LWP 4787)):
 #0  0xb7fff424 in __kernel_vsyscall ()
 #1  0x450f34d2 in __lll_lock_wait () from /lib/libpthread.so.0
 #2  0x450eee63 in _L_lock_693 () from /lib/libpthread.so.0
 #3  0x450eeca8 in pthread_mutex_lock () from /lib/libpthread.so.0
 #4  0x080488ae in callback2 (arg=0x0) at deadlock_two_threads.c:54
         random_time = 2
         __FUNCTION__ = "callback2"
 #5  0x450eccd3 in start_thread () from /lib/libpthread.so.0
 #6  0x45029d7e in clone () from /lib/libc.so.6

 Thread 2 (Thread 0xb7fe8b40 (LWP 4786)):
 #0  0xb7fff424 in __kernel_vsyscall ()
 #1  0x450f34d2 in __lll_lock_wait () from /lib/libpthread.so.0
 #2  0x450eee63 in _L_lock_693 () from /lib/libpthread.so.0
 #3  0x450eeca8 in pthread_mutex_lock () from /lib/libpthread.so.0
 #4  0x08048750 in callback1 (arg=0x0) at deadlock_two_threads.c:28
         random_time = 6
         __FUNCTION__ = "callback1"
 #5  0x450eccd3 in start_thread () from /lib/libpthread.so.0
 #6  0x45029d7e in clone () from /lib/libc.so.6

 Thread 1 (Thread 0xb7fe9900 (LWP 4783)):
 #0  0xb7fff424 in __kernel_vsyscall ()
 #1  0x450edde5 in pthread_join () from /lib/libpthread.so.0
 #2  0x080489b9 in main () at deadlock_two_threads.c:80
         thread = {3086912320, 3078519616}
         status = 0
         i = 0
 (gdb) 
 (gdb) thread 3
 [Switching to thread 3 (Thread 0xb77e7b40 (LWP 4787))]
 #0  0xb7fff424 in __kernel_vsyscall ()
 (gdb)  
 (gdb) print mutex1
 $1 = {__data = {__lock = 2, __count = 0, __owner = 4786, __kind = 0, 
         __nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, 
         __size = "\002\000\000\000\000\000\000\000\262\022\000\000\000\000\
         000\000\001\000\000\000\000\000\000", __align = 2}
 (gdb) 
 (gdb) print mutex2
 $2 = {__data = {__lock = 2, __count = 0, __owner = 4787, __kind = 0, 
         __nusers = 1, {__spins = 0, __list = {__next = 0x0}}}, 
         __size = "\002\000\000\000\000\000\000\000\263\022\000\000\000\000\
         000\000\001\000\000\000\000\000\000", __align = 2}
 (gdb)  
 (gdb) info threads
   Id   Target Id         Frame 
 * 3    Thread 0xb77e7b40 (LWP 4787) "deadlock" 0xb7fff424 in __kernel_vsyscall ()
   2    Thread 0xb7fe8b40 (LWP 4786) "deadlock" 0xb7fff424 in __kernel_vsyscall ()
   1    Thread 0xb7fe9900 (LWP 4783) "deadlock" 0xb7fff424 in __kernel_vsyscall ()
 (gdb) 
 (gdb) bt
 #0  0xb7fff424 in __kernel_vsyscall ()
 #1  0x450f34d2 in __lll_lock_wait () from /lib/libpthread.so.0
 #2  0x450eee63 in _L_lock_693 () from /lib/libpthread.so.0
 #3  0x450eeca8 in pthread_mutex_lock () from /lib/libpthread.so.0
 #4  0x080488ae in callback2 (arg=0x0) at deadlock_two_threads.c:54
 #5  0x450eccd3 in start_thread () from /lib/libpthread.so.0
 #6  0x45029d7e in clone () from /lib/libc.so.6
 (gdb) 
 (gdb) continue
 Continuing.

First, we need to run the program with GDB as "gdb deadlock" -- once we hit the GDB prompt, we enter "run" to run the program under GDB. After sometime, we would hit the deadlock and the program would halt. At that point, we can use Control-C to get into GDB prompt. Don't worry, even with Control-C, the GDB is still running the program and to get back to program, all we need to do is enter "continue"!

Once we get into GDB prompt, we can use "info threads" to see all the threads belonging to a program. In the output, each of the threads has an LWP id: thread 1 (LWP id 4783), thread 2 (LWP id 4786), and thread 3 (LWP id 4787); LWP stands for Light Weight Process. We can also see this from the GDB log that threads 2 and 3 were created from the main thread, when we issued the "run" command: "[New Thread 0xb7fe8b40 (LWP 4786)]" and "[New Thread 0xb77e7b40 (LWP 4787)]".

The asterisk (*) in the "info threads" output shows us the current thread in the GDB context. If needed, we can go from one thread to another by specifying the thread number to the "thread" command. Thus, "thread 2" would take us into the context of thread 2.

The next command "thread apply all bt full", basically prints backtrace of all threads; "bt" is short for backtrace. So, when we run it, we see backtrace of all the three threads. It shows that thread 1 started in the main(), thread 2 started in callback1(), and thread 3 started in callback2(). It also shows that two of the threads (2 and 3) are waiting on the pthread_mutex_lock() -- this should be our first hint (but not definitive hint) that we might have a deadlock.

We should also make a note of backtrace number 4 for threads 2 and 3. For thread 2, it says " #4 0x08048750 in callback1 (arg=0x0) at deadlock_two_threads.c:28", which means that it is stuck at line number 28, where we are calling the pthread_mutex_lock for mutex2. Similarly, for thread 3, it says "4 0x080488ae in callback2 (arg=0x0) at deadlock_two_threads.c:54", which means that it is stuck at line number 54, where we are calling the pthread_mutex_lock for mutex1. Another potent hint that we might be approaching the forbidden deadlock zone!

The last hint comes when we switch to thread 3 and print values of mutex1 and mutex2. When we print mutex1, we see that it has an owner with an LWP value of 4786 and LWP 4786 represents thread 2. Along the same lines, when we print mutex2, we see that it has an owner with an LWP value of 4787 and LWP 4787 represents thread 3. It is this data point that confirms we have a deadlock. From the traceback, we see that thread 2 is waiting for mutex2 and from printing the mutex2, we know that mutex2 is owned by thread 3. Similarly, from the traceback, we see that thread 3 is waiting for mutex1 and from printing the mutex, we know that mutex1 is owned by thread 2. Hence, the deadlock!

Fixing deadlocks

Finding the root cause of a deadlock is an important step towards fixing the deadlock. But, once we know that a deadlock exists, we also need to provide a fix for it. The fix varies from case to case. Let us discuss some of the methods that can help us avoid deadlock or to break an ongoing deadlock.

First, for cases where multiple resources (e.g. mutexes) are to be allocated for a given thread, we should ensure that the resource-acquisition happens in a pre-specified order. For our above example, specifying the order could mean once a thread has acquired the first mutex, then it is the same thread that will also acquire the second mutex. In this way, the second thread would not wait for the second mutex since it needs to acquire the first mutex first.

Needless to say that if we have certain threads that need to acquire only one of the resource during their life-time, then they can continue doing so without having to bother acquiring all the threads. So, enforcing acquisition of resources in a predefined order should be enforced only on those threads that need to acquire multiple resources.

Second, for applications where it is possible to know all the resources needed by a thread in advance, we can also use what is known as Banker's algorithm to avoid deadlocks. This algorithm does a book-keeping of resources and whenever it allocates a resource, it checks whether it can likely cause a deadlock.

Third, we do not have to wait indefinitely for a resource. For code-locations, where trying to acquire a resource would likely lead to a deadlock, we can replace the calls with pthread_mutex_trylock() and if it fails because the mutex is already locked, then we can retry later. If the call still fails, then that might be a good hint that we might be hitting a deadlock. In that case, we can release the resources held by the thread and either return from the call or resume the call later.

In fact, if the deadlock occurs rarely, then depending upon the application logic, it might be reasonable to simply return with an error instead of resuming the call. For example, let us say that we have two threads that are trying to send socket data. If the two threads run into a deadlock, then with the above approach, one of them can simply return without sending data. This would probably be okay for applications that can handle a rare packet loss.

Lastly, it is also common for some implementations to have a watch dog. A watch dog can be a thread that runs independently and monitors all the other threads. If certain threads are blocked for a long period of time, then the watchdog can detect that and conclude that there might be a deadlock.





comments powered by Disqus