Re: Strange hanging bug in a simple milter

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: Strange hanging bug in a simple milter
Дата
Msg-id 20130910213944.GA2706@tamriel.snowman.net
обсуждение исходный текст
Ответ на Re: Strange hanging bug in a simple milter  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Ответы Re: Strange hanging bug in a simple milter  (Stephen Frost <sfrost@snowman.net>)
Список pgsql-hackers
Heikki,

* Heikki Linnakangas (hlinnakangas@vmware.com) wrote:
> Hmm. Are you sure you're getting an SSL connection? Run it with
> something like this to make sure:

sslmode=require doesn't help on Unix domain connections. :)

Was able to get it to lock with both 9.2.4 and master, and with both
versions of the SSL library (1.0.1c-4ubuntu8.1 and 1.0.1e-3).

The lock that I got showed these stacks:

Thread #2: pthread_mutex_lock init_ssl_system ; fe-secure.c:917

Thread #3: pthread_mutex_lock pq_lockingcallback  ... SSL_new pqsecure_open_client ; fe-secure.c:275

Thread #2 waiting at 917 makes sense, he's waiting on the lock that the
other thread has on ssl_config_mutex before moving in to set up his own
SSL connection.  What's odd is how is thread #3 waiting on a lock in the
lock array.  Both threads agree that ssl_open_connections is only 1
(thread #3's ; thread #2 hasn't gotten to incrementing it yet).  Looking
at the lock array, only one of the locks is taken out and it's owner is
thread #3, meaning that SSL apparently caused a deadlock by trying to
take a lock which it's already taken.

Changing the lock type to be recursive instead masks the self-locking
issue, but then I got a case where, with the same stack traces as above,
the lock in the array was held by thread #2 instead, where thread #2 is
in init_ssl_system- well before it's even made any calls into SSL since
the previous PQdestroy happened.

I've also caught it where a thread is still holding a lock when it drops
into destroy_ssl_system() by simply trying to unlock all of the locks in
the array.  With the recursive lock type, all such attempts should
simply error out (either it's locked by someone else, or it's already
unlocked) and so I checked for *successful* unlocks:

2: DEBUG: database connection established
2: DEBUG: about to call PQfinish()
successfully unlocked mutex!

Having that happen can then cause the deadlock because the other thread
can end up waiting on that lock that we're still holding while in
destroy_ssl_system(), where we're waiting on the ssl_config_mutex lock
that the first thread has.  Even leaving that code in there, which
unlocks all the locks during destroy_ssl_system(), it's still deadlocked
on me with the same stack trace as above, with thread #2 holding a lock
in the pq_lockarray which thread #3 is trying to get (while thread #3
holds the ssl_config_mutex lock that thread #2 is waiting on).

Very curious.  Out of time right now to look into it, but will probably
be back at it later tonight.
Thanks,
    Stephen

В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Johnston
Дата:
Сообщение: Re: Question regarding Sync message and unnamed portal
Следующее
От: David Fetter
Дата:
Сообщение: 9.4 CommitFest 2 Starts Sunday 15 September 2013!