8.3.5: Crash in CountActiveBackends() - lockless race?

Поиск
Список
Период
Сортировка
От Marko Kreen
Тема 8.3.5: Crash in CountActiveBackends() - lockless race?
Дата
Msg-id e51f66da0903300311x18ababa4rdc584593aa33cb91@mail.gmail.com
обсуждение исходный текст
Ответы Re: 8.3.5: Crash in CountActiveBackends() - lockless race?
Список pgsql-hackers
We got a crash in our test-server, which has huge number of backends
running:

(gdb) bt
#0  CountActiveBackends () at procarray.c:1094
#1  0x0000000000475f45 in RecordTransactionCommit () at xact.c:945
#2  0x000000000047601c in CommitTransaction () at xact.c:1675
#3  0x0000000000476247 in CommitTransactionCommand () at xact.c:2373
#4  0x00000000005b7872 in finish_xact_command () at postgres.c:2322
#5  0x00000000005b8865 in exec_simple_query (query_string=0xa23070
"END") at postgres.c:1017
#6  0x00000000005ba1b1 in PostgresMain (argc=4, argv=<value optimized
out>, username=0x90b180 "replicator") at postgres.c:3577
#7  0x000000000058ea8b in ServerLoop () at postmaster.c:3207
#8  0x000000000058f7ae in PostmasterMain (argc=5, argv=0x9061e0) at
postmaster.c:1029
#9  0x0000000000545865 in main (argc=5, argv=<value optimized out>) at
main.c:188

$ uname -a
Linux test 2.6.23.17 #1 SMP Fri Oct 31 10:36:17 GMT 2008 x86_64 GNU/Linux

Postgres 8.3.5 / Debian etch / gcc 3.3.5 (Debian 1:3.3.5-13)

------------------------------

My theory:

CountActiveBackends() tries to do lockless access:
   /*    * Note: for speed, we don't acquire ProcArrayLock.  This is a little bit    * bogus, but since we are only
testingfields for zero or nonzero, it    * should be OK.  The result is only used for heuristic purposes anyway...
*/  for (index = 0; index < arrayP->numProcs; index++)   {       volatile PGPROC *proc = arrayP->procs[index];
 
       if (proc == MyProc)           continue;           /* do not count myself */       if (proc->pid == 0) <--
  continue;           /* do not count prepared xacts */
 



ProcArrayAdd() does proper locking, but does not consider lockless access:
   LWLockAcquire(ProcArrayLock, LW_EXCLUSIVE);
   ....
   arrayP->procs[arrayP->numProcs] = proc;   arrayP->numProcs++;  // numProcs can be visible before 'proc' ptr
   LWLockRelease(ProcArrayLock);

Because there is no memory barrier between setting the ptr and
numProcs++, the numProcs can be visible before the ptr to
CountActivebackends & co.  Also, as the ProcArray does not clear
the ptr, which means the once-used slots will point into shared mem,
the race is only fatal if it happens for unused slots.

I see 2 ways to fix it:

1.  Add memory barrier to ProcArrayAdd/ProcArrayRemove between pointer   and count update.  This guarantees that
partialslots will not be seen.
 

2.  Always clear the pointer in ProcArrayRemove and check for NULL   in all "lockless" access points.  This guarantees
thatpartial slots   will be either NULL or just-freed ones, before the barrier in   LWLockRelease(), which means the
contentsshould be still sensible.
 

#1 seems to require platform-specific code, which we don't have yet?
So #2 may be easier solution.

-- 
marko


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Zdenek Kotala
Дата:
Сообщение: Re: Solaris getopt_long and PostgreSQL
Следующее
От: Kedar Potdar
Дата:
Сообщение: Re: Partitioning feature ...