Yes, WaitLatch is vulnerable to weak-memory-ordering bugs

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Yes, WaitLatch is vulnerable to weak-memory-ordering bugs
Дата
Msg-id 24241.1312739269@sss.pgh.pa.us
обсуждение исходный текст
Ответы Re: Yes, WaitLatch is vulnerable to weak-memory-ordering bugs  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
I suspected $SUBJECT from the beginning, and I've now put in enough work
to be able to prove it.  The attached test program reliably fails within
a few minutes of being started, when run with 8 worker processes on an
8-core PPC machine.  It's a pretty simple "token passing ring" protocol,
and at some point one of the processes sees its latch set without seeing
its flag set, so it goes back to sleep and the token stops getting passed.

The original worry about the latch code was that there was some internal
race condition of this sort, but I think we were failing to see the forest
for the trees.  The specification of how to use a latch reads like this:

 * The correct pattern to wait for an event is:
 *
 * for (;;)
 * {
 *       ResetLatch();
 *       if (work to do)
 *           Do Stuff();
 *
 *       WaitLatch();
 * }
 *
 * It's important to reset the latch *before* checking if there's work to
 * do. Otherwise, if someone sets the latch between the check and the
 * ResetLatch call, you will miss it and Wait will block.
 *
 * To wake up the waiter, you must first set a global flag or something
 * else that the main loop tests in the "if (work to do)" part, and call
 * SetLatch *after* that.

Any protocol of that sort has *obviously* got a race condition between
changes of the latch state and changes of the separate flag state, if run
in a weak-memory-ordering machine.

At least on the hardware I'm testing, it seems that the key requirement
for a failure to be seen is that there's a lot of contention for the cache
line containing the flag, but not for the cache line containing the latch.
This allows the write to the flag to take longer to propagate to main
memory than the write to the latch.  I could not get it to fail until
I added the extra shared-memory traffic in the delay loop around line 175.
This probably also explains why it doesn't fail with small numbers of
workers --- you need enough contending CPUs to saturate the memory
hardware.  (On Red Hat's test machines, 7 or 8 workers would reliably
fail within a few minutes, but 6 or fewer didn't always.)

I'm not sure whether we need to do anything about this for 9.1; it may be
that none of the existing uses of latches are subject to the kind of
contention that would create a risk.  But I'm entirely convinced that we
need to insert some memory-barrier instructions into the latch code before
we expand our uses of latches much more.  Since we were already finding
that memory barriers will be needed for performance reasons elsewhere,
I think it's time to bite the bullet and develop that infrastructure.

            regards, tom lane


/*
 * Simple test program to exercise SetLatch/WaitLatch.
 *
 * This just passes a "token" flag around among some number of worker
 * processes.
 *
 * To use: compile this into a .so, then execute
 *
 * CREATE FUNCTION start_worker() RETURNS void AS '/path/to/.so' LANGUAGE c;
 *
 * Then, for each of up to MAX_WORKERS sessions, execute
 *
 * psql -c "SELECT start_worker()" dbname &
 *
 * About one session per physical CPU should be good.
 *
 * Watch the postmaster log for awhile to confirm nobody drops the ball.
 * If anyone does, all the processes will exit with a timeout failure
 * after about a minute.  If you get bored, "pg_ctl stop -m immediate"
 * is the easiest way out.
 *
 * Be sure to restart the postmaster between attempts, since there's no
 * logic for re-initializing a workspace that's already present.
 *
 * Note: this will NOT work with the Windows implementation of latches, since
 * we are cheesy about how we initialize the shared memory.  Doesn't matter
 * for now, since Windows doesn't run on any machines with weak memory
 * ordering anyway.
 */

#include "postgres.h"

#include "fmgr.h"
#include "miscadmin.h"
#include "storage/latch.h"
#include "storage/shmem.h"

PG_MODULE_MAGIC;


#define MAX_WORKERS 16

#define BIAS 500000                /* to reduce time-to-failure */


/*
 * Arrays have padding to try to ensure active values are in different cache
 * lines
 */
typedef struct
{
    int            flag;
    int            pad2[4];
    int            junk;
    char        pad[333];
} TokStruct;

typedef struct
{
    Latch        latch;
    char        pad[333];
} LatchStruct;

typedef struct
{
    int            num_workers;

    TokStruct    flags[MAX_WORKERS];

    LatchStruct    latches[MAX_WORKERS];
} MySharedSpace;

volatile long    gsum = 0;

Datum        start_worker(PG_FUNCTION_ARGS);

PG_FUNCTION_INFO_V1(start_worker);

Datum
start_worker(PG_FUNCTION_ARGS)
{
    volatile MySharedSpace *workspace;
    bool        found;
    int            i;
    int            my_number;
    long        counter = 0;
    unsigned short xseed[3];

    /* Set up non-interlocked RNG */
    xseed[0] = xseed[1] = xseed[2] = 42;

    /* Create or access the shared data */
    workspace = (MySharedSpace *)
        ShmemInitStruct("Latch Testing", sizeof(MySharedSpace), &found);

    if (!found)
    {
        /* I'm the first, initialize */
        memset((void *) workspace, 0, sizeof(MySharedSpace));

        for (i = 0; i < MAX_WORKERS; i++)
            InitSharedLatch(&(workspace->latches[i].latch));

        /* Give the token to the first worker (ie, me) */
        workspace->flags[0].flag = 1;
    }

    /*
     * Add self to the arrays.  This assumes no two would-be workers do it
     * concurrently, which should be OK if they're started manually.  Note
     * that we do NOT initialize flags[my_number], as that was set up
     * correctly during the workspace creation above.
     */
    my_number = workspace->num_workers;
    if (my_number >= MAX_WORKERS)
        elog(ERROR, "too many workers");

    OwnLatch(&(workspace->latches[my_number].latch));
    workspace->num_workers++;

    /*
     * And play ball ...
     */
    for (;;)
    {
        ResetLatch(&(workspace->latches[my_number].latch));

        /* If I have the token ... */
        if (workspace->flags[my_number].flag)
        {
            int        next_worker;
            int        cnt;

            /* delete it ... */
            workspace->flags[my_number].flag = 0;

            /*
             * ... and pass it on to the next guy.  Note we might fetch a
             * stale value of num_workers here, but that's okay, it'll just
             * mean the recently-added guy doesn't get the token during this
             * cycle.
             */
            next_worker = my_number + 1;
            if (next_worker >= workspace->num_workers)
                next_worker = 0;

            workspace->flags[next_worker].flag = 1;

            SetLatch(&(workspace->latches[next_worker].latch));

            /* Report that we're still alive every so often */
            counter++;
            if (counter % (256*1024) == 0)
                elog(LOG, "worker %d still alive after %ld cycles",
                     my_number, counter);

            CHECK_FOR_INTERRUPTS();

            /*
             * Compute for a little while, hoping our latch will be set
             * again before we wait.  The delay is random with a range that
             * grows slowly as we run.
             *
             * While we're at it, create a lot of r/w traffic on the
             * TokStructs to try to saturate their cache lines.  This is
             * needed to cause other CPUs to not see flag updates promptly
             * compared to latch updates.
             */
            cnt = (counter + BIAS) / 64 +
                (int) (pg_erand48(xseed) * ((counter + BIAS) / 64));

            for (i = 0; i < cnt; i++)
            {
                int        k = i % MAX_WORKERS;

                gsum += workspace->flags[k].flag;
                workspace->flags[k].junk++;
            }
        }

        if (!(WaitLatch(&(workspace->latches[my_number].latch),
                        WL_LATCH_SET | WL_TIMEOUT, 60000000L) & WL_LATCH_SET))
            elog(ERROR, "WaitLatch timed out in worker %d after %ld cycles, with flag %d latch %d",
                 my_number, counter,
                 workspace->flags[my_number].flag,
                 workspace->latches[my_number].latch.is_set);
    }

    PG_RETURN_NULL();
}

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: [RFC] Common object property boards
Следующее
От: Alexander Korotkov
Дата:
Сообщение: Re: WIP: Fast GiST index build