Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex
Дата
Msg-id 25859.1248117925@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex  (Stefan Kaltenbrunner <stefan@kaltenbrunner.cc>)
Ответы Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: [PATCH v4] Avoid manual shift-and-test logic in AllocSetFreeIndex  (Greg Stark <gsstark@mit.edu>)
Список pgsql-hackers
Stefan Kaltenbrunner <stefan@kaltenbrunner.cc> writes:
> Tom Lane wrote:
>> and it turns out that Intel hasn't seen fit to put a lot of effort into
>> the BSR instruction.  It's constant time, all right, but on most of
>> their CPUs that constant time is like 8 or 16 times slower than an ADD;
>> cf http://www.intel.com/Assets/PDF/manual/248966.pdf

> hmm interesting - I don't have the exact numbers any more but that
> patch(or a previous version of it) definitly showed a noticable
> improvement when I tested with sysbench on a current generation Intel
> Nehalem...

Hmm.  I may be overestimating the importance of the smaller size
categories.  To try to get some trustworthy numbers, I made a quick-hack
patch (attachment #1) to count the actual numbers of calls to
AllocSetFreeIndex, and measured the totals for a run of the regression
tests on both a 32-bit machine and a 64-bit machine.  On 32 I got
these totals:

0    5190113
1    5663980
2    3573261
3    4476398
4    4246178
5    1100634
6    386501
7    601654
8    44884
9    52372
10    202801

and on 64 these:

0    2139534
1    5994692
2    5711479
3    3289034
4    4550931
5    2573389
6    487566
7    588470
8    155148
9    52750
10    202597

If you want to do the same in some other workload, feel free.  I
wouldn't trust the results from a single-purpose benchmark too much,
though.

I then put together a test harness that exercises AllocSetFreeIndex
according to these distributions (attachments #2,#3).  (Note: I found
out that it's necessary to split the test into two files --- otherwise
gcc will inline AllocSetFreeIndex and partially const-fold the work,
leading to skewed results.)

What I'm seeing with this harness on my x86 machines is that
__builtin_clz is indeed a bit faster than a naive loop, but not by
very much --- it saves maybe 25% of the runtime.  It's better on an
old PPC Mac; saves about 50%.  Still, these are not impressive numbers
for a microbenchmark that is testing *only* AllocSetFreeIndex.

I'm still interested in the idea of doing a manual unroll instead of
relying on a compiler-specific feature.  However, some quick testing
didn't find an unrolling that helps much.

            regards, tom lane

*** src/backend/storage/ipc/ipc.c.orig    Thu Jun 11 10:55:12 2009
--- src/backend/storage/ipc/ipc.c    Mon Jul 20 13:56:06 2009
***************
*** 183,188 ****
--- 183,190 ----
                                    on_proc_exit_list[on_proc_exit_index].arg);

      on_proc_exit_index = 0;
+
+     AllocSetPrintStats();
  }

  /* ------------------
*** src/backend/utils/mmgr/aset.c.orig    Thu Jun 11 10:55:25 2009
--- src/backend/utils/mmgr/aset.c    Mon Jul 20 13:56:19 2009
***************
*** 255,260 ****
--- 255,271 ----
  #define AllocAllocInfo(_cxt, _chunk)
  #endif

+ static unsigned long allocsizes[ALLOCSET_NUM_FREELISTS];
+
+ void
+ AllocSetPrintStats()
+ {
+     int i;
+
+     for (i = 0; i < ALLOCSET_NUM_FREELISTS; i++)
+         fprintf(stderr, "category %2d count %lu\n", i, allocsizes[i]);
+ }
+
  /* ----------
   * AllocSetFreeIndex -
   *
***************
*** 277,282 ****
--- 288,294 ----
              size >>= 1;
          }
          Assert(idx < ALLOCSET_NUM_FREELISTS);
+         allocsizes[idx]++;
      }

      return idx;
/*
 * usage: time ./testit N
 *    N is a repeat count, set it large enough to get repeatable timings
 */

#include <stdio.h>
#include <stdlib.h>

#define ALLOC_MINBITS        3    /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS    11

/* number of calls to AllocSetFreeIndex with each result category */
static const int numoccurs[ALLOCSET_NUM_FREELISTS] = {
#ifdef USE_64BIT_COUNTS
2139534,
5994692,
5711479,
3289034,
4550931,
2573389,
487566,
588470,
155148,
52750,
202597
#else
5190113,
5663980,
3573261,
4476398,
4246178,
1100634,
386501,
601654,
44884,
52372,
202801
#endif
};


extern int AllocSetFreeIndex(size_t size);


int
main(int argc, char **argv)
{
    int        repeat = atoi(argv[1]);

    while (repeat-- > 0)
    {
        int        k;

        for (k = 0; k < ALLOCSET_NUM_FREELISTS; k++)
        {
            int        n = numoccurs[k];
            size_t    siz = (1 << (ALLOC_MINBITS + k));

            while (n-- > 0)
            {
                AllocSetFreeIndex(siz);
            }
        }
    }

    return 0;
}
#include <stdio.h>
#include <stdlib.h>

#define ALLOC_MINBITS        3    /* smallest chunk size is 8 bytes */
#define ALLOCSET_NUM_FREELISTS    11
#define BITS_PER_BYTE 8

/* ----------
 * AllocSetFreeIndex -
 *
 *        Depending on the size of an allocation compute which freechunk
 *        list of the alloc set it belongs to.  Caller must have verified
 *        that size <= ALLOC_CHUNK_LIMIT.
 * ----------
 */
int
AllocSetFreeIndex(size_t size)
{
    int            idx = 0;

    if (size > (1 << ALLOC_MINBITS))
    {
        size = (size - 1) >> ALLOC_MINBITS;

#ifdef HAVE_BUILTIN_CLZ
        idx = sizeof(unsigned int) * BITS_PER_BYTE -
            __builtin_clz((unsigned int)size);
#else
        do
        {
            idx++;
            size >>= 1;
        }
        while (size != 0);
#endif
    }

    return idx;
}

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Joshua Brindle
Дата:
Сообщение: Re: [PATCH] SE-PgSQL/tiny rev.2193
Следующее
От: Greg Stark
Дата:
Сообщение: Re: MIN/MAX optimization for partitioned table