WIP: parallel GiST index builds

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема WIP: parallel GiST index builds
Дата
Msg-id 0e4f4af8-1088-4995-b2fb-8f92b5c6cef9@enterprisedb.com
обсуждение исходный текст
Список pgsql-hackers
Hi,

After looking into parallel builds for BRIN and GIN indexes, I was
wondering if there's a way to do parallel builds for GiST too. I knew
next to nothing about how GiST works, but I gave it a shot and here's
what I have - the attached patch allows parallel GiST builds for the
"unsorted" case (i.e. when the opclass does not include sortsupport),
and does not support buffered builds.


unsorted builds only
--------------------

Addressing only the unsorted case may seem a bit weird, but I did it
this way for two reasons - parallel sort is a solved problem, and adding
this to the patch seems quite straightforward. It's what btree does, for
example. But I also was not very sure how common this is - we do have
sort for points, but I have no idea if the PostGIS indexes define
sorting etc. My guess was "no" but I've been told that's no longer true,
so I guess sorted builds are more widely applicable than I thought.

In any case, I'm not in a rush to parallelize sorted builds. It can be
added later, as an improvement, IMHO. In fact, it's a well isolated part
of the patch, which might make it a good choice for someone looking for
an idea for their first patch ...


buffered builds
---------------

The lack of support for buffered builds is a very different thing. The
basic idea is that we don't push the index entries all the way to the
leaf pages right away, but accumulate them in buffers half-way through.
This combines writes and reduces random I/O, which is nice.

Unfortunately, the way it's implemented does not work with parallel
builds at all - all the state is in private memory, and it assumes the
worker is the only possible backend that can split the page (at which
point the buffers need to be split too, etc.). But for parallel builds
this is obviously not true.

I'm not saying parallel builds can't do similar buffering, but it
requires moving the buffers into shared memory, and introducing locking
to coordinate accesses to the buffers. (Or perhaps it might be enough to
only "notify" the workers about page splits, with buffers still in
private memory?). Anyway, it seems far too complicated for v1.

In fact, I'm not sure the buffering is entirely necessary - maybe the
increase in amount of RAM makes this less of an issue? If the index can
fit into shared buffers (or at least page cache), maybe the amount of
extra I/O is not that bad? I'm sure there may be cases really affected
by this, but maybe it's OK to tell people to disable parallel builds in
those cases?


gistGetFakeLSN
--------------

One more thing - GiST disables WAL-logging during the build, and only
logs it once at the end. For serial builds this is fine, because there
are no concurrent splits, and so we don't need to rely on page LSNs to
detect these cases (in fact, is uses a bogus value).

But for parallel builds this would not work - we need page LSNs that
actually change, otherwise we'd miss page splits, and the index build
would either fail or produce a broken index. But the existing is_build
flag affects both things, so I had to introduce a new "is_parallel" flag
which only affects the page LSN part, using the gistGetFakeLSN()
function, previously used only for unlogged indexes.

This means we'll produce WAL during the index build (because
gistGetFakeLSN() writes a trivial message into WAL). Compared to the
serial builds this produces maybe 25-75% more WAL, but it's an order of
magnitude less than with "full" WAL logging (is_build=false).

For example, serial build of 5GB index needs ~5GB of WAL. A parallel
build may need ~7GB, while a parallel build with "full" logging would
use 50GB. I think this is a reasonable trade off.

There's one "strange" thing, though - the amount of WAL decreases with
the number of parallel workers. Consider for example an index on a
numeric field, where the index is ~9GB, but the amount of WAL changes
like this (0 workers means serial builds):

  parallel workers      0      1      3      5      7
          WAL (GB)    5.7    9.2    7.6    7.0    6.8

The explanation for this is fairly simple (AFAIK) - gistGetFakeLSN
determines if it needs to actually assign a new LSN (and write stuff to
WAL) by comparing the last LSN assigned (in a given worker) to the
current insert LSN. But the current insert LSN might have been updated
by some other worker, in which case we simply use that. Which means that
multiple workers may use the same fake LSN, and the likelihood increases
with the number of workers - and this is consistent with the observed
behavior of the WAL decreasing as the number of workers increases
(because more workers use the same LSN).

I'm not entirely sure if this is OK or a problem. I was worried two
workers might end up using the same LSN for the same page, leading to
other workers not noticing the split. But after a week of pretty
intensive stress testing, I haven't seen a single such failure ...

If this turns out to be a problem, the fix is IMHO quite simple - it
should be enough to force gistGetFakeLSN() to produce a new fake LSN
every time when is_parallel=true.


performance
-----------

Obviously, the primary goal of the patch is to speed up the builds, so
does it actually do that? For indexes of different sizes I got this
timings (in seconds):

   scale      type           0         1        3        5        7
  ------------------------------------------------------------------
   small      inet          13         7        4        4        2
              numeric      239       122       67       46       36
              oid           15         8        5        3        2
              text          71        35       19       13       10
   medium     inet         207       111       59       42       32
              numeric     3409      1714      885      618      490
              oid          214       114       60       43       33
              text         940       479      247      180      134
   large      inet        2167      1459      865      632      468
              numeric    38125     20256    10454     7487     5846
              oid         2647      1490      808      594      475
              text       10987      6298     3376     2462     1961

Here small is ~100-200MB index, medium is 1-2GB and large 10-20GB index,
depending on the data type.

The raw duration is not particularly easy to interpret, so let's look at
the "actual speedup" which is calculated as

   (serial duration) / (parallel duration)

and the table looks like this:

   scale        type         1          3          5          7
  --------------------------------------------------------------
   small        inet       1.9        3.3        3.3        6.5
                numeric    2.0        3.6        5.2        6.6
                oid        1.9        3.0        5.0        7.5
                text       2.0        3.7        5.5        7.1
   medium       inet       1.9        3.5        4.9        6.5
                numeric    2.0        3.9        5.5        7.0
                oid        1.9        3.6        5.0        6.5
                text       2.0        3.8        5.2        7.0
   large        inet       1.5        2.5        3.4        4.6
                numeric    1.9        3.6        5.1        6.5
                oid        1.8        3.3        4.5        5.6
                text       1.7        3.3        4.5        5.6

Ideally (if the build scaled linearly with the number of workers), we'd
get the number of workers + 1 (because the leader participates too).
Obviously, it's not that great - for example for text with 3 workers we
get 3.3 instead of 4.0, and 5.6 vs. 8 with 7 workers.

But I think those numbers are actually pretty good - I'd definitely not
complain if my index builds got 5x faster.

But those are synthetic tests on random data, using the btree_gist
opclasses. It'd be interesting if people could do their own testing on
real-world data sets.



regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Jacob Champion
Дата:
Сообщение: Re: Re: Add support to TLS 1.3 cipher suites and curves lists
Следующее
От: Neil Conway
Дата:
Сообщение: Re: Optimizing COPY with SIMD