Bulk loading performance improvements

Поиск
Список
Период
Сортировка
От	Simon Riggs
Тема	Bulk loading performance improvements
Дата	26 февраля 2008 г. 10:45:58
Msg-id	1204037198.4252.305.camel@ebony.site обсуждение
Ответы	Re: Bulk loading performance improvements
Список	pgsql-hackers
Дерево обсуждения
Looking at the profile for COPY and then a profile for CREATE TABLE AS
SELECT (CTAS) there is clearly too much overhead caused by inserting the
rows one at a time. 

Flat profile of CTAS: (2 cols of this output removed for clarity)

Each sample counts as 0.01 seconds. %   cumulative   self             
time   seconds   seconds    calls   5.63      0.52     0.52 21302604    LWLockRelease 5.63      1.04     0.52 21193900
 LWLockAcquire 4.77      1.48     0.44 10386370    hash_any 4.77      1.92     0.44 10139843    ReadBuffer_common 4.66
   2.35     0.43 10000842    PinBuffer 4.55      2.77     0.42 10000001    ExecProject 4.44      3.18     0.41 10000014
  heap_insert 4.23      3.57     0.39 10143770    UnpinBuffer 3.79      3.92     0.35 10000057    MarkBufferDirty 3.36
   4.23     0.31 10000001    ExecMakeFunctionResult 3.25      4.53     0.30 10000014    RelationGetBufferForTuple 2.71
   4.78     0.25 10000028    heap_fill_tuple 2.60      5.02     0.24 10528273    hash_search_with_hash_value 2.38
5.24    0.22 10000042    PageAddItem 2.38      5.46     0.22 10000000    heap_form_tuple 1.73      5.82     0.16
10143788   ResourceOwnerForgetBuffer 1.73      5.98     0.16 10000014    pgstat_count_heap_insert 1.63      6.13
0.1520280380    LockBuffer 1.63      6.28     0.15 10139682    ReleaseBuffer 1.63      6.43     0.15 10000001
ExecProcNode1.52      6.57     0.14 10004864    AllocSetAlloc 1.52      6.71     0.14 10003013    AllocSetFree 1.52
6.85     0.14 10000003    ExecEvalConst 1.52      6.99     0.14 10000000    intorel_receive 1.41      7.12     0.13
20139096   BufferGetBlockNumber 1.41      7.25     0.13 10000029    CacheInvalidateH
 

The above profile is probably fairly worst-case and differs between CTAS
and COPY. It also reduces as row length increases.

Thin tables are an important use case since many Associative tables
(Many-Many) are larger than the tables that they link together. Those
tables are often just two integer columns. So the above results are
relevant to many large database tables.

The system is beginning to be I/O bound during these operations, but CPU
savings are still helpful.

If we loaded data 1 block at a time, it seems like we make the following
calls 1 per block rather than 1 per row. That would reduce these calls
to perhaps 1% of their costs, depending upon row length.
 5.63      0.52     0.52 21302604    LWLockRelease 5.63      1.04     0.52 21193900    LWLockAcquire 4.77      1.48
0.4410386370    hash_any 4.77      1.92     0.44 10139843    ReadBuffer_common 4.66      2.35     0.43 10000842
PinBuffer4.44      3.18     0.41 10000014    heap_insert 4.23      3.57     0.39 10143770    UnpinBuffer 3.79      3.92
   0.35 10000057    MarkBufferDirty 3.25      4.53     0.30 10000014    RelationGetBufferForTuple 1.73      5.82
0.1610143788    ResourceOwnerForgetBuffer 1.73      5.98     0.16 10000014    pgstat_count_heap_insert 1.63      6.13
 0.15 20280380    LockBuffer 1.63      6.28     0.15 10139682    ReleaseBuffer 1.41      7.12     0.13 20139096
BufferGetBlockNumber1.41      7.25     0.13 10000029    CacheInvalidateHeapTuple
 

which together account for more than 50% of CPU.

So that means we would keep the buffer pinned and locked while we do
everything else during loading. That is difficult to achieve if we have
indexes on the table and also of questionable value in concurrent
situations.

Discussing with Heikki we could just kept the buffer pinned, but lock
and unlock it for each insert then we would save on most of the above.
The following possible savings would still exist...
 5.63      0.52     0.52 21302604    LWLockRelease 5.63      1.04     0.52 21193900    LWLockAcquire 3.79      3.92
0.3510000057    MarkBufferDirty 1.73      5.98     0.16 10000014    pgstat_count_heap_insert 1.63      6.13     0.15
20280380   LockBuffer 1.63      6.28     0.15 10139682    ReleaseBuffer
 

which account for about 18% of CPU. 

So the proposals for improving COPY are:

1) When running a COPY operation into a table with indexes and/or
triggers defined we will do this for each block
* identify the block to insert into
* pin it and keep it pinned
* perform all inserts until full
* unpin it

I'm posting a patch now to pgsql-patches that does this: "Bulk Insert
tuning". I measure gains of around 20% for CTAS and for 15-17% for COPY,
though the variability of results is fairly high.

2) When running CREATE TABLE AS SELECT or COPY into a table with zero
indexes and zero triggers (yet) we will *also* do these actions for each
block
* load data into a private buffer, then when full
* identify block to insert into
* lock buffer, memcpy() block into place, set tuple pointers, unlock
* mark buffer dirty once at end
* copy whole buffer to WAL

That will remove all of the block lock/unlock overhead, reduce WAL
locking and reduce overall WAL volume from large loads.

Objective: reduce CPU overhead for initial loads by around 25-40%, or
15-20% more benefit than achieved by (1) above.

3) Maintain a full block list within the buffer manager. Like the
opposite of a freelist. Any bulk operation that finishes writing a block
adds it to the fulllist. bgwriter cleans blocks on the fulllist first
before it starts normal clock sweep, up to a limit of half of the blocks
it is allowed to clean. This ensures that backends which are the source
of many dirty blocks are targeted more accurately by the bgwriter (but
not exclusively). Potentially improve chances of OS/disk controller
being able to combine writes and perform sequential write I/O.

Objective: improve hit rate of bgwriter during bulk insert ops

4) Allow bulk insert operations to utilise a BufferAccessStrategy so
that they reuse the same buffers in the bufferpool and avoid spoiling
the cache for everybody else. This on its own will slow down bulk
operations since they will be forced to write their own dirty blocks.
Together with (3) this should actually speed up bulk ops.

Objective: avoid cache spoiling and attempt to improve L2 cache usage

These would be done as separate patches, in that order. 

I'm assuming that we'll be able to tune index maintenance as well,
though I'll leave that to Heikki and/or Itagaki.

Comments?

--  Simon Riggs 2ndQuadrant  http://www.2ndQuadrant.com
В списке pgsql-hackers по дате отправления:
Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Bulk loading performance improvements