Re: HEAD seems to generate larger WAL regarding GIN index

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: HEAD seems to generate larger WAL regarding GIN index
Дата
Msg-id 53270614.5050804@vmware.com
обсуждение исходный текст
Ответ на Re: HEAD seems to generate larger WAL regarding GIN index  (Fujii Masao <masao.fujii@gmail.com>)
Ответы Re: HEAD seems to generate larger WAL regarding GIN index  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
On 03/17/2014 03:20 PM, Fujii Masao wrote:
> On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
>> On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:

> I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL,
> and then found its max seems more than 256B. Am I missing something?
>
> What I observed is
>
> [In HEAD]
> At first, the size of GIN-related WAL is gradually increasing up to about 1400B.
>      rmgr: Gin         len (rec/tot):     48/    80, tx:       1813,
> lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F
>      rmgr: Gin         len (rec/tot):     56/    88, tx:       1813,
> lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
>      rmgr: Gin         len (rec/tot):     64/    96, tx:       1813,
> lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
>      ...
>      rmgr: Gin         len (rec/tot):   1376/  1408, tx:       1813,
> lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T
>      rmgr: Gin         len (rec/tot):   1392/  1424, tx:       1813,
> lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting
> tree, node: 1663/12945/16441 blkno: 4

This corresponds to the stage where the items are stored in-line in the 
entry-tree. After it reaches a certain size, a posting tree is created.

> Then the size decreases to about 100B and is gradually increasing
> again up to 320B.
>
>      rmgr: Gin         len (rec/tot):    116/   148, tx:       1813,
> lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
> 1372 (compressed)
>      rmgr: Gin         len (rec/tot):     40/    72, tx:       1813,
> lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T
>      ...
>      rmgr: Gin         len (rec/tot):    118/   150, tx:       1813,
> lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
> 1374 (compressed)
>      ...
>      rmgr: Gin         len (rec/tot):    288/   320, tx:       1813,
> lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280
> length: 1544 (compressed)
>
> Then the size decreases to 66B and is gradually increasing again up to 320B.
> This increase and decrease of WAL size seems to continue.

Here the new items are appended to posting tree pages. This is where the 
maximum of 256 bytes I mentioned applies. 256 bytes is the max size of 
one compressed posting list, the WAL record containing it includes some 
other stuff too, which adds up to that 320 bytes.

> [In 9.3]
> At first, the size of GIN-related WAL is gradually increasing up to about 2700B.
>
>      rmgr: Gin         len (rec/tot):     52/    84, tx:       1812,
> lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node:
> 1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T
> isdelete F updateBlkno:4294967295
>      rmgr: Gin         len (rec/tot):     60/    92, tx:       1812,
> lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node:
> 1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T
> isdelete T updateBlkno:4294967295
>      ...
>      rmgr: Gin         len (rec/tot):   2740/  2772, tx:       1812,
> lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node:
> 1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T
> isdelete T updateBlkno:4294967295
>      rmgr: Gin         len (rec/tot):   2714/  2746, tx:       1812,
> lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting
> tree, node: 1663/12896/16441 blkno: 6
>
> The size decreases to 66B and then is never changed.

Same mechanism on 9.3, but the insertions to the posting tree pages are 
constant size.

>>> That could be optimized, but I figured we can live with it, thanks to the
>>> fastupdate feature. Fastupdate allows amortizing that cost over several
>>> insertions. But of course, you explicitly disabled that...
>>
>> Let me know if you want me to write patch addressing this issue.
>
> Yeah, I really want you to address this problem! That's definitely useful
> for every users disabling FASTUPDATE option for some reasons.

Ok, let's think about it a little bit. I think there are three fairly 
simple ways to address this:

1. The GIN data leaf "recompress" record contains an offset called 
"unmodifiedlength", and the data that comes after that offset. 
Currently, the record is written so that unmodifiedlength points to the 
end of the last compressed posting list stored on the page that was not 
modified, followed by all the modified ones. The straightforward way to 
cut down the WAL record size would be to be more fine-grained than that, 
and for the posting lists that were modified, only store the difference 
between the old and new version.

To make this approach work well for random insertions, not just 
appending to the end, we would also need to make the logic in 
leafRepackItems a bit smarter so that it would not re-encode all the 
posting lists, after the first modified one.

2. Instead of storing the new compressed posting list in the WAL record, 
store only the new item pointers added to the page. WAL replay would 
then have to duplicate the work done in the main insertion code path: 
find the right posting lists to insert to, decode them, add the new 
items, and re-encode.

The upside of that would be that the WAL format would be very compact. 
It would be quite simple to implement - you just need to call the same 
functions we use in the main insertion codepath to insert the new items. 
It could be more expensive, CPU-wise, to replay the records, however.

This record format would be higher-level, in the sense that we would not 
store the physical copy of the compressed posting list as it was formed 
originally. The same work would be done at WAL replay. As the code 
stands, it will produce exactly the same result, but that's not 
guaranteed if we make bugfixes to the code later, and a master and 
standby are running different minor version. There's not necessarily 
anything wrong with that, but it's something to keep in mind.

3. Just reduce the GinPostingListSegmentMaxSize constant from 256, to 
say 128. That would halve the typical size of a WAL record that appends 
to the end. However, it would not help with insertions in the middle of 
a posting list, only appends to the end, and it would bloat the pages 
somewhat, as you would waste more space on the posting list headers.


I'm leaning towards option 2. Alexander, what do you think?

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Amit Kapila
Дата:
Сообщение: Re: Patch: show relation and tuple infos of a lock to acquire
Следующее
От: Robert Haas
Дата:
Сообщение: Re: gaussian distribution pgbench