Обсуждение: HEAD seems to generate larger WAL regarding GIN index

Поиск

Список

Период

Сортировка

HEAD seems to generate larger WAL regarding GIN index

От

Fujii Masao

Дата:

15 марта 2014 г., 21:40:12

Hi,

I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.

---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);

CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();

SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------

The results of WAL size are
   960 MB (9.3) 2113 MB (HEAD)

The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?

Regards,

-- 
Fujii Masao

Re: HEAD seems to generate larger WAL regarding GIN index

От

Heikki Linnakangas

Дата:

15 марта 2014 г., 22:27:41

On 03/15/2014 08:40 PM, Fujii Masao wrote:
> Hi,
>
> I executed the following statements in HEAD and 9.3, and compared
> the size of WAL which were generated by data insertion in GIN index.
>
> ---------------------
> CREATE EXTENSION pg_trgm;
> CREATE TABLE hoge (col1 text);
> CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
> (FASTUPDATE = off);
>
> CHECKPOINT;
> SELECT pg_switch_xlog();
> SELECT pg_switch_xlog();
>
> SELECT pg_current_xlog_location();
> INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
> SELECT pg_current_xlog_location();
> ---------------------
>
> The results of WAL size are
>
>      960 MB (9.3)
>    2113 MB (HEAD)
>
> The WAL size in HEAD was more than two times bigger than that in 9.3.
> Recently the source code of GIN index has been changed dramatically.
> Is the increase in GIN-related WAL intentional or a bug?

It was somewhat expected. Updating individual items on the new-format 
GIN pages requires decompressing and recompressing the page, and the 
recompressed posting lists need to be WAL-logged. Which generates much 
larger WAL records.

That said, I didn't expect the difference to be quite that big when 
you're appending to the end of the table. When the new entries go to the 
end of the posting lists, you only need to recompress and WAL-log the 
last posting list, which is max 256 bytes long. But I guess that's still 
a lot more WAL than in the old format.

That could be optimized, but I figured we can live with it, thanks to 
the fastupdate feature. Fastupdate allows amortizing that cost over 
several insertions. But of course, you explicitly disabled that...

- Heikki

Re: HEAD seems to generate larger WAL regarding GIN index

От

Alexander Korotkov

Дата:

16 марта 2014 г., 01:15:49

On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas <hlinnakangas@vmware.com> wrote:

On 03/15/2014 08:40 PM, Fujii Masao wrote:
Hi,

I executed the following statements in HEAD and 9.3, and compared
the size of WAL which were generated by data insertion in GIN index.

---------------------
CREATE EXTENSION pg_trgm;
CREATE TABLE hoge (col1 text);
CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
(FASTUPDATE = off);

CHECKPOINT;
SELECT pg_switch_xlog();
SELECT pg_switch_xlog();

SELECT pg_current_xlog_location();
INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
SELECT pg_current_xlog_location();
---------------------

The results of WAL size are

960 MB (9.3)
2113 MB (HEAD)

The WAL size in HEAD was more than two times bigger than that in 9.3.
Recently the source code of GIN index has been changed dramatically.
Is the increase in GIN-related WAL intentional or a bug?

It was somewhat expected. Updating individual items on the new-format GIN pages requires decompressing and recompressing the page, and the recompressed posting lists need to be WAL-logged. Which generates much larger WAL records.

That said, I didn't expect the difference to be quite that big when you're appending to the end of the table. When the new entries go to the end of the posting lists, you only need to recompress and WAL-log the last posting list, which is max 256 bytes long. But I guess that's still a lot more WAL than in the old format.

That could be optimized, but I figured we can live with it, thanks to the fastupdate feature. Fastupdate allows amortizing that cost over several insertions. But of course, you explicitly disabled that...

Let me know if you want me to write patch addressing this issue.

------
With best regards,
Alexander Korotkov.

Re: HEAD seems to generate larger WAL regarding GIN index

От

Fujii Masao

Дата:

17 марта 2014 г., 16:20:06

On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
<aekorotkov@gmail.com> wrote:
> On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>>
>> On 03/15/2014 08:40 PM, Fujii Masao wrote:
>>>
>>> Hi,
>>>
>>> I executed the following statements in HEAD and 9.3, and compared
>>> the size of WAL which were generated by data insertion in GIN index.
>>>
>>> ---------------------
>>> CREATE EXTENSION pg_trgm;
>>> CREATE TABLE hoge (col1 text);
>>> CREATE INDEX hogeidx ON hoge USING gin (col1 gin_trgm_ops) WITH
>>> (FASTUPDATE = off);
>>>
>>> CHECKPOINT;
>>> SELECT pg_switch_xlog();
>>> SELECT pg_switch_xlog();
>>>
>>> SELECT pg_current_xlog_location();
>>> INSERT INTO hoge SELECT 'POSTGRESQL' FROM generate_series(1, 1000000);
>>> SELECT pg_current_xlog_location();
>>> ---------------------
>>>
>>> The results of WAL size are
>>>
>>>      960 MB (9.3)
>>>    2113 MB (HEAD)
>>>
>>> The WAL size in HEAD was more than two times bigger than that in 9.3.
>>> Recently the source code of GIN index has been changed dramatically.
>>> Is the increase in GIN-related WAL intentional or a bug?
>>
>>
>> It was somewhat expected. Updating individual items on the new-format GIN
>> pages requires decompressing and recompressing the page, and the
>> recompressed posting lists need to be WAL-logged. Which generates much
>> larger WAL records.
>>
>> That said, I didn't expect the difference to be quite that big when you're
>> appending to the end of the table. When the new entries go to the end of the
>> posting lists, you only need to recompress and WAL-log the last posting
>> list, which is max 256 bytes long. But I guess that's still a lot more WAL
>> than in the old format.

I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL,
and then found its max seems more than 256B. Am I missing something?

What I observed is

[In HEAD]
At first, the size of GIN-related WAL is gradually increasing up to about 1400B.
   rmgr: Gin         len (rec/tot):     48/    80, tx:       1813,
lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F   rmgr: Gin         len (rec/tot):     56/    88, tx:
1813,
lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T   rmgr: Gin         len (rec/tot):     64/    96, tx:
1813,
lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T   ...   rmgr: Gin         len (rec/tot):   1376/  1408, tx:
   1813,
 
lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T   rmgr: Gin         len (rec/tot):   1392/  1424, tx:
1813,
lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting
tree, node: 1663/12945/16441 blkno: 4

Then the size decreases to about 100B and is gradually increasing
again up to 320B.
   rmgr: Gin         len (rec/tot):    116/   148, tx:       1813,
lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1372 (compressed)   rmgr: Gin         len (rec/tot):     40/    72, tx:       1813,
lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T   ...   rmgr: Gin         len (rec/tot):    118/   150, tx:
   1813,
 
lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
1374 (compressed)   ...   rmgr: Gin         len (rec/tot):    288/   320, tx:       1813,
lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node:
1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280
length: 1544 (compressed)

Then the size decreases to 66B and is gradually increasing again up to 320B.
This increase and decrease of WAL size seems to continue.

[In 9.3]
At first, the size of GIN-related WAL is gradually increasing up to about 2700B.
   rmgr: Gin         len (rec/tot):     52/    84, tx:       1812,
lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T
isdelete F updateBlkno:4294967295   rmgr: Gin         len (rec/tot):     60/    92, tx:       1812,
lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295   ...   rmgr: Gin         len (rec/tot):   2740/  2772, tx:       1812,
lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T
isdelete T updateBlkno:4294967295   rmgr: Gin         len (rec/tot):   2714/  2746, tx:       1812,
lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting
tree, node: 1663/12896/16441 blkno: 6

The size decreases to 66B and then is never changed.
   rmgr: Gin         len (rec/tot):     34/    66, tx:       1812,
lsn: 0/026D9F00, prev 0/026D9EB8, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 6 offset: 451 nitem: 1 isdata: T isleaf T
isdelete F updateBlkno:4294967295   rmgr: Gin         len (rec/tot):     34/    66, tx:       1812,
lsn: 0/026D9F48, prev 0/026D9F00, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 7 offset: 451 nitem: 1 isdata: T isleaf T
isdelete F updateBlkno:4294967295   rmgr: Gin         len (rec/tot):     34/    66, tx:       1812,
lsn: 0/026D9F90, prev 0/026D9F48, bkp: 0000, desc: Insert item, node:
1663/12896/16441 blkno: 8 offset: 451 nitem: 1 isdata: T isleaf T
isdelete F updateBlkno:4294967295   ...


This difference in GIN-related WAL size seems to cause HEAD to generate more
than two times bigger WAL. Unfortunately the gap of WAL size would be
continuously increasing :(


>>
>> That could be optimized, but I figured we can live with it, thanks to the
>> fastupdate feature. Fastupdate allows amortizing that cost over several
>> insertions. But of course, you explicitly disabled that...
>
>
> Let me know if you want me to write patch addressing this issue.

Yeah, I really want you to address this problem! That's definitely useful
for every users disabling FASTUPDATE option for some reasons.

Regards,

-- 
Fujii Masao

Re: HEAD seems to generate larger WAL regarding GIN index

От

Alvaro Herrera

Дата:

17 марта 2014 г., 16:45:09

Fujii Masao escribió:
> On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:

> >> That could be optimized, but I figured we can live with it, thanks to the
> >> fastupdate feature. Fastupdate allows amortizing that cost over several
> >> insertions. But of course, you explicitly disabled that...
> >
> > Let me know if you want me to write patch addressing this issue.
> 
> Yeah, I really want you to address this problem! That's definitely useful
> for every users disabling FASTUPDATE option for some reasons.

Users that disable FASTUPDATE, in my experience, do so because their
stock work_mem is way too high and GIN searches become too slow due to
having to scan too large a list.  I think it might make sense to invest
a modest amount of time in getting FASTUPDATE to be sized completely
differently from today -- perhaps base it on a hardcoded factor of
BLCKSZ, rather than work_mem.  Or, if we really need to make it
configurable, then let it have its own parameter.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services

Re: HEAD seems to generate larger WAL regarding GIN index

От

Heikki Linnakangas

Дата:

17 марта 2014 г., 17:26:57

On 03/17/2014 03:20 PM, Fujii Masao wrote:
> On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
> <aekorotkov@gmail.com> wrote:
>> On Sat, Mar 15, 2014 at 11:27 PM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:

> I ran "pg_xlogdump | grep Gin" and checked the size of GIN-related WAL,
> and then found its max seems more than 256B. Am I missing something?
>
> What I observed is
>
> [In HEAD]
> At first, the size of GIN-related WAL is gradually increasing up to about 1400B.
>      rmgr: Gin         len (rec/tot):     48/    80, tx:       1813,
> lsn: 0/020020D8, prev 0/02000070, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: F
>      rmgr: Gin         len (rec/tot):     56/    88, tx:       1813,
> lsn: 0/02002440, prev 0/020023F8, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
>      rmgr: Gin         len (rec/tot):     64/    96, tx:       1813,
> lsn: 0/020044D8, prev 0/02004490, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 1 isdata: F isleaf: T isdelete: T
>      ...
>      rmgr: Gin         len (rec/tot):   1376/  1408, tx:       1813,
> lsn: 0/02A7EE90, prev 0/02A7E910, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 2 isdata: F isleaf: T isdelete: T
>      rmgr: Gin         len (rec/tot):   1392/  1424, tx:       1813,
> lsn: 0/02A7F458, prev 0/02A7F410, bkp: 0000, desc: Create posting
> tree, node: 1663/12945/16441 blkno: 4

This corresponds to the stage where the items are stored in-line in the 
entry-tree. After it reaches a certain size, a posting tree is created.

> Then the size decreases to about 100B and is gradually increasing
> again up to 320B.
>
>      rmgr: Gin         len (rec/tot):    116/   148, tx:       1813,
> lsn: 0/02A7F9E8, prev 0/02A7F458, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
> 1372 (compressed)
>      rmgr: Gin         len (rec/tot):     40/    72, tx:       1813,
> lsn: 0/02A7FA80, prev 0/02A7F9E8, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 3 isdata: F isleaf: T isdelete: T
>      ...
>      rmgr: Gin         len (rec/tot):    118/   150, tx:       1813,
> lsn: 0/02A83BA0, prev 0/02A83B58, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 4 isdata: T isleaf: T unmodified: 1280 length:
> 1374 (compressed)
>      ...
>      rmgr: Gin         len (rec/tot):    288/   320, tx:       1813,
> lsn: 0/02AEDE28, prev 0/02AEDCE8, bkp: 0000, desc: Insert item, node:
> 1663/12945/16441 blkno: 14 isdata: T isleaf: T unmodified: 1280
> length: 1544 (compressed)
>
> Then the size decreases to 66B and is gradually increasing again up to 320B.
> This increase and decrease of WAL size seems to continue.

Here the new items are appended to posting tree pages. This is where the 
maximum of 256 bytes I mentioned applies. 256 bytes is the max size of 
one compressed posting list, the WAL record containing it includes some 
other stuff too, which adds up to that 320 bytes.

> [In 9.3]
> At first, the size of GIN-related WAL is gradually increasing up to about 2700B.
>
>      rmgr: Gin         len (rec/tot):     52/    84, tx:       1812,
> lsn: 0/02000430, prev 0/020003D8, bkp: 0000, desc: Insert item, node:
> 1663/12896/16441 blkno: 1 offset: 11 nitem: 1 isdata: F isleaf T
> isdelete F updateBlkno:4294967295
>      rmgr: Gin         len (rec/tot):     60/    92, tx:       1812,
> lsn: 0/020004D0, prev 0/02000488, bkp: 0000, desc: Insert item, node:
> 1663/12896/16441 blkno: 1 offset: 1 nitem: 1 isdata: F isleaf T
> isdelete T updateBlkno:4294967295
>      ...
>      rmgr: Gin         len (rec/tot):   2740/  2772, tx:       1812,
> lsn: 0/026D1670, prev 0/026D0B98, bkp: 0000, desc: Insert item, node:
> 1663/12896/16441 blkno: 5 offset: 2 nitem: 1 isdata: F isleaf T
> isdelete T updateBlkno:4294967295
>      rmgr: Gin         len (rec/tot):   2714/  2746, tx:       1812,
> lsn: 0/026D21A8, prev 0/026D2160, bkp: 0000, desc: Create posting
> tree, node: 1663/12896/16441 blkno: 6
>
> The size decreases to 66B and then is never changed.

Same mechanism on 9.3, but the insertions to the posting tree pages are 
constant size.

>>> That could be optimized, but I figured we can live with it, thanks to the
>>> fastupdate feature. Fastupdate allows amortizing that cost over several
>>> insertions. But of course, you explicitly disabled that...
>>
>> Let me know if you want me to write patch addressing this issue.
>
> Yeah, I really want you to address this problem! That's definitely useful
> for every users disabling FASTUPDATE option for some reasons.

Ok, let's think about it a little bit. I think there are three fairly 
simple ways to address this:

1. The GIN data leaf "recompress" record contains an offset called 
"unmodifiedlength", and the data that comes after that offset. 
Currently, the record is written so that unmodifiedlength points to the 
end of the last compressed posting list stored on the page that was not 
modified, followed by all the modified ones. The straightforward way to 
cut down the WAL record size would be to be more fine-grained than that, 
and for the posting lists that were modified, only store the difference 
between the old and new version.

To make this approach work well for random insertions, not just 
appending to the end, we would also need to make the logic in 
leafRepackItems a bit smarter so that it would not re-encode all the 
posting lists, after the first modified one.

2. Instead of storing the new compressed posting list in the WAL record, 
store only the new item pointers added to the page. WAL replay would 
then have to duplicate the work done in the main insertion code path: 
find the right posting lists to insert to, decode them, add the new 
items, and re-encode.

The upside of that would be that the WAL format would be very compact. 
It would be quite simple to implement - you just need to call the same 
functions we use in the main insertion codepath to insert the new items. 
It could be more expensive, CPU-wise, to replay the records, however.

This record format would be higher-level, in the sense that we would not 
store the physical copy of the compressed posting list as it was formed 
originally. The same work would be done at WAL replay. As the code 
stands, it will produce exactly the same result, but that's not 
guaranteed if we make bugfixes to the code later, and a master and 
standby are running different minor version. There's not necessarily 
anything wrong with that, but it's something to keep in mind.

3. Just reduce the GinPostingListSegmentMaxSize constant from 256, to 
say 128. That would halve the typical size of a WAL record that appends 
to the end. However, it would not help with insertions in the middle of 
a posting list, only appends to the end, and it would bloat the pages 
somewhat, as you would waste more space on the posting list headers.

I'm leaning towards option 2. Alexander, what do you think?

- Heikki

Re: HEAD seems to generate larger WAL regarding GIN index

От

Tom Lane

Дата:

17 марта 2014 г., 17:33:23

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> 2. Instead of storing the new compressed posting list in the WAL record, 
> store only the new item pointers added to the page. WAL replay would 
> then have to duplicate the work done in the main insertion code path: 
> find the right posting lists to insert to, decode them, add the new 
> items, and re-encode.

That sounds fairly dangerous ... is any user-defined code involved in
those decisions?

> This record format would be higher-level, in the sense that we would not 
> store the physical copy of the compressed posting list as it was formed 
> originally. The same work would be done at WAL replay. As the code 
> stands, it will produce exactly the same result, but that's not 
> guaranteed if we make bugfixes to the code later, and a master and 
> standby are running different minor version. There's not necessarily 
> anything wrong with that, but it's something to keep in mind.

Version skew would be a hazard too, all right.  I think it's important
that WAL replay be a pretty mechanical, predictable process.
        regards, tom lane

Re: HEAD seems to generate larger WAL regarding GIN index

От

Heikki Linnakangas

Дата:

17 марта 2014 г., 17:55:03

On 03/17/2014 04:33 PM, Tom Lane wrote:
> Heikki Linnakangas <hlinnakangas@vmware.com> writes:
>> 2. Instead of storing the new compressed posting list in the WAL record,
>> store only the new item pointers added to the page. WAL replay would
>> then have to duplicate the work done in the main insertion code path:
>> find the right posting lists to insert to, decode them, add the new
>> items, and re-encode.
>
> That sounds fairly dangerous ... is any user-defined code involved in
> those decisions?

No.

>> This record format would be higher-level, in the sense that we would not
>> store the physical copy of the compressed posting list as it was formed
>> originally. The same work would be done at WAL replay. As the code
>> stands, it will produce exactly the same result, but that's not
>> guaranteed if we make bugfixes to the code later, and a master and
>> standby are running different minor version. There's not necessarily
>> anything wrong with that, but it's something to keep in mind.
>
> Version skew would be a hazard too, all right.  I think it's important
> that WAL replay be a pretty mechanical, predictable process.

Yeah. One particular point to note is that if in one place we do the 
more "high level" thing and have WAL replay re-encode the page as it 
sees fit, then we can *not* rely on the page being byte-by-byte 
identical in other places. Like, in vacuum, where items are deleted.

Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct 
the page, instead of making a physical copy of the modified parts. And 
_bt_restore_page even inserts the items physically in different order 
than the normal codepath does. So for good or bad, there is some 
precedence for this.

The imminent danger I see is if we change the logic on how the items are 
divided into posting lists, and end up in a situation where a master 
server adds an item to a page, and it just fits, but with the 
compression logic the standby version has, it cannot make it fit. As an 
escape hatch for that, we could have the WAL replay code try the 
compression again, with a larger max. posting list size, if it doesn't 
fit at first. And/or always leave something like 10 bytes of free space 
on every data page to make up for small differences in the logic.

- Heikki

Re: HEAD seems to generate larger WAL regarding GIN index

От

Robert Haas

Дата:

17 марта 2014 г., 18:17:09

On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the
> page, instead of making a physical copy of the modified parts. And
> _bt_restore_page even inserts the items physically in different order than
> the normal codepath does. So for good or bad, there is some precedence for
> this.

Yikes.

> The imminent danger I see is if we change the logic on how the items are
> divided into posting lists, and end up in a situation where a master server
> adds an item to a page, and it just fits, but with the compression logic the
> standby version has, it cannot make it fit. As an escape hatch for that, we
> could have the WAL replay code try the compression again, with a larger max.
> posting list size, if it doesn't fit at first. And/or always leave something
> like 10 bytes of free space on every data page to make up for small
> differences in the logic.

That scares the crap out of me.  I don't see any intrinsic problem
with relying on the existence page contents to figure out how to roll
forward, as PageAddItem does; after all, we do FPIs precisely so that
the page is in a known good state when we start.  However, I really
think we ought to try hard to make this deterministic in terms of what
the resulting state of the page is; anything else seems like it's
playing with fire, and I bet we'll get burned sooner rather than
later.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: HEAD seems to generate larger WAL regarding GIN index

От

Tom Lane

Дата:

17 марта 2014 г., 18:35:13

Robert Haas <robertmhaas@gmail.com> writes:
> On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> Heap and B-tree WAL records also rely on PageAddItem etc. to reconstruct the
>> page, instead of making a physical copy of the modified parts. And
>> _bt_restore_page even inserts the items physically in different order than
>> the normal codepath does. So for good or bad, there is some precedence for
>> this.

> Yikes.

Yeah.  I think it's arguably a bug that _bt_restore_page works like that,
even though we've not been burnt up to now.

>> The imminent danger I see is if we change the logic on how the items are
>> divided into posting lists, and end up in a situation where a master server
>> adds an item to a page, and it just fits, but with the compression logic the
>> standby version has, it cannot make it fit. As an escape hatch for that, we
>> could have the WAL replay code try the compression again, with a larger max.
>> posting list size, if it doesn't fit at first. And/or always leave something
>> like 10 bytes of free space on every data page to make up for small
>> differences in the logic.

> That scares the crap out of me.

Likewise.  Saving some WAL space is *not* worth this kind of risk.
        regards, tom lane

Re: HEAD seems to generate larger WAL regarding GIN index

От

Heikki Linnakangas

Дата:

17 марта 2014 г., 18:56:15

On 03/17/2014 05:35 PM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> The imminent danger I see is if we change the logic on how the items are
>>> divided into posting lists, and end up in a situation where a master server
>>> adds an item to a page, and it just fits, but with the compression logic the
>>> standby version has, it cannot make it fit. As an escape hatch for that, we
>>> could have the WAL replay code try the compression again, with a larger max.
>>> posting list size, if it doesn't fit at first. And/or always leave something
>>> like 10 bytes of free space on every data page to make up for small
>>> differences in the logic.
>
>> That scares the crap out of me.
>
> Likewise.  Saving some WAL space is *not* worth this kind of risk.

One fairly good compromise would be to only include the new items, not 
the whole modified compression lists, and let the replay logic do the 
re-encoding of the posting lists. But also include the cutoff points of 
each posting list in the WAL record. That way the replay code would have 
no freedom in how it decides to split the items into compressed lists, 
that would be fully specified by the WAL record.

Here's a refresher for those who didn't follow the development of the 
new page format: The data page basically contains a list of 
ItemPointers. The items are compressed, to save disk space. However, to 
make random access faster, all the items on the page are not compressed 
as one big list. Instead, the big array of items is split into roughly 
equal chunks, and each chunk is compressed separately. The chunks are 
stored on the page one after each other. (The chunks are called "posting 
lists" in the code, the struct is called GinPostingListData)

The compression is completely deterministic (each item is stored as a 
varbyte-encoded delta from the previous item), but there are no hard 
rules on how the items on the page ought to be divided into the posting 
lists. Currently, the code tries to maintain a max size of 256 bytes per 
list - but it will cope with any size it finds on disk. This is where 
the danger lies, where we could end up with a different physical page 
after WAL replay, if we just include the new items in the WAL record. 
The WAL replay might decide to split the items into posting lists 
differently than was originally done. (as the code stands, it would 
always make the same decision, completely deterministically, but that 
might change in a minor version if we're not careful)

We can tie WAL replay's hands about that, if we include a list of items 
that form the posting lists in the WAL record. That adds some bloat, 
compared to only including the new items, but not too much. (and we 
still only need do that for posting lists following the first modified one.)

Alexander, would you like to give that a shot, or will I?

- Heikki

Re: HEAD seems to generate larger WAL regarding GIN index

От

Fujii Masao

Дата:

18 марта 2014 г., 15:38:12

On Mon, Mar 17, 2014 at 10:44 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Fujii Masao escribió:
>> On Sun, Mar 16, 2014 at 7:15 AM, Alexander Korotkov
>> <aekorotkov@gmail.com> wrote:
>
>> >> That could be optimized, but I figured we can live with it, thanks to the
>> >> fastupdate feature. Fastupdate allows amortizing that cost over several
>> >> insertions. But of course, you explicitly disabled that...
>> >
>> > Let me know if you want me to write patch addressing this issue.
>>
>> Yeah, I really want you to address this problem! That's definitely useful
>> for every users disabling FASTUPDATE option for some reasons.
>
> Users that disable FASTUPDATE, in my experience, do so because their
> stock work_mem is way too high and GIN searches become too slow due to
> having to scan too large a list.

Yes. Another reason that I've heard from users so far is that
the size of GIN index with FASTUPDATE=off is likely to be smaller
than that with FASTUPDATE=on.

> I think it might make sense to invest
> a modest amount of time in getting FASTUPDATE to be sized completely
> differently from today -- perhaps base it on a hardcoded factor of
> BLCKSZ, rather than work_mem.  Or, if we really need to make it
> configurable, then let it have its own parameter.

I prefer to have the parameter. When users create multiple GIN indexes
for various uses, they might want to use different thresholds of the pending
list for each index. So, GIN index parameter might be better than GUC one.

Regards,

--
Fujii Masao

Re: HEAD seems to generate larger WAL regarding GIN index

От

Jesper Krogh

Дата:

20 марта 2014 г., 23:12:13

On 15/03/14 20:27, Heikki Linnakangas wrote:
> That said, I didn't expect the difference to be quite that big when 
> you're appending to the end of the table. When the new entries go to 
> the end of the posting lists, you only need to recompress and WAL-log 
> the last posting list, which is max 256 bytes long. But I guess that's 
> still a lot more WAL than in the old format.
>
> That could be optimized, but I figured we can live with it, thanks to 
> the fastupdate feature. Fastupdate allows amortizing that cost over 
> several insertions. But of course, you explicitly disabled that...

In a concurrent update environment, fastupdate as it is in 9.2 is not 
really useful. It may be that you can bulk up insertion, but you have no 
control over who ends up paying the debt. Doubling the amount of wal 
from gin-indexing would be pretty tough for us, in 9.2 we generate 
roughly 1TB wal / day, keeping it
for some weeks to be able to do PITR. The wal are mainly due to 
gin-index updates as new data is added and needs to be searchable by 
users. We do run gzip that cuts it down to 25-30% before keeping the for 
too long, but doubling this is going to be a migration challenge.

If fast-update could be made to work in an environment where we both 
have users searching the index and manually updating it and 4+ backend 
processes updating the index concurrently then it would be a good 
benefit to gain.

the gin index currently contains 70+ million records with and average 
tsvector of 124 terms.

-- 
Jesper .. trying to add some real-world info.

> - Heikki
>
>

Re: HEAD seems to generate larger WAL regarding GIN index

От

Heikki Linnakangas

Дата:

24 марта 2014 г., 20:40:16

I came up with the attached patch, to reduce the WAL volume of GIN
insertions. It become fairly large, but I guess that's not too
surprising as the old WAL-logging method was basically to dump the whole
page to WAL record. This is now a lot more fine-grained and smarter. I
separated constructing the WAL record from copying the changes back to
the disk page, which IMHO is a readability improvement even though it's
more code.

There are two parts to this patch:

* leafRepackItems has been rewritten. The previous coding basically
searched for the first modified item, and decoded and re-encoded
everything on the page that after that. Now it tries harder to avoid
re-encoding segments that are still reasonably sized (between 128 and
384 bytes, with the target for new segments being 256 bytes). This ought
to make random updates faster as a bonus, but I didn't performance test
that.

* Track more carefully which segments on the page have been modified.
The in-memory structure used to manipulate a page now keeps an action
code for each segment, indicating if the segment is completely new,
deleted, or replaced with new content, or if just some new items have
been added to it. These same actions are WAL-logged, and replayed in the
redo routine.

This brings the WAL volume back to the same ballpark as 9.3. Or better,
depending on the operation.

Fujii, Alexander, how does this look to you?

- Heikki

Вложения

gin-more-compact-wal-1.patch

Re: HEAD seems to generate larger WAL regarding GIN index

От

Robert Haas

Дата:

31 марта 2014 г., 19:41:48

On Thu, Mar 20, 2014 at 1:12 PM, Jesper Krogh <jesper@krogh.cc> wrote:
> On 15/03/14 20:27, Heikki Linnakangas wrote:
>> That said, I didn't expect the difference to be quite that big when you're
>> appending to the end of the table. When the new entries go to the end of the
>> posting lists, you only need to recompress and WAL-log the last posting
>> list, which is max 256 bytes long. But I guess that's still a lot more WAL
>> than in the old format.
>>
>> That could be optimized, but I figured we can live with it, thanks to the
>> fastupdate feature. Fastupdate allows amortizing that cost over several
>> insertions. But of course, you explicitly disabled that...
>
> In a concurrent update environment, fastupdate as it is in 9.2 is not really
> useful. It may be that you can bulk up insertion, but you have no control
> over who ends up paying the debt. Doubling the amount of wal from
> gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB
> wal / day, keeping it
> for some weeks to be able to do PITR. The wal are mainly due to gin-index
> updates as new data is added and needs to be searchable by users. We do run
> gzip that cuts it down to 25-30% before keeping the for too long, but
> doubling this is going to be a migration challenge.
>
> If fast-update could be made to work in an environment where we both have
> users searching the index and manually updating it and 4+ backend processes
> updating the index concurrently then it would be a good benefit to gain.
>
> the gin index currently contains 70+ million records with and average
> tsvector of 124 terms.

Should we try to install some hack around fastupdate for 9.4?  I fear
the divergence between reasonable values of work_mem and reasonable
sizes for that list is only going to continue to get bigger.  I'm sure
there's somebody out there who has work_mem = 16GB, and stuff like
263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
appeal of large values.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: HEAD seems to generate larger WAL regarding GIN index

От

Fujii Masao

Дата:

14 апреля 2014 г., 17:18:56

On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Mar 20, 2014 at 1:12 PM, Jesper Krogh <jesper@krogh.cc> wrote:
>> On 15/03/14 20:27, Heikki Linnakangas wrote:
>>> That said, I didn't expect the difference to be quite that big when you're
>>> appending to the end of the table. When the new entries go to the end of the
>>> posting lists, you only need to recompress and WAL-log the last posting
>>> list, which is max 256 bytes long. But I guess that's still a lot more WAL
>>> than in the old format.
>>>
>>> That could be optimized, but I figured we can live with it, thanks to the
>>> fastupdate feature. Fastupdate allows amortizing that cost over several
>>> insertions. But of course, you explicitly disabled that...
>>
>> In a concurrent update environment, fastupdate as it is in 9.2 is not really
>> useful. It may be that you can bulk up insertion, but you have no control
>> over who ends up paying the debt. Doubling the amount of wal from
>> gin-indexing would be pretty tough for us, in 9.2 we generate roughly 1TB
>> wal / day, keeping it
>> for some weeks to be able to do PITR. The wal are mainly due to gin-index
>> updates as new data is added and needs to be searchable by users. We do run
>> gzip that cuts it down to 25-30% before keeping the for too long, but
>> doubling this is going to be a migration challenge.
>>
>> If fast-update could be made to work in an environment where we both have
>> users searching the index and manually updating it and 4+ backend processes
>> updating the index concurrently then it would be a good benefit to gain.
>>
>> the gin index currently contains 70+ million records with and average
>> tsvector of 124 terms.
>
> Should we try to install some hack around fastupdate for 9.4?  I fear
> the divergence between reasonable values of work_mem and reasonable
> sizes for that list is only going to continue to get bigger.  I'm sure
> there's somebody out there who has work_mem = 16GB, and stuff like
> 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
> appeal of large values.

Controlling the threshold of the size of pending list only by GUC doesn't
seem reasonable. Users may want to increase the threshold only for the
GIN index which can be updated heavily, and decrease it otherwise. So
I think that it's better to add new storage parameter for GIN index to control
the threshold, or both storage parameter and GUC.

Regards,

-- 
Fujii Masao

Re: HEAD seems to generate larger WAL regarding GIN index

От

Tom Lane

Дата:

14 апреля 2014 г., 17:41:03

Fujii Masao <masao.fujii@gmail.com> writes:
> On Tue, Apr 1, 2014 at 1:41 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> Should we try to install some hack around fastupdate for 9.4?  I fear
>> the divergence between reasonable values of work_mem and reasonable
>> sizes for that list is only going to continue to get bigger.  I'm sure
>> there's somebody out there who has work_mem = 16GB, and stuff like
>> 263865a48973767ce8ed7b7788059a38a24a9f37 is only going to increase the
>> appeal of large values.

> Controlling the threshold of the size of pending list only by GUC doesn't
> seem reasonable. Users may want to increase the threshold only for the
> GIN index which can be updated heavily, and decrease it otherwise. So
> I think that it's better to add new storage parameter for GIN index to control
> the threshold, or both storage parameter and GUC.

Yeah, -1 for a GUC.  A GIN-index-specific storage parameter seems more
appropriate.  Or we could just hard-wire some maximum limit.  Is it
really likely that users would trouble to set such a parameter if it
existed?
        regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: HEAD seems to generate larger WAL regarding GIN index

Вложения