Re: HEAD seems to generate larger WAL regarding GIN index

Поиск
Список
Период
Сортировка
От Heikki Linnakangas
Тема Re: HEAD seems to generate larger WAL regarding GIN index
Дата
Msg-id 53271B14.2030206@vmware.com
обсуждение исходный текст
Ответ на Re: HEAD seems to generate larger WAL regarding GIN index  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: HEAD seems to generate larger WAL regarding GIN index  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список pgsql-hackers
On 03/17/2014 05:35 PM, Tom Lane wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> On Mon, Mar 17, 2014 at 10:54 AM, Heikki Linnakangas
>> <hlinnakangas@vmware.com> wrote:
>>> The imminent danger I see is if we change the logic on how the items are
>>> divided into posting lists, and end up in a situation where a master server
>>> adds an item to a page, and it just fits, but with the compression logic the
>>> standby version has, it cannot make it fit. As an escape hatch for that, we
>>> could have the WAL replay code try the compression again, with a larger max.
>>> posting list size, if it doesn't fit at first. And/or always leave something
>>> like 10 bytes of free space on every data page to make up for small
>>> differences in the logic.
>
>> That scares the crap out of me.
>
> Likewise.  Saving some WAL space is *not* worth this kind of risk.

One fairly good compromise would be to only include the new items, not 
the whole modified compression lists, and let the replay logic do the 
re-encoding of the posting lists. But also include the cutoff points of 
each posting list in the WAL record. That way the replay code would have 
no freedom in how it decides to split the items into compressed lists, 
that would be fully specified by the WAL record.

Here's a refresher for those who didn't follow the development of the 
new page format: The data page basically contains a list of 
ItemPointers. The items are compressed, to save disk space. However, to 
make random access faster, all the items on the page are not compressed 
as one big list. Instead, the big array of items is split into roughly 
equal chunks, and each chunk is compressed separately. The chunks are 
stored on the page one after each other. (The chunks are called "posting 
lists" in the code, the struct is called GinPostingListData)

The compression is completely deterministic (each item is stored as a 
varbyte-encoded delta from the previous item), but there are no hard 
rules on how the items on the page ought to be divided into the posting 
lists. Currently, the code tries to maintain a max size of 256 bytes per 
list - but it will cope with any size it finds on disk. This is where 
the danger lies, where we could end up with a different physical page 
after WAL replay, if we just include the new items in the WAL record. 
The WAL replay might decide to split the items into posting lists 
differently than was originally done. (as the code stands, it would 
always make the same decision, completely deterministically, but that 
might change in a minor version if we're not careful)

We can tie WAL replay's hands about that, if we include a list of items 
that form the posting lists in the WAL record. That adds some bloat, 
compared to only including the new items, but not too much. (and we 
still only need do that for posting lists following the first modified one.)

Alexander, would you like to give that a shot, or will I?

- Heikki



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Atri Sharma
Дата:
Сообщение: Re: Planner hints in Postgresql
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Portability issues in shm_mq