BRIN summarization vs. WAL logging

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема BRIN summarization vs. WAL logging
Дата
Msg-id 64a5bfb0-909e-0a14-3ac9-14dc74673074@enterprisedb.com
обсуждение исходный текст
Ответы Re: BRIN summarization vs. WAL logging  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Hi,

In a thread about sequences and sync replication [1], I've explained 
that the issue we're observing is due to not waiting for WAL at commit 
if the transaction only did nextval(). In which case we don't flush WAL 
in RecordTransactionCommit, we don't wait for sync replica, etc. The WAL 
may get lost in case of crash, etc.

As I explained in the other thread, there are various other cases where 
a transaction generates WAL but does not have XID, which is sufficient 
for not flushing/waiting at transaction commit. Some of those cases are 
probably fine (e.g. there are comments explaining why this is fine for 
PRUNE record).

But other cases (discovered by running regression tests with extra 
logging) looked a bit suspicious - particularly those that write 
multiple WAL messages, because what if we lose just some of those?

So I looked at two cases related to BRIN, mostly because those were 
fairly simple, and I think at least the brin_summarize_range() is 
somewhat broken.


1) brin_desummarize_range()

This is pretty simple, because this function generates a single WAL 
record, without waiting for it to be flushed:

DESUMMARIZE pagesPerRange 1, heapBlk 0, page offset 9, blkref #0: ...

But if the cluster/VM/... crashes right after you ran the function (and 
it completed just fine, possibly even in an explicit transaciton), that 
change will get lost. Not really a serious data corruption/loss, and you 
can simply run it again, but IMHO rather surprising.

Of course, most people are unlikely to run brin_desummarize_range() very 
often, so maybe it's acceptable? But of course - if we expect this to be 
very rare operation, why skip the WAL at all?


2) brin_summarize_range()

Now, the issue I think is more serious, more likely to happen, and 
harder to fix. When summarizing a range, we write two WAL records:

INSERT heapBlk 2 pagesPerRange 2 offnum 2, blkref #0: rel 1663/63 ...
SAMEPAGE_UPDATE offnum 2, blkref #0: rel 1663/63341/73957 blk 2

So, what happens if we lost the second WAL record, e.g. due to a crash? 
To experiment with this, I wrote a trivial patch (attached) that allows 
crashing on WAL message of certain type by simply setting a GUC.

Now, consider this example:

   create table t (a int);
   insert into t select i from generate_series(1,5000) s(i);
   create index on t using brin (a);
   select brin_desummarize_range('t_a_idx', 1);

   set crash_on_wal_message = 'SAMEPAGE_UPDATE';

   select brin_summarize_range('t_a_idx', 5);

   PANIC:  crashing before 'SAMEPAGE_UPDATE' WAL message
   server closed the connection unexpectedly
   ...

After recovery, this is what we have:

   select * from brin_page_items(get_Raw_page('t_a_idx', 2), 't_a_idx');

    ...  | allnulls | hasnulls | placeholder | value
    ... -+----------+----------+-------------+-------
    ...  | t        | f        | t           |
    (1 row)

So the BRIN tuple is still marked as placeholder, which is a problem 
because that means we'll always consider it as matching, making the 
bitmap index scan less efficient. And we'll *never* fix this, because 
just summarizing the range does nothing:

    select brin_summarize_range('t_a_idx', 5);
    brin_summarize_range
   ----------------------
                       0
   (1 row)

So it's still marked as placeholder, and to fix it you have to 
explicitly desummarize the range first.

The reason for this seems obvious - only the process that created the 
placeholder tuple is expected to mark it as "placeholder=false", but 
this is described as two WAL records. And if we lose the update, the 
tuple will stay marked as a placeholder forever.

Of course, this requires a crash while something is summarizing ranges. 
But consider the summarization is often done by autovacuum, so it's not 
just about hitting this from manually-executed brin_summarize_range.

I'm not quite sure what to do about this. Just doing XLogFlush() does 
not really fix this - it makes it less likely, but the root cause is the 
change is described by multiple WAL messages that are not linked 
together in any way. We may lost the last message without noticing that, 
and the flush does not fix that.

I didn't look at the other cases mentioned in [1], but I would't be 
surprised if some had a similar issue (e.g. the GIN pending list cleanup 
seems like another candidate).


regards

[1] 
https://www.postgresql.org/message-id/0f827a71-a01b-bcf9-fe77-3047a9d4a93c%40enterprisedb.com

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: "houzj.fnst@fujitsu.com"
Дата:
Сообщение: RE: row filtering for logical replication
Следующее
От: Masahiko Sawada
Дата:
Сообщение: Re: Skipping logical replication transactions on subscriber side