Обсуждение: Unlogged relations and WAL-logging

Поиск
Список
Период
Сортировка

Unlogged relations and WAL-logging

От
Heikki Linnakangas
Дата:
Unlogged relations are not WAL-logged, but creating the init-fork is. 
There are a few things around that seem sloppy:

1. In index_build(), we do this:

> 
>     /*
>      * If this is an unlogged index, we may need to write out an init fork for
>      * it -- but we must first check whether one already exists.  If, for
>      * example, an unlogged relation is truncated in the transaction that
>      * created it, or truncated twice in a subsequent transaction, the
>      * relfilenode won't change, and nothing needs to be done here.
>      */
>     if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
>         !smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
>     {
>         smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
>         indexRelation->rd_indam->ambuildempty(indexRelation);
>     }

Shouldn't we call log_smgrcreate() here? Creating the init fork is 
otherwise not WAL-logged at all. In practice, all the ambuildempty() 
implementations create and WAL-log a metapage and replay of that will 
implicitly create the underlying relation. But if you created an index 
access method whose ambuildempty() function does nothing, i.e. the init 
fork is just an empty file, there would be no trace in the WAL about 
creating the init fork, which is bad. In fact, 
src/test/modules/dummy_index_am is an example of that, but because it 
does nothing at all with the file, you cannot see any ill effect from 
the missing init fork.

2. Some implementations of ambuildempty() use the buffer cache (hash, 
gist, gin, brin), while others bypass it and call smgrimmedsync() 
instead (btree, spgist, bloom). I don't see any particular reason for 
those decisions, it seems to be based purely on which example the author 
happened to copy-paste.

Using the buffer cache seems better to me, because it avoids the 
smgrimmedsync() call. That makes creating unlogged indexes faster. 
That's not significant for any real world application, but still.

3. Those ambuildempty implementations that bypass the buffer cache use 
smgrwrite() to write the pages. That doesn't make any difference in 
practice, but in principle it's wrong: You are supposed to use 
smgrextend() when extending a relation. Using smgrwrite() skips updating 
the relation size cache, which is harmless in this case because it 
wasn't initialized previously either, and I'm not sure if we ever call 
smgrnblocks() on the init-fork. But if you build with 
CHECK_WRITE_VS_EXTEND, you get an assertion failure.

4. Also, the smgrwrite() calls are performed before WAL-logging the 
pages, so the page that's written to disk has 0/0 as the LSN, not the 
LSN of the WAL record. That's harmless too, but seems a bit sloppy.

(I remember we had a discussion once whether we should always extend the 
relation first, and WAL-log only after that, but I can't find the thread 
right now. The point is to avoid the situation that the operation fails 
because you run out of disk space, and then you crash and WAL replay 
also fails because you are still out of disk space. But most places 
currently call log_newpage() first, and smgrextend() after that.)

5. In heapam_relation_set_new_filenode(), we do this:

> 
>     /*
>      * If required, set up an init fork for an unlogged table so that it can
>      * be correctly reinitialized on restart.  An immediate sync is required
>      * even if the page has been logged, because the write did not go through
>      * shared_buffers and therefore a concurrent checkpoint may have moved the
>      * redo pointer past our xlog record.  Recovery may as well remove it
>      * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
>      * record. Therefore, logging is necessary even if wal_level=minimal.
>      */
>     if (persistence == RELPERSISTENCE_UNLOGGED)
>     {
>         Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
>                rel->rd_rel->relkind == RELKIND_MATVIEW ||
>                rel->rd_rel->relkind == RELKIND_TOASTVALUE);
>         smgrcreate(srel, INIT_FORKNUM, false);
>         log_smgrcreate(newrnode, INIT_FORKNUM);
>         smgrimmedsync(srel, INIT_FORKNUM);
>     }

The comment doesn't make much sense, we haven't written nor WAL-logged 
any page here, with nor without the buffer cache. It made more sense 
before commit fa0f466d53.

This is what actually led me to discover the bug I just reported at [1], 
with regular tables. If we fix that bug I like I proposed there, then 
smgrcreate() will register and fsync request, and we won't need 
smgrimmedsync here anymore.

Attached is a patch to tighten those up. The third one should arguably 
be part of [1], not this thread, but here it goes too.

[1] 
https://www.postgresql.org/message-id/d47d8122-415e-425c-d0a2-e0160829702d%40iki.fi

- Heikki
Вложения

Re: Unlogged relations and WAL-logging

От
Robert Haas
Дата:
On Thu, Jan 27, 2022 at 2:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
> Unlogged relations are not WAL-logged, but creating the init-fork is.
> There are a few things around that seem sloppy:
>
> 1. In index_build(), we do this:
>
> >        */
> >       if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
> >               !smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
> >       {
> >               smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
> >               indexRelation->rd_indam->ambuildempty(indexRelation);
> >       }
>
> Shouldn't we call log_smgrcreate() here? Creating the init fork is
> otherwise not WAL-logged at all.

Yes, that's a bug.

> 2. Some implementations of ambuildempty() use the buffer cache (hash,
> gist, gin, brin), while others bypass it and call smgrimmedsync()
> instead (btree, spgist, bloom). I don't see any particular reason for
> those decisions, it seems to be based purely on which example the author
> happened to copy-paste.

I thought that this inconsistency was odd when I was developing the
unlogged feature, but I tried to keep each routine's ambuildempty()
consistent with whatever ambuild() was doing. I don't mind if you want
to change it, though.

> 3. Those ambuildempty implementations that bypass the buffer cache use
> smgrwrite() to write the pages. That doesn't make any difference in
> practice, but in principle it's wrong: You are supposed to use
> smgrextend() when extending a relation.

That's a mistake on my part.

> 4. Also, the smgrwrite() calls are performed before WAL-logging the
> pages, so the page that's written to disk has 0/0 as the LSN, not the
> LSN of the WAL record. That's harmless too, but seems a bit sloppy.

That is also a mistake on my part.

> 5. In heapam_relation_set_new_filenode(), we do this:
>
> >
> >       /*
> >        * If required, set up an init fork for an unlogged table so that it can
> >        * be correctly reinitialized on restart.  An immediate sync is required
> >        * even if the page has been logged, because the write did not go through
> >        * shared_buffers and therefore a concurrent checkpoint may have moved the
> >        * redo pointer past our xlog record.  Recovery may as well remove it
> >        * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
> >        * record. Therefore, logging is necessary even if wal_level=minimal.
> >        */
> >       if (persistence == RELPERSISTENCE_UNLOGGED)
> >       {
> >               Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
> >                          rel->rd_rel->relkind == RELKIND_MATVIEW ||
> >                          rel->rd_rel->relkind == RELKIND_TOASTVALUE);
> >               smgrcreate(srel, INIT_FORKNUM, false);
> >               log_smgrcreate(newrnode, INIT_FORKNUM);
> >               smgrimmedsync(srel, INIT_FORKNUM);
> >       }
>
> The comment doesn't make much sense, we haven't written nor WAL-logged
> any page here, with nor without the buffer cache. It made more sense
> before commit fa0f466d53.

Well, it seems to me (and perhaps I am just confused) that complaining
that there's no page written here might be a technicality. The point
is that there's no synchronization between the work we're doing here
-- which is creating a fork, not writing a page -- and any concurrent
checkpoint. So we both need to log it, and also sync it immediately.

-- 
Robert Haas
EDB: http://www.enterprisedb.com



Re: Unlogged relations and WAL-logging

От
Heikki Linnakangas
Дата:
On 28/01/2022 15:57, Robert Haas wrote:
> On Thu, Jan 27, 2022 at 2:32 PM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>> Unlogged relations are not WAL-logged, but creating the init-fork is.
>> There are a few things around that seem sloppy:
>>
>> 1. In index_build(), we do this:
>>
>>>         */
>>>        if (indexRelation->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
>>>                !smgrexists(RelationGetSmgr(indexRelation), INIT_FORKNUM))
>>>        {
>>>                smgrcreate(RelationGetSmgr(indexRelation), INIT_FORKNUM, false);
>>>                indexRelation->rd_indam->ambuildempty(indexRelation);
>>>        }
>>
>> Shouldn't we call log_smgrcreate() here? Creating the init fork is
>> otherwise not WAL-logged at all.
> 
> Yes, that's a bug.

Pushed and backpatched this patch (commit 3142a8845b).

>> 2. Some implementations of ambuildempty() use the buffer cache (hash,
>> gist, gin, brin), while others bypass it and call smgrimmedsync()
>> instead (btree, spgist, bloom). I don't see any particular reason for
>> those decisions, it seems to be based purely on which example the author
>> happened to copy-paste.
> 
> I thought that this inconsistency was odd when I was developing the
> unlogged feature, but I tried to keep each routine's ambuildempty()
> consistent with whatever ambuild() was doing. I don't mind if you want
> to change it, though.
> 
>> 3. Those ambuildempty implementations that bypass the buffer cache use
>> smgrwrite() to write the pages. That doesn't make any difference in
>> practice, but in principle it's wrong: You are supposed to use
>> smgrextend() when extending a relation.
> 
> That's a mistake on my part.
> 
>> 4. Also, the smgrwrite() calls are performed before WAL-logging the
>> pages, so the page that's written to disk has 0/0 as the LSN, not the
>> LSN of the WAL record. That's harmless too, but seems a bit sloppy.
> 
> That is also a mistake on my part.

I'm still sitting on these fixes. I think the patch I posted still makes 
sense, but I got carried away with a more invasive approach that 
introduces a whole new set of functions for bulk-creating a relation, 
which would handle WAL-logging, smgrimmedsync() and all that (see 
below). We have some repetitive, error-prone code in all the index build 
functions for that. But that's not backpatchable, so I'll rebase the 
original approach next week.

>> 5. In heapam_relation_set_new_filenode(), we do this:
>>
>>>
>>>        /*
>>>         * If required, set up an init fork for an unlogged table so that it can
>>>         * be correctly reinitialized on restart.  An immediate sync is required
>>>         * even if the page has been logged, because the write did not go through
>>>         * shared_buffers and therefore a concurrent checkpoint may have moved the
>>>         * redo pointer past our xlog record.  Recovery may as well remove it
>>>         * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
>>>         * record. Therefore, logging is necessary even if wal_level=minimal.
>>>         */
>>>        if (persistence == RELPERSISTENCE_UNLOGGED)
>>>        {
>>>                Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
>>>                           rel->rd_rel->relkind == RELKIND_MATVIEW ||
>>>                           rel->rd_rel->relkind == RELKIND_TOASTVALUE);
>>>                smgrcreate(srel, INIT_FORKNUM, false);
>>>                log_smgrcreate(newrnode, INIT_FORKNUM);
>>>                smgrimmedsync(srel, INIT_FORKNUM);
>>>        }
>>
>> The comment doesn't make much sense, we haven't written nor WAL-logged
>> any page here, with nor without the buffer cache. It made more sense
>> before commit fa0f466d53.
> 
> Well, it seems to me (and perhaps I am just confused) that complaining
> that there's no page written here might be a technicality. The point
> is that there's no synchronization between the work we're doing here
> -- which is creating a fork, not writing a page -- and any concurrent
> checkpoint. So we both need to log it, and also sync it immediately.

I see. I pushed the fix from the other thread that makes smgrcreate() 
call register_dirty_segment (commit 4b4798e13). I believe that makes 
this smgrimmedsync() unnecessary. If a concurrent checkpoint happens 
with a redo pointer greater than this WAL record, it must've received 
the fsync request created by smgrcreate(). That depends on the fact that 
we write the WAL record *after* smgrcreate(). Subtle..

Hmm, we have a similar smgrimmedsync() call after index build, because 
we have written pages directly with smgrextend(skipFsync=true). If no 
checkpoints have occurred during the index build, we could call 
register_dirty_segment() instead of smgrimmedsync(). That would avoid 
the fsync() latency when creating an index on an empty or small index.

This is all very subtle to get right though. That's why I'd like to 
invent a new bulk-creation facility that would handle this stuff, and 
make the callers less error-prone.

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Unlogged relations and WAL-logging

От
Heikki Linnakangas
Дата:
On 07/07/2023 18:21, Heikki Linnakangas wrote:
> On 28/01/2022 15:57, Robert Haas wrote:
>>> 4. Also, the smgrwrite() calls are performed before WAL-logging the
>>> pages, so the page that's written to disk has 0/0 as the LSN, not the
>>> LSN of the WAL record. That's harmless too, but seems a bit sloppy.
>>
>> That is also a mistake on my part.
> 
> I'm still sitting on these fixes. I think the patch I posted still makes
> sense, but I got carried away with a more invasive approach that
> introduces a whole new set of functions for bulk-creating a relation,
> which would handle WAL-logging, smgrimmedsync() and all that (see
> below). We have some repetitive, error-prone code in all the index build
> functions for that. But that's not backpatchable, so I'll rebase the
> original approach next week.

Committed this fix to master and v16. Didn't seem worth backpatching 
further than that, given that there is no live user-visible issue here.

>>> 5. In heapam_relation_set_new_filenode(), we do this:
>>>
>>>>
>>>>         /*
>>>>          * If required, set up an init fork for an unlogged table so that it can
>>>>          * be correctly reinitialized on restart.  An immediate sync is required
>>>>          * even if the page has been logged, because the write did not go through
>>>>          * shared_buffers and therefore a concurrent checkpoint may have moved the
>>>>          * redo pointer past our xlog record.  Recovery may as well remove it
>>>>          * while replaying, for example, XLOG_DBASE_CREATE or XLOG_TBLSPC_CREATE
>>>>          * record. Therefore, logging is necessary even if wal_level=minimal.
>>>>          */
>>>>         if (persistence == RELPERSISTENCE_UNLOGGED)
>>>>         {
>>>>                 Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
>>>>                            rel->rd_rel->relkind == RELKIND_MATVIEW ||
>>>>                            rel->rd_rel->relkind == RELKIND_TOASTVALUE);
>>>>                 smgrcreate(srel, INIT_FORKNUM, false);
>>>>                 log_smgrcreate(newrnode, INIT_FORKNUM);
>>>>                 smgrimmedsync(srel, INIT_FORKNUM);
>>>>         }
>>>
>>> The comment doesn't make much sense, we haven't written nor WAL-logged
>>> any page here, with nor without the buffer cache. It made more sense
>>> before commit fa0f466d53.
>>
>> Well, it seems to me (and perhaps I am just confused) that complaining
>> that there's no page written here might be a technicality. The point
>> is that there's no synchronization between the work we're doing here
>> -- which is creating a fork, not writing a page -- and any concurrent
>> checkpoint. So we both need to log it, and also sync it immediately.
> 
> I see. I pushed the fix from the other thread that makes smgrcreate()
> call register_dirty_segment (commit 4b4798e13). I believe that makes
> this smgrimmedsync() unnecessary. If a concurrent checkpoint happens
> with a redo pointer greater than this WAL record, it must've received
> the fsync request created by smgrcreate(). That depends on the fact that
> we write the WAL record *after* smgrcreate(). Subtle..
> 
> Hmm, we have a similar smgrimmedsync() call after index build, because
> we have written pages directly with smgrextend(skipFsync=true). If no
> checkpoints have occurred during the index build, we could call
> register_dirty_segment() instead of smgrimmedsync(). That would avoid
> the fsync() latency when creating an index on an empty or small index.
> 
> This is all very subtle to get right though. That's why I'd like to
> invent a new bulk-creation facility that would handle this stuff, and
> make the callers less error-prone.

Having a more generic and less error-prone bulk-creation mechanism is 
still on my long TODO list..

-- 
Heikki Linnakangas
Neon (https://neon.tech)




Re: Unlogged relations and WAL-logging

От
Peter Eisentraut
Дата:
Is the patch 
0003-Remove-unnecessary-smgrimmedsync-when-creating-unlog.patch still 
relevant, or can this commitfest entry be closed?

On 23.08.23 16:40, Heikki Linnakangas wrote:
>>>> 5. In heapam_relation_set_new_filenode(), we do this:
>>>>
>>>>>
>>>>>         /*
>>>>>          * If required, set up an init fork for an unlogged table 
>>>>> so that it can
>>>>>          * be correctly reinitialized on restart.  An immediate 
>>>>> sync is required
>>>>>          * even if the page has been logged, because the write did 
>>>>> not go through
>>>>>          * shared_buffers and therefore a concurrent checkpoint may 
>>>>> have moved the
>>>>>          * redo pointer past our xlog record.  Recovery may as well 
>>>>> remove it
>>>>>          * while replaying, for example, XLOG_DBASE_CREATE or 
>>>>> XLOG_TBLSPC_CREATE
>>>>>          * record. Therefore, logging is necessary even if 
>>>>> wal_level=minimal.
>>>>>          */
>>>>>         if (persistence == RELPERSISTENCE_UNLOGGED)
>>>>>         {
>>>>>                 Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
>>>>>                            rel->rd_rel->relkind == RELKIND_MATVIEW ||
>>>>>                            rel->rd_rel->relkind == 
>>>>> RELKIND_TOASTVALUE);
>>>>>                 smgrcreate(srel, INIT_FORKNUM, false);
>>>>>                 log_smgrcreate(newrnode, INIT_FORKNUM);
>>>>>                 smgrimmedsync(srel, INIT_FORKNUM);
>>>>>         }
>>>>
>>>> The comment doesn't make much sense, we haven't written nor WAL-logged
>>>> any page here, with nor without the buffer cache. It made more sense
>>>> before commit fa0f466d53.
>>>
>>> Well, it seems to me (and perhaps I am just confused) that complaining
>>> that there's no page written here might be a technicality. The point
>>> is that there's no synchronization between the work we're doing here
>>> -- which is creating a fork, not writing a page -- and any concurrent
>>> checkpoint. So we both need to log it, and also sync it immediately.
>>
>> I see. I pushed the fix from the other thread that makes smgrcreate()
>> call register_dirty_segment (commit 4b4798e13). I believe that makes
>> this smgrimmedsync() unnecessary. If a concurrent checkpoint happens
>> with a redo pointer greater than this WAL record, it must've received
>> the fsync request created by smgrcreate(). That depends on the fact that
>> we write the WAL record *after* smgrcreate(). Subtle..
>>
>> Hmm, we have a similar smgrimmedsync() call after index build, because
>> we have written pages directly with smgrextend(skipFsync=true). If no
>> checkpoints have occurred during the index build, we could call
>> register_dirty_segment() instead of smgrimmedsync(). That would avoid
>> the fsync() latency when creating an index on an empty or small index.
>>
>> This is all very subtle to get right though. That's why I'd like to
>> invent a new bulk-creation facility that would handle this stuff, and
>> make the callers less error-prone.
> 
> Having a more generic and less error-prone bulk-creation mechanism is 
> still on my long TODO list..
> 




Re: Unlogged relations and WAL-logging

От
Heikki Linnakangas
Дата:
On 01/09/2023 15:49, Peter Eisentraut wrote:
> Is the patch
> 0003-Remove-unnecessary-smgrimmedsync-when-creating-unlog.patch still
> relevant, or can this commitfest entry be closed?

Yes. Pushed it now, thanks!

-- 
Heikki Linnakangas
Neon (https://neon.tech)