Re: Unlogged relation copy is not fsync'd

Поиск
Список
Период
Сортировка
От Noah Misch
Тема Re: Unlogged relation copy is not fsync'd
Дата
Msg-id 20231008022204.cc@rfd.leadboat.com
обсуждение исходный текст
Ответ на Re: Unlogged relation copy is not fsync'd  (Noah Misch <noah@leadboat.com>)
Список pgsql-hackers
On Wed, Sep 20, 2023 at 11:22:10PM -0700, Noah Misch wrote:
> On Fri, Sep 15, 2023 at 02:47:45PM +0300, Heikki Linnakangas wrote:
> > On 05/09/2023 21:20, Robert Haas wrote:
> 
> > Thinking about this some more, I think this is still not 100% correct, even
> > with the patch I posted earlier:
> > 
> > >    /*
> > >     * When we WAL-logged rel pages, we must nonetheless fsync them.  The
> > >     * reason is that since we're copying outside shared buffers, a CHECKPOINT
> > >     * occurring during the copy has no way to flush the previously written
> > >     * data to disk (indeed it won't know the new rel even exists).  A crash
> > >     * later on would replay WAL from the checkpoint, therefore it wouldn't
> > >     * replay our earlier WAL entries. If we do not fsync those pages here,
> > >     * they might still not be on disk when the crash occurs.
> > >     */
> > >    if (use_wal || relpersistence == RELPERSISTENCE_UNLOGGED)
> > >        smgrimmedsync(dst, forkNum);
> > 
> > Let's walk through each case one by one:
> > 
> > 1. Temporary table. No fsync() needed. This is correct.
> > 
> > 2. Unlogged rel, main fork. Needs to be fsync'd, because we skipped WAL, and
> > also because we bypassed the buffer manager. Correct.
> 
> Agreed.
> 
> > 3. Unlogged rel, init fork. Needs to be fsync'd, even though we WAL-logged
> > it, because we bypassed the buffer manager. Like the comment explains. This
> > is now correct with the patch.
> 
> This has two subtypes:
> 
> 3a. Unlogged rel, init fork, use_wal (wal_level!=minimal). Matches what
> you wrote.
> 
> 3b. Unlogged rel, init fork, !use_wal (wal_level==minimal). Needs to be
> fsync'd because we didn't WAL-log it and RelationCreateStorage() won't call
> AddPendingSync().  (RelationCreateStorage() could start calling
> AddPendingSync() for this case.  I think we chose not to do that because there
> will never be additional writes to the init fork, and smgrDoPendingSyncs()
> would send the fork to FlushRelationsAllBuffers() even though the fork will
> never appear in shared buffers.  On the other hand, grouping the sync with the
> other end-of-xact syncs could improve efficiency for some filesystems.  Also,
> the number of distinguishable cases is unpleasantly high.)
> 
> > 4. Permanent rel, use_wal == true. Needs to be fsync'd, even though we
> > WAL-logged it, because we bypassed the buffer manager. Like the comment
> > explains. Correct.
> > 
> > 5. Permanent rel, use_wal == false. We skip fsync, because it means that
> > it's a new relation, so we have a sync request pending for it. (An assertion
> > for that would be nice). At the end of transaction, in smgrDoPendingSyncs(),
> > we will either fsync it or we will WAL-log all the pages if it was a small
> > relation. I believe this is *wrong*. If smgrDoPendingSyncs() decides to
> > WAL-log the pages, we have the same race condition that's explained in the
> > comment, because we bypassed the buffer manager:
> > 
> > 1. RelationCopyStorage() copies some of the pages.
> > 2. Checkpoint happens, which fsyncs the relation (smgrcreate() registered a
> > dirty segment when the relation was created)
> > 3. RelationCopyStorage() copies the rest of the pages.
> > 4. smgrDoPendingSyncs() WAL-logs all the pages.
> 
> smgrDoPendingSyncs() delegates to log_newpage_range().  log_newpage_range()
> loads each page into the buffer manager and calls MarkBufferDirty() on each.
> Hence, ...
> 
> > 5. Another checkpoint happens. It does *not* fsync the relation.
> 
> ... I think this will fsync the relation.  No?
> 
> > 6. Crash.

While we're cataloging gaps, I think the middle sentence is incorrect in the
following heapam_relation_set_new_filelocator() comment:

    /*
     * If required, set up an init fork for an unlogged table so that it can
     * be correctly reinitialized on restart.  Recovery may remove it while
     * replaying, for example, an XLOG_DBASE_CREATE* or XLOG_TBLSPC_CREATE
     * record.  Therefore, logging is necessary even if wal_level=minimal.
     */
    if (persistence == RELPERSISTENCE_UNLOGGED)
    {
        Assert(rel->rd_rel->relkind == RELKIND_RELATION ||
               rel->rd_rel->relkind == RELKIND_MATVIEW ||
               rel->rd_rel->relkind == RELKIND_TOASTVALUE);
        smgrcreate(srel, INIT_FORKNUM, false);
        log_smgrcreate(newrlocator, INIT_FORKNUM);
    }

XLOG_DBASE_CREATE_FILE_COPY last had that problem before fbcbc5d (2005-06)
made it issue a checkpoint.  XLOG_DBASE_CREATE_WAL_LOG never had that problem.
XLOG_TBLSPC_CREATE last had that problem before 97ddda8a82 (2021-08).  In
general, if we reintroduced such a bug, it would affect all new rels under
wal_level=minimal, not just init forks.  Having said all that,
log_smgrcreate() calls are never conditional on wal_level=minimal; the above
code is correct.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andy Fan
Дата:
Сообщение: Re: Draft LIMIT pushdown to Append and MergeAppend patch
Следующее
От: vignesh C
Дата:
Сообщение: Re: Invalidate the subscription worker in cases where a user loses their superuser status