Re: making relfilenodes 56 bits

Поиск
Список
Период
Сортировка
От Dilip Kumar
Тема Re: making relfilenodes 56 bits
Дата
Msg-id CAFiTN-u8d4EHZCE8YEc=iHUaa9_ei1=N02SJOAt1Ow0+9TdPfg@mail.gmail.com
обсуждение исходный текст
Ответ на Re: making relfilenodes 56 bits  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Sun, Jul 3, 2022 at 12:59 AM Andres Freund <andres@anarazel.de> wrote:

> Hm. Now that I think about it, isn't the XlogFlush() in
> XLogPutNextRelFileNumber() problematic performance wise? Yes, we'll spread the
> cost across a number of GetNewRelFileNumber() calls, but still, an additional
> f[data]sync for every 64 relfilenodes assigned isn't cheap - today there's
> zero fsyncs when creating a sequence or table inside a transaction (there are
> some for indexes, but there's patches to fix that).
>
> Not that I really see an obvious alternative.

I think to see the impact we need a workload which frequently creates
the relfilenode, maybe we can test where parallel to pgbench we are
frequently creating the relation/indexes and check how much
performance hit we see.  And if we see the impact then increasing
VAR_RFN_PREFETCH value can help in resolving that.

> I guess we could try to invent a flush-log-before-write type logic for
> relfilenodes somehow? So that the first block actually written to a file needs
> to ensure the WAL record that created the relation is flushed. But getting
> that to work reliably seems nontrivial.

>
> One thing that would be good is to add an assertion to a few places ensuring
> that relfilenodes aren't above ->nextRelFileNumber, most importantly somewhere
> in the recovery path.

Yes, it makes sense.

> Why did you choose a quite small value for VAR_RFN_PREFETCH? VAR_OID_PREFETCH
> is 8192, but you chose 64 for VAR_RFN_PREFETCH?

Earlier it was 8192, then there was a comment from Robert that we can
use Oid for many other things like an identifier for a catalog tuple
or a TOAST chunk, but a RelFileNumber requires a filesystem operation,
so the amount of work that is needed to use up 8192 RelFileNumbers is
a lot bigger than the amount of work required to use up 8192 OIDs.

I think that make sense so I reduced it to 64, but now I tends to
think that we also need to consider the point that after consuming
VAR_RFN_PREFETCH we are going to do XlogFlush(), so it's better that
we keep it high.  And as Robert told upthread, even with keeping it
8192 we can still crash 2^41 (2 trillian) times before we completely
run out of the number.  So I think we can easily keep it up to 8192
and I don't think that we really need to worry much about the
performance impact by XlogFlush()?

> I'd spell out RFN in VAR_RFN_PREFETCH btw, it took me a bit to expand RFN to
> relfilenode.

Okay.

> This is embarassing. Trying to keep alignment match between C and catalog
> alignment on AIX, without actually making the system understand the alignment
> rules, is a remarkably shortsighted approach.
>
> I started a separate thread about it, since it's not really relevant to this thread:
> https://postgr.es/m/20220702183354.a6uhja35wta7agew%40alap3.anarazel.de
>
> Maybe we could at least make the field order to be something like
>   oid, relam, relfilenode, relname

Yeah that we can do.

> that should be ok alignment wise, keep oid first, and seems to make sense from
> an "importance" POV? Can't really interpret later fields without knowing relam
> etc.

Right.

> > I think due to wraparound if relfilenode gets reused by another
> > relation in the same checkpoint then there was an issue in crash
> > recovery with wal level minimal.  But the root of the issue is a
> > wraparound, right?
>
> I'm not convinced the tombstones were required solely in the oid wraparound
> case before, despite the comment saying so, with wal_level=minimal. I gotta do
> some non-work stuff for a bit, so I need to stop pondering this now :)
>
> I think it might be a good idea to have a few weeks in which we do *not*
> remove the tombstones, but have assertion checks against such files existing
> when we don't expect them to. I.e. commit 1-3, add the asserts, then commit 4
> a bit later.

I think this is a good idea.

> > Okay, so you mean to say that we can first drop the remaining segment
> > and at last we drop the segment 0 right?
>
> I'd use the approach Robert suggested and delete from the end, going down.

Yeah, I got it, thanks.

-- 
Regards,
Dilip Kumar
EnterpriseDB: http://www.enterprisedb.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrey Borodin
Дата:
Сообщение: Re: Slow standby snapshot
Следующее
От: Tom Lane
Дата:
Сообщение: Re: 15beta1 tab completion of extension versions