Re: Unhappy about API changes in the no-fsm-for-small-rels patch

Поиск
Список
Период
Сортировка
От John Naylor
Тема Re: Unhappy about API changes in the no-fsm-for-small-rels patch
Дата
Msg-id CACPNZCvE9S4ghaLkr76u1+7H1U1swghvyeyH_47KD3TXfK0C2g@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Unhappy about API changes in the no-fsm-for-small-rels patch  (Tom Lane <tgl@sss.pgh.pa.us>)
Ответы Re: Unhappy about API changes in the no-fsm-for-small-rels patch  (Amit Kapila <amit.kapila16@gmail.com>)
Re: Unhappy about API changes in the no-fsm-for-small-rels patch  (Andres Freund <andres@anarazel.de>)
Список pgsql-hackers
On Wed, Apr 17, 2019 at 2:04 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> I'm somewhat unhappy in how much the no-fsm-for-small-rels exposed
> complexity that looks like it should be purely in freespacemap.c to
> callers.
>
>
>  extern Size GetRecordedFreeSpace(Relation rel, BlockNumber heapBlk);
> -extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded);
> +extern BlockNumber GetPageWithFreeSpace(Relation rel, Size spaceNeeded,
> +                    bool check_fsm_only);
>
> So now freespace.c has an argument that says we should only check the
> fsm. That's confusing. And it's not explained to callers what that
> argument means, and when it should be set.

When first looking for free space, it's "false": Within
GetPageWithFreeSpace(), we call RelationGetNumberOfBlocks() if the FSM
returns invalid.

If we have to extend, after acquiring the lock to extend the relation,
we call GetPageWithFreeSpace() again to see if another backend already
extended while waiting on the lock. If there's no FSM, the thinking
is, it's not worth it to get the number of blocks again.

> @@ -176,20 +269,44 @@ RecordAndGetPageWithFreeSpace(Relation rel, BlockNumber oldPage,
>   * Note that if the new spaceAvail value is higher than the old value stored
>   * in the FSM, the space might not become visible to searchers until the next
>   * FreeSpaceMapVacuum call, which updates the upper level pages.
> + *
> + * Callers have no need for a local map.
>   */
>  void
> -RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk, Size spaceAvail)
> +RecordPageWithFreeSpace(Relation rel, BlockNumber heapBlk,
> +                       Size spaceAvail, BlockNumber nblocks)
>
> There's no explanation as to what that "nblocks" argument is. One
> basically has to search other callers to figure it out. It's not even
> clear to which fork it relates to. Nor that one can set it to
> InvalidBlockNumber if one doesn't have the relation size conveniently
> reachable.  But it's not exposed to RecordAndGetPageWithFreeSpace(), for
> a basically unexplained reason.  There's a comment above
> fsm_allow_writes() - but that's  file-local function that external
> callers basically have need to know about.

Okay.

> I can't figure out what "Callers have no need for a local map." is
> supposed to mean.

It was meant to contrast with [RecordAnd]GetPageWithFreeSpace(), but I
see how it's confusing.

> +/*
> + * Clear the local map.  We must call this when we have found a block with
> + * enough free space, when we extend the relation, or on transaction abort.
> + */
> +void
> +FSMClearLocalMap(void)
> +{
> +   if (FSM_LOCAL_MAP_EXISTS)
> +   {
> +       fsm_local_map.nblocks = 0;
> +       memset(&fsm_local_map.map, FSM_LOCAL_NOT_AVAIL,
> +              sizeof(fsm_local_map.map));
> +   }
> +}
> +
>
> So now there's a new function one needs to call after successfully using
> the block returned by [RecordAnd]GetPageWithFreeSpace().  But it's not
> referenced from those functions, so basically one has to just know that.

Right.

> +/* Only create the FSM if the heap has greater than this many blocks */
> +#define HEAP_FSM_CREATION_THRESHOLD 4
>
> Hm, this seems to be tying freespace.c closer to heap than I think is
> great - think of new AMs like zheap, that also want to use it.

Amit and I kept zheap in mind when working on the patch. You'd have to
work around the metapage, but everything else should work the same.

> I think this is mostly fallout about the prime issue I'm unhappy
> about. There's now some global variable in freespacemap.c that code
> using freespace.c has to know about and maintain.
>
>
> +static void
> +fsm_local_set(Relation rel, BlockNumber cur_nblocks)
> +{
> +   BlockNumber blkno,
> +               cached_target_block;
> +
> +   /* The local map must not be set already. */
> +   Assert(!FSM_LOCAL_MAP_EXISTS);
> +
> +   /*
> +    * Starting at the current last block in the relation and working
> +    * backwards, mark alternating blocks as available.
> +    */
> +   blkno = cur_nblocks - 1;
>
> That comment explains very little about why this is done, and why it's a
> good idea.

Short answer: performance -- it's too expensive to try every block.
The explanation is in storage/freespace/README -- maybe that should be
referenced here?

> +/* Status codes for the local map. */
> +
> +/* Either already tried, or beyond the end of the relation */
> +#define FSM_LOCAL_NOT_AVAIL 0x00
> +
> +/* Available to try */
> +#define FSM_LOCAL_AVAIL        0x01
>
> +/* Local map of block numbers for small heaps with no FSM. */
> +typedef struct
> +{
> +   BlockNumber nblocks;
> +   uint8       map[HEAP_FSM_CREATION_THRESHOLD];
> +}          FSMLocalMap;
> +
>
> Hm, given realistic HEAP_FSM_CREATION_THRESHOLD, and the fact that we
> really only need one bit per relation, it seems like map should really
> be just a uint32 with one bit per page.

I fail to see the advantage of that.

> +static bool
> +fsm_allow_writes(Relation rel, BlockNumber heapblk,
> +                BlockNumber nblocks, BlockNumber *get_nblocks)
>
> +       if (rel->rd_rel->relpages != InvalidBlockNumber &&
> +           rel->rd_rel->relpages > HEAP_FSM_CREATION_THRESHOLD)
> +           return true;
> +       else
> +           skip_get_nblocks = false;
> +   }
>
> This badly needs a comment explaining that these values can be basically
> arbitrarily out of date. Explaining why it's correct to rely on them
> anyway (Presumably because creating an fsm unnecessarily is ok, it just
> avoid susing this optimization).

Agreed, and yes, your presumption is what I had in mind.

> I'm kinda thinking that this is the wrong architecture.
>
> 1) Unless I miss something, this will trigger a
>    RelationGetNumberOfBlocks(), which in turn ends up doing an lseek(),
>    once for each page we add to the relation.

That was true previously anyway if the FSM returned InvalidBlockNumber.

>    That strikes me as pretty
>    suboptimal. I think it's even worse if multiple backends do the
>    insertion, because the RelationGetTargetBlock(relation) logic will
>    succeed less often.

Could you explain why it would succeed less often?

> 2) We'll repeatedly re-encounter the first few pages, because we clear
>    the local map after each successful RelationGetBufferForTuple().

Not exactly sure what you mean? We only set the map if
RelationGetTargetBlock() returns InvalidBlockNumber, or if it returned
a valid block, and inserting there already failed. So, not terribly
often, I imagine.

> 3) The global variable based approach means that we cannot easily do
>    better. Even if we had a cache that lives across
>    RelationGetBufferForTuple() calls.
>
> 4) fsm_allow_writes() does a smgrexists() for the FSM in some
>    cases. That's pretty darn expensive if it's already open.

(from earlier)
> Isn't this like really expensive? mdexists() closes the relations and
> reopens it from scratch. Shouldn't we at the very least check
> smgr_fsm_nblocks beforehand, so this is only done once?

Hmm, I can look into that.


On Wed, Apr 17, 2019 at 3:16 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-04-16 14:31:25 -0400, Tom Lane wrote:
  <snip>
> > and in any case it seems pretty inefficient for workloads that insert
> > into multiple tables.
>
> As is, it's inefficient for insertions into a *single* relation. The
> RelationGetTargetBlock() makes it not crazily expensive, but it's still
> plenty expensive.

Performance testing didn't reveal any performance regression. If you
have a realistic benchmark in mind that stresses this logic more
heavily, I'd be happy to be convinced otherwise.


-- 
John Naylor                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



В списке pgsql-hackers по дате отправления:

Предыдущее
От: "Ideriha, Takeshi"
Дата:
Сообщение: RE: Copy data to DSA area
Следующее
От: Jiří Fejfar
Дата:
Сообщение: Re: extensions are hitting the ceiling