Re: [PoC] Improve dead tuple storage for lazy vacuum

Поиск
Список
Период
Сортировка
От Masahiko Sawada
Тема Re: [PoC] Improve dead tuple storage for lazy vacuum
Дата
Msg-id CAD21AoDCuTO6_wKF9mPm2hb9Fo-dc9kwbec6iXpwptsYaoFdyA@mail.gmail.com
обсуждение исходный текст
Ответ на Re: [PoC] Improve dead tuple storage for lazy vacuum  (Masahiko Sawada <sawada.mshk@gmail.com>)
Ответы Re: [PoC] Improve dead tuple storage for lazy vacuum  (John Naylor <john.naylor@enterprisedb.com>)
Список pgsql-hackers
On Mon, Feb 20, 2023 at 2:56 PM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
>
> On Thu, Feb 16, 2023 at 6:23 PM John Naylor
> <john.naylor@enterprisedb.com> wrote:
> >
> > On Thu, Feb 16, 2023 at 10:24 AM Masahiko Sawada <sawada.mshk@gmail.com> wrote:
> > >
> > > On Tue, Feb 14, 2023 at 8:24 PM John Naylor
> > > <john.naylor@enterprisedb.com> wrote:
> >
> > > > > I can think that something like traversing a HOT chain could visit
> > > > > offsets out of order. But fortunately we prune such collected TIDs
> > > > > before heap vacuum in heap case.
> > > >
> > > > Further, currently we *already* assume we populate the tid array in order (for binary search), so we can just
continueassuming that (with an assert added since it's more public in this form). I'm not sure why such basic common
senseevaded me a few versions ago... 
> > >
> > > Right. TidStore is implemented not only for heap, so loading
> > > out-of-order TIDs might be important in the future.
> >
> > That's what I was probably thinking about some weeks ago, but I'm having a hard time imagining how it would come
up,even for something like the conveyor-belt concept. 
> >
> > > We have the following WIP comment in test_radixtree:
> > >
> > > // WIP: compiles with warnings because rt_attach is defined but not used
> > > // #define RT_SHMEM
> > >
> > > How about unsetting RT_SCOPE to suppress warnings for unused rt_attach
> > > and friends?
> >
> > Sounds good to me, and the other fixes make sense as well.
>
> Thanks, I merged them.
>
> >
> > > FYI I've briefly tested the TidStore with blocksize = 32kb, and it
> > > seems to work fine.
> >
> > That was on my list, so great! How about the other end -- nominally we allow 512b. (In practice it won't matter,
butthis would make sure I didn't mess anything up when forcing all MaxTuplesPerPage to encode.) 
>
> According to the doc, the minimum block size is 1kB. It seems to work
> fine with 1kB blocks.
>
> >
> > > You removed the vacuum integration patch from v27, is there any reason for that?
> >
> > Just an oversight.
> >
> > Now for some general comments on the tid store...
> >
> > + * TODO: The caller must be certain that no other backend will attempt to
> > + * access the TidStore before calling this function. Other backend must
> > + * explicitly call tidstore_detach to free up backend-local memory associated
> > + * with the TidStore. The backend that calls tidstore_destroy must not call
> > + * tidstore_detach.
> > + */
> > +void
> > +tidstore_destroy(TidStore *ts)
> >
> > Do we need to do anything for this todo?
>
> Since it's practically no problem, I think we can live with it for
> now. dshash also has the same todo.
>
> >
> > It might help readability to have a concept of "off_upper/off_lower", just so we can describe things more clearly.
Thekey is block + off_upper, and the value is a bitmap of all the off_lower bits. I hinted at that in my addition of
encode_key_off().Along those lines, maybe s/TIDSTORE_OFFSET_MASK/TIDSTORE_OFFSET_LOWER_MASK/. Actually, I'm not even
surethe TIDSTORE_ prefix is valuable for these local macros. 
> >
> > The word "value" as a variable name is pretty generic in this context, and it might be better to call it the
off_lower_bitmap,at least in some places. The "key" doesn't have a good short term for naming, but in comments we
shouldmake sure we're clear it's "block# + off_upper". 
> >
> > I'm not a fan of the name "tid_i", even as a temp variable -- maybe "compressed_tid"?
> >
> > maybe s/tid_to_key_off/encode_tid/ and s/encode_key_off/encode_block_offset/
> >
> > It might be worth using typedefs for key and value type. Actually, since key type is fixed for the foreseeable
future,maybe the radix tree template should define a key typedef? 
> >
> > The term "result" is probably fine within the tidstore, but as a public name used by vacuum, it's not very
descriptive.I don't have a good idea, though. 
> >
> > Some files in backend/access use CamelCase for public functions, although it's not consistent. I think doing that
fortidstore would help readability, since they would stand out from rt_* functions and vacuum functions. It's a matter
oftaste, though. 
> >
> > I don't understand the control flow in tidstore_iterate_next(), or when BlockNumberIsValid() is true. If this is
thebest way to code this, it needs more commentary. 
>
> The attached 0008 patch addressed all above comments on tidstore.
>
> > Some comments on vacuum:
> >
> > I think we'd better get some real-world testing of this, fairly soon.
> >
> > I had an idea: If it's not too much effort, it might be worth splitting it into two parts: one that just adds the
store(not caring about its memory limits or progress reporting etc). During index scan, check both the new store and
thearray and log a warning (we don't want to exit or crash, better to try to investigate while live if possible) if the
resultdoesn't match. Then perhaps set up an instance and let something like TPC-C run for a few days. The second patch
wouldjust restore the rest of the current patch. That would help reassure us it's working as designed. 
>
> Yeah, I did a similar thing in an earlier version of tidstore patch.
> Since we're trying to introduce two new components: radix tree and
> tidstore, I sometimes find it hard to investigate failures happening
> during lazy (parallel) vacuum due to a bug either in tidstore or radix
> tree. If there is a bug in lazy vacuum, we cannot even do initdb. So
> it might be a good idea to do such checks in USE_ASSERT_CHECKING (or
> with another macro say DEBUG_TIDSTORE) builds. For example, TidStore
> stores tids to both the radix tree and array, and checks if the
> results match when lookup or iteration. It will use more memory but it
> would not be a big problem in USE_ASSERT_CHECKING builds. It would
> also be great if we can enable such checks on some bf animals.

I've tried this idea. Enabling this check on all debug builds (i.e.,
with USE_ASSERT_CHECKING macro) seems not a good idea so I use a
special macro for that, TIDSTORE_DEBUG. I think we can define this
macro on some bf animals (or possibly a new bf animal).

Regards,

--
Masahiko Sawada
Amazon Web Services: https://aws.amazon.com

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Michael Paquier
Дата:
Сообщение: Re: psql memory leaks
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: "out of relcache_callback_list slots" after multiple calls to pg_logical_slot_get_binary_changes