Re: Adding facility for injection points (or probe points?) for more advanced tests

Поиск
Список
Период
Сортировка
От Michael Paquier
Тема Re: Adding facility for injection points (or probe points?) for more advanced tests
Дата
Msg-id ZVNftCZgry5pSDzD@paquier.xyz
обсуждение исходный текст
Ответ на Re: Adding facility for injection points (or probe points?) for more advanced tests  (Andres Freund <andres@anarazel.de>)
Ответы Re: Adding facility for injection points (or probe points?) for more advanced tests  (Alvaro Herrera <alvherre@alvh.no-ip.org>)
Список pgsql-hackers
On Fri, Nov 10, 2023 at 06:32:27PM -0800, Andres Freund wrote:
> I would like to see a few example tests using this facility - without that
> it's a bit hard to judge how the impact on core code would be and how easy
> tests are to write.

Sure.  I was wondering if people would be interested in that first.

> It also seems like there's a few bits and pieces missing to actually be able
> to write interesting tests. It's one thing to be able to inject code, but what
> you commonly want to do for tests is to actually wait for such a spot in the
> code to be reached, then perhaps wait inside the "modified" code, and do
> something else in the test script. But as-is a decent amount of C code would
> need to be written to write such a test, from what I can tell?

Depends on what you'd want to achieve.  As I mentioned at the top of
the thread, error, fatal, panics, hardcoded waits are the most common
cases I've seen in the last years.  Conditional waits are not in the
main patch but these are simple to support done (I mean, as in the
0003 attached with a TAP example).

While on it, I have extended the patch in the hash table a library
name and a function name so as the callback is loaded each time an
injection point is run.  (Perhaps the list of callbacks already loaded
in a process should be saved in a session-level static list/array to
avoid loading the same callbacks again, not sure if that's worth doing
for a test facility assuming that the number of times a callback is
called in a single session is usually very limited.  Anyway, that
would be simple to add if people prefer this addition.)

Anyway, here is a short list of commits that could have taken benefit
from this facility.  There are is much more, but that's a list I
grabbed quickly from my notes:
1) 8a4237908c0f
2) cb0cca188072
3) 7863ee4def65 (See https://postgr.es/m/YnT/Y2sEYj7pyOdc@paquier.xyz
where an expensive TAP test was included, and I've seen users facing
this bug in real life).  Revert of the original is clean here as well.
The trick is simple: stop a restartpoint during a promotion, and let
the restartpoint finish after the promotion.
4) 409f9ca44713, where injecting an error would stress the consistency
of the data reset (mentioned an error injected at
https://postgr.es/m/YWZk6nmAzQZS4B/z@paquier.xyz).  This reverts
cleanly even today.
5) b4721f39505b, quite similar (mentioned an error injection exactly
here: https://postgr.es/m/20181011033810.GB23570@paquier.xyz).  This
one requires an error when a transaction is started, something can be
achieved if the error is triggered conditionally (note that hard
failure would prevent the transaction to begin with the initial
snapshot taken in InitPostgres, but the module could just use a static
variable to track that).

Among these, I have implemented two examples on top of the main patch
set in 0002 and 0003: 4) as a TAP test with replication commands and
an error injection, and 3) that relies on a custom wait event and a
conditional variable to make the test posted on the other thread
cheaper, with an injection point waiting for a condition variable in
the middle of a restartpoint in the checkpointer.  I don't mean to
necessarily include all that in the upstream tree, these are just here
for reference first.

3) is the most interesting in this set, for sure.  That was a nasty
problem, and some cheap coverage in the core tree could be really good
for it, so I'd like to propose for commit after more polishing.  The
test of the bug 3) I am referring to takes originally 30~45s to run
and it was unstable as it could timeout.  With an injection point it
takes 1~2s.  Note that test_injection_points gains a wait/wake logic
to be able to use condition variables to wait on the restartpoint of a
promoted standby).  Both tests are not shaped for prime day yet, but
that's enough for a set of examples IMHO to show what can be done.

Does it answer your questions?
--
Michael

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Richard Guo
Дата:
Сообщение: Assert failure on 'list_member_ptr(rel->joininfo, restrictinfo)'
Следующее
От: Tomas Vondra
Дата:
Сообщение: Re: Why do indexes and sorts use the database collation?