Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

Поиск
Список
Период
Сортировка
От Marina Polyakova
Тема Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors
Дата
Msg-id 976a25a46420c93866ada4051976b1be@postgrespro.ru
обсуждение исходный текст
Ответ на Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors  (Fabien COELHO <coelho@cri.ensmp.fr>)
Ответы Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors  (Alvaro Herrera <alvherre@2ndquadrant.com>)
Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors  (Fabien COELHO <coelho@cri.ensmp.fr>)
Список pgsql-hackers
On 11-07-2018 16:24, Fabien COELHO wrote:
> Hello Marina,
> 
>>> * -d/--debug: I'm not in favor in requiring a mandatory text argument 
>>> on this option.
>> 
>> As you wrote in [1], adding an additional option is also a bad idea:
> 
> Hey, I'm entitled to some internal contradictions:-)

... and discussions will be continue forever %-)

>>> I'm sceptical of the "--debug-fails" options. ISTM that --debug is 
>>> already there and should just be reused.
> 
> I was thinking that you could just use the existing --debug, not
> change its syntax. My point was that --debug exists, and you could
> just print
> the messages when under --debug.

Now I understand you better, thanks. I think it will be useful to 
receive only messages about failures, because they and progress reports 
can be lost in many other debug messages such as "client %d sending ..." 
/ "client %d executing ..." / "client %d receiving".

>> Maybe it's better to use an optional argument/arguments for 
>> compatibility (--debug[=fails] or --debug[=NUM])? But if we use the 
>> numbers, now I can see only 2 levels, and there's no guarantee that 
>> they will no change..
> 
> Optional arguments to options (!) are not really clean things, so I'd
> like to avoid going onto this path, esp. as I cannot see any other
> instance in pgbench or elsewhere in postgres,

AFAICS they are used in pg_waldump (option --stats[=record]) and in psql 
(option --help[=topic]).

> and I personnaly
> consider these as a bad idea.

> So if absolutely necessary, a new option is still better than changing
> --debug syntax. If not necessary, then it is better:-)

Ok!

>>> * I'm reserved about the whole ereport thing, see comments in other
>>> messages.
>> 
>> Thank you, I'll try to implement the error reporting in the way you 
>> suggested.
> 
> Dunno if it is a good idea either. The committer word is the good one
> in the end:-à

I agree with you that ereport has good reasons to be non-trivial in the 
backend and it does not have the same in pgbench..

>>> * doCustom changes.
> 
>>> 
>>> On CSTATE_FAILURE, the next command is possibly started. Although 
>>> there is some consistency with the previous point, I think that it 
>>> totally breaks the state automaton where now a command can start 
>>> while the whole transaction is in failing state anyway. There was no 
>>> point in starting it in the first place.
>>> 
>>> So, for me, the FAILURE state should record/count the failure, then 
>>> skip
>>> to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
>>> This is much clearer that way.
>>> 
>>> Then RETRY should reinstate the global state and proceed to start the 
>>> *first* command again.
>>> <...>
>>> 
>>> It is unclear to me why backslash command errors are turned to 
>>> FAILURE
>>> instead of ABORTED: there is no way they are going to be retried, so
>>> maybe they should/could skip directly to ABORTED?
> 
>> So do you propose to execute the command "ROLLBACK" without 
>> calculating its latency etc. if we are in a failed transaction and 
>> clear the conditional stack after each failure?
> 
>> Also just to be clear: do you want to have the state CSTATE_ABORTED 
>> for client abortion and another state for interrupting the current 
>> transaction?
> 
> I do not understand what "interrupting the current transaction" means.
> A transaction is either committed or rollbacked, I do not know about
> "interrupted".

I mean that IIUC the server usually only reports the error and you must 
manually send the command "END" or "ROLLBACK" to rollback a failed 
transaction.

> When it is rollbacked, probably some stats will be
> collected in passing, I'm fine with that.
> 
> If there is an error in a pgbench script, the transaction is aborted,
> which means for me that the script execution is stopped where it was,
> and either it is restarted from the beginning (retry) or counted as
> failure (not retry, just aborted, really).
> 
> If by interrupted you mean that one script begins a transaction and
> another ends it, as I said in the review I think that this strange
> case should be forbidden, so that all the code and documentation
> trying to
> manage that can be removed.

Ok!

>>> The current RETRY state does memory allocations to generate a message
>>> with buffer allocation and so on. This looks like a costly and 
>>> useless
>>> operation. If the user required "retries", then this is normal 
>>> behavior,
>>> the retries are counted and will be printed out in the final report,
>>> and there is no point in printing out every single one of them.
>>> Maybe you want that debugging, but then coslty operations should be 
>>> guarded.
>> 
>> I think we need these debugging messages because, for example,
> 
> Debugging message should cost only when under debug. When not under
> debug, there should be no debugging message, and there should be no
> cost for building and discarding such messages in the executed code
> path beyond
> testing whether program is under debug.
> 
>> if you use the option --latency-limit, you we will never know in 
>> advance whether the serialization/deadlock failure will be retried or 
>> not.
> 
> ISTM that it will be shown final report. If I want debug, I ask for
> --debug, otherwise I think that the command should do what it was
> asked for, i.e. run scripts, collect performance statistics and show
> them at the end.
> 
> In particular, when running with retries is enabled, the user is
> expecting deadlock/serialization errors, so that they are not "errors"
> as such for
> them.
> 
>> They also help to understand which limit of retries was violated or 
>> how close we were to these limits during the execution of a specific 
>> transaction. But I agree with you that they are costly and can be 
>> skipped if the failure type is never retried. Maybe it is better to 
>> split them into multiple error function calls?..
> 
> Debugging message costs should only be incurred when under --debug,
> not otherwise.

Ok! IIUC instead of this part of the code

initPQExpBuffer(&errmsg_buf);
printfPQExpBuffer(&errmsg_buf,
                  "client %d repeats the failed transaction (try %d",
                  st->id, st->retries + 1);
if (max_tries)
    appendPQExpBuffer(&errmsg_buf, "/%d", max_tries);
if (latency_limit)
{
    appendPQExpBuffer(&errmsg_buf,
                      ", %.3f%% of the maximum time of tries was used",
                      getLatencyUsed(st, &now));
}
appendPQExpBufferStr(&errmsg_buf, ")\n");
pgbench_error(DEBUG_FAIL, "%s", errmsg_buf.data);
termPQExpBuffer(&errmsg_buf);

can we try something like this?

PGBENCH_ERROR_START(DEBUG_FAIL)
{
    PGBENCH_ERROR("client %d repeats the failed transaction (try %d",
                  st->id, st->retries + 1);
    if (max_tries)
        PGBENCH_ERROR("/%d", max_tries);
    if (latency_limit)
    {
        PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used",
                      getLatencyUsed(st, &now));
    }
    PGBENCH_ERROR(")\n");
}
PGBENCH_ERROR_END();

>>> You have added 20-columns alignment prints. This looks like too much 
>>> and
>>> generates much too large lines. Probably 10 (billion) would be 
>>> enough.
>> 
>> I have already asked you about this in [2]:
> 
> Probably:-)
> 
>>> The variables for the numbers of failures and retries are of type 
>>> int64
>>> since the variable for the total number of transactions has the same
>>> type. That's why such a large alignment (as I understand it now, 
>>> enough
>>> 20 characters). Do you prefer floating alignemnts, depending on the
>>> maximum number of failures/retries for any command in any script?
> 
> An int64 counter is not likely to reach its limit anytime soon:-) If
> the column display limit is ever reached, ISTM that then the text is
> just misaligned, which is a minor and rare inconvenience. If very wide
> columns are used, then it does not fit my terminal and the report text
> will always be wrapped around, which makes it harder to read, every
> time.

Ok!

>>> The latency limit to 900 ms try is a bad idea because it takes a lot 
>>> of time. I did such tests before and they were removed by Tom Lane
>>> because of determinism and time issues. I would comment this test out 
>>> for now.
>> 
>> Ok! If it doesn't bother you - can you tell more about the causes of 
>> these determinism issues?.. Tests for some other failures that cannot 
>> be retried are already added to 001_pgbench_with_server.pl.
> 
> Some farm animals are very slow, so you cannot really assume much
> about time one way or another.

Thanks!

>>> I do not understand why there is so much text about in failed sql 
>>> transaction stuff, while we are mainly interested in serialization & 
>>> deadlock errors, and this only falls in some "other" category. There 
>>> seems to be more details about other errors that about deadlocks & 
>>> serializable errors.
>>> 
>>> The reporting should focus on what is of interest, either all errors, 
>>> or some detailed split of these errors.
>>> 
>>> <...>
>>> 
>>> * "errors_in_failed_tx" is some subcounter of "errors", for a special
>>> case. Why it is there fails me [I finally understood, and I think it
>>> should be removed, see end of review]. If we wanted to distinguish,
>>> then we should distinguish homogeneously: maybe just count the
>>> different error types, eg have things like "deadlock_errors",
>>> "serializable_errors", "other_errors", "internal_pgbench_errors" 
>>> which
>>> would be orthogonal one to the other, and "errors" could be 
>>> recomputed
>>> from these.
>> 
>> Thank you, I agree with you. Unfortunately each new error type adds a 
>> new 1 or 2 columns of maximum width 20 to the per-statement report
> 
> The fact that some data are collected does not mean that they should
> all be reported in detail. We can have detailed error count and report
> the sum of this errors for instance, or have some more
> verbose/detailed reports
> as options (eg --latencies does just that).

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andrey Borodin
Дата:
Сообщение: Re: GiST VACUUM
Следующее
От: Andres Freund
Дата:
Сообщение: Re: In pageinspect, perform clean-up after testing gin-relatedfunctions