Обсуждение: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

Поиск
Список
Период
Сортировка

[HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
Hello, hackers!

Now in pgbench we can test only transactions with Read Committed 
isolation level because client sessions are disconnected forever on 
serialization failures. There were some proposals and discussions about 
it (see message here [1] and thread here [2]).

I suggest a patch where pgbench client sessions are not disconnected 
because of serialization or deadlock failures and these failures are 
mentioned in reports. In details:
- transaction with one of these failures continue run normally, but its 
result is rolled back;
- if there were these failures during script execution this 
"transaction" is marked
appropriately in logs;
- numbers of "transactions" with these failures are printed in progress, 
in aggregation logs and in the end with other results (all and for each 
script);

Advanced options:
- mostly for testing built-in scripts: you can set the default 
transaction isolation level by the appropriate benchmarking option (-I);
- for more detailed reports: to know per-statement serialization and 
deadlock failures you can use the appropriate benchmarking option 
(--report-failures).

Also: TAP tests for new functionality and changed documentation with new 
examples.

Patches are attached. Any suggestions are welcome!

P.S. Does this use case (do not retry transaction with serialization or 
deadlock failure) is most interesting or failed transactions should be 
retried (and how much times if there seems to be no hope of success...)?

[1] 
https://www.postgresql.org/message-id/4EC65830020000250004323F%40gw.wicourts.gov
[2] 

https://www.postgresql.org/message-id/flat/alpine.DEB.2.02.1305182259550.1473%40localhost6.localdomain6#alpine.DEB.2.02.1305182259550.1473@localhost6.localdomain6

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Robert Haas
Дата:
On Wed, Jun 14, 2017 at 4:48 AM, Marina Polyakova
<m.polyakova@postgrespro.ru> wrote:
> Now in pgbench we can test only transactions with Read Committed isolation
> level because client sessions are disconnected forever on serialization
> failures. There were some proposals and discussions about it (see message
> here [1] and thread here [2]).
>
> I suggest a patch where pgbench client sessions are not disconnected because
> of serialization or deadlock failures and these failures are mentioned in
> reports. In details:
> - transaction with one of these failures continue run normally, but its
> result is rolled back;
> - if there were these failures during script execution this "transaction" is
> marked
> appropriately in logs;
> - numbers of "transactions" with these failures are printed in progress, in
> aggregation logs and in the end with other results (all and for each
> script);
>
> Advanced options:
> - mostly for testing built-in scripts: you can set the default transaction
> isolation level by the appropriate benchmarking option (-I);
> - for more detailed reports: to know per-statement serialization and
> deadlock failures you can use the appropriate benchmarking option
> (--report-failures).
>
> Also: TAP tests for new functionality and changed documentation with new
> examples.
>
> Patches are attached. Any suggestions are welcome!

Sounds like a good idea.  Please add to the next CommitFest and review
somebody else's patch in exchange for having your own patch reviewed.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Sounds like a good idea.

Thank you!

> Please add to the next CommitFest

Done: https://commitfest.postgresql.org/14/1170/

> and review
> somebody else's patch in exchange for having your own patch reviewed.

Of course, I remember about it.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Andres Freund
Дата:
Hi,

On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote:
> Now in pgbench we can test only transactions with Read Committed isolation
> level because client sessions are disconnected forever on serialization
> failures. There were some proposals and discussions about it (see message
> here [1] and thread here [2]).

> I suggest a patch where pgbench client sessions are not disconnected because
> of serialization or deadlock failures and these failures are mentioned in
> reports.

I think that's a good idea and sorely needed.


In details:


> - if there were these failures during script execution this "transaction" is
> marked
> appropriately in logs;
> - numbers of "transactions" with these failures are printed in progress, in
> aggregation logs and in the end with other results (all and for each
> script);

I guess that'll include a "rolled-back %' or 'retried %' somewhere?


> Advanced options:
> - mostly for testing built-in scripts: you can set the default transaction
> isolation level by the appropriate benchmarking option (-I);

I'm less convinced of the need of htat, you can already set arbitrary
connection options with
PGOPTIONS='-c default_transaction_isolation=serializable' pgbench


> P.S. Does this use case (do not retry transaction with serialization or
> deadlock failure) is most interesting or failed transactions should be
> retried (and how much times if there seems to be no hope of success...)?

I can't quite parse that sentence, could you restate?

- Andres



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Kevin Grittner
Дата:
On Thu, Jun 15, 2017 at 2:16 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote:

>> I suggest a patch where pgbench client sessions are not disconnected because
>> of serialization or deadlock failures and these failures are mentioned in
>> reports.
>
> I think that's a good idea and sorely needed.

+1

>> P.S. Does this use case (do not retry transaction with serialization or
>> deadlock failure) is most interesting or failed transactions should be
>> retried (and how much times if there seems to be no hope of success...)?
>
> I can't quite parse that sentence, could you restate?

The way I read it was that the most interesting solution would retry
a transaction from the beginning on a serialization failure or
deadlock failure.  Most people who use serializable transactions (at
least in my experience) run though a framework that does that
automatically, regardless of what client code initiated the
transaction.  These retries are generally hidden from the client
code -- it just looks like the transaction took a bit longer.
Sometimes people will have a limit on the number of retries.  I
never used such a limit and never had a problem, because our
implementation of serializable transactions will not throw a
serialization failure error until one of the transactions involved
in causing it has successfully committed -- meaning that the retry
can only hit this again on a *new* set of transactions.

Essentially, the transaction should only count toward the TPS rate
when it eventually completes without a serialization failure.

Marina, did I understand you correctly?

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
Kevin Grittner wrote:
> On Thu, Jun 15, 2017 at 2:16 PM, Andres Freund <andres@anarazel.de> wrote:
> > On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote:

> >> P.S. Does this use case (do not retry transaction with serialization or
> >> deadlock failure) is most interesting or failed transactions should be
> >> retried (and how much times if there seems to be no hope of success...)?
> >
> > I can't quite parse that sentence, could you restate?
> 
> The way I read it was that the most interesting solution would retry
> a transaction from the beginning on a serialization failure or
> deadlock failure.

As far as I understand her proposal, it is exactly the opposite -- if a
transaction fails, it is discarded.  And this P.S. note is asking
whether this is a good idea, or would we prefer that failing
transactions are retried.

I think it's pretty obvious that transactions that failed with
some serializability problem should be retried.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Thomas Munro
Дата:
On Fri, Jun 16, 2017 at 9:18 AM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Kevin Grittner wrote:
>> On Thu, Jun 15, 2017 at 2:16 PM, Andres Freund <andres@anarazel.de> wrote:
>> > On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote:
>
>> >> P.S. Does this use case (do not retry transaction with serialization or
>> >> deadlock failure) is most interesting or failed transactions should be
>> >> retried (and how much times if there seems to be no hope of success...)?
>> >
>> > I can't quite parse that sentence, could you restate?
>>
>> The way I read it was that the most interesting solution would retry
>> a transaction from the beginning on a serialization failure or
>> deadlock failure.
>
> As far as I understand her proposal, it is exactly the opposite -- if a
> transaction fails, it is discarded.  And this P.S. note is asking
> whether this is a good idea, or would we prefer that failing
> transactions are retried.
>
> I think it's pretty obvious that transactions that failed with
> some serializability problem should be retried.

+1 for retry with reporting of retry rates

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Kevin Grittner
Дата:
On Thu, Jun 15, 2017 at 4:18 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Kevin Grittner wrote:

> As far as I understand her proposal, it is exactly the opposite -- if a
> transaction fails, it is discarded.  And this P.S. note is asking
> whether this is a good idea, or would we prefer that failing
> transactions are retried.
>
> I think it's pretty obvious that transactions that failed with
> some serializability problem should be retried.

Agreed all around.

-- 
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Hi,

Hello!

> I think that's a good idea and sorely needed.

Thanks, I'm very glad to hear it!

>> - if there were these failures during script execution this 
>> "transaction" is
>> marked
>> appropriately in logs;
>> - numbers of "transactions" with these failures are printed in 
>> progress, in
>> aggregation logs and in the end with other results (all and for each
>> script);
> 
> I guess that'll include a "rolled-back %' or 'retried %' somewhere?

Not exactly, see documentation:

+   If transaction has serialization / deadlock failure or them both 
(last thing
+   is possible if used script contains several transactions; see
+   <xref linkend="transactions-and-scripts"
+   endterm="transactions-and-scripts-title"> for more information), its
+   <replaceable>time</> will be reported as <literal>serialization 
failure</> /
+   <literal>deadlock failure</> /
+   <literal>serialization and deadlock failures</> appropriately.

+   Example with serialization, deadlock and both these failures:
+<screen>
+1 128 24968 0 1496759158 426984
+0 129 serialization failure 0 1496759158 427023
+3 129 serialization failure 0 1496759158 432662
+2 128 serialization failure 0 1496759158 432765
+0 130 deadlock failure 0 1496759159 460070
+1 129 serialization failure 0 1496759160 485188
+2 129 serialization and deadlock failures 0 1496759160 485339
+4 130 serialization failure 0 1496759160 485465
+</screen>

I have understood proposals in next messages of this thread that the 
most interesting case is to retry failed transaction. Do you think it's 
better to write for example "rolled-back after % retries (serialization 
failure)' or "time (retried % times, serialization and deadlock 
failures)'?

>> Advanced options:
>> - mostly for testing built-in scripts: you can set the default 
>> transaction
>> isolation level by the appropriate benchmarking option (-I);
> 
> I'm less convinced of the need of htat, you can already set arbitrary
> connection options with
> PGOPTIONS='-c default_transaction_isolation=serializable' pgbench

Oh, thanks, I forgot about it =[

>> P.S. Does this use case (do not retry transaction with serialization 
>> or
>> deadlock failure) is most interesting or failed transactions should be
>> retried (and how much times if there seems to be no hope of 
>> success...)?

> I can't quite parse that sentence, could you restate?

Álvaro Herrera later in this thread has understood my text right:

> As far as I understand her proposal, it is exactly the opposite -- if a
> transaction fails, it is discarded.  And this P.S. note is asking
> whether this is a good idea, or would we prefer that failing
> transactions are retried.

With his explanation has my text become clearer?

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
>>> P.S. Does this use case (do not retry transaction with serialization 
>>> or
>>> deadlock failure) is most interesting or failed transactions should 
>>> be
>>> retried (and how much times if there seems to be no hope of 
>>> success...)?
>> 
>> I can't quite parse that sentence, could you restate?
> 
> The way I read it was that the most interesting solution would retry
> a transaction from the beginning on a serialization failure or
> deadlock failure.  Most people who use serializable transactions (at
> least in my experience) run though a framework that does that
> automatically, regardless of what client code initiated the
> transaction.  These retries are generally hidden from the client
> code -- it just looks like the transaction took a bit longer.
> Sometimes people will have a limit on the number of retries.  I
> never used such a limit and never had a problem, because our
> implementation of serializable transactions will not throw a
> serialization failure error until one of the transactions involved
> in causing it has successfully committed -- meaning that the retry
> can only hit this again on a *new* set of transactions.
> 
> Essentially, the transaction should only count toward the TPS rate
> when it eventually completes without a serialization failure.
> 
> Marina, did I understand you correctly?

Álvaro Herrera in next message of this thread has understood my text 
right:

> As far as I understand her proposal, it is exactly the opposite -- if a
> transaction fails, it is discarded.  And this P.S. note is asking
> whether this is a good idea, or would we prefer that failing
> transactions are retried.

And thank you very much for your explanation how and why transactions 
with failures should be retried! I'll try to implement all of it.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
>> >> P.S. Does this use case (do not retry transaction with serialization or
>> >> deadlock failure) is most interesting or failed transactions should be
>> >> retried (and how much times if there seems to be no hope of success...)?
>> >
>> > I can't quite parse that sentence, could you restate?
>> 
>> The way I read it was that the most interesting solution would retry
>> a transaction from the beginning on a serialization failure or
>> deadlock failure.
> 
> As far as I understand her proposal, it is exactly the opposite -- if a
> transaction fails, it is discarded.  And this P.S. note is asking
> whether this is a good idea, or would we prefer that failing
> transactions are retried.

Yes, I have meant this, thank you!

> I think it's pretty obvious that transactions that failed with
> some serializability problem should be retried.

Thank you voted :)

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Kevin Grittner
Дата:
On Fri, Jun 16, 2017 at 5:31 AM, Marina Polyakova
<m.polyakova@postgrespro.ru> wrote:

> And thank you very much for your explanation how and why transactions with
> failures should be retried! I'll try to implement all of it.

To be clear, part of "retrying from the beginning" means that if a
result from one statement is used to determine the content (or
whether to run) a subsequent statement, that first statement must be
run in the new transaction and the results evaluated again to
determine what to use for the later statement.  You can't simply
replay the statements that were run during the first try.  For
examples, to help get a feel of why that is, see:

https://wiki.postgresql.org/wiki/SSI

--
Kevin Grittner
VMware vCenter Server
https://www.vmware.com/



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> To be clear, part of "retrying from the beginning" means that if a
> result from one statement is used to determine the content (or
> whether to run) a subsequent statement, that first statement must be
> run in the new transaction and the results evaluated again to
> determine what to use for the later statement.  You can't simply
> replay the statements that were run during the first try.  For
> examples, to help get a feel of why that is, see:


Thank you again! :))

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

A few comments about the submitted patches.

I agree that improving the error handling ability of pgbench is a good 
thing, although I'm not sure about the implications...

About the "retry" discussion: I agree that retry is the relevant option 
from an application point of view.

ISTM that the retry implementation should be implemented somehow in the 
automaton, restarting the same script for the beginning.

As pointed out in the discussion, the same values/commands should be 
executed, which suggests that random generated values should be the same 
on the retry runs, so that for a simple script the same operations are 
attempted. This means that the random generator state must be kept & 
reinstated for a client on retries. Currently the random state is in the 
thread, which is not convenient for this purpose, so it should be moved in 
the client so that it can be saved at transaction start and reinstated on 
retries.

The number of retries and maybe failures should be counted, maybe with 
some adjustable maximum, as suggested.

About 0001:

In accumStats, just use one level if, the two levels bring nothing.

In doLog, added columns should be at the end of the format. The number of 
column MUST NOT change when different issues arise, so that it works well 
with cut/... unix commands, so inserting a sentence such as "serialization 
and deadlock failures" is a bad idea.

threadRun: the point of the progress format is to fit on one not too wide 
line on a terminal and to allow some simple automatic processing. Adding a 
verbose sentence in the middle of it is not the way to go.

About tests: I do not understand why test 003 includes 2 transactions. 
It would seem more logical to have two scripts.

About 0003:

I'm not sure that there should be an new option to report failures, the 
information when relevant should be integrated in a clean format into the 
existing reports... Maybe the "per command latency" report/option should 
be renamed if it becomes more general.

About 0004:

The documentation must not be in a separate patch, but in the same patch 
as their corresponding code.

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Hello Marina,

Hello, Fabien!

> A few comments about the submitted patches.

Thank you very much for them!

> I agree that improving the error handling ability of pgbench is a good
> thing, although I'm not sure about the implications...

Could you tell a little bit more exactly.. What implications are you 
worried about?

> About the "retry" discussion: I agree that retry is the relevant
> option from an application point of view.

I'm glad to hear it!

> ISTM that the retry implementation should be implemented somehow in
> the automaton, restarting the same script for the beginning.

If there are several transactions in this script - don't you think that 
we should restart only the failed transaction?..

> As pointed out in the discussion, the same values/commands should be
> executed, which suggests that random generated values should be the
> same on the retry runs, so that for a simple script the same
> operations are attempted. This means that the random generator state
> must be kept & reinstated for a client on retries. Currently the
> random state is in the thread, which is not convenient for this
> purpose, so it should be moved in the client so that it can be saved
> at transaction start and reinstated on retries.

I think about it in the same way =)

> The number of retries and maybe failures should be counted, maybe with
> some adjustable maximum, as suggested.

If we fix the maximum number of attempts the maximum number of failures 
for one script execution will be bounded above 
(number_of_transactions_in_script * maximum_number_of_attempts). Do you 
think we should make the option in program to limit this number much 
more?

> About 0001:
> 
> In accumStats, just use one level if, the two levels bring nothing.

Thanks, I agree =[

> In doLog, added columns should be at the end of the format.

I have inserted it earlier because these columns are not optional. Do 
you think they should be optional?

> The number
> of column MUST NOT change when different issues arise, so that it
> works well with cut/... unix commands, so inserting a sentence such as
> "serialization and deadlock failures" is a bad idea.

Thanks, I agree again.

> threadRun: the point of the progress format is to fit on one not too
> wide line on a terminal and to allow some simple automatic processing.
> Adding a verbose sentence in the middle of it is not the way to go.

I was thinking about it.. Thanks, I'll try to make it shorter.

> About tests: I do not understand why test 003 includes 2 transactions.
> It would seem more logical to have two scripts.

Ok!

> About 0003:
> 
> I'm not sure that there should be an new option to report failures,
> the information when relevant should be integrated in a clean format
> into the existing reports... Maybe the "per command latency"
> report/option should be renamed if it becomes more general.

I have tried do not change other parts of program as much as possible. 
But if you think that it will be more useful to change the option I'll 
do it.

> About 0004:
> 
> The documentation must not be in a separate patch, but in the same
> patch as their corresponding code.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

>> I agree that improving the error handling ability of pgbench is a good
>> thing, although I'm not sure about the implications...
>
> Could you tell a little bit more exactly.. What implications are you worried 
> about?

The current error handling is either "close connection" or maybe in some 
cases even "exit". If this is changed, then the client may continue 
execution in some unforseen state and behave unexpectedly. We'll see.

>> ISTM that the retry implementation should be implemented somehow in
>> the automaton, restarting the same script for the beginning.
>
> If there are several transactions in this script - don't you think that we 
> should restart only the failed transaction?..

On some transaction failures based on their status. My point is that the 
retry process must be implemented clearly with a new state in the client 
automaton. Exactly when the transition to this new state must be taken is 
another issue.

>> The number of retries and maybe failures should be counted, maybe with
>> some adjustable maximum, as suggested.
>
> If we fix the maximum number of attempts the maximum number of failures for 
> one script execution will be bounded above (number_of_transactions_in_script 
> * maximum_number_of_attempts). Do you think we should make the option in 
> program to limit this number much more?

Probably not. I think that there should be a configurable maximum of 
retries on a transaction, which may be 0 by default if we want to be 
upward compatible with the current behavior, or maybe something else.

>> In doLog, added columns should be at the end of the format.
>
> I have inserted it earlier because these columns are not optional. Do you 
> think they should be optional?

I think that new non-optional columns it should be at the end of the 
existing non-optional columns so that existing scripts which may process 
the output may not need to be updated.

>> I'm not sure that there should be an new option to report failures,
>> the information when relevant should be integrated in a clean format
>> into the existing reports... Maybe the "per command latency"
>> report/option should be renamed if it becomes more general.
>
> I have tried do not change other parts of program as much as possible. But if 
> you think that it will be more useful to change the option I'll do it.

I think that the option should change if its naming becomes less relevant, 
which is to be determined. AFAICS, ISTM that new measures should be added 
to the various existing reports unconditionnaly (i.e. without a new 
option), so maybe no new option would be needed.

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> The current error handling is either "close connection" or maybe in
> some cases even "exit". If this is changed, then the client may
> continue execution in some unforseen state and behave unexpectedly.
> We'll see.

Thanks, now I understand this.

>>> ISTM that the retry implementation should be implemented somehow in
>>> the automaton, restarting the same script for the beginning.
>> 
>> If there are several transactions in this script - don't you think 
>> that we should restart only the failed transaction?..
> 
> On some transaction failures based on their status. My point is that
> the retry process must be implemented clearly with a new state in the
> client automaton. Exactly when the transition to this new state must
> be taken is another issue.

About it, I agree with you that it should be done in this way.

>>> The number of retries and maybe failures should be counted, maybe 
>>> with
>>> some adjustable maximum, as suggested.
>> 
>> If we fix the maximum number of attempts the maximum number of 
>> failures for one script execution will be bounded above 
>> (number_of_transactions_in_script * maximum_number_of_attempts). Do 
>> you think we should make the option in program to limit this number 
>> much more?
> 
> Probably not. I think that there should be a configurable maximum of
> retries on a transaction, which may be 0 by default if we want to be
> upward compatible with the current behavior, or maybe something else.

I propose the option --max-attempts-number=NUM which NUM cannot be less 
than 1. I propose it because I think that, for example, 
--max-attempts-number=100 is better than --max-retries-number=99. And 
maybe it's better to set its default value to 1 too because retrying of 
shell commands can produce new errors..

>>> In doLog, added columns should be at the end of the format.
>> 
>> I have inserted it earlier because these columns are not optional. Do 
>> you think they should be optional?
> 
> I think that new non-optional columns it should be at the end of the
> existing non-optional columns so that existing scripts which may
> process the output may not need to be updated.

Thanks, I agree with you :)

>>> I'm not sure that there should be an new option to report failures,
>>> the information when relevant should be integrated in a clean format
>>> into the existing reports... Maybe the "per command latency"
>>> report/option should be renamed if it becomes more general.
>> 
>> I have tried do not change other parts of program as much as possible. 
>> But if you think that it will be more useful to change the option I'll 
>> do it.
> 
> I think that the option should change if its naming becomes less
> relevant, which is to be determined. AFAICS, ISTM that new measures
> should be added to the various existing reports unconditionnaly (i.e.
> without a new option), so maybe no new option would be needed.

Thanks! I didn't think about it in this way..

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
>>>> The number of retries and maybe failures should be counted, maybe with
>>>> some adjustable maximum, as suggested.
>>> 
>>> If we fix the maximum number of attempts the maximum number of failures 
>>> for one script execution will be bounded above 
>>> (number_of_transactions_in_script * maximum_number_of_attempts). Do you 
>>> think we should make the option in program to limit this number much more?
>> 
>> Probably not. I think that there should be a configurable maximum of
>> retries on a transaction, which may be 0 by default if we want to be
>> upward compatible with the current behavior, or maybe something else.
>
> I propose the option --max-attempts-number=NUM which NUM cannot be less than 
> 1. I propose it because I think that, for example, --max-attempts-number=100 
> is better than --max-retries-number=99. And maybe it's better to set its 
> default value to 1 too because retrying of shell commands can produce new 
> errors..

Personnaly, I like counting retries because it also counts the number of 
time the transaction actually failed for some reason. But this is a 
marginal preference, and one can be switchted to the other easily.

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alexander Korotkov
Дата:
On Thu, Jun 15, 2017 at 10:16 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote:
> Advanced options:
> - mostly for testing built-in scripts: you can set the default transaction
> isolation level by the appropriate benchmarking option (-I);

I'm less convinced of the need of htat, you can already set arbitrary
connection options with
PGOPTIONS='-c default_transaction_isolation=serializable' pgbench

Right, there is already way to specify default isolation level using environment variables.
However, once we make pgbench work with various isolation levels, users may want to run pgbench multiple times in a row with different isolation levels.  Command line option would be very convenient in this case.
In addition, isolation level is vital parameter to interpret benchmark results correctly.  Often, graphs with pgbench results are entitled with pgbench command line.  Having, isolation level specified in command line would naturally fit into this entitling scheme.
Of course, this is solely usability question and it's fair enough to live without such a command line option.  But I'm +1 to add this option.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
Hello everyone!

There's the second version of my patch for pgbench. Now transactions 
with serialization and deadlock failures are rolled back and retried 
until they end successfully or their number of attempts reaches maximum.

In details:
- You can set the maximum number of attempts by the appropriate 
benchmarking option (--max-attempts-number). Its default value is 1 
partly because retrying of shell commands can produce new errors.
- Statistics of attempts and failures is printed in progress, in 
transaction / aggregation logs and in the end with other results (all 
and for each script). The transaction failure is reported here only if 
the last retry of this transaction fails.
- Also failures and average numbers of transactions attempts are printed 
per-command with average latencies if you use the appropriate 
benchmarking option (--report-per-command, -r) (it replaces the option 
--report-latencies as I was advised here [1]). Average numbers of 
transactions attempts are printed only for commands which start 
transactions.

As usual: TAP tests for new functionality and changed documentation with 
new examples.

Patch is attached. Any suggestions are welcome!

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707031321370.3419%40lancre

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> There's the second version of my patch for pgbench. Now transactions 
> with serialization and deadlock failures are rolled back and retried 
> until they end successfully or their number of attempts reaches maximum.

> In details:
>  - You can set the maximum number of attempts by the appropriate 
> benchmarking option (--max-attempts-number). Its default value is 1 
> partly because retrying of shell commands can produce new errors.
>
>  - Statistics of attempts and failures is printed in progress, in 
> transaction / aggregation logs and in the end with other results (all 
> and for each script). The transaction failure is reported here only if 
> the last retry of this transaction fails.
>
> - Also failures and average numbers of transactions attempts are printed 
> per-command with average latencies if you use the appropriate 
> benchmarking option (--report-per-command, -r) (it replaces the option 
> --report-latencies as I was advised here [1]). Average numbers of 
> transactions attempts are printed only for commands which start 
> transactions.

> As usual: TAP tests for new functionality and changed documentation with 
> new examples.

Here are a round of comments on the current version of the patch:

* About the feature

There is a latent issue about what is a transaction. For pgbench a transaction is a full script execution.
For postgresql, it is a statement or a BEGIN/END block, several of which may appear in a script. From a retry
perspective, you may retry from a SAVEPOINT within a BEGIN/END block... I'm not sure how to make general sense
of all this, so this is just a comment without attached action for now.

As the default is not to retry, which is the upward compatible behavior, I think that the changes should not
change much the current output bar counting the number of failures.

I would consider using "try/tries" instead of "attempt/attempts" as it is shorter. An English native speaker
opinion would be welcome on that point.

* About the code

ISTM that the code interacts significantly with various patches under review or ready for committers.
Not sure how to deal with that, there will be some rebasing work...

I'm fine with renaming "is_latencies" to "report_per_command", which is more logical & generic.

"max_attempt_number": I'm against typing fields again in their name, aka "hungarian naming". I'd suggest
"max_tries" or "max_attempts".

"SimpleStats attempts": I disagree with using this floating poiunt oriented structures to count integers.
I would suggest "int64 tries" instead, which should be enough for the 
purpose.

LastBeginState -> RetryState? I'm not sure why this state is a pointer in 
CState. Putting the struct would avoid malloc/free cycles. Index "-1" may 
be used to tell it is not set if necessary.

"CSTATE_RETRY_FAILED_TRANSACTION" -> "CSTATE_RETRY" is simpler and clear enough.

In CState and some code, a failure is a failure, maybe one boolean would 
be enough. It need only be differentiated when counting, and you have 
(deadlock_failure || serialization_failure) everywhere.

Some variables, such as "int attempt_number", should be in the client 
structure, not in the client? Generally, try to use block variables if 
possible to keep the state clearly disjoints. If there could be NO new 
variable at the doCustom level that would be great, because that would 
ensure that there is no machine state mixup hidden in these variables.

I wondering whether the RETRY & FAILURE states could/should be merged:
  on RETRY:    -> count retry    -> actually retry if < max_tries (reset client state, jump to command)    -> else
countfailure and skip to end of script
 

The start and end of transaction detection seem expensive (malloc, ...) 
and assume a one statement per command (what about "BEGIN \; ... \; 
COMMIT;", which is not necessarily the case, this limitation should be 
documented. ISTM that the space normalization should be avoided, and 
something simpler/lighter should be devised? Possibly it should consider 
handling SAVEPOINT.

I disagree about exit in ParseScript if the transaction block is not 
completed, especially as it misses out on combined statements/queries 
(BEGIN \; stuff... \; COMMIT") and would break an existing feature.

There are strange characters things in comments, eg "??ontinuous".

Option "max-attempt-number" -> "max-tries"

I would put the client random state initialization with the state 
intialization, not with the connection.

* About tracing

Progress is expected to be short, not detailed. Only add the number of 
failures and retries if max retry is not 1.

* About reporting

I think that too much is reported. I advised to do that, but nevertheless 
it is a little bit steep.

At least, it should not report the number of tries/attempts when the max 
number is one. Simple counting should be reported for failures, not 
floats...

I would suggest a more compact one-line report about failures:
  "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)"

* About the TAP tests

They are too expensive, with 3 initdb. I think that they should be 
integrated in the existing tests, as a patch has been submitted to rework 
the whole pgbench tap test infrastructure.

For now, at most one initdb and several small tests inside.

* About the documentation

I'm not sure that the feature needs pre-emminence in the documentation, 
because most of the time there is no retry as none is needed, there is no 
failure, so this rather a special (although useful) case for people 
playing with serializable and other advanced features.

Smaller updates, without dedicated examples, should be enough.

If a transaction is skipped, there was no tries, so the corresponding 
number of attempts is 0, not one.

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
> LastBeginState -> RetryState? I'm not sure why this state is a pointer in 
> CState. Putting the struct would avoid malloc/free cycles. Index "-1" may be 
> used to tell it is not set if necessary.

Another detail I forgot about this point: there may be a memory leak on 
variables copies, ISTM that the "variables" array is never freed.

I was not convinced by the overall memory management around variables to 
begin with, and it is even less so with their new copy management. Maybe 
having a clean "Variables" data structure could help improve the 
situation.

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Here are a round of comments on the current version of the patch:

Thank you very much again!

> There is a latent issue about what is a transaction. For pgbench a
> transaction is a full script execution.
> For postgresql, it is a statement or a BEGIN/END block, several of
> which may appear in a script. From a retry
> perspective, you may retry from a SAVEPOINT within a BEGIN/END
> block... I'm not sure how to make general sense
> of all this, so this is just a comment without attached action for now.

Yes it is. That's why I wrote several notes about it in documentation 
where there may be a misunderstanding:

+        Transactions with serialization or deadlock failures (or with 
both
+        of them if used script contains several transactions; see
+        <xref linkend="transactions-and-scripts"
+        endterm="transactions-and-scripts-title"> for more information) 
are
+        marked separately and their time is not reported as for skipped
+        transactions.

+ <refsect2 id="transactions-and-scripts">
+  <title id="transactions-and-scripts-title">What is the 
<quote>Transaction</> Actually Performed in 
<application>pgbench</application>?</title>

+    If a transaction has serialization and/or deadlock failures, its
+   <replaceable>time</> will be reported as <literal>serialization 
failure</>,
+   <literal>deadlock failure</>, or
+   <literal>serialization and deadlock failures</>, respectively.   </para>
+  <note>
+   <para>
+     Transactions can have both serialization and deadlock failures if 
the
+     used script contained several transactions.  See
+     <xref linkend="transactions-and-scripts"
+     endterm="transactions-and-scripts-title"> for more information.
+    </para>
+  </note>

+  <note>
+   <para>
+    The number of transactions attempts within the interval can be 
greater than
+    the number of transactions within this interval multiplied by the 
maximum
+    attempts number.  See <xref linkend="transactions-and-scripts"
+    endterm="transactions-and-scripts-title"> for more information.
+   </para>
+  </note>

+       <note>
+         <para>The total sum of per-command failures of each type can 
be greater
+         than the number of transactions with reported failures.
+         See <xref linkend="transactions-and-scripts"
+         endterm="transactions-and-scripts-title"> for more 
information.
+         </para>
+       </note>

And I didn't make rollbacks to savepoints after the failure because they 
cannot help for serialization failures at all: after rollback to 
savepoint a new attempt will be always unsuccessful.

> I would consider using "try/tries" instead of "attempt/attempts" as it
> is shorter. An English native speaker
> opinion would be welcome on that point.

Thank you, I'll change it.

> I'm fine with renaming "is_latencies" to "report_per_command", which
> is more logical & generic.

Glad to hear it!

> "max_attempt_number": I'm against typing fields again in their name,
> aka "hungarian naming". I'd suggest
> "max_tries" or "max_attempts".

Ok!

> "SimpleStats attempts": I disagree with using this floating poiunt
> oriented structures to count integers.
> I would suggest "int64 tries" instead, which should be enough for the 
> purpose.

I'm not sure that it is enough. Firstly it may be several transactions 
in script so to count the average attempts number you should know the 
total number of runned transactions. Secondly I think that stddev for 
attempts number can be quite interesting and often it is not close to 
zero.

> LastBeginState -> RetryState? I'm not sure why this state is a pointer
> in CState. Putting the struct would avoid malloc/free cycles. Index
> "-1" may be used to tell it is not set if necessary.

Thanks, I agree that it's better to do in this way.

> "CSTATE_RETRY_FAILED_TRANSACTION" -> "CSTATE_RETRY" is simpler and 
> clear enough.

Ok!

> In CState and some code, a failure is a failure, maybe one boolean
> would be enough. It need only be differentiated when counting, and you
> have (deadlock_failure || serialization_failure) everywhere.

I agree with you. I'll change it.

> Some variables, such as "int attempt_number", should be in the client
> structure, not in the client? Generally, try to use block variables if
> possible to keep the state clearly disjoints. If there could be NO new
> variable at the doCustom level that would be great, because that would
> ensure that there is no machine state mixup hidden in these variables.

Do you mean the code cleanup for doCustom function? Because if I do so 
there will be two code styles for state blocks and their variables in 
this function..

> I wondering whether the RETRY & FAILURE states could/should be merged:
> 
>   on RETRY:
>     -> count retry
>     -> actually retry if < max_tries (reset client state, jump to 
> command)
>     -> else count failure and skip to end of script
> 
> The start and end of transaction detection seem expensive (malloc,
> ...) and assume a one statement per command (what about "BEGIN \; ...
> \; COMMIT;", which is not necessarily the case, this limitation should
> be documented. ISTM that the space normalization should be avoided,
> and something simpler/lighter should be devised? Possibly it should
> consider handling SAVEPOINT.

I divided these states because if there's a failed transaction block you 
should end it before retrying. It means to go to states 
CSTATE_START_COMMAND -> CSTATE_WAIT_RESULT -> CSTATE_END_COMMAND with 
the appropriate command. How do you propose not to go to these states?

About malloc - I agree with you that it should be done without 
malloc/free.

About savepoints - as I wrote you earlier I didn't make rollbacks to 
savepoints after the failure. Because they cannot help for serialization 
failures at all: after rollback to savepoint a new attempt will be 
always unsuccessful.

> I disagree about exit in ParseScript if the transaction block is not
> completed, especially as it misses out on combined statements/queries
> (BEGIN \; stuff... \; COMMIT") and would break an existing feature.

Thanks, I'll fix it for usual transaction blocks that don't end in the 
scripts.

> There are strange characters things in comments, eg "??ontinuous".

Oh, I'm sorry. I'll fix it too.

> Option "max-attempt-number" -> "max-tries"

> I would put the client random state initialization with the state
> intialization, not with the connection.

> * About tracing
> 
> Progress is expected to be short, not detailed. Only add the number of
> failures and retries if max retry is not 1.

Ok!

> * About reporting
> 
> I think that too much is reported. I advised to do that, but
> nevertheless it is a little bit steep.
> 
> At least, it should not report the number of tries/attempts when the
> max number is one.

Ok!

> Simple counting should be reported for failures,
> not floats...
> 
> I would suggest a more compact one-line report about failures:
> 
>   "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)"

I think, there may be a misunderstanding. Because script can contain 
several transactions and get both failures.

> * About the TAP tests
> 
> They are too expensive, with 3 initdb. I think that they should be
> integrated in the existing tests, as a patch has been submitted to
> rework the whole pgbench tap test infrastructure.
> 
> For now, at most one initdb and several small tests inside.

Ok!

> * About the documentation
> 
> I'm not sure that the feature needs pre-emminence in the
> documentation, because most of the time there is no retry as none is
> needed, there is no failure, so this rather a special (although
> useful) case for people playing with serializable and other advanced
> features.
> 
> Smaller updates, without dedicated examples, should be enough.

Maybe there should be some examples to prepare people what they can see 
in the output of the program? Of course now failures are special cases 
because they disconnect its clients to the end of the program and ruin
all the results. I hope that if this patch is committed there will be 
much more cases with retried failures.

> If a transaction is skipped, there was no tries, so the corresponding
> number of attempts is 0, not one.

Oh, I'm sorry, it is a typo in the documentation.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Another detail I forgot about this point: there may be a memory leak
> on variables copies, ISTM that the "variables" array is never freed.
> 
> I was not convinced by the overall memory management around variables
> to begin with, and it is even less so with their new copy management.
> Maybe having a clean "Variables" data structure could help improve the
> situation.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello,

> [...] I didn't make rollbacks to savepoints after the failure because 
> they cannot help for serialization failures at all: after rollback to 
> savepoint a new attempt will be always unsuccessful.

Not necessarily? It depends on where the locks triggering the issue are 
set, if they are all set after the savepoint it could work on a second 
attempt.

>> "SimpleStats attempts": I disagree with using this floating poiunt 
>> oriented structures to count integers. I would suggest "int64 tries" 
>> instead, which should be enough for the purpose.
>
> I'm not sure that it is enough. Firstly it may be several transactions in 
> script so to count the average attempts number you should know the total 
> number of runned transactions. Secondly I think that stddev for attempts 
> number can be quite interesting and often it is not close to zero.

I would prefer to have a real motivation to add this complexity in the 
report and in the code. Without that, a simple int seems better for now. 
It can be improved later if the need really arises.

>> Some variables, such as "int attempt_number", should be in the client
>> structure, not in the client? Generally, try to use block variables if
>> possible to keep the state clearly disjoints. If there could be NO new
>> variable at the doCustom level that would be great, because that would
>> ensure that there is no machine state mixup hidden in these variables.
>
> Do you mean the code cleanup for doCustom function? Because if I do so there 
> will be two code styles for state blocks and their variables in this 
> function..

I think that any variable shared between state is a recipee for bugs if it 
is not reset properly, so they should be avoided. Maybe there are already 
too many of them, then too bad, not a reason to add more. The status 
before the automaton was a nightmare.

>> I wondering whether the RETRY & FAILURE states could/should be merged:
>
> I divided these states because if there's a failed transaction block you 
> should end it before retrying.

Hmmm. Maybe I'm wrong. I'll think about it.

>> I would suggest a more compact one-line report about failures:
>>
>>   "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)"
>
> I think, there may be a misunderstanding. Because script can contain several 
> transactions and get both failures.

I do not understand. Both failures number are on the compact line I 
suggested.

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
>> I was not convinced by the overall memory management around variables
>> to begin with, and it is even less so with their new copy management.
>> Maybe having a clean "Variables" data structure could help improve the
>> situation.
>
> Ok!

Note that there is something for psql (src/bin/psql/variable.c) which may 
or may not be shared. It should be checked before recoding eventually the 
same thing.

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 13-07-2017 19:32, Fabien COELHO wrote:
> Hello,

Hi!

>> [...] I didn't make rollbacks to savepoints after the failure because 
>> they cannot help for serialization failures at all: after rollback to 
>> savepoint a new attempt will be always unsuccessful.
> 
> Not necessarily? It depends on where the locks triggering the issue
> are set, if they are all set after the savepoint it could work on a
> second attempt.

Don't you mean the deadlock failures where can really help rollback to 
savepoint? And could you, please, give an example where a rollback to 
savepoint can help to end its subtransaction successfully after a 
serialization failure?

>>> "SimpleStats attempts": I disagree with using this floating poiunt 
>>> oriented structures to count integers. I would suggest "int64 tries" 
>>> instead, which should be enough for the purpose.
>> 
>> I'm not sure that it is enough. Firstly it may be several transactions 
>> in script so to count the average attempts number you should know the 
>> total number of runned transactions. Secondly I think that stddev for 
>> attempts number can be quite interesting and often it is not close to 
>> zero.
> 
> I would prefer to have a real motivation to add this complexity in the
> report and in the code. Without that, a simple int seems better for
> now. It can be improved later if the need really arises.

Ok!

>>> Some variables, such as "int attempt_number", should be in the client
>>> structure, not in the client? Generally, try to use block variables 
>>> if
>>> possible to keep the state clearly disjoints. If there could be NO 
>>> new
>>> variable at the doCustom level that would be great, because that 
>>> would
>>> ensure that there is no machine state mixup hidden in these 
>>> variables.
>> 
>> Do you mean the code cleanup for doCustom function? Because if I do so 
>> there will be two code styles for state blocks and their variables in 
>> this function..
> 
> I think that any variable shared between state is a recipee for bugs
> if it is not reset properly, so they should be avoided. Maybe there
> are already too many of them, then too bad, not a reason to add more.
> The status before the automaton was a nightmare.

Ok!

>>> I would suggest a more compact one-line report about failures:
>>> 
>>>   "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)"
>> 
>> I think, there may be a misunderstanding. Because script can contain 
>> several transactions and get both failures.
> 
> I do not understand. Both failures number are on the compact line I 
> suggested.

I mean that the sum of transactions with serialization failure and 
transactions with deadlock failure can be greater then the totally sum 
of transactions with failures. But if you think it's ok I'll change it 
and write the appropriate note in documentation.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
>>> I was not convinced by the overall memory management around variables
>>> to begin with, and it is even less so with their new copy management.
>>> Maybe having a clean "Variables" data structure could help improve 
>>> the
>>> situation.
> 
> Note that there is something for psql (src/bin/psql/variable.c) which
> may or may not be shared. It should be checked before recoding
> eventually the same thing.

Thank you very much for pointing this file! As I checked this is another 
structure: here there's a simple list, while in pgbench we should know 
if the list is sorted and the number of elements in the list. How do you 
think, is it a good idea to name a variables structure in pgbench in the 
same way (VariableSpace) or it should be different not to be confused 
(Variables, for example)?

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

>> Not necessarily? It depends on where the locks triggering the issue
>> are set, if they are all set after the savepoint it could work on a
>> second attempt.
>
> Don't you mean the deadlock failures where can really help rollback to

Yes, I mean deadlock failures can rollback to a savepoint and work on a 
second attempt.

> And could you, please, give an example where a rollback to savepoint can 
> help to end its subtransaction successfully after a serialization 
> failure?

I do not know whether this is possible with about serialization failures.
It might be if the stuff before and after the savepoint are somehow 
unrelated...

> [...] I mean that the sum of transactions with serialization failure and 
> transactions with deadlock failure can be greater then the totally sum 
> of transactions with failures.

Hmmm. Ok.

A "failure" is a transaction (in the sense of pgbench) that could not made 
it to the end, even after retries. If there is a rollback and the a retry 
which works, it is not a failure.

Now deadlock or serialization errors, which trigger retries, are worth 
counting as well, although they are not "failures". So my format proposal 
was over optimistic, and the number of deadlocks and serializations should 
better be on a retry count line.

Maybe something like:  ...  number of failures: 12 (0.004%)  number of retries: 64 (deadlocks: 29, serialization: 35)

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
>> Note that there is something for psql (src/bin/psql/variable.c) which 
>> may or may not be shared. It should be checked before recoding 
>> eventually the same thing.
>
> Thank you very much for pointing this file! As I checked this is another 
> structure: here there's a simple list, while in pgbench we should know 
> if the list is sorted and the number of elements in the list. How do you 
> think, is it a good idea to name a variables structure in pgbench in the 
> same way (VariableSpace) or it should be different not to be confused 
> (Variables, for example)?

Given that the number of variables of a pgbench script is expected to be 
pretty small, I'm not sure that the sorting stuff is worth the effort.

My suggestion is really to look at both implementations and to answer the 
question "should pgbench share its variable implementation with psql?".

If the answer is yes, then the relevant part of the implementation should 
be moved to fe_utils, and that's it.

If the answer is no, then implement something in pgbench directly.

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
>>> Not necessarily? It depends on where the locks triggering the issue
>>> are set, if they are all set after the savepoint it could work on a
>>> second attempt.
>> 
>> Don't you mean the deadlock failures where can really help rollback to
> 
> Yes, I mean deadlock failures can rollback to a savepoint and work on
> a second attempt.
> 
>> And could you, please, give an example where a rollback to savepoint 
>> can help to end its subtransaction successfully after a serialization 
>> failure?
> 
> I do not know whether this is possible with about serialization 
> failures.
> It might be if the stuff before and after the savepoint are somehow 
> unrelated...

If you mean, for example, the updates of different tables - a rollback 
to savepoint doesn't help.

And I'm not sure that we should do all the stuff for savepoints 
rollbacks because:
- as I see it now it only makes sense for the deadlock failures;
- if there's a failure what savepoint we should rollback to and start 
the execution again? Maybe to go to the last one, if it is not 
successful go to the previous one etc.
Retrying the entire transaction may take less time..

>> [...] I mean that the sum of transactions with serialization failure 
>> and transactions with deadlock failure can be greater then the totally 
>> sum of transactions with failures.
> 
> Hmmm. Ok.
> 
> A "failure" is a transaction (in the sense of pgbench) that could not
> made it to the end, even after retries. If there is a rollback and the
> a retry which works, it is not a failure.
> 
> Now deadlock or serialization errors, which trigger retries, are worth
> counting as well, although they are not "failures". So my format
> proposal was over optimistic, and the number of deadlocks and
> serializations should better be on a retry count line.
> 
> Maybe something like:
>   ...
>   number of failures: 12 (0.004%)
>   number of retries: 64 (deadlocks: 29, serialization: 35)

Ok! How to you like the idea to use the same format (the total number of 
transactions with failures and the number of retries for each failure 
type) in other places (log, aggregation log, progress) if the values are 
not "default" (= no failures and no retries)?

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Given that the number of variables of a pgbench script is expected to
> be pretty small, I'm not sure that the sorting stuff is worth the
> effort.

I think it is a good insurance if there're many variables..

> My suggestion is really to look at both implementations and to answer
> the question "should pgbench share its variable implementation with
> psql?".
> 
> If the answer is yes, then the relevant part of the implementation
> should be moved to fe_utils, and that's it.
> 
> If the answer is no, then implement something in pgbench directly.

The structure of variables is different, the container structure of the 
variables is different, so I think that the answer is no.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
>> If the answer is no, then implement something in pgbench directly.
>
> The structure of variables is different, the container structure of the 
> variables is different, so I think that the answer is no.

Ok, fine. My point was just to check before proceeding.

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
> And I'm not sure that we should do all the stuff for savepoints rollbacks 
> because:
> - as I see it now it only makes sense for the deadlock failures;
> - if there's a failure what savepoint we should rollback to and start the 
> execution again?

ISTM that it is the point of having savepoint in the first place, the 
ability to restart the transaction at that point if something failed?

> Maybe to go to the last one, if it is not successful go to the previous 
> one etc. Retrying the entire transaction may take less time..

Well, I do not know that. My 0.02 € is that if there was a savepoint then 
this is natural the restarting point of a transaction which has some 
recoverable error.

Well, the short version may be to only do a full transaction retry and to 
document that for now savepoints are not handled, and to let that for 
future work if need arises.

>> Maybe something like:
>>   ...
>>   number of failures: 12 (0.004%)
>>   number of retries: 64 (deadlocks: 29, serialization: 35)
>
> Ok! How to you like the idea to use the same format (the total number of 
> transactions with failures and the number of retries for each failure type) 
> in other places (log, aggregation log, progress) if the values are not 
> "default" (= no failures and no retries)?

For progress the output must be short and readable, and probably we do not 
care about whether retries came from this or that, so I would let that 
out.

For log and aggregated log possibly that would make more sense, but it 
must stay easy to parse.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Ok, fine. My point was just to check before proceeding.

And I'm very grateful for that :)

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Well, the short version may be to only do a full transaction retry and
> to document that for now savepoints are not handled, and to let that
> for future work if need arises.

I agree with you.

> For progress the output must be short and readable, and probably we do
> not care about whether retries came from this or that, so I would let
> that out.
> 
> For log and aggregated log possibly that would make more sense, but it
> must stay easy to parse.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
Hello again!

Here is the third version of the patch for pgbench thanks to Fabien 
Coelho comments. As in the previous one, transactions with serialization 
and deadlock failures are rolled back and retried until they end 
successfully or their number of tries reaches maximum.

Differences from the previous version:
* Some code cleanup :) In particular, the Variables structure for 
managing client variables and only one new tap tests file (as they were 
recommended here [1] and here [2]).
* There's no error if the last transaction in the script is not 
completed. But the transactions started in the previous scripts and/or 
not ending in the current script, are not rolled back and retried after 
the failure. Such script try is reported as failed because it contains a 
failure that was not rolled back and retried.
* Usually the retries and/or failures are printed if they are not equal 
to zeros. In transaction/aggregation logs the failures are always 
printed and the retries are printed if max_tries is greater than 1. It 
is done for the general format of the log during the execution of the 
program.

Patch is attached. Any suggestions are welcome!

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121338090.12795%40lancre
[2] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121142300.12795%40lancre

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Andres Freund
Дата:
Hi,

On 2017-07-21 19:32:02 +0300, Marina Polyakova wrote:
> Here is the third version of the patch for pgbench thanks to Fabien Coelho
> comments. As in the previous one, transactions with serialization and
> deadlock failures are rolled back and retried until they end successfully or
> their number of tries reaches maximum.

Just had a need for this feature, and took this to a short test
drive. So some comments:
- it'd be useful to display a retry percentage of all transactions, similar to what's displayed for failed
transactions.
- it appears that we now unconditionally do not disregard a connection after a serialization / deadlock failure. Good.
Butthat's useful far beyond just deadlocks / serialization errors, and should probably be exposed.
 
- it'd be useful to also conveniently display the number of retried transactions, rather than the total number of
retries.

Nice feature!

- Andres



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alexander Korotkov
Дата:
On Fri, Aug 11, 2017 at 10:50 PM, Andres Freund <andres@anarazel.de> wrote:
On 2017-07-21 19:32:02 +0300, Marina Polyakova wrote:
> Here is the third version of the patch for pgbench thanks to Fabien Coelho
> comments. As in the previous one, transactions with serialization and
> deadlock failures are rolled back and retried until they end successfully or
> their number of tries reaches maximum.

Just had a need for this feature, and took this to a short test
drive. So some comments:
- it'd be useful to display a retry percentage of all transactions,
  similar to what's displayed for failed transactions.
- it appears that we now unconditionally do not disregard a connection
  after a serialization / deadlock failure. Good. But that's useful far
  beyond just deadlocks / serialization errors, and should probably be exposed.

Yes, it would be nice to don't disregard a connection after other errors too.  However, I'm not sure if we should retry the *same* transaction on errors beyond deadlocks / serialization errors.  For example, in case of division by zero or unique violation error it would be more natural to give up with current transaction and continue with next one.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company 

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Hi,

Hello!

> Just had a need for this feature, and took this to a short test
> drive. So some comments:
> - it'd be useful to display a retry percentage of all transactions,
>   similar to what's displayed for failed transactions.

> - it'd be useful to also conveniently display the number of retried
>   transactions, rather than the total number of retries.

Ok!

> - it appears that we now unconditionally do not disregard a connection
>   after a serialization / deadlock failure. Good. But that's useful far
>   beyond just deadlocks / serialization errors, and should probably be 
> exposed.

I agree that it will be useful. But how do you propose to print the 
results if there are many types of errors? I'm afraid that the progress 
report can be very long although it is expected that it will be rather 
short [1]. The per statement report can also be very long..

> Nice feature!

Thanks and thank you for your comments :)

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121142300.12795%40lancre

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello,

> Here is the third version of the patch for pgbench thanks to Fabien Coelho 
> comments. As in the previous one, transactions with serialization and 
> deadlock failures are rolled back and retried until they end successfully or 
> their number of tries reaches maximum.

Here is some partial review.

Patch applies cleanly.

It compiles with warnings, please fix them:

   pgbench.c:2624:28: warning: ‘failure_status’ may be used uninitialized in this function
   pgbench.c:2697:34: warning: ‘command’ may be used uninitialized in this function

I do not think that the error handling feature needs preeminence in the
final report, compare to scale, number of clients and so. The number
of tries should be put further on.

I would spell "number of tries" instead of "tries number" which seems to
suggest that each try is attributed a number. "sql" -> "SQL".

For the per statement latency final report, I do not think it is worth 
distinguishing the kind of retry at this level, because ISTM that 
serialization & deadlocks are unlikely to appear simultaneously. I would 
just report total failures and total tries on this report. We only have 2 
errors now, but if more are added I'm pretty sure that we would not want 
to have more columns... Moreover the 25 characters alignment is ugly, 
better use a much smaller alignment.

I'm okay with having details shown in the "log to file" group report.
The documentation does not seem consistent. It discusses "the very last fields"
and seem to suggest that there are two, but the example trace below just
adds one field.

If you want a paragraph you should add <para>, skipping a line does not
work (around "All values are computed for ...").

I do not understand the second note of the --max-tries documentation.
It seems to suggest that some script may not end their own transaction...
which should be an error in my opinion? Some explanations would be welcome.

I'm not sure that "Retries" deserves a type of its own for two counters.
The "retries" in RetriesState may be redundant with these.
The failures are counted on simple counters while retries have a type,
this is not consistent. I suggest to just use simple counters everywhere.

I'm ok with having the detail report tell about failures & retries only
when some occured.

typo: sucessufully -> successfully

If a native English speaker could provide an opinion on that, and more
generally review the whole documentation, it would be great.

I think that the rand functions should really take a random_state pointer
argument, not a Thread or Client.

I'm at odds that FailureStatus does not have a clean NO_FAILURE state,
and that it is merged with misc failures.

I'm not sure that initRetries, mergeRetries, getAllRetries really
deserve a function.

I do not thing that there should be two accum Functions. Just extend
the existing one, and adding zero to zero is not a problem.

I guess that in the end pgbench & psql variables will have to be merged
if pgbench expression engine is to be used by psql as well, but this is
not linked to this patch.

The tap tests seems over-complicated and heavy with two pgbench run in
parallel... I'm not sure we really want all that complexity for this
somehow small feature. Moreover pgbench can run several scripts, I'm not
sure why two pgbench would need to be invoked. Could something much
simpler and lighter be proposed instead to test the feature?

The added code does not conform to Pg C style. For instance, if brace
should be aligned to the if. Please conform the project style.

The is_transaction_block_end seems simplistic. ISTM that it would not
work with compound commands. It should be clearly documented somewhere.

Also find attached two scripts I used for some testing:

   psql < dl_init.sql
   pgbench -f dl_trans.sql -c 8 -T 10 -P 1

-- 
Fabien.
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Hello,

Hi! I'm very sorry that I did not answer for so long, I was very busy in 
the release of Postgres Pro 10 :(

>> Here is the third version of the patch for pgbench thanks to Fabien 
>> Coelho comments. As in the previous one, transactions with 
>> serialization and deadlock failures are rolled back and retried until 
>> they end successfully or their number of tries reaches maximum.
> 
> Here is some partial review.

Thank you very much for it!

> It compiles with warnings, please fix them:
> 
>   pgbench.c:2624:28: warning: ‘failure_status’ may be used
> uninitialized in this function
>   pgbench.c:2697:34: warning: ‘command’ may be used uninitialized in
> this function

Ok!

> I do not think that the error handling feature needs preeminence in the
> final report, compare to scale, number of clients and so. The number
> of tries should be put further on.

I added it here only because both this field and field "transaction 
type" are transaction characteristics. I have some doubts where to add 
it. On the one hand, the number of clients, the number of transactions 
per client and the number of transactions actually processed form a good 
logical block which I don't want to divide. On the other hand, the 
number of clients and the number of transactions per client are 
parameters, but the number of transactions actually processed is one of 
the program results. Where, in your opinion, would it be better to add 
the maximum number of transaction tries?

> I would spell "number of tries" instead of "tries number" which seems 
> to
> suggest that each try is attributed a number. "sql" -> "SQL".

Ok!

> For the per statement latency final report, I do not think it is worth
> distinguishing the kind of retry at this level, because ISTM that
> serialization & deadlocks are unlikely to appear simultaneously. I
> would just report total failures and total tries on this report. We
> only have 2 errors now, but if more are added I'm pretty sure that we
> would not want to have more columns...

Thanks, I agree with you.

> Moreover the 25 characters
> alignment is ugly, better use a much smaller alignment.

The variables for the numbers of failures and retries are of type int64 
since the variable for the total number of transactions has the same 
type. That's why such a large alignment (as I understand it now, enough 
20 characters). Do you prefer floating alignemnts, depending on the 
maximum number of failures/retries for any command in any script?

> I'm okay with having details shown in the "log to file" group report.

I think that the output format of retries statistics should be same 
everywhere, so I would just like to output the total number of retries 
here.

> The documentation does not seem consistent. It discusses "the very last 
> fields"
> and seem to suggest that there are two, but the example trace below 
> just
> adds one field.

I'm sorry, I do not understand what you are talking about. I used 
commands and the files from the end of your message ("psql < 
dl_init.sql" and "pgbench -f dl_trans.sql -c 8 -T 10 -P 1"), and I got 
this output from pgbench:

starting vacuum...ERROR:  relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
end.
progress: 1.0 s, 14.0 tps, lat 9.094 ms stddev 5.304
progress: 2.0 s, 25.0 tps, lat 284.934 ms stddev 450.692, 1 failed
progress: 3.0 s, 21.0 tps, lat 337.942 ms stddev 473.210, 1 failed
progress: 4.0 s, 11.0 tps, lat 459.041 ms stddev 499.908, 2 failed
progress: 5.0 s, 28.0 tps, lat 220.219 ms stddev 411.390, 2 failed
progress: 6.0 s, 5.0 tps, lat 402.695 ms stddev 492.526, 2 failed
progress: 7.0 s, 24.0 tps, lat 343.249 ms stddev 626.181, 2 failed
progress: 8.0 s, 14.0 tps, lat 505.396 ms stddev 501.836, 1 failed
progress: 9.0 s, 40.0 tps, lat 180.080 ms stddev 381.335, 1 failed
progress: 10.0 s, 1.0 tps, lat 0.000 ms stddev 0.000, 1 failed
transaction type: dl_trans.sql
transaction maximum tries number: 1
scaling factor: 1
query mode: simple
number of clients: 8
number of threads: 1
duration: 10 s
number of transactions actually processed: 191
number of failures: 14 (7.330 %)
latency average = 356.701 ms
latency stddev = 564.942 ms
tps = 18.735807 (including connections establishing)
tps = 18.744898 (excluding connections establishing)

As I understand it, in the documentation "the very last fields" refer to 
the aggregation logging which is not used here. So what's the problem?

> If you want a paragraph you should add <para>, skipping a line does not
> work (around "All values are computed for ...").

Sorry, thanks =[

> I do not understand the second note of the --max-tries documentation.
> It seems to suggest that some script may not end their own 
> transaction...
> which should be an error in my opinion? Some explanations would be 
> welcome.

As you told me here [1], "I disagree about exit in ParseScript if the 
transaction block is not completed <...> and would break an existing 
feature.". Maybe it's be better to say this:

In pgbench you can use scripts in which the transaction blocks do not 
end. Be careful in this case because transactions that span over more 
than one script are not rolled back and will not be retried in case of 
an error. In such cases, the script in which the error occurred is 
reported as failed.

?

> I'm not sure that "Retries" deserves a type of its own for two 
> counters.

Ok!

> The "retries" in RetriesState may be redundant with these.

The "retries" in RetriesState have a different goal: they sum up not all 
the retries during the execution of the current script but the retries 
for the current transaction.

> The failures are counted on simple counters while retries have a type,
> this is not consistent. I suggest to just use simple counters 
> everywhere.

Ok!

> I'm ok with having the detail report tell about failures & retries only
> when some occured.

Ok!

> typo: sucessufully -> successfully

Thanks! =[

> If a native English speaker could provide an opinion on that, and more
> generally review the whole documentation, it would be great.

I agree with you))

> I think that the rand functions should really take a random_state 
> pointer
> argument, not a Thread or Client.

Thanks, I agree.

> I'm at odds that FailureStatus does not have a clean NO_FAILURE state,
> and that it is merged with misc failures.

:) It is funny but for the code it really did not matter)

> I'm not sure that initRetries, mergeRetries, getAllRetries really
> deserve a function.

Ok!

> I do not thing that there should be two accum Functions. Just extend
> the existing one, and adding zero to zero is not a problem.

Ok!

> I guess that in the end pgbench & psql variables will have to be merged
> if pgbench expression engine is to be used by psql as well, but this is
> not linked to this patch.

Ok!

> The tap tests seems over-complicated and heavy with two pgbench run in
> parallel... I'm not sure we really want all that complexity for this
> somehow small feature. Moreover pgbench can run several scripts, I'm 
> not
> sure why two pgbench would need to be invoked. Could something much
> simpler and lighter be proposed instead to test the feature?

Firstly, two pgbench need to be invoked because we don't know which of 
them will get a deadlock failure. Secondly, I tried much simplier tests 
but all of them failed sometimes although everything was ok:
- tests in which pgbench runs 5 clients and 10 transactions per client 
for a serialization/deadlock failure on any client (sometimes there are 
no failures when it is expected that they will be)
- tests in which pgbench runs 30 clients and 400 transactions per client 
for a serialization/deadlock failure on any client (sometimes there are 
no failures when it is expected that they will be)
- tests in which the psql session starts concurrently and you use sleep
commands to wait pgbench for 10 seconds (sometimes it does not work)
Only advisory locks help me not to get such errors in the tests :(

> The added code does not conform to Pg C style. For instance, if brace
> should be aligned to the if. Please conform the project style.

I'm sorry, thanks =[

> The is_transaction_block_end seems simplistic. ISTM that it would not
> work with compound commands. It should be clearly documented somewhere.

Thanks, I'll fix it.

> Also find attached two scripts I used for some testing:
> 
>   psql < dl_init.sql
>   pgbench -f dl_trans.sql -c 8 -T 10 -P 1

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121142300.12795%40lancre

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Teodor Sigaev
Дата:
> I suggest a patch where pgbench client sessions are not disconnected because of 
> serialization or deadlock failures and these failures are mentioned in reports. 
> In details:
> - transaction with one of these failures continue run normally, but its result 
> is rolled back;
> - if there were these failures during script execution this "transaction" is marked
> appropriately in logs;
> - numbers of "transactions" with these failures are printed in progress, in 
> aggregation logs and in the end with other results (all and for each script);
Hm, I took a look on both thread about patch and it seems to me now it's 
overcomplicated. With recently committed enhancements of pgbench (\if, \when) it 
becomes close to impossible to retry transation in case of failure. So, initial 
approach just to rollback such transaction looks more attractive.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
> Hm, I took a look on both thread about patch and it seems to me now it's 
> overcomplicated. With recently committed enhancements of pgbench (\if, 
> \when) it becomes close to impossible to retry transation in case of 
> failure. So, initial approach just to rollback such transaction looks 
> more attractive.

Yep.

I think that the best approach for now is simply to reset (command zero, 
random generator) and start over the whole script, without attempting to 
be more intelligent. The limitations should be clearly documented (one 
transaction per script), though. That would be a significant enhancement 
already.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 25-03-2018 15:23, Fabien COELHO wrote:
>> Hm, I took a look on both thread about patch and it seems to me now 
>> it's overcomplicated. With recently committed enhancements of pgbench 
>> (\if, \when) it becomes close to impossible to retry transation in 
>> case of failure. So, initial approach just to rollback such 
>> transaction looks more attractive.
> 
> Yep.

Many thanks to both of you! I'm working on a patch in this direction..

> I think that the best approach for now is simply to reset (command
> zero, random generator) and start over the whole script, without
> attempting to be more intelligent. The limitations should be clearly
> documented (one transaction per script), though. That would be a
> significant enhancement already.

I'm not sure that we can always do this, because we can get new errors 
until we finish the failed transaction block, and we need destroy the 
conditional stack..

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> Many thanks to both of you! I'm working on a patch in this direction..
>
>> I think that the best approach for now is simply to reset (command
>> zero, random generator) and start over the whole script, without
>> attempting to be more intelligent. The limitations should be clearly
>> documented (one transaction per script), though. That would be a
>> significant enhancement already.
>
> I'm not sure that we can always do this, because we can get new errors until 
> we finish the failed transaction block, and we need destroy the conditional 
> stack..

Sure. I'm suggesting so as to simplify that on failures the retry would 
always restarts from the beginning of the script by resetting everything, 
indeed including the conditional stack, the random generator state, the 
variable values, and so on.

This mean enforcing somehow one script is one transaction.

If the user does not do that, it would be their decision and the result 
becomes unpredictable on errors (eg some sub-transactions could be 
executed more than once).

Then if more is needed, that could be for another patch.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 26-03-2018 18:53, Fabien COELHO wrote:
> Hello Marina,

Hello!

>> Many thanks to both of you! I'm working on a patch in this direction..
>> 
>>> I think that the best approach for now is simply to reset (command
>>> zero, random generator) and start over the whole script, without
>>> attempting to be more intelligent. The limitations should be clearly
>>> documented (one transaction per script), though. That would be a
>>> significant enhancement already.
>> 
>> I'm not sure that we can always do this, because we can get new errors 
>> until we finish the failed transaction block, and we need destroy the 
>> conditional stack..
> 
> Sure. I'm suggesting so as to simplify that on failures the retry
> would always restarts from the beginning of the script by resetting
> everything, indeed including the conditional stack, the random
> generator state, the variable values, and so on.
> 
> This mean enforcing somehow one script is one transaction.
> 
> If the user does not do that, it would be their decision and the
> result becomes unpredictable on errors (eg some sub-transactions could
> be executed more than once).
> 
> Then if more is needed, that could be for another patch.

Here is the fifth version of the patch for pgbench (based on the commit 
4b9094eb6e14dfdbed61278ea8e51cc846e43579) where I tried to implement 
these ideas, thanks to your comments and those of Teodor Sigaev. Since 
we may need to execute commands to complete a failed transaction block, 
the script is now always executed completely. If there is a 
serialization/deadlock failure which can be retried, the script is 
executed again with the same random state and array of variables as 
before its first run. Meta commands  errors as well as all SQL errors do 
not cause the aborting of the client. The first failure in the current 
script execution determines whether the script run will be retried or 
not, so only such failures (they have a retry) or errors (they are not 
retried) are reported.

I tried to make fixes in accordance with your previous reviews ([1], 
[2], [3]):

> I'm unclear about the added example added in the documentation. There
> are 71% errors, but 100% of transactions are reported as processed. If
> there were errors, then it is not a success, so the transaction were
> not
> processed? To me it looks inconsistent. Also, while testing, it seems
> that
> failed transactions are counted in tps, which I think is not
> appropriate:
> 
> 
> About the feature:
> 
>  sh> PGOPTIONS='-c default_transaction_isolation=serializable' \
>        ./pgbench -P 1 -T 3 -r -M prepared -j 2 -c 4
>  starting vacuum...end.
>  progress: 1.0 s, 10845.8 tps, lat 0.091 ms stddev 0.491, 10474 failed
>  # NOT 10845.8 TPS...
>  progress: 2.0 s, 10534.6 tps, lat 0.094 ms stddev 0.658, 10203 failed
>  progress: 3.0 s, 10643.4 tps, lat 0.096 ms stddev 0.568, 10290 failed
>  ...
>  number of transactions actually processed: 32028 # NO!
>  number of errors: 30969 (96.694 %)
>  latency average = 2.833 ms
>  latency stddev = 1.508 ms
>  tps = 10666.720870 (including connections establishing) # NO
>  tps = 10683.034369 (excluding connections establishing) # NO
>  ...
> 
> For me this is all wrong. I think that the tps report is about
> transactions
> that succeeded, not mere attempts. I cannot say that a transaction
> which aborted
> was "actually processed"... as it was not.

Fixed

> The order of reported elements is not logical:
> 
>  maximum number of transaction tries: 100
>  scaling factor: 10
>  query mode: prepared
>  number of clients: 4
>  number of threads: 2
>  duration: 3 s
>  number of transactions actually processed: 967
>  number of errors: 152 (15.719 %)
>  latency average = 9.630 ms
>  latency stddev = 13.366 ms
>  number of transactions retried: 623 (64.426 %)
>  number of retries: 32272
> 
> I would suggest to group everything about error handling in one block,
> eg something like:
> 
>  scaling factor: 10
>  query mode: prepared
>  number of clients: 4
>  number of threads: 2
>  duration: 3 s
>  number of transactions actually processed: 967
>  number of errors: 152 (15.719 %)
>  number of transactions retried: 623 (64.426 %)
>  number of retries: 32272
>  maximum number of transaction tries: 100
>  latency average = 9.630 ms
>  latency stddev = 13.366 ms

Fixed

> Also, percent character should be stuck to its number: 15.719% to have
> the style more homogeneous (although there seems to be pre-existing
> inhomogeneities).
> 
> I would replace "transaction tries/retried" by "tries/retried",
> everything
> is about transactions in the report anyway.
> 
> Without reading the documentation, the overall report semantics is
> unclear,
> especially given the absurd tps results I got with the my first
> attempt,
> as failing transactions are counted as "processed".

Fixed

> About the code:
> 
> I'm at lost with the 7 states added to the automaton, where I would
> have hoped
> that only 2 (eg RETRY & FAIL, or even less) would be enough.

Fixed

> I'm wondering whether the whole feature could be simplified by
> considering that one script is one "transaction" (it is from the
> report point of view at least), and that any retry is for the full
> script only, from its beginning. That would remove the trying to guess
> at transactions begin or end, avoid scanning manually for subcommands,
> and so on.
>  - Would it make sense?
>  - Would it be ok for your use case?

Fixed

> The proposed version of the code looks unmaintainable to me. There are
> 3 levels of nested "switch/case" with state changes at the deepest
> level.
> I cannot even see it on my screen which is not wide enough.

Fixed

> There should be a typedef for "random_state", eg something like:
> 
>   typedef struct { unsigned short data[3]; } RandomState;
> 
> Please keep "const" declarations, eg "commandFailed".
> 
> I think that choosing script should depend on the thread random state,
> not
> the client random state, so that a run would generate the same pattern
> per
> thread, independently of which client finishes first.
> 
> I'm sceptical of the "--debug-fails" options. ISTM that --debug is
> already there
> and should just be reused.

Fixed

> I agree that function naming style is a already a mess, but I think
> that
> new functions you add should use a common style, eg "is_compound" vs
> "canRetry".

Fixed

> Translating error strings to their enum should be put in a function.

Removed

> I'm not sure this whole thing should be done anyway.

The processing of compound commands is removed.

> The "node" is started but never stopped.

Fixed

> For file contents, maybe the << 'EOF' here-document syntax would help
> instead
> of using concatenated backslashed strings everywhere.

I'm sorry, but I could not get it to work with regular expressions :(

> I'd start by stating (i.e. documenting) that the features assumes that 
> one
> script is just *one* transaction.
> 
> Note that pgbench somehow already assumes that one script is one
> transaction when it reports performance anyway.
> 
> If you want 2 transactions, then you have to put them in two scripts,
> which looks fine with me. Different transactions are expected to be
> independent, otherwise they should be merged into one transaction.

Fixed

> Under these restrictions, ISTM that a retry is something like:
> 
>    case ABORTED:
>       if (we want to retry) {
>          // do necessary stats
>          // reset the initial state (random, vars, current command)
>          state = START_TX; // loop
>       }
>       else {
>         // count as failed...
>         state = FINISHED; // or done.
>       }
>       break;
...
> I'm fine with having END_COMMAND skipping to START_TX if it can be done
> easily and cleanly, esp without code duplication.

I did not want to add the additional if-expressions possibly to most of 
the code in CSTATE_START_TX/CSTATE_END_TX/CSTATE_END_COMMAND, so 
CSTATE_FAILURE is used instead of CSTATE_END_COMMAND in case of failure,
and CSTATE_RETRY is called before CSTATE_END_TX if there was a failure 
during the current script execution.

> ISTM that ABORTED & FINISHED are currently exactly the same. That would
> put a particular use to aborted. Also, there are many points where the
> code may go to "aborted" state, so reusing it could help avoid 
> duplicating
> stuff on each abort decision.

To end and rollback the failed transaction block the script is always 
executed completely, and after the failure the following script command 
is executed..

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1801031720270.20034%40lancre
[2] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121309300.10810%40lancre
[3] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121607310.13422%40lancre

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Teodor Sigaev
Дата:
Conception of max-retry option seems strange for me. if number of retries 
reaches max-retry option, then we just increment counter of failed transaction 
and try again (possibly, with different random numbers). At the end we should 
distinguish number of error transaction and failed transaction, to found this 
difference documentation  suggests to rerun pgbench with debugging on.

May be I didn't catch an idea, but it seems to me max-tries should be removed. 
On transaction searialization or deadlock error pgbench should increment counter 
of failed transaction, resets conditional stack, variables, etc but not a random 
generator and then start new transaction for the first line of script.

Marina Polyakova wrote:
> On 26-03-2018 18:53, Fabien COELHO wrote:
>> Hello Marina,
> 
> Hello!
> 
>>> Many thanks to both of you! I'm working on a patch in this direction..
>>>
>>>> I think that the best approach for now is simply to reset (command
>>>> zero, random generator) and start over the whole script, without
>>>> attempting to be more intelligent. The limitations should be clearly
>>>> documented (one transaction per script), though. That would be a
>>>> significant enhancement already.
>>>
>>> I'm not sure that we can always do this, because we can get new errors until 
>>> we finish the failed transaction block, and we need destroy the conditional 
>>> stack..
>>
>> Sure. I'm suggesting so as to simplify that on failures the retry
>> would always restarts from the beginning of the script by resetting
>> everything, indeed including the conditional stack, the random
>> generator state, the variable values, and so on.
>>
>> This mean enforcing somehow one script is one transaction.
>>
>> If the user does not do that, it would be their decision and the
>> result becomes unpredictable on errors (eg some sub-transactions could
>> be executed more than once).
>>
>> Then if more is needed, that could be for another patch.
> 
> Here is the fifth version of the patch for pgbench (based on the commit 
> 4b9094eb6e14dfdbed61278ea8e51cc846e43579) where I tried to implement these 
> ideas, thanks to your comments and those of Teodor Sigaev. Since we may need to 
> execute commands to complete a failed transaction block, the script is now 
> always executed completely. If there is a serialization/deadlock failure which 
> can be retried, the script is executed again with the same random state and 
> array of variables as before its first run. Meta commands  errors as well as all 
> SQL errors do not cause the aborting of the client. The first failure in the 
> current script execution determines whether the script run will be retried or 
> not, so only such failures (they have a retry) or errors (they are not retried) 
> are reported.
> 
> I tried to make fixes in accordance with your previous reviews ([1], [2], [3]):
> 
>> I'm unclear about the added example added in the documentation. There
>> are 71% errors, but 100% of transactions are reported as processed. If
>> there were errors, then it is not a success, so the transaction were
>> not
>> processed? To me it looks inconsistent. Also, while testing, it seems
>> that
>> failed transactions are counted in tps, which I think is not
>> appropriate:
>>
>>
>> About the feature:
>>
>>  sh> PGOPTIONS='-c default_transaction_isolation=serializable' \
>>        ./pgbench -P 1 -T 3 -r -M prepared -j 2 -c 4
>>  starting vacuum...end.
>>  progress: 1.0 s, 10845.8 tps, lat 0.091 ms stddev 0.491, 10474 failed
>>  # NOT 10845.8 TPS...
>>  progress: 2.0 s, 10534.6 tps, lat 0.094 ms stddev 0.658, 10203 failed
>>  progress: 3.0 s, 10643.4 tps, lat 0.096 ms stddev 0.568, 10290 failed
>>  ...
>>  number of transactions actually processed: 32028 # NO!
>>  number of errors: 30969 (96.694 %)
>>  latency average = 2.833 ms
>>  latency stddev = 1.508 ms
>>  tps = 10666.720870 (including connections establishing) # NO
>>  tps = 10683.034369 (excluding connections establishing) # NO
>>  ...
>>
>> For me this is all wrong. I think that the tps report is about
>> transactions
>> that succeeded, not mere attempts. I cannot say that a transaction
>> which aborted
>> was "actually processed"... as it was not.
> 
> Fixed
> 
>> The order of reported elements is not logical:
>>
>>  maximum number of transaction tries: 100
>>  scaling factor: 10
>>  query mode: prepared
>>  number of clients: 4
>>  number of threads: 2
>>  duration: 3 s
>>  number of transactions actually processed: 967
>>  number of errors: 152 (15.719 %)
>>  latency average = 9.630 ms
>>  latency stddev = 13.366 ms
>>  number of transactions retried: 623 (64.426 %)
>>  number of retries: 32272
>>
>> I would suggest to group everything about error handling in one block,
>> eg something like:
>>
>>  scaling factor: 10
>>  query mode: prepared
>>  number of clients: 4
>>  number of threads: 2
>>  duration: 3 s
>>  number of transactions actually processed: 967
>>  number of errors: 152 (15.719 %)
>>  number of transactions retried: 623 (64.426 %)
>>  number of retries: 32272
>>  maximum number of transaction tries: 100
>>  latency average = 9.630 ms
>>  latency stddev = 13.366 ms
> 
> Fixed
> 
>> Also, percent character should be stuck to its number: 15.719% to have
>> the style more homogeneous (although there seems to be pre-existing
>> inhomogeneities).
>>
>> I would replace "transaction tries/retried" by "tries/retried",
>> everything
>> is about transactions in the report anyway.
>>
>> Without reading the documentation, the overall report semantics is
>> unclear,
>> especially given the absurd tps results I got with the my first
>> attempt,
>> as failing transactions are counted as "processed".
> 
> Fixed
> 
>> About the code:
>>
>> I'm at lost with the 7 states added to the automaton, where I would
>> have hoped
>> that only 2 (eg RETRY & FAIL, or even less) would be enough.
> 
> Fixed
> 
>> I'm wondering whether the whole feature could be simplified by
>> considering that one script is one "transaction" (it is from the
>> report point of view at least), and that any retry is for the full
>> script only, from its beginning. That would remove the trying to guess
>> at transactions begin or end, avoid scanning manually for subcommands,
>> and so on.
>>  - Would it make sense?
>>  - Would it be ok for your use case?
> 
> Fixed
> 
>> The proposed version of the code looks unmaintainable to me. There are
>> 3 levels of nested "switch/case" with state changes at the deepest
>> level.
>> I cannot even see it on my screen which is not wide enough.
> 
> Fixed
> 
>> There should be a typedef for "random_state", eg something like:
>>
>>   typedef struct { unsigned short data[3]; } RandomState;
>>
>> Please keep "const" declarations, eg "commandFailed".
>>
>> I think that choosing script should depend on the thread random state,
>> not
>> the client random state, so that a run would generate the same pattern
>> per
>> thread, independently of which client finishes first.
>>
>> I'm sceptical of the "--debug-fails" options. ISTM that --debug is
>> already there
>> and should just be reused.
> 
> Fixed
> 
>> I agree that function naming style is a already a mess, but I think
>> that
>> new functions you add should use a common style, eg "is_compound" vs
>> "canRetry".
> 
> Fixed
> 
>> Translating error strings to their enum should be put in a function.
> 
> Removed
> 
>> I'm not sure this whole thing should be done anyway.
> 
> The processing of compound commands is removed.
> 
>> The "node" is started but never stopped.
> 
> Fixed
> 
>> For file contents, maybe the << 'EOF' here-document syntax would help
>> instead
>> of using concatenated backslashed strings everywhere.
> 
> I'm sorry, but I could not get it to work with regular expressions :(
> 
>> I'd start by stating (i.e. documenting) that the features assumes that one
>> script is just *one* transaction.
>>
>> Note that pgbench somehow already assumes that one script is one
>> transaction when it reports performance anyway.
>>
>> If you want 2 transactions, then you have to put them in two scripts,
>> which looks fine with me. Different transactions are expected to be
>> independent, otherwise they should be merged into one transaction.
> 
> Fixed
> 
>> Under these restrictions, ISTM that a retry is something like:
>>
>>    case ABORTED:
>>       if (we want to retry) {
>>          // do necessary stats
>>          // reset the initial state (random, vars, current command)
>>          state = START_TX; // loop
>>       }
>>       else {
>>         // count as failed...
>>         state = FINISHED; // or done.
>>       }
>>       break;
> ...
>> I'm fine with having END_COMMAND skipping to START_TX if it can be done
>> easily and cleanly, esp without code duplication.
> 
> I did not want to add the additional if-expressions possibly to most of the code 
> in CSTATE_START_TX/CSTATE_END_TX/CSTATE_END_COMMAND, so CSTATE_FAILURE is used 
> instead of CSTATE_END_COMMAND in case of failure, and CSTATE_RETRY is called 
> before CSTATE_END_TX if there was a failure during the current script execution.
> 
>> ISTM that ABORTED & FINISHED are currently exactly the same. That would
>> put a particular use to aborted. Also, there are many points where the
>> code may go to "aborted" state, so reusing it could help avoid duplicating
>> stuff on each abort decision.
> 
> To end and rollback the failed transaction block the script is always executed 
> completely, and after the failure the following script command is executed..
> 
> [1] 
> https://www.postgresql.org/message-id/alpine.DEB.2.20.1801031720270.20034%40lancre
> [2] 
> https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121309300.10810%40lancre
> [3] 
> https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121607310.13422%40lancre
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
> Conception of max-retry option seems strange for me. if number of retries 
> reaches max-retry option, then we just increment counter of failed 
> transaction and try again (possibly, with different random numbers). At the 
> end we should distinguish number of error transaction and failed transaction, 
> to found this difference documentation  suggests to rerun pgbench with 
> debugging on.
>
> May be I didn't catch an idea, but it seems to me max-tries should be 
> removed. On transaction searialization or deadlock error pgbench should 
> increment counter of failed transaction, resets conditional stack, variables, 
> etc but not a random generator and then start new transaction for the first 
> line of script.

ISTM that there is the idea is that the client application should give up 
at some point are report an error to the end user, kind of a "timeout" on 
trying, and that max-retry would implement this logic of giving up: the 
transaction which was intented, represented by a given initial random 
generator state, could not be committed as if after some iterations.

Maybe the max retry should rather be expressed in time rather than number 
of attempts, or both approach could be implemented? But there is a logic 
of retrying the same (try again what the client wanted) vs retrying 
something different (another client need is served).

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 29-03-2018 22:39, Fabien COELHO wrote:
>> Conception of max-retry option seems strange for me. if number of 
>> retries reaches max-retry option, then we just increment counter of 
>> failed transaction and try again (possibly, with different random 
>> numbers).

Then the client starts another script, but by chance or by the number of 
scripts it can be the same.

>> At the end we should distinguish number of error transaction and 
>> failed transaction, to found this difference documentation  suggests 
>> to rerun pgbench with debugging on.

If I understood you correctly, this difference is the total number of 
retries and this is included in all reports.

>> May be I didn't catch an idea, but it seems to me max-tries should be 
>> removed. On transaction searialization or deadlock error pgbench 
>> should increment counter of failed transaction, resets conditional 
>> stack, variables, etc but not a random generator and then start new 
>> transaction for the first line of script.

When I sent the first version of the patch there were only rollbacks, 
and the idea to retry failed transactions was approved (see [1], [2], 
[3], [4]). And thank you, I fixed the patch to reset the client 
variables in case of errors too, and not only in case of retries (see 
attached, it is based on the commit 
3da7502cd00ddf8228c9a4a7e4a08725decff99c).

> ISTM that there is the idea is that the client application should give
> up at some point are report an error to the end user, kind of a
> "timeout" on trying, and that max-retry would implement this logic of
> giving up: the transaction which was intented, represented by a given
> initial random generator state, could not be committed as if after
> some iterations.
> 
> Maybe the max retry should rather be expressed in time rather than
> number of attempts, or both approach could be implemented? But there
> is a logic of retrying the same (try again what the client wanted) vs
> retrying something different (another client need is served).

I'm afraid that we will have a problem in debugging mode: should we 
report a failure (which will be retried) or an error (which will not be 
retried)? Because only after executing the following script commands (to 
rollback this transaction block) we will know the time that we spent on 
the execution of the current script..

[1] 
https://www.postgresql.org/message-id/CACjxUsOfbn72EaH4i_OuzdY-0PUYfg1Y3o8G27tEA8fJOaPQEw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20170615211806.sfkpiy2acoavpovl%40alvherre.pgsql
[3] 
https://www.postgresql.org/message-id/CAEepm%3D3TRTc9Fy%3DfdFThDa4STzPTR6w%3DRGfYEPikEkc-Lcd%2BMw%40mail.gmail.com
[4] 
https://www.postgresql.org/message-id/CACjxUsOQw%3DvYjPWZQ29GmgWU8ZKj336OGiNQX5Z2W-AcV12%2BNw%40mail.gmail.com

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
Hello, hackers!

Here there's a seventh version of the patch for error handling and 
retrying of transactions with serialization/deadlock failures in pgbench 
(based on the commit a08dc711952081d63577fc182fcf955958f70add). I added 
the option --max-tries-time which is an implemetation of Fabien Coelho's 
proposal in [1]: the transaction with serialization or deadlock failure 
can be retried if the total time of all its tries is less than this 
limit (in ms). This option can be combined with the option --max-tries. 
But if none of them are used, failed transactions are not retried at 
all.

Also:
* Now when the first failure occurs in the transaction it is always 
reported as a failure since only after the remaining commands of this 
transaction are executed we find out whether we can try again or not. 
Therefore add the messages about retrying or ending the failed 
transaction to the "fails" debugging level so you can distinguish 
failures (which are retried) and errors (which are not retried).
* Fix a report on the latency average because the total time includes 
time for both errors and successful transactions.
* Code cleanup (including tests).

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1803292134380.16472%40lancre

> Maybe the max retry should rather be expressed in time rather than 
> number
> of attempts, or both approach could be implemented?

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Ildus Kurbangaliev
Дата:
On Wed, 04 Apr 2018 16:07:25 +0300
Marina Polyakova <m.polyakova@postgrespro.ru> wrote:

> Hello, hackers!
> 
> Here there's a seventh version of the patch for error handling and 
> retrying of transactions with serialization/deadlock failures in
> pgbench (based on the commit
> a08dc711952081d63577fc182fcf955958f70add). I added the option
> --max-tries-time which is an implemetation of Fabien Coelho's
> proposal in [1]: the transaction with serialization or deadlock
> failure can be retried if the total time of all its tries is less
> than this limit (in ms). This option can be combined with the option
> --max-tries. But if none of them are used, failed transactions are
> not retried at all.
> 
> Also:
> * Now when the first failure occurs in the transaction it is always 
> reported as a failure since only after the remaining commands of this 
> transaction are executed we find out whether we can try again or not. 
> Therefore add the messages about retrying or ending the failed 
> transaction to the "fails" debugging level so you can distinguish 
> failures (which are retried) and errors (which are not retried).
> * Fix a report on the latency average because the total time includes 
> time for both errors and successful transactions.
> * Code cleanup (including tests).
> 
> [1] 
> https://www.postgresql.org/message-id/alpine.DEB.2.20.1803292134380.16472%40lancre
> 
> > Maybe the max retry should rather be expressed in time rather than 
> > number
> > of attempts, or both approach could be implemented?  
> 

Hi, I did a little review of your patch. It seems to work as
expected, documentation and tests are there. Still I have few comments.

There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with
corresponding fprintf(stderr..) I think it's time to do it like in the
main code, wrap with some function like log(level, msg).

In CSTATE_RETRY state used_time is used only in printing but calculated
more than needed.

In my opinion Debuglevel should be renamed to DebugLevel that looks
nicer, also there DEBUGLEVEl (where last letter is in lower case) which
is very confusing.

I have checked overall functionality of this patch, but haven't checked
any special cases yet.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
> Hi, I did a little review of your patch. It seems to work as
> expected, documentation and tests are there. Still I have few comments.

Hello! Thank you very much! I attached the fixed version of the patch 
(based on the commit 94c1f9ba11d1241a2b3b2be7177604b26b08bc3d) + thanks 
to Fabien Coelho's comments outside of this thread, I removed the option 
--max-tries-time and the option --latency-limit can be used to limit the 
time of transaction tries.

> There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with
> corresponding fprintf(stderr..) I think it's time to do it like in the
> main code, wrap with some function like log(level, msg).

I agree, fixed.

> In CSTATE_RETRY state used_time is used only in printing but calculated
> more than needed.

Sorry, fixed.

> In my opinion Debuglevel should be renamed to DebugLevel that looks
> nicer, also there DEBUGLEVEl (where last letter is in lower case) which
> is very confusing.

Sorry for this typos =[ Fixed.

> I have checked overall functionality of this patch, but haven't checked
> any special cases yet.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

FYI the v8 patch does not apply anymore, mostly because of a recent perl 
reindentation.

I think that I'll have time for a round of review in the first half of 
July. Providing a rebased patch before then would be nice.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
Fabien COELHO wrote:

> I think that I'll have time for a round of review in the first half of July.
> Providing a rebased patch before then would be nice.

Note that even in the absence of a rebased patch, you can apply to an
older checkout if you have some limited window of time for a review.

Looking over the diff, I find that this patch tries to do too much and
needs to be split up.  At a minimum there is a preliminary patch that
introduces the error reporting stuff (errstart etc); there are other
thread-related changes (for example to the random generation functions)
that probably belong in a separate one too.  Not sure if there are other
smaller patches hidden inside the rest.

On elog/errstart: we already have a convention for what ereport() calls
look like; I suggest to use that instead of inventing your own.  With
that, is there a need for elog()?  In the backend we have it because
$HISTORY but there's no need for that here -- I propose to lose elog()
and use only ereport everywhere.  Also, I don't see that you need
errmsg_internal() at all; let's lose it too.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Alvaro,

>> I think that I'll have time for a round of review in the first half of July.
>> Providing a rebased patch before then would be nice.

> Note that even in the absence of a rebased patch, you can apply to an
> older checkout if you have some limited window of time for a review.

Yes, sure. I'd like to bring this feature to be committable, so it will 
have to be rebased at some point anyway.

> Looking over the diff, I find that this patch tries to do too much and
> needs to be split up.

Yep, I agree that it would help the reviewing process. On the other hand I 
have bad memories about maintaining dependent patches which interfere 
significantly. Maybe it may not the case with this feature.

Thanks for the advices.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
Hello,

Fabien COELHO wrote:

> > Looking over the diff, I find that this patch tries to do too much and
> > needs to be split up.
> 
> Yep, I agree that it would help the reviewing process. On the other hand I
> have bad memories about maintaining dependent patches which interfere
> significantly.

Sure.  I suggest not posting these patches separately -- instead, post
as a series of commits in a single email, attaching files from "git
format-patch".

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
Hello!

Fabien and Alvaro, thank you very much! And sorry for such a late reply 
(I was a bit busy and making of ereport took some time..) :-( Below is a 
rebased version of the patch (commit 
9effb63e0dd12b0704cd8e11106fe08ff5c9d685) divided into several smaller 
patches:

v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch
- a patch for the RandomState structure (this is used to reset a 
client's random seed during the repeating of transactions after 
serialization/deadlock failures).

v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch
- a patch for the Variables structure (this is used to reset client 
variables during the repeating of transactions after 
serialization/deadlock failures).

v9-0003-Pgbench-errors-use-the-ereport-macro-to-report-de.patch
- a patch for the ereport() macro (this is used to report client 
failures that do not cause an aborts and this depends on the level of 
debugging).
- implementation: if possible, use the local ErrorData structure during 
the errstart()/errmsg()/errfinish() calls. Otherwise use a static 
variable protected by a mutex if necessary. To do all of this export the 
function appendPQExpBufferVA from libpq.

v9-0004-Pgbench-errors-and-serialization-deadlock-retries.patch
- the main patch for handling client errors and repetition of 
transactions with serialization/deadlock failures (see the detailed 
description in the file).

Any suggestions are welcome!

On 08-05-2018 9:00, Fabien COELHO wrote:
> Hello Marina,
> 
> FYI the v8 patch does not apply anymore, mostly because of a recent
> perl reindentation.
> 
> I think that I'll have time for a round of review in the first half of
> July. Providing a rebased patch before then would be nice.

They are attached, but a little delayed due to testing..

On 08-05-2018 13:58, Alvaro Herrera wrote:
> Looking over the diff, I find that this patch tries to do too much and
> needs to be split up.  At a minimum there is a preliminary patch that
> introduces the error reporting stuff (errstart etc); there are other
> thread-related changes (for example to the random generation functions)
> that probably belong in a separate one too.  Not sure if there are 
> other
> smaller patches hidden inside the rest.

Here is a try to do it..

> On elog/errstart: we already have a convention for what ereport() calls
> look like; I suggest to use that instead of inventing your own.  With
> that, is there a need for elog()?  In the backend we have it because
> $HISTORY but there's no need for that here -- I propose to lose elog()
> and use only ereport everywhere.  Also, I don't see that you need
> errmsg_internal() at all; let's lose it too.

I agree, done. But there're some changes to make such a design 
thread-safe..

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch
> - a patch for the RandomState structure (this is used to reset a client's 
> random seed during the repeating of transactions after serialization/deadlock 
> failures).

A few comments about this first patch.

Patch applies cleanly, compiles, global & pgbench "make check" ok.

I'm mostly ok with the changes, which cleanly separate the different use 
of random between threads (script choice, throttle delay, sampling...) and 
client (random*() calls).

This change is necessary so that a client can restart a transaction 
deterministically (at the client level at least), which is the ultimate 
aim of the patch series.

A few remarks:

The RandomState struct is 6 bytes, which will induce some padding when 
used. This is life and pre-existing. No problem.

ISTM that the struct itself does not need a name, ie. "typedef struct { 
... } RandomState" is enough.

There could be clear comments, say in the TState and CState structs, about 
what randomness is impacted (i.e. script choices, etc.).

getZipfianRand, computeHarmonicZipfian: The "thread" parameter was 
justified because it was used for two fieds. As the random state is 
separated, I'd suggest that the other argument should be a zipfcache 
pointer.

While reading your patch, it occurs to me that a run is not deterministic 
at the thread level under throttling and sampling, because the random 
state is sollicited differently depending on when transaction ends. This 
suggest that maybe each thread random_state use should have its own random 
state.

In passing, and totally unrelated to this patch:

I've always been a little puzzled about why a quite small 48-bit internal 
state random generator is used. I understand the need for pg to have a 
portable & state-controlled thread-safe random generator, but why this 
particular small one fails me. The source code (src/port/erand48.c, 
copyright in 1993...) looks optimized for 16 bits architectures, which is 
probably pretty inefficent to run on 64 bits architectures. Maybe this 
could be updated with something more consistent with today's processors, 
providing more quality at a lower cost.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch
> - a patch for the Variables structure (this is used to reset client variables 
> during the repeating of transactions after serialization/deadlock failures).

About this second patch:

This extracts the variable holding structure, so that it is somehow easier 
to reset them to their initial state on transaction failures, the 
management of which is the ultimate aim of this patch series.

It is also cleaner this way.

Patch applies cleanly on top of the previous one (there is no real 
interactions with it). It compiles cleanly. Global & pgbench "make check" 
are both ok.

The structure typedef does not need a name. "typedef struct { } V...".

I tend to disagree with naming things after their type, eg "array". I'd 
suggest "vars" instead. "nvariables" could be "nvars" for consistency with 
that and "vars_sorted", and because "foo.variables->nvariables" starts 
looking heavy.

I'd suggest but "Variables" type declaration just after "Variable" type 
declaration in the file.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 09-06-2018 9:55, Fabien COELHO wrote:
> Hello Marina,

Hello!

>> v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch
>> - a patch for the RandomState structure (this is used to reset a 
>> client's random seed during the repeating of transactions after 
>> serialization/deadlock failures).
> 
> A few comments about this first patch.

Thank you very much!

> Patch applies cleanly, compiles, global & pgbench "make check" ok.
> 
> I'm mostly ok with the changes, which cleanly separate the different
> use of random between threads (script choice, throttle delay,
> sampling...) and client (random*() calls).

Glad to hear it :)

> This change is necessary so that a client can restart a transaction
> deterministically (at the client level at least), which is the
> ultimate aim of the patch series.
> 
> A few remarks:
> 
> The RandomState struct is 6 bytes, which will induce some padding when
> used. This is life and pre-existing. No problem.
> 
> ISTM that the struct itself does not need a name, ie. "typedef struct
> { ... } RandomState" is enough.

Ok!

> There could be clear comments, say in the TState and CState structs,
> about what randomness is impacted (i.e. script choices, etc.).

Thank you, I'll add them.

> getZipfianRand, computeHarmonicZipfian: The "thread" parameter was
> justified because it was used for two fieds. As the random state is
> separated, I'd suggest that the other argument should be a zipfcache
> pointer.

I agree with you and I will change it.

> While reading your patch, it occurs to me that a run is not
> deterministic at the thread level under throttling and sampling,
> because the random state is sollicited differently depending on when
> transaction ends. This suggest that maybe each thread random_state use
> should have its own random state.

Thank you, I'll fix this.

> In passing, and totally unrelated to this patch:
> 
> I've always been a little puzzled about why a quite small 48-bit
> internal state random generator is used. I understand the need for pg
> to have a portable & state-controlled thread-safe random generator,
> but why this particular small one fails me. The source code
> (src/port/erand48.c, copyright in 1993...) looks optimized for 16 bits
> architectures, which is probably pretty inefficent to run on 64 bits
> architectures. Maybe this could be updated with something more
> consistent with today's processors, providing more quality at a lower
> cost.

This sounds interesting, thanks!
*went to look for a multiplier and a summand that are large enough and 
are mutually prime..*

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 09-06-2018 16:31, Fabien COELHO wrote:
> Hello Marina,

Hello!

>> v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch
>> - a patch for the Variables structure (this is used to reset client 
>> variables during the repeating of transactions after 
>> serialization/deadlock failures).
> 
> About this second patch:
> 
> This extracts the variable holding structure, so that it is somehow
> easier to reset them to their initial state on transaction failures,
> the management of which is the ultimate aim of this patch series.
> 
> It is also cleaner this way.
> 
> Patch applies cleanly on top of the previous one (there is no real
> interactions with it). It compiles cleanly. Global & pgbench "make
> check" are both ok.

:-)

> The structure typedef does not need a name. "typedef struct { } V...".

Ok!

> I tend to disagree with naming things after their type, eg "array".
> I'd suggest "vars" instead. "nvariables" could be "nvars" for
> consistency with that and "vars_sorted", and because
> "foo.variables->nvariables" starts looking heavy.
> 
> I'd suggest but "Variables" type declaration just after "Variable"
> type declaration in the file.

Thank you, I agree and I'll fix all this.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> v9-0003-Pgbench-errors-use-the-ereport-macro-to-report-de.patch
> - a patch for the ereport() macro (this is used to report client failures 
> that do not cause an aborts and this depends on the level of debugging).

ISTM that abort() is called under FATAL.

> - implementation: if possible, use the local ErrorData structure during the 
> errstart()/errmsg()/errfinish() calls. Otherwise use a static variable 
> protected by a mutex if necessary. To do all of this export the function 
> appendPQExpBufferVA from libpq.

This patch applies cleanly on top of the other ones (there are minimal 
interactions), compiles cleanly, global & pgbench "make check" are ok.

IMO this patch is more controversial than the other ones.

It is not really related to the aim of the patch series, which could do 
without, couldn't it? Moreover, it changes pgbench current behavior, which 
might be admissible, but should be discussed clearly.

I'd suggest that it should be an independent submission, unrelated to the 
pgbench error management patch.

The code adapts/duplicates existing server-side "ereport" stuff and brings 
it to the frontend, where the logging needs are somehow quite different.

I'd prefer to avoid duplication and/or have some code sharing. If it 
really needs to be duplicated, I'd suggest to put all this stuff in 
separated files. If we want to do that, I think that it would belong to 
fe_utils, and where it could/should be used by all front-end programs.

I do not understand why names are changed, eg ELEVEL_FATAL instead of 
FATAL. ISTM that part of the point of the move would be to be homogeneous, 
which suggests that the same names should be reused.

For logging purposes, ISTM that the "elog" macro interface is nicer, 
closer to the existing "fprintf(stderr", as it would not introduce the 
additional parentheses hack for "rest".

I see no actual value in creating on the fly a dynamic buffer through 
plenty macros and functions as the end result is just to print the message 
out to stderr in the end.

   errfinishImpl: fprintf(stderr, "%s", error->message.data);

This looks like overkill. From reading the code, this does not look
like an improvement:

   fprintf(stderr, "invalid socket: %s", PQerrorMessage(st->con));

vs

   ereport(ELEVEL_LOG, (errmsg("invalid socket: %s", PQerrorMessage(st->con))));

The whole complexity of the server-side interface only make sense because 
TRY/CATCH stuff and complex logging requirements (eg several outputs) in 
the backend. The patch adds quite some code and complexity without clear 
added value that I can see.

The semantics of the existing code is changed, the FATAL levels calls 
abort() and replace existing exit(1) calls. Maybe you want an ERROR level 
as well.

My 0.02€: maybe you just want to turn

   fprintf(stderr, format, ...);
   // then possibly exit or abort depending...

into

   elog(level, format, ...);

which maybe would exit or abort depending on level, and possibly not 
actually report under some levels and/or some conditions. For that, it 
could enough to just provide an nice "elog" function.

In conclusion, which you can disagree with because maybe I have missed 
something... anyway I currently think that:

  - it should be an independent submission

  - possibly at "fe_utils" level

  - possibly just a nice "elog" function is enough, if so just do that.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 10-06-2018 10:38, Fabien COELHO wrote:
> Hello Marina,

Hello!

>> v9-0003-Pgbench-errors-use-the-ereport-macro-to-report-de.patch
>> - a patch for the ereport() macro (this is used to report client 
>> failures that do not cause an aborts and this depends on the level of 
>> debugging).
> 
> ISTM that abort() is called under FATAL.

If you mean abortion of the client, this is not an abortion of the main 
program.

>> - implementation: if possible, use the local ErrorData structure 
>> during the errstart()/errmsg()/errfinish() calls. Otherwise use a 
>> static variable protected by a mutex if necessary. To do all of this 
>> export the function appendPQExpBufferVA from libpq.
> 
> This patch applies cleanly on top of the other ones (there are minimal
> interactions), compiles cleanly, global & pgbench "make check" are ok.

:-)

> IMO this patch is more controversial than the other ones.
> 
> It is not really related to the aim of the patch series, which could
> do without, couldn't it?

> I'd suggest that it should be an independent submission, unrelated to
> the pgbench error management patch.

I suppose that this is related; because of my patch there may be a lot 
of such code (see v7 in [1]):

-            fprintf(stderr,
-                    "malformed variable \"%s\" value: \"%s\"\n",
-                    var->name, var->svalue);
+            if (debug_level >= DEBUG_FAILS)
+            {
+                fprintf(stderr,
+                        "malformed variable \"%s\" value: \"%s\"\n",
+                        var->name, var->svalue);
+            }

-        if (debug)
+        if (debug_level >= DEBUG_ALL)
              fprintf(stderr, "client %d sending %s\n", st->id, sql);

That's why it was suggested to make the error function which hides all 
these things (see [2]):

There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with 
corresponding fprintf(stderr..) I think it's time to do it like in the 
main code, wrap with some function like log(level, msg).

> Moreover, it changes pgbench current
> behavior, which might be admissible, but should be discussed clearly.

> The semantics of the existing code is changed, the FATAL levels calls
> abort() and replace existing exit(1) calls. Maybe you want an ERROR
> level as well.

Oh, thanks, I agree with you. And I do not want to change the program 
exit code without good reasons, but I'm sorry I may not know all pros 
and cons in this matter..

Or did you also mean other changes?

> The code adapts/duplicates existing server-side "ereport" stuff and
> brings it to the frontend, where the logging needs are somehow quite
> different.
> 
> I'd prefer to avoid duplication and/or have some code sharing.

I was recommended to use the same interface in [3]:

On elog/errstart: we already have a convention for what ereport() calls 
look like; I suggest to use that instead of inventing your own.

> If it
> really needs to be duplicated, I'd suggest to put all this stuff in
> separated files. If we want to do that, I think that it would belong
> to fe_utils, and where it could/should be used by all front-end
> programs.

I'll try to do it..

> I do not understand why names are changed, eg ELEVEL_FATAL instead of
> FATAL. ISTM that part of the point of the move would be to be
> homogeneous, which suggests that the same names should be reused.

Ok!

> For logging purposes, ISTM that the "elog" macro interface is nicer,
> closer to the existing "fprintf(stderr", as it would not introduce the
> additional parentheses hack for "rest".

I was also recommended to use ereport() instead of elog() in [3]:

With that, is there a need for elog()?  In the backend we have it 
because $HISTORY but there's no need for that here -- I propose to lose 
elog() and use only ereport everywhere.

> I see no actual value in creating on the fly a dynamic buffer through
> plenty macros and functions as the end result is just to print the
> message out to stderr in the end.
> 
>   errfinishImpl: fprintf(stderr, "%s", error->message.data);
> 
> This looks like overkill. From reading the code, this does not look
> like an improvement:
> 
>   fprintf(stderr, "invalid socket: %s", PQerrorMessage(st->con));
> 
> vs
> 
>   ereport(ELEVEL_LOG, (errmsg("invalid socket: %s", 
> PQerrorMessage(st->con))));
> 
> The whole complexity of the server-side interface only make sense
> because TRY/CATCH stuff and complex logging requirements (eg several
> outputs) in the backend. The patch adds quite some code and complexity
> without clear added value that I can see.

> My 0.02€: maybe you just want to turn
> 
>   fprintf(stderr, format, ...);
>   // then possibly exit or abort depending...
> 
> into
> 
>   elog(level, format, ...);
> 
> which maybe would exit or abort depending on level, and possibly not
> actually report under some levels and/or some conditions. For that, it
> could enough to just provide an nice "elog" function.

I agree that elog() can be coded in this way. To use ereport() I need a 
structure to store the error level as a condition to exit.

> In conclusion, which you can disagree with because maybe I have missed
> something... anyway I currently think that:
> 
>  - it should be an independent submission
> 
>  - possibly at "fe_utils" level
> 
>  - possibly just a nice "elog" function is enough, if so just do that.

I hope I answered all this above..

[1] 
https://www.postgresql.org/message-id/453fa52de88477df2c4a2d82e09e461c%40postgrespro.ru
[2] 
https://www.postgresql.org/message-id/20180405180807.0bc1114f%40wp.localdomain
[3] 
https://www.postgresql.org/message-id/20180508105832.6o3uf3npfpjgk5m7%40alvherre.pgsql

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> I suppose that this is related; because of my patch there may be a lot of 
> such code (see v7 in [1]):
>
> -            fprintf(stderr,
> -                    "malformed variable \"%s\" value: 
> \"%s\"\n",
> -                    var->name, var->svalue);
> +            if (debug_level >= DEBUG_FAILS)
> +            {
> +                fprintf(stderr,
> +                        "malformed variable \"%s\" 
> value: \"%s\"\n",
> +                        var->name, var->svalue);
> +            }
>
> -        if (debug)
> +        if (debug_level >= DEBUG_ALL)
>             fprintf(stderr, "client %d sending %s\n", st->id, 
> sql);

I'm not sure that debug messages needs to be kept after debug, if it is 
about debugging pgbench itself. That is debatable.

> That's why it was suggested to make the error function which hides all these 
> things (see [2]):
>
>> There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with 
>> corresponding fprintf(stderr..) I think it's time to do it like in the 
>> main code, wrap with some function like log(level, msg).

Yep. I did not wrote that, but I agree with an "elog" suggestion to switch

   if (...) { fprintf(...); exit/abort/continue/... }

to a simpler:

   elog(level, ...)

>> Moreover, it changes pgbench current behavior, which might be 
>> admissible, but should be discussed clearly.
>
>> The semantics of the existing code is changed, the FATAL levels calls
>> abort() and replace existing exit(1) calls. Maybe you want an ERROR
>> level as well.
>
> Oh, thanks, I agree with you. And I do not want to change the program exit 
> code without good reasons, but I'm sorry I may not know all pros and cons in 
> this matter..
>
> Or did you also mean other changes?

AFAICR I meant switching exit to abort in some cases.

>> The code adapts/duplicates existing server-side "ereport" stuff and
>> brings it to the frontend, where the logging needs are somehow quite
>> different.
>> 
>> I'd prefer to avoid duplication and/or have some code sharing.
>
> I was recommended to use the same interface in [3]:
>
>>> On elog/errstart: we already have a convention for what ereport() 
>>> calls look like; I suggest to use that instead of inventing your own.

The "elog" interface already exists, it is not an invention. "ereport" is 
a hack which is somehow necessary in some cases. I prefer a simple 
function call if possible for the purpose, and ISTM that this is the case.

>> If it really needs to be duplicated, I'd suggest to put all this stuff 
>> in separated files. If we want to do that, I think that it would belong 
>> to fe_utils, and where it could/should be used by all front-end 
>> programs.
>
> I'll try to do it..

Dunno. If you only need one "elog" function which prints a message to 
stderr and decides whether to abort/exit/whatevrer, maybe it can just be 
kept in pgbench. If there are are several complicated functions and 
macros, better with a file. So I'd say it depends.

>> For logging purposes, ISTM that the "elog" macro interface is nicer,
>> closer to the existing "fprintf(stderr", as it would not introduce the
>> additional parentheses hack for "rest".
>
> I was also recommended to use ereport() instead of elog() in [3]:

Probably. Are you hoping that advises from different reviewers should be 
consistent? That seems optimistic:-)

>>> With that, is there a need for elog()?  In the backend we have it 
>>> because $HISTORY but there's no need for that here -- I propose to 
>>> lose elog() and use only ereport everywhere.

See commit 8a07ebb3c172 which turns some ereport into elog...

>> My 0.02€: maybe you just want to turn
>>
>>   fprintf(stderr, format, ...);
>>   // then possibly exit or abort depending...
>> 
>> into
>>
>>   elog(level, format, ...);
>> 
>> which maybe would exit or abort depending on level, and possibly not
>> actually report under some levels and/or some conditions. For that, it
>> could enough to just provide an nice "elog" function.
>
> I agree that elog() can be coded in this way. To use ereport() I need a 
> structure to store the error level as a condition to exit.

Yep. That is a lot of complication which are justified server side where 
logging requirements are special, but in this case I see it as overkill.

So my current view is that if you only need an "elog" function, it is 
simpler to add it to "pgbench.c".

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
On 2018-Jun-13, Fabien COELHO wrote:

> > > > With that, is there a need for elog()?  In the backend we have
> > > > it because $HISTORY but there's no need for that here -- I
> > > > propose to lose elog() and use only ereport everywhere.
> 
> See commit 8a07ebb3c172 which turns some ereport into elog...

For context: in the backend, elog() is only used for internal messages
(i.e. "can't-happen" conditions), and ereport() is used for user-facing
messages.  There are many things ereport() has that elog() doesn't, such
as additional message fields (HINT, DETAIL, etc) that I think could have
some use in pgbench as well.  If you use elog() then you can't have that.

Another difference is that in the backend, elog() messages are never
translated, while ereport() message are translated.  Since pgbench is
translatable I think it would be best to keep those things in sync, to
avoid confusion. (Although of course you could do it differently in
pgbench than backend.)

One thing that just came to mind is that pgbench uses some src/fe_utils
stuff.  I hope having ereport() doesn't cause a conflict with that ...

BTW I think abort() is not the right thing, as it'll cause core dumps if
enabled.  Why not just exit(1)?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 13-06-2018 22:59, Alvaro Herrera wrote:
> For context: in the backend, elog() is only used for internal messages
> (i.e. "can't-happen" conditions), and ereport() is used for user-facing
> messages.  There are many things ereport() has that elog() doesn't, 
> such
> as additional message fields (HINT, DETAIL, etc) that I think could 
> have
> some use in pgbench as well.  If you use elog() then you can't have 
> that.

AFAIU originally it was not supposed that the pgbench error messages 
have these fields, so will it be good to change the final output to 
stderr?.. For example:

-        fprintf(stderr, "%s", PQerrorMessage(con));
-        fprintf(stderr, "(ignoring this error and continuing anyway)\n");
+        ereport(LOG,
+                (errmsg("Ignoring the server error and continuing anyway"),
+                 errdetail("%s", PQerrorMessage(con))));

-            fprintf(stderr, "%s", PQerrorMessage(con));
-            if (sqlState && strcmp(sqlState, ERRCODE_UNDEFINED_TABLE) == 0)
-            {
-                fprintf(stderr, "Perhaps you need to do initialization (\"pgbench 
-i\") in database \"%s\"\n", PQdb(con));
-            }
-
-            exit(1);
+            ereport(ERROR,
+                    (errmsg("Server error"),
+                     errdetail("%s", PQerrorMessage(con)),
+                     sqlState && strcmp(sqlState, ERRCODE_UNDEFINED_TABLE) == 0 ?
+                     errhint("Perhaps you need to do initialization (\"pgbench -i\") 
in database \"%s\"\n",
+                             PQdb(con)) : 0));

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 13-06-2018 22:44, Fabien COELHO wrote:
> Hello Marina,
> 
>> I suppose that this is related; because of my patch there may be a lot 
>> of such code (see v7 in [1]):
>> 
>> -            fprintf(stderr,
>> -                    "malformed variable \"%s\" value: \"%s\"\n",
>> -                    var->name, var->svalue);
>> +            if (debug_level >= DEBUG_FAILS)
>> +            {
>> +                fprintf(stderr,
>> +                        "malformed variable \"%s\" value: \"%s\"\n",
>> +                        var->name, var->svalue);
>> +            }
>> 
>> -        if (debug)
>> +        if (debug_level >= DEBUG_ALL)
>>             fprintf(stderr, "client %d sending %s\n", st->id, sql);
> 
> I'm not sure that debug messages needs to be kept after debug, if it
> is about debugging pgbench itself. That is debatable.

AFAICS it is not about debugging pgbench itself, but about more detailed 
information that can be used to understand what exactly happened during 
its launch. In the case of errors this helps to distinguish between 
failures or errors by type (including which limit for retries was 
violated and how far it was exceeded for the serialization/deadlock 
errors).

>>> The code adapts/duplicates existing server-side "ereport" stuff and
>>> brings it to the frontend, where the logging needs are somehow quite
>>> different.
>>> 
>>> I'd prefer to avoid duplication and/or have some code sharing.
>> 
>> I was recommended to use the same interface in [3]:
>> 
>>>> On elog/errstart: we already have a convention for what ereport() 
>>>> calls look like; I suggest to use that instead of inventing your 
>>>> own.
> 
> The "elog" interface already exists, it is not an invention. "ereport"
> is a hack which is somehow necessary in some cases. I prefer a simple
> function call if possible for the purpose, and ISTM that this is the
> case.

> That is a lot of complication which are justified server side
> where logging requirements are special, but in this case I see it as
> overkill.

I think we need ereport() if we want to make detailed error messages 
(see examples in [1])..

>>> If it really needs to be duplicated, I'd suggest to put all this 
>>> stuff in separated files. If we want to do that, I think that it 
>>> would belong to fe_utils, and where it could/should be used by all 
>>> front-end programs.
>> 
>> I'll try to do it..
> 
> Dunno. If you only need one "elog" function which prints a message to
> stderr and decides whether to abort/exit/whatevrer, maybe it can just
> be kept in pgbench. If there are are several complicated functions and
> macros, better with a file. So I'd say it depends.

> So my current view is that if you only need an "elog" function, it is
> simpler to add it to "pgbench.c".

Thank you!

>>> For logging purposes, ISTM that the "elog" macro interface is nicer,
>>> closer to the existing "fprintf(stderr", as it would not introduce 
>>> the
>>> additional parentheses hack for "rest".
>> 
>> I was also recommended to use ereport() instead of elog() in [3]:
> 
> Probably. Are you hoping that advises from different reviewers should
> be consistent? That seems optimistic:-)

To make the patch committable there should be no objection to it..

[1] 
https://www.postgresql.org/message-id/c89fcc380a19380260b5ea463efc1416%40postgrespro.ru

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Alvaro,

> For context: in the backend, elog() is only used for internal messages
> (i.e. "can't-happen" conditions), and ereport() is used for user-facing
> messages.  There are many things ereport() has that elog() doesn't, such
> as additional message fields (HINT, DETAIL, etc) that I think could have
> some use in pgbench as well.  If you use elog() then you can't have that.
> [...]

Ok. Then forget elog, but I'm pretty against having a kind of ereport 
which looks greatly overkill to me, because:

  (1) the syntax is pretty heavy, and does not look like a function.

  (2) the implementation allocates a string buffer for the message
      this is greatly overkill for pgbench which only needs to print
      to stderr once.

This makes sense server-side because the generated message may be output 
several times (eg stderr, file logging, to the client), and the 
implementation has to work with cpp implementations which do not handle 
varags (and maybe other reasons).

So I would be in favor of having just a simpler error function. 
Incidentally, one already exists "pgbench_error" and could be improved, 
extended, replaced. There is also "syntax_error".

> One thing that just came to mind is that pgbench uses some src/fe_utils
> stuff.  I hope having ereport() doesn't cause a conflict with that ...

Currently ereport does not exists client-side. I do not think that this 
patch is the right moment to decide to do that. Also, there are some 
"elog" in libpq, but they are out with a "#ifndef FRONTEND".

> BTW I think abort() is not the right thing, as it'll cause core dumps if
> enabled.  Why not just exit(1)?

Yes, I agree and already reported that.

Conclusion:

My current opinion is that I'm pretty against bringing "ereport" to the 
front-end on this specific pgbench patch. I agree with you that "elog" 
would be misleading there as well, for the arguments you developed above.

I'd suggest to have just one clean and simple pgbench internal function to 
handle errors and possibly exit, debug... Something like

   void pgb_error(FATAL, "error %d raised", 12);

Implemented as

   void pgb_error(int/enum XXX level, const char * format, ...)
   {
      test level and maybe return immediately (eg debug);
      print to stderr;
      exit/abort/return depending;
   }

Then if some advanced error handling is introduced for front-end programs, 
possibly through some macros, then it would be time to improve upon that.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> v9-0004-Pgbench-errors-and-serialization-deadlock-retries.patch
> - the main patch for handling client errors and repetition of transactions 
> with serialization/deadlock failures (see the detailed description in the 
> file).

Here is a review for the last part of your v9 version.

Patch does not "git apply" (may anymore):
   error: patch failed: doc/src/sgml/ref/pgbench.sgml:513
   error: doc/src/sgml/ref/pgbench.sgml: patch does not apply

However I could get it to apply with the "patch" command.

Then patch compiles, global & pgbench "make check" are ok.

Feature
=======

The patch adds the ability to restart transactions (i.e. the full script)
on some errors, which is a good thing as it allows to exercice postgres
performance in more realistic scenarii.

* -d/--debug: I'm not in favor in requiring a mandatory text argument on this
option. It is not pratical, the user has to remember it, and it is a change.
I'm sceptical of the overall debug handling changes. Maybe we could have
multiple -d which lead to higher debug level, but I'm not sure that it can be
made to work for this case and still be compatible with the previous behavior.
Maybe you need a specific option for your purpose, eg "--debug-retry"?


Code
====

* The implementation is less complex that the previous submission, which 
is a good thing. I'm not sure that all the remaining complexity is still 
fully needed.

* I'm reserved about the whole ereport thing, see comments in other
messages.

Leves ELEVEL_LOG_CLIENT_{FAIL,ABORTED} & LOG_MAIN look unclear to me.
In particular, the "CLIENT" part is not very useful. If the
distinction makes sense, I would have kept "LOG" for the initial one and
add other ones for ABORT and PGBENCH, maybe.

* There are no comments about "retries" in StatData, CState and Command
structures.

* Also, for StatData, I would like to understand the logic between cnt,
skipped, retries, retried, errors, ... so a clear information about the
expected invariant if any would be welcome. One has to go in the code to
understand how these fields relate one to the other.

* "errors_in_failed_tx" is some subcounter of "errors", for a special 
case. Why it is there fails me [I finally understood, and I think it 
should be removed, see end of review]. If we wanted to distinguish, then 
we should distinguish homogeneously: maybe just count the different error 
types, eg have things like "deadlock_errors", "serializable_errors", 
"other_errors", "internal_pgbench_errors" which would be orthogonal one to 
the other, and "errors" could be recomputed from these.

* How "errors" differs from "ecnt" is unclear to me.

* FailureStatus states are not homogeneously named. I'd suggest to use 
*_FAILURE for all cases. The miscellaneous case should probably be the 
last. I do not understand the distinction between ANOTHER_FAILURE & 
IN_FAILED_SQL_TRANSACTION. Why should it be needed? [again, see and of 
review]

* I do not understand the comments on CState enum: "First, remember the failure
in CSTATE_FAILURE. Then process other commands of the failed transaction if any"
Why would other commands be processed at all if the transaction is aborted?
For me any error must leads to the rollback and possible retry of the
transaction. This comment needs to be clarified. It should also say
that on FAILURE, it will go either to RETRY or ABORTED. See below my 
comments about doCustom.

It is unclear to me why their could be several failures within a 
transaction, as I would have stopped that it would be aborted on the first 
one.

* I do not undestand the purpose of first_failure. The comment should explain
why it would need to be remembered. From my point of view, I'm not fully
convinced that it should.

* commandFailed: I think that it should be kept much simpler. In 
particular, having errors on errors does not help much: on ELEVEL_FATAL, 
it ignores the actual reported error and generates another error of the 
same level, so that the initial issue is hidden. Even if these are can't 
happen cases, hidding the origin if it occurs looks unhelpful. Just print 
it directly, and maybe abort if you think that it is a can't happen case.

* copyRandomState: just use sizeof(RandomState) instead of making assumptions
about the contents of the struct. Also, this function looks pretty useless,
why not just do a plain assignment?

* copyVariables: lacks comments to explain that the destination is cleaned up
and so on. The cleanup phase could probaly be in a distinct function, so that
the code would be clearer. Maybe the function variable names are too long.

   if (current_source->svalue)

in the context of a guard for a strdup, maybe:

   if (current_source->svalue != NULL)

* executeCondition: this hides client automaton state changes which were
clearly visible beforehand in the switch, and the different handling of
if & elif is also hidden.

I'm against this unnecessary restructuring and to hide such an information,
all state changes should be clearly seen in the state switch so that it is
easier to understand and follow.

I do not see why touching the conditional stack on internal errors
(evaluateExpr failure) brings anything, the whole transaction will be aborted
anyway.

* doCustom changes.

On CSTATE_START_COMMAND, it considers whether to retry on the end.
For me, this cannot happen: if some command failed, then it should have
skipped directly to the RETRY state, so that you cannot get to the end
of the script with an error. Maybe you could assert that the state of the
previous command is NO_FAILURE, though.

On CSTATE_FAILURE, the next command is possibly started. Although there is some
consistency with the previous point, I think that it totally breaks the state
automaton where now a command can start while the whole transaction is
in failing state anyway. There was no point in starting it in the first 
place.

So, for me, the FAILURE state should record/count the failure, then skip
to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
This is much clearer that way.

Then RETRY should reinstate the global state and proceed to start the *first*
command again.

The current RETRY state does memory allocations to generate a message
with buffer allocation and so on. This looks like a costly and useless
operation. If the user required "retries", then this is normal behavior,
the retries are counted and will be printed out in the final report,
and there is no point in printing out every single one of them.
Maybe you want that debugging, but then coslty operations should be guarded.

It is unclear to me why backslash command errors are turned to FAILURE
instead of ABORTED: there is no way they are going to be retried, so
maybe they should/could skip directly to ABORTED?

Function executeCondition is a bad idea, as stated above.

* reporting

The number of transactions above the latency limit report can be simplified.
Remove the if and just use one printf f with a %s for the optional comment.
I'm not sure this optional comment is useful there.

Before the patch, ISTM that all lines relied on one printf. you have 
changed to a style where a collection of printf is used to compose a line. 
I'd suggest to keep to the previous one-printf-prints-one-line style, 
where possible.

You have added 20-columns alignment prints. This looks like too much and
generates much too large lines. Probably 10 (billion) would be enough.

Some people try to parse the output, so it should be deterministic. I'd add
the needed columns always if appropriate (i.e. under retry), even if none
occured.

* processXactStats: An else is replaced by a detailed stats, with the initial
"no detailed stats" comment kept. The function is called both in the thenb
& else branch. The structure does not make sense anymore. I'm not sure 
this changed was needed.

* getLatencyUsed: declared "double" so "return 0.0".

* typo: ruin -> run; probably others, I did not check for them in detail.


TAP Tests
=========

On my laptop, tests last 5.5 seconds before the patch, and about 13 seconds
after. This is much too large. Pgbench TAP tests do not deserve to take over
twice as much time as before just on this patch.

One reason which explains this large time is there is a new script with a 
new created instance. I'd suggest to append tests to the existing 2 
scripts, depending on whether they need a running instance or not.

Secondly, I think that the design of the tests are too heavy. For such a 
feature, ISTM enough to check that it works, i.e. one test for deadlocks 
(trigger one or a few deadlocks), idem for serializable, maybe idem for 
other errors if any.

The challenge is to do that reliably and efficiently, i.e. so that the test does
not rely on chance and is still quite efficient.

The trick you use is to run an interactive psql in parallel to pgbench so as to
play with concurrent locks. That is interesting, but deserves more comments
and explanatation, eg before the test functions.

Maybe this could be achieved within pgbench by using some wait stuff in 
PL/pgSQL so that concurrent client can wait one another based on data in 
unlogged table updated by a CALL within an "embedded" transactions? Not 
sure. Otherwise, maybe (simple) pgbench-side thread barrier could help, 
but this would require more thinking.

Anyway, TAP tests should be much lighter (in total time), and if possible 
much simpler.

The latency limit to 900 ms try is a bad idea because it takes a lot of time.
I did such tests before and they were removed by Tom Lane because of determinism
and time issues. I would comment this test out for now.

Documentation
=============

Not looked at in much details for now. Just a few comments:

Having the "most important settings" on line 1-6 and 8 (i.e. skipping 7) looks
silly. The important ones should simply be the first ones, and the 8th is not
that important, or it is in 7th position.

I do not understand why there is so much text about in failed sql transaction
stuff, while we are mainly interested in serialization & deadlock errors, and
this only falls in some "other" category. There seems to be more details about
other errors that about deadlocks & serializable errors.

The reporting should focus on what is of interest, either all errors, or some
detailed split of these errors. The documentation should state clearly what
are the counted errors, and then what are their effects on the reported stats.
The "Errors and Serialization/Deadlock Retries" section is a good start in that
direction, but it does not talk about pgbench internal errors (eg "cos(true)").
I think it should more explicit about errors.

Option --max-tries default value should be spelled out in the doc.

"Client's run is aborted", do you mean "Pgbench run is aborted"?

"If a failed transaction block does not terminate in the current script":
this just looks like a very bad idea, and explains my general ranting
above about this error condition. ISTM that the only reasonable option
is that a pgbench script should be inforced as a transaction, or a set of
transactions, but cannot be a "piece" of transaction, i.e. pgbench script
with "BEGIN;" but without a corresponding "COMMIT" is a user error and
warrants an abort, so that there is no need to manage these "in aborted
transaction" errors every where and report about them and document them
extensively.

This means adding a check when a script is finished or starting that
PQtransactionStatus(const PGconn *conn) == PQTRANS_IDLE, and abort if not
with a fatal error. Then we can forget about these "in tx errors" counting,
reporting and so on, and just have to document the restriction.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 09-07-2018 16:05, Fabien COELHO wrote:
> Hello Marina,

Hello, Fabien!

> Here is a review for the last part of your v9 version.

Thank you very much for this!

> Patch does not "git apply" (may anymore):
>   error: patch failed: doc/src/sgml/ref/pgbench.sgml:513
>   error: doc/src/sgml/ref/pgbench.sgml: patch does not apply

Sorry, I'll send a new version soon.

> However I could get it to apply with the "patch" command.
> 
> Then patch compiles, global & pgbench "make check" are ok.

:-)

> Feature
> =======
> 
> The patch adds the ability to restart transactions (i.e. the full 
> script)
> on some errors, which is a good thing as it allows to exercice postgres
> performance in more realistic scenarii.
> 
> * -d/--debug: I'm not in favor in requiring a mandatory text argument 
> on this
> option. It is not pratical, the user has to remember it, and it is a 
> change.
> I'm sceptical of the overall debug handling changes. Maybe we could 
> have
> multiple -d which lead to higher debug level, but I'm not sure that it 
> can be
> made to work for this case and still be compatible with the previous 
> behavior.
> Maybe you need a specific option for your purpose, eg "--debug-retry"?

As you wrote in [1], adding an additional option is also a bad idea:

> I'm sceptical of the "--debug-fails" options. ISTM that --debug is
> already there
> and should just be reused.

Maybe it's better to use an optional argument/arguments for 
compatibility (--debug[=fails] or --debug[=NUM])? But if we use the 
numbers, now I can see only 2 levels, and there's no guarantee that they 
will no change..

> Code
> ====
> 
> * The implementation is less complex that the previous submission,
> which is a good thing. I'm not sure that all the remaining complexity
> is still fully needed.
> 
> * I'm reserved about the whole ereport thing, see comments in other
> messages.

Thank you, I'll try to implement the error reporting in the way you 
suggested.

> Leves ELEVEL_LOG_CLIENT_{FAIL,ABORTED} & LOG_MAIN look unclear to me.
> In particular, the "CLIENT" part is not very useful. If the
> distinction makes sense, I would have kept "LOG" for the initial one 
> and
> add other ones for ABORT and PGBENCH, maybe.

Ok!

> * There are no comments about "retries" in StatData, CState and Command
> structures.
> 
> * Also, for StatData, I would like to understand the logic between cnt,
> skipped, retries, retried, errors, ... so a clear information about the
> expected invariant if any would be welcome. One has to go in the code 
> to
> understand how these fields relate one to the other.
> 
> <...>
> 
> * How "errors" differs from "ecnt" is unclear to me.

Thank you, I'll fix this.

> * commandFailed: I think that it should be kept much simpler. In
> particular, having errors on errors does not help much: on
> ELEVEL_FATAL, it ignores the actual reported error and generates
> another error of the same level, so that the initial issue is hidden.
> Even if these are can't happen cases, hidding the origin if it occurs
> looks unhelpful. Just print it directly, and maybe abort if you think
> that it is a can't happen case.

Oh, thanks, my mistake(

> * copyRandomState: just use sizeof(RandomState) instead of making 
> assumptions
> about the contents of the struct. Also, this function looks pretty 
> useless,
> why not just do a plain assignment?
> 
> * copyVariables: lacks comments to explain that the destination is 
> cleaned up
> and so on. The cleanup phase could probaly be in a distinct function, 
> so that
> the code would be clearer. Maybe the function variable names are too 
> long.

Thank you, I'll fix this.

>   if (current_source->svalue)
> 
> in the context of a guard for a strdup, maybe:
> 
>   if (current_source->svalue != NULL)

I'm sorry, I'll fix this.

> * I do not understand the comments on CState enum: "First, remember the 
> failure
> in CSTATE_FAILURE. Then process other commands of the failed 
> transaction if any"
> Why would other commands be processed at all if the transaction is 
> aborted?
> For me any error must leads to the rollback and possible retry of the
> transaction. This comment needs to be clarified. It should also say
> that on FAILURE, it will go either to RETRY or ABORTED. See below my
> comments about doCustom.
> 
> It is unclear to me why their could be several failures within a
> transaction, as I would have stopped that it would be aborted on the
> first one.
> 
> * I do not undestand the purpose of first_failure. The comment should 
> explain
> why it would need to be remembered. From my point of view, I'm not 
> fully
> convinced that it should.
> 
> <...>
> 
> * executeCondition: this hides client automaton state changes which 
> were
> clearly visible beforehand in the switch, and the different handling of
> if & elif is also hidden.
> 
> I'm against this unnecessary restructuring and to hide such an 
> information,
> all state changes should be clearly seen in the state switch so that it 
> is
> easier to understand and follow.
> 
> I do not see why touching the conditional stack on internal errors
> (evaluateExpr failure) brings anything, the whole transaction will be 
> aborted
> anyway.
> 
> * doCustom changes.
> 
> On CSTATE_START_COMMAND, it considers whether to retry on the end.
> For me, this cannot happen: if some command failed, then it should have
> skipped directly to the RETRY state, so that you cannot get to the end
> of the script with an error. Maybe you could assert that the state of 
> the
> previous command is NO_FAILURE, though.
> 
> On CSTATE_FAILURE, the next command is possibly started. Although there 
> is some
> consistency with the previous point, I think that it totally breaks the 
> state
> automaton where now a command can start while the whole transaction is
> in failing state anyway. There was no point in starting it in the first 
> place.
> 
> So, for me, the FAILURE state should record/count the failure, then 
> skip
> to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
> This is much clearer that way.
> 
> Then RETRY should reinstate the global state and proceed to start the 
> *first*
> command again.
> 
> <...>
> 
> It is unclear to me why backslash command errors are turned to FAILURE
> instead of ABORTED: there is no way they are going to be retried, so
> maybe they should/could skip directly to ABORTED?
> 
> Function executeCondition is a bad idea, as stated above.

So do you propose to execute the command "ROLLBACK" without calculating 
its latency etc. if we are in a failed transaction and clear the 
conditional stack after each failure?

Also just to be clear: do you want to have the state CSTATE_ABORTED for 
client abortion and another state for interrupting the current 
transaction?

> The current RETRY state does memory allocations to generate a message
> with buffer allocation and so on. This looks like a costly and useless
> operation. If the user required "retries", then this is normal 
> behavior,
> the retries are counted and will be printed out in the final report,
> and there is no point in printing out every single one of them.
> Maybe you want that debugging, but then coslty operations should be 
> guarded.

I think we need these debugging messages because, for example, if you 
use the option --latency-limit, you we will never know in advance 
whether the serialization/deadlock failure will be retried or not. They 
also help to understand which limit of retries was violated or how close 
we were to these limits during the execution of a specific transaction. 
But I agree with you that they are costly and can be skipped if the 
failure type is never retried. Maybe it is better to split them into 
multiple error function calls?..

> * reporting
> 
> The number of transactions above the latency limit report can be 
> simplified.
> Remove the if and just use one printf f with a %s for the optional 
> comment.
> I'm not sure this optional comment is useful there.

Oh, thanks, my mistake(

> Before the patch, ISTM that all lines relied on one printf. you have
> changed to a style where a collection of printf is used to compose a
> line. I'd suggest to keep to the previous one-printf-prints-one-line
> style, where possible.

Ok!

> You have added 20-columns alignment prints. This looks like too much 
> and
> generates much too large lines. Probably 10 (billion) would be enough.

I have already asked you about this in [2]:
> The variables for the numbers of failures and retries are of type int64
> since the variable for the total number of transactions has the same
> type. That's why such a large alignment (as I understand it now, enough
> 20 characters). Do you prefer floating alignemnts, depending on the
> maximum number of failures/retries for any command in any script?

> Some people try to parse the output, so it should be deterministic. I'd 
> add
> the needed columns always if appropriate (i.e. under retry), even if 
> none
> occured.

Ok!

> * processXactStats: An else is replaced by a detailed stats, with the 
> initial
> "no detailed stats" comment kept. The function is called both in the 
> thenb
> & else branch. The structure does not make sense anymore. I'm not sure
> this changed was needed.
> 
> * getLatencyUsed: declared "double" so "return 0.0".
> 
> * typo: ruin -> run; probably others, I did not check for them in 
> detail.

Oh, thanks, my mistakes(

> TAP Tests
> =========
> 
> On my laptop, tests last 5.5 seconds before the patch, and about 13 
> seconds
> after. This is much too large. Pgbench TAP tests do not deserve to take 
> over
> twice as much time as before just on this patch.
> 
> One reason which explains this large time is there is a new script
> with a new created instance. I'd suggest to append tests to the
> existing 2 scripts, depending on whether they need a running instance
> or not.

Ok! All new tests that do not need a running instance are already added 
to the file 002_pgbench_no_server.pl.

> Secondly, I think that the design of the tests are too heavy. For such
> a feature, ISTM enough to check that it works, i.e. one test for
> deadlocks (trigger one or a few deadlocks), idem for serializable,
> maybe idem for other errors if any.
> 
> <...>
> 
> The latency limit to 900 ms try is a bad idea because it takes a lot of 
> time.
> I did such tests before and they were removed by Tom Lane because of 
> determinism
> and time issues. I would comment this test out for now.

Ok! If it doesn't bother you - can you tell more about the causes of 
these determinism issues?.. Tests for some other failures that cannot be 
retried are already added to 001_pgbench_with_server.pl.

> The challenge is to do that reliably and efficiently, i.e. so that the 
> test does
> not rely on chance and is still quite efficient.
> 
> The trick you use is to run an interactive psql in parallel to pgbench 
> so as to
> play with concurrent locks. That is interesting, but deserves more 
> comments
> and explanatation, eg before the test functions.
> 
> Maybe this could be achieved within pgbench by using some wait stuff
> in PL/pgSQL so that concurrent client can wait one another based on
> data in unlogged table updated by a CALL within an "embedded"
> transactions? Not sure.
> 
> <...>
> 
> Anyway, TAP tests should be much lighter (in total time), and if
> possible much simpler.

I'll try, thank you..

> Otherwise, maybe (simple) pgbench-side thread
> barrier could help, but this would require more thinking.

Tests must pass if we use --disable-thread-safety..

> Documentation
> =============
> 
> Not looked at in much details for now. Just a few comments:
> 
> Having the "most important settings" on line 1-6 and 8 (i.e. skipping 
> 7) looks
> silly. The important ones should simply be the first ones, and the 8th 
> is not
> that important, or it is in 7th position.

Ok!

> I do not understand why there is so much text about in failed sql 
> transaction
> stuff, while we are mainly interested in serialization & deadlock 
> errors, and
> this only falls in some "other" category. There seems to be more 
> details about
> other errors that about deadlocks & serializable errors.
> 
> The reporting should focus on what is of interest, either all errors, 
> or some
> detailed split of these errors.
> 
> <...>
> 
> * "errors_in_failed_tx" is some subcounter of "errors", for a special
> case. Why it is there fails me [I finally understood, and I think it
> should be removed, see end of review]. If we wanted to distinguish,
> then we should distinguish homogeneously: maybe just count the
> different error types, eg have things like "deadlock_errors",
> "serializable_errors", "other_errors", "internal_pgbench_errors" which
> would be orthogonal one to the other, and "errors" could be recomputed
> from these.

Thank you, I agree with you. Unfortunately each new error type adds a 
new 1 or 2 columns of maximum width 20 to the per-statement report (to 
report errors and possibly retries of this type in this statement) and 
we already have 2 new columns for all errors and retries. So I'm not 
sure that we need add anything other than statistics only about all the 
errors and all the retries in general.

> The documentation should state clearly what
> are the counted errors, and then what are their effects on the reported 
> stats.
> The "Errors and Serialization/Deadlock Retries" section is a good start 
> in that
> direction, but it does not talk about pgbench internal errors (eg 
> "cos(true)").
> I think it should more explicit about errors.

Thank you, I'll try to improve it.

> Option --max-tries default value should be spelled out in the doc.

If you mean that it is set to 1 if neither of the options --max-tries or 
--latency-limit is explicitly used, I'll fix this.

> "Client's run is aborted", do you mean "Pgbench run is aborted"?

No, other clients continue their run as usual.

> * FailureStatus states are not homogeneously named. I'd suggest to use
> *_FAILURE for all cases. The miscellaneous case should probably be the
> last. I do not understand the distinction between ANOTHER_FAILURE &
> IN_FAILED_SQL_TRANSACTION. Why should it be needed? [again, see and of
> review]
> 
> <...>
> 
> "If a failed transaction block does not terminate in the current 
> script":
> this just looks like a very bad idea, and explains my general ranting
> above about this error condition. ISTM that the only reasonable option
> is that a pgbench script should be inforced as a transaction, or a set 
> of
> transactions, but cannot be a "piece" of transaction, i.e. pgbench 
> script
> with "BEGIN;" but without a corresponding "COMMIT" is a user error and
> warrants an abort, so that there is no need to manage these "in aborted
> transaction" errors every where and report about them and document them
> extensively.
> 
> This means adding a check when a script is finished or starting that
> PQtransactionStatus(const PGconn *conn) == PQTRANS_IDLE, and abort if 
> not
> with a fatal error. Then we can forget about these "in tx errors" 
> counting,
> reporting and so on, and just have to document the restriction.

Ok!

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1801031720270.20034%40lancre
[2] 
https://www.postgresql.org/message-id/e4c5e8cefa4a8e88f1273b0f1ee29e56@postgrespro.ru

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

>> * -d/--debug: I'm not in favor in requiring a mandatory text argument on 
>> this option.
>
> As you wrote in [1], adding an additional option is also a bad idea:

Hey, I'm entitled to some internal contradictions:-)

>> I'm sceptical of the "--debug-fails" options. ISTM that --debug is 
>> already there and should just be reused.

I was thinking that you could just use the existing --debug, not change 
its syntax. My point was that --debug exists, and you could just print
the messages when under --debug.

> Maybe it's better to use an optional argument/arguments for compatibility 
> (--debug[=fails] or --debug[=NUM])? But if we use the numbers, now I can see 
> only 2 levels, and there's no guarantee that they will no change..

Optional arguments to options (!) are not really clean things, so I'd like 
to avoid going onto this path, esp. as I cannot see any other instance in 
pgbench or elsewhere in postgres, and I personnaly consider these as a bad 
idea.

So if absolutely necessary, a new option is still better than changing 
--debug syntax. If not necessary, then it is better:-)

>> * I'm reserved about the whole ereport thing, see comments in other
>> messages.
>
> Thank you, I'll try to implement the error reporting in the way you 
> suggested.

Dunno if it is a good idea either. The committer word is the good one in 
the end:-à

> Thank you, I'll fix this.
> I'm sorry, I'll fix this.

You do not have to thank me or being sorry on every comment I do, once a 
the former is enough, and there is no need for the later.

>> * doCustom changes.

>> 
>> On CSTATE_FAILURE, the next command is possibly started. Although there 
>> is some consistency with the previous point, I think that it totally 
>> breaks the state automaton where now a command can start while the 
>> whole transaction is in failing state anyway. There was no point in 
>> starting it in the first place.
>> 
>> So, for me, the FAILURE state should record/count the failure, then skip
>> to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
>> This is much clearer that way.
>> 
>> Then RETRY should reinstate the global state and proceed to start the 
>> *first* command again.
>> <...>
>> 
>> It is unclear to me why backslash command errors are turned to FAILURE
>> instead of ABORTED: there is no way they are going to be retried, so
>> maybe they should/could skip directly to ABORTED?

> So do you propose to execute the command "ROLLBACK" without calculating its 
> latency etc. if we are in a failed transaction and clear the conditional 
> stack after each failure?

> Also just to be clear: do you want to have the state CSTATE_ABORTED for 
> client abortion and another state for interrupting the current transaction?

I do not understand what "interrupting the current transaction" means. A 
transaction is either committed or rollbacked, I do not know about 
"interrupted". When it is rollbacked, probably some stats will be 
collected in passing, I'm fine with that.

If there is an error in a pgbench script, the transaction is aborted, 
which means for me that the script execution is stopped where it was, and 
either it is restarted from the beginning (retry) or counted as failure 
(not retry, just aborted, really).

If by interrupted you mean that one script begins a transaction and 
another ends it, as I said in the review I think that this strange case 
should be forbidden, so that all the code and documentation trying to
manage that can be removed.

>> The current RETRY state does memory allocations to generate a message
>> with buffer allocation and so on. This looks like a costly and useless
>> operation. If the user required "retries", then this is normal behavior,
>> the retries are counted and will be printed out in the final report,
>> and there is no point in printing out every single one of them.
>> Maybe you want that debugging, but then coslty operations should be 
>> guarded.
>
> I think we need these debugging messages because, for example,

Debugging message should cost only when under debug. When not under debug, 
there should be no debugging message, and there should be no cost for 
building and discarding such messages in the executed code path beyond
testing whether program is under debug.

> if you use the option --latency-limit, you we will never know in advance 
> whether the serialization/deadlock failure will be retried or not.

ISTM that it will be shown final report. If I want debug, I ask for 
--debug, otherwise I think that the command should do what it was asked 
for, i.e. run scripts, collect performance statistics and show them at the 
end.

In particular, when running with retries is enabled, the user is expecting 
deadlock/serialization errors, so that they are not "errors" as such for
them.

> They also help to understand which limit of retries was violated or how 
> close we were to these limits during the execution of a specific 
> transaction. But I agree with you that they are costly and can be 
> skipped if the failure type is never retried. Maybe it is better to 
> split them into multiple error function calls?..

Debugging message costs should only be incurred when under --debug, not 
otherwise.

>> You have added 20-columns alignment prints. This looks like too much and
>> generates much too large lines. Probably 10 (billion) would be enough.
>
> I have already asked you about this in [2]:

Probably:-)

>> The variables for the numbers of failures and retries are of type int64
>> since the variable for the total number of transactions has the same
>> type. That's why such a large alignment (as I understand it now, enough
>> 20 characters). Do you prefer floating alignemnts, depending on the
>> maximum number of failures/retries for any command in any script?

An int64 counter is not likely to reach its limit anytime soon:-) If the 
column display limit is ever reached, ISTM that then the text is just 
misaligned, which is a minor and rare inconvenience. If very wide columns 
are used, then it does not fit my terminal and the report text will always 
be wrapped around, which makes it harder to read, every time.

>> The latency limit to 900 ms try is a bad idea because it takes a lot of 
>> time. I did such tests before and they were removed by Tom Lane because 
>> of determinism and time issues. I would comment this test out for now.
>
> Ok! If it doesn't bother you - can you tell more about the causes of these 
> determinism issues?.. Tests for some other failures that cannot be retried 
> are already added to 001_pgbench_with_server.pl.

Some farm animals are very slow, so you cannot really assume much about 
time one way or another.

>> Otherwise, maybe (simple) pgbench-side thread
>> barrier could help, but this would require more thinking.
>
> Tests must pass if we use --disable-thread-safety..

Sure. My wording was misleading. I just meant a synchronisation barrier 
between concurrent clients, which could be managed with one thread. 
Anyway, it is probably overkill for the problem at hand, so just forget.

>> I do not understand why there is so much text about in failed sql 
>> transaction stuff, while we are mainly interested in serialization & 
>> deadlock errors, and this only falls in some "other" category. There 
>> seems to be more details about other errors that about deadlocks & 
>> serializable errors.
>> 
>> The reporting should focus on what is of interest, either all errors, 
>> or some detailed split of these errors.
>> 
>> <...>
>> 
>> * "errors_in_failed_tx" is some subcounter of "errors", for a special
>> case. Why it is there fails me [I finally understood, and I think it
>> should be removed, see end of review]. If we wanted to distinguish,
>> then we should distinguish homogeneously: maybe just count the
>> different error types, eg have things like "deadlock_errors",
>> "serializable_errors", "other_errors", "internal_pgbench_errors" which
>> would be orthogonal one to the other, and "errors" could be recomputed
>> from these.
>
> Thank you, I agree with you. Unfortunately each new error type adds a new 1 
> or 2 columns of maximum width 20 to the per-statement report

The fact that some data are collected does not mean that they should all 
be reported in detail. We can have detailed error count and report the sum 
of this errors for instance, or have some more verbose/detailed reports
as options (eg --latencies does just that).

>> <...>
>> 
>> "If a failed transaction block does not terminate in the current script":
>> this just looks like a very bad idea, and explains my general ranting
>> above about this error condition. ISTM that the only reasonable option
>> is that a pgbench script should be inforced as a transaction, or a set of
>> transactions, but cannot be a "piece" of transaction, i.e. pgbench script
>> with "BEGIN;" but without a corresponding "COMMIT" is a user error and
>> warrants an abort, so that there is no need to manage these "in aborted
>> transaction" errors every where and report about them and document them
>> extensively.
>> 
>> This means adding a check when a script is finished or starting that
>> PQtransactionStatus(const PGconn *conn) == PQTRANS_IDLE, and abort if not
>> with a fatal error. Then we can forget about these "in tx errors" counting,
>> reporting and so on, and just have to document the restriction.
>
> Ok!

Good:-) ISTM that this would remove a significant amount of complexity 
from the code and documentation.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 11-07-2018 16:24, Fabien COELHO wrote:
> Hello Marina,
> 
>>> * -d/--debug: I'm not in favor in requiring a mandatory text argument 
>>> on this option.
>> 
>> As you wrote in [1], adding an additional option is also a bad idea:
> 
> Hey, I'm entitled to some internal contradictions:-)

... and discussions will be continue forever %-)

>>> I'm sceptical of the "--debug-fails" options. ISTM that --debug is 
>>> already there and should just be reused.
> 
> I was thinking that you could just use the existing --debug, not
> change its syntax. My point was that --debug exists, and you could
> just print
> the messages when under --debug.

Now I understand you better, thanks. I think it will be useful to 
receive only messages about failures, because they and progress reports 
can be lost in many other debug messages such as "client %d sending ..." 
/ "client %d executing ..." / "client %d receiving".

>> Maybe it's better to use an optional argument/arguments for 
>> compatibility (--debug[=fails] or --debug[=NUM])? But if we use the 
>> numbers, now I can see only 2 levels, and there's no guarantee that 
>> they will no change..
> 
> Optional arguments to options (!) are not really clean things, so I'd
> like to avoid going onto this path, esp. as I cannot see any other
> instance in pgbench or elsewhere in postgres,

AFAICS they are used in pg_waldump (option --stats[=record]) and in psql 
(option --help[=topic]).

> and I personnaly
> consider these as a bad idea.

> So if absolutely necessary, a new option is still better than changing
> --debug syntax. If not necessary, then it is better:-)

Ok!

>>> * I'm reserved about the whole ereport thing, see comments in other
>>> messages.
>> 
>> Thank you, I'll try to implement the error reporting in the way you 
>> suggested.
> 
> Dunno if it is a good idea either. The committer word is the good one
> in the end:-à

I agree with you that ereport has good reasons to be non-trivial in the 
backend and it does not have the same in pgbench..

>>> * doCustom changes.
> 
>>> 
>>> On CSTATE_FAILURE, the next command is possibly started. Although 
>>> there is some consistency with the previous point, I think that it 
>>> totally breaks the state automaton where now a command can start 
>>> while the whole transaction is in failing state anyway. There was no 
>>> point in starting it in the first place.
>>> 
>>> So, for me, the FAILURE state should record/count the failure, then 
>>> skip
>>> to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
>>> This is much clearer that way.
>>> 
>>> Then RETRY should reinstate the global state and proceed to start the 
>>> *first* command again.
>>> <...>
>>> 
>>> It is unclear to me why backslash command errors are turned to 
>>> FAILURE
>>> instead of ABORTED: there is no way they are going to be retried, so
>>> maybe they should/could skip directly to ABORTED?
> 
>> So do you propose to execute the command "ROLLBACK" without 
>> calculating its latency etc. if we are in a failed transaction and 
>> clear the conditional stack after each failure?
> 
>> Also just to be clear: do you want to have the state CSTATE_ABORTED 
>> for client abortion and another state for interrupting the current 
>> transaction?
> 
> I do not understand what "interrupting the current transaction" means.
> A transaction is either committed or rollbacked, I do not know about
> "interrupted".

I mean that IIUC the server usually only reports the error and you must 
manually send the command "END" or "ROLLBACK" to rollback a failed 
transaction.

> When it is rollbacked, probably some stats will be
> collected in passing, I'm fine with that.
> 
> If there is an error in a pgbench script, the transaction is aborted,
> which means for me that the script execution is stopped where it was,
> and either it is restarted from the beginning (retry) or counted as
> failure (not retry, just aborted, really).
> 
> If by interrupted you mean that one script begins a transaction and
> another ends it, as I said in the review I think that this strange
> case should be forbidden, so that all the code and documentation
> trying to
> manage that can be removed.

Ok!

>>> The current RETRY state does memory allocations to generate a message
>>> with buffer allocation and so on. This looks like a costly and 
>>> useless
>>> operation. If the user required "retries", then this is normal 
>>> behavior,
>>> the retries are counted and will be printed out in the final report,
>>> and there is no point in printing out every single one of them.
>>> Maybe you want that debugging, but then coslty operations should be 
>>> guarded.
>> 
>> I think we need these debugging messages because, for example,
> 
> Debugging message should cost only when under debug. When not under
> debug, there should be no debugging message, and there should be no
> cost for building and discarding such messages in the executed code
> path beyond
> testing whether program is under debug.
> 
>> if you use the option --latency-limit, you we will never know in 
>> advance whether the serialization/deadlock failure will be retried or 
>> not.
> 
> ISTM that it will be shown final report. If I want debug, I ask for
> --debug, otherwise I think that the command should do what it was
> asked for, i.e. run scripts, collect performance statistics and show
> them at the end.
> 
> In particular, when running with retries is enabled, the user is
> expecting deadlock/serialization errors, so that they are not "errors"
> as such for
> them.
> 
>> They also help to understand which limit of retries was violated or 
>> how close we were to these limits during the execution of a specific 
>> transaction. But I agree with you that they are costly and can be 
>> skipped if the failure type is never retried. Maybe it is better to 
>> split them into multiple error function calls?..
> 
> Debugging message costs should only be incurred when under --debug,
> not otherwise.

Ok! IIUC instead of this part of the code

initPQExpBuffer(&errmsg_buf);
printfPQExpBuffer(&errmsg_buf,
                  "client %d repeats the failed transaction (try %d",
                  st->id, st->retries + 1);
if (max_tries)
    appendPQExpBuffer(&errmsg_buf, "/%d", max_tries);
if (latency_limit)
{
    appendPQExpBuffer(&errmsg_buf,
                      ", %.3f%% of the maximum time of tries was used",
                      getLatencyUsed(st, &now));
}
appendPQExpBufferStr(&errmsg_buf, ")\n");
pgbench_error(DEBUG_FAIL, "%s", errmsg_buf.data);
termPQExpBuffer(&errmsg_buf);

can we try something like this?

PGBENCH_ERROR_START(DEBUG_FAIL)
{
    PGBENCH_ERROR("client %d repeats the failed transaction (try %d",
                  st->id, st->retries + 1);
    if (max_tries)
        PGBENCH_ERROR("/%d", max_tries);
    if (latency_limit)
    {
        PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used",
                      getLatencyUsed(st, &now));
    }
    PGBENCH_ERROR(")\n");
}
PGBENCH_ERROR_END();

>>> You have added 20-columns alignment prints. This looks like too much 
>>> and
>>> generates much too large lines. Probably 10 (billion) would be 
>>> enough.
>> 
>> I have already asked you about this in [2]:
> 
> Probably:-)
> 
>>> The variables for the numbers of failures and retries are of type 
>>> int64
>>> since the variable for the total number of transactions has the same
>>> type. That's why such a large alignment (as I understand it now, 
>>> enough
>>> 20 characters). Do you prefer floating alignemnts, depending on the
>>> maximum number of failures/retries for any command in any script?
> 
> An int64 counter is not likely to reach its limit anytime soon:-) If
> the column display limit is ever reached, ISTM that then the text is
> just misaligned, which is a minor and rare inconvenience. If very wide
> columns are used, then it does not fit my terminal and the report text
> will always be wrapped around, which makes it harder to read, every
> time.

Ok!

>>> The latency limit to 900 ms try is a bad idea because it takes a lot 
>>> of time. I did such tests before and they were removed by Tom Lane
>>> because of determinism and time issues. I would comment this test out 
>>> for now.
>> 
>> Ok! If it doesn't bother you - can you tell more about the causes of 
>> these determinism issues?.. Tests for some other failures that cannot 
>> be retried are already added to 001_pgbench_with_server.pl.
> 
> Some farm animals are very slow, so you cannot really assume much
> about time one way or another.

Thanks!

>>> I do not understand why there is so much text about in failed sql 
>>> transaction stuff, while we are mainly interested in serialization & 
>>> deadlock errors, and this only falls in some "other" category. There 
>>> seems to be more details about other errors that about deadlocks & 
>>> serializable errors.
>>> 
>>> The reporting should focus on what is of interest, either all errors, 
>>> or some detailed split of these errors.
>>> 
>>> <...>
>>> 
>>> * "errors_in_failed_tx" is some subcounter of "errors", for a special
>>> case. Why it is there fails me [I finally understood, and I think it
>>> should be removed, see end of review]. If we wanted to distinguish,
>>> then we should distinguish homogeneously: maybe just count the
>>> different error types, eg have things like "deadlock_errors",
>>> "serializable_errors", "other_errors", "internal_pgbench_errors" 
>>> which
>>> would be orthogonal one to the other, and "errors" could be 
>>> recomputed
>>> from these.
>> 
>> Thank you, I agree with you. Unfortunately each new error type adds a 
>> new 1 or 2 columns of maximum width 20 to the per-statement report
> 
> The fact that some data are collected does not mean that they should
> all be reported in detail. We can have detailed error count and report
> the sum of this errors for instance, or have some more
> verbose/detailed reports
> as options (eg --latencies does just that).

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
On 2018-Jul-11, Marina Polyakova wrote:

> can we try something like this?
> 
> PGBENCH_ERROR_START(DEBUG_FAIL)
> {
>     PGBENCH_ERROR("client %d repeats the failed transaction (try %d",
>                   st->id, st->retries + 1);
>     if (max_tries)
>         PGBENCH_ERROR("/%d", max_tries);
>     if (latency_limit)
>     {
>         PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used",
>                       getLatencyUsed(st, &now));
>     }
>     PGBENCH_ERROR(")\n");
> }
> PGBENCH_ERROR_END();

I didn't quite understand what these PGBENCH_ERROR() functions/macros
are supposed to do.  Care to explain?

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
Just a quick skim while refreshing what were those error reporting API
changes about ...

On 2018-May-21, Marina Polyakova wrote:

> v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch
> - a patch for the RandomState structure (this is used to reset a client's
> random seed during the repeating of transactions after
> serialization/deadlock failures).

LGTM, though I'd rename the random_state struct members so that it
wouldn't look as confusing.  Maybe that's just me.

> v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch
> - a patch for the Variables structure (this is used to reset client
> variables during the repeating of transactions after serialization/deadlock
> failures).

Please don't allocate Variable structs one by one.  First time allocate
some decent number (say 8) and then enlarge by duplicating size.  That
way you save realloc overhead.  We use this technique everywhere else,
no reason do different here.  Other than that, LGTM.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:

> can we try something like this?
>
> PGBENCH_ERROR_START(DEBUG_FAIL)
> {
>     PGBENCH_ERROR("client %d repeats the failed transaction (try %d",

Argh, no? I was thinking of something much more trivial:

    pgbench_error(DEBUG, "message format %d %s...", 12, "hello world");

If you really need some complex dynamic buffer, and I would prefer 
that you avoid that, then the fallback is:

    if (level >= DEBUG)
    {
       initPQstuff(&msg);
       ...
       pgbench_error(DEBUG, "fixed message... %s\n", msg);
       freePQstuff(&msg);
    }

The point is to avoid building the message with dynamic allocation and so
if in the end it is not used.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 11-07-2018 20:49, Alvaro Herrera wrote:
> On 2018-Jul-11, Marina Polyakova wrote:
> 
>> can we try something like this?
>> 
>> PGBENCH_ERROR_START(DEBUG_FAIL)
>> {
>>     PGBENCH_ERROR("client %d repeats the failed transaction (try %d",
>>                   st->id, st->retries + 1);
>>     if (max_tries)
>>         PGBENCH_ERROR("/%d", max_tries);
>>     if (latency_limit)
>>     {
>>         PGBENCH_ERROR(", %.3f%% of the maximum time of tries was used",
>>                       getLatencyUsed(st, &now));
>>     }
>>     PGBENCH_ERROR(")\n");
>> }
>> PGBENCH_ERROR_END();
> 
> I didn't quite understand what these PGBENCH_ERROR() functions/macros
> are supposed to do.  Care to explain?

It is used only to print a string with the given arguments to stderr. 
Probably it might be just the function pgbench_error and not a macro..

P.S. This is my mistake, I did not think that PGBENCH_ERROR_END does not 
know the elevel for calling exit(1) if the elevel >= ERROR.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 11-07-2018 21:04, Alvaro Herrera wrote:
> Just a quick skim while refreshing what were those error reporting API
> changes about ...

Thank you!

> On 2018-May-21, Marina Polyakova wrote:
> 
>> v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch
>> - a patch for the RandomState structure (this is used to reset a 
>> client's
>> random seed during the repeating of transactions after
>> serialization/deadlock failures).
> 
> LGTM, though I'd rename the random_state struct members so that it
> wouldn't look as confusing.  Maybe that's just me.

IIUC, do you like "xseed" instead of "data"?

  typedef struct RandomState
  {
-    unsigned short data[3];
+    unsigned short xseed[3];
  } RandomState;

Or do you want to rename "random_state" in the structures RetryState / 
CState / TState? Thanks to Fabien Coelho' comments in [1], TState can 
contain several RandomStates for different purposes, something like 
this:

/*
  * Thread state
  */
typedef struct
{
...
    /*
     * Separate randomness for each thread. Each thread option uses its own
     * random state to make all of them independent of each other and 
therefore
     * deterministic at the thread level.
     */
    RandomState choose_script_rs;    /* random state for selecting a script */
    RandomState throttling_rs;    /* random state for transaction throttling 
*/
    RandomState sampling_rs;    /* random state for log sampling */
...
} TState;

>> v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch
>> - a patch for the Variables structure (this is used to reset client
>> variables during the repeating of transactions after 
>> serialization/deadlock
>> failures).
> 
> Please don't allocate Variable structs one by one.  First time allocate
> some decent number (say 8) and then enlarge by duplicating size.  That
> way you save realloc overhead.  We use this technique everywhere else,
> no reason do different here.  Other than that, LGTM.

Ok!

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1806090810090.5307%40lancre

> While reading your patch, it occurs to me that a run is not 
> deterministic
> at the thread level under throttling and sampling, because the random
> state is sollicited differently depending on when transaction ends. 
> This
> suggest that maybe each thread random_state use should have its own 
> random
> state.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 11-07-2018 22:34, Fabien COELHO wrote:
>> can we try something like this?
>> 
>> PGBENCH_ERROR_START(DEBUG_FAIL)
>> {
>>     PGBENCH_ERROR("client %d repeats the failed transaction (try %d",
> 
> Argh, no? I was thinking of something much more trivial:
> 
>    pgbench_error(DEBUG, "message format %d %s...", 12, "hello world");
> 
> If you really need some complex dynamic buffer, and I would prefer
> that you avoid that, then the fallback is:
> 
>    if (level >= DEBUG)
>    {
>       initPQstuff(&msg);
>       ...
>       pgbench_error(DEBUG, "fixed message... %s\n", msg);
>       freePQstuff(&msg);
>    }
> 
> The point is to avoid building the message with dynamic allocation and 
> so
> if in the end it is not used.

Ok! About avoidance - I'm afraid there's one more piece of debugging 
code with the same problem:

else if (command->type == META_COMMAND)
{
...
    initPQExpBuffer(&errmsg_buf);
    printfPQExpBuffer(&errmsg_buf, "client %d executing \\%s",
                      st->id, argv[0]);
    for (i = 1; i < argc; i++)
        appendPQExpBuffer(&errmsg_buf, " %s", argv[i]);
    appendPQExpBufferChar(&errmsg_buf, '\n');
    ereport(ELEVEL_DEBUG, (errmsg("%s", errmsg_buf.data)));
    termPQExpBuffer(&errmsg_buf);

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
>> The point is to avoid building the message with dynamic allocation and so
>> if in the end it is not used.
>
> Ok! About avoidance - I'm afraid there's one more piece of debugging code 
> with the same problem:

Indeed. I'd like to avoid all instances, so that PQExpBufferData is not 
needed anywhere, if possible. If not possible, then too bad, but I'd 
prefer to make do with formatted prints only, for simplicity.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
Hello, hackers!

Here there's a tenth version of the patch for error handling and 
retrying of transactions with serialization/deadlock failures in pgbench 
(based on the commit e0ee93053998b159e395deed7c42e02b1f921552) thanks to 
the comments of Fabien Coelho and Alvaro Herrera in this thread.

v10-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
- a patch for the RandomState structure (this is used to reset a 
client's random seed during the repeating of transactions after 
serialization/deadlock failures).

v10-0002-Pgbench-errors-use-a-separate-function-to-report.patch
- a patch for a separate error reporting function (this is used to 
report client failures that do not cause an aborts and this depends on 
the level of debugging).

v10-0003-Pgbench-errors-use-the-Variables-structure-for-c.patch
- a patch for the Variables structure (this is used to reset client 
variables during the repeating of transactions after 
serialization/deadlock failures).

v10-0004-Pgbench-errors-and-serialization-deadlock-retrie.patch
- the main patch for handling client errors and repetition of 
transactions with serialization/deadlock failures (see the detailed 
description in the file).

As Fabien wrote in [5], some of the new tests were too slow. Earlier on 
my laptop they increased the testing time of pgbench from 5.5 seconds to 
12.5 seconds. In the new version the testing time of pgbench takes about 
7 seconds. These tests include one test for serialization failure and 
retry, as well as one test for deadlock failure and retry. Both of them 
are in file 001_pgbench_with_server.pl, each test uses only one pgbench 
run, they use PL/pgSQL scripts instead of a parallel psql session.

Any suggestions are welcome!

All that was fixed from the previous version:

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1806090810090.5307%40lancre

> ISTM that the struct itself does not need a name, ie. "typedef struct {
> ... } RandomState" is enough.

> There could be clear comments, say in the TState and CState structs, 
> about
> what randomness is impacted (i.e. script choices, etc.).

> getZipfianRand, computeHarmonicZipfian: The "thread" parameter was
> justified because it was used for two fieds. As the random state is
> separated, I'd suggest that the other argument should be a zipfcache
> pointer.

> While reading your patch, it occurs to me that a run is not 
> deterministic
> at the thread level under throttling and sampling, because the random
> state is sollicited differently depending on when transaction ends. 
> This
> suggest that maybe each thread random_state use should have its own 
> random
> state.

[2] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1806091514060.3655%40lancre

> The structure typedef does not need a name. "typedef struct { } V...".

> I tend to disagree with naming things after their type, eg "array". I'd
> suggest "vars" instead. "nvariables" could be "nvars" for consistency 
> with
> that and "vars_sorted", and because "foo.variables->nvariables" starts
> looking heavy.

> I'd suggest but "Variables" type declaration just after "Variable" type
> declaration in the file.

[3] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1806100837380.3655%40lancre

> The semantics of the existing code is changed, the FATAL levels calls
> abort() and replace existing exit(1) calls. Maybe you want an ERROR
> level as well.

> I do not understand why names are changed, eg ELEVEL_FATAL instead of
> FATAL. ISTM that part of the point of the move would be to be 
> homogeneous,
> which suggests that the same names should be reused.

[4] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1807081014260.17811%40lancre

> I'd suggest to have just one clean and simple pgbench internal function 
> to
> handle errors and possibly exit, debug... Something like
> 
>    void pgb_error(FATAL, "error %d raised", 12);
> 
> Implemented as
> 
>    void pgb_error(int/enum XXX level, const char * format, ...)
>    {
>       test level and maybe return immediately (eg debug);
>       print to stderr;
>       exit/abort/return depending;
>    }

[5] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1807091451520.17811%40lancre

> Leves ELEVEL_LOG_CLIENT_{FAIL,ABORTED} & LOG_MAIN look unclear to me.
> In particular, the "CLIENT" part is not very useful. If the
> distinction makes sense, I would have kept "LOG" for the initial one 
> and
> add other ones for ABORT and PGBENCH, maybe.

> * There are no comments about "retries" in StatData, CState and Command
> structures.

> * Also, for StatData, I would like to understand the logic between cnt,
> skipped, retries, retried, errors, ... so a clear information about the
> expected invariant if any would be welcome. One has to go in the code 
> to
> understand how these fields relate one to the other.

> * "errors_in_failed_tx" is some subcounter of "errors", for a special
> case. Why it is there fails me [I finally understood, and I think it
> should be removed, see end of review]. If we wanted to distinguish, 
> then
> we should distinguish homogeneously: maybe just count the different 
> error
> types, eg have things like "deadlock_errors", "serializable_errors",
> "other_errors", "internal_pgbench_errors" which would be orthogonal one 
> to
> the other, and "errors" could be recomputed from these.

> * How "errors" differs from "ecnt" is unclear to me.

> * FailureStatus states are not homogeneously named. I'd suggest to use
> *_FAILURE for all cases. The miscellaneous case should probably be the
> last.

> * I do not understand the comments on CState enum: "First, remember the 
> failure
> in CSTATE_FAILURE. Then process other commands of the failed 
> transaction if any"
> Why would other commands be processed at all if the transaction is 
> aborted?
> For me any error must leads to the rollback and possible retry of the
> transaction.
> ...
> So, for me, the FAILURE state should record/count the failure, then 
> skip
> to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
> This is much clearer that way.
> 
> Then RETRY should reinstate the global state and proceed to start the 
> *first*
> command again.

> * commandFailed: I think that it should be kept much simpler. In
> particular, having errors on errors does not help much: on 
> ELEVEL_FATAL,
> it ignores the actual reported error and generates another error of the
> same level, so that the initial issue is hidden. Even if these are 
> can't
> happen cases, hidding the origin if it occurs looks unhelpful. Just 
> print
> it directly, and maybe abort if you think that it is a can't happen 
> case.

> * copyRandomState: just use sizeof(RandomState) instead of making 
> assumptions
> about the contents of the struct. Also, this function looks pretty 
> useless,
> why not just do a plain assignment?

> * copyVariables: lacks comments to explain that the destination is 
> cleaned up
> and so on. The cleanup phase could probaly be in a distinct function, 
> so that
> the code would be clearer. Maybe the function variable names are too 
> long.
> 
>    if (current_source->svalue)
> 
> in the context of a guard for a strdup, maybe:
> 
>    if (current_source->svalue != NULL)

> * executeCondition: this hides client automaton state changes which 
> were
> clearly visible beforehand in the switch, and the different handling of
> if & elif is also hidden.
> 
> I'm against this unnecessary restructuring and to hide such an 
> information,
> all state changes should be clearly seen in the state switch so that it 
> is
> easier to understand and follow.
> 
> I do not see why touching the conditional stack on internal errors
> (evaluateExpr failure) brings anything, the whole transaction will be 
> aborted
> anyway.

> The current RETRY state does memory allocations to generate a message
> with buffer allocation and so on. This looks like a costly and useless
> operation. If the user required "retries", then this is normal
> behavior,
> the retries are counted and will be printed out in the final report,
> and there is no point in printing out every single one of them.
> Maybe you want that debugging, but then coslty operations should be 
> guarded.

> The number of transactions above the latency limit report can be 
> simplified.
> Remove the if and just use one printf f with a %s for the optional 
> comment.
> I'm not sure this optional comment is useful there.

> Before the patch, ISTM that all lines relied on one printf. you have
> changed to a style where a collection of printf is used to compose a 
> line.
> I'd suggest to keep to the previous one-printf-prints-one-line style,
> where possible.

> You have added 20-columns alignment prints. This looks like too much 
> and
> generates much too large lines. Probably 10 (billion) would be enough.
> 
> Some people try to parse the output, so it should be deterministic. I'd 
> add
> the needed columns always if appropriate (i.e. under retry), even if 
> none
> occured.

> * processXactStats: An else is replaced by a detailed stats, with the 
> initial
> "no detailed stats" comment kept. The function is called both in the 
> thenb
> & else branch. The structure does not make sense anymore. I'm not sure
> this changed was needed.

> * getLatencyUsed: declared "double" so "return 0.0".

> * typo: ruin -> run; probably others, I did not check for them in 
> detail.

> On my laptop, tests last 5.5 seconds before the patch, and about 13 
> seconds
> after. This is much too large. Pgbench TAP tests do not deserve to take 
> over
> twice as much time as before just on this patch.
> 
> One reason which explains this large time is there is a new script with 
> a
> new created instance. I'd suggest to append tests to the existing 2
> scripts, depending on whether they need a running instance or not.
> 
> Secondly, I think that the design of the tests are too heavy. For such 
> a
> feature, ISTM enough to check that it works, i.e. one test for 
> deadlocks
> (trigger one or a few deadlocks), idem for serializable, maybe idem for
> other errors if any.
> 
> The challenge is to do that reliably and efficiently, i.e. so that the 
> test does
> not rely on chance and is still quite efficient.
> 
> The trick you use is to run an interactive psql in parallel to pgbench 
> so as to
> play with concurrent locks. That is interesting, but deserves more 
> comments
> and explanatation, eg before the test functions.
> 
> Maybe this could be achieved within pgbench by using some wait stuff in
> PL/pgSQL so that concurrent client can wait one another based on data 
> in
> unlogged table updated by a CALL within an "embedded" transactions? Not
> sure. ...
> 
> Anyway, TAP tests should be much lighter (in total time), and if 
> possible
> much simpler.
> 
> The latency limit to 900 ms try is a bad idea because it takes a lot of 
> time.
> I did such tests before and they were removed by Tom Lane because of 
> determinism
> and time issues. I would comment this test out for now.

> Documentation
> ...
> Having the "most important settings" on line 1-6 and 8 (i.e. skipping 
> 7) looks
> silly. The important ones should simply be the first ones, and the 8th 
> is not
> that important, or it is in 7th position.
> 
> I do not understand why there is so much text about in failed sql 
> transaction
> stuff, while we are mainly interested in serialization & deadlock 
> errors, and
> this only falls in some "other" category. There seems to be more 
> details about
> other errors that about deadlocks & serializable errors.
> 
> The reporting should focus on what is of interest, either all errors, 
> or some
> detailed split of these errors. The documentation should state clearly 
> what
> are the counted errors, and then what are their effects on the reported 
> stats.
> The "Errors and Serialization/Deadlock Retries" section is a good start 
> in that
> direction, but it does not talk about pgbench internal errors (eg 
> "cos(true)").
> I think it should more explicit about errors.
> 
> Option --max-tries default value should be spelled out in the doc.

[6] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1807111435250.27883%40lancre

> So if absolutely necessary, a new option is still better than changing
> --debug syntax. If not necessary, then it is better:-)

> The fact that some data are collected does not mean that they should 
> all
> be reported in detail. We can have detailed error count and report the 
> sum
> of this errors for instance, or have some more verbose/detailed reports
> as options (eg --latencies does just that).

[7] 
https://www.postgresql.org/message-id/20180711180417.3ytmmwmonsr5lra7%40alvherre.pgsql

> LGTM, though I'd rename the random_state struct members so that it
> wouldn't look as confusing.  Maybe that's just me.

> Please don't allocate Variable structs one by one.  First time allocate
> some decent number (say 8) and then enlarge by duplicating size.  That
> way you save realloc overhead.  We use this technique everywhere else,
> no reason do different here.  Other than that, LGTM.

[8] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1807112124210.27883%40lancre

> If you really need some complex dynamic buffer, and I would prefer
> that you avoid that, then the fallback is:
> 
>     if (level >= DEBUG)
>     {
>        initPQstuff(&msg);
>        ...
>        pgbench_error(DEBUG, "fixed message... %s\n", msg);
>        freePQstuff(&msg);
>     }
> 
> The point is to avoid building the message with dynamic allocation and 
> so
> if in the end it is not used.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> v10-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
> - a patch for the RandomState structure (this is used to reset a client's 
> random seed during the repeating of transactions after serialization/deadlock 
> failures).

About this v10 part 1:

Patch applies cleanly, compile, global & local make check both ok.

The random state is cleanly separated so that it will be easy to reset it 
on client error handling ISTM that the pgbench side is deterministic with
the separation of the seeds for different uses.

Code is clean, comments are clear.

I'm wondering what is the rational for the "xseed" field name? In 
particular, what does the "x" stands for?

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 07-08-2018 19:21, Fabien COELHO wrote:
> Hello Marina,

Hello, Fabien!

>> v10-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
>> - a patch for the RandomState structure (this is used to reset a 
>> client's random seed during the repeating of transactions after 
>> serialization/deadlock failures).
> 
> About this v10 part 1:
> 
> Patch applies cleanly, compile, global & local make check both ok.
> 
> The random state is cleanly separated so that it will be easy to reset
> it on client error handling ISTM that the pgbench side is
> deterministic with
> the separation of the seeds for different uses.
> 
> Code is clean, comments are clear.

:-)

> I'm wondering what is the rational for the "xseed" field name? In
> particular, what does the "x" stands for?

I called it "...seed" instead of "data" because perhaps the "data" is 
too general a name for use here (but I'm not entirely sure what Alvaro 
Herrera meant in [1], see my answer in [2]). I called it "xseed" to 
combine it with the arguments of the functions _dorand48 / pg_erand48 / 
pg_jrand48 in the file erand48.c. IIUC they use a linear congruential 
generator and perhaps "xseed" means the sequence with the name X of 
pseudorandom values of size 48 bits (X_0, X_1, ... X_n) where X_0 is the 
seed / the start value.

[1] 
https://www.postgresql.org/message-id/20180711180417.3ytmmwmonsr5lra7@alvherre.pgsql

> LGTM, though I'd rename the random_state struct members so that it
> wouldn't look as confusing.  Maybe that's just me.

[2] 
https://www.postgresql.org/message-id/cb2cde10e4e7a10a38b48e9cae8fbd28%40postgrespro.ru

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> v10-0002-Pgbench-errors-use-a-separate-function-to-report.patch
> - a patch for a separate error reporting function (this is used to report 
> client failures that do not cause an aborts and this depends on the level of 
> debugging).

Patch applies cleanly, compiles, global & local make check ok.

This patch improves/homogenizes logging & error reporting in pgbench, in 
preparation for another patch which will manage transaction restarts in 
some cases.

However ISTM that it is not as necessary as the previous one, i.e. we 
could do without it to get the desired feature, so I see it more as a 
refactoring done "in passing", and I'm wondering whether it is 
really worth it because it adds some new complexity, so I'm not sure of 
the net benefit.

Anyway, I still have quite a few comments/suggestions on this version.

* ErrorLevel

If ErrorLevel is used for things which are not errors, its name should not 
include "Error"? Maybe "LogLevel"?

I'm at odds with the proposed levels. ISTM that pgbench internal errors 
which warrant an immediate exit should be dubbed "FATAL", which would 
leave the "ERROR" name for... errors, eg SQL errors. I'd suggest to use an 
INFO level for the PGBENCH_DEBUG function, and to keep LOG for main 
program messages, so that all use case are separate. Or, maybe the 
distinction between LOG/INFO is unclear so info is not necessary.

I'm unsure about the "log_min_messages" variable name, I'd suggest 
"log_level".

I do not see the asserts on LOG >= log_min_messages as useful, because the 
level can only be LOG or DEBUG anyway.

This point also suggest that maybe "pgbench_error" is misnamed as well 
(ok, I know I suggested it in place of ereport, but e stands for error 
there), as it is called on errors, but is also on other things. Maybe 
"pgbench_log"? Or just simply "log" or "report", as it is really an local 
function, which does not need a prefix? That would mean that 
"pgbench_simple_error", which is indeed called on errors, could keep its 
initial name "pgbench_error", and be called on errors.

Alternatively, the debug/logging code could be let as it is (i.e. direct 
print to stderr) and the function only called when there is some kind of 
error, in which case it could be named with "error" in its name (or 
elog/ereport...).

* PQExpBuffer

I still do not see a positive value from importing PQExpBuffer complexity 
and cost into pgbench, as the resulting code is not very readable and it 
adds malloc/free cycles, so I'd try to avoid using PQExpBuf as much as 
possible. ISTM that all usages could be avoided in the patch, and most 
should be avoided even if ExpBuffer is imported because it is really 
useful somewhere.

- to call pgbench_error from pgbench_simple_error, you can do a 
pgbench_log_va(level, format, va_list) version called both from 
pgbench_error & pgbench_simple_error.

- for PGBENCH_DEBUG function, do separate calls per type, the 
very small partial code duplication is worth avoiding ExpBuf IMO.

- for doCustom debug: I'd just let the printf as it is, with a comment, as 
it is really very internal stuff for debug. Or I'd just snprintf a 
something in a static buffer.

- for syntax_error: it should terminate, so it should call
pgbench_error(FATAL, ...). Idem, I'd either keep the printf then call
pgbench_error(FATAL, "syntax error found\n") for a final message,
or snprintf in a static buffer.

- for listAvailableScript: I'd simply call "pgbench_error(LOG" several 
time, once per line.

I see building a string with a format (printfExpBuf..) and then calling 
the pgbench_error function with just a "%s" format on the result as not 
very elegant, because the second format is somehow hacked around.

* bool client

I'm unconvince by this added boolean just to switch the level on 
encountered errors.

I'd suggest to let lookupCreateVariable, putVariable* as they are, call 
pgbench_error with a level which does not stop the execution, and abort if 
necessary from the callers with a "aborted because of putVariable/eval/... 
error" message, as it was done before.

pgbench_error calls pgbench_error. Hmmm, why not.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 09-08-2018 12:28, Fabien COELHO wrote:
> Hello Marina,

Hello!

>> v10-0002-Pgbench-errors-use-a-separate-function-to-report.patch
>> - a patch for a separate error reporting function (this is used to 
>> report client failures that do not cause an aborts and this depends on 
>> the level of debugging).
> 
> Patch applies cleanly, compiles, global & local make check ok.

:-)

> This patch improves/homogenizes logging & error reporting in pgbench,
> in preparation for another patch which will manage transaction
> restarts in some cases.
> 
> However ISTM that it is not as necessary as the previous one, i.e. we
> could do without it to get the desired feature, so I see it more as a
> refactoring done "in passing", and I'm wondering whether it is really
> worth it because it adds some new complexity, so I'm not sure of the
> net benefit.

We discussed this starting with [1]:

>>>> IMO this patch is more controversial than the other ones.
>>>> 
>>>> It is not really related to the aim of the patch series, which could
>>>> do without, couldn't it?
>>> 
>>>> I'd suggest that it should be an independent submission, unrelated 
>>>> to
>>>> the pgbench error management patch.
>>> 
>>> I suppose that this is related; because of my patch there may be a 
>>> lot
>>> of such code (see v7 in [1]):
>>> 
>>> -            fprintf(stderr,
>>> -                    "malformed variable \"%s\" value: \"%s\"\n",
>>> -                    var->name, var->svalue);
>>> +            if (debug_level >= DEBUG_FAILS)
>>> +            {
>>> +                fprintf(stderr,
>>> +                        "malformed variable \"%s\" value: \"%s\"\n",
>>> +                        var->name, var->svalue);
>>> +            }
>>> 
>>> -        if (debug)
>>> +        if (debug_level >= DEBUG_ALL)
>>>              fprintf(stderr, "client %d sending %s\n", st->id, sql);
>> 
>> I'm not sure that debug messages needs to be kept after debug, if it 
>> is
>> about debugging pgbench itself. That is debatable.
> 
> AFAICS it is not about debugging pgbench itself, but about more 
> detailed
> information that can be used to understand what exactly happened during
> its launch. In the case of errors this helps to distinguish between
> failures or errors by type (including which limit for retries was
> violated and how far it was exceeded for the serialization/deadlock
> errors).
> 
>>> That's why it was suggested to make the error function which hides 
>>> all
>>> these things (see [2]):
>>> 
>>> There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with
>>> corresponding fprintf(stderr..) I think it's time to do it like in 
>>> the
>>> main code, wrap with some function like log(level, msg).
>> 
>> Yep. I did not wrote that, but I agree with an "elog" suggestion to 
>> switch
>> 
>>    if (...) { fprintf(...); exit/abort/continue/... }
>> 
>> to a simpler:
>> 
>>    elog(level, ...)

> Anyway, I still have quite a few comments/suggestions on this version.

Thank you very much for them!

> * ErrorLevel
> 
> If ErrorLevel is used for things which are not errors, its name should
> not include "Error"? Maybe "LogLevel"?

On the one hand, this sounds better for me too. On the other hand, will 
not this be in some kind of conflict with error level codes in elog.h?..

/* Error level codes */
#define DEBUG5        10            /* Debugging messages, in categories of
                                 * decreasing detail. */
#define DEBUG4        11
...

> I'm at odds with the proposed levels. ISTM that pgbench internal
> errors which warrant an immediate exit should be dubbed "FATAL",

Ok!

> which
> would leave the "ERROR" name for... errors, eg SQL errors. I'd suggest
> to use an INFO level for the PGBENCH_DEBUG function, and to keep LOG
> for main program messages, so that all use case are separate. Or,
> maybe the distinction between LOG/INFO is unclear so info is not
> necessary.

The messages of the errors in SQL and meta commands are printed only if 
the option --debug-fails is used so I'm not sure that they should have a 
higher error level than main program messages (ERROR vs LOG). About an 
INFO level for the PGBENCH_DEBUG function - ISTM that some main program 
messages such as "dropping old tables...\n" or ..." tuples (%d%%) done 
(elapsed %.2f s, remaining %.2f s)\n" can also use it.. About that all 
use cases were separate - in the current version the level LOG also 
includes messages about abortions of the clients.

> I'm unsure about the "log_min_messages" variable name, I'd suggest 
> "log_level".
> 
> I do not see the asserts on LOG >= log_min_messages as useful, because
> the level can only be LOG or DEBUG anyway.

Ok!

> This point also suggest that maybe "pgbench_error" is misnamed as well
> (ok, I know I suggested it in place of ereport, but e stands for error
> there), as it is called on errors, but is also on other things. Maybe
> "pgbench_log"? Or just simply "log" or "report", as it is really an
> local function, which does not need a prefix? That would mean that
> "pgbench_simple_error", which is indeed called on errors, could keep
> its initial name "pgbench_error", and be called on errors.

About the name "log" - we already have the function doLog, so perhaps 
the name "report" will be better.. But like with ErrorLevel will not 
this be in some kind of conflict with ereport which is also used for the 
levels DEBUG... / LOG / INFO?

> Alternatively, the debug/logging code could be let as it is (i.e.
> direct print to stderr) and the function only called when there is
> some kind of error, in which case it could be named with "error" in
> its name (or elog/ereport...).

As I wrote in [2]:

> because of my patch there may be a lot
> of such code (see v7 in [1]):
> 
> -            fprintf(stderr,
> -                    "malformed variable \"%s\" value: \"%s\"\n",
> -                    var->name, var->svalue);
> +            if (debug_level >= DEBUG_FAILS)
> +            {
> +                fprintf(stderr,
> +                        "malformed variable \"%s\" value: \"%s\"\n",
> +                        var->name, var->svalue);
> +            }
> 
> -        if (debug)
> +        if (debug_level >= DEBUG_ALL)
>              fprintf(stderr, "client %d sending %s\n", st->id, sql);
> 
> That's why it was suggested to make the error function which hides all
> these things (see [2]):
> 
> There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with
> corresponding fprintf(stderr..) I think it's time to do it like in the
> main code, wrap with some function like log(level, msg).

And IIUC macros will not help in the absence of __VA_ARGS__.

> * PQExpBuffer
> 
> I still do not see a positive value from importing PQExpBuffer
> complexity and cost into pgbench, as the resulting code is not very
> readable and it adds malloc/free cycles, so I'd try to avoid using
> PQExpBuf as much as possible. ISTM that all usages could be avoided in
> the patch, and most should be avoided even if ExpBuffer is imported
> because it is really useful somewhere.
> 
> - to call pgbench_error from pgbench_simple_error, you can do a
> pgbench_log_va(level, format, va_list) version called both from
> pgbench_error & pgbench_simple_error.
> 
> - for PGBENCH_DEBUG function, do separate calls per type, the very
> small partial code duplication is worth avoiding ExpBuf IMO.
> 
> - for doCustom debug: I'd just let the printf as it is, with a
> comment, as it is really very internal stuff for debug. Or I'd just
> snprintf a something in a static buffer.
> 
> - for syntax_error: it should terminate, so it should call
> pgbench_error(FATAL, ...). Idem, I'd either keep the printf then call
> pgbench_error(FATAL, "syntax error found\n") for a final message,
> or snprintf in a static buffer.
> 
> - for listAvailableScript: I'd simply call "pgbench_error(LOG" several
> time, once per line.
> 
> I see building a string with a format (printfExpBuf..) and then
> calling the pgbench_error function with just a "%s" format on the
> result as not very elegant, because the second format is somehow
> hacked around.

Ok! About using a static buffer in doCustom debug or in syntax_error - 
I'm not sure that this is always possible because ISTM that the variable 
name can be quite large.

> * bool client
> 
> I'm unconvince by this added boolean just to switch the level on
> encountered errors.
> 
> I'd suggest to let lookupCreateVariable, putVariable* as they are,
> call pgbench_error with a level which does not stop the execution, and
> abort if necessary from the callers with a "aborted because of
> putVariable/eval/... error" message, as it was done before.

There's one more problem: if this is a client failure, an error message 
inside any of these functions should be printed at the level 
DEBUG_FAILS; otherwise it should be printed at the level LOG. Or do you 
suggest using the error level as an argument for these functions?

> pgbench_error calls pgbench_error. Hmmm, why not.

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1806100837380.3655%40lancre
[2] 
https://www.postgresql.org/message-id/b692de21caaed13c59f31c06d0098488%40postgrespro.ru

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

>> I'd suggest to let lookupCreateVariable, putVariable* as they are,
>> call pgbench_error with a level which does not stop the execution, and
>> abort if necessary from the callers with a "aborted because of
>> putVariable/eval/... error" message, as it was done before.
>
> There's one more problem: if this is a client failure, an error message 
> inside any of these functions should be printed at the level DEBUG_FAILS; 
> otherwise it should be printed at the level LOG. Or do you suggest using the 
> error level as an argument for these functions?

No. I suggest that the called function does only one simple thing, 
probably "DEBUG", and that the *caller* prints a message if it is unhappy 
about the failure of the called function, as it is currently done. This 
allows to provide context as well from the caller, eg "setting variable %s 
failed while <some specific context>". The user call rerun under debug for 
precision if they need it.

I'm still not over enthousiastic with these changes, and still think that 
it should be an independent patch, not submitted together with the "retry 
on error" feature.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 10-08-2018 11:33, Fabien COELHO wrote:
> Hello Marina,
> 
>>> I'd suggest to let lookupCreateVariable, putVariable* as they are,
>>> call pgbench_error with a level which does not stop the execution, 
>>> and
>>> abort if necessary from the callers with a "aborted because of
>>> putVariable/eval/... error" message, as it was done before.
>> 
>> There's one more problem: if this is a client failure, an error 
>> message inside any of these functions should be printed at the level 
>> DEBUG_FAILS; otherwise it should be printed at the level LOG. Or do 
>> you suggest using the error level as an argument for these functions?
> 
> No. I suggest that the called function does only one simple thing,
> probably "DEBUG", and that the *caller* prints a message if it is
> unhappy about the failure of the called function, as it is currently
> done. This allows to provide context as well from the caller, eg
> "setting variable %s failed while <some specific context>". The user
> call rerun under debug for precision if they need it.

Ok!

> I'm still not over enthousiastic with these changes, and still think
> that it should be an independent patch, not submitted together with
> the "retry on error" feature.

In the next version I will put the error patch last, so it will be 
possible to compare the "retry on error" feature with it and without it, 
and let the committer decide how it is better)

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Arthur Zakirov
Дата:
On Thu, Aug 09, 2018 at 06:17:22PM +0300, Marina Polyakova wrote:
> > * ErrorLevel
> > 
> > If ErrorLevel is used for things which are not errors, its name should
> > not include "Error"? Maybe "LogLevel"?
> 
> On the one hand, this sounds better for me too. On the other hand, will not
> this be in some kind of conflict with error level codes in elog.h?..

I think it shouldn't because those error levels are backends levels.
pgbench is a client side utility with its own code, it shares some code
with libpq and other utilities, but elog.h isn't one of them.

> > This point also suggest that maybe "pgbench_error" is misnamed as well
> > (ok, I know I suggested it in place of ereport, but e stands for error
> > there), as it is called on errors, but is also on other things. Maybe
> > "pgbench_log"? Or just simply "log" or "report", as it is really an
> > local function, which does not need a prefix? That would mean that
> > "pgbench_simple_error", which is indeed called on errors, could keep
> > its initial name "pgbench_error", and be called on errors.
> 
> About the name "log" - we already have the function doLog, so perhaps the
> name "report" will be better.. But like with ErrorLevel will not this be in
> some kind of conflict with ereport which is also used for the levels
> DEBUG... / LOG / INFO?

+1 from me to keep initial name "pgbench_error". "pgbench_log" for new
function looks nice to me. I think it is better than just "log",
because "log" may conflict with natural logarithmic function (see "man 3
log").

> > pgbench_error calls pgbench_error. Hmmm, why not.

I agree with Fabien. Calling pgbench_error() inside pgbench_error()
could be dangerous. I think "fmt" checking could be removed, or we may
use Assert() or fprintf()+exit(1) at least.

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 10-08-2018 15:53, Arthur Zakirov wrote:
> On Thu, Aug 09, 2018 at 06:17:22PM +0300, Marina Polyakova wrote:
>> > * ErrorLevel
>> >
>> > If ErrorLevel is used for things which are not errors, its name should
>> > not include "Error"? Maybe "LogLevel"?
>> 
>> On the one hand, this sounds better for me too. On the other hand, 
>> will not
>> this be in some kind of conflict with error level codes in elog.h?..
> 
> I think it shouldn't because those error levels are backends levels.
> pgbench is a client side utility with its own code, it shares some code
> with libpq and other utilities, but elog.h isn't one of them.

I agree with you on this :) I just meant that maybe it would be better 
to call this group in the same way because they are used in general for 
the same purpose?..

>> > This point also suggest that maybe "pgbench_error" is misnamed as well
>> > (ok, I know I suggested it in place of ereport, but e stands for error
>> > there), as it is called on errors, but is also on other things. Maybe
>> > "pgbench_log"? Or just simply "log" or "report", as it is really an
>> > local function, which does not need a prefix? That would mean that
>> > "pgbench_simple_error", which is indeed called on errors, could keep
>> > its initial name "pgbench_error", and be called on errors.
>> 
>> About the name "log" - we already have the function doLog, so perhaps 
>> the
>> name "report" will be better.. But like with ErrorLevel will not this 
>> be in
>> some kind of conflict with ereport which is also used for the levels
>> DEBUG... / LOG / INFO?
> 
> +1 from me to keep initial name "pgbench_error". "pgbench_log" for new
> function looks nice to me. I think it is better than just "log",
> because "log" may conflict with natural logarithmic function (see "man 
> 3
> log").

Do you think that pgbench_log (or another whose name speaks only about 
logging) will look good, for example, with FATAL? Because this means 
that the logging function also processes errors and calls exit(1) if 
necessary..

>> > pgbench_error calls pgbench_error. Hmmm, why not.
> 
> I agree with Fabien. Calling pgbench_error() inside pgbench_error()
> could be dangerous. I think "fmt" checking could be removed, or we may
> use Assert()

I would like not to use Assert in this case because IIUC they are mostly 
used for testing.

> or fprintf()+exit(1) at least.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Arthur Zakirov
Дата:
On Fri, Aug 10, 2018 at 04:46:04PM +0300, Marina Polyakova wrote:
> > +1 from me to keep initial name "pgbench_error". "pgbench_log" for new
> > function looks nice to me. I think it is better than just "log",
> > because "log" may conflict with natural logarithmic function (see "man 3
> > log").
> 
> Do you think that pgbench_log (or another whose name speaks only about
> logging) will look good, for example, with FATAL? Because this means that
> the logging function also processes errors and calls exit(1) if necessary..

Yes, why not. "_log" just means that you want to log some message with
the specified log level. Moreover those messages sometimes aren't error:

pgbench_error(LOG, "starting vacuum...");

> > I agree with Fabien. Calling pgbench_error() inside pgbench_error()
> > could be dangerous. I think "fmt" checking could be removed, or we may
> > use Assert()
> 
> I would like not to use Assert in this case because IIUC they are mostly
> used for testing.

I'd vote to remove this check at all. I don't see any place where it is
possible to call pgbench_error() passing empty "fmt".

-- 
Arthur Zakirov
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 10-08-2018 17:19, Arthur Zakirov wrote:
> On Fri, Aug 10, 2018 at 04:46:04PM +0300, Marina Polyakova wrote:
>> > +1 from me to keep initial name "pgbench_error". "pgbench_log" for new
>> > function looks nice to me. I think it is better than just "log",
>> > because "log" may conflict with natural logarithmic function (see "man 3
>> > log").
>> 
>> Do you think that pgbench_log (or another whose name speaks only about
>> logging) will look good, for example, with FATAL? Because this means 
>> that
>> the logging function also processes errors and calls exit(1) if 
>> necessary..
> 
> Yes, why not. "_log" just means that you want to log some message with
> the specified log level. Moreover those messages sometimes aren't 
> error:
> 
> pgbench_error(LOG, "starting vacuum...");

"pgbench_log" is already used as the default filename prefix for 
transaction logging.

>> > I agree with Fabien. Calling pgbench_error() inside pgbench_error()
>> > could be dangerous. I think "fmt" checking could be removed, or we may
>> > use Assert()
>> 
>> I would like not to use Assert in this case because IIUC they are 
>> mostly
>> used for testing.
> 
> I'd vote to remove this check at all. I don't see any place where it is
> possible to call pgbench_error() passing empty "fmt".

pgbench_error(..., "%s", PQerrorMessage(con)); ?

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
HEllo Marina,

> v10-0003-Pgbench-errors-use-the-Variables-structure-for-c.patch
> - a patch for the Variables structure (this is used to reset client variables 
> during the repeating of transactions after serialization/deadlock failures).

This patch adds an explicit structure to manage Variables, which is useful 
to reset these on pgbench script retries, which is the purpose of the 
whole patch series.

About part 3:

Patch applies cleanly,

* typo in comments: "varaibles"

* About enlargeVariables:

multiple INT_MAX error handling looks strange, especially as this code can 
never be triggered because pgbench would be dead long before having 
allocated INT_MAX variables. So I would not bother to add such checks.

ISTM that if something is amiss it will fail in pg_realloc anyway. Also I 
do not like the ExpBuf stuff, as usual.

I'm not sure that the size_t cast here and there are useful for any 
practical values likely to be encountered by pgbench.

The exponential allocation seems overkill. I'd simply add a constant 
number of slots, with a simple rule:

   /* reallocated with a margin */
   if (max_vars < needed) max_vars = needed + 8;

So in the end the function should be much simpler.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
> About part 3:
>
> Patch applies cleanly,

I forgot: compiles, global & local "make check" are ok.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 12-08-2018 12:14, Fabien COELHO wrote:
> HEllo Marina,

Hello, Fabien!

>> v10-0003-Pgbench-errors-use-the-Variables-structure-for-c.patch
>> - a patch for the Variables structure (this is used to reset client 
>> variables during the repeating of transactions after 
>> serialization/deadlock failures).
> 
> This patch adds an explicit structure to manage Variables, which is
> useful to reset these on pgbench script retries, which is the purpose
> of the whole patch series.
> 
> About part 3:
> 
> Patch applies cleanly,

On 12-08-2018 12:17, Fabien COELHO wrote:
>> About part 3:
>> 
>> Patch applies cleanly,
> 
> I forgot: compiles, global & local "make check" are ok.

I'm glad to hear it :-)

> * typo in comments: "varaibles"

I'm sorry, I'll fix it.

> * About enlargeVariables:
> 
> multiple INT_MAX error handling looks strange, especially as this code
> can never be triggered because pgbench would be dead long before
> having allocated INT_MAX variables. So I would not bother to add such
> checks.
> ...
> I'm not sure that the size_t cast here and there are useful for any
> practical values likely to be encountered by pgbench.

Looking at the code of the functions, for example, ParseScript and 
psql_scan_setup, where the integer variable is used for the size of the 
entire script - ISTM that you are right.. Therefore size_t casts will 
also be removed.

> ISTM that if something is amiss it will fail in pg_realloc anyway.

IIUC and physical RAM is not enough, this may depend on the size of the 
swap.

> Also I do not like the ExpBuf stuff, as usual.

> The exponential allocation seems overkill. I'd simply add a constant
> number of slots, with a simple rule:
> 
> /* reallocated with a margin */
> if (max_vars < needed) max_vars = needed + 8;
> 
> So in the end the function should be much simpler.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> v10-0004-Pgbench-errors-and-serialization-deadlock-retrie.patch
> - the main patch for handling client errors and repetition of transactions 
> with serialization/deadlock failures (see the detailed description in the 
> file).

Patch applies cleanly.

It allows retrying a script (considered as a transaction) on serializable 
and deadlock errors, which is a very interesting extension but also 
impacts pgbench significantly.

I'm waiting for the feature to be right before checking in full the 
documentation and tests. There are still some issues to resolve before
checking that.

Anyway, tests look reasonable. Taking advantage of of transactions control 
from PL/pgsql is a good use of this new feature.

A few comments about the doc.

According to the documentation, the feature is triggered by --max-tries and
--latency-limit. I disagree with the later, because it means that having
latency limit without retrying is not supported anymore.

Maybe you can allow an "unlimited" max-tries, say with special value zero,
and the latency limit does its job if set, over all tries.

Doc: "error in meta commands" -> "meta command errors", for homogeneity with
other cases?

Detailed -r report. I understand from the doc that the retry number on the
detailed per-statement report is to identify at what point errors occur?
Probably this is more or less always at the same point on a given script,
so that the most interesting feature is to report the number of retries at the
script level.

Doc: "never occur.." -> "never occur", or eventually "...".

Doc: "Directly client errors" -> "Direct client errors".


I'm still in favor of asserting that the sql connection is idle (no tx in
progress) at the beginning and/or end of a script, and report a user error
if not, instead of writing complex caveats.

If someone has a use-case for that, then maybe it can be changed, but I
cannot see any in a benchmarking context, and I can see how easy it is
to have a buggy script with this allowed.

I do not think that the RETRIES_ENABLED macro is a good thing. I'd suggest
to write the condition four times.

ISTM that "skipped" transactions are NOT "successful" so there are a problem
with comments. I believe that your formula are probably right, it has more to do
with what is "success". For cnt decomposition, ISTM that "other transactions"
are really "directly successful transactions".

I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise "another"
does not make sense yet. I'd suggest to name it "OTHER_SQL_FAILURE".

In TState, field "uint32 retries": maybe it would be simpler to count "tries",
which can be compared directly to max tries set in the option?

ErrorLevel: I have already commented about in review about 10.2. I'm not sure of
the LOG -> DEBUG_FAIL changes. I do not understand the name "DEBUG_FAIL", has it
is not related to debug, they just seem to be internal errors. META_ERROR maybe?

inTransactionBlock: I disagree with any function other than doCustom changing
the client state, because it makes understanding the state machine harder. There
is already one exception to that (threadRun) that I wish to remove. All state
changes must be performed explicitely in doCustom.

The automaton skips to FAILURE on every possible error. I'm wondering whether
it could do so only on SQL errors, because other fails will lead to ABORTED
anyway? If there is no good reason to skip to FAILURE from some errors, I'd
suggest to keep the previous behavior. Maybe the good reason is to do some
counting, but this means that on eg metacommand errors now the script would
loop over instead of aborting, which does not look like a desirable change
of behavior.

PQexec("ROOLBACK"): you are inserting a synchronous command, for which the
thread will have to wait for the result, in a middle of a framework which
takes great care to use only asynchronous stuff so that one thread can
manage several clients efficiently. You cannot call PQexec there.
From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to
a new state CSTATE_WAIT_ABORT_RESULT which would be similar to
CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead
of proceeding to the next command.

ISTM that it would be more logical to only get into RETRY if there is a retry,
i.e. move the test RETRY/ABORT in FAILURE. For that, instead of "canRetry",
maybe you want "doRetry", which tells that a retry is possible (the error
is serializable or deadlock) and that the current parameters allow it
(timeout, max retries).


* Minor C style comments:

if / else if / else if ... on *_FAILURE: I'd suggest a switch.

The following line removal does not seem useful, I'd have kept it:

   stats->cnt++;
  -
   if (skipped)

copyVariables: I'm not convinced that source_vars & nvars variables are that
useful.

   memcpy(&(st->retry_state.random_state), &(st->random_state), sizeof(RandomState));

Is there a problem with "st->retry_state.random_state = st->random_state;"
instead of memcpy? ISTM that simple assignments work in C. Idem in the reverse
copy under RETRY.

   if (!copyVariables(&st->retry_state.variables, &st->variables)) {
     pgbench_error(LOG, "client %d aborted when preparing to execute a transaction\n", st->id);

The message could be more precise, eg "client %d failed while copying
variables", unless copyVariables already printed a message. As this is really
an internal error from pgbench, I'd rather do a FATAL (direct exit) there.
ISTM that the only possible failure is OOM here, and pgbench is in a very bad
shape if it gets into that.

commandFailed: I'm not thrilled by the added boolean, which is partially
redundant with the second argument.

          if (per_script_stats)
  -               accumStats(&sql_script[st->use_file].stats, skipped, latency, lag);
  +       {
  +               accumStats(&sql_script[st->use_file].stats, skipped, latency, lag,
  +                                  st->failure_status, st->retries);
  +       }
   }

I do not see the point of changing the style here.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 15-08-2018 11:50, Fabien COELHO wrote:
> Hello Marina,

Hello!

>> v10-0004-Pgbench-errors-and-serialization-deadlock-retrie.patch
>> - the main patch for handling client errors and repetition of 
>> transactions with serialization/deadlock failures (see the detailed 
>> description in the file).
> 
> Patch applies cleanly.
> 
> It allows retrying a script (considered as a transaction) on
> serializable and deadlock errors, which is a very interesting
> extension but also impacts pgbench significantly.
> 
> I'm waiting for the feature to be right before checking in full the
> documentation and tests. There are still some issues to resolve before
> checking that.
> 
> Anyway, tests look reasonable. Taking advantage of of transactions
> control from PL/pgsql is a good use of this new feature.

:-)

> A few comments about the doc.
> 
> According to the documentation, the feature is triggered by --max-tries 
> and
> --latency-limit. I disagree with the later, because it means that 
> having
> latency limit without retrying is not supported anymore.
> 
> Maybe you can allow an "unlimited" max-tries, say with special value 
> zero,
> and the latency limit does its job if set, over all tries.
> 
> Doc: "error in meta commands" -> "meta command errors", for homogeneity 
> with
> other cases?
> ...
> Doc: "never occur.." -> "never occur", or eventually "...".
> 
> Doc: "Directly client errors" -> "Direct client errors".
> ...
> inTransactionBlock: I disagree with any function other than doCustom 
> changing
> the client state, because it makes understanding the state machine 
> harder. There
> is already one exception to that (threadRun) that I wish to remove. All 
> state
> changes must be performed explicitely in doCustom.
> ...
> PQexec("ROOLBACK"): you are inserting a synchronous command, for which 
> the
> thread will have to wait for the result, in a middle of a framework 
> which
> takes great care to use only asynchronous stuff so that one thread can
> manage several clients efficiently. You cannot call PQexec there.
> From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to
> a new state CSTATE_WAIT_ABORT_RESULT which would be similar to
> CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead
> of proceeding to the next command.
> ...
>   memcpy(&(st->retry_state.random_state), &(st->random_state),
> sizeof(RandomState));
> 
> Is there a problem with "st->retry_state.random_state = 
> st->random_state;"
> instead of memcpy? ISTM that simple assignments work in C. Idem in the 
> reverse
> copy under RETRY.

Thank you, I'll fix this.

> Detailed -r report. I understand from the doc that the retry number on 
> the
> detailed per-statement report is to identify at what point errors 
> occur?
> Probably this is more or less always at the same point on a given 
> script,
> so that the most interesting feature is to report the number of retries 
> at the
> script level.

This may depend on various factors.. for example:

transaction type: pgbench_test_serialization.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
duration: 10 s
number of transactions actually processed: 266
number of errors: 10 (3.623%)
number of serialization errors: 10 (3.623%)
number of retried: 75 (27.174%)
number of retries: 75
maximum number of tries: 2
latency average = 72.734 ms (including errors)
tps = 26.501162 (including connections establishing)
tps = 26.515082 (excluding connections establishing)
statement latencies in milliseconds, errors and retries:
          0.012           0           0  \set delta random(-5000, 5000)
          0.001           0           0  \set x1 random(1, 100000)
          0.001           0           0  \set x3 random(1, 2)
          0.001           0           0  \set x2 random(1, 1)
         19.837           0           0  UPDATE xy1 SET y = y + :delta 
WHERE x = :x1;
         21.239           5          36  UPDATE xy3 SET y = y + :delta 
WHERE x = :x3;
         21.360           5          39  UPDATE xy2 SET y = y + :delta 
WHERE x = :x2;

And you can always get the number of retries at the script level from 
the main report (if only one script is used) or from the report for each 
script (if multiple scripts are used).

> I'm still in favor of asserting that the sql connection is idle (no tx 
> in
> progress) at the beginning and/or end of a script, and report a user 
> error
> if not, instead of writing complex caveats.
> 
> If someone has a use-case for that, then maybe it can be changed, but I
> cannot see any in a benchmarking context, and I can see how easy it is
> to have a buggy script with this allowed.
> 
> I do not think that the RETRIES_ENABLED macro is a good thing. I'd 
> suggest
> to write the condition four times.

Ok!

> ISTM that "skipped" transactions are NOT "successful" so there are a 
> problem
> with comments. I believe that your formula are probably right, it has 
> more to do
> with what is "success". For cnt decomposition, ISTM that "other 
> transactions"
> are really "directly successful transactions".

I agree with you, but I also think that skipped transactions should not 
be considered errors. So we can write something like this:

All the transactions are divided into several types depending on their 
execution. Firstly, they can be divided into transactions that we 
started to execute, and transactions which were skipped (it was too late 
to execute them). Secondly, running transactions fall into 2 main types: 
is there any command that got a failure during the last execution of the 
transaction script or not? Thus

the number of all transactions =
   skipped (it was too late to execute them)
   cnt (the number of successful transactions) +
   ecnt (the number of failed transactions).

A successful transaction can have several unsuccessful tries before a
successfull run. Thus

cnt (the number of successful transactions) =
   retried (they got a serialization or a deadlock failure(s), but were
            successfully retried from the very beginning) +
   directly successfull transactions (they were successfully completed on
                                      the first try).

> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise 
> "another"
> does not make sense yet.

Maybe firstly put a general group, and then special cases?...

> I'd suggest to name it "OTHER_SQL_FAILURE".

Ok!

> In TState, field "uint32 retries": maybe it would be simpler to count 
> "tries",
> which can be compared directly to max tries set in the option?

If you mean retries in CState - on the one hand, yes, on the other hand, 
statistics always use the number of retries...

> ErrorLevel: I have already commented about in review about 10.2. I'm 
> not sure of
> the LOG -> DEBUG_FAIL changes. I do not understand the name 
> "DEBUG_FAIL", has it
> is not related to debug, they just seem to be internal errors. 
> META_ERROR maybe?

As I wrote to you in [1]:

>> I'm at odds with the proposed levels. ISTM that pgbench internal
>> errors which warrant an immediate exit should be dubbed "FATAL",
> 
> Ok!
> 
>> which
>> would leave the "ERROR" name for... errors, eg SQL errors.
>> ...
> 
> The messages of the errors in SQL and meta commands are printed only if
> the option --debug-fails is used so I'm not sure that they should have 
> a
> higher error level than main program messages (ERROR vs LOG).

Perhaps we can rename the levels DEBUG_FAIL and LOG to LOG and 
LOG_PGBENCH respectively. In this case the client error messages do not 
use debug error levels and the term "logging" is already used for 
transaction/aggregation logging... Therefore perhaps we can also combine 
the options --errors-detailed and --debug-fails into the option 
--fails-detailed=none|groups|all_messages. Here --fails-detailed=groups 
can be used to group errors in reports or logs by basic types. 
--fails-detailed=all_messages can add to this all error messages in the
SQL/meta commands, and messages for processing the failed transaction 
(its end/retry).

> The automaton skips to FAILURE on every possible error. I'm wondering 
> whether
> it could do so only on SQL errors, because other fails will lead to 
> ABORTED
> anyway? If there is no good reason to skip to FAILURE from some errors, 
> I'd
> suggest to keep the previous behavior. Maybe the good reason is to do 
> some
> counting, but this means that on eg metacommand errors now the script 
> would
> loop over instead of aborting, which does not look like a desirable 
> change
> of behavior.

Even in the case of meta command errors we must prepare for 
CSTATE_END_TX and the execution of the next script: if necessary, clear 
the conditional stack and rollback the current transaction block.

> ISTM that it would be more logical to only get into RETRY if there is a 
> retry,
> i.e. move the test RETRY/ABORT in FAILURE. For that, instead of 
> "canRetry",
> maybe you want "doRetry", which tells that a retry is possible (the 
> error
> is serializable or deadlock) and that the current parameters allow it
> (timeout, max retries).
> 
> * Minor C style comments:
> 
> if / else if / else if ... on *_FAILURE: I'd suggest a switch.
> 
> The following line removal does not seem useful, I'd have kept it:
> 
>   stats->cnt++;
>  -
>   if (skipped)
> 
> copyVariables: I'm not convinced that source_vars & nvars variables are 
> that
> useful.

>   if (!copyVariables(&st->retry_state.variables, &st->variables)) {
>     pgbench_error(LOG, "client %d aborted when preparing to execute a
> transaction\n", st->id);
> 
> The message could be more precise, eg "client %d failed while copying
> variables", unless copyVariables already printed a message. As this is 
> really
> an internal error from pgbench, I'd rather do a FATAL (direct exit) 
> there.
> ISTM that the only possible failure is OOM here, and pgbench is in a 
> very bad
> shape if it gets into that.

Ok!

> commandFailed: I'm not thrilled by the added boolean, which is 
> partially
> redundant with the second argument.

Do you mean that it is partially redundant with the argument "cmd" and, 
for example, the meta commands errors always do not cause the abortions 
of the client?

>          if (per_script_stats)
>  -               accumStats(&sql_script[st->use_file].stats, skipped,
> latency, lag);
>  +       {
>  +               accumStats(&sql_script[st->use_file].stats, skipped,
> latency, lag,
>  +                                  st->failure_status, st->retries);
>  +       }
>   }
> 
> I do not see the point of changing the style here.

If in such cases one command is placed on several lines, ISTM that the 
code is more understandable if curly brackets are used...

[1] 
https://www.postgresql.org/message-id/fcc2512cdc9e6bc49d3b489181f454da%40postgrespro.ru

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

>> Detailed -r report. I understand from the doc that the retry number on 
>> the detailed per-statement report is to identify at what point errors 
>> occur? Probably this is more or less always at the same point on a 
>> given script, so that the most interesting feature is to report the 
>> number of retries at the script level.
>
> This may depend on various factors.. for example:
> [...]
>        21.239           5          36  UPDATE xy3 SET y = y + :delta WHERE x 
> = :x3;
>        21.360           5          39  UPDATE xy2 SET y = y + :delta WHERE x 
> = :x2;

Ok, not always the same point, and you confirm that it identifies where 
the error is raised which leads to a retry.

> And you can always get the number of retries at the script level from the 
> main report (if only one script is used) or from the report for each script 
> (if multiple scripts are used).

Ok.

>> ISTM that "skipped" transactions are NOT "successful" so there are a 
>> problem with comments. I believe that your formula are probably right, 
>> it has more to do with what is "success". For cnt decomposition, ISTM 
>> that "other transactions" are really "directly successful 
>> transactions".
>
> I agree with you, but I also think that skipped transactions should not be 
> considered errors.

I'm ok with having a special category for them in the explanations, which 
is neither success nor error.

> So we can write something like this:

> All the transactions are divided into several types depending on their 
> execution. Firstly, they can be divided into transactions that we started to 
> execute, and transactions which were skipped (it was too late to execute 
> them). Secondly, running transactions fall into 2 main types: is there any 
> command that got a failure during the last execution of the transaction 
> script or not? Thus

Here is an attempt at having a more precise and shorter version, not sure 
it is much better than yours, though:

"""
Transactions are counted depending on their execution and outcome. First
a transaction may have started or not: skipped transactions occur under 
--rate and --latency-limit when the client is too late to execute them. 
Secondly, a started transaction may ultimately succeed or fail on some 
error, possibly after some retries when --max-tries is not one. Thus
"""

> the number of all transactions =
>  skipped (it was too late to execute them)
>  cnt (the number of successful transactions) +
>  ecnt (the number of failed transactions).
>
> A successful transaction can have several unsuccessful tries before a
> successfull run. Thus
>
> cnt (the number of successful transactions) =
>  retried (they got a serialization or a deadlock failure(s), but were
>           successfully retried from the very beginning) +
>  directly successfull transactions (they were successfully completed on
>                                     the first try).

These above description is clearer for me.

>> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise 
>> "another" does not make sense yet.
>
> Maybe firstly put a general group, and then special cases?...

I understand it more as a catch all default "none of the above" case.

>> In TState, field "uint32 retries": maybe it would be simpler to count 
>> "tries", which can be compared directly to max tries set in the option?
>
> If you mean retries in CState - on the one hand, yes, on the other hand, 
> statistics always use the number of retries...

Ok.


>> The automaton skips to FAILURE on every possible error. I'm wondering 
>> whether it could do so only on SQL errors, because other fails will 
>> lead to ABORTED anyway? If there is no good reason to skip to FAILURE 
>> from some errors, I'd suggest to keep the previous behavior. Maybe the 
>> good reason is to do some counting, but this means that on eg 
>> metacommand errors now the script would loop over instead of aborting, 
>> which does not look like a desirable change of behavior.
>
> Even in the case of meta command errors we must prepare for CSTATE_END_TX and 
> the execution of the next script: if necessary, clear the conditional stack 
> and rollback the current transaction block.

Seems ok.

>> commandFailed: I'm not thrilled by the added boolean, which is partially
>> redundant with the second argument.
>
> Do you mean that it is partially redundant with the argument "cmd" and, for 
> example, the meta commands errors always do not cause the abortions of the 
> client?

Yes. And also I'm not sure we should want this boolean at all.

> [...]
> If in such cases one command is placed on several lines, ISTM that the code 
> is more understandable if curly brackets are used...

Hmmm. Such basic style changes are avoided because they break 
backpatching, so we try to avoid gratuitous changes unless there is a 
strong added value, which does not seem to be the case here.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 17-08-2018 10:49, Fabien COELHO wrote:
> Hello Marina,
> 
>>> Detailed -r report. I understand from the doc that the retry number 
>>> on the detailed per-statement report is to identify at what point 
>>> errors occur? Probably this is more or less always at the same point 
>>> on a given script, so that the most interesting feature is to report 
>>> the number of retries at the script level.
>> 
>> This may depend on various factors.. for example:
>> [...]
>>        21.239           5          36  UPDATE xy3 SET y = y + :delta 
>> WHERE x = :x3;
>>        21.360           5          39  UPDATE xy2 SET y = y + :delta 
>> WHERE x = :x2;
> 
> Ok, not always the same point, and you confirm that it identifies
> where the error is raised which leads to a retry.

Yes, I confirm this. I'll try to write more clearly about this in the 
documentation...

>> So we can write something like this:
> 
>> All the transactions are divided into several types depending on their 
>> execution. Firstly, they can be divided into transactions that we 
>> started to execute, and transactions which were skipped (it was too 
>> late to execute them). Secondly, running transactions fall into 2 main 
>> types: is there any command that got a failure during the last 
>> execution of the transaction script or not? Thus
> 
> Here is an attempt at having a more precise and shorter version, not
> sure it is much better than yours, though:
> 
> """
> Transactions are counted depending on their execution and outcome. 
> First
> a transaction may have started or not: skipped transactions occur
> under --rate and --latency-limit when the client is too late to
> execute them. Secondly, a started transaction may ultimately succeed
> or fail on some error, possibly after some retries when --max-tries is
> not one. Thus
> """

Thank you!

>>> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, 
>>> otherwise "another" does not make sense yet.
>> 
>> Maybe firstly put a general group, and then special cases?...
> 
> I understand it more as a catch all default "none of the above" case.

Ok!

>>> commandFailed: I'm not thrilled by the added boolean, which is 
>>> partially
>>> redundant with the second argument.
>> 
>> Do you mean that it is partially redundant with the argument "cmd" 
>> and, for example, the meta commands errors always do not cause the 
>> abortions of the client?
> 
> Yes. And also I'm not sure we should want this boolean at all.

Perhaps we can use a separate function to print the messages about 
client's abortion, something like this (it is assumed that all abortions 
happen when processing SQL commands):

static void
clientAborted(CState *st, const char *message)
{
    pgbench_error(...,
                  "client %d aborted in command %d (SQL) of script %d; %s\n",
                  st->id, st->command, st->use_file, message);
}

Or perhaps we can use a more detailed failure status so for each type of 
failure we always know the command name (argument "cmd") and whether the 
client is aborted. Something like this (but in comparison with the first 
variant ISTM overly complicated):

/*
  * For the failures during script execution.
  */
typedef enum FailureStatus
{
    NO_FAILURE = 0,

    /*
     * Failures in meta commands. In these cases the failed transaction is
     * terminated.
     */
    META_SET_FAILURE,
    META_SETSHELL_FAILURE,
    META_SHELL_FAILURE,
    META_SLEEP_FAILURE,
    META_IF_FAILURE,
    META_ELIF_FAILURE,

    /*
     * Failures in SQL commands. In cases of serialization/deadlock 
failures a
     * failed transaction is re-executed from the very beginning if 
possible;
     * otherwise the failed transaction is terminated.
     */
    SERIALIZATION_FAILURE,
    DEADLOCK_FAILURE,
    OTHER_SQL_FAILURE,            /* other failures in SQL commands that are not
                                 * listed by themselves above */

    /*
     * Failures while processing SQL commands. In this case the client is
     * aborted.
     */
    SQL_CONNECTION_FAILURE
} FailureStatus;

>> [...]
>> If in such cases one command is placed on several lines, ISTM that the 
>> code is more understandable if curly brackets are used...
> 
> Hmmm. Such basic style changes are avoided because they break
> backpatching, so we try to avoid gratuitous changes unless there is a
> strong added value, which does not seem to be the case here.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
>>>> commandFailed: I'm not thrilled by the added boolean, which is partially
>>>> redundant with the second argument.
>>> 
>>> Do you mean that it is partially redundant with the argument "cmd" and, 
>>> for example, the meta commands errors always do not cause the abortions of 
>>> the client?
>> 
>> Yes. And also I'm not sure we should want this boolean at all.
>
> Perhaps we can use a separate function to print the messages about client's 
> abortion, something like this (it is assumed that all abortions happen when 
> processing SQL commands):
>
> static void
> clientAborted(CState *st, const char *message)

Possibly.

> Or perhaps we can use a more detailed failure status so for each type of 
> failure we always know the command name (argument "cmd") and whether the 
> client is aborted. Something like this (but in comparison with the first 
> variant ISTM overly complicated):

I agree., I do not think that it would be useful given that the same thing 
is done on all meta-command error cases in the end.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 17-08-2018 14:04, Fabien COELHO wrote:
> ...
>> Or perhaps we can use a more detailed failure status so for each type 
>> of failure we always know the command name (argument "cmd") and 
>> whether the client is aborted. Something like this (but in comparison 
>> with the first variant ISTM overly complicated):
> 
> I agree., I do not think that it would be useful given that the same
> thing is done on all meta-command error cases in the end.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
Hello, hackers!

This is the eleventh version of the patch for error handling and 
retrying of transactions with serialization/deadlock failures in pgbench 
(based on the commit 14e9b2a752efaa427ce1b400b9aaa5a636898a04) thanks to 
the comments of Fabien Coelho and Arthur Zakirov in this thread.

v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
- a patch for the RandomState structure (this is used to reset a 
client's random seed during the repeating of transactions after 
serialization/deadlock failures).

v11-0002-Pgbench-errors-use-the-Variables-structure-for-c.patch
- a patch for the Variables structure (this is used to reset client 
variables during the repeating of transactions after 
serialization/deadlock failures).

v11-0003-Pgbench-errors-and-serialization-deadlock-retrie.patch
- the main patch for handling client errors and repetition of 
transactions with serialization/deadlock failures (see the detailed 
description in the file).

v11-0004-Pgbench-errors-use-a-separate-function-to-report.patch
- a patch for a separate error reporting function (this is used to 
report client failures that do not cause an aborts and this depends on 
the level of debugging). Although this is a try to fix a duplicate code 
for debug messages (see [1]), this may seem mostly refactoring and 
therefore may not seem very necessary for this set of patches (see [2], 
[3]), so this patch becomes the last as an optional.

Any suggestions are welcome!

[1] 
https://www.postgresql.org/message-id/20180405180807.0bc1114f%40wp.localdomain

> There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with
> corresponding fprintf(stderr..) I think it's time to do it like in the
> main code, wrap with some function like log(level, msg).

[2] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808071823540.13466%40lancre

> However ISTM that it is not as necessary as the previous one, i.e. we
> could do without it to get the desired feature, so I see it more as a
> refactoring done "in passing", and I'm wondering whether it is
> really worth it because it adds some new complexity, so I'm not sure of
> the net benefit.

[3] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808101027390.9120%40lancre

> I'm still not over enthousiastic with these changes, and still think 
> that
> it should be an independent patch, not submitted together with the 
> "retry
> on error" feature.

All that was fixed from the previous version:

[4] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808071823540.13466%40lancre

> I'm at odds with the proposed levels. ISTM that pgbench internal
> errors which warrant an immediate exit should be dubbed "FATAL",

> I'm unsure about the "log_min_messages" variable name, I'd suggest
> "log_level".
> 
> I do not see the asserts on LOG >= log_min_messages as useful, because
> the level can only be LOG or DEBUG anyway.

> * PQExpBuffer
> 
> I still do not see a positive value from importing PQExpBuffer
> complexity and cost into pgbench, as the resulting code is not very
> readable and it adds malloc/free cycles, so I'd try to avoid using
> PQExpBuf as much as possible. ISTM that all usages could be avoided in
> the patch, and most should be avoided even if ExpBuffer is imported
> because it is really useful somewhere.
> 
> - to call pgbench_error from pgbench_simple_error, you can do a
> pgbench_log_va(level, format, va_list) version called both from
> pgbench_error & pgbench_simple_error.
> 
> - for PGBENCH_DEBUG function, do separate calls per type, the very
> small partial code duplication is worth avoiding ExpBuf IMO.
> 
> - for doCustom debug: I'd just let the printf as it is, with a
> comment, as it is really very internal stuff for debug. Or I'd just
> snprintf a something in a static buffer.
> 
> ...
> 
> - for listAvailableScript: I'd simply call "pgbench_error(LOG" several
> time, once per line.
> 
> I see building a string with a format (printfExpBuf..) and then
> calling the pgbench_error function with just a "%s" format on the
> result as not very elegant, because the second format is somehow
> hacked around.

[5] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808101027390.9120%40lancre

> I suggest that the called function does only one simple thing,
> probably "DEBUG", and that the *caller* prints a message if it is 
> unhappy
> about the failure of the called function, as it is currently done. This
> allows to provide context as well from the caller, eg "setting variable 
> %s
> failed while <some specific context>". The user call rerun under debug 
> for
> precision if they need it.

[6] 
https://www.postgresql.org/message-id/20180810125327.GA2374%40zakirov.localdomain

> I agree with Fabien. Calling pgbench_error() inside pgbench_error()
> could be dangerous. I think "fmt" checking could be removed, or we may
> use Assert() or fprintf()+exit(1) at least.

[7] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808121057540.6189%40lancre

> * typo in comments: "varaibles"
> 
> * About enlargeVariables:
> 
> multiple INT_MAX error handling looks strange, especially as this code 
> can
> never be triggered because pgbench would be dead long before having
> allocated INT_MAX variables. So I would not bother to add such checks.

> I'm not sure that the size_t cast here and there are useful for any
> practical values likely to be encountered by pgbench.
> 
> The exponential allocation seems overkill. I'd simply add a constant
> number of slots, with a simple rule:
> 
>    /* reallocated with a margin */
>    if (max_vars < needed) max_vars = needed + 8;

[8] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808151046090.30050%40lancre

> A few comments about the doc.
> 
> According to the documentation, the feature is triggered by --max-tries 
> and
> --latency-limit. I disagree with the later, because it means that 
> having
> latency limit without retrying is not supported anymore.
> 
> Maybe you can allow an "unlimited" max-tries, say with special value 
> zero,
> and the latency limit does its job if set, over all tries.
> 
> Doc: "error in meta commands" -> "meta command errors", for homogeneity 
> with
> other cases?

> Doc: "never occur.." -> "never occur", or eventually "...".
> 
> Doc: "Directly client errors" -> "Direct client errors".
> 
> I'm still in favor of asserting that the sql connection is idle (no tx 
> in
> progress) at the beginning and/or end of a script, and report a user 
> error
> if not, instead of writing complex caveats.

> I do not think that the RETRIES_ENABLED macro is a good thing. I'd 
> suggest
> to write the condition four times.
> 
> ISTM that "skipped" transactions are NOT "successful" so there are a 
> problem
> with comments. I believe that your formula are probably right, it has 
> more to do
> with what is "success". For cnt decomposition, ISTM that "other 
> transactions"
> are really "directly successful transactions".
> 
> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise 
> "another"
> does not make sense yet. I'd suggest to name it "OTHER_SQL_FAILURE".

> I'm not sure of
> the LOG -> DEBUG_FAIL changes. I do not understand the name 
> "DEBUG_FAIL", has it
> is not related to debug, they just seem to be internal errors.

> inTransactionBlock: I disagree with any function other than doCustom 
> changing
> the client state, because it makes understanding the state machine 
> harder. There
> is already one exception to that (threadRun) that I wish to remove. All 
> state
> changes must be performed explicitely in doCustom.

> PQexec("ROOLBACK"): you are inserting a synchronous command, for which 
> the
> thread will have to wait for the result, in a middle of a framework 
> which
> takes great care to use only asynchronous stuff so that one thread can
> manage several clients efficiently. You cannot call PQexec there.
> From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to
> a new state CSTATE_WAIT_ABORT_RESULT which would be similar to
> CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead
> of proceeding to the next command.
> 
> ISTM that it would be more logical to only get into RETRY if there is a 
> retry,
> i.e. move the test RETRY/ABORT in FAILURE. For that, instead of 
> "canRetry",
> maybe you want "doRetry", which tells that a retry is possible (the 
> error
> is serializable or deadlock) and that the current parameters allow it
> (timeout, max retries).
> 
> * Minor C style comments:
> 
> if / else if / else if ... on *_FAILURE: I'd suggest a switch.
> 
> The following line removal does not seem useful, I'd have kept it:
> 
>    stats->cnt++;
>   -
>    if (skipped)
> 
> copyVariables: I'm not convinced that source_vars & nvars variables are 
> that
> useful.
> 
>    memcpy(&(st->retry_state.random_state), &(st->random_state), 
> sizeof(RandomState));
> 
> Is there a problem with "st->retry_state.random_state = 
> st->random_state;"
> instead of memcpy? ISTM that simple assignments work in C. Idem in the 
> reverse
> copy under RETRY.

> commandFailed: I'm not thrilled by the added boolean, which is 
> partially
> redundant with the second argument.
> 
>           if (per_script_stats)
>   -               accumStats(&sql_script[st->use_file].stats, skipped, 
> latency, lag);
>   +       {
>   +               accumStats(&sql_script[st->use_file].stats, skipped, 
> latency, lag,
>   +                                  st->failure_status, st->retries);
>   +       }
>    }
> 
> I do not see the point of changing the style here.

[9] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808170917510.20841%40lancre

> Here is an attempt at having a more precise and shorter version, not 
> sure
> it is much better than yours, though:
> 
> """
> Transactions are counted depending on their execution and outcome. 
> First
> a transaction may have started or not: skipped transactions occur under
> --rate and --latency-limit when the client is too late to execute them.
> Secondly, a started transaction may ultimately succeed or fail on some
> error, possibly after some retries when --max-tries is not one. Thus
> """

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

About the two first preparatory patches.

> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
> - a patch for the RandomState structure (this is used to reset a client's 
> random seed during the repeating of transactions after serialization/deadlock 
> failures).

Same version as the previous one, which was ok. Still applies, compiles, 
passes tests. Fine with me.

> v11-0002-Pgbench-errors-use-the-Variables-structure-for-c.patch
> - a patch for the Variables structure (this is used to reset client variables 
> during the repeating of transactions after serialization/deadlock failures).

Simpler version, applies cleanly on top of previous patch, compiles and 
global & local "make check" are ok. Fine with me as well.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> v11-0003-Pgbench-errors-and-serialization-deadlock-retrie.patch
> - the main patch for handling client errors and repetition of transactions 
> with serialization/deadlock failures (see the detailed description in the 
> file).

About patch v11-3.

Patch applies cleanly on top of the other two. Compiles, global and local
"make check" are ok.

* Features

As far as the actual retry feature is concerned, I'd say we are nearly 
there. However I have an issue with changing the behavior on meta command 
and other sql errors, which I find not desirable.

When a meta-command fails, before the patch the command is aborted and 
there is a convenient error message:

   sh> pgbench -T 10 -f bad-meta.sql
   bad-meta.sql:1: unexpected function name (false) in command "set" [...]
   \set i false + 1 [...]

After the patch it is simply counted, pgbench loops on the same error till 
the time is completed, and there are no clue about the actual issue:

   sh> pgbench -T 10 -f bad-meta.sql
   starting vacuum...end.
   transaction type: bad-meta.sql
   duration: 10 s
   number of transactions actually processed: 0
   number of failures: 27993953 (100.000%)
   ...

Same thing about SQL errors, an immediate abort...

   sh> pgbench -T 10 -f bad-sql.sql
   starting vacuum...end.
   client 0 aborted in command 0 of script 0; ERROR:  syntax error at or near ";"
   LINE 1: SELECT 1 + ;

... is turned into counting without aborting nor error messages, so that 
there is no clue that the user was asking for something bad.

   sh> pgbench -T 10 -f bad-sql.sql
   starting vacuum...end.
   transaction type: bad-sql.sql
   scaling factor: 1
   query mode: simple
   number of clients: 1
   number of threads: 1
   duration: 10 s
   number of transactions actually processed: 0
   number of failures: 274617 (100.000%)
   # no clue that there was a syntax error in the script

I do not think that these changes of behavior are desirable. Meta command and
miscellaneous SQL errors should result in immediatly aborting the whole run,
because the client test code itself could not run correctly or the SQL sent
was somehow wrong, which is also the client's fault, and the server 
performance bench does not make much sense in such conditions.

ISTM that the focus of this patch should only be to handle some server 
runtime errors that can be retryed, but not to change pgbench behavior on 
other kind of errors. If these are to be changed, ISTM that it would be a 
distinct patch and would require some discussion, and possibly an option 
to enable it or not if some use case emerge. AFA this patch is concerned, 
I'd suggest to let that out.


Doc says "you cannot use an infinite number of retries without latency-limit..."

Why should this be forbidden? At least if -T timeout takes precedent and
shortens the execution, ISTM that there could be good reason to test that.
Maybe it could be blocked only under -t if this would lead to an non-ending
run.


As "--print-errors" is really for debug, maybe it could be named
"--debug-errors". I'm not sure that having "--debug" implying this option
is useful: As there are two distinct options, the user may be allowed
to trigger one or the other as they wish?


* Code

The following remarks are linked to the change of behavior discussed above:
makeVariableValue error message is not for debug, but must be kept in all
cases, and the false returned must result in an immediate abort. Same thing about
lookupCreateVariable, an invalid name is a user error which warrants an immediate
abort. Same thing again about coerce* functions or evalStandardFunc...
Basically, most/all added "debug_level >= DEBUG_ERRORS" are not desirable.

sendRollback(): I'd suggest to simplify. The prepare/extended statement stuff is
really about the transaction script, not dealing with errors, esp as there is no
significant advantage in preparing a "ROLLBACK" statement which is short and has
no parameters. I'd suggest to remove this function and just issue
PQsendQuery("ROLLBACK;") in all cases.

In copyVariables, I'd simplify

  + if (source_var->svalue == NULL)
  +   dest_var->svalue = NULL;
  + else
  +   dest_var->svalue = pg_strdup(source_var->svalue);

as:

   dest_var->value = (source_var->svalue == NULL) ? NULL : pg_strdup(source_var->svalue);

  + if (sqlState)   ->   if (sqlState != NULL) ?


Function getTransactionStatus name does not seem to correspond fully to what the
function does. There is a passthru case which should be either avoided or
clearly commented.


About:

  - commandFailed(st, "SQL", "perhaps the backend died while processing");
  + clientAborted(st,
  +              "perhaps the backend died while processing");

keep on one line?


About:

  + if (doRetry(st, &now))
  +   st->state = CSTATE_RETRY;
  + else
  +   st->state = CSTATE_FAILURE;

-> st->state = doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;


* Comments

"There're different types..." -> "There are different types..."

"after the errors and"... -> "after errors and"...

"the default value of max_tries is set to 1" -> "the default value
of max_tries is 1"

"We cannot retry the transaction" -> "We cannot retry a transaction"

"may ultimately succeed or get a failure," -> "may ultimately succeed or fail,"

Overall, the comment text in StatsData is very clear. However they are not
clearly linked to the struct fields. I'd suggest that earch field when used
should be quoted, so as to separate English from code, and the struct name
should always be used explicitely when possible.

I'd insist in a comment that "cnt" does not include "skipped" transactions
(anymore).


* Documentation:

Some suggestions which may be improvements, although I'm not a native English
speaker.

ISTM that there are too many "the":
  - "turns on the option ..." -> "turns on option ..."
  - "When the option ..." -> "When option ..."
  - "By default the option ..." -> "By default option ..."
  - "only if the option ..." -> "only if option ..."
  - "combined with the option ..." -> "combined with option ..."
  - "without the option ..." -> "without option ..."
  - "is the sum of all the retries" -> "is the sum of all retries"

"infinite" -> "unlimited"

"not retried at all" -> "not retried" (maybe several times).

"messages of all errors" -> "messages about all errors".

"It is assumed that the scripts used do not contain" ->
"It is assumed that pgbench scripts do not contain"


About v11-4. I'm do not feel that these changes are very useful/important 
for now. I'd propose that your prioritize on updating 11-3 so that we can 
have another round about it as soon as possible, and keep that one later.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 08-09-2018 10:17, Fabien COELHO wrote:
> Hello Marina,

Hello, Fabien!

> About the two first preparatory patches.
> 
>> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
>> - a patch for the RandomState structure (this is used to reset a 
>> client's random seed during the repeating of transactions after 
>> serialization/deadlock failures).
> 
> Same version as the previous one, which was ok. Still applies,
> compiles, passes tests. Fine with me.
> 
>> v11-0002-Pgbench-errors-use-the-Variables-structure-for-c.patch
>> - a patch for the Variables structure (this is used to reset client 
>> variables during the repeating of transactions after 
>> serialization/deadlock failures).
> 
> Simpler version, applies cleanly on top of previous patch, compiles
> and global & local "make check" are ok. Fine with me as well.

Glad to hear it :)

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 08-09-2018 16:03, Fabien COELHO wrote:
> Hello Marina,
> 
>> v11-0003-Pgbench-errors-and-serialization-deadlock-retrie.patch
>> - the main patch for handling client errors and repetition of 
>> transactions with serialization/deadlock failures (see the detailed 
>> description in the file).
> 
> About patch v11-3.
> 
> Patch applies cleanly on top of the other two. Compiles, global and 
> local
> "make check" are ok.

:-)

> * Features
> 
> As far as the actual retry feature is concerned, I'd say we are nearly
> there. However I have an issue with changing the behavior on meta
> command and other sql errors, which I find not desirable.
> 
> When a meta-command fails, before the patch the command is aborted and
> there is a convenient error message:
> 
>   sh> pgbench -T 10 -f bad-meta.sql
>   bad-meta.sql:1: unexpected function name (false) in command "set" 
> [...]
>   \set i false + 1 [...]
> 
> After the patch it is simply counted, pgbench loops on the same error
> till the time is completed, and there are no clue about the actual
> issue:
> 
>   sh> pgbench -T 10 -f bad-meta.sql
>   starting vacuum...end.
>   transaction type: bad-meta.sql
>   duration: 10 s
>   number of transactions actually processed: 0
>   number of failures: 27993953 (100.000%)
>   ...
> 
> Same thing about SQL errors, an immediate abort...
> 
>   sh> pgbench -T 10 -f bad-sql.sql
>   starting vacuum...end.
>   client 0 aborted in command 0 of script 0; ERROR:  syntax error at or 
> near ";"
>   LINE 1: SELECT 1 + ;
> 
> ... is turned into counting without aborting nor error messages, so
> that there is no clue that the user was asking for something bad.
> 
>   sh> pgbench -T 10 -f bad-sql.sql
>   starting vacuum...end.
>   transaction type: bad-sql.sql
>   scaling factor: 1
>   query mode: simple
>   number of clients: 1
>   number of threads: 1
>   duration: 10 s
>   number of transactions actually processed: 0
>   number of failures: 274617 (100.000%)
>   # no clue that there was a syntax error in the script
> 
> I do not think that these changes of behavior are desirable. Meta 
> command and
> miscellaneous SQL errors should result in immediatly aborting the whole 
> run,
> because the client test code itself could not run correctly or the SQL 
> sent
> was somehow wrong, which is also the client's fault, and the server
> performance bench does not make much sense in such conditions.
> 
> ISTM that the focus of this patch should only be to handle some server
> runtime errors that can be retryed, but not to change pgbench behavior
> on other kind of errors. If these are to be changed, ISTM that it
> would be a distinct patch and would require some discussion, and
> possibly an option to enable it or not if some use case emerge. AFA
> this patch is concerned, I'd suggest to let that out.
...
> The following remarks are linked to the change of behavior discussed 
> above:
> makeVariableValue error message is not for debug, but must be kept in 
> all
> cases, and the false returned must result in an immediate abort. Same
> thing about
> lookupCreateVariable, an invalid name is a user error which warrants
> an immediate
> abort. Same thing again about coerce* functions or evalStandardFunc...
> Basically, most/all added "debug_level >= DEBUG_ERRORS" are not 
> desirable.

Hmm, but we can say the same for serialization or deadlock errors that 
were not retried (the client test code itself could not run correctly or 
the SQL sent was somehow wrong, which is also the client's fault), can't 
we? Why not handle client errors that can occur (but they may also not 
occur) the same way? (For example, always abort the client, or 
conversely do not make aborts in these cases.) Here's an example of such 
error:

starting vacuum...end.
transaction type: pgbench_rare_sql_error.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 250
number of transactions actually processed: 2500/2500
maximum number of tries: 1
latency average = 0.375 ms
tps = 26695.292848 (including connections establishing)
tps = 27489.678525 (excluding connections establishing)
statement latencies in milliseconds and failures:
          0.001           0  \set divider random(-1000, 1000)
          0.245           0  SELECT 1 / :divider;

starting vacuum...end.
client 5 got an error in command 1 (SQL) of script 0; ERROR:  division 
by zero

client 0 got an error in command 1 (SQL) of script 0; ERROR:  division 
by zero

client 7 got an error in command 1 (SQL) of script 0; ERROR:  division 
by zero

transaction type: pgbench_rare_sql_error.sql
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 250
number of transactions actually processed: 2497/2500
number of failures: 3 (0.120%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other SQL failures: 3 (0.120%)
maximum number of tries: 1
latency average = 0.579 ms (including failures)
tps = 17240.662547 (including connections establishing)
tps = 17862.090137 (excluding connections establishing)
statement latencies in milliseconds and failures:
          0.001           0  \set divider random(-1000, 1000)
          0.338           3  SELECT 1 / :divider;

Maybe we can limit the number of failures in one statement, and abort 
the client if this limit is exceeded?...

To get a clue about the actual issue you can use the options 
--failures-detailed (to find out out whether this is a serialization 
failure / deadlock failure / other SQL failure / meta command failure) 
and/or --print-errors (to get the complete error message).

> Doc says "you cannot use an infinite number of retries without 
> latency-limit..."
> 
> Why should this be forbidden? At least if -T timeout takes precedent 
> and
> shortens the execution, ISTM that there could be good reason to test 
> that.
> Maybe it could be blocked only under -t if this would lead to an 
> non-ending
> run.
...
> * Comments
> 
> "There're different types..." -> "There are different types..."
> 
> "after the errors and"... -> "after errors and"...
> 
> "the default value of max_tries is set to 1" -> "the default value
> of max_tries is 1"
> 
> "We cannot retry the transaction" -> "We cannot retry a transaction"
> 
> "may ultimately succeed or get a failure," -> "may ultimately succeed 
> or fail,"
...
> * Documentation:
> 
> Some suggestions which may be improvements, although I'm not a native 
> English
> speaker.
> 
> ISTM that there are too many "the":
>  - "turns on the option ..." -> "turns on option ..."
>  - "When the option ..." -> "When option ..."
>  - "By default the option ..." -> "By default option ..."
>  - "only if the option ..." -> "only if option ..."
>  - "combined with the option ..." -> "combined with option ..."
>  - "without the option ..." -> "without option ..."
>  - "is the sum of all the retries" -> "is the sum of all retries"
> 
> "infinite" -> "unlimited"
> 
> "not retried at all" -> "not retried" (maybe several times).
> 
> "messages of all errors" -> "messages about all errors".
> 
> "It is assumed that the scripts used do not contain" ->
> "It is assumed that pgbench scripts do not contain"

Thank you, I'll fix this.

If you use the option --latency-limit, the time of tries will be limited 
regardless of the use of the option -t. Therefore ISTM that an unlimited 
number of tries can be used only if the time of tries is limited by the 
options -T and/or -L.

> As "--print-errors" is really for debug, maybe it could be named
> "--debug-errors".

Ok!

> I'm not sure that having "--debug" implying this option
> is useful: As there are two distinct options, the user may be allowed
> to trigger one or the other as they wish?

I'm not sure that the main debugging output will give a good clue of 
what's happened without full messages about errors, retries and 
failures...

> * Code
> 
> <...>
>
> sendRollback(): I'd suggest to simplify. The prepare/extended statement 
> stuff is
> really about the transaction script, not dealing with errors, esp as 
> there is no
> significant advantage in preparing a "ROLLBACK" statement which is 
> short and has
> no parameters. I'd suggest to remove this function and just issue
> PQsendQuery("ROLLBACK;") in all cases.

Ok!

> In copyVariables, I'd simplify
> 
>  + if (source_var->svalue == NULL)
>  +   dest_var->svalue = NULL;
>  + else
>  +   dest_var->svalue = pg_strdup(source_var->svalue);
> 
> as:
> 
>   dest_var->value = (source_var->svalue == NULL) ? NULL :
> pg_strdup(source_var->svalue);

> About:
> 
>  + if (doRetry(st, &now))
>  +   st->state = CSTATE_RETRY;
>  + else
>  +   st->state = CSTATE_FAILURE;
> 
> -> st->state = doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;

These lines are quite long - do you suggest to wrap them this way?

+        dest_var->svalue = ((source_var->svalue == NULL) ? NULL :
+                            pg_strdup(source_var->svalue));

+                        st->state = (doRetry(st, &now) ? CSTATE_RETRY :
+                                     CSTATE_FAILURE);

>  + if (sqlState)   ->   if (sqlState != NULL) ?

Ok!

> Function getTransactionStatus name does not seem to correspond fully to 
> what the
> function does. There is a passthru case which should be either avoided 
> or
> clearly commented.

I don't quite understand you - do you mean that in fact this function 
finds out whether we are in a (failed) transaction block or not? Or do 
you mean that the case of PQTRANS_INTRANS is also ok?...

> About:
> 
>  - commandFailed(st, "SQL", "perhaps the backend died while 
> processing");
>  + clientAborted(st,
>  +              "perhaps the backend died while processing");
> 
> keep on one line?

I tried not to break the limit of 80 characters, but if you think that 
this is better, I'll change it.

> Overall, the comment text in StatsData is very clear. However they are 
> not
> clearly linked to the struct fields. I'd suggest that earch field when 
> used
> should be quoted, so as to separate English from code, and the struct 
> name
> should always be used explicitely when possible.

Ok!

> I'd insist in a comment that "cnt" does not include "skipped" 
> transactions
> (anymore).

If you mean CState.cnt I'm not sure if this is practically useful 
because the code uses only the sum of all client transactions including 
skipped and failed... Maybe we can rename this field to nxacts or 
total_cnt?

> About v11-4. I'm do not feel that these changes are very
> useful/important for now. I'd propose that your prioritize on updating
> 11-3 so that we can have another round about it as soon as possible,
> and keep that one later.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 11-09-2018 16:47, Marina Polyakova wrote:
> On 08-09-2018 16:03, Fabien COELHO wrote:
>> Hello Marina,
>> I'd insist in a comment that "cnt" does not include "skipped" 
>> transactions
>> (anymore).
> 
> If you mean CState.cnt I'm not sure if this is practically useful
> because the code uses only the sum of all client transactions
> including skipped and failed... Maybe we can rename this field to
> nxacts or total_cnt?

Sorry, I misread your proposal for the first time. Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> Hmm, but we can say the same for serialization or deadlock errors that were 
> not retried (the client test code itself could not run correctly or the SQL 
> sent was somehow wrong, which is also the client's fault), can't we?

I think not.

If a client asks for something "legal", but some other client in parallel 
happens to make an incompatible change which result in a serialization or 
deadlock error, the clients are not responsible for the raised errors, it 
is just that they happen to ask for something incompatible at the same 
time. So there is no user error per se, but the server is reporting its 
(temporary) inability to process what was asked for. For these errors, 
retrying is fine. If the client was alone, there would be no such errors, 
you cannot deadlock with yourself. This is really an isolation issue 
linked to parallel execution.

> Why not handle client errors that can occur (but they may also not 
> occur) the same way? (For example, always abort the client, or 
> conversely do not make aborts in these cases.) Here's an example of such 
> error:

> client 5 got an error in command 1 (SQL) of script 0; ERROR:  division by zero

This is an interesting case. For me we must stop the script because the 
client is asking for something "stupid", and retrying the same won't 
change the outcome, the division will still be by zero. It is the client 
responsability not to ask for something stupid, the bench script is buggy, 
it should not submit illegal SQL queries. This is quite different from 
submitting something legal which happens to fail.

> Maybe we can limit the number of failures in one statement, and abort the 
> client if this limit is exceeded?...

I think this is quite debatable, and that the best option is to leavze 
this point out of the current patch, so that we could have retry on 
serial/deadlock errors.

Then you can submit another patch for a feature about other errors if you 
feel that there is a use case for going on in some cases. I think that the 
previous behavior made sense, and that changing it should only be 
considered as an option. As it involves discussing and is not obvious, 
later is better.

> To get a clue about the actual issue you can use the options 
> --failures-detailed (to find out out whether this is a serialization failure 
> / deadlock failure / other SQL failure / meta command failure) and/or 
> --print-errors (to get the complete error message).

Yep, but for me it should haved stopped immediately, as it did before.

> If you use the option --latency-limit, the time of tries will be limited 
> regardless of the use of the option -t. Therefore ISTM that an unlimited 
> number of tries can be used only if the time of tries is limited by the 
> options -T and/or -L.

Indeed, I'm ok with forbidding unlimitted retries when under -t.

>> I'm not sure that having "--debug" implying this option
>> is useful: As there are two distinct options, the user may be allowed
>> to trigger one or the other as they wish?
>
> I'm not sure that the main debugging output will give a good clue of what's 
> happened without full messages about errors, retries and failures...

I'm more argumenting about letting the user decide what they want.

> These lines are quite long - do you suggest to wrap them this way?

Sure, if it is too long, then wrap.

>> Function getTransactionStatus name does not seem to correspond fully to 
>> what the function does. There is a passthru case which should be either 
>> avoided or clearly commented.
>
> I don't quite understand you - do you mean that in fact this function finds 
> out whether we are in a (failed) transaction block or not? Or do you mean 
> that the case of PQTRANS_INTRANS is also ok?...

The former: although the function is named "getTransactionStatus", it does 
not really return the "status" of the transaction (aka PQstatus()?).

> I tried not to break the limit of 80 characters, but if you think that this 
> is better, I'll change it.

Hmmm. 80 columns, indeed...

>> I'd insist in a comment that "cnt" does not include "skipped" transactions
>> (anymore).
>
> If you mean CState.cnt I'm not sure if this is practically useful because the 
> code uses only the sum of all client transactions including skipped and 
> failed... Maybe we can rename this field to nxacts or total_cnt?

I'm fine with renaming the field if it makes thinks clearer. They are all 
counters, so naming them "cnt" or "total_cnt" does not help much. Maybe 
"succeeded" or "success" to show what is really counted?

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 11-09-2018 18:29, Fabien COELHO wrote:
> Hello Marina,
> 
>> Hmm, but we can say the same for serialization or deadlock errors that 
>> were not retried (the client test code itself could not run correctly 
>> or the SQL sent was somehow wrong, which is also the client's fault), 
>> can't we?
> 
> I think not.
> 
> If a client asks for something "legal", but some other client in
> parallel happens to make an incompatible change which result in a
> serialization or deadlock error, the clients are not responsible for
> the raised errors, it is just that they happen to ask for something
> incompatible at the same time. So there is no user error per se, but
> the server is reporting its (temporary) inability to process what was
> asked for. For these errors, retrying is fine. If the client was
> alone, there would be no such errors, you cannot deadlock with
> yourself. This is really an isolation issue linked to parallel
> execution.

You can get other errors that cannot happen for only one client if you 
use shell commands in meta commands:

starting vacuum...end.
transaction type: pgbench_meta_concurrent_error.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 20/20
maximum number of tries: 1
latency average = 6.953 ms
tps = 287.630161 (including connections establishing)
tps = 303.232242 (excluding connections establishing)
statement latencies in milliseconds and failures:
          1.636           0  BEGIN;
          1.497           0  \setshell var mkdir my_directory && echo 1
          0.007           0  \sleep 1 us
          1.465           0  \setshell var rmdir my_directory && echo 1
          1.622           0  END;

starting vacuum...end.
mkdir: cannot create directory ‘my_directory’: File exists
mkdir: could not read result of shell command
client 1 got an error in command 1 (setshell) of script 0; execution of 
meta-command failed
transaction type: pgbench_meta_concurrent_error.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 19/20
number of failures: 1 (5.000%)
number of meta-command failures: 1 (5.000%)
maximum number of tries: 1
latency average = 11.782 ms (including failures)
tps = 161.269033 (including connections establishing)
tps = 167.733278 (excluding connections establishing)
statement latencies in milliseconds and failures:
          2.731           0  BEGIN;
          2.909           1  \setshell var mkdir my_directory && echo 1
          0.231           0  \sleep 1 us
          2.366           0  \setshell var rmdir my_directory && echo 1
          2.664           0  END;

Or if you use untrusted procedural languages in SQL expressions (see the 
used file in the attachments):

starting vacuum...ERROR:  relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
end.
client 1 got an error in command 0 (SQL) of script 0; ERROR:  could not 
create the directory "my_directory": File exists at line 3.
CONTEXT:  PL/Perl anonymous code block

client 1 got an error in command 0 (SQL) of script 0; ERROR:  could not 
create the directory "my_directory": File exists at line 3.
CONTEXT:  PL/Perl anonymous code block

transaction type: pgbench_concurrent_error.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 18/20
number of failures: 2 (10.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other SQL failures: 2 (10.000%)
maximum number of tries: 1
latency average = 3.282 ms (including failures)
tps = 548.437196 (including connections establishing)
tps = 637.662753 (excluding connections establishing)
statement latencies in milliseconds and failures:
          1.566           2  DO $$

starting vacuum...ERROR:  relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
end.
transaction type: pgbench_concurrent_error.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 20/20
maximum number of tries: 1
latency average = 2.760 ms
tps = 724.746078 (including connections establishing)
tps = 853.131985 (excluding connections establishing)
statement latencies in milliseconds and failures:
          1.893           0  DO $$

Or if you try to create a function and perhaps replace an existing one:

starting vacuum...end.
client 0 got an error in command 0 (SQL) of script 0; ERROR:  duplicate 
key value violates unique constraint "pg_proc_proname_args_nsp_index"
DETAIL:  Key (proname, proargtypes, pronamespace)=(my_function, , 2200) 
already exists.

client 0 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 0 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 0 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

transaction type: pgbench_create_function.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/20
number of failures: 10 (50.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other SQL failures: 10 (50.000%)
maximum number of tries: 1
latency average = 82.881 ms (including failures)
tps = 12.065492 (including connections establishing)
tps = 12.092216 (excluding connections establishing)
statement latencies in milliseconds and failures:
         82.549          10  CREATE OR REPLACE FUNCTION my_function() 
RETURNS integer AS 'select 1;' LANGUAGE SQL;

>> Why not handle client errors that can occur (but they may also not 
>> occur) the same way? (For example, always abort the client, or 
>> conversely do not make aborts in these cases.) Here's an example of 
>> such error:
> 
>> client 5 got an error in command 1 (SQL) of script 0; ERROR:  division 
>> by zero
> 
> This is an interesting case. For me we must stop the script because
> the client is asking for something "stupid", and retrying the same
> won't change the outcome, the division will still be by zero. It is
> the client responsability not to ask for something stupid, the bench
> script is buggy, it should not submit illegal SQL queries. This is
> quite different from submitting something legal which happens to fail.
> ...
>>> I'm not sure that having "--debug" implying this option
>>> is useful: As there are two distinct options, the user may be allowed
>>> to trigger one or the other as they wish?
>> 
>> I'm not sure that the main debugging output will give a good clue of 
>> what's happened without full messages about errors, retries and 
>> failures...
> 
> I'm more argumenting about letting the user decide what they want.
> 
>> These lines are quite long - do you suggest to wrap them this way?
> 
> Sure, if it is too long, then wrap.

Ok!

>>> Function getTransactionStatus name does not seem to correspond fully 
>>> to what the function does. There is a passthru case which should be 
>>> either avoided or clearly commented.
>> 
>> I don't quite understand you - do you mean that in fact this function 
>> finds out whether we are in a (failed) transaction block or not? Or do 
>> you mean that the case of PQTRANS_INTRANS is also ok?...
> 
> The former: although the function is named "getTransactionStatus", it
> does not really return the "status" of the transaction (aka
> PQstatus()?).

Thank you, I'll think how to improve it. Perhaps the name 
checkTransactionStatus will be better...

>>> I'd insist in a comment that "cnt" does not include "skipped" 
>>> transactions
>>> (anymore).
>> 
>> If you mean CState.cnt I'm not sure if this is practically useful 
>> because the code uses only the sum of all client transactions 
>> including skipped and failed... Maybe we can rename this field to 
>> nxacts or total_cnt?
> 
> I'm fine with renaming the field if it makes thinks clearer. They are
> all counters, so naming them "cnt" or "total_cnt" does not help much.
> Maybe "succeeded" or "success" to show what is really counted?

Perhaps renaming of StatsData.cnt is better than just adding a comment 
to this field. But IMO we have the same problem (They are all counters, 
so naming them "cnt" or "total_cnt" does not help much.) for CState.cnt 
which cannot be named in the same way because it also includes skipped 
and failed transactions.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Marina,

> You can get other errors that cannot happen for only one client if you use 
> shell commands in meta commands:

> Or if you use untrusted procedural languages in SQL expressions (see the used 
> file in the attachments):

> Or if you try to create a function and perhaps replace an existing one:

Sure. Indeed there can be shell errors, perl errors, create functions 
conflicts... I do not understand what is your point wrt these.

I'm mostly saying that your patch should focus on implementing the retry 
feature when appropriate, and avoid changing the behavior (error 
displayed, abort or not) on features unrelated to serialization & deadlock 
errors.

Maybe there are inconsistencies, and "bug"/"feature" worth fixing, but if 
so that should be a separate patch, if possible, and if these are bugs 
they could be backpatched.

For now I'm still convinced that pgbench should keep on aborting on "\set" 
or SQL syntax errors, and show clear error messages on these, and your 
examples have not changed my mind on that point.

>> I'm fine with renaming the field if it makes thinks clearer. They are
>> all counters, so naming them "cnt" or "total_cnt" does not help much.
>> Maybe "succeeded" or "success" to show what is really counted?
>
> Perhaps renaming of StatsData.cnt is better than just adding a comment to 
> this field. But IMO we have the same problem (They are all counters, so 
> naming them "cnt" or "total_cnt" does not help much.) for CState.cnt which 
> cannot be named in the same way because it also includes skipped and failed 
> transactions.

Hmmm. CState's cnt seems only used to implement -t anyway? I'm okay if it 
has a different name, esp if it has a different semantics. I think I was 
arguing only about cnt in StatsData.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 12-09-2018 17:04, Fabien COELHO wrote:
> Hello Marina,
> 
>> You can get other errors that cannot happen for only one client if you 
>> use shell commands in meta commands:
> 
>> Or if you use untrusted procedural languages in SQL expressions (see 
>> the used file in the attachments):
> 
>> Or if you try to create a function and perhaps replace an existing 
>> one:
> 
> Sure. Indeed there can be shell errors, perl errors, create functions
> conflicts... I do not understand what is your point wrt these.
> 
> I'm mostly saying that your patch should focus on implementing the
> retry feature when appropriate, and avoid changing the behavior (error
> displayed, abort or not) on features unrelated to serialization &
> deadlock errors.
> 
> Maybe there are inconsistencies, and "bug"/"feature" worth fixing, but
> if so that should be a separate patch, if possible, and if these are
> bugs they could be backpatched.
> 
> For now I'm still convinced that pgbench should keep on aborting on
> "\set" or SQL syntax errors, and show clear error messages on these,
> and your examples have not changed my mind on that point.
> 
>>> I'm fine with renaming the field if it makes thinks clearer. They are
>>> all counters, so naming them "cnt" or "total_cnt" does not help much.
>>> Maybe "succeeded" or "success" to show what is really counted?
>> 
>> Perhaps renaming of StatsData.cnt is better than just adding a comment 
>> to this field. But IMO we have the same problem (They are all 
>> counters, so naming them "cnt" or "total_cnt" does not help much.) for 
>> CState.cnt which cannot be named in the same way because it also 
>> includes skipped and failed transactions.
> 
> Hmmm. CState's cnt seems only used to implement -t anyway? I'm okay if
> it has a different name, esp if it has a different semantics.

Ok!

> I think
> I was arguing only about cnt in StatsData.

The discussion about this has become entangled from the beginning, 
because as I wrote in [1] at first I misread your original proposal...

[1] 
https://www.postgresql.org/message-id/d318cdee8f96de6b1caf2ce684ffe4db%40postgrespro.ru

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Michael Paquier
Дата:
On Wed, Sep 12, 2018 at 06:12:29PM +0300, Marina Polyakova wrote:
> The discussion about this has become entangled from the beginning, because
> as I wrote in [1] at first I misread your original proposal...

The last emails are about the last reviews of Fabien, which has remained
unanswered for the last couple of weeks.  I am marking this patch as
returned with feedback for now.
--
Michael

Вложения

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
On 2018-Sep-05, Marina Polyakova wrote:

> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
> - a patch for the RandomState structure (this is used to reset a client's
> random seed during the repeating of transactions after
> serialization/deadlock failures).

Pushed this one with minor stylistic changes (the most notable of which
is the move of initRandomState to where the rest of the random generator
infrastructure is, instead of in a totally random place).  Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Marina Polyakova
Дата:
On 2018-11-16 22:59, Alvaro Herrera wrote:
> On 2018-Sep-05, Marina Polyakova wrote:
> 
>> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
>> - a patch for the RandomState structure (this is used to reset a 
>> client's
>> random seed during the repeating of transactions after
>> serialization/deadlock failures).
> 
> Pushed this one with minor stylistic changes (the most notable of which
> is the move of initRandomState to where the rest of the random 
> generator
> infrastructure is, instead of in a totally random place).  Thanks,

Thank you very much! I'm going to send a new patch set until the end of 
this week (I'm sorry I was very busy in the release of Postgres Pro 
11...).

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
On 2018-Nov-19, Marina Polyakova wrote:

> On 2018-11-16 22:59, Alvaro Herrera wrote:
> > On 2018-Sep-05, Marina Polyakova wrote:
> > 
> > > v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
> > > - a patch for the RandomState structure (this is used to reset a
> > > client's
> > > random seed during the repeating of transactions after
> > > serialization/deadlock failures).
> > 
> > Pushed this one with minor stylistic changes (the most notable of which
> > is the move of initRandomState to where the rest of the random generator
> > infrastructure is, instead of in a totally random place).  Thanks,
> 
> Thank you very much! I'm going to send a new patch set until the end of this
> week (I'm sorry I was very busy in the release of Postgres Pro 11...).

Great, thanks.

I also think that the pgbench_error() patch should go in before the main
one.  It seems a bit pointless to introduce code using a bad API only to
fix the API together with all the new callers immediately afterwards.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Alvaro,

> I also think that the pgbench_error() patch should go in before the main
> one.  It seems a bit pointless to introduce code using a bad API only to
> fix the API together with all the new callers immediately afterwards.

I'm not that keen on this part of the patch, because ISTM that introduces 
significant and possibly costly malloc/free cycles when handling error, 
which do not currently exist in pgbench.

Previously an error was basically the end of the script, but with the 
feature being introduced by Marina some errors are handled, in which case 
we end up with paying these costs in the test loop. Also, refactoring 
error handling is not necessary for the new feature. That is why I advised 
to move it away and possibly keep it for later.

Related to Marina patch (triggered by reviewing the patches), I have 
submitted a refactoring patch which aims at cleaning up the internal state 
machine, so that additions and checking that all is well is simpler.

     https://commitfest.postgresql.org/20/1754/

It has been reviewed, I think I answered to the reviewer concerns, but the 
reviewer did not update the patch state on the cf app, so I do not know 
whether he is unsatisfied or if it was just forgotten.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
On 2018-Nov-19, Fabien COELHO wrote:

> 
> Hello Alvaro,
> 
> > I also think that the pgbench_error() patch should go in before the main
> > one.  It seems a bit pointless to introduce code using a bad API only to
> > fix the API together with all the new callers immediately afterwards.
> 
> I'm not that keen on this part of the patch, because ISTM that introduces
> significant and possibly costly malloc/free cycles when handling error,
> which do not currently exist in pgbench.

Oh, I wasn't aware of that.

> Related to Marina patch (triggered by reviewing the patches), I have
> submitted a refactoring patch which aims at cleaning up the internal state
> machine, so that additions and checking that all is well is simpler.
> 
>     https://commitfest.postgresql.org/20/1754/

let me look at this one.

> It has been reviewed, I think I answered to the reviewer concerns, but the
> reviewer did not update the patch state on the cf app, so I do not know
> whether he is unsatisfied or if it was just forgotten.

Feel free to update a patch status to "needs review" yourself after
submitting a new version that in your opinion respond to a reviewer's
comments.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
> Feel free to update a patch status to "needs review" yourself after
> submitting a new version that in your opinion respond to a reviewer's
> comments.

Sure, I do that. But I will not switch any of my patch to "Ready". AFAICR 
the concerns where mostly about imprecise comments in the code, and a few 
questions that I answered.

-- 
Fabien.


Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Thomas Munro
Дата:
On Mon, Mar 9, 2020 at 10:00 AM Marina Polyakova
<m.polyakova@postgrespro.ru> wrote:
> On 2018-11-16 22:59, Alvaro Herrera wrote:
> > On 2018-Sep-05, Marina Polyakova wrote:
> >
> >> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
> >> - a patch for the RandomState structure (this is used to reset a
> >> client's
> >> random seed during the repeating of transactions after
> >> serialization/deadlock failures).
> >
> > Pushed this one with minor stylistic changes (the most notable of which
> > is the move of initRandomState to where the rest of the random
> > generator
> > infrastructure is, instead of in a totally random place).  Thanks,
>
> Thank you very much! I'm going to send a new patch set until the end of
> this week (I'm sorry I was very busy in the release of Postgres Pro
> 11...).

Is anyone interested in rebasing this, and summarising what needs to
be done to get it in?  It's arguably a bug or at least quite
unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
that a couple of forks already ship Marina's patch set.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Thomas,

>> Thank you very much! I'm going to send a new patch set until the end of
>> this week (I'm sorry I was very busy in the release of Postgres Pro
>> 11...).
>
> Is anyone interested in rebasing this, and summarising what needs to
> be done to get it in?  It's arguably a bug or at least quite
> unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
> that a couple of forks already ship Marina's patch set.

I'm a reviewer on this patch, that I find a good thing (tm), and which was 
converging to a reasonable and simple enough addition, IMHO.

If I proceed in place of Marina, who is going to do the reviews?

-- 
Fabien.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Thomas Munro
Дата:
On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> >> Thank you very much! I'm going to send a new patch set until the end of
> >> this week (I'm sorry I was very busy in the release of Postgres Pro
> >> 11...).
> >
> > Is anyone interested in rebasing this, and summarising what needs to
> > be done to get it in?  It's arguably a bug or at least quite
> > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
> > that a couple of forks already ship Marina's patch set.
>
> I'm a reviewer on this patch, that I find a good thing (tm), and which was
> converging to a reasonable and simple enough addition, IMHO.
>
> If I proceed in place of Marina, who is going to do the reviews?

Hi Fabien,

Cool.  I'll definitely take it for a spin if you post a fresh patch
set.  Any place that we arbitrarily don't support SERIALIZABLE, I
consider a bug, so I'd like to commit this if we can agree it's ready.
It sounds like it's actually in pretty good shape.



Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hi hackers,

On Tue, 10 Mar 2020 09:48:23 +1300
Thomas Munro <thomas.munro@gmail.com> wrote:

> On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > >> Thank you very much! I'm going to send a new patch set until the end of
> > >> this week (I'm sorry I was very busy in the release of Postgres Pro
> > >> 11...).
> > >
> > > Is anyone interested in rebasing this, and summarising what needs to
> > > be done to get it in?  It's arguably a bug or at least quite
> > > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
> > > that a couple of forks already ship Marina's patch set.

I got interested in this and now looking into the patch and the past discussion. 
If anyone other won't do it and there are no objection, I would like to rebase
this. Is that okay?

Regards,
Yugo NAGATA

> >
> > I'm a reviewer on this patch, that I find a good thing (tm), and which was
> > converging to a reasonable and simple enough addition, IMHO.
> >
> > If I proceed in place of Marina, who is going to do the reviews?
> 
> Hi Fabien,
> 
> Cool.  I'll definitely take it for a spin if you post a fresh patch
> set.  Any place that we arbitrarily don't support SERIALIZABLE, I
> consider a bug, so I'd like to commit this if we can agree it's ready.
> It sounds like it's actually in pretty good shape.


-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hi hackers,

On Mon, 24 May 2021 11:29:10 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:

> Hi hackers,
> 
> On Tue, 10 Mar 2020 09:48:23 +1300
> Thomas Munro <thomas.munro@gmail.com> wrote:
> 
> > On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > > >> Thank you very much! I'm going to send a new patch set until the end of
> > > >> this week (I'm sorry I was very busy in the release of Postgres Pro
> > > >> 11...).
> > > >
> > > > Is anyone interested in rebasing this, and summarising what needs to
> > > > be done to get it in?  It's arguably a bug or at least quite
> > > > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
> > > > that a couple of forks already ship Marina's patch set.
> 
> I got interested in this and now looking into the patch and the past discussion. 
> If anyone other won't do it and there are no objection, I would like to rebase
> this. Is that okay?

I rebased and fixed the previous patches (v11) rewtten by Marina Polyakova,
and attached the revised version (v12).

v12-0001-Pgbench-errors-use-the-Variables-structure-for-c.patch
- a patch for the Variables structure (this is used to reset client 
variables during the repeating of transactions after 
serialization/deadlock failures).

v12-0002-Pgbench-errors-and-serialization-deadlock-retrie.patch
- the main patch for handling client errors and repetition of 
transactions with serialization/deadlock failures (see the detailed 
description in the file).

These are the revised versions from v11-0002 and v11-0003. v11-0001
(for the RandomState structure) is not included because this has been
already committed (40923191944). V11-0004 (for a separate error reporting
function) is not included neither because pgbench now uses common logging
APIs (30a3e772b40).

In addition to rebase on master, I updated the patch according with the
review from Fabien COELHO [1] and discussions after this. Also, I added
some other fixes through my reviewing the previous patch.

[1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1809081450100.10506%40lancre

Following are fixes according with Fabian's review.

> * Features

> As far as the actual retry feature is concerned, I'd say we are nearly 
> there. However I have an issue with changing the behavior on meta command 
> and other sql errors, which I find not desirable.
...
> I do not think that these changes of behavior are desirable. Meta command and
> miscellaneous SQL errors should result in immediatly aborting the whole run,
> because the client test code itself could not run correctly or the SQL sent
> was somehow wrong, which is also the client's fault, and the server 
> performance bench does not make much sense in such conditions.
> 
> ISTM that the focus of this patch should only be to handle some server 
> runtime errors that can be retryed, but not to change pgbench behavior on 
> other kind of errors. If these are to be changed, ISTM that it would be a 
> distinct patch and would require some discussion, and possibly an option 
> to enable it or not if some use case emerge. AFA this patch is concerned, 
> I'd suggest to let that out.

Previously, all SQL and meta command errors could be retried, but I fixed
to allow only serialization & deadlock errors to be retried. 

> Doc says "you cannot use an infinite number of retries without latency-limit..."
> 
> Why should this be forbidden? At least if -T timeout takes precedent and
> shortens the execution, ISTM that there could be good reason to test that.
> Maybe it could be blocked only under -t if this would lead to an non-ending
> run.

I fixed to allow to use --max-tries with -T option even if latency-limit
is not used.

> As "--print-errors" is really for debug, maybe it could be named
> "--debug-errors". I'm not sure that having "--debug" implying this option
> is useful: As there are two distinct options, the user may be allowed
> to trigger one or the other as they wish?

print-errors was renamed to debug-errors.

> makeVariableValue error message is not for debug, but must be kept in all
> cases, and the false returned must result in an immediate abort. Same thing about
> lookupCreateVariable, an invalid name is a user error which warrants an immediate
> abort. Same thing again about coerce* functions or evalStandardFunc...
> Basically, most/all added "debug_level >= DEBUG_ERRORS" are not desirable.

"DEBUG_ERRORS" messages unrelated to serialization & deadlock errors were removed.

> sendRollback(): I'd suggest to simplify. The prepare/extended statement stuff is
> really about the transaction script, not dealing with errors, esp as there is no
> significant advantage in preparing a "ROLLBACK" statement which is short and has
> no parameters. I'd suggest to remove this function and just issue
> PQsendQuery("ROLLBACK;") in all cases.

Now, we just issue PQsendQuery("ROLLBACK;").

> In copyVariables, I'd simplify
>
>  + if (source_var->svalue == NULL)
>  +   dest_var->svalue = NULL;
>  + else
>  +   dest_var->svalue = pg_strdup(source_var->svalue);
>
>as:
>   dest_var->value = (source_var->svalue == NULL) ? NULL : pg_strdup(source_var->svalue);

Fixed using a ternary operator.

>  + if (sqlState)   ->   if (sqlState != NULL) ?

Fixed.

> Function getTransactionStatus name does not seem to correspond fully to what the
> function does. There is a passthru case which should be either avoided or
> clearly commented.

This was renamed to checkTransactionStatus according with [2].

[2] https://www.postgresql.org/message-id/c262e889315625e0fc0d77ca78fe2eac%40postgrespro.ru

>  - commandFailed(st, "SQL", "perhaps the backend died while processing");
>  + clientAborted(st,
>  +              "perhaps the backend died while processing");
>
> keep on one line?

This fix that replaced commandFailed with clientAborted was removed.
(See below)

>  + if (doRetry(st, &now))
>  +   st->state = CSTATE_RETRY;
>  + else
>  +   st->state = CSTATE_FAILURE;
>
> -> st->state = doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;

Fixed using a ternary operator.

> * Comments

> "There're different types..." -> "There are different types..."
> "after the errors and"... -> "after errors and"...
> "the default value of max_tries is set to 1" -> "the default value
> of max_tries is 1"
> "We cannot retry the transaction" -> "We cannot retry a transaction"
> "may ultimately succeed or get a failure," -> "may ultimately succeed or fail,"

Fixed.

> Overall, the comment text in StatsData is very clear. However they are not
> clearly linked to the struct fields. I'd suggest that earch field when used
> should be quoted, so as to separate English from code, and the struct name
> should always be used explicitely when possible.

The comment in StatsData was fixed to clarify what each filed in this struct
represents.

> I'd insist in a comment that "cnt" does not include "skipped" transactions
> (anymore).

StatsData.cnt has a comment "number of successful transactions, not including
'skipped'", and CState.cnt has a comment "skipped and failed transactions are
also counted here".

> * Documentation:

> ISTM that there are too many "the":
>   - "turns on the option ..." -> "turns on option ..."
>   - "When the option ..." -> "When option ..."
>   - "By default the option ..." -> "By default option ..."
>   - "only if the option ..." -> "only if option ..."
>   - "combined with the option ..." -> "combined with option ..."
>   - "without the option ..." -> "without option ..."

The previous patch used a lot of "the option xxxx", but I fixed
them to "the xxxx option" because I found that the documentation
uses such way for referring to a certain option. For example,

- You can (and, for most purposes, probably should) increase the number
  of rows by using the <option>-s</option> (scale factor) option. 
- The prefix can be changed by using the <option>--log-prefix</option> option.
- If the <option>-j</option> option is 2 or higher, so that there are multiple
  worker threads,

>   - "is the sum of all the retries" -> "is the sum of all retries"
> "infinite" -> "unlimited" 
> "not retried at all" -> "not retried" (maybe several times). 
> "messages of all errors" -> "messages about all errors". 
> "It is assumed that the scripts used do not contain" ->
> "It is assumed that pgbench scripts do not contai

Fixed.


Following are additional fixes based on my review on the previous patch.

* About error reporting

In the previous patch, commandFailed() was changed to report an error
that doesn't immediately abort the client, and clientAborted() was
added to report an abortion of the client. In the attached patch,
behaviors around errors other than serialization and deadlock are
not changed and such errors cause the client to abort, so commandFaile()
is used without any changes to report a client abortion, and commandError()
is added to report an error that can be retried under --debug-error.

* About progress reporting

In the previous patch, the number of failures was reported only when any
transaction was failed, and statistics of retry was reported only when
any transaction was retried. This means, the number of columns in the
reporting were different depending on the interval. This was odd and
harder to parse the output.

In the attached patch, the number of failures is always reported, and
the retry statistic is reported when max-tries is not 1. 

* About result outputs

In the previous patch, the number of failed transaction, the number
of retried transaction, and the number of total retries were reported
as:

 number of failures: 324 (3.240%)
 ...
 number of retried: 5629 (56.290%)
 number of retries: 103299

I think this was confusable. Especially, it was unclear for me what
"retried" and "retries" represent repectively. Therefore, in the
attached patch, they are reported as:

 number of transactions failed: 324 (3.240%)
 ...
 number of transactions retried: 5629 (56.290%)
 number of total retries: 103299

which clarify that first two are the numbers of transactions and the
last one is the number of retries over all transactions.

* Abourt average connection time

In the previous patch, this was calculated as "conn_total_duration / total->cnt"
where conn_total_duration is the cumulated connection time sumed over threads and
total->cnt is the number of transaction that is successfully processed.

However, the average connection time could be overestimated because 
conn_total_duration includes a connection time of failed transaction
due to serialization and deadlock errors. So, in the attached patch,
this is calculated as "conn_total_duration / total->cnt + failures".


Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Yugo-san,

Thanks a lot for continuing this work started by Marina!

I'm planning to review it for the July CF. I've just added an entry there:

     https://commitfest.postgresql.org/33/3194/

-- 
Fabien.



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hello Fabien,

On Tue, 22 Jun 2021 20:03:58 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> 
> Hello Yugo-san,
> 
> Thanks a lot for continuing this work started by Marina!
> 
> I'm planning to review it for the July CF. I've just added an entry there:
> 
>      https://commitfest.postgresql.org/33/3194/

Thanks!

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Yugo-san:

# About v12.1

This is a refactoring patch, which creates a separate structure for 
holding variables. This will become handy in the next patch. There is also 
a benefit from a software engineering point of view, so it has merit on 
its own.

## Compilation

Patch applies cleanly, compiles, global & local checks pass.

## About the code

Fine.

I'm wondering whether we could use "vars" instead of "variables" as a 
struct field name and function parameter name, so that is is shorter and 
more distinct from the type name "Variables". What do you think?

## About comments

Remove the comment on enlargeVariables about "It is assumed …" the issue 
of trying MAXINT vars is more than remote and is not worth mentioning. In 
the same function, remove the comments about MARGIN, it is already on the 
macro declaration, once is enough.

-- 
Fabien.

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Wed, 23 Jun 2021 10:38:43 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> 
> Hello Yugo-san:
> 
> # About v12.1
> 
> This is a refactoring patch, which creates a separate structure for 
> holding variables. This will become handy in the next patch. There is also 
> a benefit from a software engineering point of view, so it has merit on 
> its own.

> ## Compilation
> 
> Patch applies cleanly, compiles, global & local checks pass.
> 
> ## About the code
> 
> Fine.
> 
> I'm wondering whether we could use "vars" instead of "variables" as a 
> struct field name and function parameter name, so that is is shorter and 
> more distinct from the type name "Variables". What do you think?

The struct "Variables" has a field named "vars" which is an array of
"Variable" type. I guess this is a reason why "variables" is used instead
of "vars" as a name of "Variables" type variable so that we could know
a variable's type is Variable or Variables.  Also, in order to refer to
the field, we would use

 vars->vars[vars->nvars]

and there are nested "vars". Could this make a codereader confused?


> ## About comments
> 
> Remove the comment on enlargeVariables about "It is assumed …" the issue 
> of trying MAXINT vars is more than remote and is not worth mentioning. In 
> the same function, remove the comments about MARGIN, it is already on the 
> macro declaration, once is enough.

Sure. I'll remove them.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Yugo-san,

>> I'm wondering whether we could use "vars" instead of "variables" as a
>> struct field name and function parameter name, so that is is shorter and
>> more distinct from the type name "Variables". What do you think?
>
> The struct "Variables" has a field named "vars" which is an array of
> "Variable" type. I guess this is a reason why "variables" is used instead
> of "vars" as a name of "Variables" type variable so that we could know
> a variable's type is Variable or Variables.  Also, in order to refer to
> the field, we would use
>
> vars->vars[vars->nvars]
>
> and there are nested "vars". Could this make a codereader confused?

Hmmm… Probably. Let's keep "variables" then.

-- 
Fabien.

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Yugo-san,

# About v12.2

## Compilation

Patch seems to apply cleanly with "git apply", but does not compile on my 
host: "undefined reference to `conditional_stack_reset'".

However it works better when using the "patch". I'm wondering why git 
apply fails silently…

When compiling there are warnings about "pg_log_fatal", which does not 
expect a FILE* on pgbench.c:4453. Remove the "stderr" argument.

Global and local checks ok.

> number of transactions failed: 324 (3.240%)
> ...
> number of transactions retried: 5629 (56.290%)
> number of total retries: 103299

I'd suggest: "number of failed transactions". "total number of retries" or 
just "number of retries"?

## Feature

The overall code structure changes to implements the feature seems 
reasonable to me, as we are at the 12th iteration of the patch.

Comments below are somehow about details and asking questions
about choices, and commenting…

## Documentation

There is a lot of documentation, which is good. I'll review these
separatly. It looks good, but having a native English speaker/writer
would really help!

Some output examples do not correspond to actual output for
the current version. In particular, there is always one TPS figure
given now, instead of the confusing two shown before.

## Comments

transactinos -> transactions.

## Code

By default max_tries = 0. Should not the initialization be 1,
as the documentation argues that it is the default?

Counter comments, missing + in the formula on the skipped line.

Given that we manage errors, ISTM that we should not necessarily
stop on other not retried errors, but rather count/report them and
possibly proceed.  Eg with something like:

   -- server side random fail
   DO LANGUAGE plpgsql $$
   BEGIN
     IF RANDOM() < 0.1 THEN
       RAISE EXCEPTION 'unlucky!';
     END IF;
   END;
   $$;

Or:

   -- client side random fail
   BEGIN;
   \if random(1, 10) <= 1
   SELECT 1 +;
   \else
   SELECT 2;
   \endif
   COMMIT;

We could count the fail, rollback if necessary, and go on.  What do you think?
Maybe such behavior would deserve an option.

--report-latencies -> --report-per-command: should we keep supporting
the previous option?

--failures-detailed: if we bother to run with handling failures, should
it always be on?

--debug-errors: I'm not sure we should want a special debug mode for that,
I'd consider integrating it with the standard debug, or just for development.
Also, should it use pg_log_debug?

doRetry: I'd separate the 3 no retries options instead of mixing max_tries and
timer_exceeeded, for clarity.

Tries vs retries: I'm at odds with having tries & retries and + 1 here
and there to handle that, which is a little bit confusing. I'm wondering whether
we could only count "tries" and adjust to report what we want later?

advanceConnectionState: ISTM that ERROR should logically be before others which
lead to it.

Variables management: it looks expensive, with copying and freeing variable arrays.
I'm wondering whether we should think of something more clever. Well, that would be
for some other patch.

"Accumulate the retries" -> "Count (re)tries"?

Currently, ISTM that the retry on error mode is implicitely always on.
Do we want that? I'd say yes, but maybe people could disagree.

## Tests

There are tests, good!

I'm wondering whether something simpler could be devised to trigger
serialization or deadlock errors, eg with a SEQUENCE and an \if.

See the attached files for generating deadlocks reliably (start with 2 clients).
What do you think? The PL/pgSQL minimal, it is really client-code 
oriented.

Given that deadlocks are detected about every seconds, the test runs
would take some time. Let it be for now.

-- 
Fabien.

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hello Fabien,

On Sat, 26 Jun 2021 12:15:38 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> 
> Hello Yugo-san,
> 
> # About v12.2
> 
> ## Compilation
> 
> Patch seems to apply cleanly with "git apply", but does not compile on my 
> host: "undefined reference to `conditional_stack_reset'".
> 
> However it works better when using the "patch". I'm wondering why git 
> apply fails silently…

Hmm, I don't know why your compiling fails... I can apply and complile
successfully using git.

> When compiling there are warnings about "pg_log_fatal", which does not 
> expect a FILE* on pgbench.c:4453. Remove the "stderr" argument.

Ok.

> Global and local checks ok.
> 
> > number of transactions failed: 324 (3.240%)
> > ...
> > number of transactions retried: 5629 (56.290%)
> > number of total retries: 103299
> 
> I'd suggest: "number of failed transactions". "total number of retries" or 
> just "number of retries"?

Ok. I fixed to use "number of failed transactions" and "total number of retries".

> ## Feature
> 
> The overall code structure changes to implements the feature seems 
> reasonable to me, as we are at the 12th iteration of the patch.
> 
> Comments below are somehow about details and asking questions
> about choices, and commenting…
> 
> ## Documentation
> 
> There is a lot of documentation, which is good. I'll review these
> separatly. It looks good, but having a native English speaker/writer
> would really help!
> 
> Some output examples do not correspond to actual output for
> the current version. In particular, there is always one TPS figure
> given now, instead of the confusing two shown before.

Fixed.

> ## Comments
> 
> transactinos -> transactions.

Fixed.

> ## Code
> 
> By default max_tries = 0. Should not the initialization be 1,
> as the documentation argues that it is the default?

Ok. I fixed the default value to 1.

> Counter comments, missing + in the formula on the skipped line.

Fixed.

> Given that we manage errors, ISTM that we should not necessarily
> stop on other not retried errors, but rather count/report them and
> possibly proceed.  Eg with something like:
> 
>    -- server side random fail
>    DO LANGUAGE plpgsql $$
>    BEGIN
>      IF RANDOM() < 0.1 THEN
>        RAISE EXCEPTION 'unlucky!';
>      END IF;
>    END;
>    $$;
> 
> Or:
> 
>    -- client side random fail
>    BEGIN;
>    \if random(1, 10) <= 1
>    SELECT 1 +;
>    \else
>    SELECT 2;
>    \endif
>    COMMIT;
> 
> We could count the fail, rollback if necessary, and go on.  What do you think?
> Maybe such behavior would deserve an option.

This feature to count failures that could occur at runtime seems nice. However,
as discussed in [1], I think it is better to focus to only failures that can be
retried in this patch, and introduce the feature to handle other failures in a
separate patch.

[1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1809121519590.13887%40lancre

> --report-latencies -> --report-per-command: should we keep supporting
> the previous option?

Ok. Although now the option is not only for latencies, considering users who
are using the existing option, I'm fine with this. I got back this to the
previous name.

> --failures-detailed: if we bother to run with handling failures, should
> it always be on?

If we print other failures that cannot be retried in future, it could a lot
of lines and might make some users who don't need details of failures annoyed.
Moreover, some users would always need information of detailed failures in log,
and others would need only total numbers of failures. 

Currently we handle only serialization and deadlock failures, so the number of
lines printed and the number of columns of logging is not large even under the
failures-detail, but if we have a chance to handle other failures in future,  
ISTM adding this option makes sense considering users who would like simple
outputs.
 
> --debug-errors: I'm not sure we should want a special debug mode for that,
> I'd consider integrating it with the standard debug, or just for development.

I think --debug is a debug option for telling users the pgbench's internal
behaviors, that is, which client is doing what. On other hand, --debug-errors
is for telling users what error caused a retry or a failure in detail. For
users who are not interested in pgbench's internal behavior (sending a command, 
receiving a result, ... ) but interested in actual errors raised during running 
script, this option seems useful.

> Also, should it use pg_log_debug?

If we use pg_log_debug, the message is printed only under --debug.
Therefore, I fixed to use pg_log_info instead of pg_log_error or fprintf.
 
> doRetry: I'd separate the 3 no retries options instead of mixing max_tries and
> timer_exceeeded, for clarity.

Ok. I fixed to separate them.
 
> Tries vs retries: I'm at odds with having tries & retries and + 1 here
> and there to handle that, which is a little bit confusing. I'm wondering whether
> we could only count "tries" and adjust to report what we want later?

I fixed to use "tries" instead of "retries" in CState. However, we still use
"retries" in StatsData and Command because the number of retries is printed
in the final result. Is it less confusing than the previous?

> advanceConnectionState: ISTM that ERROR should logically be before others which
> lead to it.

Sorry, I couldn't understand your suggestion. Is this about the order of case
statements or pg_log_error?
 
> Variables management: it looks expensive, with copying and freeing variable arrays.
> I'm wondering whether we should think of something more clever. Well, that would be
> for some other patch.

Well.., indeed there may be more efficient way. For example, instead of clearing all
vars in dest,  it might be possible to copy or clear only the difference part between
dest and source and remaining unchanged part in dest. Anyway, I think this work should
be done in other patch.
 
> "Accumulate the retries" -> "Count (re)tries"?

Fixed.
 
> Currently, ISTM that the retry on error mode is implicitely always on.
> Do we want that? I'd say yes, but maybe people could disagree.

The default values of max-tries is 1, so the retry on error is off.
Failed transactions are retried only when the user wants it and
specifies a valid value to max-treis.
 
> ## Tests
> 
> There are tests, good!
> 
> I'm wondering whether something simpler could be devised to trigger
> serialization or deadlock errors, eg with a SEQUENCE and an \if.
> 
> See the attached files for generating deadlocks reliably (start with 2 clients).
> What do you think? The PL/pgSQL minimal, it is really client-code 
> oriented.
> 
> Given that deadlocks are detected about every seconds, the test runs
> would take some time. Let it be for now.

Sorry, but I cannot find the attached file. I don't have a good idea 
for a simpler test for now, but I can fix the test based on your idea
after getting the file.


I attached the patch updated according with your suggestion.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Yugo-san,

Thanks for the update!

>> Patch seems to apply cleanly with "git apply", but does not compile on my
>> host: "undefined reference to `conditional_stack_reset'".
>>
>> However it works better when using the "patch". I'm wondering why git
>> apply fails silently…
>
> Hmm, I don't know why your compiling fails... I can apply and complile
> successfully using git.

Hmmm. Strange!

>> Given that we manage errors, ISTM that we should not necessarily stop 
>> on other not retried errors, but rather count/report them and possibly 
>> proceed.  Eg with something like: [...] We could count the fail, 
>> rollback if necessary, and go on.  What do you think? Maybe such 
>> behavior would deserve an option.
>
> This feature to count failures that could occur at runtime seems nice. However,
> as discussed in [1], I think it is better to focus to only failures that can be
> retried in this patch, and introduce the feature to handle other failures in a
> separate patch.

Ok.

>> --report-latencies -> --report-per-command: should we keep supporting
>> the previous option?
>
> Ok. Although now the option is not only for latencies, considering users who
> are using the existing option, I'm fine with this. I got back this to the
> previous name.

Hmmm. I liked the new name! My point was whether we need to support the 
old one as well for compatibility, or whether we should not bother. I'm 
still wondering. As I think that the new name is better, I'd suggest to 
keep it.

>> --failures-detailed: if we bother to run with handling failures, should
>> it always be on?
>
> If we print other failures that cannot be retried in future, it could a lot
> of lines and might make some users who don't need details of failures annoyed.
> Moreover, some users would always need information of detailed failures in log,
> and others would need only total numbers of failures.

Ok.

> Currently we handle only serialization and deadlock failures, so the number of
> lines printed and the number of columns of logging is not large even under the
> failures-detail, but if we have a chance to handle other failures in future,
> ISTM adding this option makes sense considering users who would like simple
> outputs.

Hmmm. What kind of failures could be managed with retries? I guess that on 
a connection failure we can try to reconnect, but otherwise it is less 
clear that other failures make sense to retry.

>> --debug-errors: I'm not sure we should want a special debug mode for that,
>> I'd consider integrating it with the standard debug, or just for development.
>
> I think --debug is a debug option for telling users the pgbench's internal
> behaviors, that is, which client is doing what. On other hand, --debug-errors
> is for telling users what error caused a retry or a failure in detail. For
> users who are not interested in pgbench's internal behavior (sending a command,
> receiving a result, ... ) but interested in actual errors raised during running
> script, this option seems useful.

Ok. The this is not really about debug per se, but a verbosity setting?
Maybe --verbose-errors would make more sense? I'm unsure. I'll think about 
it.

>> Also, should it use pg_log_debug?
>
> If we use pg_log_debug, the message is printed only under --debug.
> Therefore, I fixed to use pg_log_info instead of pg_log_error or fprintf.

Ok, pg_log_info seems right.

>> Tries vs retries: I'm at odds with having tries & retries and + 1 here
>> and there to handle that, which is a little bit confusing. I'm wondering whether
>> we could only count "tries" and adjust to report what we want later?
>
> I fixed to use "tries" instead of "retries" in CState. However, we still use
> "retries" in StatsData and Command because the number of retries is printed
> in the final result. Is it less confusing than the previous?

I'm going to think about it.

>> advanceConnectionState: ISTM that ERROR should logically be before others which
>> lead to it.
>
> Sorry, I couldn't understand your suggestion. Is this about the order of case
> statements or pg_log_error?

My sentence got mixed up. My point was about the case order, so that they 
are put in a more logical order when reading all the cases.

>> Currently, ISTM that the retry on error mode is implicitely always on.
>> Do we want that? I'd say yes, but maybe people could disagree.
>
> The default values of max-tries is 1, so the retry on error is off.

> Failed transactions are retried only when the user wants it and
> specifies a valid value to max-treis.

Ok. My point is that we do not stop on such errors, whereas before ISTM 
that we would have stopped, so somehow the default behavior has changed 
and the previous behavior cannot be reinstated with an option. Maybe that 
is not bad, but this is a behavioral change which needs to be documented 
and argumented.

>> See the attached files for generating deadlocks reliably (start with 2 
>> clients). What do you think? The PL/pgSQL minimal, it is really 
>> client-code oriented.
>
> Sorry, but I cannot find the attached file.

Sorry. Attached to this mail. The serialization stuff does not seem to 
work as well as the deadlock one. Run with 2 clients.

-- 
Fabien.
Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
> I attached the patch updated according with your suggestion.

v13 patches gave a compiler warning...

$ make >/dev/null
pgbench.c: In function ‘commandError’:
pgbench.c:3071:17: warning: unused variable ‘command’ [-Wunused-variable]
  const Command *command = sql_script[st->use_file].commands[st->command];
                 ^~~~~~~

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
> v13 patches gave a compiler warning...
> 
> $ make >/dev/null
> pgbench.c: In function ‘commandError’:
> pgbench.c:3071:17: warning: unused variable ‘command’ [-Wunused-variable]
>   const Command *command = sql_script[st->use_file].commands[st->command];
>                  ^~~~~~~

There is a typo in the doc (more over ->  moreover).

>        of all transaction tries; more over, you cannot use an unlimited number

        of all transaction tries; moreover, you cannot use an unlimited number

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
I have found an interesting result from patched pgbench (I have set
the isolation level to REPEATABLE READ):

$ pgbench -p 11000 -c 10  -T 30  --max-tries=0 test
pgbench (15devel, server 13.3)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
duration: 30 s
number of transactions actually processed: 2586
number of failed transactions: 9 (0.347%)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
number of transactions retried: 1892 (72.909%)
total number of retries: 21819
latency average = 115.551 ms (including failures)
initial connection time = 35.268 ms
tps = 86.241799 (without initial connection time)

I ran pgbench with 10 concurrent sessions. In this case pgbench always
reports 9 failed transactions regardless the setting of -T
option. This is because at the end of a pgbench session, only 1 out of
10 transaction succeeded but 9 transactions failed due to
serialization error without any chance to retry because -T expires.

This is a little bit disappointed because I wanted to see a result of
all transactions succeeded with retries.  I tried -t instead of -T but
-t cannot be used with --max-tries=0.

Also I think this behavior is somewhat inconsistent with existing
behavior of pgbench. When pgbench runs without --max-tries option,
pgbench continues to run transactions even after -T expires:

$ time pgbench -p 11000 -T 10 -f pgbench.sql test
pgbench (15devel, server 13.3)
starting vacuum...end.
transaction type: pgbench.sql
scaling factor: 1
query mode: simple
number of clients: 1
number of threads: 1
duration: 10 s
number of transactions actually processed: 2
maximum number of tries: 1
latency average = 7009.006 ms
initial connection time = 8.045 ms
tps = 0.142674 (without initial connection time)

real    0m14.067s
user    0m0.010s
sys    0m0.004s

$ cat pgbench.sql
SELECT pg_sleep(7);

So pgbench does not stop transactions after 10 seconds passed but
waits for the last transaction completes. If we consistent with
behavior when --max-tries=0, shouldn't we retry until the last
transaction finishes?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hello Ishii-san,

On Thu, 01 Jul 2021 09:03:42 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> > v13 patches gave a compiler warning...
> > 
> > $ make >/dev/null
> > pgbench.c: In function ‘commandError’:
> > pgbench.c:3071:17: warning: unused variable ‘command’ [-Wunused-variable]
> >   const Command *command = sql_script[st->use_file].commands[st->command];
> >                  ^~~~~~~

Hmm, we'll get the warning when --enable-cassert is not specified.
I'll fix it.

> There is a typo in the doc (more over ->  moreover).
> 
> >        of all transaction tries; more over, you cannot use an unlimited number
> 
>         of all transaction tries; moreover, you cannot use an unlimited number
> 

Thanks. I'll fix.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hello Ishii-san,

On Fri, 02 Jul 2021 09:25:03 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> I have found an interesting result from patched pgbench (I have set
> the isolation level to REPEATABLE READ):
> 
> $ pgbench -p 11000 -c 10  -T 30  --max-tries=0 test
> pgbench (15devel, server 13.3)
> starting vacuum...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 1
> query mode: simple
> number of clients: 10
> number of threads: 1
> duration: 30 s
> number of transactions actually processed: 2586
> number of failed transactions: 9 (0.347%)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> number of transactions retried: 1892 (72.909%)
> total number of retries: 21819
> latency average = 115.551 ms (including failures)
> initial connection time = 35.268 ms
> tps = 86.241799 (without initial connection time)
> 
> I ran pgbench with 10 concurrent sessions. In this case pgbench always
> reports 9 failed transactions regardless the setting of -T
> option. This is because at the end of a pgbench session, only 1 out of
> 10 transaction succeeded but 9 transactions failed due to
> serialization error without any chance to retry because -T expires.
> 
> This is a little bit disappointed because I wanted to see a result of
> all transactions succeeded with retries.  I tried -t instead of -T but
> -t cannot be used with --max-tries=0.
> 
> Also I think this behavior is somewhat inconsistent with existing
> behavior of pgbench. When pgbench runs without --max-tries option,
> pgbench continues to run transactions even after -T expires:
> 
> $ time pgbench -p 11000 -T 10 -f pgbench.sql test
> pgbench (15devel, server 13.3)
> starting vacuum...end.
> transaction type: pgbench.sql
> scaling factor: 1
> query mode: simple
> number of clients: 1
> number of threads: 1
> duration: 10 s
> number of transactions actually processed: 2
> maximum number of tries: 1
> latency average = 7009.006 ms
> initial connection time = 8.045 ms
> tps = 0.142674 (without initial connection time)
> 
> real    0m14.067s
> user    0m0.010s
> sys    0m0.004s
> 
> $ cat pgbench.sql
> SELECT pg_sleep(7);
> 
> So pgbench does not stop transactions after 10 seconds passed but
> waits for the last transaction completes. If we consistent with
> behavior when --max-tries=0, shouldn't we retry until the last
> transaction finishes?

I changed the previous patch to enable that the -T option can terminate
a retrying transaction and that we can specify --max-tries=0 without
--latency-limit if we have -T , according with the following comment.

> Doc says "you cannot use an infinite number of retries without latency-limit..."
> 
> Why should this be forbidden? At least if -T timeout takes precedent and
> shortens the execution, ISTM that there could be good reason to test that.
> Maybe it could be blocked only under -t if this would lead to an non-ending
> run.

Indeed, as Ishii-san pointed out, some users might not want to terminate
retrying transactions due to -T. However, the actual negative effect is only
printing the number of failed transactions. The other result that users want to
know, such as tps, are almost not affected because they are measured for
transactions processed successfully. Actually, the percentage of failed
transaction is very little, only 0.347%.

In the existing behaviour, running transactions are never terminated due to
the -T option. However, ISTM that this would be based on an assumption
that a latency of each transaction is small and that a timing when we can
finish the benchmark would come soon.  On the other hand, when transactions can 
be retried unlimitedly, it may take a long time more than expected, and we can
not guarantee that this would finish successfully in limited time. Therefore,  
terminating the benchmark by giving up to retry the transaction after time
expiration seems reasonable under unlimited retries.  In the sense that we don't
terminate running transactions forcibly, this don't change the existing behaviour. 

If you don't want to print the number of transactions failed due to -T, we can
fix to forbid to use -T without latency-limit under max-tries=0 for avoiding
possible never-ending-benchmark. In this case, users have to limit the number of
transaction retry by specifying latency-limit or max-tries (>0). However,  if some
users would like to benchmark simply allowing unlimited retries,  using -T and
max-tries=0 seems the most straight way, so I think it is better that they can be
used together.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
> Indeed, as Ishii-san pointed out, some users might not want to terminate
> retrying transactions due to -T. However, the actual negative effect is only
> printing the number of failed transactions. The other result that users want to
> know, such as tps, are almost not affected because they are measured for
> transactions processed successfully. Actually, the percentage of failed
> transaction is very little, only 0.347%.

Well, "that's very little, let's ignore it" is not technically a right
direction IMO.

> In the existing behaviour, running transactions are never terminated due to
> the -T option. However, ISTM that this would be based on an assumption
> that a latency of each transaction is small and that a timing when we can
> finish the benchmark would come soon.  On the other hand, when transactions can 
> be retried unlimitedly, it may take a long time more than expected, and we can
> not guarantee that this would finish successfully in limited time.Therefore,  
> terminating the benchmark by giving up to retry the transaction after time
> expiration seems reasonable under unlimited retries.

That's necessarily true in practice. By the time when -T is about to
expire, transactions are all finished in finite time as you can see
the result I showed. So it's reasonable that the very last cycle of
the benchmark will finish in finite time as well.

Of course if a benchmark cycle takes infinite time, this will be a
problem. However same thing can be said to non-retry
benchmarks. Theoretically it is possible that *one* benchmark cycle
takes forever. In this case the only solution will be just hitting ^C
to terminate pgbench. Why can't we have same assumption with
--max-tries=0 case?

> In the sense that we don't
> terminate running transactions forcibly, this don't change the existing behaviour. 

This statement seems to be depending on your perosnal assumption.

I still don't understand why you think that --max-tries non 0 case
will *certainly* finish in finite time whereas --max-tries=0 case will
not.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Wed, 07 Jul 2021 16:11:23 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> > Indeed, as Ishii-san pointed out, some users might not want to terminate
> > retrying transactions due to -T. However, the actual negative effect is only
> > printing the number of failed transactions. The other result that users want to
> > know, such as tps, are almost not affected because they are measured for
> > transactions processed successfully. Actually, the percentage of failed
> > transaction is very little, only 0.347%.
> 
> Well, "that's very little, let's ignore it" is not technically a right
> direction IMO.

Hmmm, It seems to me these failures are ignorable because with regard to failures
due to -T they occur only the last transaction of each client and do not affect
the result such as TPS and latency of successfully processed transactions.
(although I am not sure for what sense you use the word "technically"...)

However, maybe I am missing something. Could you please tell me what do you think
the actual harm for users about failures due to -D is?

> > In the existing behaviour, running transactions are never terminated due to
> > the -T option. However, ISTM that this would be based on an assumption
> > that a latency of each transaction is small and that a timing when we can
> > finish the benchmark would come soon.  On the other hand, when transactions can 
> > be retried unlimitedly, it may take a long time more than expected, and we can
> > not guarantee that this would finish successfully in limited time.Therefore,  
> > terminating the benchmark by giving up to retry the transaction after time
> > expiration seems reasonable under unlimited retries.
> 
> That's necessarily true in practice. By the time when -T is about to
> expire, transactions are all finished in finite time as you can see
> the result I showed. So it's reasonable that the very last cycle of
> the benchmark will finish in finite time as well.

Your script may finish in finite time, but others may not. However, 
considering only serialization and deadlock errors, almost transactions
would finish in finite time eventually. In the previous version of the
patch, errors other than serialization or deadlock can be retried and
it causes unlimited retrying easily. Now, only the two kind of errors
can be retried, nevertheless, it is unclear for me that we can assume
that retying will finish in finite time. If we can assume it, maybe,
we can remove the restriction that --max-retries=0 must be used with
--latency-limit or -T.

> Of course if a benchmark cycle takes infinite time, this will be a
> problem. However same thing can be said to non-retry
> benchmarks. Theoretically it is possible that *one* benchmark cycle
> takes forever. In this case the only solution will be just hitting ^C
> to terminate pgbench. Why can't we have same assumption with
> --max-tries=0 case?

Indeed, it is possible an execution of a query takes a long or infinite
time. However, its cause would a problematic query in the custom script
or other problems occurs on the server side. These are not problem of
pgbench and, pgbench itself can't control either. On the other hand, the
unlimited number of tries is a behaviours specified by the pgbench option,
so I think pgbench itself should internally avoid problems caused from its
behaviours. That is, if max-tries=0 could cause infinite or much longer
benchmark time more than user expected due to too many retries, I think
pgbench should avoid it.

> > In the sense that we don't
> > terminate running transactions forcibly, this don't change the existing behaviour. 
> 
> This statement seems to be depending on your perosnal assumption.

Ok. If we regard that a transaction is still running even when it is under
retrying after an error,  terminate of the retry may imply to terminate running
the transaction forcibly.  

> I still don't understand why you think that --max-tries non 0 case
> will *certainly* finish in finite time whereas --max-tries=0 case will
> not.

I just mean that --max-tries greater than zero will prevent pgbench from retrying a
transaction forever.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>> Well, "that's very little, let's ignore it" is not technically a right
>> direction IMO.
> 
> Hmmm, It seems to me these failures are ignorable because with regard to failures
> due to -T they occur only the last transaction of each client and do not affect
> the result such as TPS and latency of successfully processed transactions.
> (although I am not sure for what sense you use the word "technically"...)

"My application button does not respond once in 100 times. It's just
1% error rate. You should ignore it." I would say this attitude is not
technically correct.

> However, maybe I am missing something. Could you please tell me what do you think
> the actual harm for users about failures due to -D is?

I don't know why you are referring to -D.

>> That's necessarily true in practice. By the time when -T is about to
>> expire, transactions are all finished in finite time as you can see
>> the result I showed. So it's reasonable that the very last cycle of
>> the benchmark will finish in finite time as well.
> 
> Your script may finish in finite time, but others may not.

That's why I said "practically". In other words "in most cases the
scenario will finish in finite time".

> Indeed, it is possible an execution of a query takes a long or infinite
> time. However, its cause would a problematic query in the custom script
> or other problems occurs on the server side. These are not problem of
> pgbench and, pgbench itself can't control either. On the other hand, the
> unlimited number of tries is a behaviours specified by the pgbench option,
> so I think pgbench itself should internally avoid problems caused from its
> behaviours. That is, if max-tries=0 could cause infinite or much longer
> benchmark time more than user expected due to too many retries, I think
> pgbench should avoid it.

I would say that's user's responsibility to avoid infinite running
benchmarking. Remember, pgbench is a tool for serious users, not for
novice users.

Or, we should terminate the last cycle of benchmark regardless it is
retrying or not if -T expires. This will make pgbench behaves much
more consistent.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Wed, 07 Jul 2021 21:50:16 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> >> Well, "that's very little, let's ignore it" is not technically a right
> >> direction IMO.
> > 
> > Hmmm, It seems to me these failures are ignorable because with regard to failures
> > due to -T they occur only the last transaction of each client and do not affect
> > the result such as TPS and latency of successfully processed transactions.
> > (although I am not sure for what sense you use the word "technically"...)
> 
> "My application button does not respond once in 100 times. It's just
> 1% error rate. You should ignore it." I would say this attitude is not
> technically correct.

I cannot understand what you want to say. Reporting the number of transactions 
that is failed intentionally can be treated as same as he error rate on your
application's button?

> > However, maybe I am missing something. Could you please tell me what do you think
> > the actual harm for users about failures due to -D is?
> 
> I don't know why you are referring to -D.

Sorry. It's just a typo as you can imagine.
I am asking you what do you think the actual harm for users due to termination of
retrying by the -T option is.

> >> That's necessarily true in practice. By the time when -T is about to
> >> expire, transactions are all finished in finite time as you can see
> >> the result I showed. So it's reasonable that the very last cycle of
> >> the benchmark will finish in finite time as well.
> > 
> > Your script may finish in finite time, but others may not.
> 
> That's why I said "practically". In other words "in most cases the
> scenario will finish in finite time".

Sure.

> > Indeed, it is possible an execution of a query takes a long or infinite
> > time. However, its cause would a problematic query in the custom script
> > or other problems occurs on the server side. These are not problem of
> > pgbench and, pgbench itself can't control either. On the other hand, the
> > unlimited number of tries is a behaviours specified by the pgbench option,
> > so I think pgbench itself should internally avoid problems caused from its
> > behaviours. That is, if max-tries=0 could cause infinite or much longer
> > benchmark time more than user expected due to too many retries, I think
> > pgbench should avoid it.
> 
> I would say that's user's responsibility to avoid infinite running
> benchmarking. Remember, pgbench is a tool for serious users, not for
> novice users.

Of course, users themselves should be careful of problematic script, but it
would be better that pgbench itself avoids problems if pgbench can beforehand.
 
> Or, we should terminate the last cycle of benchmark regardless it is
> retrying or not if -T expires. This will make pgbench behaves much
> more consistent.

Hmmm, indeed this might make the behaviour a bit consistent, but I am not
sure such behavioural change benefit users.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hello Fabien,

I attached the updated patch (v14)!

On Wed, 30 Jun 2021 17:33:24 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> >> --report-latencies -> --report-per-command: should we keep supporting
> >> the previous option?
> >
> > Ok. Although now the option is not only for latencies, considering users who
> > are using the existing option, I'm fine with this. I got back this to the
> > previous name.
> 
> Hmmm. I liked the new name! My point was whether we need to support the 
> old one as well for compatibility, or whether we should not bother. I'm 
> still wondering. As I think that the new name is better, I'd suggest to 
> keep it.

Ok. I misunderstood it. I returned the option name to report-per-command.

If we keep report-latencies, I can imagine the following choises:
- use report-latencies to print only latency information
- use report-latencies as alias of report-per-command for compatibility
  and remove at an appropriate timing. (that is, treat as deprecated)

Among these, I prefer the latter because ISTM we would not need many options
for reporting information per command. However, actually, I wander that we
don't have to keep the previous one if we plan to remove it eventually.

> >> --failures-detailed: if we bother to run with handling failures, should
> >> it always be on?
> >
> > If we print other failures that cannot be retried in future, it could a lot
> > of lines and might make some users who don't need details of failures annoyed.
> > Moreover, some users would always need information of detailed failures in log,
> > and others would need only total numbers of failures.
> 
> Ok.
> 
> > Currently we handle only serialization and deadlock failures, so the number of
> > lines printed and the number of columns of logging is not large even under the
> > failures-detail, but if we have a chance to handle other failures in future,
> > ISTM adding this option makes sense considering users who would like simple
> > outputs.
> 
> Hmmm. What kind of failures could be managed with retries? I guess that on 
> a connection failure we can try to reconnect, but otherwise it is less 
> clear that other failures make sense to retry.

Indeed, there would few failures that we should retry and I can not imagine
other than serialization , deadlock, and connection failures for now. However,
considering reporting the number of failed transaction and its causes in future,
as you said

> Given that we manage errors, ISTM that we should not necessarily
> stop on other not retried errors, but rather count/report them and
> possibly proceed. 

, we could define more a few kind of failures. At least we can consider
meta-command and other SQL commands errors in addition to serialization, 
deadlock, connection failures. So, the total number of kind of failures would
be five at least and reporting always all of them results a lot of lines and
columns in logging.

> >> --debug-errors: I'm not sure we should want a special debug mode for that,
> >> I'd consider integrating it with the standard debug, or just for development.
> >
> > I think --debug is a debug option for telling users the pgbench's internal
> > behaviors, that is, which client is doing what. On other hand, --debug-errors
> > is for telling users what error caused a retry or a failure in detail. For
> > users who are not interested in pgbench's internal behavior (sending a command,
> > receiving a result, ... ) but interested in actual errors raised during running
> > script, this option seems useful.
> 
> Ok. The this is not really about debug per se, but a verbosity setting?

I think so.

> Maybe --verbose-errors would make more sense? I'm unsure. I'll think about 
> it.

Agreed. This seems more proper than the previous one, so I fixed the name to
--verbose-errors.

> > Sorry, I couldn't understand your suggestion. Is this about the order of case
> > statements or pg_log_error?
> 
> My sentence got mixed up. My point was about the case order, so that they 
> are put in a more logical order when reading all the cases.

Ok. Considering the loical order, I moved WAIT_ROLLBACK_RESULT into
between ERROR and RETRY, because WAIT_ROLLBACK_RESULT comes atter ERROR state,
and RETRY comes after ERROR or WAIT_ROLLBACK_RESULT..

> >> Currently, ISTM that the retry on error mode is implicitely always on.
> >> Do we want that? I'd say yes, but maybe people could disagree.
> >
> > The default values of max-tries is 1, so the retry on error is off.
> 
> > Failed transactions are retried only when the user wants it and
> > specifies a valid value to max-treis.
> 
> Ok. My point is that we do not stop on such errors, whereas before ISTM 
> that we would have stopped, so somehow the default behavior has changed 
> and the previous behavior cannot be reinstated with an option. Maybe that 
> is not bad, but this is a behavioral change which needs to be documented 
> and argumented.

I understood. Indeed, there is a behavioural change about whether we abort
the client after some types of errors or not. Now, serialization / deadlock
errors don't cause the abortion and are recorded as failures whereas other
errors cause to abort the client.

If we would want to record other errors as failures in future, we would need
a new option to specify which type of failures (or all types of errors, maybe)
should be reported. Until that time, ISTM we can treat serialization and
deadlock as something special errors to be reported as failures.

I rewrote "Failures and Serialization/Deadlock Retries" section a bit to
emphasis that such errors are treated differently than other errors. 

> >> See the attached files for generating deadlocks reliably (start with 2 
> >> clients). What do you think? The PL/pgSQL minimal, it is really 
> >> client-code oriented.
> >
> > Sorry, but I cannot find the attached file.
> 
> Sorry. Attached to this mail. The serialization stuff does not seem to 
> work as well as the deadlock one. Run with 2 clients.

Hmmm, your test didn't work well for me. Both tests got stuck in
pgbench_deadlock_wait() and pgbench didn't finish. 


Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
I have played with v14 patch. I previously complained that pgbench
always reported 9 errors (actually the number is always the number
specified by "-c" -1 in my case).

$ pgbench -p 11000 -c 10  -T 10  --max-tries=0 test
pgbench (15devel, server 13.3)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
duration: 10 s
number of transactions actually processed: 974
number of failed transactions: 9 (0.916%)
number of transactions retried: 651 (66.226%)
total number of retries: 8482
latency average = 101.317 ms (including failures)
initial connection time = 44.440 ms
tps = 97.796487 (without initial connection time)

To reduce the number of errors I provide "--max-tries=9000" because
pgbench reported 8482 errors.

$ pgbench -p 11000 -c 10  -T 10 --max-tries=9000 test
pgbench (15devel, server 13.3)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
duration: 10 s
number of transactions actually processed: 1133
number of failed transactions: 9 (0.788%)
number of transactions retried: 755 (66.112%)
total number of retries: 9278
maximum number of tries: 9000
latency average = 88.570 ms (including failures)
initial connection time = 23.384 ms
tps = 112.015219 (without initial connection time)

Unfortunately this didn't work. Still 9 errors because pgbench
terminated the last round of run.

Then I gave up to use -T, and switched to use -t. Number of
transactions for -t option was calculated by the total number of
transactions actually processed (1133) / number of clients (10) =
11.33. I rouned up 11.33 to 12, then multiply number of clients (10)
and got 120. The result:

$ pgbench -p 11000 -c 10  -t 120 --max-tries=9000 test
pgbench (15devel, server 13.3)
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 1
query mode: simple
number of clients: 10
number of threads: 1
number of transactions per client: 120
number of transactions actually processed: 1200/1200
number of transactions retried: 675 (56.250%)
total number of retries: 8524
maximum number of tries: 9000
latency average = 93.777 ms
initial connection time = 14.120 ms
tps = 106.635908 (without initial connection time)

Finally I was able to get a result without any errors.  This is not a
super simple way to obtain pgbench results without errors, but
probably I can live with it.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello,

> Of course, users themselves should be careful of problematic script, but it
> would be better that pgbench itself avoids problems if pgbench can beforehand.
>
>> Or, we should terminate the last cycle of benchmark regardless it is
>> retrying or not if -T expires. This will make pgbench behaves much
>> more consistent.

I would tend to agree with this behavior, that is not to start any new 
transaction or transaction attempt once -T has expired.

I'm a little hesitant about how to count and report such unfinished 
because of bench timeout transactions, though. Not counting them seems to 
be the best option.

> Hmmm, indeed this might make the behaviour a bit consistent, but I am not
> sure such behavioural change benefit users.

The user benefit would be that if they asked for a 100s benchmark, pgbench 
does a reasonable effort not to overshot that?

-- 
Fabien.



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>>> Or, we should terminate the last cycle of benchmark regardless it is
>>> retrying or not if -T expires. This will make pgbench behaves much
>>> more consistent.
> 
> I would tend to agree with this behavior, that is not to start any new
> transaction or transaction attempt once -T has expired.
> 
> I'm a little hesitant about how to count and report such unfinished
> because of bench timeout transactions, though. Not counting them seems
> to be the best option.

I agree.

>> Hmmm, indeed this might make the behaviour a bit consistent, but I am
>> not
>> sure such behavioural change benefit users.
> 
> The user benefit would be that if they asked for a 100s benchmark,
> pgbench does a reasonable effort not to overshot that?

Right.

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Tue, 13 Jul 2021 13:00:49 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> >>> Or, we should terminate the last cycle of benchmark regardless it is
> >>> retrying or not if -T expires. This will make pgbench behaves much
> >>> more consistent.
> > 
> > I would tend to agree with this behavior, that is not to start any new
> > transaction or transaction attempt once -T has expired.

That is the behavior in the latest patch. Once -T has expired, any new
transaction or retry does not start. 

IIUC, Ishii-san's proposal was changing the pgbench's behavior when -T has
expired to terminate any running transactions immediately regardless retrying.
I am not sure we should do it in this patch. If we would like this change,
it would be done in another patch as an improvement of the -T option.

> > I'm a little hesitant about how to count and report such unfinished
> > because of bench timeout transactions, though. Not counting them seems
> > to be the best option.
> 
> I agree.

I also agree. Although I  couldn't get an answer what does he think the actual
harm for users due to termination of retrying by the -T option is, I guess it just
complained about reporting the termination of retrying  as failures. Therefore,
I will fix to finish the benchmark when the time is over during retrying, that is,
change the state to CSTATE_FINISHED instead of CSTATE_ERROR in such cases.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>> > I would tend to agree with this behavior, that is not to start any new
>> > transaction or transaction attempt once -T has expired.
> 
> That is the behavior in the latest patch. Once -T has expired, any new
> transaction or retry does not start. 

Actually v14 has not changed the behavior in this regard as explained
in different email:

> $ pgbench -p 11000 -c 10  -T 10  --max-tries=0 test
> pgbench (15devel, server 13.3)
> starting vacuum...end.
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 1
> query mode: simple
> number of clients: 10
> number of threads: 1
> duration: 10 s
> number of transactions actually processed: 974
> number of failed transactions: 9 (0.916%)
> number of transactions retried: 651 (66.226%)
> total number of retries: 8482
> latency average = 101.317 ms (including failures)
> initial connection time = 44.440 ms
> tps = 97.796487 (without initial connection time)

>> > I'm a little hesitant about how to count and report such unfinished
>> > because of bench timeout transactions, though. Not counting them seems
>> > to be the best option.
>> 
>> I agree.
> 
> I also agree. Although I  couldn't get an answer what does he think the actual
> harm for users due to termination of retrying by the -T option is, I guess it just
> complained about reporting the termination of retrying  as failures. Therefore,
> I will fix to finish the benchmark when the time is over during retrying, that is,
> change the state to CSTATE_FINISHED instead of CSTATE_ERROR in such cases.

I guess Fabien wanted it differently. Suppose "-c 10 and -T 30" and we
have 100 success transactions by time 25. At time 25 pgbench starts
next benchmark cycle and by time 30 there are 10 failing transactions
(because they are retrying). pgbench stops the execution at time
30. According your proposal (change the state to CSTATE_FINISHED
instead of CSTATE_ERROR) the total number of success transactions will
be 100 + 10 = 110, right? I guess Fabien wants to have the number to
be 100 rather than 110.

Fabien,
Please correct me if you think differently.

Also actually I have explained the harm number of times but you have
kept on ignoring it because "it's subtle". My request has been pretty
simple.

> number of failed transactions: 9 (0.916%)

I don't like this and want to have the failed transactions to be 0.
Who wants a benchmark result having errors?

Best regards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Tue, 13 Jul 2021 14:35:00 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> >> > I would tend to agree with this behavior, that is not to start any new
> >> > transaction or transaction attempt once -T has expired.
> > 
> > That is the behavior in the latest patch. Once -T has expired, any new
> > transaction or retry does not start. 
> 
> Actually v14 has not changed the behavior in this regard as explained
> in different email:

Right. Both of v13 and v14 doen't start any new transaction or retry once
-T has expired.

> >> > I'm a little hesitant about how to count and report such unfinished
> >> > because of bench timeout transactions, though. Not counting them seems
> >> > to be the best option.
> >> 
> >> I agree.
> > 
> > I also agree. Although I  couldn't get an answer what does he think the actual
> > harm for users due to termination of retrying by the -T option is, I guess it just
> > complained about reporting the termination of retrying  as failures. Therefore,
> > I will fix to finish the benchmark when the time is over during retrying, that is,
> > change the state to CSTATE_FINISHED instead of CSTATE_ERROR in such cases.
> 
> I guess Fabien wanted it differently. Suppose "-c 10 and -T 30" and we
> have 100 success transactions by time 25. At time 25 pgbench starts
> next benchmark cycle and by time 30 there are 10 failing transactions
> (because they are retrying). pgbench stops the execution at time
> 30. According your proposal (change the state to CSTATE_FINISHED
> instead of CSTATE_ERROR) the total number of success transactions will
> be 100 + 10 = 110, right? 

No. The last failed transaction is not counted because CSTATE_END_TX is
bypassed, so please don't worry.

> Also actually I have explained the harm number of times but you have
> kept on ignoring it because "it's subtle". My request has been pretty
> simple.
> 
> > number of failed transactions: 9 (0.916%)
> 
> I don't like this and want to have the failed transactions to be 0.
> Who wants a benchmark result having errors?

I was asking you because I would like to confirm what you really complained
about; whether the problem is that retrying transaction is terminated by -T
option, or that pgbench reports it as the number of failed transactions? But
now, I understood this is the latter that you don't want to count the temination
of retrying as failures. Thanks. 

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hello,

I attached the updated patch.

On Tue, 13 Jul 2021 15:50:52 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:
 
> > >> > I'm a little hesitant about how to count and report such unfinished
> > >> > because of bench timeout transactions, though. Not counting them seems
> > >> > to be the best option.

> > > I will fix to finish the benchmark when the time is over during retrying, that is,
> > > change the state to CSTATE_FINISHED instead of CSTATE_ERROR in such cases.

Done.
(I wrote CSTATE_ERROR, but correctly it is CSTATE_FAILURE.)
 
Now, once the timer is expired during retrying a failed transaction, pgbench never start
a new transaction for retry. If the transaction successes, it will counted in the result.
Otherwise, if the transaction fails again, it is not counted.


In addition, I fixed to work well with pipeline mode. Previously, pipeline mode was not
enough considered, ROLLBACK was not sent correctly. I fixed to handle errors in pipeline
mode properly, and now it works.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
> I attached the updated patch.

# About pgbench error handling v15

Patches apply cleanly. Compilation, global and local tests ok.

  - v15.1: refactoring is a definite improvement.
    Good, even if it is not very useful (see below).

    While restructuring, maybe predefined variables could be make readonly
    so that a script which would update them would fail, which would be a
    good thing. Maybe this is probably material for an independent patch.

  - v15.2: see detailed comments below

# Doc

Doc build is ok.

ISTM that "number of tries" line would be better placed between the 
#threads and #transactions lines. What do you think?

Aggregate logging description: "{ failures | ... }" seems misleading 
because it suggests we have one or the other, whereas it can also be 
empty. I suggest: "{ | failures | ... }" to show the empty case.

Having a full example with retries in the doc is a good thing, and 
illustrates in passing that running with a number of clients on a small 
scale does not make much sense because of the contention on 
tellers/branches. I'd wonder whether the number of tries is set too high, 
though, ISTM that an application should give up before 100? I like that 
the feature it is also limited by latency limit.

Minor editing:

"there're" -> "there are".

"the --time" -> "the --time option".

The overall English seems good, but I'm not a native speaker. As I already said, a native
speaker proofreading would be nice.

From a technical writing point of view, maybe the documentation could be improved a bit,
but I'm not a ease on that subject. Some comments:

"The latency for failed transactions and commands is not computed separately." is unclear,
please use a positive sentence to tell what is true instead of what is not and the reader
has to guess. Maybe: "The latency figures include failed transactions which have reached
the maximum number of tries or the transaction latency limit.".

"The main report contains the number of failed transactions if it is non-zero." ISTM that
this is a pain for scripts which would like to process these reports data, because the data
may or may not be there. I'm sure to write such scripts, which explains my concern:-)

"If the total number of retried transactions is non-zero…" should it rather be "not one",
because zero means unlimited retries?

The section describing the various type of errors that can occur is a good addition.

Option "--report-latencies" changed to "--report-per-commands": I'm fine with this change.

# FEATURES

--failures-detailed: I'm not convinced that this option should not always be on, but
this is not very important, so let it be.

--verbose-errors: I still think this is only for debugging, but let it be.

Copying variables: ISTM that we should not need to save the variables 
states… no clearing, no copying should be needed. The restarted 
transaction simply overrides the existing variables which is what the 
previous version was doing anyway. The scripts should write their own 
variables before using them, and if they don't then it is the user 
problem. This is important for performance, because it means that after a 
client has executed all scripts once the variable array is stable and does 
not incur significant maintenance costs. The only thing that needs saving 
for retry is the speudo-random generator state. This suggest simplifying 
or removing "RetryState".

# CODE

The semantics of "cnt" is changed. Ok, the overall counters and their 
relationships make sense, and it simplifies the reporting code. Good.

In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed and 
could be dealt with the default case. We are only interested in 
serialization/deadlocks which are fatal errors?

doRetry: for consistency, given the assert, ISTM that it should return 
false if duration has expired, by testing end_time or timer_exceeded.

checkTransactionStatus: this function does several things at once with 2 
booleans, which make it not very clear to me. Maybe it would be clearer if 
it would just return an enum (in trans, not in trans, conn error, other 
error). Another reason to do that is that on connection error pgbench 
could try to reconnect, which would be an interesting later extension, so 
let's pave the way for that.  Also, I do not think that the function 
should print out a message, it should be the caller decision to do that.

verbose_errors: there is more or less repeated code under RETRY and 
FAILURE, which should be factored out in a separate function. The 
advanceConnectionFunction is long enough. Once this is done, there is no 
need for a getLatencyUsed function.

I'd put cleaning up the pipeline in a function. I do not understand why 
the pipeline mode is not exited in all cases, the code checks for the 
pipeline status twice in a few lines. I'd put this cleanup in the sync 
function as well, report to the caller (advanceConnectionState) if there 
was an error, which would be managed there.

WAIT_ROLLBACK_RESULT: consumming results in a while could be a function to 
avoid code repetition (there and in the "error:" label in 
readCommandResponse). On the other hand, I'm not sure why the loop is 
needed: we can only get there by submitting a "ROLLBACK" command, so there 
should be only one result anyway?

report_per_command: please always count retries and failures of commands 
even if they will not be reported in the end, the code will be simpler and 
more efficient.

doLog: the format has changed, including a new string on failures which 
replace the time field. Hmmm. Cannot say I like it much, but why not. ISTM 
that the string could be shorten to "deadlock" or "serialization". ISTM 
that the documentation example should include a line with a failure, to 
make it clear what to expect.

I'm okay with always getting computing thread stats.

# COMMENTS

struct StatsData comment is helpful.
  - "failed transactions" -> "unsuccessfully retried transactions"?
  - 'cnt' decomposition: first term is field 'retried'? if so say it
    explicitely?

"Complete the failed transaction" sounds strange: If it failed, it could 
not complete? I'd suggest "Record a failed transaction".

# TESTS

I suggested to simplify the tests by using conditionals & sequences. You 
reported that you got stuck. Hmmm.

I tried again my tests which worked fine when started with 2 clients, 
otherwise they get stuck because the first client waits for the other one 
which does not exists (the point is to generate deadlocks and other 
errors). Maybe this is your issue?

Could you try with:

   psql < deadlock_prep.sql
   pgbench -t 4 -c 2 -f deadlock.sql
   # note: each deadlock detection takes 1 second

   psql < deadlock_prep.sql
   pgbench -t 10 -c 2 -f serializable.sql
   # very quick 50% serialization errors

-- 
Fabien.

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hello Fabien,

Thank you so much for your review. 

Sorry for the late reply. I've stopped working on it due to other
jobs but I came back again. I attached the updated patch. I would
appreciate it if you could review this again.

On Mon, 19 Jul 2021 20:04:23 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> # About pgbench error handling v15
> 
> Patches apply cleanly. Compilation, global and local tests ok.
> 
>   - v15.1: refactoring is a definite improvement.
>     Good, even if it is not very useful (see below).

Ok, we don't need to save variables in order to implement the
retry feature on pbench as you suggested. Well, should we completely
separate these two patches and should I fix v15.2 to not rely v15.1?

>     While restructuring, maybe predefined variables could be make readonly
>     so that a script which would update them would fail, which would be a
>     good thing. Maybe this is probably material for an independent patch.

Yes, It shoule be for an independent patch.

>   - v15.2: see detailed comments below
> 
> # Doc
> 
> Doc build is ok.
> 
> ISTM that "number of tries" line would be better placed between the 
> #threads and #transactions lines. What do you think?

Agreed. Fixed.

> Aggregate logging description: "{ failures | ... }" seems misleading 
> because it suggests we have one or the other, whereas it can also be 
> empty. I suggest: "{ | failures | ... }" to show the empty case.

The description is correct because either "failures" or "both of
serialization_failures and deadlock_failures" should appear in aggregate
logging. If "failures" was printed only when any transaction failed,
each line in aggregate logging could have different numbers of columns
and which would make it difficult to parse the results.

> I'd wonder whether the number of tries is set too high, 
> though, ISTM that an application should give up before 100? 

Indeed, max-tries=100 seems too high for practical system. 

Also, I noticed that sum of latencies of each command (= 15.839 ms)
is significantly larger than the latency average (= 10.870 ms) 
because "per commands" results in the documentation were fixed.

So, I retook a measurement on my machine for more accurate documentation. I
used max-tries=10.

> Minor editing:
> 
> "there're" -> "there are".
> 
> "the --time" -> "the --time option".

Fixed.

> "The latency for failed transactions and commands is not computed separately." is unclear,
> please use a positive sentence to tell what is true instead of what is not and the reader
> has to guess. Maybe: "The latency figures include failed transactions which have reached
> the maximum number of tries or the transaction latency limit.".

I'm not the original author of this description, but I guess this means "The latency is
measured only for successful transactions and commands but not for failed transactions
or commands.".

> "The main report contains the number of failed transactions if it is non-zero." ISTM that
> this is a pain for scripts which would like to process these reports data, because the data
> may or may not be there. I'm sure to write such scripts, which explains my concern:-)

I agree with you. I fixed the behavior to report the the number of failed transactions
always regardless with if it is non-zero or not.

> "If the total number of retried transactions is non-zero…" should it rather be "not one",
> because zero means unlimited retries?

I guess that this means the actual number of retried transaction not the max-tries, so
"non-zero" was correct. However, for the same reason above, I fixed the behavior to
report the the retry statistics always regardeless with the actual retry numbers.

> 
> # FEATURES
 
> Copying variables: ISTM that we should not need to save the variables 
> states… no clearing, no copying should be needed. The restarted 
> transaction simply overrides the existing variables which is what the 
> previous version was doing anyway. The scripts should write their own 
> variables before using them, and if they don't then it is the user 
> problem. This is important for performance, because it means that after a 
> client has executed all scripts once the variable array is stable and does 
> not incur significant maintenance costs. The only thing that needs saving 
> for retry is the speudo-random generator state. This suggest simplifying 
> or removing "RetryState".

Yes. The variables states is not necessary because we retry the
whole script. It was necessary in the initial patch because it
planned to retry one transaction included in the script. I removed
RetryState and copyVariables.
 
> # CODE
 
> In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed and 
> could be dealt with the default case. We are only interested in 
> serialization/deadlocks which are fatal errors?

We need PGRES_NONFATAL_ERROR to save st->estatus. It is used outside
readCommandResponse to determine whether we should abort or not.

> doRetry: for consistency, given the assert, ISTM that it should return 
> false if duration has expired, by testing end_time or timer_exceeded.

Ok. I fixed doRetry to check timer_exceeded again.
 
> checkTransactionStatus: this function does several things at once with 2 
> booleans, which make it not very clear to me. Maybe it would be clearer if 
> it would just return an enum (in trans, not in trans, conn error, other 
> error). Another reason to do that is that on connection error pgbench 
> could try to reconnect, which would be an interesting later extension, so 
> let's pave the way for that.  Also, I do not think that the function 
> should print out a message, it should be the caller decision to do that.

OK. I added a new enum type TStatus and I fixed the function to return it.
Also, I changed the function name to getTransactionStatus because the
actual check is done by the caller.

> verbose_errors: there is more or less repeated code under RETRY and 
> FAILURE, which should be factored out in a separate function. The 
> advanceConnectionFunction is long enough. Once this is done, there is no 
> need for a getLatencyUsed function.

OK. I made a function to print verbose error messages and removed the
getLatencyUsed function.
 
> I'd put cleaning up the pipeline in a function. I do not understand why 
> the pipeline mode is not exited in all cases, the code checks for the 
> pipeline status twice in a few lines. I'd put this cleanup in the sync 
> function as well, report to the caller (advanceConnectionState) if there 
> was an error, which would be managed there.

I fixed to exit the pipeline whenever we have an error in a pipeline mode.
Also, I added a PQpipelineSync call which was forgotten in the previous patch. 
 
> WAIT_ROLLBACK_RESULT: consumming results in a while could be a function to 
> avoid code repetition (there and in the "error:" label in 
> readCommandResponse). On the other hand, I'm not sure why the loop is 
> needed: we can only get there by submitting a "ROLLBACK" command, so there 
> should be only one result anyway?

Right. We should receive just one PGRES_COMMAND_OK and null following it.
I eliminated the loop.
 
> report_per_command: please always count retries and failures of commands 
> even if they will not be reported in the end, the code will be simpler and 
> more efficient.

Ok. I fixed to count retries and failures of commands even if
report_per_command is false.
 
> doLog: the format has changed, including a new string on failures which 
> replace the time field. Hmmm. Cannot say I like it much, but why not. ISTM 
> that the string could be shorten to "deadlock" or "serialization". ISTM 
> that the documentation example should include a line with a failure, to 
> make it clear what to expect.

I fixed getResultString to return "deadlock" or "serialization" instead of
"deadlock_failure" or "serialization_failure". Also, I added an output
example to the documentation.
 
> I'm okay with always getting computing thread stats.
> 
> # COMMENTS
> 
> struct StatsData comment is helpful.
>   - "failed transactions" -> "unsuccessfully retried transactions"?

This seems an accurate description. However, "failed transaction" is
short and simple, and it is used in several places, so  instead of
replacing them I added the following statement to define it:

"failed transaction is defined as unsuccessfully retried transactions."

>   - 'cnt' decomposition: first term is field 'retried'? if so say it
>     explicitely?

No. 'retreid' includes unsuccessfully retreid transactions, but 'cnt'
includes only successfully retried transactions.

> "Complete the failed transaction" sounds strange: If it failed, it could 
> not complete? I'd suggest "Record a failed transaction".

Sounds good. Fixed.

> # TESTS
> 
> I suggested to simplify the tests by using conditionals & sequences. You 
> reported that you got stuck. Hmmm.
> 
> I tried again my tests which worked fine when started with 2 clients, 
> otherwise they get stuck because the first client waits for the other one 
> which does not exists (the point is to generate deadlocks and other 
> errors). Maybe this is your issue?

That seems to be right. It got stuck when I used -T option rather than -t,
it was because, I guess, the number of transactions on each thread was
different.

> Could you try with:
> 
>    psql < deadlock_prep.sql
>    pgbench -t 4 -c 2 -f deadlock.sql
>    # note: each deadlock detection takes 1 second
> 
>    psql < deadlock_prep.sql
>    pgbench -t 10 -c 2 -f serializable.sql
>    # very quick 50% serialization errors

That works. However, it still gets hang when --max-tries = 2,
so maybe I would not think we can use it for testing the retry
feature....

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
Hi Yugo and Fabien,

It seems the patch is ready for committer except below. Do you guys
want to do more on below?

>> # TESTS
>> 
>> I suggested to simplify the tests by using conditionals & sequences. You 
>> reported that you got stuck. Hmmm.
>> 
>> I tried again my tests which worked fine when started with 2 clients, 
>> otherwise they get stuck because the first client waits for the other one 
>> which does not exists (the point is to generate deadlocks and other 
>> errors). Maybe this is your issue?
> 
> That seems to be right. It got stuck when I used -T option rather than -t,
> it was because, I guess, the number of transactions on each thread was
> different.
> 
>> Could you try with:
>> 
>>    psql < deadlock_prep.sql
>>    pgbench -t 4 -c 2 -f deadlock.sql
>>    # note: each deadlock detection takes 1 second
>> 
>>    psql < deadlock_prep.sql
>>    pgbench -t 10 -c 2 -f serializable.sql
>>    # very quick 50% serialization errors
> 
> That works. However, it still gets hang when --max-tries = 2,
> so maybe I would not think we can use it for testing the retry
> feature....

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Tatsuo-san,

> It seems the patch is ready for committer except below. Do you guys want 
> to do more on below?

I'm planning a new review of this significant patch, possibly over the 
next week-end, or the next.

-- 
Fabien.



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Fabien COELHO
Дата:
Hello Yugo-san,

About Pgbench error handling v16:

This patch set needs a minor rebase because of 506035b0. Otherwise, patch 
compiles, global and local "make check" are ok. Doc generation is ok.

This patch is in good shape, the code and comments are clear.
Some minor remarks below, including typos and a few small suggestions.


## About v16-1

This refactoring patch adds a struct for managing pgbench variables, instead of
mixing fields into the client state (CState) struct.

Patch compiles, global and local "make check" are both ok.

Although this patch is not necessary to add the feature, I'm fine with it as
improves pgbench source code readability.


## About v16-2

This last patch adds handling of serialization and deadlock errors to pgbench
transactions. This feature is desirable because it enlarge performance testing
options, and makes pgbench behave more like a database client application.

Possible future extension enabled by this patch include handling deconnections
errors by trying to reconnect, for instance.

The documentation is clear and well written, at least for my non-native speaker
eyes and ears.

English: "he will be aborted" -> "it will be aborted".

I'm fine with renaming --report-latencies to --report-per-command as the later
is clearer about what the options does.

I'm still not sure I like the "failure detailed" option, ISTM that the report
could be always detailed. That would remove some complexity and I do not think
that people executing a bench with error handling would mind having the details.
No big deal.

printVerboseErrorMessages: I'd make the buffer static and initialized only once
so that there is no significant malloc/free cycle involved when calling the function.

advanceConnectionState: I'd really prefer not to add new variables (res, status)
in the loop scope, and only declare them when actually needed in the state branches,
so as to avoid any unwanted interaction between states.

typo: "fullowing" -> "following"

Pipeline cleaning: the advance function is already soooo long, I'd put that in a
separate function and call it.

I think that the report should not remove data when they are 0, otherwise it makes
it harder to script around it (in failures_detailed on line 6284).

The test cover the different cases. I tried to suggest a simpler approach 
in a previous round, but it seems not so simple to do so. They could be 
simplified later, if possible.

-- 
Fabien.



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hello Fabien,

On Sat, 12 Mar 2022 15:54:54 +0100 (CET)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> Hello Yugo-san,
> 
> About Pgbench error handling v16:

Thank you for your review! I attached the updated patches.
 
> This patch set needs a minor rebase because of 506035b0. Otherwise, patch 
> compiles, global and local "make check" are ok. Doc generation is ok.

I rebased it.

> ## About v16-2
 
> English: "he will be aborted" -> "it will be aborted".

Fixed.

> I'm still not sure I like the "failure detailed" option, ISTM that the report
> could be always detailed. That would remove some complexity and I do not think
> that people executing a bench with error handling would mind having the details.
> No big deal.

I didn't change it because I think those who don't expect any failures using a
well designed script may not need details of failures. I think reporting such
details will be required only for benchmarks where any failures are expected.

> printVerboseErrorMessages: I'd make the buffer static and initialized only once
> so that there is no significant malloc/free cycle involved when calling the function.

OK. I fixed printVerboseErrorMessages to use a static variable.

> advanceConnectionState: I'd really prefer not to add new variables (res, status)
> in the loop scope, and only declare them when actually needed in the state branches,
> so as to avoid any unwanted interaction between states.

I fixed to declare the variables in the case statement blocks.

> typo: "fullowing" -> "following"

fixed.

> Pipeline cleaning: the advance function is already soooo long, I'd put that in a
> separate function and call it.

Ok. I made a new function "discardUntilSync" for the pipeline cleaning.

> I think that the report should not remove data when they are 0, otherwise it makes
> it harder to script around it (in failures_detailed on line 6284).

I fixed to report both serialization and deadlock failures always even when
they are 0.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
Hi Yugo,

I have looked into the patch and I noticed that <xref
linkend=... endterm=...> is used in pgbench.sgml. e.g.

<xref linkend="failures-and-retries" endterm="failures-and-retries-title"/>

AFAIK this is the only place where "endterm" is used. In other places
"link" tag is used instead:

<link linkend="failures-and-retries">Failures and Serialization/Deadlock Retries</link>

Note that the rendered result is identical. Do we want to use the link tag as well?

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
Hi Yugo,

I tested with serialization error scenario by setting:
default_transaction_isolation = 'repeatable read'
The result was:

$ pgbench -t 10 -c 10 --max-tries=10 test
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 10
query mode: simple
number of clients: 10
number of threads: 1
maximum number of tries: 10
number of transactions per client: 10
number of transactions actually processed: 100/100
number of failed transactions: 0 (0.000%)
number of transactions retried: 35 (35.000%)
total number of retries: 74
latency average = 5.306 ms
initial connection time = 15.575 ms
tps = 1884.516810 (without initial connection time)

I had hard time to understand what those numbers mean:
number of transactions retried: 35 (35.000%)
total number of retries: 74

It seems "total number of retries" matches with the number of ERRORs
reported in PostgreSQL. Good. What I am not sure is "number of
transactions retried". What does this mean?

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
> Hi Yugo,
> 
> I tested with serialization error scenario by setting:
> default_transaction_isolation = 'repeatable read'
> The result was:
> 
> $ pgbench -t 10 -c 10 --max-tries=10 test
> transaction type: <builtin: TPC-B (sort of)>
> scaling factor: 10
> query mode: simple
> number of clients: 10
> number of threads: 1
> maximum number of tries: 10
> number of transactions per client: 10
> number of transactions actually processed: 100/100
> number of failed transactions: 0 (0.000%)
> number of transactions retried: 35 (35.000%)
> total number of retries: 74
> latency average = 5.306 ms
> initial connection time = 15.575 ms
> tps = 1884.516810 (without initial connection time)
> 
> I had hard time to understand what those numbers mean:
> number of transactions retried: 35 (35.000%)
> total number of retries: 74
> 
> It seems "total number of retries" matches with the number of ERRORs
> reported in PostgreSQL. Good. What I am not sure is "number of
> transactions retried". What does this mean?

Oh, ok. I see it now. It turned out that "number of transactions
retried" does not actually means the number of transactions
rtried. Suppose pgbench exectutes following in a session:

BEGIN;    -- transaction A starts
:
(ERROR)
ROLLBACK; -- transaction A aborts

(retry)

BEGIN;    -- transaction B starts
:
(ERROR)
ROLLBACK; -- transaction B aborts

(retry)

BEGIN;    -- transaction C starts
:
END;    -- finally succeeds

In this case "total number of retries:" = 2 and "number of
transactions retried:" = 1. In this patch transactions A, B and C are
regarded as "same" transaction, so the retried transaction count
becomes 1. But it's confusing to use the language "transaction" here
because A, B and C are different transactions. I would think it's
better to use different language instead of "transaction", something
like "cycle"? i.e.

number of cycles retried: 35 (35.000%)

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
Hi Ishii-san,

On Sun, 20 Mar 2022 09:52:06 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> Hi Yugo,
> 
> I have looked into the patch and I noticed that <xref
> linkend=... endterm=...> is used in pgbench.sgml. e.g.
> 
> <xref linkend="failures-and-retries" endterm="failures-and-retries-title"/>
> 
> AFAIK this is the only place where "endterm" is used. In other places
> "link" tag is used instead:

Thank you for pointing out it. 

I've checked other places using <xref/> referring to <refsect2>, and found
that "xreflabel"s are used in such <refsect2> tags. So, I'll fix it 
in this style.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Sun, 20 Mar 2022 16:11:43 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> > Hi Yugo,
> > 
> > I tested with serialization error scenario by setting:
> > default_transaction_isolation = 'repeatable read'
> > The result was:
> > 
> > $ pgbench -t 10 -c 10 --max-tries=10 test
> > transaction type: <builtin: TPC-B (sort of)>
> > scaling factor: 10
> > query mode: simple
> > number of clients: 10
> > number of threads: 1
> > maximum number of tries: 10
> > number of transactions per client: 10
> > number of transactions actually processed: 100/100
> > number of failed transactions: 0 (0.000%)
> > number of transactions retried: 35 (35.000%)
> > total number of retries: 74
> > latency average = 5.306 ms
> > initial connection time = 15.575 ms
> > tps = 1884.516810 (without initial connection time)
> > 
> > I had hard time to understand what those numbers mean:
> > number of transactions retried: 35 (35.000%)
> > total number of retries: 74
> > 
> > It seems "total number of retries" matches with the number of ERRORs
> > reported in PostgreSQL. Good. What I am not sure is "number of
> > transactions retried". What does this mean?
> 
> Oh, ok. I see it now. It turned out that "number of transactions
> retried" does not actually means the number of transactions
> rtried. Suppose pgbench exectutes following in a session:
> 
> BEGIN;    -- transaction A starts
> :
> (ERROR)
> ROLLBACK; -- transaction A aborts
> 
> (retry)
> 
> BEGIN;    -- transaction B starts
> :
> (ERROR)
> ROLLBACK; -- transaction B aborts
> 
> (retry)
> 
> BEGIN;    -- transaction C starts
> :
> END;    -- finally succeeds
> 
> In this case "total number of retries:" = 2 and "number of
> transactions retried:" = 1. In this patch transactions A, B and C are
> regarded as "same" transaction, so the retried transaction count
> becomes 1. But it's confusing to use the language "transaction" here
> because A, B and C are different transactions. I would think it's
> better to use different language instead of "transaction", something
> like "cycle"? i.e.
> 
> number of cycles retried: 35 (35.000%)

In the original patch by Marina Polyakova it was "number of retried", 
but I changed it to "number of transactions retried" is because I felt
it was confusing with "number of retries". I chose the word "transaction"
because a transaction ends in any one of successful commit , skipped, or
failure, after possible retries. 

Well, I agree with that it is somewhat confusing wording. If we can find
nice word to resolve the confusion, I don't mind if we change the word. 
Maybe, we can use "executions" as well as "cycles". However, I am not sure
that the situation is improved by using such word because what such word
exactly means seems to be still unclear for users. 

Another idea is instead reporting only "the number of successfully
retried transactions" that does not include "failed transactions", 
that is, transactions failed after retries, like this;

 number of transactions actually processed: 100/100
 number of failed transactions: 0 (0.000%)
 number of successfully retried transactions: 35 (35.000%)
 total number of retries: 74 

The meaning is clear and there seems to be no confusion.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
> On Sun, 20 Mar 2022 16:11:43 +0900 (JST)
> Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> 
>> > Hi Yugo,
>> > 
>> > I tested with serialization error scenario by setting:
>> > default_transaction_isolation = 'repeatable read'
>> > The result was:
>> > 
>> > $ pgbench -t 10 -c 10 --max-tries=10 test
>> > transaction type: <builtin: TPC-B (sort of)>
>> > scaling factor: 10
>> > query mode: simple
>> > number of clients: 10
>> > number of threads: 1
>> > maximum number of tries: 10
>> > number of transactions per client: 10
>> > number of transactions actually processed: 100/100
>> > number of failed transactions: 0 (0.000%)
>> > number of transactions retried: 35 (35.000%)
>> > total number of retries: 74
>> > latency average = 5.306 ms
>> > initial connection time = 15.575 ms
>> > tps = 1884.516810 (without initial connection time)
>> > 
>> > I had hard time to understand what those numbers mean:
>> > number of transactions retried: 35 (35.000%)
>> > total number of retries: 74
>> > 
>> > It seems "total number of retries" matches with the number of ERRORs
>> > reported in PostgreSQL. Good. What I am not sure is "number of
>> > transactions retried". What does this mean?
>> 
>> Oh, ok. I see it now. It turned out that "number of transactions
>> retried" does not actually means the number of transactions
>> rtried. Suppose pgbench exectutes following in a session:
>> 
>> BEGIN;    -- transaction A starts
>> :
>> (ERROR)
>> ROLLBACK; -- transaction A aborts
>> 
>> (retry)
>> 
>> BEGIN;    -- transaction B starts
>> :
>> (ERROR)
>> ROLLBACK; -- transaction B aborts
>> 
>> (retry)
>> 
>> BEGIN;    -- transaction C starts
>> :
>> END;    -- finally succeeds
>> 
>> In this case "total number of retries:" = 2 and "number of
>> transactions retried:" = 1. In this patch transactions A, B and C are
>> regarded as "same" transaction, so the retried transaction count
>> becomes 1. But it's confusing to use the language "transaction" here
>> because A, B and C are different transactions. I would think it's
>> better to use different language instead of "transaction", something
>> like "cycle"? i.e.
>> 
>> number of cycles retried: 35 (35.000%)

I realized that the same argument can be applied even to "number of
transactions actually processed" because with the retry feature,
"transaction" could comprise multiple transactions.

But if we go forward and replace those "transactions" with "cycles"
(or whatever) altogether, probably it could bring enough confusion to
users who have been using pgbench. Probably we should give up the
language changing and redefine "transaction" when the retry feature is
enabled instead like "when retry feature is enabled, each transaction
can be consisted of multiple transactions retried."

> In the original patch by Marina Polyakova it was "number of retried", 
> but I changed it to "number of transactions retried" is because I felt
> it was confusing with "number of retries". I chose the word "transaction"
> because a transaction ends in any one of successful commit , skipped, or
> failure, after possible retries. 

Ok.

> Well, I agree with that it is somewhat confusing wording. If we can find
> nice word to resolve the confusion, I don't mind if we change the word. 
> Maybe, we can use "executions" as well as "cycles". However, I am not sure
> that the situation is improved by using such word because what such word
> exactly means seems to be still unclear for users. 
> 
> Another idea is instead reporting only "the number of successfully
> retried transactions" that does not include "failed transactions", 
> that is, transactions failed after retries, like this;
> 
>  number of transactions actually processed: 100/100
>  number of failed transactions: 0 (0.000%)
>  number of successfully retried transactions: 35 (35.000%)
>  total number of retries: 74 
> 
> The meaning is clear and there seems to be no confusion.

Thank you for the suggestion. But I think it would better to leave it
as it is because of the reason I mentioned above.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Tue, 22 Mar 2022 09:08:15 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:

> Hi Ishii-san,
> 
> On Sun, 20 Mar 2022 09:52:06 +0900 (JST)
> Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> 
> > Hi Yugo,
> > 
> > I have looked into the patch and I noticed that <xref
> > linkend=... endterm=...> is used in pgbench.sgml. e.g.
> > 
> > <xref linkend="failures-and-retries" endterm="failures-and-retries-title"/>
> > 
> > AFAIK this is the only place where "endterm" is used. In other places
> > "link" tag is used instead:
> 
> Thank you for pointing out it. 
> 
> I've checked other places using <xref/> referring to <refsect2>, and found
> that "xreflabel"s are used in such <refsect2> tags. So, I'll fix it 
> in this style.

I attached the updated patch. I also fixed the following paragraph which I had
forgotten to fix in the previous patch.

 The first seven lines report some of the most important parameter settings.
 The sixth line reports the maximum number of tries for transactions with
 serialization or deadlock errors

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>> I've checked other places using <xref/> referring to <refsect2>, and found
>> that "xreflabel"s are used in such <refsect2> tags. So, I'll fix it 
>> in this style.
> 
> I attached the updated patch. I also fixed the following paragraph which I had
> forgotten to fix in the previous patch.
> 
>  The first seven lines report some of the most important parameter settings.
>  The sixth line reports the maximum number of tries for transactions with
>  serialization or deadlock errors

Thank you for the updated patch. I think the patches look good and now
it's ready for commit. If there's no objection, I would like to
commit/push the patches.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>> I attached the updated patch. I also fixed the following paragraph which I had
>> forgotten to fix in the previous patch.
>> 
>>  The first seven lines report some of the most important parameter settings.
>>  The sixth line reports the maximum number of tries for transactions with
>>  serialization or deadlock errors
> 
> Thank you for the updated patch. I think the patches look good and now
> it's ready for commit. If there's no objection, I would like to
> commit/push the patches.

The patch Pushed. Thank you!

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tom Lane
Дата:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> The patch Pushed. Thank you!

My hoary animal prairiedog doesn't like this [1]:

#   Failed test 'concurrent update with retrying stderr /(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of
script0; ERROR:  could not serialize access due to concurrent update\\b.*\\g1)/' 
#   at t/001_pgbench_with_server.pl line 1229.
#                   'pgbench: pghost: /tmp/nhghgwAoki pgport: 58259 nclients: 2 nxacts: 1 dbName: postgres
...
# pgbench: client 0 got an error in command 3 (SQL) of script 0; ERROR:  could not serialize access due to concurrent
update
...
# '
#     doesn't match '(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script 0; ERROR:  could not serialize
accessdue to concurrent update\\b.*\\g1)' 
# Looks like you failed 1 test of 425.

I'm not sure what the "\\b.*\\g1" part of this regex is meant to
accomplish, but it seems to be assuming more than it should
about the output format of TAP messages.

            regards, tom lane

[1] https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=prairiedog&dt=2022-03-23%2013%3A21%3A44



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Wed, 23 Mar 2022 14:26:54 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> > The patch Pushed. Thank you!
> 
> My hoary animal prairiedog doesn't like this [1]:
> 
> #   Failed test 'concurrent update with retrying stderr /(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of
script0; ERROR:  could not serialize access due to concurrent update\\b.*\\g1)/'
 
> #   at t/001_pgbench_with_server.pl line 1229.
> #                   'pgbench: pghost: /tmp/nhghgwAoki pgport: 58259 nclients: 2 nxacts: 1 dbName: postgres
> ...
> # pgbench: client 0 got an error in command 3 (SQL) of script 0; ERROR:  could not serialize access due to concurrent
update
> ...
> # '
> #     doesn't match '(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script 0; ERROR:  could not
serializeaccess due to concurrent update\\b.*\\g1)'
 
> # Looks like you failed 1 test of 425.
> 
> I'm not sure what the "\\b.*\\g1" part of this regex is meant to
> accomplish, but it seems to be assuming more than it should
> about the output format of TAP messages.

I have edited the test code from the original patch by mistake, but
I could not realize because the test works in my machine without any
errors somehow.

I attached a patch to fix the test as was in the original patch, where
backreferences are used to check retry of the same query.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>> My hoary animal prairiedog doesn't like this [1]:
>> 
>> #   Failed test 'concurrent update with retrying stderr /(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of
script0; ERROR:  could not serialize access due to concurrent update\\b.*\\g1)/'
 
>> #   at t/001_pgbench_with_server.pl line 1229.
>> #                   'pgbench: pghost: /tmp/nhghgwAoki pgport: 58259 nclients: 2 nxacts: 1 dbName: postgres
>> ...
>> # pgbench: client 0 got an error in command 3 (SQL) of script 0; ERROR:  could not serialize access due to
concurrentupdate
 
>> ...
>> # '
>> #     doesn't match '(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script 0; ERROR:  could not
serializeaccess due to concurrent update\\b.*\\g1)'
 
>> # Looks like you failed 1 test of 425.
>> 
>> I'm not sure what the "\\b.*\\g1" part of this regex is meant to
>> accomplish, but it seems to be assuming more than it should
>> about the output format of TAP messages.
> 
> I have edited the test code from the original patch by mistake, but
> I could not realize because the test works in my machine without any
> errors somehow.
> 
> I attached a patch to fix the test as was in the original patch, where
> backreferences are used to check retry of the same query.

My machine (Ubuntu 20) did not complain either. Maybe perl version
difference?  Any way, the fix pushed. Let's see how prairiedog feels.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tom Lane
Дата:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
>> My hoary animal prairiedog doesn't like this [1]:

> My machine (Ubuntu 20) did not complain either. Maybe perl version
> difference?  Any way, the fix pushed. Let's see how prairiedog feels.

Still not happy.  After some digging in man pages, I believe the
problem is that its old version of Perl does not understand "\gN"
backreferences.  Is there a good reason to be using that rather
than the traditional "\N" backref notation?

            regards, tom lane



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>> My machine (Ubuntu 20) did not complain either. Maybe perl version
>> difference?  Any way, the fix pushed. Let's see how prairiedog feels.
> 
> Still not happy.  After some digging in man pages, I believe the
> problem is that its old version of Perl does not understand "\gN"
> backreferences.  Is there a good reason to be using that rather
> than the traditional "\N" backref notation?

I don't see a reason to use "\gN" either. Actually after applying
attached patch, my machine is still happy with pgbench test.

Yugo?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 60cae1e843..22a23489e8 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1224,7 +1224,7 @@ my $err_pattern =
     "(client (0|1) sending UPDATE xy SET y = y \\+ -?\\d+\\b).*"
   . "client \\g2 got an error in command 3 \\(SQL\\) of script 0; "
   . "ERROR:  could not serialize access due to concurrent update\\b.*"
-  . "\\g1";
+  . "\\1";
 
 $node->pgbench(
     "-n -c 2 -t 1 -d --verbose-errors --max-tries 2",

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tom Lane
Дата:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> I don't see a reason to use "\gN" either. Actually after applying
> attached patch, my machine is still happy with pgbench test.

Note that the \\g2 just above also needs to be changed.

            regards, tom lane



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
> Note that the \\g2 just above also needs to be changed.

Oops. Thanks. New patch attached. Test has passed on my machine.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 60cae1e843..ca71f968dc 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1222,9 +1222,9 @@ local $ENV{PGOPTIONS} = "-c default_transaction_isolation=repeatable\\ read";
 # delta variable in the next try
 my $err_pattern =
     "(client (0|1) sending UPDATE xy SET y = y \\+ -?\\d+\\b).*"
-  . "client \\g2 got an error in command 3 \\(SQL\\) of script 0; "
+  . "client \\2 got an error in command 3 \\(SQL\\) of script 0; "
   . "ERROR:  could not serialize access due to concurrent update\\b.*"
-  . "\\g1";
+  . "\\1";
 
 $node->pgbench(
     "-n -c 2 -t 1 -d --verbose-errors --max-tries 2",

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tom Lane
Дата:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> Oops. Thanks. New patch attached. Test has passed on my machine.

I reproduced the failure on another machine with perl 5.8.8,
and I can confirm that this patch fixes it.

            regards, tom lane



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Fri, 25 Mar 2022 09:14:00 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> > Note that the \\g2 just above also needs to be changed.
> 
> Oops. Thanks. New patch attached. Test has passed on my machine.

This patch works for me. I think it is ok to use \N instead of \gN.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
> I reproduced the failure on another machine with perl 5.8.8,
> and I can confirm that this patch fixes it.

Thank you for the test. I have pushed the patch.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>> Oops. Thanks. New patch attached. Test has passed on my machine.
> 
> This patch works for me. I think it is ok to use \N instead of \gN.

Thanks. Patch pushed.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tom Lane
Дата:
Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> Thanks. Patch pushed.

This patch has caused the PDF documentation to fail to build cleanly:

[WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by
morethan 50 points. (See position 125066:375) 

It's complaining about this:

<synopsis>
<replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> |
<replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional>
<replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable>
<replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional>
<replaceable>retried</replaceable><replaceable>retries</replaceable> </optional> 
</synopsis>

which runs much too wide in HTML format too, even though that toolchain
doesn't tell you so.

We could silence the warning by inserting an arbitrary line break or two,
or refactoring the syntax description into multiple parts.  Either way
seems to create a risk of confusion.

TBH, I think the *real* problem is that the complexity of this log format
has blown past "out of hand".  Can't we simplify it?  Who is really going
to use all these numbers?  I pity the poor sucker who tries to write a
log analysis tool that will handle all the variants.

            regards, tom lane



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
> This patch has caused the PDF documentation to fail to build cleanly:
> 
> [WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by
morethan 50 points. (See position 125066:375)
 
> 
> It's complaining about this:
> 
> <synopsis>
> <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> |
<replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional>
<replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable>
<replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional>
<replaceable>retried</replaceable><replaceable>retries</replaceable> </optional>
 
> </synopsis>
> 
> which runs much too wide in HTML format too, even though that toolchain
> doesn't tell you so.

Yeah.

> We could silence the warning by inserting an arbitrary line break or two,
> or refactoring the syntax description into multiple parts.  Either way
> seems to create a risk of confusion.

I think we can fold the line nicely. Here is the rendered image.

Before:
interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures
deadlock_failures} [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]
 

After:
interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
  { failures | serialization_failures deadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried
retries]
 

Note that before it was like this:

interval_start num_transactions​ sum_latency sum_latency_2 min_latency max_latency​ [ sum_lag sum_lag_2 min_lag max_lag
[skipped ] ]
 

So newly added items are "{ failures | serialization_failures deadlock_failures }" and " [ retried retries ]".

> TBH, I think the *real* problem is that the complexity of this log format
> has blown past "out of hand".  Can't we simplify it?  Who is really going
> to use all these numbers?  I pity the poor sucker who tries to write a
> log analysis tool that will handle all the variants.

Well, the extra logging items above only appear when the retry feature
is enabled. For those who do not use the feature the only new logging
item is "failures". For those who use the feature, the extra logging
items are apparently necessary. For example if we write an application
using repeatable read or serializable transaction isolation mode,
retrying failed transactions due to srialization error is an essential
technique. Also the retry rate of transactions will deeply affect the
performance and in such use cases the newly added items will be
precisou information. I would suggest leave the log items as it is.

Patch attached.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ebdb4b3f46..b65b813ebe 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -2398,10 +2398,11 @@ END;
 
   <para>
    With the <option>--aggregate-interval</option> option, a different
-   format is used for the log files:
+   format is used for the log files (note that the actual log line is not folded).
 
 <synopsis>
-<replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> |
<replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional>
<replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable>
<replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional>
<replaceable>retried</replaceable><replaceable>retries</replaceable> </optional>
 
+  <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>
+  { <replaceable>failures</replaceable> | <replaceable>serialization_failures</replaceable>
<replaceable>deadlock_failures</replaceable>} <optional> <replaceable>sum_lag</replaceable>
<replaceable>sum_lag_2</replaceable><replaceable>min_lag</replaceable> <replaceable>max_lag</replaceable> <optional>
<replaceable>skipped</replaceable></optional> </optional> <optional> <replaceable>retried</replaceable>
<replaceable>retries</replaceable></optional>
 
 </synopsis>
 
    where

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Sun, 27 Mar 2022 15:28:41 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> > This patch has caused the PDF documentation to fail to build cleanly:
> > 
> > [WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction
bymore than 50 points. (See position 125066:375)
 
> > 
> > It's complaining about this:
> > 
> > <synopsis>
> > <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> |
<replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional>
<replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable>
<replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional>
<replaceable>retried</replaceable><replaceable>retries</replaceable> </optional>
 
> > </synopsis>
> > 
> > which runs much too wide in HTML format too, even though that toolchain
> > doesn't tell you so.
> 
> Yeah.
> 
> > We could silence the warning by inserting an arbitrary line break or two,
> > or refactoring the syntax description into multiple parts.  Either way
> > seems to create a risk of confusion.
> 
> I think we can fold the line nicely. Here is the rendered image.
> 
> Before:
> interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures
deadlock_failures} [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]
 
> 
> After:
> interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
>   { failures | serialization_failures deadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried
retries]
 
> 
> Note that before it was like this:
> 
> interval_start num_transactions​ sum_latency sum_latency_2 min_latency max_latency​ [ sum_lag sum_lag_2 min_lag
max_lag[ skipped ] ]
 
> 
> So newly added items are "{ failures | serialization_failures deadlock_failures }" and " [ retried retries ]".
> 
> > TBH, I think the *real* problem is that the complexity of this log format
> > has blown past "out of hand".  Can't we simplify it?  Who is really going
> > to use all these numbers?  I pity the poor sucker who tries to write a
> > log analysis tool that will handle all the variants.
> 
> Well, the extra logging items above only appear when the retry feature
> is enabled. For those who do not use the feature the only new logging
> item is "failures". For those who use the feature, the extra logging
> items are apparently necessary. For example if we write an application
> using repeatable read or serializable transaction isolation mode,
> retrying failed transactions due to srialization error is an essential
> technique. Also the retry rate of transactions will deeply affect the
> performance and in such use cases the newly added items will be
> precisou information. I would suggest leave the log items as it is.
> 
> Patch attached.

Even applying this patch, "make postgres-A4.pdf" arises the warning on my
machine. After some investigations, I found that previous document had a break
after 'num_transactions', but it has been removed due to this commit. So,
I would like to get back this as it was. I attached the patch.

Regards,
Yugo Nagata


-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
> Even applying this patch, "make postgres-A4.pdf" arises the warning on my
> machine. After some investigations, I found that previous document had a break
> after 'num_transactions', but it has been removed due to this commit.

Yes, your patch removed "&zwsp;".

> So,
> I would like to get back this as it was. I attached the patch.

This produces errors. Needs ";" postfix?

ref/pgbench.sgml:2404: parser error : EntityRef: expecting ';'
le>interval_start</replaceable> <replaceable>num_transactions</replaceable>&zwsp
                                                                               ^
ref/pgbench.sgml:2781: parser error : chunk is not well balanced

^
reference.sgml:251: parser error : Failure to process entity pgbench
   &pgbench;
            ^
reference.sgml:251: parser error : Entity 'pgbench' not defined
   &pgbench;
            ^
reference.sgml:296: parser error : chunk is not well balanced

^
postgres.sgml:240: parser error : Failure to process entity reference
 &reference;
            ^
postgres.sgml:240: parser error : Entity 'reference' not defined
 &reference;
            ^
make: *** [Makefile:135: html-stamp] エラー 1

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp



Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Yugo NAGATA
Дата:
On Mon, 28 Mar 2022 12:17:13 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> > Even applying this patch, "make postgres-A4.pdf" arises the warning on my
> > machine. After some investigations, I found that previous document had a break
> > after 'num_transactions', but it has been removed due to this commit.
> 
> Yes, your patch removed "&zwsp;".
> 
> > So,
> > I would like to get back this as it was. I attached the patch.
> 
> This produces errors. Needs ";" postfix?

Oops. Yes, it needs ';'. Also, I found another "&zwsp;" dropped.
I attached the fixed patch.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>> > Even applying this patch, "make postgres-A4.pdf" arises the warning on my
>> > machine. After some investigations, I found that previous document had a break
>> > after 'num_transactions', but it has been removed due to this commit.
>> 
>> Yes, your patch removed "&zwsp;".
>> 
>> > So,
>> > I would like to get back this as it was. I attached the patch.
>> 
>> This produces errors. Needs ";" postfix?
> 
> Oops. Yes, it needs ';'. Also, I found another "&zwsp;" dropped.
> I attached the fixed patch.

Basic problem with this patch is, this may solve the issue with pdf
generation but this does not solve the issue with HTML generation. The
PDF manual of pgbench has ridiculously long line, which Tom Lane
complained too:

interval_start num_transactions​ sum_latency sum_latency_2 min_latency max_latency​ { failures | serialization_failures
deadlock_failures} [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]
 

Why can't we use just line feeds instead of &zwsp;? Although it's not
a command usage but the SELECT manual already uses line feeds to
nicely break into multiple lines of command usage.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Tatsuo Ishii
Дата:
>>> > Even applying this patch, "make postgres-A4.pdf" arises the warning on my
>>> > machine. After some investigations, I found that previous document had a break
>>> > after 'num_transactions', but it has been removed due to this commit.
>>> 
>>> Yes, your patch removed "&zwsp;".
>>> 
>>> > So,
>>> > I would like to get back this as it was. I attached the patch.
>>> 
>>> This produces errors. Needs ";" postfix?
>> 
>> Oops. Yes, it needs ';'. Also, I found another "&zwsp;" dropped.
>> I attached the fixed patch.
> 
> Basic problem with this patch is, this may solve the issue with pdf
> generation but this does not solve the issue with HTML generation. The
> PDF manual of pgbench has ridiculously long line, which Tom Lane
I meant "HTML manual" here.

> complained too:
> 
> interval_start num_transactions​ sum_latency sum_latency_2 min_latency max_latency​ { failures |
serialization_failuresdeadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]
 
> 
> Why can't we use just line feeds instead of &zwsp;? Although it's not
> a command usage but the SELECT manual already uses line feeds to
> nicely break into multiple lines of command usage.
> 
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От
Alvaro Herrera
Дата:
Hello,

On 2022-Mar-27, Tatsuo Ishii wrote:

> After:
> interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
>   { failures | serialization_failures deadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried
retries]
 

You're showing an indentation, but looking at the HTML output there is
no such.  Is the HTML processor eating leading whitespace or something
like that?

I think that the explanatory paragraph is way too long now, particularly
since it explains --failures-detailed starting in the middle.  Also, the
example output doesn't include the failures-detailed mode.  I suggest
that this should be broken down even more; first to explain the output
without failures-detailed, including an example, and then the output
with failures-detailed, and an example of that.  Something like this,
perhaps:

Aggregated Logging
With the --aggregate-interval option, a different format is used for the log files (note that the actual log line is
notfolded).
 

  interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
  failures [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]

where interval_start is the start of the interval (as a Unix epoch time stamp), num_transactions is the number of
transactionswithin the interval, sum_latency is the sum of the transaction latencies within the interval, sum_latency_2
isthe sum of squares of the transaction latencies within the interval, min_latency is the minimum latency within the
interval,and max_latency is the maximum latency within the interval, failures is the number of transactions that ended
witha failed SQL command within the interval.
 

The next fields, sum_lag, sum_lag_2, min_lag, and max_lag, are only present if the --rate option is used. They provide
statisticsabout the time each transaction had to wait for the previous one to finish, i.e., the difference between each
transaction'sscheduled start time and the time it actually started. The next field, skipped, is only present if the
--latency-limitoption is used, too. It counts the number of transactions skipped because they would have started too
late.The retried and retries fields are present only if the --max-tries option is not equal to 1. They report the
numberof retried transactions and the sum of all retries after serialization or deadlock errors within the interval.
Eachtransaction is counted in the interval when it was committed.
 

Notice that while the plain (unaggregated) log file shows which script was used for each transaction, the aggregated
logdoes not. Therefore if you need per-script data, you need to aggregate the data on your own.
 

Here is some example output:

1345828501 5601 1542744 483552416 61 2573 0
1345828503 7884 1979812 565806736 60 1479 0
1345828505 7208 1979422 567277552 59 1391 0
1345828507 7685 1980268 569784714 60 1398 0
1345828509 7073 1979779 573489941 236 1411 0

If you use option --failures-detailed, instead of the sum of all failed transactions you will get more detailed
statisticsfor the failed transactions:
 

  interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
  serialization_failures deadlock_failures [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]

This is similar to the above, but here the single 'failures' figure is replaced by serialization_failures which is the
numberof transactions that got a serialization error and were not retried after this, deadlock_failures which is the
numberof transactions that got a deadlock error and were not retried after this. The other fields are as above. Here is
someexample output:
 

[example with detailed failures]

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."                   (New York Times, about Microsoft PowerPoint)