Обсуждение: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

Поиск

Список

Период

Сортировка

[HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

14 июня 2017 г., 14:48:25

Hello, hackers!

Now in pgbench we can test only transactions with Read Committed 
isolation level because client sessions are disconnected forever on 
serialization failures. There were some proposals and discussions about 
it (see message here [1] and thread here [2]).

I suggest a patch where pgbench client sessions are not disconnected 
because of serialization or deadlock failures and these failures are 
mentioned in reports. In details:
- transaction with one of these failures continue run normally, but its 
result is rolled back;
- if there were these failures during script execution this 
"transaction" is marked
appropriately in logs;
- numbers of "transactions" with these failures are printed in progress, 
in aggregation logs and in the end with other results (all and for each 
script);

Advanced options:
- mostly for testing built-in scripts: you can set the default 
transaction isolation level by the appropriate benchmarking option (-I);
- for more detailed reports: to know per-statement serialization and 
deadlock failures you can use the appropriate benchmarking option 
(--report-failures).

Also: TAP tests for new functionality and changed documentation with new 
examples.

Patches are attached. Any suggestions are welcome!

P.S. Does this use case (do not retry transaction with serialization or 
deadlock failure) is most interesting or failed transactions should be 
retried (and how much times if there seems to be no hope of success...)?

[1] 
https://www.postgresql.org/message-id/4EC65830020000250004323F%40gw.wicourts.gov
[2] 

https://www.postgresql.org/message-id/flat/alpine.DEB.2.02.1305182259550.1473%40localhost6.localdomain6#alpine.DEB.2.02.1305182259550.1473@localhost6.localdomain6

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

> To be clear, part of "retrying from the beginning" means that if a

> result from one statement is used to determine the content (or

> whether to run) a subsequent statement, that first statement must be

> run in the new transaction and the results evaluated again to

> determine what to use for the later statement. You can't simply

> replay the statements that were run during the first try. For

> examples, to help get a feel of why that is, see:

> https://wiki.postgresql.org/wiki/SSI

Thank you again! :))

Marina Polyakova

Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

01 июля 2017 г., 20:39:08

Hello Marina,

A few comments about the submitted patches.

I agree that improving the error handling ability of pgbench is a good 
thing, although I'm not sure about the implications...

About the "retry" discussion: I agree that retry is the relevant option 
from an application point of view.

ISTM that the retry implementation should be implemented somehow in the 
automaton, restarting the same script for the beginning.

As pointed out in the discussion, the same values/commands should be 
executed, which suggests that random generated values should be the same 
on the retry runs, so that for a simple script the same operations are 
attempted. This means that the random generator state must be kept & 
reinstated for a client on retries. Currently the random state is in the 
thread, which is not convenient for this purpose, so it should be moved in 
the client so that it can be saved at transaction start and reinstated on 
retries.

The number of retries and maybe failures should be counted, maybe with 
some adjustable maximum, as suggested.

About 0001:

In accumStats, just use one level if, the two levels bring nothing.

In doLog, added columns should be at the end of the format. The number of 
column MUST NOT change when different issues arise, so that it works well 
with cut/... unix commands, so inserting a sentence such as "serialization 
and deadlock failures" is a bad idea.

threadRun: the point of the progress format is to fit on one not too wide 
line on a terminal and to allow some simple automatic processing. Adding a 
verbose sentence in the middle of it is not the way to go.

About tests: I do not understand why test 003 includes 2 transactions. 
It would seem more logical to have two scripts.

About 0003:

I'm not sure that there should be an new option to report failures, the 
information when relevant should be integrated in a clean format into the 
existing reports... Maybe the "per command latency" report/option should 
be renamed if it becomes more general.

About 0004:

The documentation must not be in a separate patch, but in the same patch 
as their corresponding code.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

03 июля 2017 г., 14:52:22

> Hello Marina,

Hello, Fabien!

> A few comments about the submitted patches.

Thank you very much for them!

> I agree that improving the error handling ability of pgbench is a good
> thing, although I'm not sure about the implications...

Could you tell a little bit more exactly.. What implications are you 
worried about?

> About the "retry" discussion: I agree that retry is the relevant
> option from an application point of view.

I'm glad to hear it!

> ISTM that the retry implementation should be implemented somehow in
> the automaton, restarting the same script for the beginning.

If there are several transactions in this script - don't you think that 
we should restart only the failed transaction?..

> As pointed out in the discussion, the same values/commands should be
> executed, which suggests that random generated values should be the
> same on the retry runs, so that for a simple script the same
> operations are attempted. This means that the random generator state
> must be kept & reinstated for a client on retries. Currently the
> random state is in the thread, which is not convenient for this
> purpose, so it should be moved in the client so that it can be saved
> at transaction start and reinstated on retries.

I think about it in the same way =)

> The number of retries and maybe failures should be counted, maybe with
> some adjustable maximum, as suggested.

If we fix the maximum number of attempts the maximum number of failures 
for one script execution will be bounded above 
(number_of_transactions_in_script * maximum_number_of_attempts). Do you 
think we should make the option in program to limit this number much 
more?

> About 0001:
> 
> In accumStats, just use one level if, the two levels bring nothing.

Thanks, I agree =[

> In doLog, added columns should be at the end of the format.

I have inserted it earlier because these columns are not optional. Do 
you think they should be optional?

> The number
> of column MUST NOT change when different issues arise, so that it
> works well with cut/... unix commands, so inserting a sentence such as
> "serialization and deadlock failures" is a bad idea.

Thanks, I agree again.

> threadRun: the point of the progress format is to fit on one not too
> wide line on a terminal and to allow some simple automatic processing.
> Adding a verbose sentence in the middle of it is not the way to go.

I was thinking about it.. Thanks, I'll try to make it shorter.

> About tests: I do not understand why test 003 includes 2 transactions.
> It would seem more logical to have two scripts.

Ok!

> About 0003:
> 
> I'm not sure that there should be an new option to report failures,
> the information when relevant should be integrated in a clean format
> into the existing reports... Maybe the "per command latency"
> report/option should be renamed if it becomes more general.

I have tried do not change other parts of program as much as possible. 
But if you think that it will be more useful to change the option I'll 
do it.

> About 0004:
> 
> The documentation must not be in a separate patch, but in the same
> patch as their corresponding code.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

03 июля 2017 г., 17:37:30

Hello Marina,

>> I agree that improving the error handling ability of pgbench is a good
>> thing, although I'm not sure about the implications...
>
> Could you tell a little bit more exactly.. What implications are you worried 
> about?

The current error handling is either "close connection" or maybe in some 
cases even "exit". If this is changed, then the client may continue 
execution in some unforseen state and behave unexpectedly. We'll see.

>> ISTM that the retry implementation should be implemented somehow in
>> the automaton, restarting the same script for the beginning.
>
> If there are several transactions in this script - don't you think that we 
> should restart only the failed transaction?..

On some transaction failures based on their status. My point is that the 
retry process must be implemented clearly with a new state in the client 
automaton. Exactly when the transition to this new state must be taken is 
another issue.

>> The number of retries and maybe failures should be counted, maybe with
>> some adjustable maximum, as suggested.
>
> If we fix the maximum number of attempts the maximum number of failures for 
> one script execution will be bounded above (number_of_transactions_in_script 
> * maximum_number_of_attempts). Do you think we should make the option in 
> program to limit this number much more?

Probably not. I think that there should be a configurable maximum of 
retries on a transaction, which may be 0 by default if we want to be 
upward compatible with the current behavior, or maybe something else.

>> In doLog, added columns should be at the end of the format.
>
> I have inserted it earlier because these columns are not optional. Do you 
> think they should be optional?

I think that new non-optional columns it should be at the end of the 
existing non-optional columns so that existing scripts which may process 
the output may not need to be updated.

>> I'm not sure that there should be an new option to report failures,
>> the information when relevant should be integrated in a clean format
>> into the existing reports... Maybe the "per command latency"
>> report/option should be renamed if it becomes more general.
>
> I have tried do not change other parts of program as much as possible. But if 
> you think that it will be more useful to change the option I'll do it.

I think that the option should change if its naming becomes less relevant, 
which is to be determined. AFAICS, ISTM that new measures should be added 
to the various existing reports unconditionnaly (i.e. without a new 
option), so maybe no new option would be needed.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

03 июля 2017 г., 18:07:11

> The current error handling is either "close connection" or maybe in
> some cases even "exit". If this is changed, then the client may
> continue execution in some unforseen state and behave unexpectedly.
> We'll see.

Thanks, now I understand this.

>>> ISTM that the retry implementation should be implemented somehow in
>>> the automaton, restarting the same script for the beginning.
>> 
>> If there are several transactions in this script - don't you think 
>> that we should restart only the failed transaction?..
> 
> On some transaction failures based on their status. My point is that
> the retry process must be implemented clearly with a new state in the
> client automaton. Exactly when the transition to this new state must
> be taken is another issue.

About it, I agree with you that it should be done in this way.

>>> The number of retries and maybe failures should be counted, maybe 
>>> with
>>> some adjustable maximum, as suggested.
>> 
>> If we fix the maximum number of attempts the maximum number of 
>> failures for one script execution will be bounded above 
>> (number_of_transactions_in_script * maximum_number_of_attempts). Do 
>> you think we should make the option in program to limit this number 
>> much more?
> 
> Probably not. I think that there should be a configurable maximum of
> retries on a transaction, which may be 0 by default if we want to be
> upward compatible with the current behavior, or maybe something else.

I propose the option --max-attempts-number=NUM which NUM cannot be less 
than 1. I propose it because I think that, for example, 
--max-attempts-number=100 is better than --max-retries-number=99. And 
maybe it's better to set its default value to 1 too because retrying of 
shell commands can produce new errors..

>>> In doLog, added columns should be at the end of the format.
>> 
>> I have inserted it earlier because these columns are not optional. Do 
>> you think they should be optional?
> 
> I think that new non-optional columns it should be at the end of the
> existing non-optional columns so that existing scripts which may
> process the output may not need to be updated.

Thanks, I agree with you :)

>>> I'm not sure that there should be an new option to report failures,
>>> the information when relevant should be integrated in a clean format
>>> into the existing reports... Maybe the "per command latency"
>>> report/option should be renamed if it becomes more general.
>> 
>> I have tried do not change other parts of program as much as possible. 
>> But if you think that it will be more useful to change the option I'll 
>> do it.
> 
> I think that the option should change if its naming becomes less
> relevant, which is to be determined. AFAICS, ISTM that new measures
> should be added to the various existing reports unconditionnaly (i.e.
> without a new option), so maybe no new option would be needed.

Thanks! I didn't think about it in this way..

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

03 июля 2017 г., 18:59:55

>>>> The number of retries and maybe failures should be counted, maybe with
>>>> some adjustable maximum, as suggested.
>>> 
>>> If we fix the maximum number of attempts the maximum number of failures 
>>> for one script execution will be bounded above 
>>> (number_of_transactions_in_script * maximum_number_of_attempts). Do you 
>>> think we should make the option in program to limit this number much more?
>> 
>> Probably not. I think that there should be a configurable maximum of
>> retries on a transaction, which may be 0 by default if we want to be
>> upward compatible with the current behavior, or maybe something else.
>
> I propose the option --max-attempts-number=NUM which NUM cannot be less than 
> 1. I propose it because I think that, for example, --max-attempts-number=100 
> is better than --max-retries-number=99. And maybe it's better to set its 
> default value to 1 too because retrying of shell commands can produce new 
> errors..

Personnaly, I like counting retries because it also counts the number of 
time the transaction actually failed for some reason. But this is a 
marginal preference, and one can be switchted to the other easily.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Alexander Korotkov

Дата:

07 июля 2017 г., 18:44:21

On Thu, Jun 15, 2017 at 10:16 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-06-14 11:48:25 +0300, Marina Polyakova wrote:
> Advanced options:
> - mostly for testing built-in scripts: you can set the default transaction
> isolation level by the appropriate benchmarking option (-I);

I'm less convinced of the need of htat, you can already set arbitrary
connection options with
PGOPTIONS='-c default_transaction_isolation=serializable' pgbench

Right, there is already way to specify default isolation level using environment variables.

However, once we make pgbench work with various isolation levels, users may want to run pgbench multiple times in a row with different isolation levels. Command line option would be very convenient in this case.

In addition, isolation level is vital parameter to interpret benchmark results correctly. Often, graphs with pgbench results are entitled with pgbench command line. Having, isolation level specified in command line would naturally fit into this entitling scheme.

Of course, this is solely usability question and it's fair enough to live without such a command line option. But I'm +1 to add this option.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

10 июля 2017 г., 19:21:28

Hello everyone!

There's the second version of my patch for pgbench. Now transactions 
with serialization and deadlock failures are rolled back and retried 
until they end successfully or their number of attempts reaches maximum.

In details:
- You can set the maximum number of attempts by the appropriate 
benchmarking option (--max-attempts-number). Its default value is 1 
partly because retrying of shell commands can produce new errors.
- Statistics of attempts and failures is printed in progress, in 
transaction / aggregation logs and in the end with other results (all 
and for each script). The transaction failure is reported here only if 
the last retry of this transaction fails.
- Also failures and average numbers of transactions attempts are printed 
per-command with average latencies if you use the appropriate 
benchmarking option (--report-per-command, -r) (it replaces the option 
--report-latencies as I was advised here [1]). Average numbers of 
transactions attempts are printed only for commands which start 
transactions.

As usual: TAP tests for new functionality and changed documentation with 
new examples.

Patch is attached. Any suggestions are welcome!

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707031321370.3419%40lancre

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

v2-0001-Pgbench-Retry-transactions-with-serialization-or-.patch

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

12 июля 2017 г., 15:47:01

Hello Marina,

> There's the second version of my patch for pgbench. Now transactions 
> with serialization and deadlock failures are rolled back and retried 
> until they end successfully or their number of attempts reaches maximum.

> In details:
>  - You can set the maximum number of attempts by the appropriate 
> benchmarking option (--max-attempts-number). Its default value is 1 
> partly because retrying of shell commands can produce new errors.
>
>  - Statistics of attempts and failures is printed in progress, in 
> transaction / aggregation logs and in the end with other results (all 
> and for each script). The transaction failure is reported here only if 
> the last retry of this transaction fails.
>
> - Also failures and average numbers of transactions attempts are printed 
> per-command with average latencies if you use the appropriate 
> benchmarking option (--report-per-command, -r) (it replaces the option 
> --report-latencies as I was advised here [1]). Average numbers of 
> transactions attempts are printed only for commands which start 
> transactions.

> As usual: TAP tests for new functionality and changed documentation with 
> new examples.

Here are a round of comments on the current version of the patch:

* About the feature

There is a latent issue about what is a transaction. For pgbench a transaction is a full script execution.
For postgresql, it is a statement or a BEGIN/END block, several of which may appear in a script. From a retry
perspective, you may retry from a SAVEPOINT within a BEGIN/END block... I'm not sure how to make general sense
of all this, so this is just a comment without attached action for now.

As the default is not to retry, which is the upward compatible behavior, I think that the changes should not
change much the current output bar counting the number of failures.

I would consider using "try/tries" instead of "attempt/attempts" as it is shorter. An English native speaker
opinion would be welcome on that point.

* About the code

ISTM that the code interacts significantly with various patches under review or ready for committers.
Not sure how to deal with that, there will be some rebasing work...

I'm fine with renaming "is_latencies" to "report_per_command", which is more logical & generic.

"max_attempt_number": I'm against typing fields again in their name, aka "hungarian naming". I'd suggest
"max_tries" or "max_attempts".

"SimpleStats attempts": I disagree with using this floating poiunt oriented structures to count integers.
I would suggest "int64 tries" instead, which should be enough for the 
purpose.

LastBeginState -> RetryState? I'm not sure why this state is a pointer in 
CState. Putting the struct would avoid malloc/free cycles. Index "-1" may 
be used to tell it is not set if necessary.

"CSTATE_RETRY_FAILED_TRANSACTION" -> "CSTATE_RETRY" is simpler and clear enough.

In CState and some code, a failure is a failure, maybe one boolean would 
be enough. It need only be differentiated when counting, and you have 
(deadlock_failure || serialization_failure) everywhere.

Some variables, such as "int attempt_number", should be in the client 
structure, not in the client? Generally, try to use block variables if 
possible to keep the state clearly disjoints. If there could be NO new 
variable at the doCustom level that would be great, because that would 
ensure that there is no machine state mixup hidden in these variables.

I wondering whether the RETRY & FAILURE states could/should be merged:
  on RETRY:    -> count retry    -> actually retry if < max_tries (reset client state, jump to command)    -> else
countfailure and skip to end of script
 

The start and end of transaction detection seem expensive (malloc, ...) 
and assume a one statement per command (what about "BEGIN \; ... \; 
COMMIT;", which is not necessarily the case, this limitation should be 
documented. ISTM that the space normalization should be avoided, and 
something simpler/lighter should be devised? Possibly it should consider 
handling SAVEPOINT.

I disagree about exit in ParseScript if the transaction block is not 
completed, especially as it misses out on combined statements/queries 
(BEGIN \; stuff... \; COMMIT") and would break an existing feature.

There are strange characters things in comments, eg "??ontinuous".

Option "max-attempt-number" -> "max-tries"

I would put the client random state initialization with the state 
intialization, not with the connection.

* About tracing

Progress is expected to be short, not detailed. Only add the number of 
failures and retries if max retry is not 1.

* About reporting

I think that too much is reported. I advised to do that, but nevertheless 
it is a little bit steep.

At least, it should not report the number of tries/attempts when the max 
number is one. Simple counting should be reported for failures, not 
floats...

I would suggest a more compact one-line report about failures:
  "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)"

* About the TAP tests

They are too expensive, with 3 initdb. I think that they should be 
integrated in the existing tests, as a patch has been submitted to rework 
the whole pgbench tap test infrastructure.

For now, at most one initdb and several small tests inside.

* About the documentation

I'm not sure that the feature needs pre-emminence in the documentation, 
because most of the time there is no retry as none is needed, there is no 
failure, so this rather a special (although useful) case for people 
playing with serializable and other advanced features.

Smaller updates, without dedicated examples, should be enough.

If a transaction is skipped, there was no tries, so the corresponding 
number of attempts is 0, not one.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

12 июля 2017 г., 17:51:26

> LastBeginState -> RetryState? I'm not sure why this state is a pointer in 
> CState. Putting the struct would avoid malloc/free cycles. Index "-1" may be 
> used to tell it is not set if necessary.

Another detail I forgot about this point: there may be a memory leak on 
variables copies, ISTM that the "variables" array is never freed.

I was not convinced by the overall memory management around variables to 
begin with, and it is even less so with their new copy management. Maybe 
having a clean "Variables" data structure could help improve the 
situation.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

13 июля 2017 г., 20:34:14

> Here are a round of comments on the current version of the patch:

Thank you very much again!

> There is a latent issue about what is a transaction. For pgbench a
> transaction is a full script execution.
> For postgresql, it is a statement or a BEGIN/END block, several of
> which may appear in a script. From a retry
> perspective, you may retry from a SAVEPOINT within a BEGIN/END
> block... I'm not sure how to make general sense
> of all this, so this is just a comment without attached action for now.

Yes it is. That's why I wrote several notes about it in documentation 
where there may be a misunderstanding:

+        Transactions with serialization or deadlock failures (or with 
both
+        of them if used script contains several transactions; see
+        <xref linkend="transactions-and-scripts"
+        endterm="transactions-and-scripts-title"> for more information) 
are
+        marked separately and their time is not reported as for skipped
+        transactions.

+ <refsect2 id="transactions-and-scripts">
+  <title id="transactions-and-scripts-title">What is the 
<quote>Transaction</> Actually Performed in 
<application>pgbench</application>?</title>

+    If a transaction has serialization and/or deadlock failures, its
+   <replaceable>time</> will be reported as <literal>serialization 
failure</>,
+   <literal>deadlock failure</>, or
+   <literal>serialization and deadlock failures</>, respectively.   </para>
+  <note>
+   <para>
+     Transactions can have both serialization and deadlock failures if 
the
+     used script contained several transactions.  See
+     <xref linkend="transactions-and-scripts"
+     endterm="transactions-and-scripts-title"> for more information.
+    </para>
+  </note>

+  <note>
+   <para>
+    The number of transactions attempts within the interval can be 
greater than
+    the number of transactions within this interval multiplied by the 
maximum
+    attempts number.  See <xref linkend="transactions-and-scripts"
+    endterm="transactions-and-scripts-title"> for more information.
+   </para>
+  </note>

+       <note>
+         <para>The total sum of per-command failures of each type can 
be greater
+         than the number of transactions with reported failures.
+         See <xref linkend="transactions-and-scripts"
+         endterm="transactions-and-scripts-title"> for more 
information.
+         </para>
+       </note>

And I didn't make rollbacks to savepoints after the failure because they 
cannot help for serialization failures at all: after rollback to 
savepoint a new attempt will be always unsuccessful.

> I would consider using "try/tries" instead of "attempt/attempts" as it
> is shorter. An English native speaker
> opinion would be welcome on that point.

Thank you, I'll change it.

> I'm fine with renaming "is_latencies" to "report_per_command", which
> is more logical & generic.

Glad to hear it!

> "max_attempt_number": I'm against typing fields again in their name,
> aka "hungarian naming". I'd suggest
> "max_tries" or "max_attempts".

Ok!

> "SimpleStats attempts": I disagree with using this floating poiunt
> oriented structures to count integers.
> I would suggest "int64 tries" instead, which should be enough for the 
> purpose.

I'm not sure that it is enough. Firstly it may be several transactions 
in script so to count the average attempts number you should know the 
total number of runned transactions. Secondly I think that stddev for 
attempts number can be quite interesting and often it is not close to 
zero.

> LastBeginState -> RetryState? I'm not sure why this state is a pointer
> in CState. Putting the struct would avoid malloc/free cycles. Index
> "-1" may be used to tell it is not set if necessary.

Thanks, I agree that it's better to do in this way.

> "CSTATE_RETRY_FAILED_TRANSACTION" -> "CSTATE_RETRY" is simpler and 
> clear enough.

Ok!

> In CState and some code, a failure is a failure, maybe one boolean
> would be enough. It need only be differentiated when counting, and you
> have (deadlock_failure || serialization_failure) everywhere.

I agree with you. I'll change it.

> Some variables, such as "int attempt_number", should be in the client
> structure, not in the client? Generally, try to use block variables if
> possible to keep the state clearly disjoints. If there could be NO new
> variable at the doCustom level that would be great, because that would
> ensure that there is no machine state mixup hidden in these variables.

Do you mean the code cleanup for doCustom function? Because if I do so 
there will be two code styles for state blocks and their variables in 
this function..

> I wondering whether the RETRY & FAILURE states could/should be merged:
> 
>   on RETRY:
>     -> count retry
>     -> actually retry if < max_tries (reset client state, jump to 
> command)
>     -> else count failure and skip to end of script
> 
> The start and end of transaction detection seem expensive (malloc,
> ...) and assume a one statement per command (what about "BEGIN \; ...
> \; COMMIT;", which is not necessarily the case, this limitation should
> be documented. ISTM that the space normalization should be avoided,
> and something simpler/lighter should be devised? Possibly it should
> consider handling SAVEPOINT.

I divided these states because if there's a failed transaction block you 
should end it before retrying. It means to go to states 
CSTATE_START_COMMAND -> CSTATE_WAIT_RESULT -> CSTATE_END_COMMAND with 
the appropriate command. How do you propose not to go to these states?

About malloc - I agree with you that it should be done without 
malloc/free.

About savepoints - as I wrote you earlier I didn't make rollbacks to 
savepoints after the failure. Because they cannot help for serialization 
failures at all: after rollback to savepoint a new attempt will be 
always unsuccessful.

> I disagree about exit in ParseScript if the transaction block is not
> completed, especially as it misses out on combined statements/queries
> (BEGIN \; stuff... \; COMMIT") and would break an existing feature.

Thanks, I'll fix it for usual transaction blocks that don't end in the 
scripts.

> There are strange characters things in comments, eg "??ontinuous".

Oh, I'm sorry. I'll fix it too.

> Option "max-attempt-number" -> "max-tries"

> I would put the client random state initialization with the state
> intialization, not with the connection.

> * About tracing
> 
> Progress is expected to be short, not detailed. Only add the number of
> failures and retries if max retry is not 1.

Ok!

> * About reporting
> 
> I think that too much is reported. I advised to do that, but
> nevertheless it is a little bit steep.
> 
> At least, it should not report the number of tries/attempts when the
> max number is one.

Ok!

> Simple counting should be reported for failures,
> not floats...
> 
> I would suggest a more compact one-line report about failures:
> 
>   "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)"

I think, there may be a misunderstanding. Because script can contain 
several transactions and get both failures.

> * About the TAP tests
> 
> They are too expensive, with 3 initdb. I think that they should be
> integrated in the existing tests, as a patch has been submitted to
> rework the whole pgbench tap test infrastructure.
> 
> For now, at most one initdb and several small tests inside.

Ok!

> * About the documentation
> 
> I'm not sure that the feature needs pre-emminence in the
> documentation, because most of the time there is no retry as none is
> needed, there is no failure, so this rather a special (although
> useful) case for people playing with serializable and other advanced
> features.
> 
> Smaller updates, without dedicated examples, should be enough.

Maybe there should be some examples to prepare people what they can see 
in the output of the program? Of course now failures are special cases 
because they disconnect its clients to the end of the program and ruin
all the results. I hope that if this patch is committed there will be 
much more cases with retried failures.

> If a transaction is skipped, there was no tries, so the corresponding
> number of attempts is 0, not one.

Oh, I'm sorry, it is a typo in the documentation.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

13 июля 2017 г., 20:34:30

> Another detail I forgot about this point: there may be a memory leak
> on variables copies, ISTM that the "variables" array is never freed.
> 
> I was not convinced by the overall memory management around variables
> to begin with, and it is even less so with their new copy management.
> Maybe having a clean "Variables" data structure could help improve the
> situation.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

13 июля 2017 г., 22:32:53

Hello,

> [...] I didn't make rollbacks to savepoints after the failure because 
> they cannot help for serialization failures at all: after rollback to 
> savepoint a new attempt will be always unsuccessful.

Not necessarily? It depends on where the locks triggering the issue are 
set, if they are all set after the savepoint it could work on a second 
attempt.

>> "SimpleStats attempts": I disagree with using this floating poiunt 
>> oriented structures to count integers. I would suggest "int64 tries" 
>> instead, which should be enough for the purpose.
>
> I'm not sure that it is enough. Firstly it may be several transactions in 
> script so to count the average attempts number you should know the total 
> number of runned transactions. Secondly I think that stddev for attempts 
> number can be quite interesting and often it is not close to zero.

I would prefer to have a real motivation to add this complexity in the 
report and in the code. Without that, a simple int seems better for now. 
It can be improved later if the need really arises.

>> Some variables, such as "int attempt_number", should be in the client
>> structure, not in the client? Generally, try to use block variables if
>> possible to keep the state clearly disjoints. If there could be NO new
>> variable at the doCustom level that would be great, because that would
>> ensure that there is no machine state mixup hidden in these variables.
>
> Do you mean the code cleanup for doCustom function? Because if I do so there 
> will be two code styles for state blocks and their variables in this 
> function..

I think that any variable shared between state is a recipee for bugs if it 
is not reset properly, so they should be avoided. Maybe there are already 
too many of them, then too bad, not a reason to add more. The status 
before the automaton was a nightmare.

>> I wondering whether the RETRY & FAILURE states could/should be merged:
>
> I divided these states because if there's a failed transaction block you 
> should end it before retrying.

Hmmm. Maybe I'm wrong. I'll think about it.

>> I would suggest a more compact one-line report about failures:
>>
>>   "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)"
>
> I think, there may be a misunderstanding. Because script can contain several 
> transactions and get both failures.

I do not understand. Both failures number are on the compact line I 
suggested.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

13 июля 2017 г., 22:34:55

>> I was not convinced by the overall memory management around variables
>> to begin with, and it is even less so with their new copy management.
>> Maybe having a clean "Variables" data structure could help improve the
>> situation.
>
> Ok!

Note that there is something for psql (src/bin/psql/variable.c) which may 
or may not be shared. It should be checked before recoding eventually the 
same thing.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

14 июля 2017 г., 16:35:44

On 13-07-2017 19:32, Fabien COELHO wrote:
> Hello,

Hi!

>> [...] I didn't make rollbacks to savepoints after the failure because 
>> they cannot help for serialization failures at all: after rollback to 
>> savepoint a new attempt will be always unsuccessful.
> 
> Not necessarily? It depends on where the locks triggering the issue
> are set, if they are all set after the savepoint it could work on a
> second attempt.

Don't you mean the deadlock failures where can really help rollback to 
savepoint? And could you, please, give an example where a rollback to 
savepoint can help to end its subtransaction successfully after a 
serialization failure?

>>> "SimpleStats attempts": I disagree with using this floating poiunt 
>>> oriented structures to count integers. I would suggest "int64 tries" 
>>> instead, which should be enough for the purpose.
>> 
>> I'm not sure that it is enough. Firstly it may be several transactions 
>> in script so to count the average attempts number you should know the 
>> total number of runned transactions. Secondly I think that stddev for 
>> attempts number can be quite interesting and often it is not close to 
>> zero.
> 
> I would prefer to have a real motivation to add this complexity in the
> report and in the code. Without that, a simple int seems better for
> now. It can be improved later if the need really arises.

Ok!

>>> Some variables, such as "int attempt_number", should be in the client
>>> structure, not in the client? Generally, try to use block variables 
>>> if
>>> possible to keep the state clearly disjoints. If there could be NO 
>>> new
>>> variable at the doCustom level that would be great, because that 
>>> would
>>> ensure that there is no machine state mixup hidden in these 
>>> variables.
>> 
>> Do you mean the code cleanup for doCustom function? Because if I do so 
>> there will be two code styles for state blocks and their variables in 
>> this function..
> 
> I think that any variable shared between state is a recipee for bugs
> if it is not reset properly, so they should be avoided. Maybe there
> are already too many of them, then too bad, not a reason to add more.
> The status before the automaton was a nightmare.

Ok!

>>> I would suggest a more compact one-line report about failures:
>>> 
>>>   "number of failures: 12 (0.001%, deadlock: 7, serialization: 5)"
>> 
>> I think, there may be a misunderstanding. Because script can contain 
>> several transactions and get both failures.
> 
> I do not understand. Both failures number are on the compact line I 
> suggested.

I mean that the sum of transactions with serialization failure and 
transactions with deadlock failure can be greater then the totally sum 
of transactions with failures. But if you think it's ok I'll change it 
and write the appropriate note in documentation.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

14 июля 2017 г., 16:36:02

>>> I was not convinced by the overall memory management around variables
>>> to begin with, and it is even less so with their new copy management.
>>> Maybe having a clean "Variables" data structure could help improve 
>>> the
>>> situation.
> 
> Note that there is something for psql (src/bin/psql/variable.c) which
> may or may not be shared. It should be checked before recoding
> eventually the same thing.

Thank you very much for pointing this file! As I checked this is another 
structure: here there's a simple list, while in pgbench we should know 
if the list is sorted and the number of elements in the list. How do you 
think, is it a good idea to name a variables structure in pgbench in the 
same way (VariableSpace) or it should be different not to be confused 
(Variables, for example)?

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

14 июля 2017 г., 17:58:21

Hello Marina,

>> Not necessarily? It depends on where the locks triggering the issue
>> are set, if they are all set after the savepoint it could work on a
>> second attempt.
>
> Don't you mean the deadlock failures where can really help rollback to

Yes, I mean deadlock failures can rollback to a savepoint and work on a 
second attempt.

> And could you, please, give an example where a rollback to savepoint can 
> help to end its subtransaction successfully after a serialization 
> failure?

I do not know whether this is possible with about serialization failures.
It might be if the stuff before and after the savepoint are somehow 
unrelated...

> [...] I mean that the sum of transactions with serialization failure and 
> transactions with deadlock failure can be greater then the totally sum 
> of transactions with failures.

Hmmm. Ok.

A "failure" is a transaction (in the sense of pgbench) that could not made 
it to the end, even after retries. If there is a rollback and the a retry 
which works, it is not a failure.

Now deadlock or serialization errors, which trigger retries, are worth 
counting as well, although they are not "failures". So my format proposal 
was over optimistic, and the number of deadlocks and serializations should 
better be on a retry count line.

Maybe something like:  ...  number of failures: 12 (0.004%)  number of retries: 64 (deadlocks: 29, serialization: 35)

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

14 июля 2017 г., 18:06:10

>> Note that there is something for psql (src/bin/psql/variable.c) which 
>> may or may not be shared. It should be checked before recoding 
>> eventually the same thing.
>
> Thank you very much for pointing this file! As I checked this is another 
> structure: here there's a simple list, while in pgbench we should know 
> if the list is sorted and the number of elements in the list. How do you 
> think, is it a good idea to name a variables structure in pgbench in the 
> same way (VariableSpace) or it should be different not to be confused 
> (Variables, for example)?

Given that the number of variables of a pgbench script is expected to be 
pretty small, I'm not sure that the sorting stuff is worth the effort.

My suggestion is really to look at both implementations and to answer the 
question "should pgbench share its variable implementation with psql?".

If the answer is yes, then the relevant part of the implementation should 
be moved to fe_utils, and that's it.

If the answer is no, then implement something in pgbench directly.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

14 июля 2017 г., 19:41:49

>>> Not necessarily? It depends on where the locks triggering the issue
>>> are set, if they are all set after the savepoint it could work on a
>>> second attempt.
>> 
>> Don't you mean the deadlock failures where can really help rollback to
> 
> Yes, I mean deadlock failures can rollback to a savepoint and work on
> a second attempt.
> 
>> And could you, please, give an example where a rollback to savepoint 
>> can help to end its subtransaction successfully after a serialization 
>> failure?
> 
> I do not know whether this is possible with about serialization 
> failures.
> It might be if the stuff before and after the savepoint are somehow 
> unrelated...

If you mean, for example, the updates of different tables - a rollback 
to savepoint doesn't help.

And I'm not sure that we should do all the stuff for savepoints 
rollbacks because:
- as I see it now it only makes sense for the deadlock failures;
- if there's a failure what savepoint we should rollback to and start 
the execution again? Maybe to go to the last one, if it is not 
successful go to the previous one etc.
Retrying the entire transaction may take less time..

>> [...] I mean that the sum of transactions with serialization failure 
>> and transactions with deadlock failure can be greater then the totally 
>> sum of transactions with failures.
> 
> Hmmm. Ok.
> 
> A "failure" is a transaction (in the sense of pgbench) that could not
> made it to the end, even after retries. If there is a rollback and the
> a retry which works, it is not a failure.
> 
> Now deadlock or serialization errors, which trigger retries, are worth
> counting as well, although they are not "failures". So my format
> proposal was over optimistic, and the number of deadlocks and
> serializations should better be on a retry count line.
> 
> Maybe something like:
>   ...
>   number of failures: 12 (0.004%)
>   number of retries: 64 (deadlocks: 29, serialization: 35)

Ok! How to you like the idea to use the same format (the total number of 
transactions with failures and the number of retries for each failure 
type) in other places (log, aggregation log, progress) if the values are 
not "default" (= no failures and no retries)?

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

14 июля 2017 г., 19:48:29

> Given that the number of variables of a pgbench script is expected to
> be pretty small, I'm not sure that the sorting stuff is worth the
> effort.

I think it is a good insurance if there're many variables..

> My suggestion is really to look at both implementations and to answer
> the question "should pgbench share its variable implementation with
> psql?".
> 
> If the answer is yes, then the relevant part of the implementation
> should be moved to fe_utils, and that's it.
> 
> If the answer is no, then implement something in pgbench directly.

The structure of variables is different, the container structure of the 
variables is different, so I think that the answer is no.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

14 июля 2017 г., 20:37:13

>> If the answer is no, then implement something in pgbench directly.
>
> The structure of variables is different, the container structure of the 
> variables is different, so I think that the answer is no.

Ok, fine. My point was just to check before proceeding.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

14 июля 2017 г., 20:46:18

> And I'm not sure that we should do all the stuff for savepoints rollbacks 
> because:
> - as I see it now it only makes sense for the deadlock failures;
> - if there's a failure what savepoint we should rollback to and start the 
> execution again?

ISTM that it is the point of having savepoint in the first place, the 
ability to restart the transaction at that point if something failed?

> Maybe to go to the last one, if it is not successful go to the previous 
> one etc. Retrying the entire transaction may take less time..

Well, I do not know that. My 0.02 € is that if there was a savepoint then 
this is natural the restarting point of a transaction which has some 
recoverable error.

Well, the short version may be to only do a full transaction retry and to 
document that for now savepoints are not handled, and to let that for 
future work if need arises.

>> Maybe something like:
>>   ...
>>   number of failures: 12 (0.004%)
>>   number of retries: 64 (deadlocks: 29, serialization: 35)
>
> Ok! How to you like the idea to use the same format (the total number of 
> transactions with failures and the number of retries for each failure type) 
> in other places (log, aggregation log, progress) if the values are not 
> "default" (= no failures and no retries)?

For progress the output must be short and readable, and probably we do not 
care about whether retries came from this or that, so I would let that 
out.

For log and aggregated log possibly that would make more sense, but it 
must stay easy to parse.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

14 июля 2017 г., 20:54:37

> Ok, fine. My point was just to check before proceeding.

And I'm very grateful for that :)

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

14 июля 2017 г., 21:01:56

> Well, the short version may be to only do a full transaction retry and
> to document that for now savepoints are not handled, and to let that
> for future work if need arises.

I agree with you.

> For progress the output must be short and readable, and probably we do
> not care about whether retries came from this or that, so I would let
> that out.
> 
> For log and aggregated log possibly that would make more sense, but it
> must stay easy to parse.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

21 июля 2017 г., 22:32:02

Hello again!

Here is the third version of the patch for pgbench thanks to Fabien 
Coelho comments. As in the previous one, transactions with serialization 
and deadlock failures are rolled back and retried until they end 
successfully or their number of tries reaches maximum.

Differences from the previous version:
* Some code cleanup :) In particular, the Variables structure for 
managing client variables and only one new tap tests file (as they were 
recommended here [1] and here [2]).
* There's no error if the last transaction in the script is not 
completed. But the transactions started in the previous scripts and/or 
not ending in the current script, are not rolled back and retried after 
the failure. Such script try is reported as failed because it contains a 
failure that was not rolled back and retried.
* Usually the retries and/or failures are printed if they are not equal 
to zeros. In transaction/aggregation logs the failures are always 
printed and the retries are printed if max_tries is greater than 1. It 
is done for the general format of the log during the execution of the 
program.

Patch is attached. Any suggestions are welcome!

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121338090.12795%40lancre
[2] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121142300.12795%40lancre

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

v3-0001-Pgbench-Retry-transactions-with-serialization-or-.patch

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Andres Freund

Дата:

12 августа 2017 г., 01:50:47

Hi,

On 2017-07-21 19:32:02 +0300, Marina Polyakova wrote:
> Here is the third version of the patch for pgbench thanks to Fabien Coelho
> comments. As in the previous one, transactions with serialization and
> deadlock failures are rolled back and retried until they end successfully or
> their number of tries reaches maximum.

Just had a need for this feature, and took this to a short test
drive. So some comments:
- it'd be useful to display a retry percentage of all transactions, similar to what's displayed for failed
transactions.
- it appears that we now unconditionally do not disregard a connection after a serialization / deadlock failure. Good.
Butthat's useful far beyond just deadlocks / serialization errors, and should probably be exposed.
 
- it'd be useful to also conveniently display the number of retried transactions, rather than the total number of
retries.

Nice feature!

- Andres

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Alexander Korotkov

Дата:

16 августа 2017 г., 20:15:26

On Fri, Aug 11, 2017 at 10:50 PM, Andres Freund <andres@anarazel.de> wrote:

On 2017-07-21 19:32:02 +0300, Marina Polyakova wrote:
> Here is the third version of the patch for pgbench thanks to Fabien Coelho
> comments. As in the previous one, transactions with serialization and
> deadlock failures are rolled back and retried until they end successfully or
> their number of tries reaches maximum.

Just had a need for this feature, and took this to a short test
drive. So some comments:
- it'd be useful to display a retry percentage of all transactions,
similar to what's displayed for failed transactions.
- it appears that we now unconditionally do not disregard a connection
after a serialization / deadlock failure. Good. But that's useful far
beyond just deadlocks / serialization errors, and should probably be exposed.

Yes, it would be nice to don't disregard a connection after other errors too. However, I'm not sure if we should retry the *same* transaction on errors beyond deadlocks / serialization errors. For example, in case of division by zero or unique violation error it would be more natural to give up with current transaction and continue with next one.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

16 августа 2017 г., 21:44:18

> Hi,

Hello!

> Just had a need for this feature, and took this to a short test
> drive. So some comments:
> - it'd be useful to display a retry percentage of all transactions,
>   similar to what's displayed for failed transactions.

> - it'd be useful to also conveniently display the number of retried
>   transactions, rather than the total number of retries.

Ok!

> - it appears that we now unconditionally do not disregard a connection
>   after a serialization / deadlock failure. Good. But that's useful far
>   beyond just deadlocks / serialization errors, and should probably be 
> exposed.

I agree that it will be useful. But how do you propose to print the 
results if there are many types of errors? I'm afraid that the progress 
report can be very long although it is expected that it will be rather 
short [1]. The per statement report can also be very long..

> Nice feature!

Thanks and thank you for your comments :)

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1707121142300.12795%40lancre

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

27 августа 2017 г., 21:49:05

Hello,

> Here is the third version of the patch for pgbench thanks to Fabien Coelho
> comments. As in the previous one, transactions with serialization and
> deadlock failures are rolled back and retried until they end successfully or
> their number of tries reaches maximum.

Here is some partial review.

Patch applies cleanly.

It compiles with warnings, please fix them:

pgbench.c:2624:28: warning: ‘failure_status’ may be used uninitialized in this function
pgbench.c:2697:34: warning: ‘command’ may be used uninitialized in this function

I do not think that the error handling feature needs preeminence in the
final report, compare to scale, number of clients and so. The number
of tries should be put further on.

I would spell "number of tries" instead of "tries number" which seems to
suggest that each try is attributed a number. "sql" -> "SQL".

For the per statement latency final report, I do not think it is worth
distinguishing the kind of retry at this level, because ISTM that
serialization & deadlocks are unlikely to appear simultaneously. I would
just report total failures and total tries on this report. We only have 2
errors now, but if more are added I'm pretty sure that we would not want
to have more columns... Moreover the 25 characters alignment is ugly,
better use a much smaller alignment.

I'm okay with having details shown in the "log to file" group report.
The documentation does not seem consistent. It discusses "the very last fields"
and seem to suggest that there are two, but the example trace below just
adds one field.

If you want a paragraph you should add <para>, skipping a line does not
work (around "All values are computed for ...").

I do not understand the second note of the --max-tries documentation.
It seems to suggest that some script may not end their own transaction...
which should be an error in my opinion? Some explanations would be welcome.

I'm not sure that "Retries" deserves a type of its own for two counters.
The "retries" in RetriesState may be redundant with these.
The failures are counted on simple counters while retries have a type,
this is not consistent. I suggest to just use simple counters everywhere.

I'm ok with having the detail report tell about failures & retries only
when some occured.

typo: sucessufully -> successfully

If a native English speaker could provide an opinion on that, and more
generally review the whole documentation, it would be great.

I think that the rand functions should really take a random_state pointer
argument, not a Thread or Client.

I'm at odds that FailureStatus does not have a clean NO_FAILURE state,
and that it is merged with misc failures.

I'm not sure that initRetries, mergeRetries, getAllRetries really
deserve a function.

I do not thing that there should be two accum Functions. Just extend
the existing one, and adding zero to zero is not a problem.

I guess that in the end pgbench & psql variables will have to be merged
if pgbench expression engine is to be used by psql as well, but this is
not linked to this patch.

The tap tests seems over-complicated and heavy with two pgbench run in
parallel... I'm not sure we really want all that complexity for this
somehow small feature. Moreover pgbench can run several scripts, I'm not
sure why two pgbench would need to be invoked. Could something much
simpler and lighter be proposed instead to test the feature?

The added code does not conform to Pg C style. For instance, if brace
should be aligned to the if. Please conform the project style.

v5-0001-Pgbench-errors-and-serialization-deadlock-retries.patch

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Teodor Sigaev

Дата:

29 марта 2018 г., 17:53:45

Conception of max-retry option seems strange for me. if number of retries 
reaches max-retry option, then we just increment counter of failed transaction 
and try again (possibly, with different random numbers). At the end we should 
distinguish number of error transaction and failed transaction, to found this 
difference documentation  suggests to rerun pgbench with debugging on.

May be I didn't catch an idea, but it seems to me max-tries should be removed. 
On transaction searialization or deadlock error pgbench should increment counter 
of failed transaction, resets conditional stack, variables, etc but not a random 
generator and then start new transaction for the first line of script.

Marina Polyakova wrote:
> On 26-03-2018 18:53, Fabien COELHO wrote:
>> Hello Marina,
> 
> Hello!
> 
>>> Many thanks to both of you! I'm working on a patch in this direction..
>>>
>>>> I think that the best approach for now is simply to reset (command
>>>> zero, random generator) and start over the whole script, without
>>>> attempting to be more intelligent. The limitations should be clearly
>>>> documented (one transaction per script), though. That would be a
>>>> significant enhancement already.
>>>
>>> I'm not sure that we can always do this, because we can get new errors until 
>>> we finish the failed transaction block, and we need destroy the conditional 
>>> stack..
>>
>> Sure. I'm suggesting so as to simplify that on failures the retry
>> would always restarts from the beginning of the script by resetting
>> everything, indeed including the conditional stack, the random
>> generator state, the variable values, and so on.
>>
>> This mean enforcing somehow one script is one transaction.
>>
>> If the user does not do that, it would be their decision and the
>> result becomes unpredictable on errors (eg some sub-transactions could
>> be executed more than once).
>>
>> Then if more is needed, that could be for another patch.
> 
> Here is the fifth version of the patch for pgbench (based on the commit 
> 4b9094eb6e14dfdbed61278ea8e51cc846e43579) where I tried to implement these 
> ideas, thanks to your comments and those of Teodor Sigaev. Since we may need to 
> execute commands to complete a failed transaction block, the script is now 
> always executed completely. If there is a serialization/deadlock failure which 
> can be retried, the script is executed again with the same random state and 
> array of variables as before its first run. Meta commands  errors as well as all 
> SQL errors do not cause the aborting of the client. The first failure in the 
> current script execution determines whether the script run will be retried or 
> not, so only such failures (they have a retry) or errors (they are not retried) 
> are reported.
> 
> I tried to make fixes in accordance with your previous reviews ([1], [2], [3]):
> 
>> I'm unclear about the added example added in the documentation. There
>> are 71% errors, but 100% of transactions are reported as processed. If
>> there were errors, then it is not a success, so the transaction were
>> not
>> processed? To me it looks inconsistent. Also, while testing, it seems
>> that
>> failed transactions are counted in tps, which I think is not
>> appropriate:
>>
>>
>> About the feature:
>>
>>  sh> PGOPTIONS='-c default_transaction_isolation=serializable' \
>>        ./pgbench -P 1 -T 3 -r -M prepared -j 2 -c 4
>>  starting vacuum...end.
>>  progress: 1.0 s, 10845.8 tps, lat 0.091 ms stddev 0.491, 10474 failed
>>  # NOT 10845.8 TPS...
>>  progress: 2.0 s, 10534.6 tps, lat 0.094 ms stddev 0.658, 10203 failed
>>  progress: 3.0 s, 10643.4 tps, lat 0.096 ms stddev 0.568, 10290 failed
>>  ...
>>  number of transactions actually processed: 32028 # NO!
>>  number of errors: 30969 (96.694 %)
>>  latency average = 2.833 ms
>>  latency stddev = 1.508 ms
>>  tps = 10666.720870 (including connections establishing) # NO
>>  tps = 10683.034369 (excluding connections establishing) # NO
>>  ...
>>
>> For me this is all wrong. I think that the tps report is about
>> transactions
>> that succeeded, not mere attempts. I cannot say that a transaction
>> which aborted
>> was "actually processed"... as it was not.
> 
> Fixed
> 
>> The order of reported elements is not logical:
>>
>>  maximum number of transaction tries: 100
>>  scaling factor: 10
>>  query mode: prepared
>>  number of clients: 4
>>  number of threads: 2
>>  duration: 3 s
>>  number of transactions actually processed: 967
>>  number of errors: 152 (15.719 %)
>>  latency average = 9.630 ms
>>  latency stddev = 13.366 ms
>>  number of transactions retried: 623 (64.426 %)
>>  number of retries: 32272
>>
>> I would suggest to group everything about error handling in one block,
>> eg something like:
>>
>>  scaling factor: 10
>>  query mode: prepared
>>  number of clients: 4
>>  number of threads: 2
>>  duration: 3 s
>>  number of transactions actually processed: 967
>>  number of errors: 152 (15.719 %)
>>  number of transactions retried: 623 (64.426 %)
>>  number of retries: 32272
>>  maximum number of transaction tries: 100
>>  latency average = 9.630 ms
>>  latency stddev = 13.366 ms
> 
> Fixed
> 
>> Also, percent character should be stuck to its number: 15.719% to have
>> the style more homogeneous (although there seems to be pre-existing
>> inhomogeneities).
>>
>> I would replace "transaction tries/retried" by "tries/retried",
>> everything
>> is about transactions in the report anyway.
>>
>> Without reading the documentation, the overall report semantics is
>> unclear,
>> especially given the absurd tps results I got with the my first
>> attempt,
>> as failing transactions are counted as "processed".
> 
> Fixed
> 
>> About the code:
>>
>> I'm at lost with the 7 states added to the automaton, where I would
>> have hoped
>> that only 2 (eg RETRY & FAIL, or even less) would be enough.
> 
> Fixed
> 
>> I'm wondering whether the whole feature could be simplified by
>> considering that one script is one "transaction" (it is from the
>> report point of view at least), and that any retry is for the full
>> script only, from its beginning. That would remove the trying to guess
>> at transactions begin or end, avoid scanning manually for subcommands,
>> and so on.
>>  - Would it make sense?
>>  - Would it be ok for your use case?
> 
> Fixed
> 
>> The proposed version of the code looks unmaintainable to me. There are
>> 3 levels of nested "switch/case" with state changes at the deepest
>> level.
>> I cannot even see it on my screen which is not wide enough.
> 
> Fixed
> 
>> There should be a typedef for "random_state", eg something like:
>>
>>   typedef struct { unsigned short data[3]; } RandomState;
>>
>> Please keep "const" declarations, eg "commandFailed".
>>
>> I think that choosing script should depend on the thread random state,
>> not
>> the client random state, so that a run would generate the same pattern
>> per
>> thread, independently of which client finishes first.
>>
>> I'm sceptical of the "--debug-fails" options. ISTM that --debug is
>> already there
>> and should just be reused.
> 
> Fixed
> 
>> I agree that function naming style is a already a mess, but I think
>> that
>> new functions you add should use a common style, eg "is_compound" vs
>> "canRetry".
> 
> Fixed
> 
>> Translating error strings to their enum should be put in a function.
> 
> Removed
> 
>> I'm not sure this whole thing should be done anyway.
> 
> The processing of compound commands is removed.
> 
>> The "node" is started but never stopped.
> 
> Fixed
> 
>> For file contents, maybe the << 'EOF' here-document syntax would help
>> instead
>> of using concatenated backslashed strings everywhere.
> 
> I'm sorry, but I could not get it to work with regular expressions :(
> 
>> I'd start by stating (i.e. documenting) that the features assumes that one
>> script is just *one* transaction.
>>
>> Note that pgbench somehow already assumes that one script is one
>> transaction when it reports performance anyway.
>>
>> If you want 2 transactions, then you have to put them in two scripts,
>> which looks fine with me. Different transactions are expected to be
>> independent, otherwise they should be merged into one transaction.
> 
> Fixed
> 
>> Under these restrictions, ISTM that a retry is something like:
>>
>>    case ABORTED:
>>       if (we want to retry) {
>>          // do necessary stats
>>          // reset the initial state (random, vars, current command)
>>          state = START_TX; // loop
>>       }
>>       else {
>>         // count as failed...
>>         state = FINISHED; // or done.
>>       }
>>       break;
> ...
>> I'm fine with having END_COMMAND skipping to START_TX if it can be done
>> easily and cleanly, esp without code duplication.
> 
> I did not want to add the additional if-expressions possibly to most of the code 
> in CSTATE_START_TX/CSTATE_END_TX/CSTATE_END_COMMAND, so CSTATE_FAILURE is used 
> instead of CSTATE_END_COMMAND in case of failure, and CSTATE_RETRY is called 
> before CSTATE_END_TX if there was a failure during the current script execution.
> 
>> ISTM that ABORTED & FINISHED are currently exactly the same. That would
>> put a particular use to aborted. Also, there are many points where the
>> code may go to "aborted" state, so reusing it could help avoid duplicating
>> stuff on each abort decision.
> 
> To end and rollback the failed transaction block the script is always executed 
> completely, and after the failure the following script command is executed..
> 
> [1] 
> https://www.postgresql.org/message-id/alpine.DEB.2.20.1801031720270.20034%40lancre
> [2] 
> https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121309300.10810%40lancre
> [3] 
> https://www.postgresql.org/message-id/alpine.DEB.2.20.1801121607310.13422%40lancre
> 

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
                                                    WWW: http://www.sigaev.ru/

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

30 марта 2018 г., 01:39:42

> Conception of max-retry option seems strange for me. if number of retries 
> reaches max-retry option, then we just increment counter of failed 
> transaction and try again (possibly, with different random numbers). At the 
> end we should distinguish number of error transaction and failed transaction, 
> to found this difference documentation  suggests to rerun pgbench with 
> debugging on.
>
> May be I didn't catch an idea, but it seems to me max-tries should be 
> removed. On transaction searialization or deadlock error pgbench should 
> increment counter of failed transaction, resets conditional stack, variables, 
> etc but not a random generator and then start new transaction for the first 
> line of script.

ISTM that there is the idea is that the client application should give up 
at some point are report an error to the end user, kind of a "timeout" on 
trying, and that max-retry would implement this logic of giving up: the 
transaction which was intented, represented by a given initial random 
generator state, could not be committed as if after some iterations.

Maybe the max retry should rather be expressed in time rather than number 
of attempts, or both approach could be implemented? But there is a logic 
of retrying the same (try again what the client wanted) vs retrying 
something different (another client need is served).

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

30 марта 2018 г., 18:20:16

On 29-03-2018 22:39, Fabien COELHO wrote:
>> Conception of max-retry option seems strange for me. if number of 
>> retries reaches max-retry option, then we just increment counter of 
>> failed transaction and try again (possibly, with different random 
>> numbers).

Then the client starts another script, but by chance or by the number of 
scripts it can be the same.

>> At the end we should distinguish number of error transaction and 
>> failed transaction, to found this difference documentation  suggests 
>> to rerun pgbench with debugging on.

If I understood you correctly, this difference is the total number of 
retries and this is included in all reports.

>> May be I didn't catch an idea, but it seems to me max-tries should be 
>> removed. On transaction searialization or deadlock error pgbench 
>> should increment counter of failed transaction, resets conditional 
>> stack, variables, etc but not a random generator and then start new 
>> transaction for the first line of script.

When I sent the first version of the patch there were only rollbacks, 
and the idea to retry failed transactions was approved (see [1], [2], 
[3], [4]). And thank you, I fixed the patch to reset the client 
variables in case of errors too, and not only in case of retries (see 
attached, it is based on the commit 
3da7502cd00ddf8228c9a4a7e4a08725decff99c).

> ISTM that there is the idea is that the client application should give
> up at some point are report an error to the end user, kind of a
> "timeout" on trying, and that max-retry would implement this logic of
> giving up: the transaction which was intented, represented by a given
> initial random generator state, could not be committed as if after
> some iterations.
> 
> Maybe the max retry should rather be expressed in time rather than
> number of attempts, or both approach could be implemented? But there
> is a logic of retrying the same (try again what the client wanted) vs
> retrying something different (another client need is served).

I'm afraid that we will have a problem in debugging mode: should we 
report a failure (which will be retried) or an error (which will not be 
retried)? Because only after executing the following script commands (to 
rollback this transaction block) we will know the time that we spent on 
the execution of the current script..

[1] 
https://www.postgresql.org/message-id/CACjxUsOfbn72EaH4i_OuzdY-0PUYfg1Y3o8G27tEA8fJOaPQEw%40mail.gmail.com
[2] 
https://www.postgresql.org/message-id/20170615211806.sfkpiy2acoavpovl%40alvherre.pgsql
[3] 
https://www.postgresql.org/message-id/CAEepm%3D3TRTc9Fy%3DfdFThDa4STzPTR6w%3DRGfYEPikEkc-Lcd%2BMw%40mail.gmail.com
[4] 
https://www.postgresql.org/message-id/CACjxUsOQw%3DvYjPWZQ29GmgWU8ZKj336OGiNQX5Z2W-AcV12%2BNw%40mail.gmail.com

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

v6-0001-Pgbench-errors-and-serialization-deadlock-retries.patch

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

04 апреля 2018 г., 19:07:25

Hello, hackers!

Here there's a seventh version of the patch for error handling and 
retrying of transactions with serialization/deadlock failures in pgbench 
(based on the commit a08dc711952081d63577fc182fcf955958f70add). I added 
the option --max-tries-time which is an implemetation of Fabien Coelho's 
proposal in [1]: the transaction with serialization or deadlock failure 
can be retried if the total time of all its tries is less than this 
limit (in ms). This option can be combined with the option --max-tries. 
But if none of them are used, failed transactions are not retried at 
all.

Also:
* Now when the first failure occurs in the transaction it is always 
reported as a failure since only after the remaining commands of this 
transaction are executed we find out whether we can try again or not. 
Therefore add the messages about retrying or ending the failed 
transaction to the "fails" debugging level so you can distinguish 
failures (which are retried) and errors (which are not retried).
* Fix a report on the latency average because the total time includes 
time for both errors and successful transactions.
* Code cleanup (including tests).

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.20.1803292134380.16472%40lancre

> Maybe the max retry should rather be expressed in time rather than 
> number
> of attempts, or both approach could be implemented?

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

v7-0001-Pgbench-errors-and-serialization-deadlock-retries.patch

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Ildus Kurbangaliev

Дата:

05 апреля 2018 г., 21:08:07

On Wed, 04 Apr 2018 16:07:25 +0300
Marina Polyakova <m.polyakova@postgrespro.ru> wrote:

> Hello, hackers!
> 
> Here there's a seventh version of the patch for error handling and 
> retrying of transactions with serialization/deadlock failures in
> pgbench (based on the commit
> a08dc711952081d63577fc182fcf955958f70add). I added the option
> --max-tries-time which is an implemetation of Fabien Coelho's
> proposal in [1]: the transaction with serialization or deadlock
> failure can be retried if the total time of all its tries is less
> than this limit (in ms). This option can be combined with the option
> --max-tries. But if none of them are used, failed transactions are
> not retried at all.
> 
> Also:
> * Now when the first failure occurs in the transaction it is always 
> reported as a failure since only after the remaining commands of this 
> transaction are executed we find out whether we can try again or not. 
> Therefore add the messages about retrying or ending the failed 
> transaction to the "fails" debugging level so you can distinguish 
> failures (which are retried) and errors (which are not retried).
> * Fix a report on the latency average because the total time includes 
> time for both errors and successful transactions.
> * Code cleanup (including tests).
> 
> [1] 
> https://www.postgresql.org/message-id/alpine.DEB.2.20.1803292134380.16472%40lancre
> 
> > Maybe the max retry should rather be expressed in time rather than 
> > number
> > of attempts, or both approach could be implemented?  
> 

Hi, I did a little review of your patch. It seems to work as
expected, documentation and tests are there. Still I have few comments.

There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with
corresponding fprintf(stderr..) I think it's time to do it like in the
main code, wrap with some function like log(level, msg).

In CSTATE_RETRY state used_time is used only in printing but calculated
more than needed.

In my opinion Debuglevel should be renamed to DebugLevel that looks
nicer, also there DEBUGLEVEl (where last letter is in lower case) which
is very confusing.

I have checked overall functionality of this patch, but haven't checked
any special cases yet.

-- 
---
Ildus Kurbangaliev
Postgres Professional: http://www.postgrespro.com
Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

07 апреля 2018 г., 00:40:36

> Hi, I did a little review of your patch. It seems to work as
> expected, documentation and tests are there. Still I have few comments.

Hello! Thank you very much! I attached the fixed version of the patch 
(based on the commit 94c1f9ba11d1241a2b3b2be7177604b26b08bc3d) + thanks 
to Fabien Coelho's comments outside of this thread, I removed the option 
--max-tries-time and the option --latency-limit can be used to limit the 
time of transaction tries.

> There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with
> corresponding fprintf(stderr..) I think it's time to do it like in the
> main code, wrap with some function like log(level, msg).

I agree, fixed.

> In CSTATE_RETRY state used_time is used only in printing but calculated
> more than needed.

Sorry, fixed.

> In my opinion Debuglevel should be renamed to DebugLevel that looks
> nicer, also there DEBUGLEVEl (where last letter is in lower case) which
> is very confusing.

Sorry for this typos =[ Fixed.

> I have checked overall functionality of this patch, but haven't checked
> any special cases yet.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

v8-0001-Pgbench-errors-and-serialization-deadlock-retries.patch

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

08 мая 2018 г., 12:00:57

Hello Marina,

FYI the v8 patch does not apply anymore, mostly because of a recent perl 
reindentation.

I think that I'll have time for a round of review in the first half of 
July. Providing a rebased patch before then would be nice.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Alvaro Herrera

Дата:

08 мая 2018 г., 16:58:32

Fabien COELHO wrote:

> I think that I'll have time for a round of review in the first half of July.
> Providing a rebased patch before then would be nice.

Note that even in the absence of a rebased patch, you can apply to an
older checkout if you have some limited window of time for a review.

Looking over the diff, I find that this patch tries to do too much and
needs to be split up.  At a minimum there is a preliminary patch that
introduces the error reporting stuff (errstart etc); there are other
thread-related changes (for example to the random generation functions)
that probably belong in a separate one too.  Not sure if there are other
smaller patches hidden inside the rest.

On elog/errstart: we already have a convention for what ereport() calls
look like; I suggest to use that instead of inventing your own.  With
that, is there a need for elog()?  In the backend we have it because
$HISTORY but there's no need for that here -- I propose to lose elog()
and use only ereport everywhere.  Also, I don't see that you need
errmsg_internal() at all; let's lose it too.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

09 мая 2018 г., 01:22:14

Hello Alvaro,

>> I think that I'll have time for a round of review in the first half of July.
>> Providing a rebased patch before then would be nice.

> Note that even in the absence of a rebased patch, you can apply to an
> older checkout if you have some limited window of time for a review.

Yes, sure. I'd like to bring this feature to be committable, so it will 
have to be rebased at some point anyway.

> Looking over the diff, I find that this patch tries to do too much and
> needs to be split up.

Yep, I agree that it would help the reviewing process. On the other hand I 
have bad memories about maintaining dependent patches which interfere 
significantly. Maybe it may not the case with this feature.

Thanks for the advices.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Alvaro Herrera

Дата:

09 мая 2018 г., 01:36:21

Hello,

Fabien COELHO wrote:

> > Looking over the diff, I find that this patch tries to do too much and
> > needs to be split up.
> 
> Yep, I agree that it would help the reviewing process. On the other hand I
> have bad memories about maintaining dependent patches which interfere
> significantly.

Sure.  I suggest not posting these patches separately -- instead, post
as a series of commits in a single email, attaching files from "git
format-patch".

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

21 мая 2018 г., 19:10:44

Hello!

Fabien and Alvaro, thank you very much! And sorry for such a late reply 
(I was a bit busy and making of ereport took some time..) :-( Below is a 
rebased version of the patch (commit 
9effb63e0dd12b0704cd8e11106fe08ff5c9d685) divided into several smaller 
patches:

v9-0001-Pgbench-errors-use-the-RandomState-structure-for-.patch
- a patch for the RandomState structure (this is used to reset a 
client's random seed during the repeating of transactions after 
serialization/deadlock failures).

v9-0002-Pgbench-errors-use-the-Variables-structure-for-cl.patch
- a patch for the Variables structure (this is used to reset client 
variables during the repeating of transactions after 
serialization/deadlock failures).

v9-0003-Pgbench-errors-use-the-ereport-macro-to-report-de.patch
- a patch for the ereport() macro (this is used to report client 
failures that do not cause an aborts and this depends on the level of 
debugging).
- implementation: if possible, use the local ErrorData structure during 
the errstart()/errmsg()/errfinish() calls. Otherwise use a static 
variable protected by a mutex if necessary. To do all of this export the 
function appendPQExpBufferVA from libpq.

v9-0004-Pgbench-errors-and-serialization-deadlock-retries.patch
- the main patch for handling client errors and repetition of 
transactions with serialization/deadlock failures (see the detailed 
description in the file).

Any suggestions are welcome!

On 08-05-2018 9:00, Fabien COELHO wrote:
> Hello Marina,
> 
> FYI the v8 patch does not apply anymore, mostly because of a recent
> perl reindentation.
> 
> I think that I'll have time for a round of review in the first half of
> July. Providing a rebased patch before then would be nice.

They are attached, but a little delayed due to testing..

On 08-05-2018 13:58, Alvaro Herrera wrote:
> Looking over the diff, I find that this patch tries to do too much and
> needs to be split up.  At a minimum there is a preliminary patch that
> introduces the error reporting stuff (errstart etc); there are other
> thread-related changes (for example to the random generation functions)
> that probably belong in a separate one too.  Not sure if there are 
> other
> smaller patches hidden inside the rest.

Here is a try to do it..

> On elog/errstart: we already have a convention for what ereport() calls
> look like; I suggest to use that instead of inventing your own.  With
> that, is there a need for elog()?  In the backend we have it because
> $HISTORY but there's no need for that here -- I propose to lose elog()
> and use only ereport everywhere.  Also, I don't see that you need
> errmsg_internal() at all; let's lose it too.

I agree, done. But there're some changes to make such a design 
thread-safe..

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Hello, hackers!

Here there's a tenth version of the patch for error handling and 
retrying of transactions with serialization/deadlock failures in pgbench 
(based on the commit e0ee93053998b159e395deed7c42e02b1f921552) thanks to 
the comments of Fabien Coelho and Alvaro Herrera in this thread.

v10-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
- a patch for the RandomState structure (this is used to reset a 
client's random seed during the repeating of transactions after 
serialization/deadlock failures).

v10-0002-Pgbench-errors-use-a-separate-function-to-report.patch
- a patch for a separate error reporting function (this is used to 
report client failures that do not cause an aborts and this depends on 
the level of debugging).

v10-0003-Pgbench-errors-use-the-Variables-structure-for-c.patch
- a patch for the Variables structure (this is used to reset client 
variables during the repeating of transactions after 
serialization/deadlock failures).

v10-0004-Pgbench-errors-and-serialization-deadlock-retrie.patch
- the main patch for handling client errors and repetition of 
transactions with serialization/deadlock failures (see the detailed 
description in the file).

As Fabien wrote in [5], some of the new tests were too slow. Earlier on 
my laptop they increased the testing time of pgbench from 5.5 seconds to 
12.5 seconds. In the new version the testing time of pgbench takes about 
7 seconds. These tests include one test for serialization failure and 
retry, as well as one test for deadlock failure and retry. Both of them 
are in file 001_pgbench_with_server.pl, each test uses only one pgbench 
run, they use PL/pgSQL scripts instead of a parallel psql session.

Any suggestions are welcome!

All that was fixed from the previous version:

[1] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1806090810090.5307%40lancre

> ISTM that the struct itself does not need a name, ie. "typedef struct {
> ... } RandomState" is enough.

> There could be clear comments, say in the TState and CState structs, 
> about
> what randomness is impacted (i.e. script choices, etc.).

> getZipfianRand, computeHarmonicZipfian: The "thread" parameter was
> justified because it was used for two fieds. As the random state is
> separated, I'd suggest that the other argument should be a zipfcache
> pointer.

> While reading your patch, it occurs to me that a run is not 
> deterministic
> at the thread level under throttling and sampling, because the random
> state is sollicited differently depending on when transaction ends. 
> This
> suggest that maybe each thread random_state use should have its own 
> random
> state.

[2] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1806091514060.3655%40lancre

> The structure typedef does not need a name. "typedef struct { } V...".

> I tend to disagree with naming things after their type, eg "array". I'd
> suggest "vars" instead. "nvariables" could be "nvars" for consistency 
> with
> that and "vars_sorted", and because "foo.variables->nvariables" starts
> looking heavy.

> I'd suggest but "Variables" type declaration just after "Variable" type
> declaration in the file.

[3] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1806100837380.3655%40lancre

> The semantics of the existing code is changed, the FATAL levels calls
> abort() and replace existing exit(1) calls. Maybe you want an ERROR
> level as well.

> I do not understand why names are changed, eg ELEVEL_FATAL instead of
> FATAL. ISTM that part of the point of the move would be to be 
> homogeneous,
> which suggests that the same names should be reused.

[4] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1807081014260.17811%40lancre

> I'd suggest to have just one clean and simple pgbench internal function 
> to
> handle errors and possibly exit, debug... Something like
> 
>    void pgb_error(FATAL, "error %d raised", 12);
> 
> Implemented as
> 
>    void pgb_error(int/enum XXX level, const char * format, ...)
>    {
>       test level and maybe return immediately (eg debug);
>       print to stderr;
>       exit/abort/return depending;
>    }

[5] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1807091451520.17811%40lancre

> Leves ELEVEL_LOG_CLIENT_{FAIL,ABORTED} & LOG_MAIN look unclear to me.
> In particular, the "CLIENT" part is not very useful. If the
> distinction makes sense, I would have kept "LOG" for the initial one 
> and
> add other ones for ABORT and PGBENCH, maybe.

> * There are no comments about "retries" in StatData, CState and Command
> structures.

> * Also, for StatData, I would like to understand the logic between cnt,
> skipped, retries, retried, errors, ... so a clear information about the
> expected invariant if any would be welcome. One has to go in the code 
> to
> understand how these fields relate one to the other.

> * "errors_in_failed_tx" is some subcounter of "errors", for a special
> case. Why it is there fails me [I finally understood, and I think it
> should be removed, see end of review]. If we wanted to distinguish, 
> then
> we should distinguish homogeneously: maybe just count the different 
> error
> types, eg have things like "deadlock_errors", "serializable_errors",
> "other_errors", "internal_pgbench_errors" which would be orthogonal one 
> to
> the other, and "errors" could be recomputed from these.

> * How "errors" differs from "ecnt" is unclear to me.

> * FailureStatus states are not homogeneously named. I'd suggest to use
> *_FAILURE for all cases. The miscellaneous case should probably be the
> last.

> * I do not understand the comments on CState enum: "First, remember the 
> failure
> in CSTATE_FAILURE. Then process other commands of the failed 
> transaction if any"
> Why would other commands be processed at all if the transaction is 
> aborted?
> For me any error must leads to the rollback and possible retry of the
> transaction.
> ...
> So, for me, the FAILURE state should record/count the failure, then 
> skip
> to RETRY if a retry is decided, else proceed to ABORT. Nothing else.
> This is much clearer that way.
> 
> Then RETRY should reinstate the global state and proceed to start the 
> *first*
> command again.

> * commandFailed: I think that it should be kept much simpler. In
> particular, having errors on errors does not help much: on 
> ELEVEL_FATAL,
> it ignores the actual reported error and generates another error of the
> same level, so that the initial issue is hidden. Even if these are 
> can't
> happen cases, hidding the origin if it occurs looks unhelpful. Just 
> print
> it directly, and maybe abort if you think that it is a can't happen 
> case.

> * copyRandomState: just use sizeof(RandomState) instead of making 
> assumptions
> about the contents of the struct. Also, this function looks pretty 
> useless,
> why not just do a plain assignment?

> * copyVariables: lacks comments to explain that the destination is 
> cleaned up
> and so on. The cleanup phase could probaly be in a distinct function, 
> so that
> the code would be clearer. Maybe the function variable names are too 
> long.
> 
>    if (current_source->svalue)
> 
> in the context of a guard for a strdup, maybe:
> 
>    if (current_source->svalue != NULL)

> * executeCondition: this hides client automaton state changes which 
> were
> clearly visible beforehand in the switch, and the different handling of
> if & elif is also hidden.
> 
> I'm against this unnecessary restructuring and to hide such an 
> information,
> all state changes should be clearly seen in the state switch so that it 
> is
> easier to understand and follow.
> 
> I do not see why touching the conditional stack on internal errors
> (evaluateExpr failure) brings anything, the whole transaction will be 
> aborted
> anyway.

> The current RETRY state does memory allocations to generate a message
> with buffer allocation and so on. This looks like a costly and useless
> operation. If the user required "retries", then this is normal
> behavior,
> the retries are counted and will be printed out in the final report,
> and there is no point in printing out every single one of them.
> Maybe you want that debugging, but then coslty operations should be 
> guarded.

> The number of transactions above the latency limit report can be 
> simplified.
> Remove the if and just use one printf f with a %s for the optional 
> comment.
> I'm not sure this optional comment is useful there.

> Before the patch, ISTM that all lines relied on one printf. you have
> changed to a style where a collection of printf is used to compose a 
> line.
> I'd suggest to keep to the previous one-printf-prints-one-line style,
> where possible.

> You have added 20-columns alignment prints. This looks like too much 
> and
> generates much too large lines. Probably 10 (billion) would be enough.
> 
> Some people try to parse the output, so it should be deterministic. I'd 
> add
> the needed columns always if appropriate (i.e. under retry), even if 
> none
> occured.

> * processXactStats: An else is replaced by a detailed stats, with the 
> initial
> "no detailed stats" comment kept. The function is called both in the 
> thenb
> & else branch. The structure does not make sense anymore. I'm not sure
> this changed was needed.

> * getLatencyUsed: declared "double" so "return 0.0".

> * typo: ruin -> run; probably others, I did not check for them in 
> detail.

> On my laptop, tests last 5.5 seconds before the patch, and about 13 
> seconds
> after. This is much too large. Pgbench TAP tests do not deserve to take 
> over
> twice as much time as before just on this patch.
> 
> One reason which explains this large time is there is a new script with 
> a
> new created instance. I'd suggest to append tests to the existing 2
> scripts, depending on whether they need a running instance or not.
> 
> Secondly, I think that the design of the tests are too heavy. For such 
> a
> feature, ISTM enough to check that it works, i.e. one test for 
> deadlocks
> (trigger one or a few deadlocks), idem for serializable, maybe idem for
> other errors if any.
> 
> The challenge is to do that reliably and efficiently, i.e. so that the 
> test does
> not rely on chance and is still quite efficient.
> 
> The trick you use is to run an interactive psql in parallel to pgbench 
> so as to
> play with concurrent locks. That is interesting, but deserves more 
> comments
> and explanatation, eg before the test functions.
> 
> Maybe this could be achieved within pgbench by using some wait stuff in
> PL/pgSQL so that concurrent client can wait one another based on data 
> in
> unlogged table updated by a CALL within an "embedded" transactions? Not
> sure. ...
> 
> Anyway, TAP tests should be much lighter (in total time), and if 
> possible
> much simpler.
> 
> The latency limit to 900 ms try is a bad idea because it takes a lot of 
> time.
> I did such tests before and they were removed by Tom Lane because of 
> determinism
> and time issues. I would comment this test out for now.

> Documentation
> ...
> Having the "most important settings" on line 1-6 and 8 (i.e. skipping 
> 7) looks
> silly. The important ones should simply be the first ones, and the 8th 
> is not
> that important, or it is in 7th position.
> 
> I do not understand why there is so much text about in failed sql 
> transaction
> stuff, while we are mainly interested in serialization & deadlock 
> errors, and
> this only falls in some "other" category. There seems to be more 
> details about
> other errors that about deadlocks & serializable errors.
> 
> The reporting should focus on what is of interest, either all errors, 
> or some
> detailed split of these errors. The documentation should state clearly 
> what
> are the counted errors, and then what are their effects on the reported 
> stats.
> The "Errors and Serialization/Deadlock Retries" section is a good start 
> in that
> direction, but it does not talk about pgbench internal errors (eg 
> "cos(true)").
> I think it should more explicit about errors.
> 
> Option --max-tries default value should be spelled out in the doc.

[6] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1807111435250.27883%40lancre

> So if absolutely necessary, a new option is still better than changing
> --debug syntax. If not necessary, then it is better:-)

> The fact that some data are collected does not mean that they should 
> all
> be reported in detail. We can have detailed error count and report the 
> sum
> of this errors for instance, or have some more verbose/detailed reports
> as options (eg --latencies does just that).

[7] 
https://www.postgresql.org/message-id/20180711180417.3ytmmwmonsr5lra7%40alvherre.pgsql

> LGTM, though I'd rename the random_state struct members so that it
> wouldn't look as confusing.  Maybe that's just me.

> Please don't allocate Variable structs one by one.  First time allocate
> some decent number (say 8) and then enlarge by duplicating size.  That
> way you save realloc overhead.  We use this technique everywhere else,
> no reason do different here.  Other than that, LGTM.

[8] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1807112124210.27883%40lancre

> If you really need some complex dynamic buffer, and I would prefer
> that you avoid that, then the fallback is:
> 
>     if (level >= DEBUG)
>     {
>        initPQstuff(&msg);
>        ...
>        pgbench_error(DEBUG, "fixed message... %s\n", msg);
>        freePQstuff(&msg);
>     }
> 
> The point is to avoid building the message with dynamic allocation and 
> so
> if in the end it is not used.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Hello Marina,

> v10-0004-Pgbench-errors-and-serialization-deadlock-retrie.patch
> - the main patch for handling client errors and repetition of transactions
> with serialization/deadlock failures (see the detailed description in the
> file).

Patch applies cleanly.

It allows retrying a script (considered as a transaction) on serializable
and deadlock errors, which is a very interesting extension but also
impacts pgbench significantly.

I'm waiting for the feature to be right before checking in full the
documentation and tests. There are still some issues to resolve before
checking that.

Anyway, tests look reasonable. Taking advantage of of transactions control
from PL/pgsql is a good use of this new feature.

A few comments about the doc.

According to the documentation, the feature is triggered by --max-tries and
--latency-limit. I disagree with the later, because it means that having
latency limit without retrying is not supported anymore.

Maybe you can allow an "unlimited" max-tries, say with special value zero,
and the latency limit does its job if set, over all tries.

Doc: "error in meta commands" -> "meta command errors", for homogeneity with
other cases?

Detailed -r report. I understand from the doc that the retry number on the
detailed per-statement report is to identify at what point errors occur?
Probably this is more or less always at the same point on a given script,
so that the most interesting feature is to report the number of retries at the
script level.

Doc: "never occur.." -> "never occur", or eventually "...".

Doc: "Directly client errors" -> "Direct client errors".

I'm still in favor of asserting that the sql connection is idle (no tx in
progress) at the beginning and/or end of a script, and report a user error
if not, instead of writing complex caveats.

If someone has a use-case for that, then maybe it can be changed, but I
cannot see any in a benchmarking context, and I can see how easy it is
to have a buggy script with this allowed.

I do not think that the RETRIES_ENABLED macro is a good thing. I'd suggest
to write the condition four times.

ISTM that "skipped" transactions are NOT "successful" so there are a problem
with comments. I believe that your formula are probably right, it has more to do
with what is "success". For cnt decomposition, ISTM that "other transactions"
are really "directly successful transactions".

I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise "another"
does not make sense yet. I'd suggest to name it "OTHER_SQL_FAILURE".

In TState, field "uint32 retries": maybe it would be simpler to count "tries",
which can be compared directly to max tries set in the option?

ErrorLevel: I have already commented about in review about 10.2. I'm not sure of
the LOG -> DEBUG_FAIL changes. I do not understand the name "DEBUG_FAIL", has it
is not related to debug, they just seem to be internal errors. META_ERROR maybe?

inTransactionBlock: I disagree with any function other than doCustom changing
the client state, because it makes understanding the state machine harder. There
is already one exception to that (threadRun) that I wish to remove. All state
changes must be performed explicitely in doCustom.

The automaton skips to FAILURE on every possible error. I'm wondering whether
it could do so only on SQL errors, because other fails will lead to ABORTED
anyway? If there is no good reason to skip to FAILURE from some errors, I'd
suggest to keep the previous behavior. Maybe the good reason is to do some
counting, but this means that on eg metacommand errors now the script would
loop over instead of aborting, which does not look like a desirable change
of behavior.

PQexec("ROOLBACK"): you are inserting a synchronous command, for which the
thread will have to wait for the result, in a middle of a framework which
takes great care to use only asynchronous stuff so that one thread can
manage several clients efficiently. You cannot call PQexec there.
From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to
a new state CSTATE_WAIT_ABORT_RESULT which would be similar to
CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead
of proceeding to the next command.

ISTM that it would be more logical to only get into RETRY if there is a retry,
i.e. move the test RETRY/ABORT in FAILURE. For that, instead of "canRetry",
maybe you want "doRetry", which tells that a retry is possible (the error
is serializable or deadlock) and that the current parameters allow it
(timeout, max retries).

* Minor C style comments:

if / else if / else if ... on *_FAILURE: I'd suggest a switch.

The following line removal does not seem useful, I'd have kept it:

stats->cnt++;
-
if (skipped)

copyVariables: I'm not convinced that source_vars & nvars variables are that
useful.

memcpy(&(st->retry_state.random_state), &(st->random_state), sizeof(RandomState));

Is there a problem with "st->retry_state.random_state = st->random_state;"
instead of memcpy? ISTM that simple assignments work in C. Idem in the reverse
copy under RETRY.

if (!copyVariables(&st->retry_state.variables, &st->variables)) {
pgbench_error(LOG, "client %d aborted when preparing to execute a transaction\n", st->id);

The message could be more precise, eg "client %d failed while copying
variables", unless copyVariables already printed a message. As this is really
an internal error from pgbench, I'd rather do a FATAL (direct exit) there.
ISTM that the only possible failure is OOM here, and pgbench is in a very bad
shape if it gets into that.

commandFailed: I'm not thrilled by the added boolean, which is partially
redundant with the second argument.

if (per_script_stats)
- accumStats(&sql_script[st->use_file].stats, skipped, latency, lag);
+ {
+ accumStats(&sql_script[st->use_file].stats, skipped, latency, lag,
+ st->failure_status, st->retries);
+ }
}

I do not see the point of changing the style here.

--
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

16 августа 2018 г., 23:12:50

On 15-08-2018 11:50, Fabien COELHO wrote:
> Hello Marina,

Hello!

>> v10-0004-Pgbench-errors-and-serialization-deadlock-retrie.patch
>> - the main patch for handling client errors and repetition of 
>> transactions with serialization/deadlock failures (see the detailed 
>> description in the file).
> 
> Patch applies cleanly.
> 
> It allows retrying a script (considered as a transaction) on
> serializable and deadlock errors, which is a very interesting
> extension but also impacts pgbench significantly.
> 
> I'm waiting for the feature to be right before checking in full the
> documentation and tests. There are still some issues to resolve before
> checking that.
> 
> Anyway, tests look reasonable. Taking advantage of of transactions
> control from PL/pgsql is a good use of this new feature.

:-)

> A few comments about the doc.
> 
> According to the documentation, the feature is triggered by --max-tries 
> and
> --latency-limit. I disagree with the later, because it means that 
> having
> latency limit without retrying is not supported anymore.
> 
> Maybe you can allow an "unlimited" max-tries, say with special value 
> zero,
> and the latency limit does its job if set, over all tries.
> 
> Doc: "error in meta commands" -> "meta command errors", for homogeneity 
> with
> other cases?
> ...
> Doc: "never occur.." -> "never occur", or eventually "...".
> 
> Doc: "Directly client errors" -> "Direct client errors".
> ...
> inTransactionBlock: I disagree with any function other than doCustom 
> changing
> the client state, because it makes understanding the state machine 
> harder. There
> is already one exception to that (threadRun) that I wish to remove. All 
> state
> changes must be performed explicitely in doCustom.
> ...
> PQexec("ROOLBACK"): you are inserting a synchronous command, for which 
> the
> thread will have to wait for the result, in a middle of a framework 
> which
> takes great care to use only asynchronous stuff so that one thread can
> manage several clients efficiently. You cannot call PQexec there.
> From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to
> a new state CSTATE_WAIT_ABORT_RESULT which would be similar to
> CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead
> of proceeding to the next command.
> ...
>   memcpy(&(st->retry_state.random_state), &(st->random_state),
> sizeof(RandomState));
> 
> Is there a problem with "st->retry_state.random_state = 
> st->random_state;"
> instead of memcpy? ISTM that simple assignments work in C. Idem in the 
> reverse
> copy under RETRY.

Thank you, I'll fix this.

> Detailed -r report. I understand from the doc that the retry number on 
> the
> detailed per-statement report is to identify at what point errors 
> occur?
> Probably this is more or less always at the same point on a given 
> script,
> so that the most interesting feature is to report the number of retries 
> at the
> script level.

This may depend on various factors.. for example:

transaction type: pgbench_test_serialization.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
duration: 10 s
number of transactions actually processed: 266
number of errors: 10 (3.623%)
number of serialization errors: 10 (3.623%)
number of retried: 75 (27.174%)
number of retries: 75
maximum number of tries: 2
latency average = 72.734 ms (including errors)
tps = 26.501162 (including connections establishing)
tps = 26.515082 (excluding connections establishing)
statement latencies in milliseconds, errors and retries:
          0.012           0           0  \set delta random(-5000, 5000)
          0.001           0           0  \set x1 random(1, 100000)
          0.001           0           0  \set x3 random(1, 2)
          0.001           0           0  \set x2 random(1, 1)
         19.837           0           0  UPDATE xy1 SET y = y + :delta 
WHERE x = :x1;
         21.239           5          36  UPDATE xy3 SET y = y + :delta 
WHERE x = :x3;
         21.360           5          39  UPDATE xy2 SET y = y + :delta 
WHERE x = :x2;

And you can always get the number of retries at the script level from 
the main report (if only one script is used) or from the report for each 
script (if multiple scripts are used).

> I'm still in favor of asserting that the sql connection is idle (no tx 
> in
> progress) at the beginning and/or end of a script, and report a user 
> error
> if not, instead of writing complex caveats.
> 
> If someone has a use-case for that, then maybe it can be changed, but I
> cannot see any in a benchmarking context, and I can see how easy it is
> to have a buggy script with this allowed.
> 
> I do not think that the RETRIES_ENABLED macro is a good thing. I'd 
> suggest
> to write the condition four times.

Ok!

> ISTM that "skipped" transactions are NOT "successful" so there are a 
> problem
> with comments. I believe that your formula are probably right, it has 
> more to do
> with what is "success". For cnt decomposition, ISTM that "other 
> transactions"
> are really "directly successful transactions".

I agree with you, but I also think that skipped transactions should not 
be considered errors. So we can write something like this:

All the transactions are divided into several types depending on their 
execution. Firstly, they can be divided into transactions that we 
started to execute, and transactions which were skipped (it was too late 
to execute them). Secondly, running transactions fall into 2 main types: 
is there any command that got a failure during the last execution of the 
transaction script or not? Thus

the number of all transactions =
   skipped (it was too late to execute them)
   cnt (the number of successful transactions) +
   ecnt (the number of failed transactions).

A successful transaction can have several unsuccessful tries before a
successfull run. Thus

cnt (the number of successful transactions) =
   retried (they got a serialization or a deadlock failure(s), but were
            successfully retried from the very beginning) +
   directly successfull transactions (they were successfully completed on
                                      the first try).

> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise 
> "another"
> does not make sense yet.

Maybe firstly put a general group, and then special cases?...

> I'd suggest to name it "OTHER_SQL_FAILURE".

Ok!

> In TState, field "uint32 retries": maybe it would be simpler to count 
> "tries",
> which can be compared directly to max tries set in the option?

If you mean retries in CState - on the one hand, yes, on the other hand, 
statistics always use the number of retries...

> ErrorLevel: I have already commented about in review about 10.2. I'm 
> not sure of
> the LOG -> DEBUG_FAIL changes. I do not understand the name 
> "DEBUG_FAIL", has it
> is not related to debug, they just seem to be internal errors. 
> META_ERROR maybe?

As I wrote to you in [1]:

>> I'm at odds with the proposed levels. ISTM that pgbench internal
>> errors which warrant an immediate exit should be dubbed "FATAL",
> 
> Ok!
> 
>> which
>> would leave the "ERROR" name for... errors, eg SQL errors.
>> ...
> 
> The messages of the errors in SQL and meta commands are printed only if
> the option --debug-fails is used so I'm not sure that they should have 
> a
> higher error level than main program messages (ERROR vs LOG).

Perhaps we can rename the levels DEBUG_FAIL and LOG to LOG and 
LOG_PGBENCH respectively. In this case the client error messages do not 
use debug error levels and the term "logging" is already used for 
transaction/aggregation logging... Therefore perhaps we can also combine 
the options --errors-detailed and --debug-fails into the option 
--fails-detailed=none|groups|all_messages. Here --fails-detailed=groups 
can be used to group errors in reports or logs by basic types. 
--fails-detailed=all_messages can add to this all error messages in the
SQL/meta commands, and messages for processing the failed transaction 
(its end/retry).

> The automaton skips to FAILURE on every possible error. I'm wondering 
> whether
> it could do so only on SQL errors, because other fails will lead to 
> ABORTED
> anyway? If there is no good reason to skip to FAILURE from some errors, 
> I'd
> suggest to keep the previous behavior. Maybe the good reason is to do 
> some
> counting, but this means that on eg metacommand errors now the script 
> would
> loop over instead of aborting, which does not look like a desirable 
> change
> of behavior.

Even in the case of meta command errors we must prepare for 
CSTATE_END_TX and the execution of the next script: if necessary, clear 
the conditional stack and rollback the current transaction block.

> ISTM that it would be more logical to only get into RETRY if there is a 
> retry,
> i.e. move the test RETRY/ABORT in FAILURE. For that, instead of 
> "canRetry",
> maybe you want "doRetry", which tells that a retry is possible (the 
> error
> is serializable or deadlock) and that the current parameters allow it
> (timeout, max retries).
> 
> * Minor C style comments:
> 
> if / else if / else if ... on *_FAILURE: I'd suggest a switch.
> 
> The following line removal does not seem useful, I'd have kept it:
> 
>   stats->cnt++;
>  -
>   if (skipped)
> 
> copyVariables: I'm not convinced that source_vars & nvars variables are 
> that
> useful.

>   if (!copyVariables(&st->retry_state.variables, &st->variables)) {
>     pgbench_error(LOG, "client %d aborted when preparing to execute a
> transaction\n", st->id);
> 
> The message could be more precise, eg "client %d failed while copying
> variables", unless copyVariables already printed a message. As this is 
> really
> an internal error from pgbench, I'd rather do a FATAL (direct exit) 
> there.
> ISTM that the only possible failure is OOM here, and pgbench is in a 
> very bad
> shape if it gets into that.

Ok!

> commandFailed: I'm not thrilled by the added boolean, which is 
> partially
> redundant with the second argument.

Do you mean that it is partially redundant with the argument "cmd" and, 
for example, the meta commands errors always do not cause the abortions 
of the client?

>          if (per_script_stats)
>  -               accumStats(&sql_script[st->use_file].stats, skipped,
> latency, lag);
>  +       {
>  +               accumStats(&sql_script[st->use_file].stats, skipped,
> latency, lag,
>  +                                  st->failure_status, st->retries);
>  +       }
>   }
> 
> I do not see the point of changing the style here.

If in such cases one command is placed on several lines, ISTM that the 
code is more understandable if curly brackets are used...

[1] 
https://www.postgresql.org/message-id/fcc2512cdc9e6bc49d3b489181f454da%40postgrespro.ru

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

17 августа 2018 г., 13:49:08

Hello Marina,

>> Detailed -r report. I understand from the doc that the retry number on 
>> the detailed per-statement report is to identify at what point errors 
>> occur? Probably this is more or less always at the same point on a 
>> given script, so that the most interesting feature is to report the 
>> number of retries at the script level.
>
> This may depend on various factors.. for example:
> [...]
>        21.239           5          36  UPDATE xy3 SET y = y + :delta WHERE x 
> = :x3;
>        21.360           5          39  UPDATE xy2 SET y = y + :delta WHERE x 
> = :x2;

Ok, not always the same point, and you confirm that it identifies where 
the error is raised which leads to a retry.

> And you can always get the number of retries at the script level from the 
> main report (if only one script is used) or from the report for each script 
> (if multiple scripts are used).

Ok.

>> ISTM that "skipped" transactions are NOT "successful" so there are a 
>> problem with comments. I believe that your formula are probably right, 
>> it has more to do with what is "success". For cnt decomposition, ISTM 
>> that "other transactions" are really "directly successful 
>> transactions".
>
> I agree with you, but I also think that skipped transactions should not be 
> considered errors.

I'm ok with having a special category for them in the explanations, which 
is neither success nor error.

> So we can write something like this:

> All the transactions are divided into several types depending on their 
> execution. Firstly, they can be divided into transactions that we started to 
> execute, and transactions which were skipped (it was too late to execute 
> them). Secondly, running transactions fall into 2 main types: is there any 
> command that got a failure during the last execution of the transaction 
> script or not? Thus

Here is an attempt at having a more precise and shorter version, not sure 
it is much better than yours, though:

"""
Transactions are counted depending on their execution and outcome. First
a transaction may have started or not: skipped transactions occur under 
--rate and --latency-limit when the client is too late to execute them. 
Secondly, a started transaction may ultimately succeed or fail on some 
error, possibly after some retries when --max-tries is not one. Thus
"""

> the number of all transactions =
>  skipped (it was too late to execute them)
>  cnt (the number of successful transactions) +
>  ecnt (the number of failed transactions).
>
> A successful transaction can have several unsuccessful tries before a
> successfull run. Thus
>
> cnt (the number of successful transactions) =
>  retried (they got a serialization or a deadlock failure(s), but were
>           successfully retried from the very beginning) +
>  directly successfull transactions (they were successfully completed on
>                                     the first try).

These above description is clearer for me.

>> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise 
>> "another" does not make sense yet.
>
> Maybe firstly put a general group, and then special cases?...

I understand it more as a catch all default "none of the above" case.

>> In TState, field "uint32 retries": maybe it would be simpler to count 
>> "tries", which can be compared directly to max tries set in the option?
>
> If you mean retries in CState - on the one hand, yes, on the other hand, 
> statistics always use the number of retries...

Ok.


>> The automaton skips to FAILURE on every possible error. I'm wondering 
>> whether it could do so only on SQL errors, because other fails will 
>> lead to ABORTED anyway? If there is no good reason to skip to FAILURE 
>> from some errors, I'd suggest to keep the previous behavior. Maybe the 
>> good reason is to do some counting, but this means that on eg 
>> metacommand errors now the script would loop over instead of aborting, 
>> which does not look like a desirable change of behavior.
>
> Even in the case of meta command errors we must prepare for CSTATE_END_TX and 
> the execution of the next script: if necessary, clear the conditional stack 
> and rollback the current transaction block.

Seems ok.

>> commandFailed: I'm not thrilled by the added boolean, which is partially
>> redundant with the second argument.
>
> Do you mean that it is partially redundant with the argument "cmd" and, for 
> example, the meta commands errors always do not cause the abortions of the 
> client?

Yes. And also I'm not sure we should want this boolean at all.

> [...]
> If in such cases one command is placed on several lines, ISTM that the code 
> is more understandable if curly brackets are used...

Hmmm. Such basic style changes are avoided because they break 
backpatching, so we try to avoid gratuitous changes unless there is a 
strong added value, which does not seem to be the case here.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

17 августа 2018 г., 15:31:52

On 17-08-2018 10:49, Fabien COELHO wrote:
> Hello Marina,
> 
>>> Detailed -r report. I understand from the doc that the retry number 
>>> on the detailed per-statement report is to identify at what point 
>>> errors occur? Probably this is more or less always at the same point 
>>> on a given script, so that the most interesting feature is to report 
>>> the number of retries at the script level.
>> 
>> This may depend on various factors.. for example:
>> [...]
>>        21.239           5          36  UPDATE xy3 SET y = y + :delta 
>> WHERE x = :x3;
>>        21.360           5          39  UPDATE xy2 SET y = y + :delta 
>> WHERE x = :x2;
> 
> Ok, not always the same point, and you confirm that it identifies
> where the error is raised which leads to a retry.

Yes, I confirm this. I'll try to write more clearly about this in the 
documentation...

>> So we can write something like this:
> 
>> All the transactions are divided into several types depending on their 
>> execution. Firstly, they can be divided into transactions that we 
>> started to execute, and transactions which were skipped (it was too 
>> late to execute them). Secondly, running transactions fall into 2 main 
>> types: is there any command that got a failure during the last 
>> execution of the transaction script or not? Thus
> 
> Here is an attempt at having a more precise and shorter version, not
> sure it is much better than yours, though:
> 
> """
> Transactions are counted depending on their execution and outcome. 
> First
> a transaction may have started or not: skipped transactions occur
> under --rate and --latency-limit when the client is too late to
> execute them. Secondly, a started transaction may ultimately succeed
> or fail on some error, possibly after some retries when --max-tries is
> not one. Thus
> """

Thank you!

>>> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, 
>>> otherwise "another" does not make sense yet.
>> 
>> Maybe firstly put a general group, and then special cases?...
> 
> I understand it more as a catch all default "none of the above" case.

Ok!

>>> commandFailed: I'm not thrilled by the added boolean, which is 
>>> partially
>>> redundant with the second argument.
>> 
>> Do you mean that it is partially redundant with the argument "cmd" 
>> and, for example, the meta commands errors always do not cause the 
>> abortions of the client?
> 
> Yes. And also I'm not sure we should want this boolean at all.

Perhaps we can use a separate function to print the messages about 
client's abortion, something like this (it is assumed that all abortions 
happen when processing SQL commands):

static void
clientAborted(CState *st, const char *message)
{
    pgbench_error(...,
                  "client %d aborted in command %d (SQL) of script %d; %s\n",
                  st->id, st->command, st->use_file, message);
}

Or perhaps we can use a more detailed failure status so for each type of 
failure we always know the command name (argument "cmd") and whether the 
client is aborted. Something like this (but in comparison with the first 
variant ISTM overly complicated):

/*
  * For the failures during script execution.
  */
typedef enum FailureStatus
{
    NO_FAILURE = 0,

    /*
     * Failures in meta commands. In these cases the failed transaction is
     * terminated.
     */
    META_SET_FAILURE,
    META_SETSHELL_FAILURE,
    META_SHELL_FAILURE,
    META_SLEEP_FAILURE,
    META_IF_FAILURE,
    META_ELIF_FAILURE,

    /*
     * Failures in SQL commands. In cases of serialization/deadlock 
failures a
     * failed transaction is re-executed from the very beginning if 
possible;
     * otherwise the failed transaction is terminated.
     */
    SERIALIZATION_FAILURE,
    DEADLOCK_FAILURE,
    OTHER_SQL_FAILURE,            /* other failures in SQL commands that are not
                                 * listed by themselves above */

    /*
     * Failures while processing SQL commands. In this case the client is
     * aborted.
     */
    SQL_CONNECTION_FAILURE
} FailureStatus;

>> [...]
>> If in such cases one command is placed on several lines, ISTM that the 
>> code is more understandable if curly brackets are used...
> 
> Hmmm. Such basic style changes are avoided because they break
> backpatching, so we try to avoid gratuitous changes unless there is a
> strong added value, which does not seem to be the case here.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

17 августа 2018 г., 17:04:56

>>>> commandFailed: I'm not thrilled by the added boolean, which is partially
>>>> redundant with the second argument.
>>> 
>>> Do you mean that it is partially redundant with the argument "cmd" and, 
>>> for example, the meta commands errors always do not cause the abortions of 
>>> the client?
>> 
>> Yes. And also I'm not sure we should want this boolean at all.
>
> Perhaps we can use a separate function to print the messages about client's 
> abortion, something like this (it is assumed that all abortions happen when 
> processing SQL commands):
>
> static void
> clientAborted(CState *st, const char *message)

Possibly.

> Or perhaps we can use a more detailed failure status so for each type of 
> failure we always know the command name (argument "cmd") and whether the 
> client is aborted. Something like this (but in comparison with the first 
> variant ISTM overly complicated):

I agree., I do not think that it would be useful given that the same thing 
is done on all meta-command error cases in the end.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

17 августа 2018 г., 17:27:19

On 17-08-2018 14:04, Fabien COELHO wrote:
> ...
>> Or perhaps we can use a more detailed failure status so for each type 
>> of failure we always know the command name (argument "cmd") and 
>> whether the client is aborted. Something like this (but in comparison 
>> with the first variant ISTM overly complicated):
> 
> I agree., I do not think that it would be useful given that the same
> thing is done on all meta-command error cases in the end.

Ok!

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

06 сентября 2018 г., 00:19:38

Hello, hackers!

This is the eleventh version of the patch for error handling and 
retrying of transactions with serialization/deadlock failures in pgbench 
(based on the commit 14e9b2a752efaa427ce1b400b9aaa5a636898a04) thanks to 
the comments of Fabien Coelho and Arthur Zakirov in this thread.

v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
- a patch for the RandomState structure (this is used to reset a 
client's random seed during the repeating of transactions after 
serialization/deadlock failures).

v11-0002-Pgbench-errors-use-the-Variables-structure-for-c.patch
- a patch for the Variables structure (this is used to reset client 
variables during the repeating of transactions after 
serialization/deadlock failures).

v11-0003-Pgbench-errors-and-serialization-deadlock-retrie.patch
- the main patch for handling client errors and repetition of 
transactions with serialization/deadlock failures (see the detailed 
description in the file).

v11-0004-Pgbench-errors-use-a-separate-function-to-report.patch
- a patch for a separate error reporting function (this is used to 
report client failures that do not cause an aborts and this depends on 
the level of debugging). Although this is a try to fix a duplicate code 
for debug messages (see [1]), this may seem mostly refactoring and 
therefore may not seem very necessary for this set of patches (see [2], 
[3]), so this patch becomes the last as an optional.

Any suggestions are welcome!

[1] 
https://www.postgresql.org/message-id/20180405180807.0bc1114f%40wp.localdomain

> There is a lot of checks like "if (debug_level >= DEBUG_FAILS)" with
> corresponding fprintf(stderr..) I think it's time to do it like in the
> main code, wrap with some function like log(level, msg).

[2] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808071823540.13466%40lancre

> However ISTM that it is not as necessary as the previous one, i.e. we
> could do without it to get the desired feature, so I see it more as a
> refactoring done "in passing", and I'm wondering whether it is
> really worth it because it adds some new complexity, so I'm not sure of
> the net benefit.

[3] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808101027390.9120%40lancre

> I'm still not over enthousiastic with these changes, and still think 
> that
> it should be an independent patch, not submitted together with the 
> "retry
> on error" feature.

All that was fixed from the previous version:

[4] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808071823540.13466%40lancre

> I'm at odds with the proposed levels. ISTM that pgbench internal
> errors which warrant an immediate exit should be dubbed "FATAL",

> I'm unsure about the "log_min_messages" variable name, I'd suggest
> "log_level".
> 
> I do not see the asserts on LOG >= log_min_messages as useful, because
> the level can only be LOG or DEBUG anyway.

> * PQExpBuffer
> 
> I still do not see a positive value from importing PQExpBuffer
> complexity and cost into pgbench, as the resulting code is not very
> readable and it adds malloc/free cycles, so I'd try to avoid using
> PQExpBuf as much as possible. ISTM that all usages could be avoided in
> the patch, and most should be avoided even if ExpBuffer is imported
> because it is really useful somewhere.
> 
> - to call pgbench_error from pgbench_simple_error, you can do a
> pgbench_log_va(level, format, va_list) version called both from
> pgbench_error & pgbench_simple_error.
> 
> - for PGBENCH_DEBUG function, do separate calls per type, the very
> small partial code duplication is worth avoiding ExpBuf IMO.
> 
> - for doCustom debug: I'd just let the printf as it is, with a
> comment, as it is really very internal stuff for debug. Or I'd just
> snprintf a something in a static buffer.
> 
> ...
> 
> - for listAvailableScript: I'd simply call "pgbench_error(LOG" several
> time, once per line.
> 
> I see building a string with a format (printfExpBuf..) and then
> calling the pgbench_error function with just a "%s" format on the
> result as not very elegant, because the second format is somehow
> hacked around.

[5] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808101027390.9120%40lancre

> I suggest that the called function does only one simple thing,
> probably "DEBUG", and that the *caller* prints a message if it is 
> unhappy
> about the failure of the called function, as it is currently done. This
> allows to provide context as well from the caller, eg "setting variable 
> %s
> failed while <some specific context>". The user call rerun under debug 
> for
> precision if they need it.

[6] 
https://www.postgresql.org/message-id/20180810125327.GA2374%40zakirov.localdomain

> I agree with Fabien. Calling pgbench_error() inside pgbench_error()
> could be dangerous. I think "fmt" checking could be removed, or we may
> use Assert() or fprintf()+exit(1) at least.

[7] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808121057540.6189%40lancre

> * typo in comments: "varaibles"
> 
> * About enlargeVariables:
> 
> multiple INT_MAX error handling looks strange, especially as this code 
> can
> never be triggered because pgbench would be dead long before having
> allocated INT_MAX variables. So I would not bother to add such checks.

> I'm not sure that the size_t cast here and there are useful for any
> practical values likely to be encountered by pgbench.
> 
> The exponential allocation seems overkill. I'd simply add a constant
> number of slots, with a simple rule:
> 
>    /* reallocated with a margin */
>    if (max_vars < needed) max_vars = needed + 8;

[8] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808151046090.30050%40lancre

> A few comments about the doc.
> 
> According to the documentation, the feature is triggered by --max-tries 
> and
> --latency-limit. I disagree with the later, because it means that 
> having
> latency limit without retrying is not supported anymore.
> 
> Maybe you can allow an "unlimited" max-tries, say with special value 
> zero,
> and the latency limit does its job if set, over all tries.
> 
> Doc: "error in meta commands" -> "meta command errors", for homogeneity 
> with
> other cases?

> Doc: "never occur.." -> "never occur", or eventually "...".
> 
> Doc: "Directly client errors" -> "Direct client errors".
> 
> I'm still in favor of asserting that the sql connection is idle (no tx 
> in
> progress) at the beginning and/or end of a script, and report a user 
> error
> if not, instead of writing complex caveats.

> I do not think that the RETRIES_ENABLED macro is a good thing. I'd 
> suggest
> to write the condition four times.
> 
> ISTM that "skipped" transactions are NOT "successful" so there are a 
> problem
> with comments. I believe that your formula are probably right, it has 
> more to do
> with what is "success". For cnt decomposition, ISTM that "other 
> transactions"
> are really "directly successful transactions".
> 
> I'd suggest to put "ANOTHER_SQL_FAILURE" as the last option, otherwise 
> "another"
> does not make sense yet. I'd suggest to name it "OTHER_SQL_FAILURE".

> I'm not sure of
> the LOG -> DEBUG_FAIL changes. I do not understand the name 
> "DEBUG_FAIL", has it
> is not related to debug, they just seem to be internal errors.

> inTransactionBlock: I disagree with any function other than doCustom 
> changing
> the client state, because it makes understanding the state machine 
> harder. There
> is already one exception to that (threadRun) that I wish to remove. All 
> state
> changes must be performed explicitely in doCustom.

> PQexec("ROOLBACK"): you are inserting a synchronous command, for which 
> the
> thread will have to wait for the result, in a middle of a framework 
> which
> takes great care to use only asynchronous stuff so that one thread can
> manage several clients efficiently. You cannot call PQexec there.
> From where I sit, I'd suggest to sendQuery("ROLLBACK"), then switch to
> a new state CSTATE_WAIT_ABORT_RESULT which would be similar to
> CSTATE_WAIT_RESULT, but on success would skip to RETRY or ABORT instead
> of proceeding to the next command.
> 
> ISTM that it would be more logical to only get into RETRY if there is a 
> retry,
> i.e. move the test RETRY/ABORT in FAILURE. For that, instead of 
> "canRetry",
> maybe you want "doRetry", which tells that a retry is possible (the 
> error
> is serializable or deadlock) and that the current parameters allow it
> (timeout, max retries).
> 
> * Minor C style comments:
> 
> if / else if / else if ... on *_FAILURE: I'd suggest a switch.
> 
> The following line removal does not seem useful, I'd have kept it:
> 
>    stats->cnt++;
>   -
>    if (skipped)
> 
> copyVariables: I'm not convinced that source_vars & nvars variables are 
> that
> useful.
> 
>    memcpy(&(st->retry_state.random_state), &(st->random_state), 
> sizeof(RandomState));
> 
> Is there a problem with "st->retry_state.random_state = 
> st->random_state;"
> instead of memcpy? ISTM that simple assignments work in C. Idem in the 
> reverse
> copy under RETRY.

> commandFailed: I'm not thrilled by the added boolean, which is 
> partially
> redundant with the second argument.
> 
>           if (per_script_stats)
>   -               accumStats(&sql_script[st->use_file].stats, skipped, 
> latency, lag);
>   +       {
>   +               accumStats(&sql_script[st->use_file].stats, skipped, 
> latency, lag,
>   +                                  st->failure_status, st->retries);
>   +       }
>    }
> 
> I do not see the point of changing the style here.

[9] 
https://www.postgresql.org/message-id/alpine.DEB.2.21.1808170917510.20841%40lancre

> Here is an attempt at having a more precise and shorter version, not 
> sure
> it is much better than yours, though:
> 
> """
> Transactions are counted depending on their execution and outcome. 
> First
> a transaction may have started or not: skipped transactions occur under
> --rate and --latency-limit when the client is too late to execute them.
> Secondly, a started transaction may ultimately succeed or fail on some
> error, possibly after some retries when --max-tries is not one. Thus
> """

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

On 11-09-2018 18:29, Fabien COELHO wrote:
> Hello Marina,
> 
>> Hmm, but we can say the same for serialization or deadlock errors that 
>> were not retried (the client test code itself could not run correctly 
>> or the SQL sent was somehow wrong, which is also the client's fault), 
>> can't we?
> 
> I think not.
> 
> If a client asks for something "legal", but some other client in
> parallel happens to make an incompatible change which result in a
> serialization or deadlock error, the clients are not responsible for
> the raised errors, it is just that they happen to ask for something
> incompatible at the same time. So there is no user error per se, but
> the server is reporting its (temporary) inability to process what was
> asked for. For these errors, retrying is fine. If the client was
> alone, there would be no such errors, you cannot deadlock with
> yourself. This is really an isolation issue linked to parallel
> execution.

You can get other errors that cannot happen for only one client if you 
use shell commands in meta commands:

starting vacuum...end.
transaction type: pgbench_meta_concurrent_error.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 20/20
maximum number of tries: 1
latency average = 6.953 ms
tps = 287.630161 (including connections establishing)
tps = 303.232242 (excluding connections establishing)
statement latencies in milliseconds and failures:
          1.636           0  BEGIN;
          1.497           0  \setshell var mkdir my_directory && echo 1
          0.007           0  \sleep 1 us
          1.465           0  \setshell var rmdir my_directory && echo 1
          1.622           0  END;

starting vacuum...end.
mkdir: cannot create directory ‘my_directory’: File exists
mkdir: could not read result of shell command
client 1 got an error in command 1 (setshell) of script 0; execution of 
meta-command failed
transaction type: pgbench_meta_concurrent_error.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 19/20
number of failures: 1 (5.000%)
number of meta-command failures: 1 (5.000%)
maximum number of tries: 1
latency average = 11.782 ms (including failures)
tps = 161.269033 (including connections establishing)
tps = 167.733278 (excluding connections establishing)
statement latencies in milliseconds and failures:
          2.731           0  BEGIN;
          2.909           1  \setshell var mkdir my_directory && echo 1
          0.231           0  \sleep 1 us
          2.366           0  \setshell var rmdir my_directory && echo 1
          2.664           0  END;

Or if you use untrusted procedural languages in SQL expressions (see the 
used file in the attachments):

starting vacuum...ERROR:  relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
end.
client 1 got an error in command 0 (SQL) of script 0; ERROR:  could not 
create the directory "my_directory": File exists at line 3.
CONTEXT:  PL/Perl anonymous code block

client 1 got an error in command 0 (SQL) of script 0; ERROR:  could not 
create the directory "my_directory": File exists at line 3.
CONTEXT:  PL/Perl anonymous code block

transaction type: pgbench_concurrent_error.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 18/20
number of failures: 2 (10.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other SQL failures: 2 (10.000%)
maximum number of tries: 1
latency average = 3.282 ms (including failures)
tps = 548.437196 (including connections establishing)
tps = 637.662753 (excluding connections establishing)
statement latencies in milliseconds and failures:
          1.566           2  DO $$

starting vacuum...ERROR:  relation "pgbench_branches" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_tellers" does not exist
(ignoring this error and continuing anyway)
ERROR:  relation "pgbench_history" does not exist
(ignoring this error and continuing anyway)
end.
transaction type: pgbench_concurrent_error.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 20/20
maximum number of tries: 1
latency average = 2.760 ms
tps = 724.746078 (including connections establishing)
tps = 853.131985 (excluding connections establishing)
statement latencies in milliseconds and failures:
          1.893           0  DO $$

Or if you try to create a function and perhaps replace an existing one:

starting vacuum...end.
client 0 got an error in command 0 (SQL) of script 0; ERROR:  duplicate 
key value violates unique constraint "pg_proc_proname_args_nsp_index"
DETAIL:  Key (proname, proargtypes, pronamespace)=(my_function, , 2200) 
already exists.

client 0 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 0 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 1 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

client 0 got an error in command 0 (SQL) of script 0; ERROR:  tuple 
concurrently updated

transaction type: pgbench_create_function.sql
scaling factor: 1
query mode: simple
number of clients: 2
number of threads: 1
number of transactions per client: 10
number of transactions actually processed: 10/20
number of failures: 10 (50.000%)
number of serialization failures: 0 (0.000%)
number of deadlock failures: 0 (0.000%)
number of other SQL failures: 10 (50.000%)
maximum number of tries: 1
latency average = 82.881 ms (including failures)
tps = 12.065492 (including connections establishing)
tps = 12.092216 (excluding connections establishing)
statement latencies in milliseconds and failures:
         82.549          10  CREATE OR REPLACE FUNCTION my_function() 
RETURNS integer AS 'select 1;' LANGUAGE SQL;

>> Why not handle client errors that can occur (but they may also not 
>> occur) the same way? (For example, always abort the client, or 
>> conversely do not make aborts in these cases.) Here's an example of 
>> such error:
> 
>> client 5 got an error in command 1 (SQL) of script 0; ERROR:  division 
>> by zero
> 
> This is an interesting case. For me we must stop the script because
> the client is asking for something "stupid", and retrying the same
> won't change the outcome, the division will still be by zero. It is
> the client responsability not to ask for something stupid, the bench
> script is buggy, it should not submit illegal SQL queries. This is
> quite different from submitting something legal which happens to fail.
> ...
>>> I'm not sure that having "--debug" implying this option
>>> is useful: As there are two distinct options, the user may be allowed
>>> to trigger one or the other as they wish?
>> 
>> I'm not sure that the main debugging output will give a good clue of 
>> what's happened without full messages about errors, retries and 
>> failures...
> 
> I'm more argumenting about letting the user decide what they want.
> 
>> These lines are quite long - do you suggest to wrap them this way?
> 
> Sure, if it is too long, then wrap.

Ok!

>>> Function getTransactionStatus name does not seem to correspond fully 
>>> to what the function does. There is a passthru case which should be 
>>> either avoided or clearly commented.
>> 
>> I don't quite understand you - do you mean that in fact this function 
>> finds out whether we are in a (failed) transaction block or not? Or do 
>> you mean that the case of PQTRANS_INTRANS is also ok?...
> 
> The former: although the function is named "getTransactionStatus", it
> does not really return the "status" of the transaction (aka
> PQstatus()?).

Thank you, I'll think how to improve it. Perhaps the name 
checkTransactionStatus will be better...

>>> I'd insist in a comment that "cnt" does not include "skipped" 
>>> transactions
>>> (anymore).
>> 
>> If you mean CState.cnt I'm not sure if this is practically useful 
>> because the code uses only the sum of all client transactions 
>> including skipped and failed... Maybe we can rename this field to 
>> nxacts or total_cnt?
> 
> I'm fine with renaming the field if it makes thinks clearer. They are
> all counters, so naming them "cnt" or "total_cnt" does not help much.
> Maybe "succeeded" or "success" to show what is really counted?

Perhaps renaming of StatsData.cnt is better than just adding a comment 
to this field. But IMO we have the same problem (They are all counters, 
so naming them "cnt" or "total_cnt" does not help much.) for CState.cnt 
which cannot be named in the same way because it also includes skipped 
and failed transactions.

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

pgbench_concurrent_error.sql

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

12 сентября 2018 г., 20:04:09

Hello Marina,

> You can get other errors that cannot happen for only one client if you use 
> shell commands in meta commands:

> Or if you use untrusted procedural languages in SQL expressions (see the used 
> file in the attachments):

> Or if you try to create a function and perhaps replace an existing one:

Sure. Indeed there can be shell errors, perl errors, create functions 
conflicts... I do not understand what is your point wrt these.

I'm mostly saying that your patch should focus on implementing the retry 
feature when appropriate, and avoid changing the behavior (error 
displayed, abort or not) on features unrelated to serialization & deadlock 
errors.

Maybe there are inconsistencies, and "bug"/"feature" worth fixing, but if 
so that should be a separate patch, if possible, and if these are bugs 
they could be backpatched.

For now I'm still convinced that pgbench should keep on aborting on "\set" 
or SQL syntax errors, and show clear error messages on these, and your 
examples have not changed my mind on that point.

>> I'm fine with renaming the field if it makes thinks clearer. They are
>> all counters, so naming them "cnt" or "total_cnt" does not help much.
>> Maybe "succeeded" or "success" to show what is really counted?
>
> Perhaps renaming of StatsData.cnt is better than just adding a comment to 
> this field. But IMO we have the same problem (They are all counters, so 
> naming them "cnt" or "total_cnt" does not help much.) for CState.cnt which 
> cannot be named in the same way because it also includes skipped and failed 
> transactions.

Hmmm. CState's cnt seems only used to implement -t anyway? I'm okay if it 
has a different name, esp if it has a different semantics. I think I was 
arguing only about cnt in StatsData.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

12 сентября 2018 г., 21:12:29

On 12-09-2018 17:04, Fabien COELHO wrote:
> Hello Marina,
> 
>> You can get other errors that cannot happen for only one client if you 
>> use shell commands in meta commands:
> 
>> Or if you use untrusted procedural languages in SQL expressions (see 
>> the used file in the attachments):
> 
>> Or if you try to create a function and perhaps replace an existing 
>> one:
> 
> Sure. Indeed there can be shell errors, perl errors, create functions
> conflicts... I do not understand what is your point wrt these.
> 
> I'm mostly saying that your patch should focus on implementing the
> retry feature when appropriate, and avoid changing the behavior (error
> displayed, abort or not) on features unrelated to serialization &
> deadlock errors.
> 
> Maybe there are inconsistencies, and "bug"/"feature" worth fixing, but
> if so that should be a separate patch, if possible, and if these are
> bugs they could be backpatched.
> 
> For now I'm still convinced that pgbench should keep on aborting on
> "\set" or SQL syntax errors, and show clear error messages on these,
> and your examples have not changed my mind on that point.
> 
>>> I'm fine with renaming the field if it makes thinks clearer. They are
>>> all counters, so naming them "cnt" or "total_cnt" does not help much.
>>> Maybe "succeeded" or "success" to show what is really counted?
>> 
>> Perhaps renaming of StatsData.cnt is better than just adding a comment 
>> to this field. But IMO we have the same problem (They are all 
>> counters, so naming them "cnt" or "total_cnt" does not help much.) for 
>> CState.cnt which cannot be named in the same way because it also 
>> includes skipped and failed transactions.
> 
> Hmmm. CState's cnt seems only used to implement -t anyway? I'm okay if
> it has a different name, esp if it has a different semantics.

Ok!

> I think
> I was arguing only about cnt in StatsData.

The discussion about this has become entangled from the beginning, 
because as I wrote in [1] at first I misread your original proposal...

[1] 
https://www.postgresql.org/message-id/d318cdee8f96de6b1caf2ce684ffe4db%40postgrespro.ru

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Michael Paquier

Дата:

01 октября 2018 г., 12:44:07

On Wed, Sep 12, 2018 at 06:12:29PM +0300, Marina Polyakova wrote:
> The discussion about this has become entangled from the beginning, because
> as I wrote in [1] at first I misread your original proposal...

The last emails are about the last reviews of Fabien, which has remained
unanswered for the last couple of weeks.  I am marking this patch as
returned with feedback for now.
--
Michael

Вложения

signature.asc

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Alvaro Herrera

Дата:

16 ноября 2018 г., 22:59:52

On 2018-Sep-05, Marina Polyakova wrote:

> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
> - a patch for the RandomState structure (this is used to reset a client's
> random seed during the repeating of transactions after
> serialization/deadlock failures).

Pushed this one with minor stylistic changes (the most notable of which
is the move of initRandomState to where the rest of the random generator
infrastructure is, instead of in a totally random place).  Thanks,

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Marina Polyakova

Дата:

19 ноября 2018 г., 11:16:10

On 2018-11-16 22:59, Alvaro Herrera wrote:
> On 2018-Sep-05, Marina Polyakova wrote:
> 
>> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
>> - a patch for the RandomState structure (this is used to reset a 
>> client's
>> random seed during the repeating of transactions after
>> serialization/deadlock failures).
> 
> Pushed this one with minor stylistic changes (the most notable of which
> is the move of initRandomState to where the rest of the random 
> generator
> infrastructure is, instead of in a totally random place).  Thanks,

Thank you very much! I'm going to send a new patch set until the end of 
this week (I'm sorry I was very busy in the release of Postgres Pro 
11...).

-- 
Marina Polyakova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Alvaro Herrera

Дата:

19 ноября 2018 г., 19:01:24

On 2018-Nov-19, Marina Polyakova wrote:

> On 2018-11-16 22:59, Alvaro Herrera wrote:
> > On 2018-Sep-05, Marina Polyakova wrote:
> > 
> > > v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
> > > - a patch for the RandomState structure (this is used to reset a
> > > client's
> > > random seed during the repeating of transactions after
> > > serialization/deadlock failures).
> > 
> > Pushed this one with minor stylistic changes (the most notable of which
> > is the move of initRandomState to where the rest of the random generator
> > infrastructure is, instead of in a totally random place).  Thanks,
> 
> Thank you very much! I'm going to send a new patch set until the end of this
> week (I'm sorry I was very busy in the release of Postgres Pro 11...).

Great, thanks.

I also think that the pgbench_error() patch should go in before the main
one.  It seems a bit pointless to introduce code using a bad API only to
fix the API together with all the new callers immediately afterwards.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

19 ноября 2018 г., 22:54:49

Hello Alvaro,

> I also think that the pgbench_error() patch should go in before the main
> one.  It seems a bit pointless to introduce code using a bad API only to
> fix the API together with all the new callers immediately afterwards.

I'm not that keen on this part of the patch, because ISTM that introduces 
significant and possibly costly malloc/free cycles when handling error, 
which do not currently exist in pgbench.

Previously an error was basically the end of the script, but with the 
feature being introduced by Marina some errors are handled, in which case 
we end up with paying these costs in the test loop. Also, refactoring 
error handling is not necessary for the new feature. That is why I advised 
to move it away and possibly keep it for later.

Related to Marina patch (triggered by reviewing the patches), I have 
submitted a refactoring patch which aims at cleaning up the internal state 
machine, so that additions and checking that all is well is simpler.

     https://commitfest.postgresql.org/20/1754/

It has been reviewed, I think I answered to the reviewer concerns, but the 
reviewer did not update the patch state on the cf app, so I do not know 
whether he is unsatisfied or if it was just forgotten.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Alvaro Herrera

Дата:

19 ноября 2018 г., 23:00:25

On 2018-Nov-19, Fabien COELHO wrote:

> 
> Hello Alvaro,
> 
> > I also think that the pgbench_error() patch should go in before the main
> > one.  It seems a bit pointless to introduce code using a bad API only to
> > fix the API together with all the new callers immediately afterwards.
> 
> I'm not that keen on this part of the patch, because ISTM that introduces
> significant and possibly costly malloc/free cycles when handling error,
> which do not currently exist in pgbench.

Oh, I wasn't aware of that.

> Related to Marina patch (triggered by reviewing the patches), I have
> submitted a refactoring patch which aims at cleaning up the internal state
> machine, so that additions and checking that all is well is simpler.
> 
>     https://commitfest.postgresql.org/20/1754/

let me look at this one.

> It has been reviewed, I think I answered to the reviewer concerns, but the
> reviewer did not update the patch state on the cf app, so I do not know
> whether he is unsatisfied or if it was just forgotten.

Feel free to update a patch status to "needs review" yourself after
submitting a new version that in your opinion respond to a reviewer's
comments.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

19 ноября 2018 г., 23:10:28

> Feel free to update a patch status to "needs review" yourself after
> submitting a new version that in your opinion respond to a reviewer's
> comments.

Sure, I do that. But I will not switch any of my patch to "Ready". AFAICR 
the concerns where mostly about imprecise comments in the code, and a few 
questions that I answered.

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Thomas Munro

Дата:

09 марта 2020 г., 00:11:01

On Mon, Mar 9, 2020 at 10:00 AM Marina Polyakova
<m.polyakova@postgrespro.ru> wrote:
> On 2018-11-16 22:59, Alvaro Herrera wrote:
> > On 2018-Sep-05, Marina Polyakova wrote:
> >
> >> v11-0001-Pgbench-errors-use-the-RandomState-structure-for.patch
> >> - a patch for the RandomState structure (this is used to reset a
> >> client's
> >> random seed during the repeating of transactions after
> >> serialization/deadlock failures).
> >
> > Pushed this one with minor stylistic changes (the most notable of which
> > is the move of initRandomState to where the rest of the random
> > generator
> > infrastructure is, instead of in a totally random place).  Thanks,
>
> Thank you very much! I'm going to send a new patch set until the end of
> this week (I'm sorry I was very busy in the release of Postgres Pro
> 11...).

Is anyone interested in rebasing this, and summarising what needs to
be done to get it in?  It's arguably a bug or at least quite
unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
that a couple of forks already ship Marina's patch set.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

09 марта 2020 г., 22:41:51

Hello Thomas,

>> Thank you very much! I'm going to send a new patch set until the end of
>> this week (I'm sorry I was very busy in the release of Postgres Pro
>> 11...).
>
> Is anyone interested in rebasing this, and summarising what needs to
> be done to get it in?  It's arguably a bug or at least quite
> unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
> that a couple of forks already ship Marina's patch set.

I'm a reviewer on this patch, that I find a good thing (tm), and which was 
converging to a reasonable and simple enough addition, IMHO.

If I proceed in place of Marina, who is going to do the reviews?

-- 
Fabien.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Thomas Munro

Дата:

09 марта 2020 г., 23:48:23

On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> >> Thank you very much! I'm going to send a new patch set until the end of
> >> this week (I'm sorry I was very busy in the release of Postgres Pro
> >> 11...).
> >
> > Is anyone interested in rebasing this, and summarising what needs to
> > be done to get it in?  It's arguably a bug or at least quite
> > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
> > that a couple of forks already ship Marina's patch set.
>
> I'm a reviewer on this patch, that I find a good thing (tm), and which was
> converging to a reasonable and simple enough addition, IMHO.
>
> If I proceed in place of Marina, who is going to do the reviews?

Hi Fabien,

Cool.  I'll definitely take it for a spin if you post a fresh patch
set.  Any place that we arbitrarily don't support SERIALIZABLE, I
consider a bug, so I'd like to commit this if we can agree it's ready.
It sounds like it's actually in pretty good shape.

Re: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

От

Yugo NAGATA

Дата:

24 мая 2021 г., 05:29:10

Hi hackers,

On Tue, 10 Mar 2020 09:48:23 +1300
Thomas Munro <thomas.munro@gmail.com> wrote:

> On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > >> Thank you very much! I'm going to send a new patch set until the end of
> > >> this week (I'm sorry I was very busy in the release of Postgres Pro
> > >> 11...).
> > >
> > > Is anyone interested in rebasing this, and summarising what needs to
> > > be done to get it in?  It's arguably a bug or at least quite
> > > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
> > > that a couple of forks already ship Marina's patch set.

I got interested in this and now looking into the patch and the past discussion. 
If anyone other won't do it and there are no objection, I would like to rebase
this. Is that okay?

Regards,
Yugo NAGATA

> >
> > I'm a reviewer on this patch, that I find a good thing (tm), and which was
> > converging to a reasonable and simple enough addition, IMHO.
> >
> > If I proceed in place of Marina, who is going to do the reviews?
> 
> Hi Fabien,
> 
> Cool.  I'll definitely take it for a spin if you post a fresh patch
> set.  Any place that we arbitrarily don't support SERIALIZABLE, I
> consider a bug, so I'd like to commit this if we can agree it's ready.
> It sounds like it's actually in pretty good shape.


-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Yugo NAGATA

Дата:

22 июня 2021 г., 17:54:59

Hi hackers,

On Mon, 24 May 2021 11:29:10 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:

> Hi hackers,
> 
> On Tue, 10 Mar 2020 09:48:23 +1300
> Thomas Munro <thomas.munro@gmail.com> wrote:
> 
> > On Tue, Mar 10, 2020 at 8:43 AM Fabien COELHO <coelho@cri.ensmp.fr> wrote:
> > > >> Thank you very much! I'm going to send a new patch set until the end of
> > > >> this week (I'm sorry I was very busy in the release of Postgres Pro
> > > >> 11...).
> > > >
> > > > Is anyone interested in rebasing this, and summarising what needs to
> > > > be done to get it in?  It's arguably a bug or at least quite
> > > > unfortunate that pgbench doesn't work with SERIALIZABLE, and I heard
> > > > that a couple of forks already ship Marina's patch set.
> 
> I got interested in this and now looking into the patch and the past discussion. 
> If anyone other won't do it and there are no objection, I would like to rebase
> this. Is that okay?

I rebased and fixed the previous patches (v11) rewtten by Marina Polyakova,
and attached the revised version (v12).

v12-0001-Pgbench-errors-use-the-Variables-structure-for-c.patch
- a patch for the Variables structure (this is used to reset client 
variables during the repeating of transactions after 
serialization/deadlock failures).

v12-0002-Pgbench-errors-and-serialization-deadlock-retrie.patch
- the main patch for handling client errors and repetition of 
transactions with serialization/deadlock failures (see the detailed 
description in the file).

These are the revised versions from v11-0002 and v11-0003. v11-0001
(for the RandomState structure) is not included because this has been
already committed (40923191944). V11-0004 (for a separate error reporting
function) is not included neither because pgbench now uses common logging
APIs (30a3e772b40).

In addition to rebase on master, I updated the patch according with the
review from Fabien COELHO [1] and discussions after this. Also, I added
some other fixes through my reviewing the previous patch.

[1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1809081450100.10506%40lancre

Following are fixes according with Fabian's review.

> * Features

> As far as the actual retry feature is concerned, I'd say we are nearly 
> there. However I have an issue with changing the behavior on meta command 
> and other sql errors, which I find not desirable.
...
> I do not think that these changes of behavior are desirable. Meta command and
> miscellaneous SQL errors should result in immediatly aborting the whole run,
> because the client test code itself could not run correctly or the SQL sent
> was somehow wrong, which is also the client's fault, and the server 
> performance bench does not make much sense in such conditions.
> 
> ISTM that the focus of this patch should only be to handle some server 
> runtime errors that can be retryed, but not to change pgbench behavior on 
> other kind of errors. If these are to be changed, ISTM that it would be a 
> distinct patch and would require some discussion, and possibly an option 
> to enable it or not if some use case emerge. AFA this patch is concerned, 
> I'd suggest to let that out.

Previously, all SQL and meta command errors could be retried, but I fixed
to allow only serialization & deadlock errors to be retried. 

> Doc says "you cannot use an infinite number of retries without latency-limit..."
> 
> Why should this be forbidden? At least if -T timeout takes precedent and
> shortens the execution, ISTM that there could be good reason to test that.
> Maybe it could be blocked only under -t if this would lead to an non-ending
> run.

I fixed to allow to use --max-tries with -T option even if latency-limit
is not used.

> As "--print-errors" is really for debug, maybe it could be named
> "--debug-errors". I'm not sure that having "--debug" implying this option
> is useful: As there are two distinct options, the user may be allowed
> to trigger one or the other as they wish?

print-errors was renamed to debug-errors.

> makeVariableValue error message is not for debug, but must be kept in all
> cases, and the false returned must result in an immediate abort. Same thing about
> lookupCreateVariable, an invalid name is a user error which warrants an immediate
> abort. Same thing again about coerce* functions or evalStandardFunc...
> Basically, most/all added "debug_level >= DEBUG_ERRORS" are not desirable.

"DEBUG_ERRORS" messages unrelated to serialization & deadlock errors were removed.

> sendRollback(): I'd suggest to simplify. The prepare/extended statement stuff is
> really about the transaction script, not dealing with errors, esp as there is no
> significant advantage in preparing a "ROLLBACK" statement which is short and has
> no parameters. I'd suggest to remove this function and just issue
> PQsendQuery("ROLLBACK;") in all cases.

Now, we just issue PQsendQuery("ROLLBACK;").

> In copyVariables, I'd simplify
>
>  + if (source_var->svalue == NULL)
>  +   dest_var->svalue = NULL;
>  + else
>  +   dest_var->svalue = pg_strdup(source_var->svalue);
>
>as:
>   dest_var->value = (source_var->svalue == NULL) ? NULL : pg_strdup(source_var->svalue);

Fixed using a ternary operator.

>  + if (sqlState)   ->   if (sqlState != NULL) ?

Fixed.

> Function getTransactionStatus name does not seem to correspond fully to what the
> function does. There is a passthru case which should be either avoided or
> clearly commented.

This was renamed to checkTransactionStatus according with [2].

[2] https://www.postgresql.org/message-id/c262e889315625e0fc0d77ca78fe2eac%40postgrespro.ru

>  - commandFailed(st, "SQL", "perhaps the backend died while processing");
>  + clientAborted(st,
>  +              "perhaps the backend died while processing");
>
> keep on one line?

This fix that replaced commandFailed with clientAborted was removed.
(See below)

>  + if (doRetry(st, &now))
>  +   st->state = CSTATE_RETRY;
>  + else
>  +   st->state = CSTATE_FAILURE;
>
> -> st->state = doRetry(st, &now) ? CSTATE_RETRY : CSTATE_FAILURE;

Fixed using a ternary operator.

> * Comments

> "There're different types..." -> "There are different types..."
> "after the errors and"... -> "after errors and"...
> "the default value of max_tries is set to 1" -> "the default value
> of max_tries is 1"
> "We cannot retry the transaction" -> "We cannot retry a transaction"
> "may ultimately succeed or get a failure," -> "may ultimately succeed or fail,"

Fixed.

> Overall, the comment text in StatsData is very clear. However they are not
> clearly linked to the struct fields. I'd suggest that earch field when used
> should be quoted, so as to separate English from code, and the struct name
> should always be used explicitely when possible.

The comment in StatsData was fixed to clarify what each filed in this struct
represents.

> I'd insist in a comment that "cnt" does not include "skipped" transactions
> (anymore).

StatsData.cnt has a comment "number of successful transactions, not including
'skipped'", and CState.cnt has a comment "skipped and failed transactions are
also counted here".

> * Documentation:

> ISTM that there are too many "the":
>   - "turns on the option ..." -> "turns on option ..."
>   - "When the option ..." -> "When option ..."
>   - "By default the option ..." -> "By default option ..."
>   - "only if the option ..." -> "only if option ..."
>   - "combined with the option ..." -> "combined with option ..."
>   - "without the option ..." -> "without option ..."

The previous patch used a lot of "the option xxxx", but I fixed
them to "the xxxx option" because I found that the documentation
uses such way for referring to a certain option. For example,

- You can (and, for most purposes, probably should) increase the number
  of rows by using the <option>-s</option> (scale factor) option. 
- The prefix can be changed by using the <option>--log-prefix</option> option.
- If the <option>-j</option> option is 2 or higher, so that there are multiple
  worker threads,

>   - "is the sum of all the retries" -> "is the sum of all retries"
> "infinite" -> "unlimited" 
> "not retried at all" -> "not retried" (maybe several times). 
> "messages of all errors" -> "messages about all errors". 
> "It is assumed that the scripts used do not contain" ->
> "It is assumed that pgbench scripts do not contai

Fixed.

Following are additional fixes based on my review on the previous patch.

* About error reporting

In the previous patch, commandFailed() was changed to report an error
that doesn't immediately abort the client, and clientAborted() was
added to report an abortion of the client. In the attached patch,
behaviors around errors other than serialization and deadlock are
not changed and such errors cause the client to abort, so commandFaile()
is used without any changes to report a client abortion, and commandError()
is added to report an error that can be retried under --debug-error.

* About progress reporting

In the previous patch, the number of failures was reported only when any
transaction was failed, and statistics of retry was reported only when
any transaction was retried. This means, the number of columns in the
reporting were different depending on the interval. This was odd and
harder to parse the output.

In the attached patch, the number of failures is always reported, and
the retry statistic is reported when max-tries is not 1. 

* About result outputs

In the previous patch, the number of failed transaction, the number
of retried transaction, and the number of total retries were reported
as:

 number of failures: 324 (3.240%)
 ...
 number of retried: 5629 (56.290%)
 number of retries: 103299

I think this was confusable. Especially, it was unclear for me what
"retried" and "retries" represent repectively. Therefore, in the
attached patch, they are reported as:

 number of transactions failed: 324 (3.240%)
 ...
 number of transactions retried: 5629 (56.290%)
 number of total retries: 103299

which clarify that first two are the numbers of transactions and the
last one is the number of retries over all transactions.

* Abourt average connection time

In the previous patch, this was calculated as "conn_total_duration / total->cnt"
where conn_total_duration is the cumulated connection time sumed over threads and
total->cnt is the number of transaction that is successfully processed.

However, the average connection time could be overestimated because 
conn_total_duration includes a connection time of failed transaction
due to serialization and deadlock errors. So, in the attached patch,
this is calculated as "conn_total_duration / total->cnt + failures".

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Hello Fabien,

On Sat, 26 Jun 2021 12:15:38 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> 
> Hello Yugo-san,
> 
> # About v12.2
> 
> ## Compilation
> 
> Patch seems to apply cleanly with "git apply", but does not compile on my 
> host: "undefined reference to `conditional_stack_reset'".
> 
> However it works better when using the "patch". I'm wondering why git 
> apply fails silently…

Hmm, I don't know why your compiling fails... I can apply and complile
successfully using git.

> When compiling there are warnings about "pg_log_fatal", which does not 
> expect a FILE* on pgbench.c:4453. Remove the "stderr" argument.

Ok.

> Global and local checks ok.
> 
> > number of transactions failed: 324 (3.240%)
> > ...
> > number of transactions retried: 5629 (56.290%)
> > number of total retries: 103299
> 
> I'd suggest: "number of failed transactions". "total number of retries" or 
> just "number of retries"?

Ok. I fixed to use "number of failed transactions" and "total number of retries".

> ## Feature
> 
> The overall code structure changes to implements the feature seems 
> reasonable to me, as we are at the 12th iteration of the patch.
> 
> Comments below are somehow about details and asking questions
> about choices, and commenting…
> 
> ## Documentation
> 
> There is a lot of documentation, which is good. I'll review these
> separatly. It looks good, but having a native English speaker/writer
> would really help!
> 
> Some output examples do not correspond to actual output for
> the current version. In particular, there is always one TPS figure
> given now, instead of the confusing two shown before.

Fixed.

> ## Comments
> 
> transactinos -> transactions.

Fixed.

> ## Code
> 
> By default max_tries = 0. Should not the initialization be 1,
> as the documentation argues that it is the default?

Ok. I fixed the default value to 1.

> Counter comments, missing + in the formula on the skipped line.

Fixed.

> Given that we manage errors, ISTM that we should not necessarily
> stop on other not retried errors, but rather count/report them and
> possibly proceed.  Eg with something like:
> 
>    -- server side random fail
>    DO LANGUAGE plpgsql $$
>    BEGIN
>      IF RANDOM() < 0.1 THEN
>        RAISE EXCEPTION 'unlucky!';
>      END IF;
>    END;
>    $$;
> 
> Or:
> 
>    -- client side random fail
>    BEGIN;
>    \if random(1, 10) <= 1
>    SELECT 1 +;
>    \else
>    SELECT 2;
>    \endif
>    COMMIT;
> 
> We could count the fail, rollback if necessary, and go on.  What do you think?
> Maybe such behavior would deserve an option.

This feature to count failures that could occur at runtime seems nice. However,
as discussed in [1], I think it is better to focus to only failures that can be
retried in this patch, and introduce the feature to handle other failures in a
separate patch.

[1] https://www.postgresql.org/message-id/alpine.DEB.2.21.1809121519590.13887%40lancre

> --report-latencies -> --report-per-command: should we keep supporting
> the previous option?

Ok. Although now the option is not only for latencies, considering users who
are using the existing option, I'm fine with this. I got back this to the
previous name.

> --failures-detailed: if we bother to run with handling failures, should
> it always be on?

If we print other failures that cannot be retried in future, it could a lot
of lines and might make some users who don't need details of failures annoyed.
Moreover, some users would always need information of detailed failures in log,
and others would need only total numbers of failures. 

Currently we handle only serialization and deadlock failures, so the number of
lines printed and the number of columns of logging is not large even under the
failures-detail, but if we have a chance to handle other failures in future,  
ISTM adding this option makes sense considering users who would like simple
outputs.

> --debug-errors: I'm not sure we should want a special debug mode for that,
> I'd consider integrating it with the standard debug, or just for development.

I think --debug is a debug option for telling users the pgbench's internal
behaviors, that is, which client is doing what. On other hand, --debug-errors
is for telling users what error caused a retry or a failure in detail. For
users who are not interested in pgbench's internal behavior (sending a command, 
receiving a result, ... ) but interested in actual errors raised during running 
script, this option seems useful.

> Also, should it use pg_log_debug?

If we use pg_log_debug, the message is printed only under --debug.
Therefore, I fixed to use pg_log_info instead of pg_log_error or fprintf.

> doRetry: I'd separate the 3 no retries options instead of mixing max_tries and
> timer_exceeeded, for clarity.

Ok. I fixed to separate them.

> Tries vs retries: I'm at odds with having tries & retries and + 1 here
> and there to handle that, which is a little bit confusing. I'm wondering whether
> we could only count "tries" and adjust to report what we want later?

I fixed to use "tries" instead of "retries" in CState. However, we still use
"retries" in StatsData and Command because the number of retries is printed
in the final result. Is it less confusing than the previous?

> advanceConnectionState: ISTM that ERROR should logically be before others which
> lead to it.

Sorry, I couldn't understand your suggestion. Is this about the order of case
statements or pg_log_error?

> Variables management: it looks expensive, with copying and freeing variable arrays.
> I'm wondering whether we should think of something more clever. Well, that would be
> for some other patch.

Well.., indeed there may be more efficient way. For example, instead of clearing all
vars in dest,  it might be possible to copy or clear only the difference part between
dest and source and remaining unchanged part in dest. Anyway, I think this work should
be done in other patch.

> "Accumulate the retries" -> "Count (re)tries"?

Fixed.

> Currently, ISTM that the retry on error mode is implicitely always on.
> Do we want that? I'd say yes, but maybe people could disagree.

The default values of max-tries is 1, so the retry on error is off.
Failed transactions are retried only when the user wants it and
specifies a valid value to max-treis.

> ## Tests
> 
> There are tests, good!
> 
> I'm wondering whether something simpler could be devised to trigger
> serialization or deadlock errors, eg with a SEQUENCE and an \if.
> 
> See the attached files for generating deadlocks reliably (start with 2 clients).
> What do you think? The PL/pgSQL minimal, it is really client-code 
> oriented.
> 
> Given that deadlocks are detected about every seconds, the test runs
> would take some time. Let it be for now.

Sorry, but I cannot find the attached file. I don't have a good idea 
for a simpler test for now, but I can fix the test based on your idea
after getting the file.

I attached the patch updated according with your suggestion.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

30 июня 2021 г., 18:33:24

Hello Yugo-san,

Thanks for the update!

>> Patch seems to apply cleanly with "git apply", but does not compile on my
>> host: "undefined reference to `conditional_stack_reset'".
>>
>> However it works better when using the "patch". I'm wondering why git
>> apply fails silently…
>
> Hmm, I don't know why your compiling fails... I can apply and complile
> successfully using git.

Hmmm. Strange!

>> Given that we manage errors, ISTM that we should not necessarily stop 
>> on other not retried errors, but rather count/report them and possibly 
>> proceed.  Eg with something like: [...] We could count the fail, 
>> rollback if necessary, and go on.  What do you think? Maybe such 
>> behavior would deserve an option.
>
> This feature to count failures that could occur at runtime seems nice. However,
> as discussed in [1], I think it is better to focus to only failures that can be
> retried in this patch, and introduce the feature to handle other failures in a
> separate patch.

Ok.

>> --report-latencies -> --report-per-command: should we keep supporting
>> the previous option?
>
> Ok. Although now the option is not only for latencies, considering users who
> are using the existing option, I'm fine with this. I got back this to the
> previous name.

Hmmm. I liked the new name! My point was whether we need to support the 
old one as well for compatibility, or whether we should not bother. I'm 
still wondering. As I think that the new name is better, I'd suggest to 
keep it.

>> --failures-detailed: if we bother to run with handling failures, should
>> it always be on?
>
> If we print other failures that cannot be retried in future, it could a lot
> of lines and might make some users who don't need details of failures annoyed.
> Moreover, some users would always need information of detailed failures in log,
> and others would need only total numbers of failures.

Ok.

> Currently we handle only serialization and deadlock failures, so the number of
> lines printed and the number of columns of logging is not large even under the
> failures-detail, but if we have a chance to handle other failures in future,
> ISTM adding this option makes sense considering users who would like simple
> outputs.

Hmmm. What kind of failures could be managed with retries? I guess that on 
a connection failure we can try to reconnect, but otherwise it is less 
clear that other failures make sense to retry.

>> --debug-errors: I'm not sure we should want a special debug mode for that,
>> I'd consider integrating it with the standard debug, or just for development.
>
> I think --debug is a debug option for telling users the pgbench's internal
> behaviors, that is, which client is doing what. On other hand, --debug-errors
> is for telling users what error caused a retry or a failure in detail. For
> users who are not interested in pgbench's internal behavior (sending a command,
> receiving a result, ... ) but interested in actual errors raised during running
> script, this option seems useful.

Ok. The this is not really about debug per se, but a verbosity setting?
Maybe --verbose-errors would make more sense? I'm unsure. I'll think about 
it.

>> Also, should it use pg_log_debug?
>
> If we use pg_log_debug, the message is printed only under --debug.
> Therefore, I fixed to use pg_log_info instead of pg_log_error or fprintf.

Ok, pg_log_info seems right.

>> Tries vs retries: I'm at odds with having tries & retries and + 1 here
>> and there to handle that, which is a little bit confusing. I'm wondering whether
>> we could only count "tries" and adjust to report what we want later?
>
> I fixed to use "tries" instead of "retries" in CState. However, we still use
> "retries" in StatsData and Command because the number of retries is printed
> in the final result. Is it less confusing than the previous?

I'm going to think about it.

>> advanceConnectionState: ISTM that ERROR should logically be before others which
>> lead to it.
>
> Sorry, I couldn't understand your suggestion. Is this about the order of case
> statements or pg_log_error?

My sentence got mixed up. My point was about the case order, so that they 
are put in a more logical order when reading all the cases.

>> Currently, ISTM that the retry on error mode is implicitely always on.
>> Do we want that? I'd say yes, but maybe people could disagree.
>
> The default values of max-tries is 1, so the retry on error is off.

> Failed transactions are retried only when the user wants it and
> specifies a valid value to max-treis.

Ok. My point is that we do not stop on such errors, whereas before ISTM 
that we would have stopped, so somehow the default behavior has changed 
and the previous behavior cannot be reinstated with an option. Maybe that 
is not bad, but this is a behavioral change which needs to be documented 
and argumented.

>> See the attached files for generating deadlocks reliably (start with 2 
>> clients). What do you think? The PL/pgSQL minimal, it is really 
>> client-code oriented.
>
> Sorry, but I cannot find the attached file.

Sorry. Attached to this mail. The serialization stuff does not seem to 
work as well as the deadlock one. Run with 2 clients.

-- 
Fabien.

Hello Fabien,

I attached the updated patch (v14)!

On Wed, 30 Jun 2021 17:33:24 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> >> --report-latencies -> --report-per-command: should we keep supporting
> >> the previous option?
> >
> > Ok. Although now the option is not only for latencies, considering users who
> > are using the existing option, I'm fine with this. I got back this to the
> > previous name.
> 
> Hmmm. I liked the new name! My point was whether we need to support the 
> old one as well for compatibility, or whether we should not bother. I'm 
> still wondering. As I think that the new name is better, I'd suggest to 
> keep it.

Ok. I misunderstood it. I returned the option name to report-per-command.

If we keep report-latencies, I can imagine the following choises:
- use report-latencies to print only latency information
- use report-latencies as alias of report-per-command for compatibility
  and remove at an appropriate timing. (that is, treat as deprecated)

Among these, I prefer the latter because ISTM we would not need many options
for reporting information per command. However, actually, I wander that we
don't have to keep the previous one if we plan to remove it eventually.

> >> --failures-detailed: if we bother to run with handling failures, should
> >> it always be on?
> >
> > If we print other failures that cannot be retried in future, it could a lot
> > of lines and might make some users who don't need details of failures annoyed.
> > Moreover, some users would always need information of detailed failures in log,
> > and others would need only total numbers of failures.
> 
> Ok.
> 
> > Currently we handle only serialization and deadlock failures, so the number of
> > lines printed and the number of columns of logging is not large even under the
> > failures-detail, but if we have a chance to handle other failures in future,
> > ISTM adding this option makes sense considering users who would like simple
> > outputs.
> 
> Hmmm. What kind of failures could be managed with retries? I guess that on 
> a connection failure we can try to reconnect, but otherwise it is less 
> clear that other failures make sense to retry.

Indeed, there would few failures that we should retry and I can not imagine
other than serialization , deadlock, and connection failures for now. However,
considering reporting the number of failed transaction and its causes in future,
as you said

> Given that we manage errors, ISTM that we should not necessarily
> stop on other not retried errors, but rather count/report them and
> possibly proceed. 

, we could define more a few kind of failures. At least we can consider
meta-command and other SQL commands errors in addition to serialization, 
deadlock, connection failures. So, the total number of kind of failures would
be five at least and reporting always all of them results a lot of lines and
columns in logging.

> >> --debug-errors: I'm not sure we should want a special debug mode for that,
> >> I'd consider integrating it with the standard debug, or just for development.
> >
> > I think --debug is a debug option for telling users the pgbench's internal
> > behaviors, that is, which client is doing what. On other hand, --debug-errors
> > is for telling users what error caused a retry or a failure in detail. For
> > users who are not interested in pgbench's internal behavior (sending a command,
> > receiving a result, ... ) but interested in actual errors raised during running
> > script, this option seems useful.
> 
> Ok. The this is not really about debug per se, but a verbosity setting?

I think so.

> Maybe --verbose-errors would make more sense? I'm unsure. I'll think about 
> it.

Agreed. This seems more proper than the previous one, so I fixed the name to
--verbose-errors.

> > Sorry, I couldn't understand your suggestion. Is this about the order of case
> > statements or pg_log_error?
> 
> My sentence got mixed up. My point was about the case order, so that they 
> are put in a more logical order when reading all the cases.

Ok. Considering the loical order, I moved WAIT_ROLLBACK_RESULT into
between ERROR and RETRY, because WAIT_ROLLBACK_RESULT comes atter ERROR state,
and RETRY comes after ERROR or WAIT_ROLLBACK_RESULT..

> >> Currently, ISTM that the retry on error mode is implicitely always on.
> >> Do we want that? I'd say yes, but maybe people could disagree.
> >
> > The default values of max-tries is 1, so the retry on error is off.
> 
> > Failed transactions are retried only when the user wants it and
> > specifies a valid value to max-treis.
> 
> Ok. My point is that we do not stop on such errors, whereas before ISTM 
> that we would have stopped, so somehow the default behavior has changed 
> and the previous behavior cannot be reinstated with an option. Maybe that 
> is not bad, but this is a behavioral change which needs to be documented 
> and argumented.

I understood. Indeed, there is a behavioural change about whether we abort
the client after some types of errors or not. Now, serialization / deadlock
errors don't cause the abortion and are recorded as failures whereas other
errors cause to abort the client.

If we would want to record other errors as failures in future, we would need
a new option to specify which type of failures (or all types of errors, maybe)
should be reported. Until that time, ISTM we can treat serialization and
deadlock as something special errors to be reported as failures.

I rewrote "Failures and Serialization/Deadlock Retries" section a bit to
emphasis that such errors are treated differently than other errors. 

> >> See the attached files for generating deadlocks reliably (start with 2 
> >> clients). What do you think? The PL/pgSQL minimal, it is really 
> >> client-code oriented.
> >
> > Sorry, but I cannot find the attached file.
> 
> Sorry. Attached to this mail. The serialization stuff does not seem to 
> work as well as the deadlock one. Run with 2 clients.

Hmmm, your test didn't work well for me. Both tests got stuck in
pgbench_deadlock_wait() and pgbench didn't finish. 

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Hello,

I attached the updated patch.

On Tue, 13 Jul 2021 15:50:52 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:

> > >> > I'm a little hesitant about how to count and report such unfinished
> > >> > because of bench timeout transactions, though. Not counting them seems
> > >> > to be the best option.

> > > I will fix to finish the benchmark when the time is over during retrying, that is,
> > > change the state to CSTATE_FINISHED instead of CSTATE_ERROR in such cases.

Done.
(I wrote CSTATE_ERROR, but correctly it is CSTATE_FAILURE.)

Now, once the timer is expired during retrying a failed transaction, pgbench never start
a new transaction for retry. If the transaction successes, it will counted in the result.
Otherwise, if the transaction fails again, it is not counted.

In addition, I fixed to work well with pipeline mode. Previously, pipeline mode was not
enough considered, ROLLBACK was not sent correctly. I fixed to handle errors in pipeline
mode properly, and now it works.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Fabien COELHO

Дата:

19 июля 2021 г., 21:04:23

> I attached the updated patch.

# About pgbench error handling v15

Patches apply cleanly. Compilation, global and local tests ok.

  - v15.1: refactoring is a definite improvement.
    Good, even if it is not very useful (see below).

    While restructuring, maybe predefined variables could be make readonly
    so that a script which would update them would fail, which would be a
    good thing. Maybe this is probably material for an independent patch.

  - v15.2: see detailed comments below

# Doc

Doc build is ok.

ISTM that "number of tries" line would be better placed between the 
#threads and #transactions lines. What do you think?

Aggregate logging description: "{ failures | ... }" seems misleading 
because it suggests we have one or the other, whereas it can also be 
empty. I suggest: "{ | failures | ... }" to show the empty case.

Having a full example with retries in the doc is a good thing, and 
illustrates in passing that running with a number of clients on a small 
scale does not make much sense because of the contention on 
tellers/branches. I'd wonder whether the number of tries is set too high, 
though, ISTM that an application should give up before 100? I like that 
the feature it is also limited by latency limit.

Minor editing:

"there're" -> "there are".

"the --time" -> "the --time option".

The overall English seems good, but I'm not a native speaker. As I already said, a native
speaker proofreading would be nice.

From a technical writing point of view, maybe the documentation could be improved a bit,
but I'm not a ease on that subject. Some comments:

"The latency for failed transactions and commands is not computed separately." is unclear,
please use a positive sentence to tell what is true instead of what is not and the reader
has to guess. Maybe: "The latency figures include failed transactions which have reached
the maximum number of tries or the transaction latency limit.".

"The main report contains the number of failed transactions if it is non-zero." ISTM that
this is a pain for scripts which would like to process these reports data, because the data
may or may not be there. I'm sure to write such scripts, which explains my concern:-)

"If the total number of retried transactions is non-zero…" should it rather be "not one",
because zero means unlimited retries?

The section describing the various type of errors that can occur is a good addition.

Option "--report-latencies" changed to "--report-per-commands": I'm fine with this change.

# FEATURES

--failures-detailed: I'm not convinced that this option should not always be on, but
this is not very important, so let it be.

--verbose-errors: I still think this is only for debugging, but let it be.

Copying variables: ISTM that we should not need to save the variables 
states… no clearing, no copying should be needed. The restarted 
transaction simply overrides the existing variables which is what the 
previous version was doing anyway. The scripts should write their own 
variables before using them, and if they don't then it is the user 
problem. This is important for performance, because it means that after a 
client has executed all scripts once the variable array is stable and does 
not incur significant maintenance costs. The only thing that needs saving 
for retry is the speudo-random generator state. This suggest simplifying 
or removing "RetryState".

# CODE

The semantics of "cnt" is changed. Ok, the overall counters and their 
relationships make sense, and it simplifies the reporting code. Good.

In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed and 
could be dealt with the default case. We are only interested in 
serialization/deadlocks which are fatal errors?

doRetry: for consistency, given the assert, ISTM that it should return 
false if duration has expired, by testing end_time or timer_exceeded.

checkTransactionStatus: this function does several things at once with 2 
booleans, which make it not very clear to me. Maybe it would be clearer if 
it would just return an enum (in trans, not in trans, conn error, other 
error). Another reason to do that is that on connection error pgbench 
could try to reconnect, which would be an interesting later extension, so 
let's pave the way for that.  Also, I do not think that the function 
should print out a message, it should be the caller decision to do that.

verbose_errors: there is more or less repeated code under RETRY and 
FAILURE, which should be factored out in a separate function. The 
advanceConnectionFunction is long enough. Once this is done, there is no 
need for a getLatencyUsed function.

I'd put cleaning up the pipeline in a function. I do not understand why 
the pipeline mode is not exited in all cases, the code checks for the 
pipeline status twice in a few lines. I'd put this cleanup in the sync 
function as well, report to the caller (advanceConnectionState) if there 
was an error, which would be managed there.

WAIT_ROLLBACK_RESULT: consumming results in a while could be a function to 
avoid code repetition (there and in the "error:" label in 
readCommandResponse). On the other hand, I'm not sure why the loop is 
needed: we can only get there by submitting a "ROLLBACK" command, so there 
should be only one result anyway?

report_per_command: please always count retries and failures of commands 
even if they will not be reported in the end, the code will be simpler and 
more efficient.

doLog: the format has changed, including a new string on failures which 
replace the time field. Hmmm. Cannot say I like it much, but why not. ISTM 
that the string could be shorten to "deadlock" or "serialization". ISTM 
that the documentation example should include a line with a failure, to 
make it clear what to expect.

I'm okay with always getting computing thread stats.

# COMMENTS

struct StatsData comment is helpful.
  - "failed transactions" -> "unsuccessfully retried transactions"?
  - 'cnt' decomposition: first term is field 'retried'? if so say it
    explicitely?

"Complete the failed transaction" sounds strange: If it failed, it could 
not complete? I'd suggest "Record a failed transaction".

# TESTS

I suggested to simplify the tests by using conditionals & sequences. You 
reported that you got stuck. Hmmm.

I tried again my tests which worked fine when started with 2 clients, 
otherwise they get stuck because the first client waits for the other one 
which does not exists (the point is to generate deadlocks and other 
errors). Maybe this is your issue?

Could you try with:

   psql < deadlock_prep.sql
   pgbench -t 4 -c 2 -f deadlock.sql
   # note: each deadlock detection takes 1 second

   psql < deadlock_prep.sql
   pgbench -t 10 -c 2 -f serializable.sql
   # very quick 50% serialization errors

-- 
Fabien.

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Yugo NAGATA

Дата:

18 февраля 2022 г., 17:45:49

Hello Fabien,

Thank you so much for your review. 

Sorry for the late reply. I've stopped working on it due to other
jobs but I came back again. I attached the updated patch. I would
appreciate it if you could review this again.

On Mon, 19 Jul 2021 20:04:23 +0200 (CEST)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> # About pgbench error handling v15
> 
> Patches apply cleanly. Compilation, global and local tests ok.
> 
>   - v15.1: refactoring is a definite improvement.
>     Good, even if it is not very useful (see below).

Ok, we don't need to save variables in order to implement the
retry feature on pbench as you suggested. Well, should we completely
separate these two patches and should I fix v15.2 to not rely v15.1?

>     While restructuring, maybe predefined variables could be make readonly
>     so that a script which would update them would fail, which would be a
>     good thing. Maybe this is probably material for an independent patch.

Yes, It shoule be for an independent patch.

>   - v15.2: see detailed comments below
> 
> # Doc
> 
> Doc build is ok.
> 
> ISTM that "number of tries" line would be better placed between the 
> #threads and #transactions lines. What do you think?

Agreed. Fixed.

> Aggregate logging description: "{ failures | ... }" seems misleading 
> because it suggests we have one or the other, whereas it can also be 
> empty. I suggest: "{ | failures | ... }" to show the empty case.

The description is correct because either "failures" or "both of
serialization_failures and deadlock_failures" should appear in aggregate
logging. If "failures" was printed only when any transaction failed,
each line in aggregate logging could have different numbers of columns
and which would make it difficult to parse the results.

> I'd wonder whether the number of tries is set too high, 
> though, ISTM that an application should give up before 100? 

Indeed, max-tries=100 seems too high for practical system. 

Also, I noticed that sum of latencies of each command (= 15.839 ms)
is significantly larger than the latency average (= 10.870 ms) 
because "per commands" results in the documentation were fixed.

So, I retook a measurement on my machine for more accurate documentation. I
used max-tries=10.

> Minor editing:
> 
> "there're" -> "there are".
> 
> "the --time" -> "the --time option".

Fixed.

> "The latency for failed transactions and commands is not computed separately." is unclear,
> please use a positive sentence to tell what is true instead of what is not and the reader
> has to guess. Maybe: "The latency figures include failed transactions which have reached
> the maximum number of tries or the transaction latency limit.".

I'm not the original author of this description, but I guess this means "The latency is
measured only for successful transactions and commands but not for failed transactions
or commands.".

> "The main report contains the number of failed transactions if it is non-zero." ISTM that
> this is a pain for scripts which would like to process these reports data, because the data
> may or may not be there. I'm sure to write such scripts, which explains my concern:-)

I agree with you. I fixed the behavior to report the the number of failed transactions
always regardless with if it is non-zero or not.

> "If the total number of retried transactions is non-zero…" should it rather be "not one",
> because zero means unlimited retries?

I guess that this means the actual number of retried transaction not the max-tries, so
"non-zero" was correct. However, for the same reason above, I fixed the behavior to
report the the retry statistics always regardeless with the actual retry numbers.

> 
> # FEATURES

> Copying variables: ISTM that we should not need to save the variables 
> states… no clearing, no copying should be needed. The restarted 
> transaction simply overrides the existing variables which is what the 
> previous version was doing anyway. The scripts should write their own 
> variables before using them, and if they don't then it is the user 
> problem. This is important for performance, because it means that after a 
> client has executed all scripts once the variable array is stable and does 
> not incur significant maintenance costs. The only thing that needs saving 
> for retry is the speudo-random generator state. This suggest simplifying 
> or removing "RetryState".

Yes. The variables states is not necessary because we retry the
whole script. It was necessary in the initial patch because it
planned to retry one transaction included in the script. I removed
RetryState and copyVariables.

> # CODE

> In readCommandResponse: ISTM that PGRES_NONFATAL_ERROR is not needed and 
> could be dealt with the default case. We are only interested in 
> serialization/deadlocks which are fatal errors?

We need PGRES_NONFATAL_ERROR to save st->estatus. It is used outside
readCommandResponse to determine whether we should abort or not.

> doRetry: for consistency, given the assert, ISTM that it should return 
> false if duration has expired, by testing end_time or timer_exceeded.

Ok. I fixed doRetry to check timer_exceeded again.

> checkTransactionStatus: this function does several things at once with 2 
> booleans, which make it not very clear to me. Maybe it would be clearer if 
> it would just return an enum (in trans, not in trans, conn error, other 
> error). Another reason to do that is that on connection error pgbench 
> could try to reconnect, which would be an interesting later extension, so 
> let's pave the way for that.  Also, I do not think that the function 
> should print out a message, it should be the caller decision to do that.

OK. I added a new enum type TStatus and I fixed the function to return it.
Also, I changed the function name to getTransactionStatus because the
actual check is done by the caller.

> verbose_errors: there is more or less repeated code under RETRY and 
> FAILURE, which should be factored out in a separate function. The 
> advanceConnectionFunction is long enough. Once this is done, there is no 
> need for a getLatencyUsed function.

OK. I made a function to print verbose error messages and removed the
getLatencyUsed function.

> I'd put cleaning up the pipeline in a function. I do not understand why 
> the pipeline mode is not exited in all cases, the code checks for the 
> pipeline status twice in a few lines. I'd put this cleanup in the sync 
> function as well, report to the caller (advanceConnectionState) if there 
> was an error, which would be managed there.

I fixed to exit the pipeline whenever we have an error in a pipeline mode.
Also, I added a PQpipelineSync call which was forgotten in the previous patch. 

> WAIT_ROLLBACK_RESULT: consumming results in a while could be a function to 
> avoid code repetition (there and in the "error:" label in 
> readCommandResponse). On the other hand, I'm not sure why the loop is 
> needed: we can only get there by submitting a "ROLLBACK" command, so there 
> should be only one result anyway?

Right. We should receive just one PGRES_COMMAND_OK and null following it.
I eliminated the loop.

> report_per_command: please always count retries and failures of commands 
> even if they will not be reported in the end, the code will be simpler and 
> more efficient.

Ok. I fixed to count retries and failures of commands even if
report_per_command is false.

> doLog: the format has changed, including a new string on failures which 
> replace the time field. Hmmm. Cannot say I like it much, but why not. ISTM 
> that the string could be shorten to "deadlock" or "serialization". ISTM 
> that the documentation example should include a line with a failure, to 
> make it clear what to expect.

I fixed getResultString to return "deadlock" or "serialization" instead of
"deadlock_failure" or "serialization_failure". Also, I added an output
example to the documentation.

> I'm okay with always getting computing thread stats.
> 
> # COMMENTS
> 
> struct StatsData comment is helpful.
>   - "failed transactions" -> "unsuccessfully retried transactions"?

This seems an accurate description. However, "failed transaction" is
short and simple, and it is used in several places, so  instead of
replacing them I added the following statement to define it:

"failed transaction is defined as unsuccessfully retried transactions."

>   - 'cnt' decomposition: first term is field 'retried'? if so say it
>     explicitely?

No. 'retreid' includes unsuccessfully retreid transactions, but 'cnt'
includes only successfully retried transactions.

> "Complete the failed transaction" sounds strange: If it failed, it could 
> not complete? I'd suggest "Record a failed transaction".

Sounds good. Fixed.

> # TESTS
> 
> I suggested to simplify the tests by using conditionals & sequences. You 
> reported that you got stuck. Hmmm.
> 
> I tried again my tests which worked fine when started with 2 clients, 
> otherwise they get stuck because the first client waits for the other one 
> which does not exists (the point is to generate deadlocks and other 
> errors). Maybe this is your issue?

That seems to be right. It got stuck when I used -T option rather than -t,
it was because, I guess, the number of transactions on each thread was
different.

> Could you try with:
> 
>    psql < deadlock_prep.sql
>    pgbench -t 4 -c 2 -f deadlock.sql
>    # note: each deadlock detection takes 1 second
> 
>    psql < deadlock_prep.sql
>    pgbench -t 10 -c 2 -f serializable.sql
>    # very quick 50% serialization errors

That works. However, it still gets hang when --max-tries = 2,
so maybe I would not think we can use it for testing the retry
feature....

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Hello Fabien,

On Sat, 12 Mar 2022 15:54:54 +0100 (CET)
Fabien COELHO <coelho@cri.ensmp.fr> wrote:

> Hello Yugo-san,
> 
> About Pgbench error handling v16:

Thank you for your review! I attached the updated patches.

> This patch set needs a minor rebase because of 506035b0. Otherwise, patch 
> compiles, global and local "make check" are ok. Doc generation is ok.

I rebased it.

> ## About v16-2

> English: "he will be aborted" -> "it will be aborted".

Fixed.

> I'm still not sure I like the "failure detailed" option, ISTM that the report
> could be always detailed. That would remove some complexity and I do not think
> that people executing a bench with error handling would mind having the details.
> No big deal.

I didn't change it because I think those who don't expect any failures using a
well designed script may not need details of failures. I think reporting such
details will be required only for benchmarks where any failures are expected.

> printVerboseErrorMessages: I'd make the buffer static and initialized only once
> so that there is no significant malloc/free cycle involved when calling the function.

OK. I fixed printVerboseErrorMessages to use a static variable.

> advanceConnectionState: I'd really prefer not to add new variables (res, status)
> in the loop scope, and only declare them when actually needed in the state branches,
> so as to avoid any unwanted interaction between states.

I fixed to declare the variables in the case statement blocks.

> typo: "fullowing" -> "following"

fixed.

> Pipeline cleaning: the advance function is already soooo long, I'd put that in a
> separate function and call it.

Ok. I made a new function "discardUntilSync" for the pipeline cleaning.

> I think that the report should not remove data when they are 0, otherwise it makes
> it harder to script around it (in failures_detailed on line 6284).

I fixed to report both serialization and deadlock failures always even when
they are 0.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

On Tue, 22 Mar 2022 09:08:15 +0900
Yugo NAGATA <nagata@sraoss.co.jp> wrote:

> Hi Ishii-san,
> 
> On Sun, 20 Mar 2022 09:52:06 +0900 (JST)
> Tatsuo Ishii <ishii@sraoss.co.jp> wrote:
> 
> > Hi Yugo,
> > 
> > I have looked into the patch and I noticed that <xref
> > linkend=... endterm=...> is used in pgbench.sgml. e.g.
> > 
> > <xref linkend="failures-and-retries" endterm="failures-and-retries-title"/>
> > 
> > AFAIK this is the only place where "endterm" is used. In other places
> > "link" tag is used instead:
> 
> Thank you for pointing out it. 
> 
> I've checked other places using <xref/> referring to <refsect2>, and found
> that "xreflabel"s are used in such <refsect2> tags. So, I'll fix it 
> in this style.

I attached the updated patch. I also fixed the following paragraph which I had
forgotten to fix in the previous patch.

 The first seven lines report some of the most important parameter settings.
 The sixth line reports the maximum number of tries for transactions with
 serialization or deadlock errors

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

On Wed, 23 Mar 2022 14:26:54 -0400
Tom Lane <tgl@sss.pgh.pa.us> wrote:

> Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> > The patch Pushed. Thank you!
> 
> My hoary animal prairiedog doesn't like this [1]:
> 
> #   Failed test 'concurrent update with retrying stderr /(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of
script0; ERROR:  could not serialize access due to concurrent update\\b.*\\g1)/'
 
> #   at t/001_pgbench_with_server.pl line 1229.
> #                   'pgbench: pghost: /tmp/nhghgwAoki pgport: 58259 nclients: 2 nxacts: 1 dbName: postgres
> ...
> # pgbench: client 0 got an error in command 3 (SQL) of script 0; ERROR:  could not serialize access due to concurrent
update
> ...
> # '
> #     doesn't match '(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script 0; ERROR:  could not
serializeaccess due to concurrent update\\b.*\\g1)'
 
> # Looks like you failed 1 test of 425.
> 
> I'm not sure what the "\\b.*\\g1" part of this regex is meant to
> accomplish, but it seems to be assuming more than it should
> about the output format of TAP messages.

I have edited the test code from the original patch by mistake, but
I could not realize because the test works in my machine without any
errors somehow.

I attached a patch to fix the test as was in the original patch, where
backreferences are used to check retry of the same query.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

fix_pgbench_test.patch

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tatsuo Ishii

Дата:

24 марта 2022 г., 07:49:24

>> My hoary animal prairiedog doesn't like this [1]:
>> 
>> #   Failed test 'concurrent update with retrying stderr /(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of
script0; ERROR:  could not serialize access due to concurrent update\\b.*\\g1)/'
 
>> #   at t/001_pgbench_with_server.pl line 1229.
>> #                   'pgbench: pghost: /tmp/nhghgwAoki pgport: 58259 nclients: 2 nxacts: 1 dbName: postgres
>> ...
>> # pgbench: client 0 got an error in command 3 (SQL) of script 0; ERROR:  could not serialize access due to
concurrentupdate
 
>> ...
>> # '
>> #     doesn't match '(?s-xim:client (0|1) got an error in command 3 \\(SQL\\) of script 0; ERROR:  could not
serializeaccess due to concurrent update\\b.*\\g1)'
 
>> # Looks like you failed 1 test of 425.
>> 
>> I'm not sure what the "\\b.*\\g1" part of this regex is meant to
>> accomplish, but it seems to be assuming more than it should
>> about the output format of TAP messages.
> 
> I have edited the test code from the original patch by mistake, but
> I could not realize because the test works in my machine without any
> errors somehow.
> 
> I attached a patch to fix the test as was in the original patch, where
> backreferences are used to check retry of the same query.

My machine (Ubuntu 20) did not complain either. Maybe perl version
difference?  Any way, the fix pushed. Let's see how prairiedog feels.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tom Lane

Дата:

24 марта 2022 г., 21:02:59

Tatsuo Ishii <ishii@sraoss.co.jp> writes:
>> My hoary animal prairiedog doesn't like this [1]:

> My machine (Ubuntu 20) did not complain either. Maybe perl version
> difference?  Any way, the fix pushed. Let's see how prairiedog feels.

Still not happy.  After some digging in man pages, I believe the
problem is that its old version of Perl does not understand "\gN"
backreferences.  Is there a good reason to be using that rather
than the traditional "\N" backref notation?

            regards, tom lane

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tatsuo Ishii

Дата:

25 марта 2022 г., 02:50:44

>> My machine (Ubuntu 20) did not complain either. Maybe perl version
>> difference?  Any way, the fix pushed. Let's see how prairiedog feels.
> 
> Still not happy.  After some digging in man pages, I believe the
> problem is that its old version of Perl does not understand "\gN"
> backreferences.  Is there a good reason to be using that rather
> than the traditional "\N" backref notation?

I don't see a reason to use "\gN" either. Actually after applying
attached patch, my machine is still happy with pgbench test.

Yugo?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 60cae1e843..22a23489e8 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1224,7 +1224,7 @@ my $err_pattern =
     "(client (0|1) sending UPDATE xy SET y = y \\+ -?\\d+\\b).*"
   . "client \\g2 got an error in command 3 \\(SQL\\) of script 0; "
   . "ERROR:  could not serialize access due to concurrent update\\b.*"
-  . "\\g1";
+  . "\\1";
 
 $node->pgbench(
     "-n -c 2 -t 1 -d --verbose-errors --max-tries 2",

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tom Lane

Дата:

25 марта 2022 г., 03:05:10

Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> I don't see a reason to use "\gN" either. Actually after applying
> attached patch, my machine is still happy with pgbench test.

Note that the \\g2 just above also needs to be changed.

            regards, tom lane

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tatsuo Ishii

Дата:

25 марта 2022 г., 03:14:00

> Note that the \\g2 just above also needs to be changed.

Oops. Thanks. New patch attached. Test has passed on my machine.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/src/bin/pgbench/t/001_pgbench_with_server.pl b/src/bin/pgbench/t/001_pgbench_with_server.pl
index 60cae1e843..ca71f968dc 100644
--- a/src/bin/pgbench/t/001_pgbench_with_server.pl
+++ b/src/bin/pgbench/t/001_pgbench_with_server.pl
@@ -1222,9 +1222,9 @@ local $ENV{PGOPTIONS} = "-c default_transaction_isolation=repeatable\\ read";
 # delta variable in the next try
 my $err_pattern =
     "(client (0|1) sending UPDATE xy SET y = y \\+ -?\\d+\\b).*"
-  . "client \\g2 got an error in command 3 \\(SQL\\) of script 0; "
+  . "client \\2 got an error in command 3 \\(SQL\\) of script 0; "
   . "ERROR:  could not serialize access due to concurrent update\\b.*"
-  . "\\g1";
+  . "\\1";
 
 $node->pgbench(
     "-n -c 2 -t 1 -d --verbose-errors --max-tries 2",

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tom Lane

Дата:

25 марта 2022 г., 04:07:44

Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> Oops. Thanks. New patch attached. Test has passed on my machine.

I reproduced the failure on another machine with perl 5.8.8,
and I can confirm that this patch fixes it.

            regards, tom lane

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Yugo NAGATA

Дата:

25 марта 2022 г., 04:12:18

On Fri, 25 Mar 2022 09:14:00 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> > Note that the \\g2 just above also needs to be changed.
> 
> Oops. Thanks. New patch attached. Test has passed on my machine.

This patch works for me. I think it is ok to use \N instead of \gN.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tatsuo Ishii

Дата:

25 марта 2022 г., 04:24:43

> I reproduced the failure on another machine with perl 5.8.8,
> and I can confirm that this patch fixes it.

Thank you for the test. I have pushed the patch.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tatsuo Ishii

Дата:

25 марта 2022 г., 04:25:20

>> Oops. Thanks. New patch attached. Test has passed on my machine.
> 
> This patch works for me. I think it is ok to use \N instead of \gN.

Thanks. Patch pushed.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tom Lane

Дата:

26 марта 2022 г., 19:36:07

Tatsuo Ishii <ishii@sraoss.co.jp> writes:
> Thanks. Patch pushed.

This patch has caused the PDF documentation to fail to build cleanly:

[WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by
morethan 50 points. (See position 125066:375) 

It's complaining about this:

<synopsis>
<replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> |
<replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional>
<replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable>
<replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional>
<replaceable>retried</replaceable><replaceable>retries</replaceable> </optional> 
</synopsis>

which runs much too wide in HTML format too, even though that toolchain
doesn't tell you so.

We could silence the warning by inserting an arbitrary line break or two,
or refactoring the syntax description into multiple parts.  Either way
seems to create a risk of confusion.

TBH, I think the *real* problem is that the complexity of this log format
has blown past "out of hand".  Can't we simplify it?  Who is really going
to use all these numbers?  I pity the poor sucker who tries to write a
log analysis tool that will handle all the variants.

            regards, tom lane

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tatsuo Ishii

Дата:

27 марта 2022 г., 09:28:41

> This patch has caused the PDF documentation to fail to build cleanly:
> 
> [WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction by
morethan 50 points. (See position 125066:375)
 
> 
> It's complaining about this:
> 
> <synopsis>
> <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> |
<replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional>
<replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable>
<replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional>
<replaceable>retried</replaceable><replaceable>retries</replaceable> </optional>
 
> </synopsis>
> 
> which runs much too wide in HTML format too, even though that toolchain
> doesn't tell you so.

Yeah.

> We could silence the warning by inserting an arbitrary line break or two,
> or refactoring the syntax description into multiple parts.  Either way
> seems to create a risk of confusion.

I think we can fold the line nicely. Here is the rendered image.

Before:
interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures
deadlock_failures} [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]
 

After:
interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
  { failures | serialization_failures deadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried
retries]
 

Note that before it was like this:

interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency [ sum_lag sum_lag_2 min_lag max_lag
[skipped ] ]
 

So newly added items are "{ failures | serialization_failures deadlock_failures }" and " [ retried retries ]".

> TBH, I think the *real* problem is that the complexity of this log format
> has blown past "out of hand".  Can't we simplify it?  Who is really going
> to use all these numbers?  I pity the poor sucker who tries to write a
> log analysis tool that will handle all the variants.

Well, the extra logging items above only appear when the retry feature
is enabled. For those who do not use the feature the only new logging
item is "failures". For those who use the feature, the extra logging
items are apparently necessary. For example if we write an application
using repeatable read or serializable transaction isolation mode,
retrying failed transactions due to srialization error is an essential
technique. Also the retry rate of transactions will deeply affect the
performance and in such use cases the newly added items will be
precisou information. I would suggest leave the log items as it is.

Patch attached.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp
diff --git a/doc/src/sgml/ref/pgbench.sgml b/doc/src/sgml/ref/pgbench.sgml
index ebdb4b3f46..b65b813ebe 100644
--- a/doc/src/sgml/ref/pgbench.sgml
+++ b/doc/src/sgml/ref/pgbench.sgml
@@ -2398,10 +2398,11 @@ END;
 
   <para>
    With the <option>--aggregate-interval</option> option, a different
-   format is used for the log files:
+   format is used for the log files (note that the actual log line is not folded).
 
 <synopsis>
-<replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> |
<replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional>
<replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable>
<replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional>
<replaceable>retried</replaceable><replaceable>retries</replaceable> </optional>
 
+  <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>
+  { <replaceable>failures</replaceable> | <replaceable>serialization_failures</replaceable>
<replaceable>deadlock_failures</replaceable>} <optional> <replaceable>sum_lag</replaceable>
<replaceable>sum_lag_2</replaceable><replaceable>min_lag</replaceable> <replaceable>max_lag</replaceable> <optional>
<replaceable>skipped</replaceable></optional> </optional> <optional> <replaceable>retried</replaceable>
<replaceable>retries</replaceable></optional>
 
 </synopsis>
 
    where

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Yugo NAGATA

Дата:

28 марта 2022 г., 05:29:36

On Sun, 27 Mar 2022 15:28:41 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> > This patch has caused the PDF documentation to fail to build cleanly:
> > 
> > [WARN] FOUserAgent - The contents of fo:block line 1 exceed the available area in the inline-progression direction
bymore than 50 points. (See position 125066:375)
 
> > 
> > It's complaining about this:
> > 
> > <synopsis>
> > <replaceable>interval_start</replaceable> <replaceable>num_transactions</replaceable>
<replaceable>sum_latency</replaceable><replaceable>sum_latency_2</replaceable> <replaceable>min_latency</replaceable>
<replaceable>max_latency</replaceable>{ <replaceable>failures</replaceable> |
<replaceable>serialization_failures</replaceable><replaceable>deadlock_failures</replaceable> } <optional>
<replaceable>sum_lag</replaceable><replaceable>sum_lag_2</replaceable> <replaceable>min_lag</replaceable>
<replaceable>max_lag</replaceable><optional> <replaceable>skipped</replaceable> </optional> </optional> <optional>
<replaceable>retried</replaceable><replaceable>retries</replaceable> </optional>
 
> > </synopsis>
> > 
> > which runs much too wide in HTML format too, even though that toolchain
> > doesn't tell you so.
> 
> Yeah.
> 
> > We could silence the warning by inserting an arbitrary line break or two,
> > or refactoring the syntax description into multiple parts.  Either way
> > seems to create a risk of confusion.
> 
> I think we can fold the line nicely. Here is the rendered image.
> 
> Before:
> interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures
deadlock_failures} [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]
 
> 
> After:
> interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
>   { failures | serialization_failures deadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried
retries]
 
> 
> Note that before it was like this:
> 
> interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency [ sum_lag sum_lag_2 min_lag
max_lag[ skipped ] ]
 
> 
> So newly added items are "{ failures | serialization_failures deadlock_failures }" and " [ retried retries ]".
> 
> > TBH, I think the *real* problem is that the complexity of this log format
> > has blown past "out of hand".  Can't we simplify it?  Who is really going
> > to use all these numbers?  I pity the poor sucker who tries to write a
> > log analysis tool that will handle all the variants.
> 
> Well, the extra logging items above only appear when the retry feature
> is enabled. For those who do not use the feature the only new logging
> item is "failures". For those who use the feature, the extra logging
> items are apparently necessary. For example if we write an application
> using repeatable read or serializable transaction isolation mode,
> retrying failed transactions due to srialization error is an essential
> technique. Also the retry rate of transactions will deeply affect the
> performance and in such use cases the newly added items will be
> precisou information. I would suggest leave the log items as it is.
> 
> Patch attached.

Even applying this patch, "make postgres-A4.pdf" arises the warning on my
machine. After some investigations, I found that previous document had a break
after 'num_transactions', but it has been removed due to this commit. So,
I would like to get back this as it was. I attached the patch.

Regards,
Yugo Nagata


-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

pgbench-doc_v2.patch

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tatsuo Ishii

Дата:

28 марта 2022 г., 06:17:13

> Even applying this patch, "make postgres-A4.pdf" arises the warning on my
> machine. After some investigations, I found that previous document had a break
> after 'num_transactions', but it has been removed due to this commit.

Yes, your patch removed "&zwsp;".

> So,
> I would like to get back this as it was. I attached the patch.

This produces errors. Needs ";" postfix?

ref/pgbench.sgml:2404: parser error : EntityRef: expecting ';'
le>interval_start</replaceable> <replaceable>num_transactions</replaceable>&zwsp
                                                                               ^
ref/pgbench.sgml:2781: parser error : chunk is not well balanced

^
reference.sgml:251: parser error : Failure to process entity pgbench
   &pgbench;
            ^
reference.sgml:251: parser error : Entity 'pgbench' not defined
   &pgbench;
            ^
reference.sgml:296: parser error : chunk is not well balanced

^
postgres.sgml:240: parser error : Failure to process entity reference
 &reference;
            ^
postgres.sgml:240: parser error : Entity 'reference' not defined
 &reference;
            ^
make: *** [Makefile:135: html-stamp] エラー 1

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Yugo NAGATA

Дата:

28 марта 2022 г., 06:48:11

On Mon, 28 Mar 2022 12:17:13 +0900 (JST)
Tatsuo Ishii <ishii@sraoss.co.jp> wrote:

> > Even applying this patch, "make postgres-A4.pdf" arises the warning on my
> > machine. After some investigations, I found that previous document had a break
> > after 'num_transactions', but it has been removed due to this commit.
> 
> Yes, your patch removed "&zwsp;".
> 
> > So,
> > I would like to get back this as it was. I attached the patch.
> 
> This produces errors. Needs ";" postfix?

Oops. Yes, it needs ';'. Also, I found another "&zwsp;" dropped.
I attached the fixed patch.

Regards,
Yugo Nagata

-- 
Yugo NAGATA <nagata@sraoss.co.jp>

Вложения

pgbench-doc_v3.patch

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tatsuo Ishii

Дата:

28 марта 2022 г., 09:50:55

>> > Even applying this patch, "make postgres-A4.pdf" arises the warning on my
>> > machine. After some investigations, I found that previous document had a break
>> > after 'num_transactions', but it has been removed due to this commit.
>> 
>> Yes, your patch removed "&zwsp;".
>> 
>> > So,
>> > I would like to get back this as it was. I attached the patch.
>> 
>> This produces errors. Needs ";" postfix?
> 
> Oops. Yes, it needs ';'. Also, I found another "&zwsp;" dropped.
> I attached the fixed patch.

Basic problem with this patch is, this may solve the issue with pdf
generation but this does not solve the issue with HTML generation. The
PDF manual of pgbench has ridiculously long line, which Tom Lane
complained too:

interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures | serialization_failures
deadlock_failures} [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]
 

Why can't we use just line feeds instead of &zwsp;? Although it's not
a command usage but the SELECT manual already uses line feeds to
nicely break into multiple lines of command usage.

Best reagards,
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Tatsuo Ishii

Дата:

28 марта 2022 г., 09:53:44

>>> > Even applying this patch, "make postgres-A4.pdf" arises the warning on my
>>> > machine. After some investigations, I found that previous document had a break
>>> > after 'num_transactions', but it has been removed due to this commit.
>>> 
>>> Yes, your patch removed "&zwsp;".
>>> 
>>> > So,
>>> > I would like to get back this as it was. I attached the patch.
>>> 
>>> This produces errors. Needs ";" postfix?
>> 
>> Oops. Yes, it needs ';'. Also, I found another "&zwsp;" dropped.
>> I attached the fixed patch.
> 
> Basic problem with this patch is, this may solve the issue with pdf
> generation but this does not solve the issue with HTML generation. The
> PDF manual of pgbench has ridiculously long line, which Tom Lane
I meant "HTML manual" here.

> complained too:
> 
> interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency { failures |
serialization_failuresdeadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]
 
> 
> Why can't we use just line feeds instead of &zwsp;? Although it's not
> a command usage but the SELECT manual already uses line feeds to
> nicely break into multiple lines of command usage.
> 
> Best reagards,
> --
> Tatsuo Ishii
> SRA OSS, Inc. Japan
> English: http://www.sraoss.co.jp/index_en.php
> Japanese:http://www.sraoss.co.jp

Re: [HACKERS] WIP aPatch: Pgbench Serialization and deadlock errors

От

Alvaro Herrera

Дата:

28 марта 2022 г., 10:57:59

Hello,

On 2022-Mar-27, Tatsuo Ishii wrote:

> After:
> interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
>   { failures | serialization_failures deadlock_failures } [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried
retries]

You're showing an indentation, but looking at the HTML output there is
no such.  Is the HTML processor eating leading whitespace or something
like that?

I think that the explanatory paragraph is way too long now, particularly
since it explains --failures-detailed starting in the middle.  Also, the
example output doesn't include the failures-detailed mode.  I suggest
that this should be broken down even more; first to explain the output
without failures-detailed, including an example, and then the output
with failures-detailed, and an example of that.  Something like this,
perhaps:

Aggregated Logging
With the --aggregate-interval option, a different format is used for the log files (note that the actual log line is
notfolded).

  interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
  failures [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]

where interval_start is the start of the interval (as a Unix epoch time stamp), num_transactions is the number of
transactionswithin the interval, sum_latency is the sum of the transaction latencies within the interval, sum_latency_2
isthe sum of squares of the transaction latencies within the interval, min_latency is the minimum latency within the
interval,and max_latency is the maximum latency within the interval, failures is the number of transactions that ended
witha failed SQL command within the interval.

The next fields, sum_lag, sum_lag_2, min_lag, and max_lag, are only present if the --rate option is used. They provide
statisticsabout the time each transaction had to wait for the previous one to finish, i.e., the difference between each
transaction'sscheduled start time and the time it actually started. The next field, skipped, is only present if the
--latency-limitoption is used, too. It counts the number of transactions skipped because they would have started too
late.The retried and retries fields are present only if the --max-tries option is not equal to 1. They report the
numberof retried transactions and the sum of all retries after serialization or deadlock errors within the interval.
Eachtransaction is counted in the interval when it was committed.

Notice that while the plain (unaggregated) log file shows which script was used for each transaction, the aggregated
logdoes not. Therefore if you need per-script data, you need to aggregate the data on your own.

Here is some example output:

1345828501 5601 1542744 483552416 61 2573 0
1345828503 7884 1979812 565806736 60 1479 0
1345828505 7208 1979422 567277552 59 1391 0
1345828507 7685 1980268 569784714 60 1398 0
1345828509 7073 1979779 573489941 236 1411 0

If you use option --failures-detailed, instead of the sum of all failed transactions you will get more detailed
statisticsfor the failed transactions:

  interval_start num_transactions sum_latency sum_latency_2 min_latency max_latency
  serialization_failures deadlock_failures [ sum_lag sum_lag_2 min_lag max_lag [ skipped ] ] [ retried retries ]

This is similar to the above, but here the single 'failures' figure is replaced by serialization_failures which is the
numberof transactions that got a serialization error and were not retried after this, deadlock_failures which is the
numberof transactions that got a deadlock error and were not retried after this. The other fields are as above. Here is
someexample output:

[example with detailed failures]

-- 
Álvaro Herrera               48°01'N 7°57'E  —  https://www.EnterpriseDB.com/
"If you have nothing to say, maybe you need just the right tool to help you
not say it."                   (New York Times, about Microsoft PowerPoint)

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: [HACKERS] WIP Patch: Pgbench Serialization and deadlock errors

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения