Обсуждение: In progress INSERT wrecks plans on table

От:
Mark Kirkwood
Дата:

Recently we encountered the following unhappy sequence of events:

1/ system running happily
2/ batch load into table begins
3/ very quickly (some) preexisting queries on said table go orders of
magnitude slower
4/ database instance becomes unresponsive
5/ application outage

After looking down a few false leads, We've isolated the cause to the
following:

The accumulating in-progress row changes are such that previously
optimal plans are optimal no longer. Now this situation will fix itself
when the next autoanalyze happens (and new plan will be chosen) -
however that cannot occur until the batch load is completed and
committed (approx 70 seconds). However during that time there is enough
of a performance degradation for queries still using the old plan to
cripple the server.

Now that we know what is happening we can work around it. But I'm
wondering - is there any way (or if not should there be one) to let
postgres handle this automatically? I experimented with a quick hack to
src/backend/commands/analyze.c (attached) that lets another session's
ANALYZE see in progress rows - which works but a) may cause other
problems and b) does not help autoaanalyze which has to wait for COMMIT
+ stats message.

I've attached a (synthetic) test case that shows the issue, I'll
reproduce the output below to hopefully make the point obvious:


                Table "public.plan"
  Column |            Type             | Modifiers
--------+-----------------------------+-----------
  id     | integer                     | not null
  typ    | integer                     | not null
  dat    | timestamp without time zone |
  val    | text                        | not null
Indexes:
     "plan_id" UNIQUE, btree (id)
     "plan_dat" btree (dat)
     "plan_typ" btree (typ)


[Session 1]
EXPLAIN ANALYZE
SELECT * FROM plan
WHERE typ = 3 AND dat IS NOT NULL;
                                                      QUERY PLAN
---------------------------------------------------------------------------------------------------------------------
  Index Scan using plan_dat on plan  (cost=0.00..265.47 rows=55
width=117) (actual time=0.130..4.409 rows=75 loops=1)
    Index Cond: (dat IS NOT NULL)
    Filter: (typ = 3)
    Rows Removed by Filter: 5960
  Total runtime: 4.440 ms
(5 rows)

[Session 2]

BEGIN;
INSERT INTO plan
SELECT id + 2000001,typ,current_date + id * '1 seconds'::interval ,val
FROM plan
;

[Session 1]
EXPLAIN ANALYZE
SELECT * FROM plan
WHERE typ = 3 AND dat IS NOT NULL;

                                                       QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
  Index Scan using plan_dat on plan  (cost=0.00..551.35 rows=91
width=117) (actual time=0.131..202.699 rows=75 loops=1)
    Index Cond: (dat IS NOT NULL)
    Filter: (typ = 3)
    Rows Removed by Filter: 5960
  Total runtime: 202.729 ms
(5 rows)
[Session 2]
COMMIT;

[Session 1...wait for autoanalyze to finish then]

EXPLAIN ANALYZE
SELECT * FROM plan
WHERE typ = 3 AND dat IS NOT NULL;
QUERY PLAN

---------------------------------------------------------------------------------------------------------------------------
  Bitmap Heap Scan on plan  (cost=407.87..44991.95 rows=10116 width=117)
(actual time=2.692..6.582 rows=75 loops=1)
    Recheck Cond: (typ = 3)
    Filter: (dat IS NOT NULL)
    Rows Removed by Filter: 19925
    ->  Bitmap Index Scan on plan_typ  (cost=0.00..405.34 rows=20346
width=0) (actual time=2.573..2.573 rows=20000 loops=1)
          Index Cond: (typ = 3)
  Total runtime: 6.615 ms


Regards

Mark

От:
Mark Kirkwood
Дата:

On 26/04/13 14:56, Gavin Flower wrote:
> On 26/04/13 14:33, Mark Kirkwood wrote:
>> Recently we encountered the following unhappy sequence of events:
>>
>> 1/ system running happily
>> 2/ batch load into table begins
>> 3/ very quickly (some) preexisting queries on said table go orders of
>> magnitude slower
>> 4/ database instance becomes unresponsive
>> 5/ application outage
>>
>> After looking down a few false leads, We've isolated the cause to the
>> following:
>>
>> The accumulating in-progress row changes are such that previously
>> optimal plans are optimal no longer. Now this situation will fix
>> itself when the next autoanalyze happens (and new plan will be
>> chosen) - however that cannot occur until the batch load is completed
>> and committed (approx 70 seconds). However during that time there is
>> enough of a performance degradation for queries still using the old
>> plan to cripple the server.
>>
>> Now that we know what is happening we can work around it. But I'm
>> wondering - is there any way (or if not should there be one) to let
>> postgres handle this automatically? I experimented with a quick hack
>> to src/backend/commands/analyze.c (attached) that lets another
>> session's ANALYZE see in progress rows - which works but a) may cause
>> other problems and b) does not help autoaanalyze which has to wait
>> for COMMIT + stats message.
>>
>> I've attached a (synthetic) test case that shows the issue, I'll
>> reproduce the output below to hopefully make the point obvious:
>>
>>
>>                Table "public.plan"
>>  Column |            Type             | Modifiers
>> --------+-----------------------------+-----------
>>  id     | integer                     | not null
>>  typ    | integer                     | not null
>>  dat    | timestamp without time zone |
>>  val    | text                        | not null
>> Indexes:
>>     "plan_id" UNIQUE, btree (id)
>>     "plan_dat" btree (dat)
>>     "plan_typ" btree (typ)
>>
>>
>> [Session 1]
>> EXPLAIN ANALYZE
>> SELECT * FROM plan
>> WHERE typ = 3 AND dat IS NOT NULL;
>>                                                      QUERY PLAN
>>
--------------------------------------------------------------------------------------------------------------------- 
>>
>>  Index Scan using plan_dat on plan  (cost=0.00..265.47 rows=55
>> width=117) (actual time=0.130..4.409 rows=75 loops=1)
>>    Index Cond: (dat IS NOT NULL)
>>    Filter: (typ = 3)
>>    Rows Removed by Filter: 5960
>>  Total runtime: 4.440 ms
>> (5 rows)
>>
>> [Session 2]
>>
>> BEGIN;
>> INSERT INTO plan
>> SELECT id + 2000001,typ,current_date + id * '1 seconds'::interval ,val
>> FROM plan
>> ;
>>
>> [Session 1]
>> EXPLAIN ANALYZE
>> SELECT * FROM plan
>> WHERE typ = 3 AND dat IS NOT NULL;
>>
>>                                                       QUERY PLAN
>>
----------------------------------------------------------------------------------------------------------------------- 
>>
>>  Index Scan using plan_dat on plan  (cost=0.00..551.35 rows=91
>> width=117) (actual time=0.131..202.699 rows=75 loops=1)
>>    Index Cond: (dat IS NOT NULL)
>>    Filter: (typ = 3)
>>    Rows Removed by Filter: 5960
>>  Total runtime: 202.729 ms
>> (5 rows)
>> [Session 2]
>> COMMIT;
>>
>> [Session 1...wait for autoanalyze to finish then]
>>
>> EXPLAIN ANALYZE
>> SELECT * FROM plan
>> WHERE typ = 3 AND dat IS NOT NULL;
>> QUERY PLAN
>>
---------------------------------------------------------------------------------------------------------------------------

>>
>>  Bitmap Heap Scan on plan  (cost=407.87..44991.95 rows=10116
>> width=117) (actual time=2.692..6.582 rows=75 loops=1)
>>    Recheck Cond: (typ = 3)
>>    Filter: (dat IS NOT NULL)
>>    Rows Removed by Filter: 19925
>>    ->  Bitmap Index Scan on plan_typ  (cost=0.00..405.34 rows=20346
>> width=0) (actual time=2.573..2.573 rows=20000 loops=1)
>>          Index Cond: (typ = 3)
>>  Total runtime: 6.615 ms
>>
>>
>> Regards
>>
>> Mark
>>
>>
> Hmm...
>
> You need to specify:
>
> 1. version of Postgres
> 2. Operating system
> 3. changes to postgresql.conf
> 4. CPU/RAM etc
> 5. anything else that might be relevant
>
>


While in general you are quite correct - in the above case (particularly
as I've supplied a test case) it should be pretty obvious that any
moderately modern version of postgres on any supported platform will
exhibit this.

I produced the above test case on Postgres 9.2.4 Ubuntu 13.04, with no
changes to the default postgresql.conf


Now our actual production server is a 32 CPU box with 512GB RAM, and 16
SAS SSD running Postgres 9.2.4 on Ubuntu 12.04. And yes there are quite
a few changes from the defaults there - and I wasted quite a lot of time
chasing issues with high CPU and RAM, and changing various configs to
see if they helped - before identifying that the issue was in progress
row changes and planner statistics.  Also in the "real" case with much
bigger datasets the difference between the plan being optimal and it
*not* being optimal is a factor of 2000x elapsed time instead of a mere
50x  !

regards

Mark



От:
Mark Kirkwood
Дата:

On 26/04/13 15:34, Gavin Flower wrote:
> On 26/04/13 15:19, Mark Kirkwood wrote:
>> While in general you are quite correct - in the above case
>> (particularly as I've supplied a test case) it should be pretty
>> obvious that any moderately modern version of postgres on any
>> supported platform will exhibit this.
 >
> While I admit that I did not look closely at your test case - I am aware
> that several times changes to Postgres from one minor version to
> another, can have drastic unintended side effects (which might, or might
> not, be relevant to your situation). Besides, it helps sets the scene,
> and is one less thing that needs to be deduced.
>

Indeed - however, my perhaps slightly grumpy reply to your email was
based on an impression of over keen-ness to dismiss my message without
reading it (!) and a - dare I say it - one size fits all presentation of
"here are the hoops to jump through". Now I spent a reasonable amount of
time preparing the message and its attendant test case - and a comment
such as your based on *not reading it* ...errrm... well lets say I think
we can/should do better.

I am concerned that the deafening lack of any replies to my original
message is a result of folk glancing at your original quick reply and
thinking... incomplete problem spec...ignore... when that is not that
case - yes I should have muttered "9.2" in the original email, but we
have covered that now.

Regards


Mark


От:
Tom Lane
Дата:

Mark Kirkwood <> writes:
> I am concerned that the deafening lack of any replies to my original
> message is a result of folk glancing at your original quick reply and
> thinking... incomplete problem spec...ignore... when that is not that
> case - yes I should have muttered "9.2" in the original email, but we
> have covered that now.

No, I think it's more that we're trying to get to beta, and so anything
that looks like new development is getting shuffled to folks' "to
look at later" queues.  The proposed patch is IMO a complete nonstarter
anyway; but I'm not sure what a less bogus solution would look like.

            regards, tom lane


От:
Mark Kirkwood
Дата:

On 02/05/13 02:06, Tom Lane wrote:
> Mark Kirkwood <> writes:
>> I am concerned that the deafening lack of any replies to my original
>> message is a result of folk glancing at your original quick reply and
>> thinking... incomplete problem spec...ignore... when that is not that
>> case - yes I should have muttered "9.2" in the original email, but we
>> have covered that now.
> No, I think it's more that we're trying to get to beta, and so anything
> that looks like new development is getting shuffled to folks' "to
> look at later" queues.  The proposed patch is IMO a complete nonstarter
> anyway; but I'm not sure what a less bogus solution would look like.
>

Yeah, I did think that beta might be consuming everyone's attention (of
course immediately *after* sending the email)!

And yes, the patch was merely to illustrate the problem rather than any
serious attempt at a solution.

Regards

Mark



От:
Simon Riggs
Дата:

On 2 May 2013 01:49, Mark Kirkwood <> wrote:
> On 02/05/13 02:06, Tom Lane wrote:
>>
>> Mark Kirkwood <> writes:
>>>
>>> I am concerned that the deafening lack of any replies to my original
>>> message is a result of folk glancing at your original quick reply and
>>> thinking... incomplete problem spec...ignore... when that is not that
>>> case - yes I should have muttered "9.2" in the original email, but we
>>> have covered that now.
>>
>> No, I think it's more that we're trying to get to beta, and so anything
>> that looks like new development is getting shuffled to folks' "to
>> look at later" queues.  The proposed patch is IMO a complete nonstarter
>> anyway; but I'm not sure what a less bogus solution would look like.
>>
>
> Yeah, I did think that beta might be consuming everyone's attention (of
> course immediately *after* sending the email)!
>
> And yes, the patch was merely to illustrate the problem rather than any
> serious attempt at a solution.

I think we need a problem statement before we attempt a solution,
which is what Tom is alluding to.

ISTM that you've got a case where the plan is very sensitive to a
table load. Which is a pretty common situation and one that can be
solved in various ways. I don't see much that Postgres can do because
it can't know ahead of time you're about to load rows. We could
imagine an optimizer that set thresholds on plans that caused the
whole plan to be recalculated half way thru a run, but that would be a
lot of work to design and implement and even harder to test. Having
static plans at least allows us to discuss what it does after the fact
with some ease.

The plan is set using stats that are set when there are very few
non-NULL rows, and those increase massively on load. The way to cope
is to run the ANALYZE immediately after the load and then don't allow
auto-ANALYZE to reset them later.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


От:
mark.kirkwood@catalyst.net.nz
Дата:

> On 2 May 2013 01:49, Mark Kirkwood <> wrote:
>
> I think we need a problem statement before we attempt a solution,
> which is what Tom is alluding to.
>

Actually no - I think Tom (quite correctly) was saying that the patch was
not a viable solution. With which I agree.

I believe the title of this thread is the problem statement.

> ISTM that you've got a case where the plan is very sensitive to a
> table load. Which is a pretty common situation and one that can be
> solved in various ways. I don't see much that Postgres can do because
> it can't know ahead of time you're about to load rows. We could
> imagine an optimizer that set thresholds on plans that caused the
> whole plan to be recalculated half way thru a run, but that would be a
> lot of work to design and implement and even harder to test. Having
> static plans at least allows us to discuss what it does after the fact
> with some ease.
>
> The plan is set using stats that are set when there are very few
> non-NULL rows, and those increase massively on load. The way to cope
> is to run the ANALYZE immediately after the load and then don't allow
> auto-ANALYZE to reset them later.

No. We do run analyze immediately after the load. The surprise was that
this was not sufficient - the (small) amount of time where non optimal
plans were being used due to the in progress row activity was enough to
cripple the system - that is the problem. The analysis of why not led to
the test case included in the original email. And sure it is deliberately
crafted to display the issue, and is therefore open to criticism for being
artificial. However it was purely meant to make it easy to see what I was
talking about.


Currently we are working around this by coercing one of the predicates in
the query to discourage the attractive looking but dangerous index.

I think the idea of telling postgres that we are doing a load is probably
the wrong way to go about this. We have a framework that tries to
automatically figure out the best plans...I think some more thought about
how to make that understand some of the more subtle triggers for a
time-to-do-new-plans moment is the way to go. I understand this is
probably hard - and may imply some radical surgery to how the stats
collector and planner interact.

Regards

Mark




От:
Thomas Kellerer
Дата:

 wrote on 03.05.2013 00:19:
> I think the idea of telling postgres that we are doing a load is probably
> the wrong way to go about this. We have a framework that tries to
> automatically figure out the best plans...I think some more thought about
> how to make that understand some of the more subtle triggers for a
> time-to-do-new-plans moment is the way to go. I understand this is
> probably hard - and may imply some radical surgery to how the stats
> collector and planner interact.

I wonder if "freezing" (analyze, then disable autovacuum) the statistics for the large number of rows would work.

От:
mark.kirkwood@catalyst.net.nz
Дата:

>  wrote on 03.05.2013 00:19:
>> I think the idea of telling postgres that we are doing a load is
>> probably
>> the wrong way to go about this. We have a framework that tries to
>> automatically figure out the best plans...I think some more thought
>> about
>> how to make that understand some of the more subtle triggers for a
>> time-to-do-new-plans moment is the way to go. I understand this is
>> probably hard - and may imply some radical surgery to how the stats
>> collector and planner interact.
>
> I wonder if "freezing" (analyze, then disable autovacuum) the statistics
> for the large number of rows would work.
>
>
>

I'm thinking that the issue is actually the opposite - it is that a new
plan is needed because the new (uncomitted) rows are changing the data
distribution. So we want more plan instability rather than plan stability
:-)

Cheers

Mark




От:
Simon Riggs
Дата:

On 2 May 2013 23:19,  <> wrote:
>> On 2 May 2013 01:49, Mark Kirkwood <> wrote:
>>
>> I think we need a problem statement before we attempt a solution,
>> which is what Tom is alluding to.
>>
>
> Actually no - I think Tom (quite correctly) was saying that the patch was
> not a viable solution. With which I agree.
>
> I believe the title of this thread is the problem statement.
>
>> ISTM that you've got a case where the plan is very sensitive to a
>> table load. Which is a pretty common situation and one that can be
>> solved in various ways. I don't see much that Postgres can do because
>> it can't know ahead of time you're about to load rows. We could
>> imagine an optimizer that set thresholds on plans that caused the
>> whole plan to be recalculated half way thru a run, but that would be a
>> lot of work to design and implement and even harder to test. Having
>> static plans at least allows us to discuss what it does after the fact
>> with some ease.
>>
>> The plan is set using stats that are set when there are very few
>> non-NULL rows, and those increase massively on load. The way to cope
>> is to run the ANALYZE immediately after the load and then don't allow
>> auto-ANALYZE to reset them later.
>
> No. We do run analyze immediately after the load. The surprise was that
> this was not sufficient - the (small) amount of time where non optimal
> plans were being used due to the in progress row activity was enough to
> cripple the system - that is the problem. The analysis of why not led to
> the test case included in the original email. And sure it is deliberately
> crafted to display the issue, and is therefore open to criticism for being
> artificial. However it was purely meant to make it easy to see what I was
> talking about.

I had another look at this and see I that I read the second explain incorrectly.

The amount of data examined and returned is identical in both plans.
The only difference is the number of in-progress rows seen by the
second query. Looking at the numbers some more, it looks like 6000
in-progress rows are examined in addition to the data. It might be
worth an EXPLAIN patch to put instrumentation in to show that, but its
not that interesting.

It would be useful to force the indexscan into a bitmapscan to check
that the cost isn't attributable to the plan but to other overheads.

What appears to be happening is we're spending a lot of time in
TransactionIdIsInProgress() so we can set hints and then when we find
it is still in progress we then spend more time in XidIsInSnapshot()
while we check that it is still invisible to us. Even if the
transaction we see repeatedly ends, we will still pay the cost in
XidIsInSnapshot repeatedly as we execute.

Given that code path, I would expect it to suck worse on a live system
with many sessions, and even worse with many subtransactions.

(1) A proposed fix is attached, but its only a partial one and barely tested.

Deeper fixes might be

(2)  to sort the xid array if we call XidIsInSnapshot too many times
in a transaction. I don't think that is worth it, because a long
running snapshot may be examined many times, but is unlikely to see
multiple in-progress xids repeatedly. Whereas your case seems
reasonably common.

(3) to make the check on TransactionIdIsInProgress() into a heuristic,
since we don't *need* to check that, so if we keep checking the same
xid repeatedly we can reduce the number of checks or avoid xids that
seem to be long running. That's slightly more coding than my quick
hack here but seems worth it.

I think we need both (1) and (3) but the attached patch does just (1).

This is a similar optimisation to the one I introduced for
TransactionIdIsKnownCompleted(), except this applies to repeated
checking of as yet-incomplete xids, and to bulk concurrent
transactions.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

От:
Simon Riggs
Дата:

On 3 May 2013 13:41, Simon Riggs <> wrote:

> (3) to make the check on TransactionIdIsInProgress() into a heuristic,
> since we don't *need* to check that, so if we keep checking the same
> xid repeatedly we can reduce the number of checks or avoid xids that
> seem to be long running. That's slightly more coding than my quick
> hack here but seems worth it.
>
> I think we need both (1) and (3) but the attached patch does just (1).
>
> This is a similar optimisation to the one I introduced for
> TransactionIdIsKnownCompleted(), except this applies to repeated
> checking of as yet-incomplete xids, and to bulk concurrent
> transactions.

ISTM we can improve performance of TransactionIdIsInProgress() by
caching the procno of our last xid.

Mark, could you retest with both these patches? Thanks.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

От:
Mark Kirkwood
Дата:

On 05/05/13 00:49, Simon Riggs wrote:
> On 3 May 2013 13:41, Simon Riggs <> wrote:
>
>> (3) to make the check on TransactionIdIsInProgress() into a heuristic,
>> since we don't *need* to check that, so if we keep checking the same
>> xid repeatedly we can reduce the number of checks or avoid xids that
>> seem to be long running. That's slightly more coding than my quick
>> hack here but seems worth it.
>>
>> I think we need both (1) and (3) but the attached patch does just (1).
>>
>> This is a similar optimisation to the one I introduced for
>> TransactionIdIsKnownCompleted(), except this applies to repeated
>> checking of as yet-incomplete xids, and to bulk concurrent
>> transactions.
>
> ISTM we can improve performance of TransactionIdIsInProgress() by
> caching the procno of our last xid.
>
> Mark, could you retest with both these patches? Thanks.
>

Thanks Simon, will do and report back.




От:
Simon Riggs
Дата:

On 6 May 2013 02:51, Mark Kirkwood <> wrote:
> On 05/05/13 00:49, Simon Riggs wrote:
>>
>> On 3 May 2013 13:41, Simon Riggs <> wrote:
>>
>>> (3) to make the check on TransactionIdIsInProgress() into a heuristic,
>>> since we don't *need* to check that, so if we keep checking the same
>>> xid repeatedly we can reduce the number of checks or avoid xids that
>>> seem to be long running. That's slightly more coding than my quick
>>> hack here but seems worth it.
>>>
>>> I think we need both (1) and (3) but the attached patch does just (1).
>>>
>>> This is a similar optimisation to the one I introduced for
>>> TransactionIdIsKnownCompleted(), except this applies to repeated
>>> checking of as yet-incomplete xids, and to bulk concurrent
>>> transactions.
>>
>>
>> ISTM we can improve performance of TransactionIdIsInProgress() by
>> caching the procno of our last xid.
>>
>> Mark, could you retest with both these patches? Thanks.
>>
>
> Thanks Simon, will do and report back.

OK, here's a easily reproducible test...

Prep:
DROP TABLE IF EXISTS plan;
CREATE TABLE plan
(
  id INTEGER NOT NULL,
  typ INTEGER NOT NULL,
  dat TIMESTAMP,
  val TEXT NOT NULL
);
insert into plan select generate_series(1,100000), 0,
current_timestamp, 'some texts';
CREATE UNIQUE INDEX plan_id ON plan(id);
CREATE INDEX plan_dat ON plan(dat);

testcase.pgb
select count(*) from plan where dat is null and typ = 3;

Session 1:
pgbench -n -f testcase.pgb -t 100

Session 2:
BEGIN; insert into plan select 1000000 + generate_series(1, 100000),
3, NULL, 'b';

Transaction rate in Session 1: (in tps)
(a) before we run Session 2:
Current: 5600tps
Patched: 5600tps

(b) after Session 2 has run, yet before transaction end
Current: 56tps
Patched: 65tps

(c ) after Session 2 has aborted
Current/Patched: 836, 1028, 5400tps
VACUUM improves timing again

New version of patch attached which fixes a few bugs.

Patch works and improves things, but we're still swamped by the block
accesses via the index.

Which brings me back to Mark's original point, which is that we are
x100 times slower in this case and it *is* because the choice of
IndexScan is a bad one for this situation.

After some thought on this, I do think we need to do something about
it directly, rather than by tuning infrastructire (as I just
attempted). The root cause here is that IndexScan plans are sensitive
to mistakes in data distribution, much more so than other plan types.

The two options, broadly, are to either

1. avoid IndexScans in the planner unless they have a *significantly*
better cost. At the moment we use IndexScans if cost is lowest, even
if that is only by a whisker.

2. make IndexScans adaptive so that they switch to other plan types
mid-way through execution.

(2) seems fairly hard generically, since we'd have to keep track of
the tids returned from the IndexScan to allow us to switch to a
different plan and avoid re-issuing rows that we've already returned.
But maybe if we adapted the IndexScan plan type so that it adopted a
more page oriented approach internally, it could act like a
bitmapscan. Anyway, that would need some proof that it would work and
sounds like a fair task.

(1) sounds more easily possible and plausible. At the moment we have
enable_indexscan = off. If we had something like
plan_cost_weight_indexscan = N, we could selectively increase the cost
of index scans so that they would be less likely to be selected. i.e.
plan_cost_weight_indexscan = 2 would mean an indexscan would need to
be half the cost of any other plan before it was selected. (parameter
name selected so it could apply to all parameter types). The reason to
apply this weighting would be to calculate "risk adjusted cost" not
just estimated cost.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

От:
Gavin Flower
Дата:

On 03/05/13 00:27, Simon Riggs wrote:
On 2 May 2013 01:49, Mark Kirkwood <> wrote:
On 02/05/13 02:06, Tom Lane wrote:
Mark Kirkwood <> writes:
I am concerned that the deafening lack of any replies to my original
message is a result of folk glancing at your original quick reply and
thinking... incomplete problem spec...ignore... when that is not that
case - yes I should have muttered "9.2" in the original email, but we
have covered that now.
No, I think it's more that we're trying to get to beta, and so anything
that looks like new development is getting shuffled to folks' "to
look at later" queues.  The proposed patch is IMO a complete nonstarter
anyway; but I'm not sure what a less bogus solution would look like.

Yeah, I did think that beta might be consuming everyone's attention (of
course immediately *after* sending the email)!

And yes, the patch was merely to illustrate the problem rather than any
serious attempt at a solution.
I think we need a problem statement before we attempt a solution,
which is what Tom is alluding to.

ISTM that you've got a case where the plan is very sensitive to a
table load. Which is a pretty common situation and one that can be
solved in various ways. I don't see much that Postgres can do because
it can't know ahead of time you're about to load rows. We could
imagine an optimizer that set thresholds on plans that caused the
whole plan to be recalculated half way thru a run, but that would be a
lot of work to design and implement and even harder to test. Having
static plans at least allows us to discuss what it does after the fact
with some ease.

The plan is set using stats that are set when there are very few
non-NULL rows, and those increase massively on load. The way to cope
is to run the ANALYZE immediately after the load and then don't allow
auto-ANALYZE to reset them later.

--Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services
Would be practicable to have a facility for telling Postgres in advance what you intend to do, so it can create plans accordingly?

I won't try and invent syntax, but it would be good to tell the system that you intend to:
insert a million rows sequentially
or
insert ten million rows randomly
or
update two million rows in the primary key range 'AA00000' to 'PP88877'
etc.

Though, sometime it may be useful to give a little more detail,especially if you have a good estimate of the distribution of primary keys, e.g.:
AA000000
        20%
AA000456
         0%
KE700999
        30%
NN400005
        35%
PA000001

        15%
PP808877


I figure that if the planner had more information about what one intends to do, then it could combine that with the statistics it knows, to come up with a more realistic plan.


Cheers,
Gavin



От:
Matt Clarkson
Дата:

>
> (2) seems fairly hard generically, since we'd have to keep track of
> the tids returned from the IndexScan to allow us to switch to a
> different plan and avoid re-issuing rows that we've already returned.
> But maybe if we adapted the IndexScan plan type so that it adopted a
> more page oriented approach internally, it could act like a
> bitmapscan. Anyway, that would need some proof that it would work and
> sounds like a fair task.
>
> (1) sounds more easily possible and plausible. At the moment we have
> enable_indexscan = off. If we had something like
> plan_cost_weight_indexscan = N, we could selectively increase the cost
> of index scans so that they would be less likely to be selected. i.e.
> plan_cost_weight_indexscan = 2 would mean an indexscan would need to
> be half the cost of any other plan before it was selected. (parameter
> name selected so it could apply to all parameter types). The reason to
> apply this weighting would be to calculate "risk adjusted cost" not
> just estimated cost.
>
> --
>  Simon Riggs                   http://www.2ndQuadrant.com/
>  PostgreSQL Development, 24x7 Support, Training & Services


Another option would be for the bulk insert/update/delete to track the
distribution stats as the operation progresses and if it detects that it
is changing the distribution of data beyond a certain threshold it would
update the pg stats accordingly.

--
Matt Clarkson
Catalyst.Net Limited




От:
mark.kirkwood@catalyst.net.nz
Дата:

> Simon Riggs wrote:
>
> Patch works and improves things, but we're still swamped by the block
> accesses via the index.

Which *might* be enough to stop it making the server go unresponsive,
we'll look at the effect of this in the next few days, nice work!

>
> Which brings me back to Mark's original point, which is that we are
> x100 times slower in this case and it *is* because the choice of
> IndexScan is a bad one for this situation.
>
> After some thought on this, I do think we need to do something about
> it directly, rather than by tuning infrastructire (as I just
> attempted). The root cause here is that IndexScan plans are sensitive
> to mistakes in data distribution, much more so than other plan types.
>
> The two options, broadly, are to either
>
> 1. avoid IndexScans in the planner unless they have a *significantly*
> better cost. At the moment we use IndexScans if cost is lowest, even
> if that is only by a whisker.
>
> 2. make IndexScans adaptive so that they switch to other plan types
> mid-way through execution.
>
> (2) seems fairly hard generically, since we'd have to keep track of
> the tids returned from the IndexScan to allow us to switch to a
> different plan and avoid re-issuing rows that we've already returned.
> But maybe if we adapted the IndexScan plan type so that it adopted a
> more page oriented approach internally, it could act like a
> bitmapscan. Anyway, that would need some proof that it would work and
> sounds like a fair task.
>
> (1) sounds more easily possible and plausible. At the moment we have
> enable_indexscan = off. If we had something like
> plan_cost_weight_indexscan = N, we could selectively increase the cost
> of index scans so that they would be less likely to be selected. i.e.
> plan_cost_weight_indexscan = 2 would mean an indexscan would need to
> be half the cost of any other plan before it was selected. (parameter
> name selected so it could apply to all parameter types). The reason to
> apply this weighting would be to calculate "risk adjusted cost" not
> just estimated cost.
>

I'm thinking that a variant of (2) might be simpler to inplement:

(I think Matt C essentially beat me to this suggestion - he originally
discovered this issue). It is probably good enough for only *new* plans to
react to the increased/increasing number of in progress rows. So this
would require backends doing significant numbers of row changes to either
directly update pg_statistic or report their in progress numbers to the
stats collector. The key change here is the partial execution numbers
would need to be sent. Clearly one would need to avoid doing this too
often (!) - possibly only when number of changed rows >
autovacuum_analyze_scale_factor proportion of the relation concerned or
similar.

regards

Mark



От:
Simon Riggs
Дата:

On 7 May 2013 01:23,  <> wrote:

> I'm thinking that a variant of (2) might be simpler to inplement:
>
> (I think Matt C essentially beat me to this suggestion - he originally
> discovered this issue). It is probably good enough for only *new* plans to
> react to the increased/increasing number of in progress rows. So this
> would require backends doing significant numbers of row changes to either
> directly update pg_statistic or report their in progress numbers to the
> stats collector. The key change here is the partial execution numbers
> would need to be sent. Clearly one would need to avoid doing this too
> often (!) - possibly only when number of changed rows >
> autovacuum_analyze_scale_factor proportion of the relation concerned or
> similar.

Are you loading using COPY? Why not break down the load into chunks?

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


От:
Mark Kirkwood
Дата:

On 07/05/13 18:10, Simon Riggs wrote:
> On 7 May 2013 01:23,  <> wrote:
>
>> I'm thinking that a variant of (2) might be simpler to inplement:
>>
>> (I think Matt C essentially beat me to this suggestion - he originally
>> discovered this issue). It is probably good enough for only *new* plans to
>> react to the increased/increasing number of in progress rows. So this
>> would require backends doing significant numbers of row changes to either
>> directly update pg_statistic or report their in progress numbers to the
>> stats collector. The key change here is the partial execution numbers
>> would need to be sent. Clearly one would need to avoid doing this too
>> often (!) - possibly only when number of changed rows >
>> autovacuum_analyze_scale_factor proportion of the relation concerned or
>> similar.
>
> Are you loading using COPY? Why not break down the load into chunks?
>

INSERT - but we could maybe workaround by chunking the INSERT. However
that *really* breaks the idea that in SQL you just say what you want,
not how the database engine should do it! And more practically means
that the most obvious and clear way to add your new data has nasty side
effects, and you have to tip toe around muttering secret incantations to
make things work well :-)

I'm still thinking that making postgres smarter about having current
stats for getting the actual optimal plan is the best solution.

Cheers

Mark



От:
Matt Clarkson
Дата:

On Tue, 2013-05-07 at 18:32 +1200, Mark Kirkwood wrote:
> On 07/05/13 18:10, Simon Riggs wrote:
> > On 7 May 2013 01:23,  <> wrote:
> >
> >> I'm thinking that a variant of (2) might be simpler to inplement:
> >>
> >> (I think Matt C essentially beat me to this suggestion - he originally
> >> discovered this issue). It is probably good enough for only *new* plans to
> >> react to the increased/increasing number of in progress rows. So this
> >> would require backends doing significant numbers of row changes to either
> >> directly update pg_statistic or report their in progress numbers to the
> >> stats collector. The key change here is the partial execution numbers
> >> would need to be sent. Clearly one would need to avoid doing this too
> >> often (!) - possibly only when number of changed rows >
> >> autovacuum_analyze_scale_factor proportion of the relation concerned or
> >> similar.
> >
> > Are you loading using COPY? Why not break down the load into chunks?
> >
>
> INSERT - but we could maybe workaround by chunking the INSERT. However
> that *really* breaks the idea that in SQL you just say what you want,
> not how the database engine should do it! And more practically means
> that the most obvious and clear way to add your new data has nasty side
> effects, and you have to tip toe around muttering secret incantations to
> make things work well :-)

We also had the same problem with an UPDATE altering the data
distribution in such a way that trivial but frequently executed queries
cause massive server load until auto analyze sorted out the stats.

--
Matt Clarkson
Catalyst.Net Limited




От:
Simon Riggs
Дата:

On 7 May 2013 07:32, Mark Kirkwood <> wrote:
> On 07/05/13 18:10, Simon Riggs wrote:
>>
>> On 7 May 2013 01:23,  <> wrote:
>>
>>> I'm thinking that a variant of (2) might be simpler to inplement:
>>>
>>> (I think Matt C essentially beat me to this suggestion - he originally
>>> discovered this issue). It is probably good enough for only *new* plans
>>> to
>>> react to the increased/increasing number of in progress rows. So this
>>> would require backends doing significant numbers of row changes to either
>>> directly update pg_statistic or report their in progress numbers to the
>>> stats collector. The key change here is the partial execution numbers
>>> would need to be sent. Clearly one would need to avoid doing this too
>>> often (!) - possibly only when number of changed rows >
>>> autovacuum_analyze_scale_factor proportion of the relation concerned or
>>> similar.
>>
>>
>> Are you loading using COPY? Why not break down the load into chunks?
>>
>
> INSERT - but we could maybe workaround by chunking the INSERT. However that
> *really* breaks the idea that in SQL you just say what you want, not how the
> database engine should do it! And more practically means that the most
> obvious and clear way to add your new data has nasty side effects, and you
> have to tip toe around muttering secret incantations to make things work
> well :-)

Yes, we'd need to break up SQL statements into pieces and use external
transaction snapshots to do that.

> I'm still thinking that making postgres smarter about having current stats
> for getting the actual optimal plan is the best solution.

I agree.

The challenge now is to come up with something that actually works;
most of the ideas have been very vague and ignore the many downsides.
The hard bit is the analysis and balanced thinking, not the
developing.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


От:
Mark Kirkwood
Дата:

On 07/05/13 19:33, Simon Riggs wrote:
> On 7 May 2013 07:32, Mark Kirkwood <> wrote:
>> On 07/05/13 18:10, Simon Riggs wrote:
>>>
>>> On 7 May 2013 01:23,  <> wrote:
>>>
>>>> I'm thinking that a variant of (2) might be simpler to inplement:
>>>>
>>>> (I think Matt C essentially beat me to this suggestion - he originally
>>>> discovered this issue). It is probably good enough for only *new* plans
>>>> to
>>>> react to the increased/increasing number of in progress rows. So this
>>>> would require backends doing significant numbers of row changes to either
>>>> directly update pg_statistic or report their in progress numbers to the
>>>> stats collector. The key change here is the partial execution numbers
>>>> would need to be sent. Clearly one would need to avoid doing this too
>>>> often (!) - possibly only when number of changed rows >
>>>> autovacuum_analyze_scale_factor proportion of the relation concerned or
>>>> similar.
>>>
>>>
>>> Are you loading using COPY? Why not break down the load into chunks?
>>>
>>
>> INSERT - but we could maybe workaround by chunking the INSERT. However that
>> *really* breaks the idea that in SQL you just say what you want, not how the
>> database engine should do it! And more practically means that the most
>> obvious and clear way to add your new data has nasty side effects, and you
>> have to tip toe around muttering secret incantations to make things work
>> well :-)
>
> Yes, we'd need to break up SQL statements into pieces and use external
> transaction snapshots to do that.
>
>> I'm still thinking that making postgres smarter about having current stats
>> for getting the actual optimal plan is the best solution.
>
> I agree.
>
> The challenge now is to come up with something that actually works;
> most of the ideas have been very vague and ignore the many downsides.
> The hard bit is the analysis and balanced thinking, not the
> developing.
>

Yeah - seeing likely downsides can be a bit tricky too. I'll have a play
with some prototyping ideas, since this is actually an area of postgres
(analyze/stats collector) that I've fiddled with before :-)

Cheers

Mark



От:
Vitalii Tymchyshyn
Дата:

Well, could you write a trigger that would do what you need? AFAIR analyze data is stored no matter transaction boundaries. You could store some counters in session vars and issue an explicit analyze when enough rows were added.

7 трав. 2013 08:33, "Mark Kirkwood" <> напис.
On 07/05/13 18:10, Simon Riggs wrote:
On 7 May 2013 01:23,  <> wrote:

I'm thinking that a variant of (2) might be simpler to inplement:

(I think Matt C essentially beat me to this suggestion - he originally
discovered this issue). It is probably good enough for only *new* plans to
react to the increased/increasing number of in progress rows. So this
would require backends doing significant numbers of row changes to either
directly update pg_statistic or report their in progress numbers to the
stats collector. The key change here is the partial execution numbers
would need to be sent. Clearly one would need to avoid doing this too
often (!) - possibly only when number of changed rows >
autovacuum_analyze_scale_factor proportion of the relation concerned or
similar.

Are you loading using COPY? Why not break down the load into chunks?


INSERT - but we could maybe workaround by chunking the INSERT. However that *really* breaks the idea that in SQL you just say what you want, not how the database engine should do it! And more practically means that the most obvious and clear way to add your new data has nasty side effects, and you have to tip toe around muttering secret incantations to make things work well :-)

I'm still thinking that making postgres smarter about having current stats for getting the actual optimal plan is the best solution.

Cheers

Mark



--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
От:
Mark Kirkwood
Дата:

(See below for the reply)

On 10/05/13 22:48, Vitalii Tymchyshyn wrote:
> Well, could you write a trigger that would do what you need? AFAIR
> analyze data is stored no matter transaction boundaries. You could store
> some counters in session vars and issue an explicit analyze when enough
> rows were added.
>
> 7 трав. 2013 08:33, "Mark Kirkwood" <
> <mailto:>> напис.
>
>     On 07/05/13 18:10, Simon Riggs wrote:
>
>         On 7 May 2013 01:23,  <
>         <mailto:>__> wrote:
>
>             I'm thinking that a variant of (2) might be simpler to
>             inplement:
>
>             (I think Matt C essentially beat me to this suggestion - he
>             originally
>             discovered this issue). It is probably good enough for only
>             *new* plans to
>             react to the increased/increasing number of in progress
>             rows. So this
>             would require backends doing significant numbers of row
>             changes to either
>             directly update pg_statistic or report their in progress
>             numbers to the
>             stats collector. The key change here is the partial
>             execution numbers
>             would need to be sent. Clearly one would need to avoid doing
>             this too
>             often (!) - possibly only when number of changed rows >
>             autovacuum_analyze_scale___factor proportion of the relation
>             concerned or
>             similar.
>
>
>         Are you loading using COPY? Why not break down the load into chunks?
>
>
>     INSERT - but we could maybe workaround by chunking the INSERT.
>     However that *really* breaks the idea that in SQL you just say what
>     you want, not how the database engine should do it! And more
>     practically means that the most obvious and clear way to add your
>     new data has nasty side effects, and you have to tip toe around
>     muttering secret incantations to make things work well :-)
>
>     I'm still thinking that making postgres smarter about having current
>     stats for getting the actual optimal plan is the best solution.

Unfortunately a trigger will not really do the job - analyze ignores in
progress rows (unless they were added by the current transaction), and
then the changes made by analyze are not seen by any other sessions. So
no changes to plans until the entire INSERT is complete and COMMIT
happens (which could be a while - too long in our case).

Figuring out how to improve on this situation is tricky.


Cheers

Mark


От:
Tom Lane
Дата:

Mark Kirkwood <> writes:
> Unfortunately a trigger will not really do the job - analyze ignores in
> progress rows (unless they were added by the current transaction), and
> then the changes made by analyze are not seen by any other sessions. So
> no changes to plans until the entire INSERT is complete and COMMIT
> happens (which could be a while - too long in our case).

I'm not sure I believe the thesis that plans won't change at all.
The planner will notice that the physical size of the table is growing.
That may not be enough, if the table-contents statistics are missing
or completely unreflective of reality, but it's something.

It is true that *already cached* plans won't change until after an
ANALYZE is done (the key point there being that ANALYZE sends out a
shared-inval message to force replanning of plans for the table).
Conceivably you could issue concurrent ANALYZEs occasionally while
the INSERT is running, not so much to update the stats --- because
they wouldn't --- as to force cached-plan invalidation.

            regards, tom lane


От:
Mark Kirkwood
Дата:

On 11/05/13 01:30, Tom Lane wrote:
> Mark Kirkwood <> writes:
>> Unfortunately a trigger will not really do the job - analyze ignores in
>> progress rows (unless they were added by the current transaction), and
>> then the changes made by analyze are not seen by any other sessions. So
>> no changes to plans until the entire INSERT is complete and COMMIT
>> happens (which could be a while - too long in our case).
>
> I'm not sure I believe the thesis that plans won't change at all.
> The planner will notice that the physical size of the table is growing.
> That may not be enough, if the table-contents statistics are missing
> or completely unreflective of reality, but it's something.
>
> It is true that *already cached* plans won't change until after an
> ANALYZE is done (the key point there being that ANALYZE sends out a
> shared-inval message to force replanning of plans for the table).
> Conceivably you could issue concurrent ANALYZEs occasionally while
> the INSERT is running, not so much to update the stats --- because
> they wouldn't --- as to force cached-plan invalidation.

Yeah - true, I was focusing on the particular type of query illustrated
in the test case - pretty much entirely needing updated selectivity
stats for a column, which wouldn't change unfortunately.

Cheers

Mark



От:
Heikki Linnakangas
Дата:

On 03.05.2013 15:41, Simon Riggs wrote:
> What appears to be happening is we're spending a lot of time in
> TransactionIdIsInProgress() so we can set hints and then when we find
> it is still in progress we then spend more time in XidIsInSnapshot()
> while we check that it is still invisible to us. Even if the
> transaction we see repeatedly ends, we will still pay the cost in
> XidIsInSnapshot repeatedly as we execute.
>
> Given that code path, I would expect it to suck worse on a live system
> with many sessions, and even worse with many subtransactions.
>
> (1) A proposed fix is attached, but its only a partial one and barely tested.
>
> Deeper fixes might be
>
> (2)  to sort the xid array if we call XidIsInSnapshot too many times
> in a transaction. I don't think that is worth it, because a long
> running snapshot may be examined many times, but is unlikely to see
> multiple in-progress xids repeatedly. Whereas your case seems
> reasonably common.

Yeah, sorting would be a waste of time most of the time.

Instead of adding a new cache field, how about just swapping the matched
XID to the beginning of the array?

Did you have some simple performance test script for this?

- Heikki


От:
Heikki Linnakangas
Дата:

On 06.05.2013 04:51, Mark Kirkwood wrote:
> On 05/05/13 00:49, Simon Riggs wrote:
>> On 3 May 2013 13:41, Simon Riggs <> wrote:
>>
>>> (3) to make the check on TransactionIdIsInProgress() into a heuristic,
>>> since we don't *need* to check that, so if we keep checking the same
>>> xid repeatedly we can reduce the number of checks or avoid xids that
>>> seem to be long running. That's slightly more coding than my quick
>>> hack here but seems worth it.
>>>
>>> I think we need both (1) and (3) but the attached patch does just (1).
>>>
>>> This is a similar optimisation to the one I introduced for
>>> TransactionIdIsKnownCompleted(), except this applies to repeated
>>> checking of as yet-incomplete xids, and to bulk concurrent
>>> transactions.
>>
>> ISTM we can improve performance of TransactionIdIsInProgress() by
>> caching the procno of our last xid.
>>
>> Mark, could you retest with both these patches? Thanks.
>>
>
> Thanks Simon, will do and report back.

Did anyone ever try (3) ?

I'm not sure if this the same idea as (3) above, but ISTM that
HeapTupleSatisfiesMVCC doesn't actually need to call
TransactionIdIsInProgress(), because it checks XidInMVCCSnapshot(). The
comment at the top of tqual.c says:

>  * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)
>  * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
>  * pg_clog).  Otherwise we have a race condition: we might decide that a
>  * just-committed transaction crashed, because none of the tests succeed.
>  * xact.c is careful to record commit/abort in pg_clog before it unsets
>  * MyPgXact->xid in PGXACT array.  That fixes that problem, but it also
>  * means there is a window where TransactionIdIsInProgress and
>  * TransactionIdDidCommit will both return true.  If we check only
>  * TransactionIdDidCommit, we could consider a tuple committed when a
>  * later GetSnapshotData call will still think the originating transaction
>  * is in progress, which leads to application-level inconsistency.    The
>  * upshot is that we gotta check TransactionIdIsInProgress first in all
>  * code paths, except for a few cases where we are looking at
>  * subtransactions of our own main transaction and so there can't be any
>  * race condition.

If TransactionIdIsInProgress() returns true for a given XID, then surely
it was also running when the snapshot was taken (or had not even began
yet). In which case the XidInMVCCSnapshot() call will also return true.
Am I missing something?

There's one little problem: we currently only set the hint bits when
TransactionIdIsInProgress() returns false. If we do that earlier, then
even though HeapTupleSatisfiesMVCC works correctly thanks to the
XidInMVCCSnapshot call, other HeapTupleSatisfies* functions that don't
call XIdInMVCCSnapshot might see the tuple as committed or aborted too
early, if they see the hint bit as set while the transaction is still
in-progress according to the proc array. Would have to check all the
callers of those other HeapTupleSatisfies* functions to verify if that's OK.

- Heikki


От:
Tom Lane
Дата:

Heikki Linnakangas <> writes:
> I'm not sure if this the same idea as (3) above, but ISTM that
> HeapTupleSatisfiesMVCC doesn't actually need to call
> TransactionIdIsInProgress(), because it checks XidInMVCCSnapshot(). The
> comment at the top of tqual.c says:

>> * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)
>> * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
>> * pg_clog).  Otherwise we have a race condition: we might decide that a
>> * just-committed transaction crashed, because none of the tests succeed.
>> * xact.c is careful to record commit/abort in pg_clog before it unsets
>> * MyPgXact->xid in PGXACT array.  That fixes that problem, but it also
>> * means there is a window where TransactionIdIsInProgress and
>> * TransactionIdDidCommit will both return true.  If we check only
>> * TransactionIdDidCommit, we could consider a tuple committed when a
>> * later GetSnapshotData call will still think the originating transaction
>> * is in progress, which leads to application-level inconsistency.    The
>> * upshot is that we gotta check TransactionIdIsInProgress first in all
>> * code paths, except for a few cases where we are looking at
>> * subtransactions of our own main transaction and so there can't be any
>> * race condition.

> If TransactionIdIsInProgress() returns true for a given XID, then surely
> it was also running when the snapshot was taken (or had not even began
> yet). In which case the XidInMVCCSnapshot() call will also return true.
> Am I missing something?

Yes, you're failing to understand the nature of the race condition.
What we're concerned about is that if tqual says a tuple is committed,
its transaction must be committed (not still in progress) according to
any *subsequently taken* snapshot.  This is not about the contents of
the snapshot we're currently consulting; it's about not wanting a tuple
to be thought committed if anyone could possibly later decide its
transaction is still in progress.

It's possible that this issue would be moot if the only use of
transaction-in-progress data were in tqual.c (so that we could assume
all later tests use the same logic you propose here), but I doubt that
that's true.

            regards, tom lane


От:
Simon Riggs
Дата:

On 16 June 2013 16:04, Heikki Linnakangas <> wrote:
> On 03.05.2013 15:41, Simon Riggs wrote:
>>
>> What appears to be happening is we're spending a lot of time in
>> TransactionIdIsInProgress() so we can set hints and then when we find
>> it is still in progress we then spend more time in XidIsInSnapshot()
>> while we check that it is still invisible to us. Even if the
>> transaction we see repeatedly ends, we will still pay the cost in
>> XidIsInSnapshot repeatedly as we execute.
>>
>> Given that code path, I would expect it to suck worse on a live system
>> with many sessions, and even worse with many subtransactions.
>>
>> (1) A proposed fix is attached, but its only a partial one and barely
>> tested.
>>
>> Deeper fixes might be
>>
>> (2)  to sort the xid array if we call XidIsInSnapshot too many times
>> in a transaction. I don't think that is worth it, because a long
>> running snapshot may be examined many times, but is unlikely to see
>> multiple in-progress xids repeatedly. Whereas your case seems
>> reasonably common.
>
>
> Yeah, sorting would be a waste of time most of the time.
>
> Instead of adding a new cache field, how about just swapping the matched XID
> to the beginning of the array?

Do you think that is significantly different from what I've done?

> Did you have some simple performance test script for this?

Files attached to set up and tear down the test. Needs
max_prepared_transactions = 100

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

От:
Simon Riggs
Дата:

On 16 June 2013 16:23, Heikki Linnakangas <> wrote:
> On 06.05.2013 04:51, Mark Kirkwood wrote:
>>
>> On 05/05/13 00:49, Simon Riggs wrote:
>>>
>>> On 3 May 2013 13:41, Simon Riggs <> wrote:
>>>
>>>> (3) to make the check on TransactionIdIsInProgress() into a heuristic,
>>>> since we don't *need* to check that, so if we keep checking the same
>>>> xid repeatedly we can reduce the number of checks or avoid xids that
>>>> seem to be long running. That's slightly more coding than my quick
>>>> hack here but seems worth it.
>>>>
>>>> I think we need both (1) and (3) but the attached patch does just (1).
>>>>
>>>> This is a similar optimisation to the one I introduced for
>>>> TransactionIdIsKnownCompleted(), except this applies to repeated
>>>> checking of as yet-incomplete xids, and to bulk concurrent
>>>> transactions.
>>>
>>>
>>> ISTM we can improve performance of TransactionIdIsInProgress() by
>>> caching the procno of our last xid.
>>>
>>> Mark, could you retest with both these patches? Thanks.
>>>
>>
>> Thanks Simon, will do and report back.
>
>
> Did anyone ever try (3) ?

No, because my other patch meant I didn't need to. In other words, my
other patch speeded up repeated access enough I didn't care about (3)
anymore.


> I'm not sure if this the same idea as (3) above, but ISTM that
> HeapTupleSatisfiesMVCC doesn't actually need to call
> TransactionIdIsInProgress(), because it checks XidInMVCCSnapshot(). The
> comment at the top of tqual.c says:
>
>>  * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT
>> array)
>>  * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
>>  * pg_clog).  Otherwise we have a race condition: we might decide that a
>>  * just-committed transaction crashed, because none of the tests succeed.
>>  * xact.c is careful to record commit/abort in pg_clog before it unsets
>>  * MyPgXact->xid in PGXACT array.  That fixes that problem, but it also
>>  * means there is a window where TransactionIdIsInProgress and
>>  * TransactionIdDidCommit will both return true.  If we check only
>>  * TransactionIdDidCommit, we could consider a tuple committed when a
>>  * later GetSnapshotData call will still think the originating transaction
>>  * is in progress, which leads to application-level inconsistency.
>> The
>>  * upshot is that we gotta check TransactionIdIsInProgress first in all
>>  * code paths, except for a few cases where we are looking at
>>  * subtransactions of our own main transaction and so there can't be any
>>  * race condition.
>
>
> If TransactionIdIsInProgress() returns true for a given XID, then surely it
> was also running when the snapshot was taken (or had not even began yet). In
> which case the XidInMVCCSnapshot() call will also return true. Am I missing
> something?
>
> There's one little problem: we currently only set the hint bits when
> TransactionIdIsInProgress() returns false. If we do that earlier, then even
> though HeapTupleSatisfiesMVCC works correctly thanks to the
> XidInMVCCSnapshot call, other HeapTupleSatisfies* functions that don't call
> XIdInMVCCSnapshot might see the tuple as committed or aborted too early, if
> they see the hint bit as set while the transaction is still in-progress
> according to the proc array. Would have to check all the callers of those
> other HeapTupleSatisfies* functions to verify if that's OK.

Well, I looked at that and its too complex and fiddly to be worth it, IMHO.

--
 Simon Riggs                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


От:
Jeff Janes
Дата:

On Mon, May 6, 2013 at 1:14 AM, Simon Riggs <> wrote:
> On 6 May 2013 02:51, Mark Kirkwood <> wrote:
>> On 05/05/13 00:49, Simon Riggs wrote:
>>>
>>> On 3 May 2013 13:41, Simon Riggs <> wrote:
>>>
>>>> (3) to make the check on TransactionIdIsInProgress() into a heuristic,
>>>> since we don't *need* to check that, so if we keep checking the same
>>>> xid repeatedly we can reduce the number of checks or avoid xids that
>>>> seem to be long running. That's slightly more coding than my quick
>>>> hack here but seems worth it.
>>>>
>>>> I think we need both (1) and (3) but the attached patch does just (1).
>>>>
>>>> This is a similar optimisation to the one I introduced for
>>>> TransactionIdIsKnownCompleted(), except this applies to repeated
>>>> checking of as yet-incomplete xids, and to bulk concurrent
>>>> transactions.
>>>
>>>
>>> ISTM we can improve performance of TransactionIdIsInProgress() by
>>> caching the procno of our last xid.
>>>
>>> Mark, could you retest with both these patches? Thanks.
>>>
>>
>> Thanks Simon, will do and report back.
>
> OK, here's a easily reproducible test...
>
> Prep:
> DROP TABLE IF EXISTS plan;
> CREATE TABLE plan
> (
>   id INTEGER NOT NULL,
>   typ INTEGER NOT NULL,
>   dat TIMESTAMP,
>   val TEXT NOT NULL
> );
> insert into plan select generate_series(1,100000), 0,
> current_timestamp, 'some texts';
> CREATE UNIQUE INDEX plan_id ON plan(id);
> CREATE INDEX plan_dat ON plan(dat);
>
> testcase.pgb
> select count(*) from plan where dat is null and typ = 3;
>
> Session 1:
> pgbench -n -f testcase.pgb -t 100
>
> Session 2:
> BEGIN; insert into plan select 1000000 + generate_series(1, 100000),
> 3, NULL, 'b';
>
> Transaction rate in Session 1: (in tps)
> (a) before we run Session 2:
> Current: 5600tps
> Patched: 5600tps
>
> (b) after Session 2 has run, yet before transaction end
> Current: 56tps
> Patched: 65tps


When I run this test case in single-client mode, I don't see nearly
that much speedup, it just goes from 38.99 TPS to 40.12 TPS.  But
still, it is a speedup, and very reproducible (t-test p-val < 1e-40, n
of 21 for both)

But I also tried it with 4 pgbench clients, and ran into a collapse of
the performance, TPS dropping down to ~8 TPS.   It is too variable to
figure out how reliable the speed-up with this patch is, so far.
Apparently they are all fighting over the spinlock on the
ProcArrayLock.

This is a single quad core, "Intel(R) Xeon(R) CPU           X3210  @ 2.13GHz"

So I agree with (3) above, about not checking
TransactionIdIsInProgress repeatedly.  Or could we change the order of
operations so that TransactionIdIsInProgress is checked only after
XidInMVCCSnapshot?

Or perhaps the comment "XXX Can we test without the lock first?" could
be implemented and save the day here?

Looking at the code, there is something that bothers me about this part:

        pxid = cxid = InvalidTransactionId;
        return false;

If it is safe to return false at this point (as determined by the
stale values of pxid and cxid) then why would we clear out the stale
values so they can't be used in the future to also short circuit
things?  On the other hand, if the stale values need to be cleared so
they are not misleading to future invocations, why is it safe for this
invocation to have made a decision based on them?  Maybe with more
thought I will see why this is so.

....


>
> Which brings me back to Mark's original point, which is that we are
> x100 times slower in this case and it *is* because the choice of
> IndexScan is a bad one for this situation.
>
> After some thought on this, I do think we need to do something about
> it directly, rather than by tuning infrastructire (as I just
> attempted). The root cause here is that IndexScan plans are sensitive
> to mistakes in data distribution, much more so than other plan types.
>
> The two options, broadly, are to either
>
> 1. avoid IndexScans in the planner unless they have a *significantly*
> better cost. At the moment we use IndexScans if cost is lowest, even
> if that is only by a whisker.

This wouldn't work well in Mark's specific case, because the problem
is that it is using the wrong index, not that it is using an index at
all.  There are two candidate indexes, and one looks slightly better
but then gets ruined by the in-progress insert, while the other looks
slightly worse but would be resistant to the in-progress insert.
Switching from the bad index to the seq scan is not going to fix
things.  I don't think there is any heuristic solution here other than
to keep track of the invisible data distribution as well as the
visible data distribution.

The more I've thought about it, the more I see the charm of Mark's
original proposal.  Why not build the statistics assuming that the
in-progress insert will commit?  It is not a complete solution,
because autoanalyze will not get triggered until the transaction
completes.  But why not let the manual ANALYZE get the benefit of
seeing them?

The statistics serve two masters.  One is to estimate how many rows
will actually be returned.  The other is to estimate how much work it
will take to return them (including the work of groveling through a
list of in-process tuples).  Right now those are implicitly considered
the same thing--we could change that and keep separate sets of
statistics, but I think we could improve things some without doing
that.  For the first case of estimating actual rows returned, I think
counting in-progress rows is a wash.  It seems just about as likely
that the bulk transaction which was in progress at the time of the
last ANALYZE has already committed at the time of the planning (but
not yet completed the autoanalyze) as it is that the bulk transaction
is still in progress at the time of the planning; which means counting
the rows as if they committed is sometimes right and sometimes wrong
but in about equal measure.  But for the second master, counting the
in progress rows seems like a win all the time.  Either we actually
see them and do the work, or we refuse to see them but still have to
do the work to figure that out.

Cheers,

Jeff


От:
Ants Aasma
Дата:

On Sat, Jul 13, 2013 at 1:47 AM, Jeff Janes <> wrote:
> But I also tried it with 4 pgbench clients, and ran into a collapse of
> the performance, TPS dropping down to ~8 TPS.   It is too variable to
> figure out how reliable the speed-up with this patch is, so far.
> Apparently they are all fighting over the spinlock on the
> ProcArrayLock.
>
> This is a single quad core, "Intel(R) Xeon(R) CPU           X3210  @ 2.13GHz"
>
> So I agree with (3) above, about not checking
> TransactionIdIsInProgress repeatedly.  Or could we change the order of
> operations so that TransactionIdIsInProgress is checked only after
> XidInMVCCSnapshot?

I haven't checked the patch in detail, but it sounds like my proposal
for CSN based snapshots[1] could help here. Using it
TransactionIdIsInProgress can be done completely lock-free. It would
include a direct dense array lookup, read barrier and a check of the
dense/sparse horizon, and if necessary a binary search in the sparse
array and another read barrier and check for sparse array version
counter.

I plan to start working on the patch next week. I hope to have a first
cut available for CF2.

[1] http://www.postgresql.org/message-id/CA+CSw_tEpJ=md1zgxPkjH6CWDnTDft4gBi=

Regards,
Ants Aasma
--
Cybertec Schönig & Schönig GmbH
Gröhrmühlgasse 26
A-2700 Wiener Neustadt
Web: http://www.postgresql-support.de


От:
Jeff Janes
Дата:

On Sunday, June 16, 2013, Heikki Linnakangas wrote:
On 06.05.2013 04:51, Mark Kirkwood wrote:
On 05/05/13 00:49, Simon Riggs wrote:
On 3 May 2013 13:41, Simon Riggs <> wrote:

(3) to make the check on TransactionIdIsInProgress() into a heuristic,
since we don't *need* to check that, so if we keep checking the same
xid repeatedly we can reduce the number of checks or avoid xids that
seem to be long running. That's slightly more coding than my quick
hack here but seems worth it.

I think we need both (1) and (3) but the attached patch does just (1).

This is a similar optimisation to the one I introduced for
TransactionIdIsKnownCompleted(), except this applies to repeated
checking of as yet-incomplete xids, and to bulk concurrent
transactions.

ISTM we can improve performance of TransactionIdIsInProgress() by
caching the procno of our last xid.

Mark, could you retest with both these patches? Thanks.


Thanks Simon, will do and report back.

Did anyone ever try (3) ?

I'm not sure if this the same idea as (3) above, but ISTM that HeapTupleSatisfiesMVCC doesn't actually need to call TransactionIdIsInProgress(), because it checks XidInMVCCSnapshot(). The comment at the top of tqual.c says:

Or at least, it doesn't need to call TransactionIdIsInProgress() if XidInMVCCSnapshot() returned true. 

 

 * NOTE: must check TransactionIdIsInProgress (which looks in PGXACT array)
 * before TransactionIdDidCommit/TransactionIdDidAbort (which look in
 * pg_clog).  Otherwise we have a race condition: we might decide that a
 * just-committed transaction crashed, because none of the tests succeed.
 * xact.c is careful to record commit/abort in pg_clog before it unsets
 * MyPgXact->xid in PGXACT array.  That fixes that problem, but it also
 * means there is a window where TransactionIdIsInProgress and
 * TransactionIdDidCommit will both return true.  If we check only
 * TransactionIdDidCommit, we could consider a tuple committed when a
 * later GetSnapshotData call will still think the originating transaction
 * is in progress, which leads to application-level inconsistency.      The
 * upshot is that we gotta check TransactionIdIsInProgress first in all
 * code paths, except for a few cases where we are looking at
 * subtransactions of our own main transaction and so there can't be any
 * race condition.

If TransactionIdIsInProgress() returns true for a given XID, then surely it was also running when the snapshot was taken (or had not even began yet). In which case the XidInMVCCSnapshot() call will also return true. Am I missing something?

There's one little problem: we currently only set the hint bits when TransactionIdIsInProgress() returns false. If we do that earlier,

But why would we do that earlier?  If we never bother to call TransactionIdIsInProgress(), then we just don't set the hint bits, because we don't know what to set them to.  It can't matter what order we call TransactionIdIsInProgress and TransactionIdDidCommit in if we don't call either of them at all during this invocation.

Cheers,

Jeff