Обсуждение: 27 second plan times

Поиск
Список
Период
Сортировка

27 second plan times

От
Gregory Stark
Дата:
Following up on some complaints we've had about extremely large plan times for
large partitioned tables I've been doing some profiling. I've constructed a
situation where it takes upwards of 30 seconds to plan a simple query like:

postgres=# explain select * from a where i between 999 and 1001;                             QUERY PLAN
            
 
----------------------------------------------------------------------Result  (cost=0.00..8.63 rows=58 width=108)  ->
Append (cost=0.00..8.63 rows=58 width=108)        ->  Seq Scan on a  (cost=0.00..1.01 rows=1 width=108)
Filter:((i >= 999) AND (i <= 1001))        ->  Seq Scan on a997 a  (cost=0.00..1.27 rows=1 width=108)
Filter:((i >= 999) AND (i <= 1001))        ->  Seq Scan on a998 a  (cost=0.00..1.27 rows=1 width=108)
Filter:((i >= 999) AND (i <= 1001))        ->  Seq Scan on a999 a  (cost=0.00..1.27 rows=18 width=108)
Filter:((i >= 999) AND (i <= 1001))        ->  Seq Scan on a1000 a  (cost=0.00..1.27 rows=18 width=108)
Filter:((i >= 999) AND (i <= 1001))        ->  Seq Scan on a1001 a  (cost=0.00..1.27 rows=18 width=108)
Filter:((i >= 999) AND (i <= 1001))        ->  Seq Scan on a1002 a  (cost=0.00..1.27 rows=1 width=108)
Filter:((i >= 999) AND (i <= 1001))
 
(16 rows)

Time: 46324.627 ms

Table "a" is a parent table with 2,000 partitions each of which have 102
columns, two of which are covered by constraints of the form "WHERE i BETWEEN
90 AND 110".

The gprof output is pretty damning:

Each sample counts as 0.01 seconds. %   cumulative   self              self     total           time   seconds
seconds   calls   s/call   s/call  name    62.58    203.76   203.76 21,474,937     0.00     0.00  SearchCatCache 9.01
233.08    29.32  1,923,789     0.00     0.00  list_nth_cell 5.34    250.46    17.38  2386465     0.00     0.00
pgstat_initstats2.80    259.57     9.11 235691652     0.00     0.00  AllocSetAlloc 1.95    265.91     6.34 219852840
0.00     0.00  nocachegetattr 1.19    269.78     3.87 256569078     0.00     0.00  FunctionCall2 0.74    272.18
2.40107923848     0.00     0.00  MemoryContextAllocZeroAligned
 


The SearchCatCache here is the one in get_attavgwidth called to estimate the
relation width. There are 200k attributes being measured here but I'm not
clear why it's causing 21M calls.

The first thing that comes to mind is that we're doing the
constraint_exclusion code *after* estimating the width of the relations we're
going to exclude. If we push the constraint exclusion up a few lines the
planning time goes down the 1.7s.

I think there's still a problem here with some kind of n^2 behaviour for
appends of very wide tables but I haven't quite nailed it yet. In any case is
there any reason not to make the following small change to move the constraint
exclusion ahead of the size estimates and index checks and save ourselves
potentially a lot of work?

Index: allpaths.c
===================================================================
RCS file: /home/stark/src/REPOSITORY/pgsql/src/backend/optimizer/path/allpaths.c,v
retrieving revision 1.161
diff -c -r1.161 allpaths.c
*** allpaths.c    22 Feb 2007 22:00:23 -0000    1.161
--- allpaths.c    20 Apr 2007 18:12:40 -0000
***************
*** 196,215 **** static void set_plain_rel_pathlist(PlannerInfo *root, RelOptInfo *rel, RangeTblEntry *rte) {
-     /* Mark rel with estimated output rows, width, etc */
-     set_baserel_size_estimates(root, rel);
- 
-     /* Test any partial indexes of rel for applicability */
-     check_partial_indexes(root, rel);
- 
-     /*
-      * Check to see if we can extract any restriction conditions from join
-      * quals that are OR-of-AND structures.  If so, add them to the rel's
-      * restriction list, and recompute the size estimates.
-      */
-     if (create_or_index_quals(root, rel))
-         set_baserel_size_estimates(root, rel);
-      /*      * If we can prove we don't need to scan the rel via constraint exclusion,      * set up a single dummy
pathfor it.  (Rather than inventing a special
 
--- 196,201 ----
***************
*** 228,233 ****
--- 214,233 ----         return;     } 
+     /* Mark rel with estimated output rows, width, etc */
+     set_baserel_size_estimates(root, rel);
+ 
+     /* Test any partial indexes of rel for applicability */
+     check_partial_indexes(root, rel);
+ 
+     /*
+      * Check to see if we can extract any restriction conditions from join
+      * quals that are OR-of-AND structures.  If so, add them to the rel's
+      * restriction list, and recompute the size estimates.
+      */
+     if (create_or_index_quals(root, rel))
+         set_baserel_size_estimates(root, rel);
+      /*      * Generate paths and add them to the rel's pathlist.      *




--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: 27 second plan times

От
Tom Lane
Дата:
Gregory Stark <stark@enterprisedb.com> writes:
> The SearchCatCache here is the one in get_attavgwidth called to estimate the
> relation width.

Hmm, that information is supposed to be cached ... could you provide the
test case?

> The first thing that comes to mind is that we're doing the
> constraint_exclusion code *after* estimating the width of the relations we're
> going to exclude. If we push the constraint exclusion up a few lines the
> planning time goes down the 1.7s.

... and probably breaks a few things; are you sure there are no needed
side effects of the skipped code?
        regards, tom lane


Re: 27 second plan times

От
Gregory Stark
Дата:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> The SearchCatCache here is the one in get_attavgwidth called to estimate the
>> relation width.
>
> Hmm, that information is supposed to be cached ... could you provide the
> test case?

Cached meaning it's stored in the cache once it's looked up to avoid future
lookups? In which case it will still mean 200k lookups since there really are
that many attributes.

Or cached meaning we prime the cache in one batch operation? It may well be
avoiding i/o, but there are suspiciously large numbers of calls to
SearchCatCache itself.

I'll send the pg_dump to pgsql-patches. But it's much larger than really
necessary to reproduce the effect. The planning time was already through the
roof before I added data and extra constraints. The key seems to be the wide
tables though, planning 2,000 child tables is fast if they only have a couple
columns.

>> The first thing that comes to mind is that we're doing the
>> constraint_exclusion code *after* estimating the width of the relations we're
>> going to exclude. If we push the constraint exclusion up a few lines the
>> planning time goes down the 1.7s.
>
> ... and probably breaks a few things; are you sure there are no needed
> side effects of the skipped code?

Well that's why I asked whether there was a reason we had to do it this way. I
haven't checked yet what those other pieces of code are doing exactly. But the
constraint exclusion doesn't depend on much else, I can't see anything related
to stats or indexes affecting it. The regression tests all pass but that
doesn't prove anything.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: 27 second plan times

От
Gregory Stark
Дата:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Gregory Stark <stark@enterprisedb.com> writes:
>> The SearchCatCache here is the one in get_attavgwidth called to estimate the
>> relation width.
>
> Hmm, that information is supposed to be cached ... could you provide the
> test case?

This case caused my machine to thrash and swap terribly with the backend
requiring over 1.3G of RAM to handle the explain. 

What's really strange is that I got warnings about checkpoints happening too
frequently. How could an explain of a select cause checkpoints?! There are no
log entries for autovacuum but that's the only thing I can think of.

I'll dive back into gprof tomorrow, but for now it's time for sleep.


postgres=# copy (select 'create table t (' union all select 'a'||generate_series(1,1599)||' bool,' union all select
'a1600bool);') to '/tmp/t';
 
COPY 1601
postgres=# \i /tmp/t
CREATE TABLE
postgres=# copy (select 'create table t'||generate_series(1,1000)||' () inherits (t);') to '/tmp/u';
COPY 1000
Time: 5.002 ms
postgres=# 
postgres=# \i /tmp/u
postgres=# explain select * from t;                              QUERY PLAN                               
------------------------------------------------------------------------Result  (cost=0.00..10510.50 rows=50050
width=1600) ->  Append  (cost=0.00..10510.50 rows=50050 width=1600)        ->  Seq Scan on t  (cost=0.00..10.50 rows=50
width=1600)       ->  Seq Scan on t1 t  (cost=0.00..10.50 rows=50 width=1600)        ->  Seq Scan on t2 t
(cost=0.00..10.50rows=50 width=1600)
 
...        ->  Seq Scan on t999 t  (cost=0.00..10.50 rows=50 width=1600)        ->  Seq Scan on t1000 t
(cost=0.00..10.50rows=50 width=1600)
 
(1003 rows)

Time: 889727.115 ms


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: 27 second plan times

От
Tom Lane
Дата:
Gregory Stark <stark@enterprisedb.com> writes:
> Following up on some complaints we've had about extremely large plan times for
> large partitioned tables I've been doing some profiling.

>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls   s/call   s/call  name    
>  62.58    203.76   203.76 21,474,937     0.00     0.00  SearchCatCache
>   9.01    233.08    29.32  1,923,789     0.00     0.00  list_nth_cell
>   5.34    250.46    17.38  2386465     0.00     0.00  pgstat_initstats
>   2.80    259.57     9.11 235691652     0.00     0.00  AllocSetAlloc
>   1.95    265.91     6.34 219852840     0.00     0.00  nocachegetattr
>   1.19    269.78     3.87 256569078     0.00     0.00  FunctionCall2

Is this supposed to be for a single run of the query?  Cause the numbers
I get are a lot different:
 %   cumulative   self              self     total           time   seconds   seconds    calls   s/call   s/call  name
 35.24     15.23    15.23   353301     0.00     0.00  SearchCatCache30.08     28.23    13.00   422469     0.00     0.00
list_nth_cell21.03     37.32     9.09   642488     0.00     0.00  pgstat_initstats 1.39     37.92     0.60  4281770
0.00    0.00  hash_search_with_hash_value 1.23     38.45     0.53  5130941     0.00     0.00  AllocSetAlloc 1.16
38.95    0.50  4154885     0.00     0.00  _bt_compare
 

(This is CVS HEAD as of today, which might or might not make a difference.)
        regards, tom lane


Re: 27 second plan times

От
Tom Lane
Дата:
Gregory Stark <stark@enterprisedb.com> writes:
> I think there's still a problem here with some kind of n^2 behaviour for
> appends of very wide tables but I haven't quite nailed it yet. In any case is
> there any reason not to make the following small change to move the constraint
> exclusion ahead of the size estimates and index checks and save ourselves
> potentially a lot of work?

Applied along with some other hacking to reduce the costs of the
lower-level functions that this example shows to be inefficient.
They'd still be slow in large queries, whether CE applies or not.
        regards, tom lane


Re: 27 second plan times

От
Gregory Stark
Дата:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> Applied along with some other hacking to reduce the costs of the
> lower-level functions that this example shows to be inefficient.
> They'd still be slow in large queries, whether CE applies or not.

BIG difference. The case that caused swapping and took almost 15m to plan is
now down to 2.5s. The profile still looks a bit odd but  I can't argue with
the results.


stark@oxford:/var/tmp/db$ gprof /usr/local/pgsql/bin/postgres gmon.out
Flat profile:

Each sample counts as 0.01 seconds. %   cumulative   self              self     total           time   seconds
seconds   calls   s/call   s/call  name    24.36      2.46     2.46   418517     0.00     0.00  SearchCatCache 7.33
3.20     0.74  2564235     0.00     0.00  hash_any 6.34      3.84     0.64  4283964     0.00     0.00
hash_search_with_hash_value4.36      4.28     0.44   216316     0.00     0.00  list_nth_cell 3.96      4.68     0.40
6535943    0.00     0.00  AllocSetAlloc 3.37      5.02     0.34  4165664     0.00     0.00  _bt_compare 2.67      5.29
  0.27  2266696     0.00     0.00  MemoryContextAllocZeroAligned
 

...               0.01    0.03    2000/424529      get_namespace_name [164]               0.01    0.03    2001/424529
  pg_class_aclmask [167]               0.01    0.03    2001/424529      get_rel_name [163]               0.01    0.03
2002/424529      has_subclass [165]               1.21    2.69  204102/424529      get_attavgwidth [37]
1.21   2.69  204308/424529      TupleDescInitEntry [36]
 
[632]    0.0    0.00    0.00  418517         SearchSysCache <cycle 9> [632]                             418517
  SearchCatCache <cycle 9> [15]
 



--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: 27 second plan times

От
Gregory Stark
Дата:
"Gregory Stark" <stark@enterprisedb.com> writes:

> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>
>> Applied along with some other hacking to reduce the costs of the
>> lower-level functions that this example shows to be inefficient.
>> They'd still be slow in large queries, whether CE applies or not.
>
> BIG difference. The case that caused swapping and took almost 15m to plan is
> now down to 2.5s. The profile still looks a bit odd but  I can't argue with
> the results.
>
> stark@oxford:/var/tmp/db$ gprof /usr/local/pgsql/bin/postgres gmon.out
> Flat profile:
>
> Each sample counts as 0.01 seconds.
>   %   cumulative   self              self     total           
>  time   seconds   seconds    calls   s/call   s/call  name    
>  24.36      2.46     2.46   418517     0.00     0.00  SearchCatCache
>   7.33      3.20     0.74  2564235     0.00     0.00  hash_any
>   6.34      3.84     0.64  4283964     0.00     0.00  hash_search_with_hash_value
>   4.36      4.28     0.44   216316     0.00     0.00  list_nth_cell
>   3.96      4.68     0.40  6535943     0.00     0.00  AllocSetAlloc
>   3.37      5.02     0.34  4165664     0.00     0.00  _bt_compare
>   2.67      5.29     0.27  2266696     0.00     0.00  MemoryContextAllocZeroAligned

For what it's worth if I defeat the "same column position" optimization
SearchCatCache and list_nth_cell pop back to the top. I had a version that
kept the col_mappings and translated_vars lists in arrays to fix the n^2
list_nth_cell behaviour. But the SearchCatCache is something else.
53.77     87.36    87.36  4813018     0.00     0.00  SearchCatCache11.29    105.71    18.35  1610810     0.00     0.00
list_nth_cell3.09    110.72     5.01 17738640     0.00     0.00  hash_any 2.50    114.78     4.05 30612499     0.00
0.00 hash_search_with_hash_value 2.16    118.29     3.51 54588745     0.00     0.00  AllocSetAlloc
 


--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com



Re: 27 second plan times

От
Tom Lane
Дата:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Applied along with some other hacking to reduce the costs of the
>> lower-level functions that this example shows to be inefficient.
>> They'd still be slow in large queries, whether CE applies or not.

> BIG difference. The case that caused swapping and took almost 15m to plan is
> now down to 2.5s. The profile still looks a bit odd but  I can't argue with
> the results.

I'm still feeling a bit annoyed with the behavior of the stats machinery
(pgstat_initstats and related macros).  Yesterday I fixed it so that
pgstat_initstats doesn't perform redundant searches of the tabstat
arrays, but there's still an issue, which is that any rel that's
heap_opened or index_opened within a transaction is going to get a
tabstat entry, whether any events are subsequently counted or not.
In typical scenarios I don't think this is a big deal, but in examples
such as yours we're going to be sending a whole lot of all-zero tabstat
messages; there'll be one for every heap or index that the planner even
momentarily considered.  That means more UDP traffic and more work for
the stats collector.  gprof won't show the resulting overhead since
it doesn't know anything about kernel-level overhead or activity in the
stats collector.  (Hm, might be able to measure it in oprofile
though...)

We could fix this by not doing pgstat_initstats at heap_open time,
but postponing it until something more interesting happens.  The trouble
is that that'd add at least a small amount of overhead at the places
where "something more interesting" is counted, since the pgstat macros
would have to check validity of the tabstat pointer.  The added overhead
should be only about one extra comparison, but maybe that's enough to
make someone object?
        regards, tom lane


Re: 27 second plan times

От
Tom Lane
Дата:
I wrote:
> I'm still feeling a bit annoyed with the behavior of the stats machinery
> (pgstat_initstats and related macros).
> ... That means more UDP traffic and more work for
> the stats collector.  gprof won't show the resulting overhead since
> it doesn't know anything about kernel-level overhead or activity in the
> stats collector.  (Hm, might be able to measure it in oprofile
> though...)

I spent some time with oprofile and couldn't muster any evidence
suggesting that this was accounting for more than 1% or so of total
runtime.  So for the moment I'll leave it alone.  It might eventually
become worth worrying about, though.

The thing I saw as being more interesting than the tabstat overhead
is that the planner does RelationGetNumberOfBlocks (ie, an lseek kernel
call) on every child rel ... and would do it on every index of every
child rel, too, if the example had any.  It would be nice if we could
postpone all of the work of get_relation_info() until after we've
checked for constraint exclusion.  This looks like it'd require some
nontrivial refactoring though --- in particular, I'm not sure how we'd
handle the total_table_pages calculation.

If you're satisfied with the performance as it now stands, let's leave
this for the maybe-do-someday list.
        regards, tom lane