Обсуждение: slower merge join on sorted data chosen over nested loop

Поиск
Список
Период
Сортировка

slower merge join on sorted data chosen over nested loop

От
"Kevin Grittner"
Дата:
In both the 8.1beta2 and using a build from this morning's
dev snapshot, this query ran slower than expected:

select count(*) from "DbTranRepository" AS "dtr" inner join "DbTranLogRecord" AS "dtlr"   on (    "dtlr"."countyNo"
 = "dtr"."countyNo"       and "dtlr"."tranImageSeqNo" = "dtr"."tranImageSeqNo"       and "dtlr"."tableName"      =
'Cal'     ) where "dtr"."countyNo" = 40   and "dtr"."timestampValue" between '2005-09-30 00:00:00' and '2005-10-01
00:00:00';

Fresh restore from dump.  ANALYZE run.  No other activity.
Autovacuum was disabled prior to the restore and test.
8GB dual Xeon SuSE 9.3 system hitting RAID 5.


With the OS cache flushed by dd:
Aggregate  (cost=176218.64..176218.65 rows=1 width=0) (actual time=32081.623..32081.626 rows=1 loops=1)  ->  Merge Join
(cost=175627.39..176218.09 rows=218 width=0) (actual time=31774.573..32071.776 rows=3007 loops=1)        Merge Cond:
("outer"."?column3?"= "inner"."?column3?")        ->  Sort  (cost=48913.06..48975.72 rows=25067 width=17) (actual
time=5828.984..5954.706rows=39690 loops=1)              Sort Key: (dtr."tranImageSeqNo")::numeric              ->
BitmapHeap Scan on "DbTranRepository" dtr  (cost=297.07..47081.47 rows=25067 width=17) (actual time=69.056..5560.895
rows=39690loops=1)                    Recheck Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestamp
withtime zone >= '2005-09-30 00:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestamp with time zone
<='2005-10-01 00:00:00-05'::timestamp with time zone))                    ->  Bitmap Index Scan on
"DbTranRepository_timestamp" (cost=0.00..297.07 rows=25067 width=0) (actual time=52.675..52.675 rows=39690 loops=1)
                    Index Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestamp with time zone >=
'2005-09-3000:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestamp with time zone <= '2005-10-01
00:00:00-05'::timestampwith time zone))        ->  Sort  (cost=126714.34..126945.94 rows=92640 width=17) (actual
time=25130.457..25551.291rows=109605 loops=1)              Sort Key: (dtlr."tranImageSeqNo")::numeric              ->
BitmapHeap Scan on "DbTranLogRecord" dtlr  (cost=1037.84..119071.84 rows=92640 width=17) (actual
time=203.528..23588.122rows=114113 loops=1)                    Recheck Cond: ((40 = ("countyNo")::smallint) AND
(("tableName")::text= 'Cal'::text))                    ->  Bitmap Index Scan on "DbTranLogRecord_tableName"
(cost=0.00..1037.84rows=92640 width=0) (actual time=146.412..146.412 rows=114113 loops=1)
IndexCond: ((40 = ("countyNo")::smallint) AND (("tableName")::text = 'Cal'::text))Total runtime: 32091.211 ms 


With the OS cache populated by previous run of this plan:
Aggregate  (cost=176218.64..176218.65 rows=1 width=0) (actual time=4510.590..4510.593 rows=1 loops=1)  ->  Merge Join
(cost=175627.39..176218.09rows=218 width=0) (actual time=4203.451..4500.785 rows=3007 loops=1)        Merge Cond:
("outer"."?column3?"= "inner"."?column3?")        ->  Sort  (cost=48913.06..48975.72 rows=25067 width=17) (actual
time=686.388..812.160rows=39690 loops=1)              Sort Key: (dtr."tranImageSeqNo")::numeric              ->  Bitmap
HeapScan on "DbTranRepository" dtr  (cost=297.07..47081.47 rows=25067 width=17) (actual time=30.432..427.463 rows=39690
loops=1)                   Recheck Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestamp with time
zone>= '2005-09-30 00:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestamp with time zone <=
'2005-10-0100:00:00-05'::timestamp with time zone))                    ->  Bitmap Index Scan on
"DbTranRepository_timestamp" (cost=0.00..297.07 rows=25067 width=0) (actual time=19.089..19.089 rows=39690 loops=1)
                    Index Cond: ((("countyNo")::smallint = 40) AND (("timestampValue")::timestamp with time zone >=
'2005-09-3000:00:00-05'::timestamp with time zone) AND (("timestampValue")::timestamp with time zone <= '2005-10-01
00:00:00-05'::timestampwith time zone))        ->  Sort  (cost=126714.34..126945.94 rows=92640 width=17) (actual
time=2705.007..3124.966rows=109605 loops=1)              Sort Key: (dtlr."tranImageSeqNo")::numeric              ->
BitmapHeap Scan on "DbTranLogRecord" dtlr  (cost=1037.84..119071.84 rows=92640 width=17) (actual time=96.046..1195.302
rows=114113loops=1)                    Recheck Cond: ((40 = ("countyNo")::smallint) AND (("tableName")::text =
'Cal'::text))                   ->  Bitmap Index Scan on "DbTranLogRecord_tableName"  (cost=0.00..1037.84 rows=92640
width=0)(actual time=54.917..54.917 rows=114113 loops=1)                          Index Cond: ((40 =
("countyNo")::smallint)AND (("tableName")::text = 'Cal'::text))Total runtime: 4520.130 ms 


Without the option to use merge join, it picked a plan which ran in
one-third to one-half the time, depending on caching.


dtr=> set enable_mergejoin=off


With the OS cache flushed by dd:
Aggregate  (cost=298741.48..298741.49 rows=1 width=0) (actual time=14289.207..14289.210 rows=1 loops=1)  ->  Nested
Loop (cost=0.00..298740.94 rows=218 width=0) (actual time=143.436..14277.124 rows=3007 loops=1)        ->  Index Scan
using"DbTranRepository_timestamp" on "DbTranRepository" dtr  (cost=0.00..49419.45 rows=25067 width=17) (actual
time=33.625..11510.723rows=39690 loops=1)              Index Cond: ((("countyNo")::smallint = 40) AND
(("timestampValue")::timestampwith time zone >= '2005-09-30 00:00:00-05'::timestamp with time zone) AND
(("timestampValue")::timestampwith time zone <= '2005-10-01 00:00:00-05'::timestamp with time zone))        ->  Index
Scanusing "DbTranLogRecordPK" on "DbTranLogRecord" dtlr  (cost=0.00..9.93 rows=1 width=17) (actual time=0.062..0.062
rows=0loops=39690)              Index Cond: ((40 = (dtlr."countyNo")::smallint) AND ((dtlr."tranImageSeqNo")::numeric =
("outer"."tranImageSeqNo")::numeric))             Filter: (("tableName")::text = 'Cal'::text)Total runtime: 14289.305
ms


With the OS cache populated by previous run of this plan:
Aggregate  (cost=298741.48..298741.49 rows=1 width=0) (actual time=1476.593..1476.596 rows=1 loops=1)  ->  Nested Loop
(cost=0.00..298740.94rows=218 width=0) (actual time=1.373..1464.720 rows=3007 loops=1)        ->  Index Scan using
"DbTranRepository_timestamp"on "DbTranRepository" dtr  (cost=0.00..49419.45 rows=25067 width=17) (actual
time=0.083..412.268rows=39690 loops=1)              Index Cond: ((("countyNo")::smallint = 40) AND
(("timestampValue")::timestampwith time zone >= '2005-09-30 00:00:00-05'::timestamp with time zone) AND
(("timestampValue")::timestampwith time zone <= '2005-10-01 00:00:00-05'::timestamp with time zone))        ->  Index
Scanusing "DbTranLogRecordPK" on "DbTranLogRecord" dtlr  (cost=0.00..9.93 rows=1 width=17) (actual time=0.019..0.019
rows=0loops=39690)              Index Cond: ((40 = (dtlr."countyNo")::smallint) AND ((dtlr."tranImageSeqNo")::numeric =
("outer"."tranImageSeqNo")::numeric))             Filter: (("tableName")::text = 'Cal'::text)Total runtime: 1476.681 ms 


listen_addresses = '*'          # what IP interface(s) to listen on;
max_connections = 100                   # note: increasing max_connections costs
shared_buffers = 20000                  # min 16 or max_connections*2, 8KB each
work_mem = 10240                        # min 64, size in KB
max_fsm_pages = 100000                  # min max_fsm_relations*16, 6 bytes each
wal_buffers = 20                        # min 4, 8KB each
effective_cache_size = 917504           # typically 8KB each
random_page_cost = 2                    # units are one sequential page fetch
log_line_prefix = '[%m] %p %q<%u %d %r> '                       # Special values:
stats_start_collector = on
stats_row_level = on
autovacuum = false                      # enable autovacuum subprocess?
autovacuum_naptime = 10         # time between autovacuum runs, in secs
autovacuum_vacuum_threshold = 1 # min # of tuple updates before
autovacuum_analyze_threshold = 1        # min # of tuple updates before
autovacuum_vacuum_scale_factor = 0.2    # fraction of rel size before
autovacuum_analyze_scale_factor = 0.1   # fraction of rel size before
lc_messages = 'C'                       # locale for system error message
lc_monetary = 'C'                       # locale for monetary formatting
lc_numeric = 'C'                        # locale for number formatting
lc_time = 'C'                           # locale for time formatting
sql_inheritance = off

If random_page_cost is reduced to 1.2, index scans are used
instead of bitmap index scans.  That makes little difference for
the run time with this query -- at least when its data is cached.

           Table "public.DbTranRepository"     Column      |          Type          | Modifiers
------------------+------------------------+-----------countyNo         | "CountyNoT"            | not
nulltranImageSeqNo  | "TranImageSeqNoT"      | not nulltimestampValue   | "TimestampT"           | not
nulltransactionImage| "ImageT"               |status           | character(1)           | not nullqueryName        |
"QueryNameT"          |runDuration      | numeric(15,0)          |userId           | "UserIdT"
|functionalArea  | character varying(50)  |sourceRef        | character varying(255) |url              | "URLT"
       |tranImageSize    | numeric(15,0)          | 
Indexes:   "DbTranRepositoryPK" PRIMARY KEY, btree ("countyNo", "tranImageSeqNo")   "DbTranRepository_timestamp" btree
("countyNo","timestampValue")   "DbTranRepository_userId" btree ("countyNo", "userId", "timestampValue") 

        Table "public.DbTranLogRecord"    Column     |       Type        | Modifiers
----------------+-------------------+-----------countyNo       | "CountyNoT"       | not nulltranImageSeqNo |
"TranImageSeqNoT"| not nulllogRecordSeqNo | "LogRecordSeqNoT" | not nulloperation      | "OperationT"      | not
nulltableName     | "TableNameT"      | not null 
Indexes:   "DbTranLogRecordPK" PRIMARY KEY, btree ("countyNo", "tranImageSeqNo", "logRecordSeqNo")
"DbTranLogRecord_tableName"btree ("countyNo", "tableName", operation) 

Need any other info?
-Kevin



Re: slower merge join on sorted data chosen over nested loop

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> In both the 8.1beta2 and using a build from this morning's
> dev snapshot, this query ran slower than expected:

There's a known issue that the planner tends to overestimate the cost of
inner-index-scan nestloops, because it doesn't allow for the strong
caching effects associated with repeated scans of the same index (in
particular, that all the upper index levels tend to stay in cache).
See the archives for past inconclusive discussions about how to fix
this.

However, something else caught my eye:

>                ->  Bitmap Heap Scan on "DbTranRepository" dtr  (cost=297.07..47081.47 rows=25067 width=17) (actual
time=69.056..5560.895rows=39690 loops=1)
 

>          ->  Index Scan using "DbTranRepository_timestamp" on "DbTranRepository" dtr  (cost=0.00..49419.45 rows=25067
width=17)(actual time=33.625..11510.723 rows=39690 loops=1)
 

I don't understand why the second case chose a plain index scan when
there was no need for sorted output; the bitmap scan is faster both
per cost estimate and in reality.  Are you sure you turned off only
enable_mergejoin and not enable_bitmapscan?

Also, when you say "slower than expected", what is setting your
expectation?
        regards, tom lane


Re: slower merge join on sorted data chosen over nested loop

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> Tom Lane <tgl@sss.pgh.pa.us> 10/06/05 9:28 PM >>>
>> There's a known issue that the planner tends to overestimate the cost of
>> inner-index-scan nestloops, because it doesn't allow for the strong
>> caching effects associated with repeated scans of the same index (in
>> particular, that all the upper index levels tend to stay in cache).
>> See the archives for past inconclusive discussions about how to fix
>> this.

> I spent a few hours trying different searches in the archives, and
> found three very interesting threads on the topic.  All were from
> 2003.  Should I keep digging for more recent threads, or would
> these probably represent the current state of the issue?

No, I don't recall that anything much has happened with it.

The place where the rubber hits the road is this passage in
genericcostestimate():
   /*    * Estimate the number of index pages that will be retrieved.    *    * For all currently-supported index
types,the first page of the index    * is a metadata page, and we should figure on fetching that plus a    * pro-rated
fractionof the remaining pages.    */   if (index->pages > 1 && index->tuples > 0)   {       numIndexPages =
(numIndexTuples/ index->tuples) * (index->pages - 1);       numIndexPages += 1;        /* count the metapage too */
 numIndexPages = ceil(numIndexPages);   }   else       numIndexPages = 1.0;
 
   /*    * Compute the index access cost.    *    * Disk cost: our generic assumption is that the index pages will be
read   * sequentially, so they have cost 1.0 each, not random_page_cost.    */   *indexTotalCost = numIndexPages;
 

An important point here: in all the currently supported index types,
that last assumption is ridiculously optimistic.  Except maybe in a
freshly-built btree, logically consecutive index pages won't be
physically adjacent, and so a honest cost accounting would require
charging random_page_cost per page.  However, that's not the direction
we want the estimate to go in :-(

We could do something based on effective_cache_size to estimate the
probability that the index page is already in cache and hence need not
be re-read.  The main trouble with this is estimating what fraction of
the effective_cache_size can fairly be assumed to be populated by each
index --- without some global knowledge about what other indexes are in
heavy use, it's difficult to see how to get there from here.  (OTOH,
one could also level the same charge against the existing uses of
effective_cache_size... maybe we could redefine it as cache available
per query, or something like that, to put part of the problem on the
DBA's shoulders.)

One simple tweak we could make is to not count the index metapage in the
pages-to-be-fetched estimate, which could be justified on the grounds
that the metapage is almost certain to be in cache.  (Note that the
existing estimation technique is already logically wrong for btrees,
because it's considering only the metapage and the leaf page(s) that
need to be visited, and not the internal pages that need to be descended
through to get to the right leaf.  However, again we can say that the
upper internal pages are probably in cache and so it's fair not to count
them as needing to be read.)

Whether this tweak would make the overall behavior better or worse is
really hard to say.  It'd be interesting to look at some test cases
though.
        regards, tom lane


Re: slower merge join on sorted data chosen over nested loop

От
Tom Lane
Дата:
"Kevin Grittner" <Kevin.Grittner@wicourts.gov> writes:
> A couple questions occur to me, though.  I'm not clear on why
> ceil is called -- do we need to eliminate the fraction here?

Well, you don't get to read part of a page.  In particular, fetching 1.0
index tuples requires fetching 1.0 pages, not (say) 0.01 page.  We
consistently use ceil() in these sorts of estimates, and I think it's
correct.
> Also, to be clear, I thought someone said that only the leaf level
> was counted -- it looks like the other levels are being factored in,
> but not as heavily as they would be accessed with logical reads,
> because we're taking the fraction of tuples for the index multiplied
> by the total number of index pages except for the metadata page.
> This should bias the cost slightly higher for additional index levels,
> shouldn't it?

True, I overstated a little in claiming that upper pages weren't counted
at all.  However, the current formula drastically undercounts them.
Consider a simple int4 index: this will require 16 bytes per entry
(assuming MAXALIGN 4) so there could be up to 500-some index entries per
page.  At the traditional rule-of-thumb load factor of 2/3rds, that's
say 340 entries per page, which will also be the fanout factor for upper
index pages.  Assume a fully populated 3-level index:1 root page        340 entries340 2nd-level pages    340^2 =
115600entries115600 leaf pages    340^3 = 39304000 entries
 
Total index size: 115942 pages, 39304000 leaf entries.  If we try to
calculate the cost to fetch 1000 index tuples, the current numIndexPages
calculation isceil((1000 / 39304000) * (115942 - 1) + 1)
which is ceil(3.95) or 4 --- including the metapage.  So we're basically
just counting the three leaf pages that the tuples occupy; the root and
second level pages are too small a fraction of the index to affect the
result.  (The correct estimate of course would be six or seven pages
visited: metapage, root, one 2nd-level page, and 3 or 4 leaf pages
depending on boundary effects.)

And actually that example still understates the problem.  You have to
touch one page in each of the upper index levels to descend, regardless
of how many leaf pages you are going to touch in the scan.  Do the math
for the estimate of the cost to read just *one* index entry --- in this
example it comes out to 1.003 which ceil()'s up to 2, ie, the metapage
and the single leaf page.

With wider index entries, the fanout of upper pages decreases and so the
upper pages take a larger fraction of the index --- but the tree depth
increases, too, so more upper pages have to be visited to get to the
first leaf page.  You still end up with a big undercount.

I recall thinking about changing the formula to more accurately count
the number of pages touched; but I desisted when I realized that it
would drastically increase the cost estimates for index searches, and
that's surely the wrong direction to be going in.  We really can't do
that until we have some usable infrastructure to allow estimating the
probability that those pages are already in cache.  In the meantime,
the tweaks under discussion here amount to assuming that the metapage
and all upper pages are always in cache.

The current cost estimate to fetch a single tuple via indexscan is
basically random_page_cost + 2, plus some near-negligible cpu costs.
Not counting the metapage would take that down to random_page_cost + 1.
This would definitely move the goalposts, particularly for people
who run with smaller-than-default random_page_cost, but I'm not sure
if it's enough to solve the problem.

Also, all this is really just a sideshow; I think the main problem for
join estimation is that because we cost an inner-indexscan nestloop as
taking N times the cost of one execution of the inner scan, we fail to
account for cacheing effects in the table itself as well as the index.
That would cut into the random_page_cost part of the cost estimate as
well as the index cost.  For all the reasons I've cited, it's pretty
hard to justify reducing the estimate for an indexscan standing on its
own --- but in the context of a nestloop join, it's easier to make a
case.
        regards, tom lane


Re: slower merge join on sorted data chosen over nested loop

От
"Jim C. Nasby"
Дата:
On Mon, Oct 10, 2005 at 09:10:38PM -0400, Tom Lane wrote:
> I recall thinking about changing the formula to more accurately count
> the number of pages touched; but I desisted when I realized that it
> would drastically increase the cost estimates for index searches, and
> that's surely the wrong direction to be going in.  We really can't do
> that until we have some usable infrastructure to allow estimating the
> probability that those pages are already in cache.  In the meantime,
> the tweaks under discussion here amount to assuming that the metapage
> and all upper pages are always in cache.
> 
> The current cost estimate to fetch a single tuple via indexscan is
> basically random_page_cost + 2, plus some near-negligible cpu costs.
> Not counting the metapage would take that down to random_page_cost + 1.
> This would definitely move the goalposts, particularly for people
> who run with smaller-than-default random_page_cost, but I'm not sure
> if it's enough to solve the problem.
> 
> Also, all this is really just a sideshow; I think the main problem for
> join estimation is that because we cost an inner-indexscan nestloop as
> taking N times the cost of one execution of the inner scan, we fail to
> account for cacheing effects in the table itself as well as the index.
> That would cut into the random_page_cost part of the cost estimate as
> well as the index cost.  For all the reasons I've cited, it's pretty
> hard to justify reducing the estimate for an indexscan standing on its
> own --- but in the context of a nestloop join, it's easier to make a
> case.

One thing I noticed the last time I looked at all of this was that index
correlation seems to be severely mis-weighted in scan calculations.
http://archives.postgresql.org/pgsql-performance/2005-04/msg00669.php
has more info on this.

I suspect that until that issue is addressed other changes to the cost
estimates won't make any useful difference.
-- 
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461