Обсуждение: order of nested loop

Поиск

Список

Период

Сортировка

order of nested loop

От

Joseph Shraibman

Дата:

17 июня 2003 г., 00:31:02

I have two queries that return identical results.  One is a SELECT DISTINCT and the other
is the same query without the DISTINCT.  The explain for the second one makes it seem as
if it would be faster:

Sort  (cost=73560.75..73560.75 rows=3 width=604)
vs.
Sort  (cost=67246.81..67246.82 rows=3 width=604)

However in reality the first query runs much faster.  The problem is this nested loop:
not distinct:
                      ->  Subquery Scan "*SELECT* 2"  (cost=0.00..30602.38 rows=25 width=604)
                            ->  Limit  (cost=0.00..30602.38 rows=25 width=604)
                                  ->  Nested Loop  (cost=0.00..5499145.64 rows=4492 width=604)
   ================ vs. =================================
distinct:
                                        ->  Sort  (cost=36903.81..36915.04 rows=4492
width=604)
                                              Sort Key: <snip>
                                              ->  Nested Loop  (cost=0.00..36631.27
rows=4492 width=604)

In the query with the distinct one table is done first, in the other the order is
reversed.  This makes all the difference in the query, because in my test case there is
only one matching entry in one of the tables and that is always the table that determines
the number of rows in the result (and except in pathalogical cases will always be much
lower than the number returned from the first table).  So how can I tell postgres which
table to scan in the loop first?

Re: order of nested loop

От

Tom Lane

Дата:

17 июня 2003 г., 00:56:04

Joseph Shraibman <jks@selectacast.net> writes:
> I have two queries that return identical results.  One is a SELECT
> DISTINCT and the other is the same query without the DISTINCT.

Could we see the actual queries?  And the complete EXPLAIN ANALYZE
output?

            regards, tom lane

Re: order of nested loop

От

Tom Lane

Дата:

17 июня 2003 г., 03:32:19

Joseph Shraibman <joseph@xtenit.com> writes:
> Tom Lane wrote:
>> Joseph Shraibman <jks@selectacast.net> writes:
>>> I have two queries that return identical results.  One is a SELECT
>>> DISTINCT and the other is the same query without the DISTINCT.
>>
>> Could we see the actual queries?  And the complete EXPLAIN ANALYZE
>> output?

> (private email, bec I don't want to sanatize this huge thing for the list)

Okay, but people send bigger things than this to the list all the time...

> [fast case]
> (SELECT distinct
> u.userkey,u.adminper,u.addper,u.approvper,u.ownerper,u.subper,u.banned,u.status,u.config,u.rules, ...
> ORDER BY lower(d.username) LIMIT 25)

>                             ->  Limit  (cost=36903.81..36916.31 rows=25 width=604) (actual
> time=10.92..10.93 rows=1 loops=1)
>                                   ->  Unique  (cost=36903.81..37128.43 rows=449 width=604)
> (actual time=10.92..10.92 rows=1 loops=1)
>                                         ->  Sort  (cost=36903.81..36915.04 rows=4492
> width=604) (actual time=10.91..10.91 rows=1 loops=1)
>                                               Sort Key: lower(d.username), u.userkey,
> u.adminper, u.addper, u.approvper, u.ownerper, u.subper, u.banned, u.status, u.co
> nfig, u.rules, (subplan), CASE WHEN ((u.rules IS NULL) OR (u.rules = ''::text)) THEN
> ''::text ELSE 'r'::text END, d.username, d.realname, d.firstname, (subplan), d.st
> atus, 2
>                                               ->  Nested Loop  (cost=0.00..36631.27
> rows=4492 width=604) (actual time=10.65..10.67 rows=1 loops=1)
>                                                     ->  Index Scan using
> usertable_podkey_key on usertable u  (cost=0.00..16003.53 rows=5446 width=555) (actual time=0.
> 04..0.05 rows=1 loops=1)
>                                                           Index Cond: (podkey = 20)
>                                                           Filter: ((status = 2) AND (NOT
> banned))
>                                                     ->  Index Scan using directory_pkey on
> directory d  (cost=0.00..3.78 rows=1 width=49) (actual time=0.02..0.03 rows=
> 1 loops=1)
>                                                           Index Cond: (d.userkey =
> "outer".userkey)
>                                                           Filter: ((status = 2) OR (status
> = 5))

> [slow case]
> (SELECT
> u.userkey,u.adminper,u.addper,u.approvper,u.ownerper,u.subper,u.banned,u.status,u.config,u.rules, ...
> ORDER BY lower(d.username) LIMIT 25)

>                             ->  Limit  (cost=0.00..30602.38 rows=25 width=604) (actual
> time=74810.10..102624.62 rows=1 loops=1)
>                                   ->  Nested Loop  (cost=0.00..5499145.64 rows=4492
> width=604) (actual time=74810.10..102624.61 rows=1 loops=1)
>                                         ->  Index Scan using directory_lower_username_key
> on directory d  (cost=0.00..2380577.42 rows=525568 width=49) (actual time=15.61..79588.09
> rows=589718 loops=1)
>                                               Filter: ((status = 2) OR (status = 5))
>                                         ->  Index Scan using usertable_pkey on usertable u
>   (cost=0.00..5.92 rows=1 width=555) (actual time=0.04..0.04 rows=0 loops=589718)
>                                               Index Cond: (("outer".userkey = u.userkey)
> AND (u.podkey = 20))
>                                               Filter: ((status = 2) AND (NOT banned))

As near as I can tell, the failure of estimation is not in the slow case
--- the planner correctly estimates that this plan is expensive.
Rather, it's in the fast case, because the planner mistakenly thinks
that that is also expensive.  The critical error is right here:

>                                                     ->  Index Scan using
> usertable_podkey_key on usertable u  (cost=0.00..16003.53 rows=5446 width=555) (actual time=0.
> 04..0.05 rows=1 loops=1)
>                                                           Index Cond: (podkey = 20)
>                                                           Filter: ((status = 2) AND (NOT
> banned))

That scan is estimated to yield 5446 rows and it only yields 1.  Do you
have any idea why the estimate is so far off?  (My guess is that podkey,
status and banned are correlated to a large extent, but you tell us.)

            regards, tom lane

Re: order of nested loop

От

Jean-Luc Lachance

Дата:

17 июня 2003 г., 15:18:38

Tom,


I am currious.  Why perform a sort first?  Would it not be more
efficient to create an in memory index on the fly?  I seems to me that
it would consume less resources, specially if there are many duplicates.
Also caching the last key lookup would save time if the records are
"bunched".  What am I mising here?

JL


Tom Lane wrote:
>
> Joseph Shraibman <joseph@xtenit.com> writes:
> > Tom Lane wrote:
> >> Joseph Shraibman <jks@selectacast.net> writes:
> >>> I have two queries that return identical results.  One is a SELECT
> >>> DISTINCT and the other is the same query without the DISTINCT.
> >>
> >> Could we see the actual queries?  And the complete EXPLAIN ANALYZE
> >> output?
>
> > (private email, bec I don't want to sanatize this huge thing for the list)
>
> Okay, but people send bigger things than this to the list all the time...
>
> > [fast case]
> > (SELECT distinct
> > u.userkey,u.adminper,u.addper,u.approvper,u.ownerper,u.subper,u.banned,u.status,u.config,u.rules, ...
> > ORDER BY lower(d.username) LIMIT 25)
>
> >                             ->  Limit  (cost=36903.81..36916.31 rows=25 width=604) (actual
> > time=10.92..10.93 rows=1 loops=1)
> >                                   ->  Unique  (cost=36903.81..37128.43 rows=449 width=604)
> > (actual time=10.92..10.92 rows=1 loops=1)
> >                                         ->  Sort  (cost=36903.81..36915.04 rows=4492
> > width=604) (actual time=10.91..10.91 rows=1 loops=1)
> >                                               Sort Key: lower(d.username), u.userkey,
> > u.adminper, u.addper, u.approvper, u.ownerper, u.subper, u.banned, u.status, u.co
> > nfig, u.rules, (subplan), CASE WHEN ((u.rules IS NULL) OR (u.rules = ''::text)) THEN
> > ''::text ELSE 'r'::text END, d.username, d.realname, d.firstname, (subplan), d.st
> > atus, 2
> >                                               ->  Nested Loop  (cost=0.00..36631.27
> > rows=4492 width=604) (actual time=10.65..10.67 rows=1 loops=1)
> >                                                     ->  Index Scan using
> > usertable_podkey_key on usertable u  (cost=0.00..16003.53 rows=5446 width=555) (actual time=0.
> > 04..0.05 rows=1 loops=1)
> >                                                           Index Cond: (podkey = 20)
> >                                                           Filter: ((status = 2) AND (NOT
> > banned))
> >                                                     ->  Index Scan using directory_pkey on
> > directory d  (cost=0.00..3.78 rows=1 width=49) (actual time=0.02..0.03 rows=
> > 1 loops=1)
> >                                                           Index Cond: (d.userkey =
> > "outer".userkey)
> >                                                           Filter: ((status = 2) OR (status
> > = 5))
>
> > [slow case]
> > (SELECT
> > u.userkey,u.adminper,u.addper,u.approvper,u.ownerper,u.subper,u.banned,u.status,u.config,u.rules, ...
> > ORDER BY lower(d.username) LIMIT 25)
>
> >                             ->  Limit  (cost=0.00..30602.38 rows=25 width=604) (actual
> > time=74810.10..102624.62 rows=1 loops=1)
> >                                   ->  Nested Loop  (cost=0.00..5499145.64 rows=4492
> > width=604) (actual time=74810.10..102624.61 rows=1 loops=1)
> >                                         ->  Index Scan using directory_lower_username_key
> > on directory d  (cost=0.00..2380577.42 rows=525568 width=49) (actual time=15.61..79588.09
> > rows=589718 loops=1)
> >                                               Filter: ((status = 2) OR (status = 5))
> >                                         ->  Index Scan using usertable_pkey on usertable u
> >   (cost=0.00..5.92 rows=1 width=555) (actual time=0.04..0.04 rows=0 loops=589718)
> >                                               Index Cond: (("outer".userkey = u.userkey)
> > AND (u.podkey = 20))
> >                                               Filter: ((status = 2) AND (NOT banned))
>
> As near as I can tell, the failure of estimation is not in the slow case
> --- the planner correctly estimates that this plan is expensive.
> Rather, it's in the fast case, because the planner mistakenly thinks
> that that is also expensive.  The critical error is right here:
>
> >                                                     ->  Index Scan using
> > usertable_podkey_key on usertable u  (cost=0.00..16003.53 rows=5446 width=555) (actual time=0.
> > 04..0.05 rows=1 loops=1)
> >                                                           Index Cond: (podkey = 20)
> >                                                           Filter: ((status = 2) AND (NOT
> > banned))
>
> That scan is estimated to yield 5446 rows and it only yields 1.  Do you
> have any idea why the estimate is so far off?  (My guess is that podkey,
> status and banned are correlated to a large extent, but you tell us.)
>
>                         regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html

Re: order of nested loop

От

Tom Lane

Дата:

17 июня 2003 г., 15:41:12

Jean-Luc Lachance <jllachan@nsd.ca> writes:
> I am currious.  Why perform a sort first?  Would it not be more
> efficient to create an in memory index on the fly?

Why would you think that?  Creating an index involves a sort ...
there's no free lunch there that I can see.

            regards, tom lane

Re: order of nested loop

От

Jean-Luc Lachance

Дата:

17 июня 2003 г., 15:53:08

Tom Lane wrote:
>
> Jean-Luc Lachance <jllachan@nsd.ca> writes:
> > I am currious.  Why perform a sort first?  Would it not be more
> > efficient to create an in memory index on the fly?
>
> Why would you think that?  Creating an index involves a sort ...
> there's no free lunch there that I can see.
>
>                         regards, tom lane


There is only a small difference, but distinct implies unique index
which mean there is no need to sort duplicate records.

Re: order of nested loop

От

Joseph Shraibman

Дата:

17 июня 2003 г., 17:33:00

Tom Lane wrote:

>
> That scan is estimated to yield 5446 rows and it only yields 1.  Do you
> have any idea why the estimate is so far off?  (My guess is that podkey,
> status and banned are correlated to a large extent, but you tell us.)

The relationship between d and u is like this: There is a row in d for
each user, and for each pod they are a member of there is an entry in u.
  So when I'm querying u for members of a particular pod I'm filtering
by the status of the d entry and that status of the u entry.  There are
a lot of entries in d, but only a few of them will be members of a
particular pod.  Thus it would make sense to first get the entries in u,
filter them, then filter by their status in d.  There will be an entry
in d for each entry in u, but not vice versa.

The planner shows this for the scan on d:
(cost=0.00..2380577.42 rows=525568 width=49)

Maybe it thinks it will reach the limit of 25 before it actually does,
which is why it is willing to try something so expensive?

Re: order of nested loop

От

Tom Lane

Дата:

17 июня 2003 г., 18:12:02

Joseph Shraibman <jks@selectacast.net> writes:
> The planner shows this for the scan on d:
> (cost=0.00..2380577.42 rows=525568 width=49)
> Maybe it thinks it will reach the limit of 25 before it actually does,
> which is why it is willing to try something so expensive?

Yeah, exactly, it's extrapolating that it will only actually have to
process a relatively small part of that scan.  Which would be true if
it were getting 4492 rows out of the join per estimate, and not just 1
per reality.  This is the same estimation failure as in the other plan,
I think, but it's a lot simpler to see in the other plan.

> ... Thus it would make sense to first get the entries in u,
> filter them, then filter by their status in d.

Right, but the problem is to know how many u entries will get through
the filter.  When that estimate is off by a factor of ~5000, it's no
surprise that the planner is likely to choose the wrong plan.  If you
could cut that estimation error by even a factor of 2, it would have
made the right choices here.

So we're back to my previous question: why is that estimate so far off?
You might try comparing
    explain select * from usertable where podkey = 20;
    select count(*) from usertable where podkey = 20;
to see whether the estimate is failing on the basic podkey=20 part.
If that doesn't seem too far off, add in the status = 2 and/or
(NOT banned) parts to see what confuses it.  I'd like to see the
pg_stats rows for these three columns, too.

BTW, you have done an ANALYZE recently on usertable, I hope.

            regards, tom lane

Re: order of nested loop

От

Joseph Shraibman

Дата:

17 июня 2003 г., 18:30:58

Tom Lane wrote:
> Joseph Shraibman <jks@selectacast.net> writes:
>
>>The planner shows this for the scan on d:
>>(cost=0.00..2380577.42 rows=525568 width=49)
>>Maybe it thinks it will reach the limit of 25 before it actually does,
>>which is why it is willing to try something so expensive?
>
>
> Yeah, exactly, it's extrapolating that it will only actually have to
> process a relatively small part of that scan.  Which would be true if
> it were getting 4492 rows out of the join per estimate, and not just 1
> per reality.  This is the same estimation failure as in the other plan,
> I think, but it's a lot simpler to see in the other plan.
>
>
>>... Thus it would make sense to first get the entries in u,
>>filter them, then filter by their status in d.
>
>
> Right, but the problem is to know how many u entries will get through
> the filter.  When that estimate is off by a factor of ~5000, it's no
> surprise that the planner is likely to choose the wrong plan.  If you
> could cut that estimation error by even a factor of 2, it would have
> made the right choices here.
>
> So we're back to my previous question: why is that estimate so far off?
> You might try comparing
>     explain select * from usertable where podkey = 20;
>     select count(*) from usertable where podkey = 20;

=> explain select * from usertable where podkey = 20;
                                           QUERY PLAN

-----------------------------------------------------------------------------------------------
  Index Scan using usertable_podkey_key on usertable
(cost=0.00..16019.99 rows=5923 width=625)
    Index Cond: (podkey = 20)
(2 rows)

=> select count(*) from usertable where podkey = 20;
  count
-------
      3
(1 row)

=> select * from pg_stats where tablename = 'usertable' and  attname
in('podkey','status','banned');
  schemaname | tablename | attname | null_frac | avg_width | n_distinct
|             most_common_vals             |
    most_common_freqs                               |
histogram_bounds               | correlation

------------+-----------+---------+-----------+-----------+------------+------------------------------------------+-------------------------------------------------------------------------------+---------------------------------------------+-------------
  public     | usertable | podkey  |         0 |         4 |         66
| {<actual numbers deleted, but 20 isn't on of them>} |
{0.208,0.156,0.112,0.0696667,0.0306667,0.028,0.0273333,0.025,0.0243333,0.023}
| {10,90,137,140,197,207,246,264,267,269,300} |     0.53816
  public     | usertable | status  |         0 |         2 |          4
| {2,4,1,3}                                |
{0.938,0.0496667,0.0103333,0.002}
       |                                             |    0.840237
  public     | usertable | banned  |         0 |         1 |          2
| {f,t}                                    | {0.982,0.018}
                                                    |
                           |      0.9339
(3 rows)

=> select count(*) from usertable;
   count
---------
  1121190
(1 row)


> to see whether the estimate is failing on the basic podkey=20 part.
> If that doesn't seem too far off, add in the status = 2 and/or
> (NOT banned) parts to see what confuses it.  I'd like to see the
> pg_stats rows for these three columns, too.
>


> BTW, you have done an ANALYZE recently on usertable, I hope.

Yeah, every week by cron job.

Re: order of nested loop

От

Tom Lane

Дата:

17 июня 2003 г., 19:07:32

Joseph Shraibman <jks@selectacast.net> writes:
> => explain select * from usertable where podkey = 20;

>   Index Scan using usertable_podkey_key on usertable
> (cost=0.00..16019.99 rows=5923 width=625)

> => select count(*) from usertable where podkey = 20;
>   count
> -------
>       3
> (1 row)

Well, there's our problem :-(

I would suggest bumping up the statistics target for usertable.podkey
(see ALTER TABLE SET STATISTICS).  Since it's only an int4 column you
could make the target 100 (instead of the default 10) without much
cost.  That should give substantially finer-grain detail and hopefully
bring this estimate down out of the stratosphere.  Re-analyze the table
and see what it gets you.

            regards, tom lane

Re: order of nested loop

От

"Jim C. Nasby"

Дата:

17 июня 2003 г., 21:31:46

On Tue, Jun 17, 2003 at 11:53:05AM -0400, Jean-Luc Lachance wrote:
> Tom Lane wrote:
> >
> > Jean-Luc Lachance <jllachan@nsd.ca> writes:
> > > I am currious.  Why perform a sort first?  Would it not be more
> > > efficient to create an in memory index on the fly?
> >
> > Why would you think that?  Creating an index involves a sort ...
> > there's no free lunch there that I can see.
> >
> >                         regards, tom lane
>
>
> There is only a small difference, but distinct implies unique index
> which mean there is no need to sort duplicate records.

Also (and not specific to this example), an index doesn't need to sort
entire tuples. If the tuples are wide enough, it would be faster to
build an index, then use that index to access the tuples, especially if
you're going to read through the data more than once (the tuples might
have to be hopelessly large to make a single scan effective).

On a related note; since temporary tables are only visible to a single
connection, do they have full MVCC info in them, or can it be bypassed?
If it's not there you'd probably need some other means to allow for
transactions, but the performance gain might be well worth it.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: order of nested loop

От

Joseph Shraibman

Дата:

17 июня 2003 г., 22:12:47

Tom Lane wrote:

> I would suggest bumping up the statistics target for usertable.podkey
> (see ALTER TABLE SET STATISTICS).  Since it's only an int4 column you
> could make the target 100 (instead of the default 10) without much
> cost.  That should give substantially finer-grain detail and hopefully
> bring this estimate down out of the stratosphere.  Re-analyze the table
> and see what it gets you.
>
Thanks, that seemed to have helped.  Now I have this other query:

                                                            QUERY PLAN


---------------------------------------------------------------------------------------------------------------------------------
  Hash Join  (cost=46167.67..81194.20 rows=175858 width=13) (actual time=8807.10..16097.85
rows=249551 loops=1)
    Hash Cond: ("outer".userkey = "inner".userkey)
    ->  Seq Scan on u  (cost=0.00..25886.79 rows=175858 width=7) (actual
time=0.10..3329.93 rows=249551 loops=1)
          Filter: (podkey = 260)
    ->  Hash  (cost=20167.55..20167.55 rows=637355 width=6) (actual time=8806.36..8806.36
rows=0 loops=1)
          ->  Seq Scan on d  (cost=0.00..20167.55 rows=637355 width=6) (actual
time=28.13..7317.19 rows=638045 loops=1)
  Total runtime: 16926.00 msec
(7 rows)


How do I read that?  Is it creating a hash out of the data in d, then going through u
doing a join?

Re: order of nested loop

От

Tom Lane

Дата:

17 июня 2003 г., 22:44:56

Joseph Shraibman <jks@selectacast.net> writes:
> How do I read that?  Is it creating a hash out of the data in d, then
> going through u doing a join?

Yeah.  Given the numbers of rows involved, the plan seems pretty
reasonable --- I doubt you can do a lot better within the context
you're showing here.  To make it faster you'll have to find a way
to not need to look at all the rows.

            regards, tom lane

Re: order of nested loop

От

Tom Lane

Дата:

17 июня 2003 г., 23:07:16

Joseph Shraibman <joseph@xtenit.com> writes:
> Well there is no reason for it to look at all the rows in d, since it
> has a filter on u that should produce much less rows than there on in
> d.

What filter?  The scan is producing 250 thousand rows, man!  You expect
it to join 250k rows in no time flat?

You can try forcing merge or nestloop joins with the enable_foo
switches, but I suspect you'll find that for the problem as posed,
this is the best plan available.  To make it go faster you will need a
way to eliminate a large percentage of one or both tables before the
join happens.

            regards, tom lane

Re: order of nested loop

От

Tom Lane

Дата:

17 июня 2003 г., 23:29:33

"Jim C. Nasby" <jim@nasby.net> writes:
> On a related note; since temporary tables are only visible to a single
> connection, do they have full MVCC info in them, or can it be bypassed?

You can't really bypass it.  You still need the transaction number so
you can tell if a tuple was created by a committed, aborted, or the
current transaction, and you still need the command number for the same
reasons as usual.

We do (in 7.3) bypass WAL logging of changes in temp tables, and we also
use non-shared buffers for them so as to avoid buffer locking overhead.
Not sure there's much more win to be had there.  (The local buffer
manager code could stand improvement, but that's a different issue.)

            regards, tom lane

Re: order of nested loop

От

"Jim C. Nasby"

Дата:

18 июня 2003 г., 04:53:49

On Tue, Jun 17, 2003 at 07:29:16PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > On a related note; since temporary tables are only visible to a single
> > connection, do they have full MVCC info in them, or can it be bypassed?
>
> You can't really bypass it.  You still need the transaction number so
> you can tell if a tuple was created by a committed, aborted, or the
> current transaction, and you still need the command number for the same
> reasons as usual.

You do need that info, but I think a local-only temp. table means you
don't have to use XID (and maybe CID, I don't really know how that's
used) to store the info. This is because only a single transaction can
reference the table. As part of committing an update, you can either set
a simple flag in the old tuples, or you might even be able to just
delete them (though nested transactions would probably break that). You
would have to keep a list of what tuples are old and new, but that's
probably more efficient than 16/20 bytes per tuple.

OTOH, I've also been wondering if MVCC info could be stored more
efficiently across the board. One idea is to keep a list of valid MVCC
states, and assign int (or maybe even smaller) ID's to every entry in
that list. Each tuple would only have that entry number then. This would
probably not be good for tables that have a lot of single-row
transactions (though there's another advantage that might offset the
overhead), but for tables that only see fairly large transactions it
could be a tremendous gain.

The other advantage to normalizing the MVCC info is it should then be
reasonable to store it in indexes as well as base tuples. This could
dramatically speed up index work, since you don't have to read the base
tuple to see if it's still valid. It would also allow things like index
covering, and count(*) to do only an index scan.

Another alternative to an MVCC ID would be to store the MVCC info
outside the mainline storage with a reference back to the base tuple,
and use some form of RLE. IE:

xmin, cmin, xmax, cmax, xvac, min_tid, max_tid

where min_tid and max_tid define the range of tuples that MVCC info
applies to. I'm sure there's plenty of other ways MVCC info could be
stored without using 16/20 bytes per tuple.
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: order of nested loop

От

Tom Lane

Дата:

18 июня 2003 г., 05:42:16

"Jim C. Nasby" <jim@nasby.net> writes:
> ... I'm sure there's plenty of other ways MVCC info could be
> stored without using 16/20 bytes per tuple.

I didn't really see a single workable idea there.  Keep in mind that
storage space is only one consideration (and not a real big one given
modern disk-drive sizes).  Ask yourself about atomicity, failure
recovery, and update costs.  RLE encoding of tuple states?  Get real ---
how many rows could get wiped out by a one-bit lossage?  How extensive
are the on-disk changes needed to encode a one-tuple change in state,
and how do you recover if the machine crashes when only some of those
changes are down to disk?  In my opinion PG's on-disk structures are
barely reliable enough now; we don't want to introduce compression
schemes with the potential for large cross-tuple failure modes.

Storing commit state in index entries has been repeatedly proposed
and repeatedly rejected, too.  It converts an atomic operation
(update one word in one page) into a non-atomic, multi-page operation,
which creates lots of performance and reliability problems.  And the
point of an index is to be smaller than the main table --- the more
stuff you cram into an index tuple header, the less the advantage
of having the index.

Criticism in the form of a patch with experimental evidence is welcome,
but I'm not really interested in debating what-if proposals, especially
not ones that are already discussed in the archives.

            regards, tom lane

Re: order of nested loop

От

"Jim C. Nasby"

Дата:

18 июня 2003 г., 18:36:23

On Wed, Jun 18, 2003 at 01:42:02AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <jim@nasby.net> writes:
> > ... I'm sure there's plenty of other ways MVCC info could be
> > stored without using 16/20 bytes per tuple.
>
> I didn't really see a single workable idea there.  Keep in mind that
> storage space is only one consideration (and not a real big one given
> modern disk-drive sizes).  Ask yourself about atomicity, failure

Disk space is cheap; disk bandwidth is insanely expensive, as is memory
bandwidth.

> recovery, and update costs.  RLE encoding of tuple states?  Get real ---
> how many rows could get wiped out by a one-bit lossage?  How extensive
> are the on-disk changes needed to encode a one-tuple change in state,
> and how do you recover if the machine crashes when only some of those
> changes are down to disk?  In my opinion PG's on-disk structures are
> barely reliable enough now; we don't want to introduce compression
> schemes with the potential for large cross-tuple failure modes.

Well, with more efficient MVCC info storage, adding error correction
capability becomes more practical. On the other hand, is it really
pgsql's responsibility to handle errors caused by bad hardware? I
don't think any other database tries to.

> Storing commit state in index entries has been repeatedly proposed
> and repeatedly rejected, too.  It converts an atomic operation
> (update one word in one page) into a non-atomic, multi-page operation,
> which creates lots of performance and reliability problems.  And the
> point of an index is to be smaller than the main table --- the more
> stuff you cram into an index tuple header, the less the advantage
> of having the index.

Well, it doesn't have to be stored in the index itself. Moving MVCC
information to a seperate set of pages from the base tuples would
provide the same ability.

> Criticism in the form of a patch with experimental evidence is welcome,
> but I'm not really interested in debating what-if proposals, especially
> not ones that are already discussed in the archives.

Fair enough, though I hope some others are interested since my C coding
ability is nil.

Can you point me at the archived discussions? Searching for index and
mvcc reveals nothing (actually, searching for 'index' returns nothing,
which surely must be a bug).
--
Jim C. Nasby (aka Decibel!)                    jim@nasby.net
Member: Triangle Fraternity, Sports Car Club of America
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: order of nested loop

От

Joseph Shraibman

Дата:

19 июня 2003 г., 16:31:07

Tom Lane wrote:
> Joseph Shraibman <jks@selectacast.net> writes:
>
>>How do I read that?  Is it creating a hash out of the data in d, then
>>going through u doing a join?
>
>
> Yeah.  Given the numbers of rows involved, the plan seems pretty
> reasonable --- I doubt you can do a lot better within the context
> you're showing here.  To make it faster you'll have to find a way
> to not need to look at all the rows.
>
>             regards, tom lane
Well there is no reason for it to look at all the rows in d, since it has a filter on u
that should produce much less rows than there on in d.  In both the plan and the actual
the rows returned from u are less than the rows from d

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: order of nested loop