Обсуждение: Simple join optimized badly?

От:
"Craig A. James"
Дата:

I have two tables, SAMPLE and HITLIST that when joined, generate a monsterous sort.

  HITLIST_ROWS has about 48,000 rows
  SAMPLE has about 16 million rows

  The joined column is indexed in SAMPLE
  HITLIST_ROWS is a scratch table which is used a few times then discarded.
  HITLIST_ROWS has no indexes at all

There are two plans below.  The first is before an ANALYZE HITLIST_ROWS, and it's horrible -- it looks to me like it's
sortingthe 16 million rows of the SEARCH table.  Then I run ANALYZE HITLIST_ROWS, and the plan is pretty decent. 

First question: HITLIST_ROWS so small, I don't understand why the lack of ANALYZE should cause SAMPLE's contents to be
sorted.

Second question: Even though ANALYZE brings it down from 26 minutes to 47 seconds, a huge improvement, it still seems
slowto me.  Its going at roughly 1 row per millisecond -- are my expectations too high?  This is a small-ish Dell
computer(Xeon), 4 GB memory, with a four-disk SATA software RAID0 (bandwidth limited to about 130 MB/sec due to PCI
cards). Other joins of a similar size seem much faster. 

It looks like I'll need to do an ANALYZE every time I modify HITLIST_ROWS, which seems like a waste because
HITLIST_ROWSis rarely used more than once or twice before being truncated and rebuilt with new content.  (HITLIST_ROWS
can'tbe an actual temporary table, though, because it's a web application and each access is from a new connection.) 

This is Postgres 8.0.3.  (We're upgrading soon.)

Thanks,
Craig



explain analyze select t.SAMPLE_ID from SAMPLE t, HITLIST_ROWS ph where t.VERSION_ID = ph.ObjectID);
                                                              QUERY PLAN
               

---------------------------------------------------------------------------------------------------------------------------------------
 Merge Join  (cost=4782.35..1063809.82 rows=613226 width=4) (actual time=174.212..1593886.582 rows=176294 loops=1)
   Merge Cond: ("outer".version_id = "inner".objectid)
   ->  Index Scan using i_sample_version_id on sample t  (cost=0.00..1008713.68 rows=16446157 width=8) (actual
time=0.111..1571911.208rows=16446157 loops=1) 
   ->  Sort  (cost=4782.35..4910.39 rows=51216 width=4) (actual time=173.669..389.496 rows=176329 loops=1)
         Sort Key: ph.objectid
         ->  Seq Scan on hitlist_rows_378593 ph  (cost=0.00..776.16 rows=51216 width=4) (actual time=0.015..90.059
rows=48834loops=1) 
 Total runtime: 1594093.725 ms
(7 rows)

chmoogle2=> analyze HITLIST_ROWS;
ANALYZE
chmoogle2=> explain analyze select t.SAMPLE_ID from SAMPLE t, HITLIST_ROWS ph where t.VERSION_ID = ph.ObjectID;
                                                              QUERY PLAN
               

---------------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=874.43..457976.83 rows=584705 width=4) (actual time=302.792..47796.719 rows=176294 loops=1)
   Hash Cond: ("outer".version_id = "inner".objectid)
   ->  Seq Scan on sample t  (cost=0.00..369024.57 rows=16446157 width=8) (actual time=46.344..26752.343 rows=16446157
loops=1)
   ->  Hash  (cost=752.34..752.34 rows=48834 width=4) (actual time=149.548..149.548 rows=0 loops=1)
         ->  Seq Scan on hitlist_rows_378593 ph  (cost=0.00..752.34 rows=48834 width=4) (actual time=0.048..80.721
rows=48834loops=1) 
 Total runtime: 47988.572 ms
(6 rows)

От:
Tom Lane
Дата:

"Craig A. James" <> writes:
> There are two plans below.  The first is before an ANALYZE HITLIST_ROWS, and it's horrible -- it looks to me like
it'ssorting the 16 million rows of the SEARCH table.  Then I run ANALYZE HITLIST_ROWS, and the plan is pretty decent. 

It would be interesting to look at the before-ANALYZE cost estimate for
the hash join, which you could get by setting enable_mergejoin off (you
might have to turn off enable_nestloop too).  I recall though that
there's a fudge factor in costsize.c that penalizes hashing on a column
that no statistics are available for.  The reason for this is the
possibility that the column has only a small number of distinct values,
which would make a hash join very inefficient (in the worst case all
the values might end up in the same hash bucket, making it no better
than a nestloop).  Once you've done ANALYZE it plugs in a real estimate
instead, and evidently the cost estimate drops enough to make hashjoin
the winner.

You might be able to persuade it to use a hashjoin anyway by increasing
work_mem enough, but on the whole my advice is to do the ANALYZE after
you load up the temp table.  The planner really can't be expected to be
very intelligent when it has no stats.

            regards, tom lane

От:
"Denis Lussier"
Дата:

Wouldn't PG supporting simple optmizer hints get around this kinda
problem?   Seems to me that at least one customer posting per week
would be solved via the use of simple hints.

If the community is interested...  EnterpriseDB has added support for
a few different simple types of hints (optimize for speed, optimize
for first rows, use particular indexes) for our upcoming 8.2 version.
 We are glad to submit them into the community process if there is any
chance they will eventually be accepted for 8.3.

I don't think there is an ANSI standrd for hints, but, that doesn't
mean they are not occosaionally extrenmely useful.  All hints are
effectively harmless/helpful suggestions,  the planner is free to
ignore them if they are not feasible.

--Denis Lussier
  Founder
  http://www.enterprisedb.com

On 10/7/06, Tom Lane <> wrote:
> "Craig A. James" <> writes:
> > There are two plans below.  The first is before an ANALYZE HITLIST_ROWS, and it's horrible -- it looks to me like
it'ssorting the 16 million rows of the SEARCH table.  Then I run ANALYZE HITLIST_ROWS, and the plan is pretty decent. 
>
> It would be interesting to look at the before-ANALYZE cost estimate for
> the hash join, which you could get by setting enable_mergejoin off (you
> might have to turn off enable_nestloop too).  I recall though that
> there's a fudge factor in costsize.c that penalizes hashing on a column
> that no statistics are available for.  The reason for this is the
> possibility that the column has only a small number of distinct values,
> which would make a hash join very inefficient (in the worst case all
> the values might end up in the same hash bucket, making it no better
> than a nestloop).  Once you've done ANALYZE it plugs in a real estimate
> instead, and evidently the cost estimate drops enough to make hashjoin
> the winner.
>
> You might be able to persuade it to use a hashjoin anyway by increasing
> work_mem enough, but on the whole my advice is to do the ANALYZE after
> you load up the temp table.  The planner really can't be expected to be
> very intelligent when it has no stats.
>
>                        regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: Don't 'kill -9' the postmaster
>

От:
Jim Nasby
Дата:

On Oct 7, 2006, at 8:50 PM, Denis Lussier wrote:
> Wouldn't PG supporting simple optmizer hints get around this kinda
> problem?   Seems to me that at least one customer posting per week
> would be solved via the use of simple hints.
>
> If the community is interested...  EnterpriseDB has added support for
> a few different simple types of hints (optimize for speed, optimize
> for first rows, use particular indexes) for our upcoming 8.2 version.
> We are glad to submit them into the community process if there is any
> chance they will eventually be accepted for 8.3.

+1 (and I'd be voting that way regardless of where my paycheck comes
from) While it's important that we continue to improve the planner,
it's simply not possible to build one that's smart enough to handle
every single situation.
--
Jim C. Nasby, Database Architect                   
512.569.9461 (cell)                         http://jim.nasby.net



От:
Josh Berkus
Дата:

Denis,

> Wouldn't PG supporting simple optmizer hints get around this kinda
> problem?   Seems to me that at least one customer posting per week
> would be solved via the use of simple hints.

... and add 100 other problems.  Hints are used because the DBA thinks that
they are smarter than the optimizer; 99% of the time, they are wrong.
Just try manually optimizing a complex query, you'll see -- with three
join types, several scan types, aggregates, bitmaps, internal and external
sorts, and the ability to collapse subqueries it's significantly more than
a human can figure out accurately.

Given the availability of hints, the newbie DBA will attempt to use them
instead of fixing any of the underlying issues.  Craig's post is a classic
example of that: what he really needs to do is ANALYZE HITLIST_ROWS after
populating it.  If he had the option of hints, and was shortsighted (I'm
not assuming that Craig is shortsighted, but just for the sake of
argument) he'd fix this with a hint and move on ... and then add another
hint when he adds a another query which needs HITLIST_ROWS, and another.
And then he'll find out that some change in his data (the sample table
growing, for example) makes his hints obsolete and he has to go back and
re-tune them all.

And then ... it comes time to upgrade PostgreSQL.  The hints which worked
well in version 8.0 won't necessarily work well in 8.2.  In fact, many of
them may make queries disastrously slow.    Ask any Oracle DBA, they'll
tell you that upgrading hint is a major PITA, and why Oracle is getting
away from Hints and has eliminated the rules-based optimizer.

Now, if you were offering us a patch to auto-populate the statistics as a
table is loaded, I'd be all for that.   But I, personally, would need a
lot of convincing to believe that hints don't do more harm than good.

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

От:
"Craig A. James"
Дата:

> ... and add 100 other problems.  Hints are used because the DBA thinks that
> they are smarter than the optimizer; 99% of the time, they are wrong.
> Just try manually optimizing a complex query, you'll see -- with three
> join types, several scan types, aggregates, bitmaps, internal and external
> sorts, and the ability to collapse subqueries it's significantly more than
> a human can figure out accurately.

Sorry, this is just wrong, wrong, wrong.

I've heard this from several PG developers every time hints have come up in my roughly eighteen months as a PG
applicationdeveloper.  And in between every assertion that "the application programmers aren't as smart as the
optimizer",there are a dozen or two examples where posters to this list are told to increase this setting, decrease
thatone, adjust these other two, and the end result is to get the plan that the application programmer -- AND the PG
professionals-- knew was the right plan to start with. 

People are smarter than computers.  Period.

Now I'll agree that the majority, perhaps the great majority, of questions to this group should NOT be solved with
hints. You're absolutely right that in most cases hints are a really bad idea.  People will resort to hints when they
shouldbe learning better ways to craft SQL, and when they should have read the configuration guides. 

But that doesn't alter the fact that many, perhaps most, complicated application will, sooner or later, run into a
showstoppercase where PG just optimizes wrong, and there's not a damned thing the app programmer can do about it. 

My example, discussed previously in this forum, is a classic.  I have a VERY expensive function (it's in the class of
NP-completeproblems, so there is no faster way to do it).  There is no circumstance when my function should be used as
afilter, and no circumstance when it should be done before a join.  But PG has no way of knowing the cost of a
function,and so the optimizer assigns the same cost to every function.  Big disaster. 

The result?  I can't use my function in any WHERE clause that involves any other conditions or joins.  Only by itself.
PGwill occasionally decide to use my function as a filter instead of doing the join or the other WHERE conditions
first,and I'm dead. 

The interesting thing is that PG works pretty well for me on big tables -- it does the join first, then applies my
expensivefunctions.  But with a SMALL (like 50K rows) table, it applies my function first, then does the join.  A
searchthat completes in 1 second on a 5,000,000 row database can take a minute or more on a 50,000 row database. 

Instead, I have to separate the WHERE terms into two SQL statements, and do the join myself.  I do the first half of my
query,suck it all into memory, do the second half, suck it into memory, build a hash table and join the two lists in
memory,then take the joined results and apply my function to it. 

This is not how a relational database should work.  It shouldn't fall over dead just when a table's size SHRINKS beyond
somethreshold that causes the planner to switch to a poor plan. 

Since these tables are all in the same database, adjusting configuration parameters doesn't help me.  And I suppose I
coulduse SET to disable various plans, but how is that any different from a HINT feature? 

Now you might argue that function-cost needs to be added to the optimizer's arsenal of tricks.  And I'd agree with you:
ThatWOULD be a better solution than hints.  But I need my problem solved TODAY, not next year.  Hints can help solve
problemsNOW that can be brought to the PG team's attention later, and in the mean time let me get my application to
work.

Sorry if I seem particularly hot under the collar on this one.  I think you PG designers have created a wonderful
product. It's not the lack of hints that bothers me, it's the "You app developers are dumber than we are" attitude.
We'renot.  Some of us know what we're doing, and we need hints. 

If it is just a matter of resources, that's fine.  I understand that these things take time.  But please don't keep
dismissingthe repeated and serious requests for this feature.  It's important. 

Thanks for listening.
Craig

От:
Tom Lane
Дата:

Josh Berkus <> writes:
> Now, if you were offering us a patch to auto-populate the statistics as a
> table is loaded, I'd be all for that.

Curiously enough, I was just thinking about that after reading Craig's
post.  autovacuum will do this, sort of, if it's turned on --- but its
reaction time is measured in minutes typically so that may not be good
enough.

Another thing we've been beat up about in the past is that loading a
pg_dump script doesn't ANALYZE the data afterward...

            regards, tom lane

От:
Bruce Momjian
Дата:

Jonah H. Harris wrote:
> On Oct 08, 2006 07:05 PM, Josh Berkus <> wrote:
> > Hints are used because the DBA thinks that they are smarter than
> > the optimizer; 99% of the time, they are wrong.
>
> That's a figure which I'm 100% sure cannot be backed up by fact.
>
> > Just try manually optimizing a complex query, you'll see -- with three
> > join types, several scan types, aggregates, bitmaps, [blah blah blah]
> > it's significantly more than a human can figure out accurately.
>
> Let me get this right... the optimizer is written by humans who know and
> can calculate the proper query plan and generate code to do the same;
> yet humans aren't smart enough to optimize the queries themselves? A bit
> of circular reasoning here?

I can do 100! on my computer, but can't do it in my head.

--
  Bruce Momjian   
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

От:
Mark Kirkwood
Дата:

Craig A. James wrote:
>
>
> My example, discussed previously in this forum, is a classic.  I have a
> VERY expensive function (it's in the class of NP-complete problems, so
> there is no faster way to do it).  There is no circumstance when my
> function should be used as a filter, and no circumstance when it should
> be done before a join.  But PG has no way of knowing the cost of a
> function, and so the optimizer assigns the same cost to every function.
> Big disaster.
>
> The result?  I can't use my function in any WHERE clause that involves
> any other conditions or joins.  Only by itself.  PG will occasionally
> decide to use my function as a filter instead of doing the join or the
> other WHERE conditions first, and I'm dead.
>

this is an argument for cost-for-functions rather than hints AFAICS.

It seems to me that if (in addition to the function cost) we come up
with some efficient way of recording cross column statistics we would be
well on the way to silencing *most* of the demands for hints.

We would still be left with some of the really difficult problems - a
metric for "locally correlated" column distributions and a reliable
statistical algorithm for most common value sampling (or a different way
of approaching this). These sound like interesting computer science or
mathematics thesis topics, maybe we could try (again?) to get some
interest at that level?

Cheers

Mark



От:
"Craig A. James"
Дата:

Bruce Momjian wrote:
> I can do 100! on my computer, but can't do it in my head.

A poor example.  100! is a simple repetative calculation, something computers are very good at.  Optimizing an SQL
queryis very difficult, and a completely different class of problem. 

The fact is the PG team has done a remarkable job with the optimizer so far.   I'm usually very happy with its plans.
Buthumans still beat computers at many tasks, and there are unquestionably areas where the PG optimizer is not yet
fullydeveloped. 

When the optimizer reaches its limits, and you have to get your web site running, a HINT can be invaluable.

I said something in a previous version of this topic, which I'll repeat here.  The PG documentation for HINTs should be
FILLEDwith STRONG ADMONITIONS to post the problematic queries here before resorting to hints. 

There will always be fools who abuse hints.  Too bad for them, but don't make the rest of us suffer for their folly.

Craig


От:
"Craig A. James"
Дата:

Mark Kirkwood wrote:
>> The result?  I can't use my function in any WHERE clause that involves
>> any other conditions or joins.  Only by itself.  PG will occasionally
>> decide to use my function as a filter instead of doing the join or the
>> other WHERE conditions first, and I'm dead.
>
> this is an argument for cost-for-functions rather than hints AFAICS.

Perhaps you scanned past what I wrote a couple paragraphs farther down.  I'm going to repeat it because it's the KEY
POINTI'm trying to make: 

Craig James wrote:
> Now you might argue that function-cost needs to be added to the
> optimizer's arsenal of tricks.  And I'd agree with you: That WOULD be a
> better solution than hints.  But I need my problem solved TODAY, not
> next year.  Hints can help solve problems NOW that can be brought to the
> PG team's attention later, and in the mean time let me get my
> application to work.

Craig


От:
Chris Browne
Дата:

 (Tom Lane) writes:
> Another thing we've been beat up about in the past is that loading a
> pg_dump script doesn't ANALYZE the data afterward...

Do I misrecall, or were there not plans (circa 7.4...) to for pg_dump
to have an option to do an ANALYZE at the end?

I seem to remember some dispute as to whether the default should be to
include the ANALYZE, with an option to suppress it, or the opposite...
--
(reverse (concatenate 'string "ofni.sesabatadxunil" "@" "enworbbc"))
http://www3.sympatico.ca/cbbrowne/wp.html
"You can measure a programmer's perspective by noting his attitude on
the continuing vitality of FORTRAN." -- Alan J. Perlis

От:
Mark Kirkwood
Дата:

Craig A. James wrote:

>
> Perhaps you scanned past what I wrote a couple paragraphs farther down.
> I'm going to repeat it because it's the KEY POINT I'm trying to make:
>
> Craig James wrote:
>> Now you might argue that function-cost needs to be added to the
>> optimizer's arsenal of tricks.  And I'd agree with you: That WOULD be
>> a better solution than hints.  But I need my problem solved TODAY, not
>> next year.  Hints can help solve problems NOW that can be brought to
>> the PG team's attention later, and in the mean time let me get my
>> application to work.

True enough - but (aside from the fact that hints might take just as
long to get into the development tree as cost-for-functions might take
to write and put in...) there is a nasty side effect to adding hints -
most of the raw material for optimizer improvement disappears (and hence
optimizer improvement stalls)- why? simply that everyone then hints
everything - welcome to the mess that Oracle are in (and seem to be
trying to get out of recently)!

I understand that it is frustrating to not have the feature you need now
  - but you could perhaps view it as a necessary part of the community
development process - your need is the driver for optimizer improvement,
and it can take time.

Now ISTM that hints "solve" the problem by removing the need any further
optimizer improvement at all - by making *you* the optimizer. This is
bad for those of us in the DSS world, where most ad-hoc tools do not
provide the ability to add hints.

Cheers

Mark




От:
Tom Lane
Дата:

Mark Kirkwood <> writes:
> True enough - but (aside from the fact that hints might take just as
> long to get into the development tree as cost-for-functions might take
> to write and put in...) there is a nasty side effect to adding hints -
> most of the raw material for optimizer improvement disappears (and hence
> optimizer improvement stalls)- why? simply that everyone then hints
> everything - welcome to the mess that Oracle are in (and seem to be
> trying to get out of recently)!

And *that* is exactly the key point here.  Sure, if we had unlimited
manpower we could afford to throw some at developing a hint language
that would be usable and not too likely to break at every PG revision.
But we do not have unlimited manpower.  My opinion is that spending
our development effort on hints will have a poor yield on investment
compared to spending similar effort on making the planner smarter.

Josh's post points out some reasons why it's not that easy to get
long-term benefits from hints --- you could possibly address some of
those problems, but a hint language that responds to those criticisms
won't be trivial to design, implement, or maintain.  See (many) past
discussions for reasons why not.

            regards, tom lane

От:
Josh Berkus
Дата:

Tom,

> Josh's post points out some reasons why it's not that easy to get
> long-term benefits from hints --- you could possibly address some of
> those problems, but a hint language that responds to those criticisms
> won't be trivial to design, implement, or maintain.  See (many) past
> discussions for reasons why not.

Well, why don't we see what EDB can come up with?   If it's not "good enough"
we'll just reject it.

Unfortunately, EDB's solution is likely to be Oracle-based, which is liable to
fall into the trap of "not good enough."

--
Josh Berkus
PostgreSQL @ Sun
San Francisco

От:
Tom Lane
Дата:

Josh Berkus <> writes:
> Unfortunately, EDB's solution is likely to be Oracle-based, which is
> liable to fall into the trap of "not good enough."

I'd be a bit worried about Oracle patents as well...

            regards, tom lane

От:
Chris Browne
Дата:

 ("Craig A. James") writes:
> Mark Kirkwood wrote:
>>> The result?  I can't use my function in any WHERE clause that
>>> involves any other conditions or joins.  Only by itself.  PG will
>>> occasionally decide to use my function as a filter instead of doing
>>> the join or the other WHERE conditions first, and I'm dead.
>> this is an argument for cost-for-functions rather than hints AFAICS.
>
> Perhaps you scanned past what I wrote a couple paragraphs farther
> down.  I'm going to repeat it because it's the KEY POINT I'm trying
> to make:
>
> Craig James wrote:
>> Now you might argue that function-cost needs to be added to the
>> optimizer's arsenal of tricks.  And I'd agree with you: That WOULD
>> be a better solution than hints.  But I need my problem solved
>> TODAY, not next year.  Hints can help solve problems NOW that can be
>> brought to the PG team's attention later, and in the mean time let
>> me get my application to work.

Unfortunately, that "hint language" also needs to mandate a temporal
awareness of when hints were introduced so that it doesn't worsen
things down the road.

e.g. - Suppose you upgrade to 8.4, where the query optimizer becomes
smart enough (perhaps combined with entirely new kinds of scan
strategies) to make certain of your hints obsolete and/or downright
wrong.  Those hints (well, *some* of them) ought to be ignored, right?

The trouble is that the "hint language" will be painfully large and
complex.  Its likely-nonstandard interaction with SQL will make query
parsing worse.

All we really have, at this point, is a vague desire for a "hint
language," as opposed to any clear direction as to what it should look
like, and how it needs to interact with other system components.
That's not nearly enough; there needs to be a clear design.
--
(format nil "~S@~S" "cbbrowne" "cbbrowne.com")
http://cbbrowne.com/info/advocacy.html
'Typos in FINNEGANS WAKE? How could you tell?' -- Kim Stanley Robinson

От:
Scott Marlowe
Дата:

On Sun, 2006-10-08 at 18:05, Josh Berkus wrote:

> Now, if you were offering us a patch to auto-populate the statistics as a
> table is loaded, I'd be all for that.   But I, personally, would need a
> lot of convincing to believe that hints don't do more harm than good.

Actually, I'd much rather have a log option, on by default, that spit
out info messages when the planner made a guess that was off by a factor
of 20 or 50 or so or more on a plan.

I can remember to run stats, but finding slow queries that are slow
because the plan was bad, that's the hard part.

От:
"Jim C. Nasby"
Дата:

On Mon, Oct 09, 2006 at 06:07:29PM +0000, Chris Browne wrote:
>  ("Craig A. James") writes:
> > Mark Kirkwood wrote:
> >>> The result?  I can't use my function in any WHERE clause that
> >>> involves any other conditions or joins.  Only by itself.  PG will
> >>> occasionally decide to use my function as a filter instead of doing
> >>> the join or the other WHERE conditions first, and I'm dead.
> >> this is an argument for cost-for-functions rather than hints AFAICS.
> >
> > Perhaps you scanned past what I wrote a couple paragraphs farther
> > down.  I'm going to repeat it because it's the KEY POINT I'm trying
> > to make:
> >
> > Craig James wrote:
> >> Now you might argue that function-cost needs to be added to the
> >> optimizer's arsenal of tricks.  And I'd agree with you: That WOULD
> >> be a better solution than hints.  But I need my problem solved
> >> TODAY, not next year.  Hints can help solve problems NOW that can be
> >> brought to the PG team's attention later, and in the mean time let
> >> me get my application to work.
>
> Unfortunately, that "hint language" also needs to mandate a temporal
> awareness of when hints were introduced so that it doesn't worsen
> things down the road.
>
> e.g. - Suppose you upgrade to 8.4, where the query optimizer becomes
> smart enough (perhaps combined with entirely new kinds of scan
> strategies) to make certain of your hints obsolete and/or downright
> wrong.  Those hints (well, *some* of them) ought to be ignored, right?

Great, then you pull the hints back out of the application. They're a
last resort anyway; if you have more than a handful of them in your code
you really need to look at what you're doing.

> The trouble is that the "hint language" will be painfully large and
> complex.  Its likely-nonstandard interaction with SQL will make query
> parsing worse.
>
> All we really have, at this point, is a vague desire for a "hint
> language," as opposed to any clear direction as to what it should look
> like, and how it needs to interact with other system components.
> That's not nearly enough; there needs to be a clear design.

I can agree to that, but we'll never get any progress so long as every
time hints are brought up the response is that they're evil and should
never be in the database. I'll also say that a very simple hinting
language (ie: allowing you to specify access method for a table, and
join methods) would go a huge way towards enabling app developers to get
stuff done now while waiting for all these magical optimizer
improvements that have been talked about for years.
--
Jim Nasby                                            
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

От:
Tobias Brox
Дата:

[Jim C. Nasby - Mon at 04:18:27PM -0500]
> I can agree to that, but we'll never get any progress so long as every
> time hints are brought up the response is that they're evil and should
> never be in the database. I'll also say that a very simple hinting
> language (ie: allowing you to specify access method for a table, and
> join methods) would go a huge way towards enabling app developers to get
> stuff done now while waiting for all these magical optimizer
> improvements that have been talked about for years.

Just a comment from the side line; can't the rough "set
enable_seqscan=off" be considered as sort of a hint anyway?  There have
been situations where we've actually had to resort to such crud.

Beeing able to i.e. force a particular index is something I really
wouldn't put into the application except for as a very last resort,
_but_ beeing able to force i.e. the use of a particular index in an
interactive 'explain analyze'-query would often be ... if not outright
useful, then at least very interessting.


От:
"Jim C. Nasby"
Дата:

On Mon, Oct 09, 2006 at 11:33:03PM +0200, Tobias Brox wrote:
> [Jim C. Nasby - Mon at 04:18:27PM -0500]
> > I can agree to that, but we'll never get any progress so long as every
> > time hints are brought up the response is that they're evil and should
> > never be in the database. I'll also say that a very simple hinting
> > language (ie: allowing you to specify access method for a table, and
> > join methods) would go a huge way towards enabling app developers to get
> > stuff done now while waiting for all these magical optimizer
> > improvements that have been talked about for years.
>
> Just a comment from the side line; can't the rough "set
> enable_seqscan=off" be considered as sort of a hint anyway?  There have
> been situations where we've actually had to resort to such crud.
>
> Beeing able to i.e. force a particular index is something I really
> wouldn't put into the application except for as a very last resort,
> _but_ beeing able to force i.e. the use of a particular index in an
> interactive 'explain analyze'-query would often be ... if not outright
> useful, then at least very interessting.

One of the big problems with doing set enable_...=off is that there's no
way to embed that into something like a view, so you're almost forced
into putting into the application code itself, which makes matters even
worse. If you could hint this within a query (maybe even on a per-table
level), you could at least encapsulate that into a view.
--
Jim Nasby                                            
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

От:
Tom Lane
Дата:

"Jim C. Nasby" <> writes:
> I'll also say that a very simple hinting
> language (ie: allowing you to specify access method for a table, and
> join methods) would go a huge way towards enabling app developers to get
> stuff done now while waiting for all these magical optimizer
> improvements that have been talked about for years.

Basically, the claim that it'll be both easy and useful is what I think
is horsepucky ... let's see a detailed design if you think it's easy.

            regards, tom lane

От:
"Joshua D. Drake"
Дата:

>
> One of the big problems with doing set enable_...=off is that there's no
> way to embed that into something like a view, so you're almost forced
> into putting into the application code itself, which makes matters even
> worse. If you could hint this within a query (maybe even on a per-table
> level), you could at least encapsulate that into a view.

You can easily pass multiple statements within a single exec() or push
it into an SPF.

Joshua D. Drake



--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/



От:
Tom Lane
Дата:

"Jim C. Nasby" <> writes:
> One of the big problems with doing set enable_...=off is that there's no
> way to embed that into something like a view, so you're almost forced
> into putting into the application code itself, which makes matters even
> worse. If you could hint this within a query (maybe even on a per-table
> level), you could at least encapsulate that into a view.

You've almost reinvented one of the points that was made in the last
go-round on the subject of hints, which is that keeping them out of the
application code is an important factor in making them manageable by a
DBA.  Hints stored in a system catalog (and probably having the form of
"make this statistical assumption" rather than specifically "use that
plan") would avoid many of the negatives.

            regards, tom lane

От:
Brian Herlihy
Дата:

PG does support hints actually.. and I used them to solve the last performance
problem I had, rather than waiting n years for the query planner to be
improved.  The problem in question (from an automated query planning point of
view) is the lack of multi-column statistics, leading to the wrong index being
used.

The only thing is, the hints are expressed in an obscure, ad-hoc and
implementation dependant language.

For example, the "Don't use index X" hint (the one I used) can be accessed by
replacing your index with an index on values derived from the actual index,
instead of the values themselves.  Then that index is not available during
normal query planning.

Another example is the "Maybe use index on X and also sort by X" hint, which
you access by adding "ORDER BY X" to your query.  That would have solved my
problem for a simple select, but it didn't help for an update.

Then there's the "Don't use seq scan" hint, which is expressed as "set
enable_seqscan=off".  That can help when it mistakenly chooses seq scan.

And there are many more such hints, which are regularly used by PG users to
work around erroneous query plans.

While writing this email, I had an idea for a FAQ, which would tell PG users
how to access this informal hint language:

Q: The query planner keeps choosing the wrong index.  How do I force it to use
the correct index?

A: Have you analyzed your tables, increased statistics, etc etc etc?  If that
doesn't help, you can change the index to use a value derived from the actual
row values.  Then the index will not be available unless you explicitly use the
derived values in your conditions.

With such a FAQ, us people who use PG in the real world can have our queries
running reliably and efficiently, while work to improve the query planner continues.

От:
"Craig A. James"
Дата:

Brian Herlihy wrote:
> PG does support hints actually..
> The only thing is, the hints are expressed in an obscure, ad-hoc and
> implementation dependant language.
>
> For example, the "Don't use index X" hint (the one I used) can be accessed by
> replacing your index with an index on values derived from the actual index...

And then there's

    select ... from (select ... offset 0)

where the "offset 0" prevents any rewriting between the two levels of query.  This replaces joins and AND clauses where
theplanner makes the wrong choice of join order or filtering.  I grepped my code and found four of these (all
workaroundsfor the same underlying problem). 

Imagine I got run over by a train, and someone was reading my code.  Which would be easier for them to maintain: Code
withweird SQL, or code with sensible, well-written SQL and explicit hints?  Luckily for my (hypothetical, I hope)
successor,I put massive comments in my code explaining the strange SQL. 

The bad applications are ALREADY HERE.  And they're WORSE to maintain than if we had a formal hint language.  The
argumentthat hints lead to poor application is true.  But lack of hints leads to worse applications. 

Craig


От:
"Joshua D. Drake"
Дата:

> Imagine I got run over by a train, and someone was reading my code.
> Which would be easier for them to maintain: Code with weird SQL, or code
> with sensible, well-written SQL and explicit hints?

You forgot the most important option:

Code with appropriate documentation about your weird SQL.

If you document your code, your argument is moot.

Joshua D. Drake



--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/



От:
"Jim C. Nasby"
Дата:

On Mon, Oct 09, 2006 at 06:45:16PM -0400, Tom Lane wrote:
> "Jim C. Nasby" <> writes:
> > One of the big problems with doing set enable_...=off is that there's no
> > way to embed that into something like a view, so you're almost forced
> > into putting into the application code itself, which makes matters even
> > worse. If you could hint this within a query (maybe even on a per-table
> > level), you could at least encapsulate that into a view.
>
> You've almost reinvented one of the points that was made in the last
> go-round on the subject of hints, which is that keeping them out of the
> application code is an important factor in making them manageable by a
> DBA.  Hints stored in a system catalog (and probably having the form of
> "make this statistical assumption" rather than specifically "use that
> plan") would avoid many of the negatives.

Sure, but IIRC no one's figured out what that would actually look like,
while it's not hard to come up with a syntax that allows you to tell the
optimizer "scan index XYZ to access this table". (And if there's real
interest in adding that I'll come up with a proposal.)

I'd rather have the ugly solution sooner rather than the elegant one
later (if ever).
--
Jim Nasby                                            
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

От:
"Jim C. Nasby"
Дата:

On Mon, Oct 09, 2006 at 03:41:09PM -0700, Joshua D. Drake wrote:
> >
> > One of the big problems with doing set enable_...=off is that there's no
> > way to embed that into something like a view, so you're almost forced
> > into putting into the application code itself, which makes matters even
> > worse. If you could hint this within a query (maybe even on a per-table
> > level), you could at least encapsulate that into a view.
>
> You can easily pass multiple statements within a single exec() or push
> it into an SPF.

Unless I'm missing something, putting multiple statements in a single
exec means you're messing with the application code. And you can't
update a SRF (also means messing with the application code). Though, I
suppose you could update a view that pulled from an SRF...
--
Jim Nasby                                            
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

От:
"Jim C. Nasby"
Дата:

On Mon, Oct 09, 2006 at 08:22:39PM -0700, Joshua D. Drake wrote:
>
> > Imagine I got run over by a train, and someone was reading my code.
> > Which would be easier for them to maintain: Code with weird SQL, or code
> > with sensible, well-written SQL and explicit hints?
>
> You forgot the most important option:
>
> Code with appropriate documentation about your weird SQL.
>
> If you document your code, your argument is moot.

You apparently didn't read the whole email. He said he did document his
code. But his point is still valid: obscure code is bad even with
documentation. Would you put something from the obfuscated C contest
into production with comments describing what it does, or would you just
write the code cleanly to begin with?
--
Jim Nasby                                            
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

От:
Tom Lane
Дата:

"Jim C. Nasby" <> writes:
> I'd rather have the ugly solution sooner rather than the elegant one
> later (if ever).

The trouble with that is that we couldn't ever get rid of it, and we'd
be stuck with backward-compatibility concerns with the first (over
simplified) design.  It's important to get it right the first time,
at least for stuff that you know perfectly well is going to end up
embedded in application code.

            regards, tom lane

От:
"Steinar H. Gunderson"
Дата:

On Tue, Oct 10, 2006 at 09:07:03AM -0500, Jim C. Nasby wrote:
> Would you put something from the obfuscated C contest
> into production with comments describing what it does,

If nothing else, it would be a nice practical joke =)

/* Steinar */
--
Homepage: http://www.sesse.net/

От:
"Jim C. Nasby"
Дата:

On Tue, Oct 10, 2006 at 10:14:48AM -0400, Tom Lane wrote:
> "Jim C. Nasby" <> writes:
> > I'd rather have the ugly solution sooner rather than the elegant one
> > later (if ever).
>
> The trouble with that is that we couldn't ever get rid of it, and we'd
> be stuck with backward-compatibility concerns with the first (over
> simplified) design.  It's important to get it right the first time,
> at least for stuff that you know perfectly well is going to end up
> embedded in application code.

We've depricated things before, I'm sure we'll do it again. Yes, it's a
pain, but it's better than not having anything release after release.
And having a formal hint language would at least allow us to eventually
clean up some of these oddball cases, like the OFFSET 0 hack.

I'm also not convinced that even supplimental statistics will be enough
to ensure the planner always does the right thing, so query-level hints
may have to stay (though it'd be great if that wasn't the case).
--
Jim Nasby                                            
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

От:
"Joshua D. Drake"
Дата:

Jim C. Nasby wrote:
> On Mon, Oct 09, 2006 at 03:41:09PM -0700, Joshua D. Drake wrote:
>>> One of the big problems with doing set enable_...=off is that there's no
>>> way to embed that into something like a view, so you're almost forced
>>> into putting into the application code itself, which makes matters even
>>> worse. If you could hint this within a query (maybe even on a per-table
>>> level), you could at least encapsulate that into a view.
>> You can easily pass multiple statements within a single exec() or push
>> it into an SPF.
>
> Unless I'm missing something, putting multiple statements in a single
> exec means you're messing with the application code. And you can't
> update a SRF (also means messing with the application code). Though, I
> suppose you could update a view that pulled from an SRF...

I always think of application code as outside the db. I was thinking
more in layers.

Joshua D. Drake



--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/



От:
"Joshua D. Drake"
Дата:

Jim C. Nasby wrote:
> On Mon, Oct 09, 2006 at 08:22:39PM -0700, Joshua D. Drake wrote:
>>> Imagine I got run over by a train, and someone was reading my code.
>>> Which would be easier for them to maintain: Code with weird SQL, or code
>>> with sensible, well-written SQL and explicit hints?
>> You forgot the most important option:
>>
>> Code with appropriate documentation about your weird SQL.
>>
>> If you document your code, your argument is moot.
>
> You apparently didn't read the whole email. He said he did document his
> code. But his point is still valid: obscure code is bad even with
> documentation. Would you put something from the obfuscated C contest
> into production with comments describing what it does, or would you just
> write the code cleanly to begin with?

You are comparing apples to oranges. We aren't talking about an
obfuscated piece of code. We are talking about an SQL statement that
solves a particular problem.

That can easily be documented, and documented with enough verbosity that
it is never a question, except to test and see if the problem exists in
current versions.

Sincerely,

Joshua D. Drake



--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/



От:
"Joshua D. Drake"
Дата:

Steinar H. Gunderson wrote:
> On Tue, Oct 10, 2006 at 09:07:03AM -0500, Jim C. Nasby wrote:
>> Would you put something from the obfuscated C contest
>> into production with comments describing what it does,
>
> If nothing else, it would be a nice practical joke =)

nice isn't the word I would use ;)

Joshua D. Drake

>
> /* Steinar */


--

   === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564 || 24x7/Emergency: +1.800.492.2240
   Providing the most comprehensive  PostgreSQL solutions since 1997
             http://www.commandprompt.com/



От:
Bruno Wolff III
Дата:

On Mon, Oct 09, 2006 at 23:33:03 +0200,
  Tobias Brox <> wrote:
>
> Just a comment from the side line; can't the rough "set
> enable_seqscan=off" be considered as sort of a hint anyway?  There have
> been situations where we've actually had to resort to such crud.

That only works for simple queries. To be generally useful, you want to
be able to hint how to handle each join being done in the query. The
current controlls affect all joins.

От:
Josh Berkus
Дата:

Jim,

> We've depricated things before, I'm sure we'll do it again. Yes, it's a
> pain, but it's better than not having anything release after release.
> And having a formal hint language would at least allow us to eventually
> clean up some of these oddball cases, like the OFFSET 0 hack.
>
> I'm also not convinced that even supplimental statistics will be enough
> to ensure the planner always does the right thing, so query-level hints
> may have to stay (though it'd be great if that wasn't the case).

"stay"?   I don't think that the general developers of PostgreSQL are going
to *accept* anything that stands a significant chance of breaking in one
release.   You have you challange for the EDB development team: come up
with a hinting language which is flexible enough not to do more harm than
good (hint: it's not Oracle's hints).

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

От:
"Jim C. Nasby"
Дата:

On Tue, Oct 10, 2006 at 10:28:29AM -0700, Josh Berkus wrote:
> Jim,
>
> > We've depricated things before, I'm sure we'll do it again. Yes, it's a
> > pain, but it's better than not having anything release after release.
> > And having a formal hint language would at least allow us to eventually
> > clean up some of these oddball cases, like the OFFSET 0 hack.
> >
> > I'm also not convinced that even supplimental statistics will be enough
> > to ensure the planner always does the right thing, so query-level hints
> > may have to stay (though it'd be great if that wasn't the case).
>
> "stay"?   I don't think that the general developers of PostgreSQL are going
> to *accept* anything that stands a significant chance of breaking in one
> release.   You have you challange for the EDB development team: come up
> with a hinting language which is flexible enough not to do more harm than
> good (hint: it's not Oracle's hints).

My point was that I think we'll always have a need for fine-grained (ie:
table and join level) hints, even if we do get the ability for users to
over-ride the statistics system. It's just not possible to come up with
automation that will handle every possible query that can be thrown at a
system. I don't see how that means breaking anything in a given release.
Worst-case, the optimizer might be able to do a better job of something
than hints written for an older version of the database, but that's
going to be true of any planner override we come up with.

BTW, I'm not speaking for EnterpriseDB or it's developers here... query
hints are something I feel we've needed for a long time.
--
Jim Nasby                                            
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

От:
Mark Kirkwood
Дата:

Jim C. Nasby wrote:

> (snippage)... but we'll never get any progress so long as every
> time hints are brought up the response is that they're evil and should
> never be in the database. I'll also say that a very simple hinting
> language (ie: allowing you to specify access method for a table, and
> join methods) would go a huge way towards enabling app developers to get
> stuff done now while waiting for all these magical optimizer
> improvements that have been talked about for years.

It is possibly because some of us feel they are evil :-) (can't speak
for the *real* Pg developers, just my 2c here)

As for optimizer improvements well, yeah we all want those - but the
basic problem (as I think Tom stated) is the developer resources to do
them. As an aside this applies to hints as well - even if we have a
patch to start off with - look at how much time bitmap indexes have been
worked  on to get them ready for release....

Personally I don't agree with the oft stated comment along the lines of
"we will never get the optimizer to the point where it does not need
some form of hinting" as:

1/ we don't know that to be a true statement, and
2/ it is kind of admitting defeat on a very interesting problem, when in
fact a great deal of progress has been made to date, obviously by people
who believe it is possible to build a "start enough" optimizer.

best wishes

Mark


От:
Mark Kirkwood
Дата:

Mark Kirkwood wrote:

> who believe it is possible to build a "start enough" optimizer.
>
That's meant to read "smart enough" optimizer .. sorry.

От:
Brian Herlihy
Дата:

-- tom lane wrote ---------------------------------------------------------
"Jim C. Nasby" <> writes:
> I'd rather have the ugly solution sooner rather than the elegant one
> later (if ever).

The trouble with that is that we couldn't ever get rid of it, and we'd
be stuck with backward-compatibility concerns with the first (over
simplified) design.  It's important to get it right the first time,
at least for stuff that you know perfectly well is going to end up
embedded in application code.

            regards, tom lane
---------------------------------------------------------------------------

I agree that it's important to get it right the first time.  It's also
important that my queries use the right index NOW.  It's no use to me if my
queries run efficiently in the next release when I am running those queries
right now.

Hints would allow me to do that.

What would it take for hints to be added to postgres?  If someone designed a
hint system that was powerful and flexible, and offered to implement it
themselves, would this be sufficient?  This would address the concerns of
having a "bad" hint system, and also the concern of time being better spent on
other things.

I want to know if the other objections to hints, such as hints being left
behind after an improvement to the optimizer, would also be an issue.  I don't
see this objection as significant, as people are already using ad hoc hacks
where they would otherwise use hints.  The other reason I don't accept this
objection is that people who care about performance will review their code
after every DBMS upgrade, and they will read the release notes :)

От:
Tom Lane
Дата:

Brian Herlihy <> writes:
> What would it take for hints to be added to postgres?

A *whole lot* more thought and effort than has been expended on the
subject to date.

Personally I have no use for the idea of "force the planner to do
exactly X given a query of exactly Y".  You don't have exactly Y
today, tomorrow, and the day after (if you do, you don't need a
hint mechanism at all, you need a mysql-style query cache).
IMHO most of the planner mistakes we see that could be fixed via
hinting are really statistical estimation errors, and so the right
level to be fixing them at is hints about how to estimate the number
of rows produced for given conditions.  Mind you that's still a plenty
hard problem, but you could at least hope that a hint of that form
would be useful for more than one query.

            regards, tom lane

От:
Brian Herlihy
Дата:

--- Tom Lane <> wrote:
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y".  You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).

I don't agree here.  I have "exactly Y" running millions of times daily.
There's enough data that the statistics on specific values don't help all that
much, even at the maximum statistics collection level.  By "exactly Y" I mean
the form of the query is identical, and the query plan is identical, since only
the general statistics are being used for most executions of the query.  The
specific values vary, so caching is no help.

In summary, I have a need to run "exactly Y" with query plan "exactly X".
(detail in postscript)

> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions.

Do you mean something like "The selectivity of these two columns together is
really X"?  That would solve my specific problem.  And the academic part of me
likes the elegance of that solution.

On the negative side, it means people must learn how the optimizer uses
statistics (which I would never have done if I could have said "Use index X").

> Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.

Yes it would be useful for more than one query.  I agree that it's the "right"
level to hint at, in that it is at a higher level.  Maybe the right level is
not the best level though?  In a business environment, you just want things to
work, you don't want to analyze a problem all the way through and find the
best, most general solution.  As a former academic I understand the two points
of view, and I don't think either is correct or wrong.  Each view has its
place.

Since I work for a business now, my focus is on making quick fixes that keep
the system running smoothly.  Solving problems in the "right" way is not
important.  If the query slows down again later, we will examine the query plan
and do whatever we have to do to fix it.  It's not elegant, but it gives fast
response times to the customers, and that's what matters.


PS The case in question is a table with a 3-column primary key on (A, B, C).
It also has an index on (B, C).  Re-ordering the primary key doesn't help as I
do lookups on A only as well.  When I specify A, B and C (the primary key), the
optimizer chooses the (B, C) index, on the assumption that specifying these two
values will return only 1 row.  But high correlation between B and C leads to
100s of rows being returned, and the query gets very slow.  The quick fix is to
say "Use index (A, B, C)".  The statistics level fix would be to say "B and C
really have high correlation".

От:
"Bucky Jordan"
Дата:

> Brian Herlihy <> writes:
> > What would it take for hints to be added to postgres?
>
> A *whole lot* more thought and effort than has been expended on the
> subject to date.
>
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y".  You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).
> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions.  Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.
>

Do I understand correctly that you're suggesting it might not be a bad
idea to allow users to provide statistics?

Is this along the lines of "I'm loading a big table and touching every
row of data, so I may as well collect some stats along the way" and "I
know my data contains these statistical properties, but the analyzer
wasn't able to figure that out (or maybe can't figure it out efficiently
enough)"?

While it seems like this would require more knowledge from the user
(e.g. more about their data, how the planner works, and how it uses
statistics) this would actually be helpful/required for those who really
care about performance. I guess it's the difference between a tool
advanced users can get long term benefit from, or a quick fix that will
probably come back to bite you. I've been pleased with Postgres'
thoughtful design; recently I've been doing some work with MySQL, and
can't say I feel the same way.

Also, I'm guessing this has already come up at some point, but what
about allowing PG to do some stat collection during queries? If you're
touching a lot of data (such as an import process) wouldn't it be more
efficient (and perhaps more accurate) to collect stats then, rather than
having to re-scan? It would be nice to be able to turn this on/off on a
per query basis, seeing as it could have pretty negative impacts on OLTP
performance...

- Bucky

От:
Heikki Linnakangas
Дата:

Bucky Jordan wrote:
>
> Is this along the lines of "I'm loading a big table and touching every
> row of data, so I may as well collect some stats along the way" and "I
> know my data contains these statistical properties, but the analyzer
> wasn't able to figure that out (or maybe can't figure it out efficiently
> enough)"?
>
> While it seems like this would require more knowledge from the user
> (e.g. more about their data, how the planner works, and how it uses
> statistics) this would actually be helpful/required for those who really
> care about performance. ...

The user would have to know his data, but he wouldn't need to know how
the planner works. While with hints like "use index X", he *does* need
to know how the planner works.

Being able to give hints about statistical properties of relations and
their relationships seems like a good idea to me. And we can later
figure out ways to calculate them automatically.

BTW, in DB2 you can declare a table as volatile, which means that the
cardinality of the table varies greatly. The planner favors index scans
on volatile tables.

--
   Heikki Linnakangas
   EnterpriseDB   http://www.enterprisedb.com

От:
Bruce Momjian
Дата:

Heikki Linnakangas wrote:
> BTW, in DB2 you can declare a table as volatile, which means that the
> cardinality of the table varies greatly. The planner favors index scans
> on volatile tables.

Now that seems like a valuable idea.

--
  Bruce Momjian   
  EnterpriseDB    http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

От:
Mark Lewis
Дата:

Tom,

I'm interested in the problem of cross-column statistics from a
theoretical perspective.  It would be interesting to sit down and try to
reason out a useful solution, or at very least to understand the problem
better so I can anticipate when it might come and eat me.

From my understanding, the main problem is that if PG knows the
selectivity of n conditions C1,C2,...,Cn then it doesn't know whether
the combined selectivity will be C1*C2*...*Cn (conditions are
independent) or max(C1,C2,...,Cn) (conditions are strictly dependent),
or somewhere in the middle. Therefore, row estimates could be orders of
magnitude off.

I suppose a common example would be a table with a serial primary key
column and a timestamp value which is always inserted as
CURRENT_TIMESTAMP, so the two columns are strongly correlated.  If the
planner guesses that 1% of the rows of the table will match pk>1000000,
and 1% of the rows of the table will match timestamp > X, then it would
be nice for it to know that if you specify both "pk>1000000 AND
timestamp>X" that the combined selectivity is still only 1% and not 1% *
1% = 0.01%.

As long as I'm sitting down and reasoning about the problem anyway, are
there any other types of cases you're aware of where some form of cross-
column statistics would be useful?  In the unlikely event that I
actually come up with a brilliant and simple solution, I'd at least like
to make sure that I'm solving the right problem :)

Thanks,
Mark Lewis



On Tue, 2006-10-10 at 22:38 -0400, Tom Lane wrote:
> Brian Herlihy <> writes:
> > What would it take for hints to be added to postgres?
>
> A *whole lot* more thought and effort than has been expended on the
> subject to date.
>
> Personally I have no use for the idea of "force the planner to do
> exactly X given a query of exactly Y".  You don't have exactly Y
> today, tomorrow, and the day after (if you do, you don't need a
> hint mechanism at all, you need a mysql-style query cache).
> IMHO most of the planner mistakes we see that could be fixed via
> hinting are really statistical estimation errors, and so the right
> level to be fixing them at is hints about how to estimate the number
> of rows produced for given conditions.  Mind you that's still a plenty
> hard problem, but you could at least hope that a hint of that form
> would be useful for more than one query.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

От:
"Jim C. Nasby"
Дата:

On Wed, Oct 11, 2006 at 10:27:26AM -0400, Bucky Jordan wrote:
> Also, I'm guessing this has already come up at some point, but what
> about allowing PG to do some stat collection during queries? If you're
> touching a lot of data (such as an import process) wouldn't it be more
> efficient (and perhaps more accurate) to collect stats then, rather than
> having to re-scan? It would be nice to be able to turn this on/off on a
> per query basis, seeing as it could have pretty negative impacts on OLTP
> performance...

I suspect that could be highly useful in data warehouse environments
where you're more likely to have to sequential scan a table. It would be
interesting to have it so that a sequential scan that will run to
completion also collects stats along the way.
--
Jim Nasby                                            
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)