Обсуждение: Performance improvement hints

Поиск
Список
Период
Сортировка

Performance improvement hints

От
devik@cdi.cz
Дата:
Hello,
I have encountered problems with particular query so that
a started to dug into sources. I've two questions/ideas:

1) when optimizer computes size of join it does it as  card(R1)*card(R2)*selectivity. Suppose two relations  (R1 & R2)
each10000 rows. If you (inner) join them  using equality operator, the result is at most 10000  rows
(min(card(R1),card(R2)).But pg estimates  1 000 000 (uses selectivity 0.01 here).  Then when computing cost it will
resultin very high  cost in case of hash and loop join BUT low (right)  cost for merge join. It is because for hash and
loop joins the cost is estimated from row count but merge  join uses another estimation (as it always know that  merge
joincan be done only on equality op).  It then leads to use of mergejoin for majority of joins.  Unfortunately I found
thatin majority of such cases  the hash join is two times faster.  I tested it using SET ENABLE_MERGEJOIN=OFF ...  What
aboutto change cost estimator to use min(card(R1),  card(R2)) instead of card(R1)*card(R2)*selectivity in  case where
R1and R2 are connected using equality ?  It should lead to much faster plans for majority of SQLs.
 

2) suppose we have relation R1(id,name) and index ix(id,name)  on it. In query like: select id,name from R1 order by id
planner will prefer to do seqscan+sort (althought the R1  is rather big). And yes it is really faster than using
indexscan. But indexscan always lookups actual record in heap even if  all needed attributes are contained in the
index. Oracle and even MSSQL reads attributes directly from index  without looking for actual tuple at heap.  Is there
anyneed to do it in such ineffecient way ?
 

regards, devik



Re: Performance improvement hints

От
Jules Bean
Дата:
On Tue, Sep 12, 2000 at 02:30:09PM +0200, devik@cdi.cz wrote:
> Hello,
> I have encountered problems with particular query so that
> a started to dug into sources. I've two questions/ideas:
> 
> 1) when optimizer computes size of join it does it as
>    card(R1)*card(R2)*selectivity. Suppose two relations
>    (R1 & R2) each 10000 rows. If you (inner) join them
>    using equality operator, the result is at most 10000
>    rows (min(card(R1),card(R2)). But pg estimates
>    1 000 000 (uses selectivity 0.01 here).

Surely not.  If you inner join, you can get many more than min
(card(R1),card(R2)), if you are joining over non-unique keys (a common
case).  For example:

employee:

name    job

Jon    Programmer
George    Programmer

job_drinks

job        drink

Programmer    Jolt
Programmer    Coffee
Programmer    Beer


The natural (inner) join between these two tables results in 6 rows,
card(R1)*card(R2).  

I think you mean that min(card(R1),card(R2)) is the correct figure
when the join is done over a unique key in both tables.  


> 
> 2) suppose we have relation R1(id,name) and index ix(id,name)
>    on it. In query like: select id,name from R1 order by id
>    planner will prefer to do seqscan+sort (althought the R1
>    is rather big). And yes it is really faster than using
>    indexscan.
>    But indexscan always lookups actual record in heap even if
>    all needed attributes are contained in the index.
>    Oracle and even MSSQL reads attributes directly from index
>    without looking for actual tuple at heap.
>    Is there any need to do it in such ineffecient way ?


I believe this is because PgSQL doesn't remove entries from the index
at DELETE time, thus it is always necessary to refer to the main table
in case the entry found in the index has since been deleted.
Presumably this speeds up deletes (but I find this behaviour suprising
too).

Jules


Re: Performance improvement hints

От
Tom Lane
Дата:
devik@cdi.cz writes:
> 1) when optimizer computes size of join it does it as
>    card(R1)*card(R2)*selectivity. Suppose two relations
>    (R1 & R2) each 10000 rows. If you (inner) join them
>    using equality operator, the result is at most 10000
>    rows (min(card(R1),card(R2)). But pg estimates
>    1 000 000 (uses selectivity 0.01 here).

0.01 is only the default estimate used if you've never done a VACUUM
ANALYZE (hint hint).  After ANALYZE, there are column statistics
available that will give a better estimate.

Note that your claim above is incorrect unless you are joining on unique
columns, anyway.  In the extreme case, if all the entries have the same
value in the column being used, you'd get card(R1)*card(R2) output rows.
I'm unwilling to make the system assume column uniqueness without
evidence to back it up, because the consequences of assuming an overly
small output row count are a lot worse than assuming an overly large
one.

One form of evidence that the planner should take into account here is
the existence of a UNIQUE index on a column --- if one has been created,
we could assume column uniqueness even if no VACUUM ANALYZE has ever
been done on the table.  This is on the to-do list, but I don't feel
it's real high priority.  The planner's results are pretty much going
to suck in the absence of VACUUM ANALYZE stats anyway :-(

>    Then when computing cost it will result in very high
>    cost in case of hash and loop join BUT low (right)
>    cost for merge join. It is because for hash and loop
>    joins the cost is estimated from row count but merge
>    join uses another estimation (as it always know that
>    merge join can be done only on equality op).
>    It then leads to use of mergejoin for majority of joins.
>    Unfortunately I found that in majority of such cases
>    the hash join is two times faster.

The mergejoin cost calculation may be overly optimistic.  The cost
estimates certainly need further work.

>    But indexscan always lookups actual record in heap even if
>    all needed attributes are contained in the index.
>    Oracle and even MSSQL reads attributes directly from index
>    without looking for actual tuple at heap.

Doesn't work in Postgres' storage management scheme --- the heap
tuple must be consulted to see if it's still valid.
        regards, tom lane


Re: Performance improvement hints

От
devik@cdi.cz
Дата:
> >    using equality operator, the result is at most 10000
> >    rows (min(card(R1),card(R2)). But pg estimates
> >    1 000 000 (uses selectivity 0.01 here).
> 
> Surely not.  If you inner join, you can get many more than min
> (card(R1),card(R2)), if you are joining over non-unique keys (a common
> case).  For example:

Ohh yes. You are right. Also I found that my main problem
was not running VACUUM ANALYZE so that I have invalid value
of column's disbursion.
I ran it and now hash join estimates row count correctly.

> >    But indexscan always lookups actual record in heap even if
> >    all needed attributes are contained in the index.
> >    Oracle and even MSSQL reads attributes directly from index
> >    without looking for actual tuple at heap.
> 
> I believe this is because PgSQL doesn't remove entries from the index
> at DELETE time, thus it is always necessary to refer to the main table
> in case the entry found in the index has since been deleted.

Hmm it looks reasonable. But it still should not prevent us
to retrieve data directly from index whether possible. What
do you think ? Only problem I can imagine is if it has to
do something with locking ..

regards, devik



Re: Performance improvement hints + measurement

От
devik@cdi.cz
Дата:
> >    But indexscan always lookups actual record in heap even if
> >    all needed attributes are contained in the index.
> >    Oracle and even MSSQL reads attributes directly from index
> >    without looking for actual tuple at heap.
> 
> Doesn't work in Postgres' storage management scheme --- the heap
> tuple must be consulted to see if it's still valid.

yes, I just spent another day by looking into sources and
it seems that we need xmin, xmax stuff.
What do you think about this approach:

1) add all validity & tx fields from heap tuple into   index tuple too
2) when generating plan for index scan try to determine  whether we can satisfy target list using only data  from index
tuples,if yes then compute cost without  accounting random heap page reads - it will lead into  much lower cost
 
3) whenever you update/delete heap tuple's tx fields, update  then also in indices (you don't have to delete them from
index)

It will cost more storage space and slightly more work when
updating indices but should give excelent performance when
index is used. 

Measurements:
I've table with about 2 mil. rows declared as
bigrel(namex varchar(50),cnt integer,sale datetime). 
I regulary need to run this query against it:
select nazev,sum(cnt) from bigrel group by name;
It took (in seconds):

Server\Index   YES      NO
pg7.01 linux   58       264
MSSQL7 winnt   17       22

I compared on the same machine (PII/375,128RAM) using
WINNT under VMWARE and native linux 2.2. pq was 
vaccum analyzed.
Why is pgsql so slow ? The mssql plan without index uses
hash aggregating but pg sorts while relation.
With index, in pg there is a big overhead of heap tuple
reading - mssql uses data directly from scanned index.

Also I noticed another problem, when I added 
where nazev<'0' it took 110ms on pg when I used
set enable_seqscan=on;.
Without is, planner still tried to use seqscan+sort
which took 27s in this case.

I'm not sure how complex the proposed changes are. Another
way would be to implement another aggregator like HashAgg
which will use hashing. 
But it could be even more complicated as one has to use
temp relation to store all hash buckets ..

Still I think that direct index reads should give us huge
speed improvement for all indexed queries.
I'm prepared to implement it but I'd like to know your
hints/complaints.

Regards, devik



Re: Performance improvement hints + measurement

От
Tom Lane
Дата:
devik@cdi.cz writes:
> What do you think about this approach:

> 1) add all validity & tx fields from heap tuple into 
>    index tuple too

Non-starter I'm afraid.  That would mean that whenever we update a
tuple, we'd have to find and update all the index entries that refer to
it.  You'd be taking a tremendous performance hit on all update
operations in the hope of saving time on only a relatively small number
of inquiries.

This has been discussed before (repeatedly, IIRC).  Please peruse the
pghackers archives.

> I regulary need to run this query against it:
> select nazev,sum(cnt) from bigrel group by name;
> With index, in pg there is a big overhead of heap tuple
> reading - mssql uses data directly from scanned index.

How exactly is MSSQL going to do that with only an index on "name"?
You need to have access to the cnt field as well, which wouldn't be
present in an index entry for name.

> I'm not sure how complex the proposed changes are. Another
> way would be to implement another aggregator like HashAgg
> which will use hashing. 

That would be worth looking at --- we have no such plan type now.

> But it could be even more complicated as one has to use
> temp relation to store all hash buckets ..

You could probably generalize the existing code for hashjoin tables
to support hash aggregation as well.  Now that I think about it, that
sounds like a really cool idea.  Should put it on the TODO list.
        regards, tom lane


Re: Performance improvement hints + measurement

От
Tom Lane
Дата:
devik@cdi.cz writes:
>> You could probably generalize the existing code for hashjoin tables
>> to support hash aggregation as well.  Now that I think about it, that
>> sounds like a really cool idea.  Should put it on the TODO list.

> Yep. It should be easy. It could be used as part of Hash
> node by extending ExecHash to return all hashed rows and
> adding value{1,2}[nbuckets] to HashJoinTableData.

Actually I think what we want is a hash table indexed by the
grouping-column value(s) and storing the current running aggregate
states for each agg function being computed.  You wouldn't really
need to store any of the original tuples.  You might want to form
the agg states for each entry into a tuple just for convenience of
storage though.

> By the way, what is the "portal" and "slot" ?

As far as the hash code is concerned, a portal is just a memory
allocation context.  Destroying the portal gets rid of all the
memory allocated therein, without the hassle of finding and freeing
each palloc'd block individually.

As for slots, you are probably thinking of tuple table slots, which
are used to hold the tuples returned by plan nodes.  The input
tuples read by the hash node are stored in a slot that's filled 
by the child Plan node each time it's called.  Similarly, the hash
join node has to return a new tuple in its output slot each time
it's called.  It's a pretty simplistic form of memory management,
but it works fine for plan node output tuples.

If you are interested in working on this idea, you should be looking
at current sources --- both the memory management for hash tables
and the implementation of aggregate state storage have changed
materially since 7.0, so code based on 7.0 would need a lot of work
to be usable.
        regards, tom lane