Re: Hash join on int takes 8..114 seconds

Поиск
Список
Период
Сортировка
От Tomas Vondra
Тема Re: Hash join on int takes 8..114 seconds
Дата
Msg-id 492726AF.8030608@fuzzy.cz
обсуждение исходный текст
Ответ на Re: Hash join on int takes 8..114 seconds  ("Andrus" <kobruleht2@hot.ee>)
Список pgsql-performance
> 2. Run the following commands periodically in this order:
>
> VACUUM FULL;
> vacuum full pg_shdepend;
> CLUSTER rid on (toode);
> CLUSTER dok  on (kuupaev);
> REINDEX DATABASE mydb;
> REINDEX SYSTEM mydb;
> ANALYZE;
>
> Are all those command required or can something leaved out ?

Running CLUSTER after VACUUM FULL is just a waste of time. In my
experience CLUSTER is actually faster in case of such heavily bloated
tables - I think this is caused by the fact that it creates indexes from
the beginning instead of updating them (as VACUUM FULL does).

So CLUSTER actually performs REINDEX, so I'd just run

CLUSTER rid ON rid_pkey;
CLUSTER dok ON dok_pkey;
ANALYZE rid;
ANALYZE dok;

Clustering by other indexes might give better performance, using primary
keys is just a safe guess here. This should improve the performance of
your query and it seems these two tables are the most bloated ones.

I wouldn't do the same maintenance on the other tables now - it's just a
waste of time.

>
>> Several other things to consider:
>>
>> 1) Regarding the toode column - why are you using CHAR(20) when the
>> values
>> are actually shorter? This may significantly increase the amount of space
>> required.
>
> There may be some products whose codes may be up to 20 characters.
> PostgreSQL does not hold trailing spaces in db, so this does *not*
> affect to
> space.

OK, I haven't realized this. You're right.

>> 2) I've noticed the CPU used is Celeron, which may negatively affect the
>> speed of hash computation. I'd try to replace it by something faster -
>> say
>> INTEGER as an artificial primary key of the "toode" table and using it as
>> a FK in other tables.  This might improve the "Bitmap Heap Scan on rid"
>> part, but yes - it's just a minor improvement compared to the "Hash Join"
>> part of the query.
>
> Natural key Toode CHAR(20) is used widely in different queries.
> Replacing it with INT surrogate key requires major application rewrite.
>
> Should I add surrogate index INT columns to toode and rid table and measure
> test query speed in this case?

Test it. Create tables with fake data, and compare the performance with
and without the surrogate keys. Using a simple data type instead of text
  gave me huge performance boost. For example one of my colleagues used
VARCHAR(15) to store IP addresses, and then used them to join tables
(and suffered by the poor perfomance). Replacing them by INET improved
the speed by several orders of magnitude.

>> Materialized views seem like a good idea to me, but maybe I'm not seeing
>> something. What do you mean by "reports are different"? If there is a lot
>> of rows for a given product / day, then creating an aggregated table with
>> (product code / day) as a primary key is quite simple. It may require a
>> lot of disk space, but it'll remove the hash join overhead. But if the
>> queries are very different, then it may be difficult to build such
>> materialized view(s).
>
> log file seems that mostly only those queries are slow:
>
> SELECT ...
>   FROM dok JOIN rid USING (dokumnr)
>   JOIN ProductId USING (ProductId)
>   WHERE rid.ProductId LIKE :p1 || '%' AND dok.SaleDate>=:p2
>
> :p1 and :p2 are parameters different for different queries.
>
> dok contains several years of data. :p2 is usually only few previous months
> or last year ago.
> SELECT column list contains fixed list of known columns from all tables.
>
> How to create index or materialized view to optimize this types of
> queries ?

Well, difficult to answer without detailed information about the queries
you want to run, aggregated values, etc. Materialized views is a world
on it's own, and the best solution depends on (for example):

1) what aggregated values are you interested in (additive values are the
    most primitive ones, while VARIANCE etc. make it difficult)

2) do you need current data, or is it OK that today's data are not
    available till midnight (for example)?

Another thing you have to consider is whether you want to create
materialized view with final or intermediary data and then compute the
final data somehow (for example monthly totals from daily totals).

The most primitive (but often sufficient) solution is recreating the
materialized view periodically (for example every midnight). In your
case it'd mean running something like

CREATE TABLE materialized_view AS SELECT ... your query here ...
GROUP BY productId, saleDate

This gives you daily totals for each product - the clients then can run
another query to compute the final data.

But of course, if you need to maintain 'current' data, you may create a
set of triggers to update the materialized view. Either after each
modification or (more sophisticated) when it's needed.

See for example a great presentation from this year's PGCon:

http://www.pgcon.org/2008/schedule/events/69.en.html


regards
Tomas

В списке pgsql-performance по дате отправления:

Предыдущее
От: Tomas Vondra
Дата:
Сообщение: Re: Hash join on int takes 8..114 seconds
Следующее
От: "Andrus"
Дата:
Сообщение: Re: Hash join on int takes 8..114 seconds