Re: Optimizing queries that use multiple tables and many order by columns

Поиск

Список

Период

Сортировка

От	Wappler, Robert
Тема	Re: Optimizing queries that use multiple tables and many order by columns
Дата	25 августа 2010 г. 15:41:07
Msg-id	C8E2DAF0E663A948840B04023E0DE32A02B1F4CC@w2k3server02.de.ophardt.com обсуждение исходный текст
Ответ на	Optimizing queries that use multiple tables and many order by columns (Joshua Berry <yoberi@gmail.com>)
Ответы	Re: Optimizing queries that use multiple tables and many order by columns (Joshua Berry <yoberi@gmail.com>)
Список	pgsql-general

Дерево обсуждения

On 2010-08-25, Joshua Berry wrote:

> Hi Group,
>
> I've never really learned how to optimize queries that join
> several tables and have order by clauses that specify columns
> from each table. Is there documentation that could help me
> optimize and have the proper indexes in place? I've read
> through the PG Docs Chapter 11 on Indexes yet still lack the
> needed understanding.
>
> Here's my latest culprit:
>
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode, anl.job, anl.lab
> limit 10;
>

Could you try to remove the limit clause? I have seen several times,
that it may slow down a query. Although I haven't tested it, that an
OFFSET 0 clause can improve the situation, iirc.

From an algebraic point of view, I cannot see obvious inefficiencies.
Others, which now the internals of pg better, might see more.

> Here's the query plan using PG 8.4.4:
> Limit  (cost=21990.24..21990.27 rows=10 width=32)
>   ->  Sort  (cost=21990.24..22437.69 rows=178979 width=32)
>         Sort Key: job.companycode, anl.lab
>         ->  Hash Join  (cost=451.20..18122.57 rows=178979 width=32)
>               Hash Cond: (anl.job = job.job) ->  Seq Scan on analysis
>               anl (cost=0.00..14091.79 rows=178979 width=23) ->  Hash
>               (cost=287.20..287.20 rows=13120 width=17)
>                     ->  Seq Scan on job  (cost=0.00..287.20
> rows=13120 width=17)
>
>
> If I change the above query to only order by one of the
> tables, I get better results:
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode --, anl.job, anl.lab
> limit 10;
> Limit  (cost=0.00..3.65 rows=10 width=32)
>   ->  Nested Loop  (cost=0.00..65269.13 rows=178979 width=32)
>         ->  Index Scan using job_companycode on job (cost=0.00..972.67
>         rows=13120 width=17) ->  Index Scan using analysis_job_lab on
>         analysis anl
>  (cost=0.00..4.63 rows=22 width=23)
>               Index Cond: (anl.job = job.job)
>

That are estimated query plans, what does EXPLAIN ANALYZE say? The query
plans above do not execute the query instead they just make a rough
guess about the costs. Reality might be different. Also you may want to
run VACUUM ANALYZE before.

> Any idea on how I can improve this? In the past I would tend
> to create a cached copy of the query as a table that would be
> utilized, but I suspect that there is a better way to go
> about this. I'm using a system (Clarion) which heavily uses
> cursors via the ODBC driver (I use the psqlODBC latest
> version) to get a handful of records at a time, so no actual
> LIMITs would be used in the production queries; I've added
> the LIMITs here to try to simulate the performance
> differences that I find when browsing the data while ordering
> by the above columns.
>
>
> Here are the relevant tables and indexes:
>
>
> CREATE TABLE job
> (
>   job bigint NOT NULL, -- Job #
>   companycode character(4), -- Company Code
>   recdbycode character(3), -- Initials of who checked in sample(s)
>   datein date, -- Date sample was received
>   project character varying, -- Project or Site name
>   remarks text, -- Remarks
>   --[CONSTRAINTs etc]
> )
>
> CREATE INDEX job_companycode
>   ON job
>   USING btree
>   (companycode);
> CREATE INDEX job_companycode_job
>   ON samples.job
>   USING btree
>   (companycode, job);
>

Index job_companycode is not used in the plans. Additionally, it can be
constructed from the second index, as companycode is the primary sort
key.

> CREATE TABLE analysis
> (
>   lab bigint NOT NULL, -- Lab number
>   job bigint, -- Job number
>   sampletype character varying(5), -- General class of sample
>   priority character(1), -- Priority level
>   samplename character varying, -- Sample or Well name
>   CONSTRAINT rel_joblabk_to_jobk FOREIGN KEY (job)
>       REFERENCES job (job) MATCH SIMPLE
>       ON UPDATE CASCADE ON DELETE RESTRICT,
>   --[CONSTRAINTs etc]
> )
>
> CREATE INDEX analysis_companycode_job_lab
>   ON analysis
>   USING btree
>   (companycode, job, lab);
>
> CREATE INDEX analysis_job_lab
>   ON analysis
>   USING btree
>   (job, lab);
>

Maybe, the planner decides for a Sort Join, if there are sorted indexes
for anl.job and job.job. But the speed-up may vary depending on the
data.

> Thanks for any insights and tips you can provide!
>
> Kind Regards,
> -Joshua Berry
>
>
>

HTH

--
Robert...

В списке pgsql-general по дате отправления:

Предыдущее

От: Thom Brown
Дата: 25 августа 2010 г., 15:36:37
Сообщение: Re: unexpected message type 0x58 during COPY from stdin

Следующее

От: "A.M."
Дата: 25 августа 2010 г., 15:41:17
Сообщение: Re: initdb fails to allocate shared memory

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: Optimizing queries that use multiple tables and many order by columns

Предыдущее

Следующее