Обсуждение: Optimizing queries that use multiple tables and many order by columns

Поиск
Список
Период
Сортировка

Optimizing queries that use multiple tables and many order by columns

От
Joshua Berry
Дата:
Hi Group,

I've never really learned how to optimize queries that join several tables and have order by clauses that specify columns from each table. Is there documentation that could help me optimize and have the proper indexes in place? I've read through the PG Docs Chapter 11 on Indexes yet still lack the needed understanding.

Here's my latest culprit:

select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn, JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab
limit 10;

Here's the query plan using PG 8.4.4:
Limit  (cost=21990.24..21990.27 rows=10 width=32)
  ->  Sort  (cost=21990.24..22437.69 rows=178979 width=32)
        Sort Key: job.companycode, anl.lab
        ->  Hash Join  (cost=451.20..18122.57 rows=178979 width=32)
              Hash Cond: (anl.job = job.job)
              ->  Seq Scan on analysis anl  (cost=0.00..14091.79 rows=178979 width=23)
              ->  Hash  (cost=287.20..287.20 rows=13120 width=17)
                    ->  Seq Scan on job  (cost=0.00..287.20 rows=13120 width=17)


If I change the above query to only order by one of the tables, I get better results:
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn, JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode --, anl.job, anl.lab
limit 10;
Limit  (cost=0.00..3.65 rows=10 width=32)
  ->  Nested Loop  (cost=0.00..65269.13 rows=178979 width=32)
        ->  Index Scan using job_companycode on job  (cost=0.00..972.67 rows=13120 width=17)
        ->  Index Scan using analysis_job_lab on analysis anl  (cost=0.00..4.63 rows=22 width=23)
              Index Cond: (anl.job = job.job)


Any idea on how I can improve this? In the past I would tend to create a cached copy of the query as a table that would be utilized, but I suspect that there is a better way to go about this. I'm using a system (Clarion) which heavily uses cursors via the ODBC driver (I use the psqlODBC latest version) to get a handful of records at a time, so no actual LIMITs would be used in the production queries; I've added the LIMITs here to try to simulate the performance differences that I find when browsing the data while ordering by the above columns.


Here are the relevant tables and indexes:


CREATE TABLE job
(
  job bigint NOT NULL, -- Job #
  companycode character(4), -- Company Code
  recdbycode character(3), -- Initials of who checked in sample(s)
  datein date, -- Date sample was received
  project character varying, -- Project or Site name
  remarks text, -- Remarks
  --[CONSTRAINTs etc]
)

CREATE INDEX job_companycode
  ON job
  USING btree
  (companycode);

CREATE INDEX job_companycode_job
  ON samples.job
  USING btree
  (companycode, job);


CREATE TABLE analysis
(
  lab bigint NOT NULL, -- Lab number
  job bigint, -- Job number
  sampletype character varying(5), -- General class of sample
  priority character(1), -- Priority level
  samplename character varying, -- Sample or Well name
  CONSTRAINT rel_joblabk_to_jobk FOREIGN KEY (job)
      REFERENCES job (job) MATCH SIMPLE
      ON UPDATE CASCADE ON DELETE RESTRICT,
  --[CONSTRAINTs etc]
)

CREATE INDEX analysis_companycode_job_lab
  ON analysis
  USING btree
  (companycode, job, lab);


CREATE INDEX analysis_job_lab
  ON analysis
  USING btree
  (job, lab);


Thanks for any insights and tips you can provide!

Kind Regards,
-Joshua Berry

Re: Optimizing queries that use multiple tables and many order by columns

От
"Wappler, Robert"
Дата:
On 2010-08-25, Joshua Berry wrote:

> Hi Group,
>
> I've never really learned how to optimize queries that join
> several tables and have order by clauses that specify columns
> from each table. Is there documentation that could help me
> optimize and have the proper indexes in place? I've read
> through the PG Docs Chapter 11 on Indexes yet still lack the
> needed understanding.
>
> Here's my latest culprit:
>
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode, anl.job, anl.lab
> limit 10;
>

Could you try to remove the limit clause? I have seen several times,
that it may slow down a query. Although I haven't tested it, that an
OFFSET 0 clause can improve the situation, iirc.

From an algebraic point of view, I cannot see obvious inefficiencies.
Others, which now the internals of pg better, might see more.

> Here's the query plan using PG 8.4.4:
> Limit  (cost=21990.24..21990.27 rows=10 width=32)
>   ->  Sort  (cost=21990.24..22437.69 rows=178979 width=32)
>         Sort Key: job.companycode, anl.lab
>         ->  Hash Join  (cost=451.20..18122.57 rows=178979 width=32)
>               Hash Cond: (anl.job = job.job) ->  Seq Scan on analysis
>               anl (cost=0.00..14091.79 rows=178979 width=23) ->  Hash
>               (cost=287.20..287.20 rows=13120 width=17)
>                     ->  Seq Scan on job  (cost=0.00..287.20
> rows=13120 width=17)
>
>
> If I change the above query to only order by one of the
> tables, I get better results:
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode --, anl.job, anl.lab
> limit 10;
> Limit  (cost=0.00..3.65 rows=10 width=32)
>   ->  Nested Loop  (cost=0.00..65269.13 rows=178979 width=32)
>         ->  Index Scan using job_companycode on job (cost=0.00..972.67
>         rows=13120 width=17) ->  Index Scan using analysis_job_lab on
>         analysis anl
>  (cost=0.00..4.63 rows=22 width=23)
>               Index Cond: (anl.job = job.job)
>

That are estimated query plans, what does EXPLAIN ANALYZE say? The query
plans above do not execute the query instead they just make a rough
guess about the costs. Reality might be different. Also you may want to
run VACUUM ANALYZE before.

> Any idea on how I can improve this? In the past I would tend
> to create a cached copy of the query as a table that would be
> utilized, but I suspect that there is a better way to go
> about this. I'm using a system (Clarion) which heavily uses
> cursors via the ODBC driver (I use the psqlODBC latest
> version) to get a handful of records at a time, so no actual
> LIMITs would be used in the production queries; I've added
> the LIMITs here to try to simulate the performance
> differences that I find when browsing the data while ordering
> by the above columns.
>
>
> Here are the relevant tables and indexes:
>
>
> CREATE TABLE job
> (
>   job bigint NOT NULL, -- Job #
>   companycode character(4), -- Company Code
>   recdbycode character(3), -- Initials of who checked in sample(s)
>   datein date, -- Date sample was received
>   project character varying, -- Project or Site name
>   remarks text, -- Remarks
>   --[CONSTRAINTs etc]
> )
>
> CREATE INDEX job_companycode
>   ON job
>   USING btree
>   (companycode);
> CREATE INDEX job_companycode_job
>   ON samples.job
>   USING btree
>   (companycode, job);
>

Index job_companycode is not used in the plans. Additionally, it can be
constructed from the second index, as companycode is the primary sort
key.

> CREATE TABLE analysis
> (
>   lab bigint NOT NULL, -- Lab number
>   job bigint, -- Job number
>   sampletype character varying(5), -- General class of sample
>   priority character(1), -- Priority level
>   samplename character varying, -- Sample or Well name
>   CONSTRAINT rel_joblabk_to_jobk FOREIGN KEY (job)
>       REFERENCES job (job) MATCH SIMPLE
>       ON UPDATE CASCADE ON DELETE RESTRICT,
>   --[CONSTRAINTs etc]
> )
>
> CREATE INDEX analysis_companycode_job_lab
>   ON analysis
>   USING btree
>   (companycode, job, lab);
>
> CREATE INDEX analysis_job_lab
>   ON analysis
>   USING btree
>   (job, lab);
>

Maybe, the planner decides for a Sort Join, if there are sorted indexes
for anl.job and job.job. But the speed-up may vary depending on the
data.

> Thanks for any insights and tips you can provide!
>
> Kind Regards,
> -Joshua Berry
>
>
>

HTH

--
Robert...



Re: Optimizing queries that use multiple tables and many order by columns

От
Joshua Berry
Дата:


On Wed, Aug 25, 2010 at 10:40 AM, Wappler, Robert <rwappler@ophardt.com> wrote:
On 2010-08-25, Joshua Berry wrote:

> Here's my latest culprit:
>
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode, anl.job, anl.lab
> limit 10;
>

Could you try to remove the limit clause? I have seen several times,
that it may slow down a query. Although I haven't tested it, that an
OFFSET 0 clause can improve the situation, iirc.

The actual query uses a cursor, which seems to run the query and after the entire set is ready to be fetched, it will be able to allow fetching. So, I'm not using a limit in the actual queries, but something like this:


declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab;
fetch 10 in "SQL_CUR0453D910";
close "SQL_CUR0453D910";

From an algebraic point of view, I cannot see obvious inefficiencies.
Others, which now the internals of pg better, might see more.
 
That are estimated query plans, what does EXPLAIN ANALYZE say? The query
plans above do not execute the query instead they just make a rough
guess about the costs. Reality might be different. Also you may want to
run VACUUM ANALYZE before.


The database is vacuum analyze'd and the stat target is the default of 100. I'm also using PG 8.4.4 running on Centos 5.5 x86_64

--Here's what explain analyze says for the query
explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode, anl.job, anl.lab;

Sort  (cost=38047.92..38495.65 rows=179095 width=32) (actual time=1890.796..2271.248 rows=178979 loops=1)
  Sort Key: job.companycode, anl.job, anl.lab
  Sort Method:  external merge  Disk: 8416kB
  ->  Hash Join  (cost=451.20..18134.05 rows=179095 width=32) (actual time=8.239..260.848 rows=178979 loops=1)
        Hash Cond: (anl.job = job.job)
        ->  Seq Scan on analysis anl  (cost=0.00..14100.95 rows=179095 width=23) (actual time=0.026..91.602 rows=178979 loops=1)
        ->  Hash  (cost=287.20..287.20 rows=13120 width=17) (actual time=8.197..8.197 rows=13120 loops=1)
              ->  Seq Scan on job  (cost=0.00..287.20 rows=13120 width=17) (actual time=0.007..4.166 rows=13120 loops=1)
Total runtime: 2286.224 ms

 
Maybe, the planner decides for a Sort Join, if there are sorted indexes
for anl.job and job.job. But the speed-up may vary depending on the
data.

It seems to be reading the entire dataset, then sorting, right? There's not much more that could be done to improve such queries, aside from increasing memory and IO bandwidth.

But now that I've said that, there's the following query that deals with exactly the same set of data, but the ordering involves only one of the two joined tables.

explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by job.companycode --, anl.job, anl.lab; --Only order by indexed columns from job.

Nested Loop  (cost=0.00..65305.66 rows=179095 width=32) (actual time=0.084..288.976 rows=178979 loops=1)
  ->  Index Scan using job_companycode on job  (cost=0.00..972.67 rows=13120 width=17) (actual time=0.045..7.328 rows=13120 loops=1)
  ->  Index Scan using analysis_job_lab on analysis anl  (cost=0.00..4.63 rows=22 width=23) (actual time=0.006..0.015 rows=14 loops=13120)
        Index Cond: (anl.job = job.job)
Total runtime: 303.230 ms

If I order by columns from the other table, analysis only, I get the follow query and results:
explain analyze
declare "SQL_CUR0453D910" cursor with hold for
select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
JOB.CompanyCode, Anl.SampleName
from analysis anl join job on anl.job = job.job
order by --job.companycode,
anl.job, anl.lab; --Only order by indexed columns from analysis.

Merge Join  (cost=0.56..44872.45 rows=179095 width=32) (actual time=0.078..368.620 rows=178979 loops=1)
  Merge Cond: (anl.job = job.job)
  ->  Index Scan using analysis_job_lab on analysis anl  (cost=0.00..35245.47 rows=179095 width=23) (actual time=0.035..128.460 rows=178979 loops=1)
  ->  Index Scan using job_job_pk on job  (cost=0.00..508.53 rows=13120 width=17) (actual time=0.039..53.733 rows=179005 loops=1)
Total runtime: 388.884 ms


Notice that in these cases the query completes in <400 ms and the other query that involves ordering on columns from both of the joined tables completes in >2300ms.

In the application here, these queries are used by a client application to fill a window's listbox that can be scrolled up or down. If the user changes direction of the scroll, it initiates a new cursor and query to fetch a page of results. If the scrolling motion is in the same direction, it simply continues to fetch more results from the cursor. But each time the direction of movement changes, there can be a significant lag.

Any suggestions would be helpful! I'll assume for now that the indexes and queries can't be improved, but rather that I should tweak more of the postmaster settings. Please correct me if you know better and have time to reply.

Thanks,
-Joshua

P.S. Is it possible to have indexes that involves several columns from different but related tables? If so, where can I learn about them?


> Thanks for any insights and tips you can provide!
>
> Kind Regards,
> -Joshua Berry
>
>
>

HTH

--
Robert...



Re: Optimizing queries that use multiple tables and many order by columns

От
"Wappler, Robert"
Дата:
On 2010-08-25, Joshua Berry wrote:

> --Here's what explain analyze says for the query
> explain analyze
> declare "SQL_CUR0453D910" cursor with hold for
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode, anl.job, anl.lab;
>
> Sort  (cost=38047.92..38495.65 rows=179095 width=32) (actual
> time=1890.796..2271.248 rows=178979 loops=1)
>   Sort Key: job.companycode, anl.job, anl.lab
>   Sort Method:  external merge  Disk: 8416kB
>   ->  Hash Join  (cost=451.20..18134.05 rows=179095 width=32)
> (actual time=8.239..260.848 rows=178979 loops=1)
>         Hash Cond: (anl.job = job.job) ->  Seq Scan on analysis anl
>         (cost=0.00..14100.95 rows=179095 width=23) (actual
>         time=0.026..91.602 rows=178979 loops=1) ->  Hash
>         (cost=287.20..287.20 rows=13120 width=17)
> (actual time=8.197..8.197 rows=13120 loops=1)
>               ->  Seq Scan on job  (cost=0.00..287.20
> rows=13120 width=17) (actual time=0.007..4.166 rows=13120 loops=1)
> Total runtime: 2286.224 ms
>
>
>
>     Maybe, the planner decides for a Sort Join, if there
> are sorted indexes
>
>     for anl.job and job.job. But the speed-up may vary
> depending on the
>     data.
>
>
>
> It seems to be reading the entire dataset, then sorting,
> right? There's not much more that could be done to improve
> such queries, aside from increasing memory and IO bandwidth.
>

It has to, because it has to start with the smallest row in the
resultset, which may be the last one read, if it cannot read the tuples
in sorted order.

> But now that I've said that, there's the following query that
> deals with exactly the same set of data, but the ordering
> involves only one of the two joined tables.
>
> explain analyze
> declare "SQL_CUR0453D910" cursor with hold for
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by job.companycode --, anl.job, anl.lab; --Only order
> by indexed columns from job.
>
> Nested Loop  (cost=0.00..65305.66 rows=179095 width=32)
> (actual time=0.084..288.976 rows=178979 loops=1)
>   ->  Index Scan using job_companycode on job
> (cost=0.00..972.67 rows=13120 width=17) (actual
> time=0.045..7.328 rows=13120 loops=1)
>   ->  Index Scan using analysis_job_lab on analysis anl
> (cost=0.00..4.63 rows=22 width=23) (actual time=0.006..0.015
> rows=14 loops=13120)
>         Index Cond: (anl.job = job.job)
> Total runtime: 303.230 ms
>
> If I order by columns from the other table, analysis only, I
> get the follow query and results:
> explain analyze
> declare "SQL_CUR0453D910" cursor with hold for
> select Anl.Priority, Anl.Lab, Anl.Job, JOB.DateIn,
> JOB.CompanyCode, Anl.SampleName
> from analysis anl join job on anl.job = job.job
> order by --job.companycode,
> anl.job, anl.lab; --Only order by indexed columns from analysis.
>
> Merge Join  (cost=0.56..44872.45 rows=179095 width=32)
> (actual time=0.078..368.620 rows=178979 loops=1)
>   Merge Cond: (anl.job = job.job)
>   ->  Index Scan using analysis_job_lab on analysis anl
> (cost=0.00..35245.47 rows=179095 width=23) (actual
> time=0.035..128.460 rows=178979 loops=1)
>   ->  Index Scan using job_job_pk on job  (cost=0.00..508.53
> rows=13120 width=17) (actual time=0.039..53.733 rows=179005 loops=1)
> Total runtime: 388.884 ms
>
>
> Notice that in these cases the query completes in <400 ms and
> the other query that involves ordering on columns from both
> of the joined tables completes in >2300ms.
>

Because, these queries don't need to sort the result, they can read it
in order.

What I don't really get is, that you compare queries with different sort
orders, especially with a different number of sort keys. Of course, that
has a big influence on the time needed for sorting. While your first
query, which sorts on three keys really does a sort on the result, the
latter two don't need that, because they can read the tuples in the
correct order from the indexes. If I read the first plan correctly, that
sort costs you about 2 sec in query execution time, because the Hash
Join is done after 260ms.

Do you really have the requirement to sort anything? Or let me ask it
the other way round: Assuming you have too much data, to sort it on the
application side, which user can read all this from one single table in
the user interface?

> In the application here, these queries are used by a client
> application to fill a window's listbox that can be scrolled
> up or down. If the user changes direction of the scroll, it
> initiates a new cursor and query to fetch a page of results.
> If the scrolling motion is in the same direction, it simply
> continues to fetch more results from the cursor. But each
> time the direction of movement changes, there can be a
> significant lag.
>

Then, obviously you shouldn't create a new cursor. You can create
backwards scrollable cursors. See the SCROLL option of the DECLARE
statement.

> Any suggestions would be helpful! I'll assume for now that
> the indexes and queries can't be improved, but rather that I
> should tweak more of the postmaster settings. Please correct
> me if you know better and have time to reply.
>

These options heavily depend on the environment and the data set, I
always see them as some last resort, because they might slow down other
queries if tweaked to much towards a specific thing. I have not yet
played around with this a lot. The things simply work fast enough here.
Others can give you better hints on this.


> Thanks,
> -Joshua
>
> P.S. Is it possible to have indexes that involves several
> columns from different but related tables? If so, where can I
> learn about them?
>
>

Nope. An index is tied to one table only. But another option is, to
precalculate the join. Depending on your needs (especially INSERT/UPDATE
performance), you could use triggers and/or a regular batch job, which
writes the joined results in another table. There you can index these
columns accordingly. In general, this is ugly and leads to redundancy
but can give a big performance boost and is sometimes the only option.


--
Robert...



Re: Optimizing queries that use multiple tables and many order by columns

От
Joshua Berry
Дата:


On Thu, Aug 26, 2010 at 2:51 AM, Wappler, Robert <rwappler@ophardt.com> wrote:
Do you really have the requirement to sort anything? Or let me ask it
the other way round: Assuming you have too much data, to sort it on the
application side, which user can read all this from one single table in
the user interface?

The tool that I'm using to pull this information together is really easy to use and maintain when you use it's database drivers to generate the queries. The extra sort here is so that the I could order the dataset by company, then by job number, then by the specific lab number, where jobs are assigned to a single company, and labs are assigned to a given job. The idea is for the application to be a substitute for bringing the dataset into a spreadsheet and peruse it there. I could just sort by company XOR both job and lab, but in the case of sorting by company, all of the companies job numbers would not necessarily be in order, and likewise the labs within the jobs would also not. This could be smoothed over by cutting down the dataset to a subset based on a few criteria, which is the next approach to take.
 

> In the application here, these queries are used by a client
> application to fill a window's listbox that can be scrolled
> up or down. If the user changes direction of the scroll, it
> initiates a new cursor and query to fetch a page of results.
> If the scrolling motion is in the same direction, it simply
> continues to fetch more results from the cursor. But each
> time the direction of movement changes, there can be a
> significant lag.
>

Then, obviously you shouldn't create a new cursor. You can create
backwards scrollable cursors. See the SCROLL option of the DECLARE
statement.

These queries are generated by the database driver and are not easily tweakable. Generally they use a subset of whatever is available via the ODBC interface. So, although not optimal, it's not something that I can improve in the shortterm.
 
> Any suggestions would be helpful! I'll assume for now that
> the indexes and queries can't be improved, but rather that I
> should tweak more of the postmaster settings. Please correct
> me if you know better and have time to reply.
>

These options heavily depend on the environment and the data set, I
always see them as some last resort, because they might slow down other
queries if tweaked to much towards a specific thing. I have not yet
played around with this a lot. The things simply work fast enough here.
Others can give you better hints on this.

Thanks for you tips and insight. I'll make getting this portion of the system "good enough" and look to refactor later when needed.
 
> P.S. Is it possible to have indexes that involves several
> columns from different but related tables? If so, where can I
> learn about them?

Nope. An index is tied to one table only. But another option is, to
precalculate the join. Depending on your needs (especially INSERT/UPDATE
performance), you could use triggers and/or a regular batch job, which
writes the joined results in another table. There you can index these
columns accordingly. In general, this is ugly and leads to redundancy
but can give a big performance boost and is sometimes the only option.

That's an option. I do use triggers now to log user changes to the tables, this wouldn't be too hard to do, but a bit hard to maintain down the road, perhaps. It's great to have a backup plan in the case that I have a backlog of support requests regarding the UI lbeing too laggy.


Kind Regards,
-Joshua
 

--
Robert...