Re: query optimization

Поиск
Список
Период
Сортировка
От Faheem Mitha
Тема Re: query optimization
Дата
Msg-id alpine.DEB.2.00.0911271508440.21186@orwell.homelinux.org
обсуждение исходный текст
Ответ на Re: query optimization  (Robert Haas <robertmhaas@gmail.com>)
Ответы Re: query optimization  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-performance

On Wed, 25 Nov 2009, Robert Haas wrote:

> On Wed, Nov 25, 2009 at 5:54 PM, Faheem Mitha <faheem@email.unc.edu> wrote:

> Well, any method of DISTINCT-ifying is likely to be somewhat slow, but
> I've had good luck with SELECT DISTINCT ON (...) in the past, as
> compared with other methods.  YMMV - the only way to find out is to
> benchmark it.  I don't think it's non-deterministic if you order by
> the DISTINCT-ON columns and enough extras to break any ties - you
> should get the first one of each set.

Right, but adding enough extras to break ties is up to the user, and the
language doesn't guarantee anything, so it feels more fragile.

>>> If not, you might want to look at some way of pre-marking the
>>> non-duplicate rows so that you don't have to recompute that each time.
>>
>> What are the options re pre-marking?
>
> Well, what I usually do is - if I'm going to do the same
> distinct-ification frequently, I add an extra column (say, a boolean)
> and set it to true for all and only those rows which will pass the
> distinct-ification filter.  Then I can just say WHERE <that column
> name>.

Yes, I see. The problem with is premarking is that the selection is
somewhat dynamic, in the sense that this depends on the idlink table,
which depends on patient data, which can change.

>>> Then you might be able to use the underlying table directly in the next
>>> CTE, which will usually permit better optimization, more use of indices,
>>> etc.  It seems pretty unfortunate that dedup_patient_anno joins against geno
>>> and then patient_geno does what appears to be the same join again. Is there
>>> some way to eliminate that?  If so it will probably help.
>>
>> You don't say whether you are looking at the PED or TPED query, so I'll
>> assume PED. They are similar anyway.
>>
>> I see your point re the joins. You mean
>>
>> anno INNER JOIN geno
>>
>> followed by
>>
>> geno INNER JOIN dedup_patient_anno
>>
>> ? I think the point of the first join is to reduce the anno table based on
>> information from the geno table. The result is basically a subset of the
>> anno table with some potential duplication removed, which is then re-joined
>> to the geno table. I agree this seems a bit suboptimal, and there might be a
>> better way to do this.
>
> Yeah, I didn't think about it in detail, but it looks like it should
> be possible.  Eliminating joins can sometimes have *dramatic* effects
> on query performance, and it never hurts.

Failing all else, couldn't I smoosh together the two queries and do a
triple join? For reference, the two CTEs in question, from the PED query,
are as follows.

     dedup_patient_anno AS
      ( SELECT *
      FROM
              (SELECT  *,
                       row_number() OVER(PARTITION BY anno.rsid ORDER BY
anno.id)
              FROM     anno
                       INNER JOIN geno
                       ON       anno.id = geno.anno_id
              WHERE    idlink_id        =
                       (SELECT MIN(id)
                       FROM    idlink
                       )
              ) AS s
      WHERE   row_number = '1'
      ),
      patient_geno AS
      ( SELECT geno.idlink_id AS idlink_id,
        geno.anno_id AS anno_id,
        geno.snpval_id AS snpval_id,
        allelea_id, alleleb_id
        FROM    geno
              INNER JOIN dedup_patient_anno
              ON      geno.anno_id = dedup_patient_anno.id
      ),

                                                       Regards, Faheem.

В списке pgsql-performance по дате отправления:

Предыдущее
От: Greg Smith
Дата:
Сообщение: Re: Analyse without locking?
Следующее
От: Reydan Cankur
Дата:
Сообщение: OpenMP in PostgreSQL-8.4.0