Обсуждение: [GENERAL] improvements/feedback sought for a working query that looks a bit ugly and might be inefficient

Поиск
Список
Период
Сортировка
Everything here works fine - but after a handful of product iterations & production adjustments, a query that handles a
"taskqueue" across a few tables looks a bit ugly. 

I'm wondering if anyone can see obvious improvements.

There are 3 tables:
    upstream_provider
    task
    task_queue

Originally we needed to select 50 items off the top of the queue at a time.
Then we needed to set a max of 5 tasks per upstream provider (It's a cheap way to handle throttling).
The table is quite large, so a limit of the last 1000 items drastically improved performance.
The query got ugly when we needed to add a "priority" toggle to the queue -- basically to mark things as "process
ASAP".  
The only way I could figure out how to do that, was to add a sort -- on "is_priority DESC NULLS LAST".
My concern is that the sort needs to happen 3x --
    in the subselect for 1000 items
    in the partition for row numbering
    in the final sort


If anyone has a moment to look this over and suggest anything, I'd be very thankful.  A working reproduction is below.


SQL---
--------------

CREATE TABLE upstream_provider (id SERIAL PRIMARY KEY,
                                name VARCHAR(32),
                                is_paused BOOLEAN DEFAULT NULL
                                );
CREATE TABLE task (id SERIAL PRIMARY KEY,
                   upstream_provider_id INT NOT NULL REFERENCES upstream_provider(id),
                   name VARCHAR(32)
                   );
CREATE TABLE task_queue (id SERIAL PRIMARY KEY,
                         task_id INT NOT NULL REFERENCES task(id),
                         upstream_provider_id INT NOT NULL REFERENCES upstream_provider(id),  # only here because it
eliminatesexpensive joins elsewhere 
                         processing_status BOOLEAN DEFAULT NULL,
                         is_priority BOOLEAN DEFAULT NULL
                         );

SELECT  partition1.*
      , task.*
FROM (SELECT  window1.*
            , row_number() OVER (PARTITION BY window1.upstream_provider_id
                                 ORDER BY window1.is_priority DESC NULLS LAST,
                                          window1.task_queue_id
                                 ) AS rownum
      FROM (SELECT   qu.id AS task_queue_id
                   , qu.upstream_provider_id
                   , qu.task_id
                   , qu.is_priority
            FROM
                task_queue qu
            JOIN
                upstream_provider ON qu.upstream_provider_id = upstream_provider.id
            WHERE (qu.processing_status IS NULL)
              AND (upstream_provider.is_paused IS NOT TRUE)
            ORDER BY is_priority DESC NULLS LAST,
                     qu.id DESC
            LIMIT 1000
            ) window1
      ) partition1
JOIN task ON partition1.task_id = task.id
WHERE partition1.rownum < 5
ORDER BY is_priority DESC NULLS LAST,
         task_queue_id DESC
LIMIT
    50
;

On Tuesday, May 16, 2017, jonathan vanasco <postgres@2xlp.com> wrote:

Everything here works fine - but after a handful of product iterations & production adjustments, a query that handles a "task queue" across a few tables looks a bit ugly.

This is a far cry from ugly.
 

My concern is that the sort needs to happen 3x --
        in the subselect for 1000 items
        in the partition for row numbering
        in the final sort


Unless you can discard the 5 and 1000 limits you are going to be stuck computing rank three times in order to compute and filter them.

Without understanding how the data is ultimately used its difficult to suggest meaningful alternatives.  Views and/or functions can hide some of the complexity behind meaningful names but changing away from your choice of "window" and "partition" as relation aliases will give most of the same effect.

The query is a solid reflection of the data model.  If the only concern is stylistic I'd say you should move on.

Given the first limit 1000 a reasonable work_mem setting should (without any evidence...) suffice to make the actual time spent sorting immaterial.  Lower level sorting is retained where possible so the upper levels don't have as much to move around.  The lakck of the provider id in the final sort was noticed though...

David J.


On May 16, 2017, at 10:20 PM, David G. Johnston wrote:

Unless you can discard the 5 and 1000 limits you are going to be stuck computing rank three times in order to compute and filter them.

Thanks a ton for your insight.  I'm suck using them (5 is required for throttling, 1000 is required for this to run in a reasonable amount of time)

The overhead of computing things is indeed super small.  I'm not really worried much about the performance of this query (it runs around 3ms now, down from 20+s).  I'm more worried about this code being referenced and a (possibly improper) idiom being used on queries where it will have a noticeable effect.