Обсуждение: speeding up big query lookup
This is a question on speeding up some type of queries. I have a very big table that catalogs measurements of some objects over time. Measurements can be of several (~10) types. It keeps the observation date in a field, and indicates the type of measurement in another field. I often need to get the latest measurement of type A for object X. The table is indexed by object_id. The obvoious way to get the latest measurement of type A would be to join the table against SELECT object_id, object_val_type_id, max(observation_date) FROM object_val GROUP BY object_id, object_val_type_id But this can take a long time, and needs to be done very often. Next strategy was to build an auxiliary table that just keeps the last measurement of each type, for each object. I defined triggers to keep this table up to date whenever the object_val table was updated. However I don't trust this as much, and though it helps performance in reads, I'm not sure it's the best option overall. I found that Postgres can use indexes on pairs, so I'm going to play with indexing object_val by (object_id, object_val_type_id), and I'm sure my original JOIN query will be much faster. But I'm sure many people have this type of problem. Is there a smarter way to deal with this? I'd love to be able to choose the row with max value of some of the fields with just one statement, but I think this can't be done? Thanks Jaime *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************
On fös, 2006-08-25 at 18:34 -0400, Silvela, Jaime (Exchange) wrote:
> This is a question on speeding up some type of queries.
>
> I have a very big table that catalogs measurements of some objects over
> time. Measurements can be of several (~10) types. It keeps the
> observation date in a field, and indicates the type of measurement in
> another field.
>
> I often need to get the latest measurement of type A for object X.
> The table is indexed by object_id.
one popular way is to create a composite index:
CREATE INDEX object_val_id_type_date
       ON object_val(object_id,
                     object_val_type_id,
                     observation_date);
then you could
SELECT * FROM object_val
         WHERE object_id=?
               AND object_val_type_id=?
         ORDER BY observation_date DESC
         LIMIT 1;
Hope this helps
gnari
			
		"Silvela, Jaime \(Exchange\)" <JSilvela@Bear.com> writes:
> I have a very big table that catalogs measurements of some objects over
> time. Measurements can be of several (~10) types. It keeps the
> observation date in a field, and indicates the type of measurement in
> another field.
> I often need to get the latest measurement of type A for object X.
This is a pretty common requirement, and since plain SQL doesn't handle
it very well, different DBMSes have invented different extensions to
help.  For instance you can use LIMIT:
  SELECT * from object_val WHERE object_id = X and object_val_type_id = Y
    ORDER BY object_id DESC, object_val_type_id DESC, observation_date DESC
    LIMIT 1;
This will work very very fast if there is an index on (object_id,
object_val_type_id, observation_date) for it to use.  The only problem
with it is that there's no obvious way to extend it to fetch latest
measurements for several objects in one query.
Another way, which AFAIK is Postgres-only, is to use DISTINCT ON:
  SELECT DISTINCT ON (object_id, object_val_type_id) * from object_val
    ORDER BY object_id DESC, object_val_type_id DESC, observation_date DESC
This can give you all the latest measurements at once, or any subset
you need (just add a WHERE clause).  It's not amazingly fast but it
generally beats the bog-standard-SQL alternatives, which as you
mentioned require joining against subselects.
            regards, tom lane
			
		Silvela, Jaime (Exchange) wrote:
> The obvoious way to get the latest measurement of type A would be to
> join the table against
>
> SELECT object_id, object_val_type_id, max(observation_date)
> FROM object_val
> GROUP BY object_id, object_val_type_id
I'm not sure this is actually the result you want; doesn't this give you
all the unique (object_id, object_val_type_id)'s combined with the max
observation_date in the table (as in, not necessarily related to the
records listed)?
I'd think you want this:
  SELECT object_id, object_val_type_id, observation_date
  FROM object_val
  GROUP BY object_id, object_val_type_id, observation_date
  HAVING observation_date = max(observation_date)
Which'd return a single record with the highest observation_date. Though
not strictly necessary, I can imagine you'd want observation_date to be
unique, or you could get grouped observations with the same date.
Although ordering and limiting is probably faster.
I don't think the planner is intelligent enough to know that this would
only return the record with the highest observation_date - it may be
smart enough to reject ("drop from the result set") found records after
finding ones with a higher observation_date (which'd be "interesting"
when using cursors) or something along those lines. Hmm... Now I'm all
curious; an EXPLAIN'd be interesting...
Sorry for the mostly useless post :P
Regards,
--
Alban Hertroys
alban@magproductions.nl
magproductions b.v.
T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
    7500 AK Enschede
// Integrate Your World //
			
		No, you can make this work just fine if you JOIN right.
You're way is a more concise way of expressing it, though.
Tom's trick
  SELECT DISTINCT ON (object_id, object_val_type_id) * from object_val
    ORDER BY object_id DESC, object_val_type_id DESC, observation_date
DESC
Runs about twice as fast as the GROUP BY ... HAVING, but definitely not
as fast as keeping a separate table with only the latest observations,
updated by triggers. I'll be testing out the differences in overall
performance for my applications.
Thanks for the suggestions,
Jaime
-----Original Message-----
From: pgsql-general-owner@postgresql.org
[mailto:pgsql-general-owner@postgresql.org] On Behalf Of Alban Hertroys
Sent: Monday, August 28, 2006 4:57 AM
To: Silvela, Jaime (Exchange)
Cc: pgsql-general@postgresql.org
Subject: Re: [GENERAL] speeding up big query lookup
Silvela, Jaime (Exchange) wrote:
> The obvoious way to get the latest measurement of type A would be to
> join the table against
>
> SELECT object_id, object_val_type_id, max(observation_date)
> FROM object_val
> GROUP BY object_id, object_val_type_id
I'm not sure this is actually the result you want; doesn't this give you
all the unique (object_id, object_val_type_id)'s combined with the max
observation_date in the table (as in, not necessarily related to the
records listed)?
I'd think you want this:
  SELECT object_id, object_val_type_id, observation_date
  FROM object_val
  GROUP BY object_id, object_val_type_id, observation_date
  HAVING observation_date = max(observation_date)
Which'd return a single record with the highest observation_date. Though
not strictly necessary, I can imagine you'd want observation_date to be
unique, or you could get grouped observations with the same date.
Although ordering and limiting is probably faster.
I don't think the planner is intelligent enough to know that this would
only return the record with the highest observation_date - it may be
smart enough to reject ("drop from the result set") found records after
finding ones with a higher observation_date (which'd be "interesting"
when using cursors) or something along those lines. Hmm... Now I'm all
curious; an EXPLAIN'd be interesting...
Sorry for the mostly useless post :P
Regards,
--
Alban Hertroys
alban@magproductions.nl
magproductions b.v.
T: ++31(0)534346874
F: ++31(0)534346876
M:
I: www.magproductions.nl
A: Postbus 416
    7500 AK Enschede
// Integrate Your World //
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
               http://www.postgresql.org/docs/faq
***********************************************************************
Bear Stearns is not responsible for any recommendation, solicitation,
offer or agreement or any information about any transaction, customer
account or account activity contained in this communication.
Bear Stearns does not provide tax, legal or accounting advice.  You
should consult your own tax, legal and accounting advisors before
engaging in any transaction. In order for Bear Stearns to comply with
Internal Revenue Service Circular 230 (if applicable), you are notified
that any discussion of U.S. federal tax issues contained or referred to
herein is not intended or written to be used, and cannot be used, for
the purpose of:  (A) avoiding penalties that may be imposed under the
Internal Revenue Code; nor (B) promoting, marketing or recommending to
another party any transaction or matter addressed herein.
***********************************************************************
			
		I just finished reading 'The Art of SQL' by Stephane Faroult who has a
chapter (ch 6) discussing this very topic.  I strongly recommend any
developer dealing with databases take a few days to read this
narrative.
A solution would seem to depend on whether you have many objects which
change in measurement only occasionally or if your have very few
objects whose measurement change very frequently.  If you have a
chance, go to your local big name bookstore who allow you to sit and
take a read of the book, jump to ch6 (page 156 or so) and get a good
understanding of some various techniques to take into consideration.
A good suggestion (imo) is the composite key already mentioned.  If I
can give an example from the book in the case you have many objects
whose measurements change occasionally:
select whatever
from object_val as outer
where outer.item_id = someValueForObjectX
and object_val_type_id = someValueForTypeA
and outer.observation_date = ( select max(inner.observation_date)
                                             from object_val as inner
                                             where inner.item_id =
someValueForObjectX
                                              and
inner.object_val_type_id = someValueForTypeA
                                              and
inner.observation_date <= yourReferenceDate )
Hopefully, I haven't completely mis-understood the author's intentions
and gave you some silly query.  Some may argue that the inner.item_id
could be correlated to the outer.item_id, but then the inner query
would need to be run multiple times.  If used as presented, the inner
query fires only once.
I'd be rather interested in knowing if the above query (or similar)
works in practice as occassionally can sound good on paper, but doesn't
work in the real world - I'd like to know if in your situation it
provides a good execution time.  I would take the step of creating that
composite key as suggested by Ragnar first.
Good luck,
am.
"Silvela, Jaime (Exchange)" wrote:
> This is a question on speeding up some type of queries.
>
> I have a very big table that catalogs measurements of some objects over
> time. Measurements can be of several (~10) types. It keeps the
> observation date in a field, and indicates the type of measurement in
> another field.
>
> I often need to get the latest measurement of type A for object X.
>
			
		Just for fun, another approach since I believe pg supports it: select whateverFields from object_val as outer where (outer.object_id, outer.object_val_type_id,outer.observation_date) IN (select inner.object_id, inner.object_val_type,max(inner.observation_date) from object_val as inner where inner.object_id = somevalueForObjectX and inner.object_val_type = someValueForTypeA and inner.observation_date <= yourReferenceDate group by inner.object_id, inner.object_val_type) The reason these subqueries should run quickly is because the object_id,object_val_type,oberservation_date make up a composite key, so the subquery should execute extremely fast, thus eliminating the majority of the data when you want to display or act on other fields from the object_val (as outer). I suppose if you don't need any further information from object_val, and you are happy with the speeds, Tom's method is smooth. Adding the order by clause will take you out of the 'relational world' and thus slow you down. My fear with the triggers and the separate snapshot is that the delays are spread out and add questionable complexity, and potentially uneccessary overhead to the application. Something to consider (although admittedly it is arguably a weak consideration in some circumstances) is the extra space, indexes, and other factors such as additional time for backup routines (and restoration) the extra table creates. Best of luck, am "Silvela, Jaime (Exchange)" wrote: > No, you can make this work just fine if you JOIN right. > You're way is a more concise way of expressing it, though. > > Tom's trick > > SELECT DISTINCT ON (object_id, object_val_type_id) * from object_val > ORDER BY object_id DESC, object_val_type_id DESC, observation_date > DESC > > Runs about twice as fast as the GROUP BY ... HAVING, but definitely not > as fast as keeping a separate table with only the latest observations, > updated by triggers. I'll be testing out the differences in overall > performance for my applications. > > Thanks for the suggestions, > Jaime >
macgillivary wrote: > I just finished reading 'The Art of SQL' by Stephane Faroult who has a > chapter (ch 6) discussing this very topic. I'd be curious to know any other references, books, folks would recommend when it comes to writing efficient SQL, as well as references on database design. -- Until later, Geoffrey Those who would give up essential Liberty, to purchase a little temporary Safety, deserve neither Liberty nor Safety. - Benjamin Franklin
Thanks for the tips, am Actually, your suggestion is equivalent to JOINing the table with a GROUP BY copy of itself, and EXPLAIN shows both versions using the same index and aggregates. Just a matter of style. Your previous suggestion from the book works well too, but I actually prefer the JOIN method, since that allows me to set the object_id and/or object_val_type values in only one place. Tom's method is faster, but has against it a bit of obscurity - it's very fine tuned to a very specific behavior of DISTINCT ON and is less easy to read than the others. I fully agree that it is annoying to keep another table with triggers. And of course, that table needs to be indexed too, or it's worthless. I'm wondering how much extra time the db spends running all those indexes and triggers, and I'll probably dismantle that in favor of the composite index and the queries suggested so far. I'll definitely check that book, I've been looking for something like that. Thanks Jaime -----Original Message----- From: pgsql-general-owner@postgresql.org [mailto:pgsql-general-owner@postgresql.org] On Behalf Of macgillivary Sent: Monday, August 28, 2006 10:14 PM To: pgsql-general@postgresql.org Subject: Re: [GENERAL] speeding up big query lookup Just for fun, another approach since I believe pg supports it: select whateverFields from object_val as outer where (outer.object_id, outer.object_val_type_id,outer.observation_date) IN (select inner.object_id, inner.object_val_type,max(inner.observation_date) from object_val as inner where inner.object_id = somevalueForObjectX and inner.object_val_type = someValueForTypeA and inner.observation_date <= yourReferenceDate group by inner.object_id, inner.object_val_type) The reason these subqueries should run quickly is because the object_id,object_val_type,oberservation_date make up a composite key, so the subquery should execute extremely fast, thus eliminating the majority of the data when you want to display or act on other fields from the object_val (as outer). I suppose if you don't need any further information from object_val, and you are happy with the speeds, Tom's method is smooth. Adding the order by clause will take you out of the 'relational world' and thus slow you down. My fear with the triggers and the separate snapshot is that the delays are spread out and add questionable complexity, and potentially uneccessary overhead to the application. Something to consider (although admittedly it is arguably a weak consideration in some circumstances) is the extra space, indexes, and other factors such as additional time for backup routines (and restoration) the extra table creates. Best of luck, am "Silvela, Jaime (Exchange)" wrote: > No, you can make this work just fine if you JOIN right. > You're way is a more concise way of expressing it, though. > > Tom's trick > > SELECT DISTINCT ON (object_id, object_val_type_id) * from object_val > ORDER BY object_id DESC, object_val_type_id DESC, observation_date > DESC > > Runs about twice as fast as the GROUP BY ... HAVING, but definitely not > as fast as keeping a separate table with only the latest observations, > updated by triggers. I'll be testing out the differences in overall > performance for my applications. > > Thanks for the suggestions, > Jaime > ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster *********************************************************************** Bear Stearns is not responsible for any recommendation, solicitation, offer or agreement or any information about any transaction, customer account or account activity contained in this communication. Bear Stearns does not provide tax, legal or accounting advice. You should consult your own tax, legal and accounting advisors before engaging in any transaction. In order for Bear Stearns to comply with Internal Revenue Service Circular 230 (if applicable), you are notified that any discussion of U.S. federal tax issues contained or referred to herein is not intended or written to be used, and cannot be used, for the purpose of: (A) avoiding penalties that may be imposed under the Internal Revenue Code; nor (B) promoting, marketing or recommending to another party any transaction or matter addressed herein. ***********************************************************************