Обсуждение: (Hopefully stupid) select question.

Поиск
Список
Период
Сортировка

(Hopefully stupid) select question.

От
Fredric Fredricson
Дата:
I have been fighting with a select and can find no satisfactory solution.

Simplified version of the problem:

A table that, in reality, log state changes to an object (represented as a row in another table):

CREATE TABLE t (
    id SERIAL UNIQUE,
    ref INTEGER, -- Reference to a row in another table
    someData TEXT,
    inserted DATE DEFAULT CURRENT_TIMESTAMP
) ;
Then we insert multiple rows for each "ref" with different "someData".


Now I want the latest "someData" for each "ref" like:

ref | someData (only latest inserted)
-------------
 1  | 'data1'
 2  | 'data2'
etc...

The best solution I could find depended on the fact that serial is higher for higher dates. I do not like that because if that is true, it is an indirect way to get the data and could possibly, in the future, yield the wrong result if unrelated changes where made or id's reused.

Here is my solution (that depend on the SERIAL):
SELECT x.ref,x.someData
  FROM t as x
  NATURAL JOIN (SELECT ref,max(id) AS id FROM t GROUP BY ref ORDER BY ref) AS y ;

Can somebody come up with a better solution? (without resorting to stored procedures and other performance killers).

/Fredric
Вложения

Re: (Hopefully stupid) select question.

От
Tom Lane
Дата:
Fredric Fredricson <Fredric.Fredricson@bonetmail.com> writes:
> ... Now I want the latest "someData" for each "ref" like:

> The best solution I could find depended on the fact that serial is
> higher for higher dates. I do not like that because if that is true, it
> is an indirect way to get the data and could possibly, in the future,
> yield the wrong result if unrelated changes where made or id's reused.

> Here is my solution (that depend on the SERIAL):
> SELECT x.ref,x.someData
>    FROM t as x
>    NATURAL JOIN (SELECT ref,max(id) AS id FROM t GROUP BY ref ORDER BY
> ref) AS y ;

Well, you could just substitute max(inserted) for max(id).  But you
should also consider using DISTINCT ON --- look at the "weather reports"
example in the SELECT reference page.

BTW, "inserted DATE DEFAULT CURRENT_TIMESTAMP" looks pretty fishy.
You sure the column type shouldn't be timestamp or timestamptz, to
support multiple updates per day?

            regards, tom lane

Re: (Hopefully stupid) select question.

От
"A.M."
Дата:
On Jan 24, 2011, at 10:50 AM, Fredric Fredricson wrote:

> I have been fighting with a select and can find no satisfactory solution.
>
> Simplified version of the problem:
>
> A table that, in reality, log state changes to an object (represented as a row in another table):
>
> CREATE TABLE t (
>    id SERIAL UNIQUE,
>    ref INTEGER, -- Reference to a row in another table
>    someData TEXT,
>    inserted DATE DEFAULT CURRENT_TIMESTAMP
> ) ;
> Then we insert multiple rows for each "ref" with different "someData".
>
>
> Now I want the latest "someData" for each "ref" like:
>
> ref | someData (only latest inserted)
> -------------
> 1  | 'data1'
> 2  | 'data2'
> etc...
>
> The best solution I could find depended on the fact that serial is higher for higher dates. I do not like that
becauseif that is true, it is an indirect way to get the data and could possibly, in the future, yield the wrong result
ifunrelated changes where made or id's reused. 
>
> Here is my solution (that depend on the SERIAL):
> SELECT x.ref,x.someData
>  FROM t as x
>  NATURAL JOIN (SELECT ref,max(id) AS id FROM t GROUP BY ref ORDER BY ref) AS y ;
>
> Can somebody come up with a better solution? (without resorting to stored procedures and other performance killers).

I would argue that relying on the id is safer than relying on the current timestamp because CURRENT_TIMESTAMP refers to
thetime that the transaction is started, not when the transaction was committed (or the row was "actually" inserted).
Inaddition, it is technically possible for two transactions to get the same CURRENT_TIMESTAMP. SERIAL values are never
reused.You could also create a security view which exposes the historical data but without the primary key in the
actualtable. 

I recommend http://pgfoundry.org/projects/tablelog which uses "performance killers" like stored procedures to handle
thingsproperly- at least take a look to see how things are handled. 

Cheers,
M

Re: (Hopefully stupid) select question.

От
Fredric Fredricson
Дата:
On 01/24/2011 04:56 PM, Tom Lane wrote:
> Fredric Fredricson<Fredric.Fredricson@bonetmail.com>  writes:
>> ... Now I want the latest "someData" for each "ref" like:
>> The best solution I could find depended on the fact that serial is
>> higher for higher dates. I do not like that because if that is true, it
>> is an indirect way to get the data and could possibly, in the future,
>> yield the wrong result if unrelated changes where made or id's reused.
>> Here is my solution (that depend on the SERIAL):
>> SELECT x.ref,x.someData
>>     FROM t as x
>>     NATURAL JOIN (SELECT ref,max(id) AS id FROM t GROUP BY ref ORDER BY
>> ref) AS y ;
> Well, you could just substitute max(inserted) for max(id).  But you
I tried this, but that did not get me "someData" because I need "id" for
that.
> should also consider using DISTINCT ON --- look at the "weather reports"
> example in the SELECT reference page.
DISTINCT ON did the trick. Thank you!
My select is now much simpler:
SELECT DISTINCT ON (ref) ref, someData FROM t ORDER BY red,date DESC;

Also 20-30% faster in my setup.

I tried DISTINCT but I wasn't aware of the "DISTINCT ON" functionality.
You live - you learn.
> BTW, "inserted DATE DEFAULT CURRENT_TIMESTAMP" looks pretty fishy.
> You sure the column type shouldn't be timestamp or timestamptz, to
> support multiple updates per day?
Sorry, my typo, it is really a TIMESTAMP(0), of course.

/Fredric
>             regards, tom lane
>


Вложения

Re: (Hopefully stupid) select question.

От
Fredric Fredricson
Дата:
On 01/24/2011 05:02 PM, A.M. wrote:
> On Jan 24, 2011, at 10:50 AM, Fredric Fredricson wrote:
>
>> I have been fighting with a select and can find no satisfactory solution.
>>
>> Simplified version of the problem:
>>
>> A table that, in reality, log state changes to an object (represented as a row in another table):
>>
>> CREATE TABLE t (
>>     id SERIAL UNIQUE,
>>     ref INTEGER, -- Reference to a row in another table
>>     someData TEXT,
>>     inserted DATE DEFAULT CURRENT_TIMESTAMP
>> ) ;
>> Then we insert multiple rows for each "ref" with different "someData".
>>
>>
>> Now I want the latest "someData" for each "ref" like:
>>
>> ref | someData (only latest inserted)
>> -------------
>> 1  | 'data1'
>> 2  | 'data2'
>> etc...
>>
>> The best solution I could find depended on the fact that serial is higher for higher dates. I do not like that
becauseif that is true, it is an indirect way to get the data and could possibly, in the future, yield the wrong result
ifunrelated changes where made or id's reused. 
>>
>> Here is my solution (that depend on the SERIAL):
>> SELECT x.ref,x.someData
>>   FROM t as x
>>   NATURAL JOIN (SELECT ref,max(id) AS id FROM t GROUP BY ref ORDER BY ref) AS y ;
>>
>> Can somebody come up with a better solution? (without resorting to stored procedures and other performance killers).
> I would argue that relying on the id is safer than relying on the current timestamp because CURRENT_TIMESTAMP refers
tothe time that the transaction is started, not when the transaction was committed (or the row was "actually"
inserted).In addition, it is technically possible for two transactions to get the same CURRENT_TIMESTAMP. SERIAL values
arenever reused. You could also create a security view which exposes the historical data but without the primary key in
theactual table. 
Well, in my case the transaction time is not an issue really. The
database is a backend to a REST Web service and all transactions are
short (as dictated by the web server).
But I see your point.
> I recommend http://pgfoundry.org/projects/tablelog which uses "performance killers" like stored procedures to handle
thingsproperly- at least take a look to see how things are handled. 
I looked at this page and it is not what I need for this particular
problem, since I log only specific changes in state and these changes
are represented as rows in this state-log table (the row in the
referenced table is not changed).

But I do log changes in about 80% of my tables and I use a technique
similar to the one described in the table log. I have a script that
parse my sql-code and auto-generate sql statemens that creates a
"shadow"-table and the triggers required. I also have a mandatory
"header" on all my logged tables and store an entry in a change log
table with information about user name (external user, not SQL ROLE) and
timestamp. This way all changes can be traced in time and I can, in
theory, get a snapshot of my entire data at an arbitrary point in time.
I say "in theory" because I have not implemented it and with a lot of
unions and such I expect the performance to suck. I will however use it
for parts of the data, which is why I implemented it.

And about performance. In my application insert performance is not an
issue, I suspect it rarely is in systems run by human hands. Read
performance on the other hand can definitely be an issue since reads are
much more frequent and contains more data. I use views a lot and at one
point I had nested views that used stored procedures and I started to
get select times in the region of 7-800ms for simple selects with a
couple of hundred rows in the result set. Not funny. I removed the
stored procedures (it was painful!) and the nested views and got select
times down to 20-40ms. Not entirely satisfactory, maybe, but much better
and with some decent hardware I guess it would be even better.
Hence my remark about stored procedures as "performance killers".

Thanks,
Fredric
> Cheers,
> M


Вложения