Finding latest record for a number of groups in an INSERT-only table

Поиск
Список
Период
Сортировка
От Daniel Farina
Тема Finding latest record for a number of groups in an INSERT-only table
Дата
Msg-id CAAZKuFa2dw9TT5yuPyrYQDDxpA8YFLAOVadsafmAxhNDhqQgJQ@mail.gmail.com
обсуждение исходный текст
Ответы Re: Finding latest record for a number of groups in an INSERT-only table  (David Johnston <polobo@yahoo.com>)
Re: Finding latest record for a number of groups in an INSERT-only table  (Simon Riggs <simon@2ndQuadrant.com>)
Re: Finding latest record for a number of groups in an INSERT-only table  (Gavin Flower <GavinFlower@archidevsys.co.nz>)
Список pgsql-general
This is basically exactly the same as
http://archives.postgresql.org/pgsql-sql/2008-10/msg00009.php; I'm
just asking again, to see if thinking on the problem has changed:

The basic problem, restated, is one has a relation with tuples like this:

(key, recency_stamp, other_columns*)

Some example data may be:

'ice cream', 0, 'vanilla'
'ice cream', 1, 'chocolate'
'beer', 0, 'ale'
'beer', 3, 'lager'

The goal is to extract from this relation the following:

'ice cream', '1', 'chocolate'
'beer', 3, 'lager'

In my dataset, the distinct number of keys is << the number of rows;
in fact, that ratio grows ever smaller until the data is truncated,
with a convergence point probably resting at a negligible fraction of
a percent -- definitely less than 0.1% in short order, easily getting
to 0.01% and 0.001% in not a vast amount of time.  There is a standard
btree index on (key, recency_stamp), which ideally is exploitable to
compute the 'most recent' relation very quickly.  Approaches I've
investigated include DISTINCT ON (this may be the cleanest), windowing
functions (rank, and a predicate), and a GROUP BY with a self-join,
but none of them avoid a full relation scan, which is just crippling
performance-wise.

The only way I could find to trick the optimizer into doing something
non-crippling was to trick it with
http://wiki.postgresql.org/wiki/Loose_indexscan to find distinct
values quickly and subsequently using a subquery in the target list
that the optimizer decides to run over every distinct value (it
estimates a very high cost for this, even though it is actually rather
fast, but the ridiculous costing can subsequent joins against the
relation in unfortunate ways).  This mechanism is obscure enough that
I may just write a plpgsql function as a workaround, as that may well
be more lucid.

I am using PostgreSQL 9.

Does anyone have fresh thoughts or suggestions for dealing with
INSERT-mostly tables conceived in this manner?

--
fdr

В списке pgsql-general по дате отправления:

Предыдущее
От: "Jonathan Brinkman"
Дата:
Сообщение: Read MS-SQL data into Postgres via ODBC link?
Следующее
От: "Brent Wood"
Дата:
Сообщение: Re: Read MS-SQL data into Postgres via ODBC link?