[feature] cached index to speed up specific queries on extremely large data sets

Поиск
Список
Период
Сортировка
От lkcl .
Тема [feature] cached index to speed up specific queries on extremely large data sets
Дата
Msg-id CAPweEDzyR923NrEedEUKXS=EdZAkL=bEWB6AN+0sVkoK56o4Vg@mail.gmail.com
обсуждение исходный текст
Ответы Re: [feature] cached index to speed up specific queries on extremely large data sets  (Heikki Linnakangas <hlinnakangas@vmware.com>)
Список pgsql-hackers
hi folks, please cc me direct on responses as i am subscribed on digest.

i've been asked to look at how to deal with around 7 billion records
(appx 30 columns, appx data size total 1k) and this might have to be
in a single system (i will need to Have Words with the client about
that).  the data is read-only and an arbitrary number of additional
tables may be created to "manage" the data.  records come in at a rate
of around 25 million per day, the 7 billion records is based on the
assumption of keeping one month's worth of data around.

analysis of this data needs to be done across the entire set: i.e. it
may not be subdivided into isolated tables (by day for example).  i am
therefore um rather concerned about efficiency, even just from the
perspective of using 2nd normalised form and not doing JOINs against
other tables.

so i had an idea.  there already exists the concept of indexes.  there
already exists the concept of "cached queries".  question: would it be
practical to *merge* those two concepts such that specific queries
could be *updated* as new records are added, such that when the query
is called again it answers basically pretty much immediately? let us
assume that performance degradation on "update" (given that indexes
already exist and are required to be updated) is acceptable.

the only practical way (without digging into postgresql's c code) to
do this at a higher level would be in effect to abandon the advantages
of the postgresql query optimisation engine and *reimplement* it in a
high-level language, subdividing the data into smaller (more
manageable) tables, using yet more tables to store intermediate
results of a previous query then somehow managing to stitch together a
new response based on newer packets.  it would be a complete nightmare
to both implement and maintain.

second question then based on whether the first is practical: is there
anyone who would be willing (assuming it can be arranged) to engage in
a contract to implement the required functionality?

thanks,

l.



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Craig Ringer
Дата:
Сообщение: Re: WIP patch (v2) for updatable security barrier views
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: [feature] cached index to speed up specific queries on extremely large data sets