Обсуждение: [feature] cached index to speed up specific queries on extremely large data sets

Поиск
Список
Период
Сортировка

[feature] cached index to speed up specific queries on extremely large data sets

От
"lkcl ."
Дата:
hi folks, please cc me direct on responses as i am subscribed on digest.

i've been asked to look at how to deal with around 7 billion records
(appx 30 columns, appx data size total 1k) and this might have to be
in a single system (i will need to Have Words with the client about
that).  the data is read-only and an arbitrary number of additional
tables may be created to "manage" the data.  records come in at a rate
of around 25 million per day, the 7 billion records is based on the
assumption of keeping one month's worth of data around.

analysis of this data needs to be done across the entire set: i.e. it
may not be subdivided into isolated tables (by day for example).  i am
therefore um rather concerned about efficiency, even just from the
perspective of using 2nd normalised form and not doing JOINs against
other tables.

so i had an idea.  there already exists the concept of indexes.  there
already exists the concept of "cached queries".  question: would it be
practical to *merge* those two concepts such that specific queries
could be *updated* as new records are added, such that when the query
is called again it answers basically pretty much immediately? let us
assume that performance degradation on "update" (given that indexes
already exist and are required to be updated) is acceptable.

the only practical way (without digging into postgresql's c code) to
do this at a higher level would be in effect to abandon the advantages
of the postgresql query optimisation engine and *reimplement* it in a
high-level language, subdividing the data into smaller (more
manageable) tables, using yet more tables to store intermediate
results of a previous query then somehow managing to stitch together a
new response based on newer packets.  it would be a complete nightmare
to both implement and maintain.

second question then based on whether the first is practical: is there
anyone who would be willing (assuming it can be arranged) to engage in
a contract to implement the required functionality?

thanks,

l.



Re: [feature] cached index to speed up specific queries on extremely large data sets

От
Heikki Linnakangas
Дата:
On 04/11/2014 03:20 PM, lkcl . wrote:
> so i had an idea.  there already exists the concept of indexes.  there
> already exists the concept of "cached queries".  question: would it be
> practical to*merge*  those two concepts such that specific queries
> could be*updated*  as new records are added, such that when the query
> is called again it answers basically pretty much immediately? let us
> assume that performance degradation on "update" (given that indexes
> already exist and are required to be updated) is acceptable.

I think you just described materialized views. The built-in materialized 
views in PostgreSQL are not updated immediately as the tables are 
modified, but it's entirely possible to roll your own using views and 
triggers. There are a few links on the PostgreSQL wiki, in the "Versions 
before 9.3" section: https://wiki.postgresql.org/wiki/Materialized_Views.

- Heikki



Re: [feature] cached index to speed up specific queries on extremely large data sets

От
"lkcl ."
Дата:
On Fri, Apr 11, 2014 at 1:26 PM, Heikki Linnakangas
<hlinnakangas@vmware.com> wrote:
> On 04/11/2014 03:20 PM, lkcl . wrote:
>>
>> so i had an idea.  there already exists the concept of indexes.  there
>> already exists the concept of "cached queries".  question: would it be
>> practical to*merge*  those two concepts such that specific queries
>> could be*updated*  as new records are added, such that when the query
>>
>> is called again it answers basically pretty much immediately? let us
>> assume that performance degradation on "update" (given that indexes
>> already exist and are required to be updated) is acceptable.
>
>
> I think you just described materialized views.
... well... dang :)
http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
ok so definitely not the snapshot materialised views, but yes!  the
eager materialised views, definitely.

> The built-in materialized
> views in PostgreSQL are not updated immediately as the tables are modified,
... but that would probably be enough.

> but it's entirely possible to roll your own using views and triggers. There
> are a few links on the PostgreSQL wiki, in the "Versions before 9.3"
> section: https://wiki.postgresql.org/wiki/Materialized_Views.
awesome.  uhhh, well that was easy *lol*.  once i am paid, whom do i
send the payment to for the fast response and incredibly valuable
information? :)  [this is a serious question!]

l.



Re: [feature] cached index to speed up specific queries on extremely large data sets

От
Michael Paquier
Дата:
On Fri, Apr 11, 2014 at 9:53 PM, lkcl . <luke.leighton@gmail.com> wrote:
> On Fri, Apr 11, 2014 at 1:26 PM, Heikki Linnakangas
> <hlinnakangas@vmware.com> wrote:
>> On 04/11/2014 03:20 PM, lkcl . wrote:
>>>
>>> so i had an idea.  there already exists the concept of indexes.  there
>>> already exists the concept of "cached queries".  question: would it be
>>> practical to*merge*  those two concepts such that specific queries
>>> could be*updated*  as new records are added, such that when the query
>>>
>>> is called again it answers basically pretty much immediately? let us
>>> assume that performance degradation on "update" (given that indexes
>>> already exist and are required to be updated) is acceptable.
>>
>>
>> I think you just described materialized views.
>
>  ... well... dang :)
>
>  http://tech.jonathangardner.net/wiki/PostgreSQL/Materialized_Views
>
>  ok so definitely not the snapshot materialised views, but yes!  the
> eager materialised views, definitely.
>
>> The built-in materialized
>> views in PostgreSQL are not updated immediately as the tables are modified,
>
>  ... but that would probably be enough.
>
>> but it's entirely possible to roll your own using views and triggers. There
>> are a few links on the PostgreSQL wiki, in the "Versions before 9.3"
>> section: https://wiki.postgresql.org/wiki/Materialized_Views.
When updating a materialized view, or refreshing it, you need as well
to be aware that an exclusive lock is taken on it during the refresh
in 9.3, so the materialized view cannot be accessed for read queries.

>  awesome.  uhhh, well that was easy *lol*.  once i am paid, whom do i
> send the payment to for the fast response and incredibly valuable
> information? :)  [this is a serious question!]
This can be helpful:
http://www.postgresql.org/about/donate/
-- 
Michael



Re: [feature] cached index to speed up specific queries on extremely large data sets

От
"lkcl ."
Дата:
On Fri, Apr 11, 2014 at 2:12 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Fri, Apr 11, 2014 at 9:53 PM, lkcl . <luke.leighton@gmail.com> wrote:
>>> section: https://wiki.postgresql.org/wiki/Materialized_Views.
> When updating a materialized view, or refreshing it, you need as well
> to be aware that an exclusive lock is taken on it during the refresh
> in 9.3, so the materialized view cannot be accessed for read queries.
ok.  as long as the storage of data (in the underlying table) is not
adversely affected, that would be fine.  as this is the hackers list
and this turns out to be more a user question i'll leave it for now.

>>  awesome.  uhhh, well that was easy *lol*.  once i am paid, whom do i
>> send the payment to for the fast response and incredibly valuable
>> information? :)  [this is a serious question!]
> This can be helpful:
> http://www.postgresql.org/about/donate/
thank you michael.