Обсуждение: Idea: GSoC - Query Rewrite with Materialized Views

Поиск
Список
Период
Сортировка

Idea: GSoC - Query Rewrite with Materialized Views

От
Eric Grinstein
Дата:
Hello Everyone,

I'm Eric Grinstein, an Comp. Engineering undergrad at PUC-Rio, Brazil.
I'm very excited about contributing to Postgres, and thought GSoC would be 
a very good program to get me started. 

Some commercial RDBMS such as oracle implement a feature called 'Query Rewrite'
using materialized views. When a query is being processed, the system checks whether a materialized view created on a similar query could be useful for the query. If so, it rewrites it so it can take advantage of the MV, usually dispensing a lot of calculation. 

I think the implementation of the feature would be a nice project to work on, and very benefical to Postgres. Many people (including myself) miss the feature on it, and 

So, could you give me your opinion about it? If I made a proposal, would the community's response be positive? If you have things to add, new ideas, I'd be very excited to hear them.

Thank you for your time! Hope to make my first patch soon!

Eric


Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Kevin Grittner
Дата:
Eric Grinstein <eric@aluno.puc-rio.br> wrote:

> I'm very excited about contributing to Postgres, and thought GSoC
> would be a very good program to get me started.

Great!

> Some commercial RDBMS such as oracle implement a feature called
> 'Query Rewrite' using materialized views. When a query is being
> processed, the system checks whether a materialized view created on
> a similar query could be useful for the query. If so, it rewrites
> it so it can take advantage of the MV, usually dispensing a lot of
> calculation.

Oracle, MS SQL Server, Sybase ASE, and IBM DB2 all have this.
(There may be others.)  In essence they treat an MV a bit like an
index, as something you can create to speed up an existing query
without rewriting it.  It would certainly be nice to have this in
PostgreSQL, too, in my opinion.

> So, could you give me your opinion about it? If I made a proposal,
> would the community's response be positive? If you have things to
> add, new ideas, I'd be very excited to hear them.

That seems extraordinarily difficult for a GSoC project.  Unless
you can demonstrate mastery of the concepts involved in such
optimizations, and a familiarity with the PostgreSQL planner, with
a plan to put forward for how you would do this I think you should
set a more modest goal.  Perhaps you could find something to work
on related to the planner that is small enough to be achieved in
the limited time allowed for a GSoC project, that would move you
closer to taking on something this big.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Tomas Vondra
Дата:
On 20.2.2015 22:45, Kevin Grittner wrote:
> 
> Oracle, MS SQL Server, Sybase ASE, and IBM DB2 all have this. (There
> may be others.)  In essence they treat an MV a bit like an index, as
> something you can create to speed up an existing query without
> rewriting it.  It would certainly be nice to have this in PostgreSQL,
> too, in my opinion.

Yeah. The trouble is indexes are up-to-date, but MVs may not be - there
might be changes since the last REFRESH, which makes the rewrite more
complex. We don't want to use stale MVs for the rewrite, so we'd have to
identify the stale MVs somehow - AFAIK we don't have a flag for that.

> That seems extraordinarily difficult for a GSoC project.  Unless you
> can demonstrate mastery of the concepts involved in such 
> optimizations, and a familiarity with the PostgreSQL planner, with a
> plan to put forward for how you would do this I think you should set
> a more modest goal.  Perhaps you could find something to work on
> related to the planner that is small enough to be achieved in the
> limited time allowed for a GSoC project, that would move you closer
> to taking on something this big.

I share the view that this would be very valuable, but the scope far
exceeds what can be done within a single GSoC project. But maybe we
could split that into multiple pieces, and Eric would implement only the
first piece?

For example the 'is_stale' flag for a MV would be really useful, making
it possible to refresh only the MVs that actually need a refresh.

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Kevin Grittner
Дата:
Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

> I share the view that this would be very valuable, but the scope
> far exceeds what can be done within a single GSoC project. But
> maybe we could split that into multiple pieces, and Eric would
> implement only the first piece?
>
> For example the 'is_stale' flag for a MV would be really useful,
> making it possible to refresh only the MVs that actually need a
> refresh.

You may be on to something there.  Frankly, though, I'm not sure
that we could even reach consensus within the community on a
detailed design for how we intend to track staleness (that will
hold up both now and once we have incremental maintenance of
materialized views working) within the time frame of a GSoC
project.  This would need to be done with an eye toward how it
might be used in direct references (will we allow a "staleness
limit" on a reference from a query?), for use in a rewrite, and how
it will interact with changes to base tables and with both REFRESH
statements and incremental maintenance at various levels of
"eagerness".  I'm not sure that staleness management wouldn't be
better left until we have some of those other parts for it to work
with.

Questions to consider:

Some other products allow materialized views to be partitioned and
staleness to be tracked by partition, and will check which
partitions will be accessed in determining staleness.  Is that
something we want to allow for?

Once we have incremental maintenance, an MV maintained in an
"eager" fashion (changes are visible in the MV as soon as the
transaction modifying the underlying table commit) could be
accessed with a MVCC snapshots, with different snapshots seeing
different versions.  It seems pretty clear that such an MV would
always be considered "fresh", so there would be no need to
constantly flipping to stale and back again as the underlying table
were changed and the changes were reflected in the MV.  How do we
handle that?

If changes to an MV are less eager (they are queued for application
after COMMIT, as time permits) would we want to track the xid of
how far along they are, so that we can tell whether a particular
snapshot is safe to use?  Do we want to allow a non-MVCC snapshot
that shows the latest version of each row? Only if staleness is
minimal?

What about MVs which don't have incremental maintenance?  We can
still determine what xid they are current "as of", from the
creation or the latest refresh.  Do we want to track that instead
of a simple boolean flag?

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Tomas Vondra
Дата:
On 21.2.2015 00:20, Kevin Grittner wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
> 
>> I share the view that this would be very valuable, but the scope
>> far exceeds what can be done within a single GSoC project. But
>> maybe we could split that into multiple pieces, and Eric would
>> implement only the first piece?
>>
>> For example the 'is_stale' flag for a MV would be really useful,
>> making it possible to refresh only the MVs that actually need a
>> refresh.
> 
> You may be on to something there.  Frankly, though, I'm not sure
> that we could even reach consensus within the community on a
> detailed design for how we intend to track staleness (that will
> hold up both now and once we have incremental maintenance of
> materialized views working) within the time frame of a GSoC
> project.  This would need to be done with an eye toward how it
> might be used in direct references (will we allow a "staleness
> limit" on a reference from a query?), for use in a rewrite, and how
> it will interact with changes to base tables and with both REFRESH
> statements and incremental maintenance at various levels of
> "eagerness".  I'm not sure that staleness management wouldn't be
> better left until we have some of those other parts for it to work
> with.

Doing that properly is going to be nontrivial, no doubt about that. I
was thinking about keeping a simple list of updated tables (oids) and
then at commit time, deciding which MVs to depend on that and setting
some sort of flag (or XID) for all those MVs. But maybe there's a better
way.

> Questions to consider:
> 
> Some other products allow materialized views to be partitioned and 
> staleness to be tracked by partition, and will check which partitions
> will be accessed in determining staleness. Is that something we want
> to allow for?

I think we need to get this working for simple MVs, especially because
we don't have partitioned MVs (or the type of declarative partitioning
the other products do have).

> Once we have incremental maintenance, an MV maintained in an "eager"
> fashion (changes are visible in the MV as soon as the transaction
> modifying the underlying table commit) could be accessed with a MVCC
> snapshots, with different snapshots seeing different versions. It
> seems pretty clear that such an MV would always be considered
> "fresh", so there would be no need to constantly flipping to stale
> and back again as the underlying table were changed and the changes
> were reflected in the MV. How do we handle that?

Yes, incrementally updated MVs might be used more easily, without
tracking staleness. But we don't have that now, and it's going to take a
significant amount of time to get there.

Also, not all MVs can be updated incrementally, so either we allow only
simple MVs to be used for rewrites, or we'll have to implement the
'stale' flag anyway.

> If changes to an MV are less eager (they are queued for application 
> after COMMIT, as time permits) would we want to track the xid of how
> far along they are, so that we can tell whether a particular snapshot
> is safe to use? Do we want to allow a non-MVCC snapshot that shows
> the latest version of each row? Only if staleness is minimal?

Maybe. When I talk about 'flag' I actually mean a simple way to
determine whether the MV is up-to-date or not. Snapshots and XIDs are
probably the right way to do that in MVCC-based system.

> What about MVs which don't have incremental maintenance? We can still
> determine what xid they are current "as of", from the creation or the
> latest refresh. Do we want to track that instead of a simple boolean
> flag?

How would we use the 'as of' XID? IMHO it's unacceptable to quietly use
stale data unless the user explicitly references the MV, so we'd have to
assume we can't use that MV.

regards

-- 
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Eric Grinstein
Дата:
Thank you for your answers.
I would be very interested in tracking the staleness of the MV.
You see, I work in a research group in database tuning, and we have
implemented some solutions to take advantage of MV's and speed up queries.
The query rewrite feature would be extremely desirable for us. 
Do you think that implementing the staleness check as suggested by Thomas 
could get us started in the query rewrite business? Do you think I should make a proposal 
or there are more interesting subjects to GSoC? I'd be happy to hear project suggestions, especially
related to the optimizer, tuning, etc.

Eric

2015-02-20 22:35 GMT-02:00 Tomas Vondra <tomas.vondra@2ndquadrant.com>:
On 21.2.2015 00:20, Kevin Grittner wrote:
> Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:
>
>> I share the view that this would be very valuable, but the scope
>> far exceeds what can be done within a single GSoC project. But
>> maybe we could split that into multiple pieces, and Eric would
>> implement only the first piece?
>>
>> For example the 'is_stale' flag for a MV would be really useful,
>> making it possible to refresh only the MVs that actually need a
>> refresh.
>
> You may be on to something there.  Frankly, though, I'm not sure
> that we could even reach consensus within the community on a
> detailed design for how we intend to track staleness (that will
> hold up both now and once we have incremental maintenance of
> materialized views working) within the time frame of a GSoC
> project.  This would need to be done with an eye toward how it
> might be used in direct references (will we allow a "staleness
> limit" on a reference from a query?), for use in a rewrite, and how
> it will interact with changes to base tables and with both REFRESH
> statements and incremental maintenance at various levels of
> "eagerness".  I'm not sure that staleness management wouldn't be
> better left until we have some of those other parts for it to work
> with.

Doing that properly is going to be nontrivial, no doubt about that. I
was thinking about keeping a simple list of updated tables (oids) and
then at commit time, deciding which MVs to depend on that and setting
some sort of flag (or XID) for all those MVs. But maybe there's a better
way.

> Questions to consider:
>
> Some other products allow materialized views to be partitioned and
> staleness to be tracked by partition, and will check which partitions
> will be accessed in determining staleness. Is that something we want
> to allow for?

I think we need to get this working for simple MVs, especially because
we don't have partitioned MVs (or the type of declarative partitioning
the other products do have).

> Once we have incremental maintenance, an MV maintained in an "eager"
> fashion (changes are visible in the MV as soon as the transaction
> modifying the underlying table commit) could be accessed with a MVCC
> snapshots, with different snapshots seeing different versions. It
> seems pretty clear that such an MV would always be considered
> "fresh", so there would be no need to constantly flipping to stale
> and back again as the underlying table were changed and the changes
> were reflected in the MV. How do we handle that?

Yes, incrementally updated MVs might be used more easily, without
tracking staleness. But we don't have that now, and it's going to take a
significant amount of time to get there.

Also, not all MVs can be updated incrementally, so either we allow only
simple MVs to be used for rewrites, or we'll have to implement the
'stale' flag anyway.

> If changes to an MV are less eager (they are queued for application
> after COMMIT, as time permits) would we want to track the xid of how
> far along they are, so that we can tell whether a particular snapshot
> is safe to use? Do we want to allow a non-MVCC snapshot that shows
> the latest version of each row? Only if staleness is minimal?

Maybe. When I talk about 'flag' I actually mean a simple way to
determine whether the MV is up-to-date or not. Snapshots and XIDs are
probably the right way to do that in MVCC-based system.

> What about MVs which don't have incremental maintenance? We can still
> determine what xid they are current "as of", from the creation or the
> latest refresh. Do we want to track that instead of a simple boolean
> flag?

How would we use the 'as of' XID? IMHO it's unacceptable to quietly use
stale data unless the user explicitly references the MV, so we'd have to
assume we can't use that MV.

regards

--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Kevin Grittner
Дата:
Eric Grinstein <eric@aluno.puc-rio.br> wrote:

> I would be very interested in tracking the staleness of the MV.
> You see, I work in a research group in database tuning, and we
> have implemented some solutions to take advantage of MV's and
> speed up queries.

Please be sure, right up front, that there are no intellectual
property issues for what you pick.  We're happy to give credit
where credit is due, but we can't accept contributions which have
any license restrictions incompatible with the PostgreSQL License:

http://www.postgresql.org/about/licence/

> The query rewrite feature would be extremely desirable for us.
> Do you think that implementing the staleness check as suggested
> by Thomas could get us started in the query rewrite business?

There are many aspects related to the definition, maintenance, and
use of MVs that need work; it seems to me that many of them can be
pursued in parallel as long as people are communicating.  Staleness
tracking is definitely one aspect that is needed.  If you want to
put forward a proposal for that, which seems to be of a scope that
is possible in the context of GSoC, that would be great.  If there
is any other aspect of the MV "big picture" that you can think of
that you would like to tackle and seems of appropriate scope,
please don't feel constrained to "staleness" as the only possible
project; it was just one suggestion of something that might be of
about the right size.

> Do you think I should make a proposal or there are more
> interesting subjects to GSoC? I'd be happy to hear project
> suggestions, especially related to the optimizer, tuning, etc.

If you can find a project in any of those areas that is of interest
to you and of a scope that allows a deliverable within the time
frame of a GSoC project, go for it!  The optimizer seems like a
pretty challenging place for a first contribution, though; I would
recommend looking back through recent git history at a few
optimizer changes to get a feel for what that looks like.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Jim Nasby
Дата:
On 3/2/15 9:03 AM, Kevin Grittner wrote:
>> The query rewrite feature would be extremely desirable for us.
>> >Do you think that implementing the staleness check as suggested
>> >by Thomas could get us started in the query rewrite business?
> There are many aspects related to the definition, maintenance, and
> use of MVs that need work; it seems to me that many of them can be
> pursued in parallel as long as people are communicating.  Staleness
> tracking is definitely one aspect that is needed.  If you want to
> put forward a proposal for that, which seems to be of a scope that
> is possible in the context of GSoC, that would be great.  If there
> is any other aspect of the MV "big picture" that you can think of
> that you would like to tackle and seems of appropriate scope,
> please don't feel constrained to "staleness" as the only possible
> project; it was just one suggestion of something that might be of
> about the right size.

FWIW, what I would find most useful at this point is a way to get the 
equivalent of an AFTER STATEMENT trigger that provided all changed rows 
in a MV as the result of a statement. That would at least allow people 
do do their own MV refresh work without needing to study the methods for 
identifying how the results of a statement impact what should be in the 
MV. I think even something that just does that in pure SQL/plpgsql would 
be a big step forward, even if we wouldn't want it directly in the codebase.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com



Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Eric Grinstein
Дата:
 I think even something that just does that in pure SQL/plpgsql would be a big step forward, even if we wouldn't want it directly in the codebase.

Something like creating a trigger under the hood each time a MV is created, that checks the changed rows on every statement against the query that generated the MV? That also seems feasible, but wouldn't it be rather too small for my three months of GSoC?



2015-03-02 20:22 GMT-03:00 Jim Nasby <Jim.Nasby@bluetreble.com>:
On 3/2/15 9:03 AM, Kevin Grittner wrote:
The query rewrite feature would be extremely desirable for us.
>Do you think that implementing the staleness check as suggested
>by Thomas could get us started in the query rewrite business?
There are many aspects related to the definition, maintenance, and
use of MVs that need work; it seems to me that many of them can be
pursued in parallel as long as people are communicating.  Staleness
tracking is definitely one aspect that is needed.  If you want to
put forward a proposal for that, which seems to be of a scope that
is possible in the context of GSoC, that would be great.  If there
is any other aspect of the MV "big picture" that you can think of
that you would like to tackle and seems of appropriate scope,
please don't feel constrained to "staleness" as the only possible
project; it was just one suggestion of something that might be of
about the right size.

FWIW, what I would find most useful at this point is a way to get the equivalent of an AFTER STATEMENT trigger that provided all changed rows in a MV as the result of a statement. That would at least allow people do do their own MV refresh work without needing to study the methods for identifying how the results of a statement impact what should be in the MV. I think even something that just does that in pure SQL/plpgsql would be a big step forward, even if we wouldn't want it directly in the codebase.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com

Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Alvaro Herrera
Дата:
Jim Nasby wrote:

> FWIW, what I would find most useful at this point is a way to get the
> equivalent of an AFTER STATEMENT trigger that provided all changed rows in a
> MV as the result of a statement.

Ah, like
https://www.postgresql.org/message-id/1402790204.65037.YahooMailNeo%40web122301.mail.ne1.yahoo.com

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: Idea: GSoC - Query Rewrite with Materialized Views

От
David Fetter
Дата:
On Tue, Mar 03, 2015 at 05:49:06PM -0300, Alvaro Herrera wrote:
> Jim Nasby wrote:
> 
> > FWIW, what I would find most useful at this point is a way to get
> > the equivalent of an AFTER STATEMENT trigger that provided all
> > changed rows in a MV as the result of a statement.
> 
> Ah, like
> https://www.postgresql.org/message-id/1402790204.65037.YahooMailNeo%40web122301.mail.ne1.yahoo.com

Yes, very much like that.

Kevin, might you be able to give some guidance on this?

Cheers,
David.
-- 
David Fetter <david@fetter.org> http://fetter.org/
Phone: +1 415 235 3778  AIM: dfetter666  Yahoo!: dfetter
Skype: davidfetter      XMPP: david.fetter@gmail.com

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate



Re: Idea: GSoC - Query Rewrite with Materialized Views

От
Jim Nasby
Дата:
On 3/3/15 3:34 PM, David Fetter wrote:
> On Tue, Mar 03, 2015 at 05:49:06PM -0300, Alvaro Herrera wrote:
>> Jim Nasby wrote:
>>
>>> FWIW, what I would find most useful at this point is a way to get
>>> the equivalent of an AFTER STATEMENT trigger that provided all
>>> changed rows in a MV as the result of a statement.
>>
>> Ah, like
>> https://www.postgresql.org/message-id/1402790204.65037.YahooMailNeo%40web122301.mail.ne1.yahoo.com
>
> Yes, very much like that.

Actually, I was talking about the next step beyond that. I don't want 
what changed in a single table; I want what changed *in the source of 
the entire MV*. Kevin has a whitepaper that describes how to do this in 
set notation; theoretically this is a matter of converting that to SQL. 
IIRC this needs the deltas and current (or maybe NEW and OLD) for every 
table in the MV. So one way you could model this is a function that 
accepts a bunch of NEW and OLD recordsets.

Theoretically you could actually drive that with per-row triggers, but 
the performance would presumably suck. Next best thing would be 
providing NEW and OLD for AFTER STATEMENT triggers (what Kevin was 
talking about in that email). Though, if you're driving this at a 
statement level that means you can't actually reference the MV in a 
statement that's performing DML on any of the dependent tables.

As you can see, this is all pretty involved. Doing just a small part of 
this would make for a good GSoC project. AFTER STATEMENT NEW and OLD 
might be a good project; I don't know how much more work Kevin's stuff 
needs.  But there's much greater value in creating something that would 
take the definition for a MV and turn that into appropriate delta logic. 
That would be the basis for detecting if a MV was stale (beyond just the 
gross level check of were any of the tables involved touched), and is 
what is needed to do *any* kind of incremental update.

That logic doesn't have to be driven by triggers. For example, you could 
have PgQ or similar capture all DML on all tables for a MV and feed that 
data to the delta logic on an async incremental basis. It's pretty easy 
for an end user to setup PgQ or similar but doing the delta logic is 
tightly coupled to the MV definition, which would be very hard for an 
end user to deal with.
-- 
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com