Re: refresh materialized view concurrently

Поиск
Список
Период
Сортировка
От Kevin Grittner
Тема Re: refresh materialized view concurrently
Дата
Msg-id 1372342390.73497.YahooMailNeo@web162903.mail.bf1.yahoo.com
обсуждение исходный текст
Ответ на Re: refresh materialized view concurrently  (Hitoshi Harada <umi.tanuki@gmail.com>)
Список pgsql-hackers
Hitoshi Harada <umi.tanuki@gmail.com> wrote:
> Kevin Grittner <kgrittn@ymail.com> wrote:
>> Hitoshi Harada <umi.tanuki@gmail.com> wrote:

>>> As far as I can tell, the overall approach is as follows.
>>>
>>> - create a new temp heap as non-concurrent does, but with
>>> ExclusiveLock on the matview, so that reader wouldn't be
>>> blocked
>>
>> Non-concurrent creates the heap in the matview's tablespace and
>> namespace, so the "temp" part is different in concurrent
>> generation.  This difference is why concurrent can be faster
>> when few rows change.
>
> It's still not clear to me why you need temp in concurrent and
> not in non-concurrent.

Well, temp tables can be in an entirely different tablespace, so
you can't just move the heap of a temp table into place as the new
heap for the matview (as we do for a non-concurrent REFRESH) -- at
least not without potentially copying everything an extra time.

For concurrent we are modifying the existing matview heap, and the
freshly generated set of values, as well as the "diff" table, are
just needed, well, temporarily.  That makes the behavior of temp
tables ideal.  Not only are they not logged, they are potentially
placed on faster tablespaces, and the data written to them might
never be written to disk:

http://www.postgresql.org/docs/9.2/interactive/runtime-config-resource.html#GUC-TEMP-BUFFERS

> If this type of operations is always creating "temp" table and
> just swap it with existing one, why can't we just make it temp
> always?

Because of the potentially differrent tablespaces.

> And if the performance is the only concern, is temp better than
> just turning off WAL for the table or UNLOGGED table?

Yes, it is.

>>> - this operation requires unique index for performance reason
>>> (or correctness reason too)
>>
>> It is primarily for correctness in the face of duplicate rows
>> which have no nulls.  Do you think the reasons need to be better
>> documented with comments?
>
> Ah, yes, even after looking at patch I was confused if it was for
> performance or correctness.

This is the part of the function's comment which attempts to
explain the problem.

 * This join cannot work if there are any
 * duplicated rows in either the old or new versions, in the sense that every
 * column would compare as equal between the two rows.  It does work correctly
 * in the face of rows which have at least one NULL value, with all non-NULL
 * columns equal.  The behavior of NULLs on equality tests and on UNIQUE
 * indexes turns out to be quite convenient here; the tests we need to make
 * are consistent with default behavior.  If there is at least one UNIQUE
 * index on the materialized view, we have exactly the guarantee we need.  By
 * joining based on equality on all columns which are part of any unique
 * index, we identify the rows on which we can use UPDATE without any problem.
 * If any column is NULL in either the old or new version of a row (or both),
 * we must use DELETE and INSERT, since there could be multiple rows which are
 * NOT DISTINCT FROM each other, and we could otherwise end up with the wrong
 * number of occurrences in the updated relation.

I'm open to suggestions on better wording.

As an example of the way the full join gets into trouble with
duplicate rows when used for a diff, see this example:

test=# create table old (c1 int, c2 int);
CREATE TABLE
test=# create table new (c1 int, c2 int);
CREATE TABLE
test=# insert into old values
test-#   (1,1),(1,2),(1,2),(1,2),(1,3),(1,null),(1,null);
INSERT 0 7
test=# insert into new values
test-#   (1,1),(1,2),(1,2),(1,2),(1,2),(1,2),(1,4),(1,null),(1,null),(1,null);
INSERT 0 10

At this point it is clear that we need to add two rows with values
(1,2) and we need to wind up with one more row with values (1,null)
than we already have.  We also need to delete (1,3) and add (1,4).
But full join logic fails to get things right for the case of
duplicate rows with no nulls:

test=# select old, new
test-#   from old
test-#   full join new on old.c1 = new.c1 and old.c2 = new.c2
test-#   where (old.*) is distinct from (new.*);
  old  |  new  
-------+-------
 (1,3) |
 (1,)  |
 (1,)  |
       | (1,4)
       | (1,)
       | (1,)
       | (1,)
(7 rows)

> It's a shame we cannot refresh it concurrently if we have
> duplicate rows in the matview.  I thought it would make sense to
> allow it without unique key if it as only performance tradeoffs.

Well, we *could* allow it without a unique index, but the code
would be more complex and significantly slower.  I think we would
still want to handle it the way the current patch does when a
unique index is present, even if we have a way to handle cases
where such an index is not present.  Even if I were convinced it
was worthwhile to support the more general case, I would want to
commit this patch first, and add the more complicated code as a
follow-on patch.

At this point I'm not convinced of the value of supporting
concurrent refresh of materialized views which contain duplicate
rows, nor of the benefit of allowing it to work ten times slower on
matviews without duplicate rows but on which no unique index has
been added, rather than requiring the user to add a unique index if
they want the concurrent refresh.  I'm open to arguments as to why
either of those is a good idea and worth doing ahead of incremental
maintenance of matviews.

--
Kevin Grittner
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: FILTER for aggregates [was Re: Department of Redundancy Department: makeNode(FuncCall) division]
Следующее
От: Robert Haas
Дата:
Сообщение: Re: Add more regression tests for dbcommands