Re: record identical operator

Поиск

Список

Период

Сортировка

От	Kevin Grittner
Тема	Re: record identical operator
Дата	24 сентября 2013 г. 15:19:12
Msg-id	1380035944.53036.YahooMailNeo@web162905.mail.bf1.yahoo.com обсуждение исходный текст
Ответ на	Re: record identical operator (Stephen Frost <sfrost@snowman.net>)
Ответы	Re: record identical operator
Список	pgsql-hackers

Дерево обсуждения

Stephen Frost <sfrost@snowman.net> wrote:
> Kevin Grittner (kgrittn@ymail.com) wrote:
>> Stephen Frost <sfrost@snowman.net> wrote:
>> > I worry that adding these will come back to bite us later
>>
>> How?
>
> User misuse is certainly one consideration,

I think that one has been talked to death already, with the
consensus that we should use different operator names and document
them as a result.  Any comparison operator can be misused.  Andres
said he has seen the operators from text_pattern_ops used in
production; but we have not removed, for example, ~>=~ from our
text operators as a result.

> but I wonder what's going to
> happen if we change our internal representation of data (eg: numerics
> get changed again), or when incremental matview maintenance happens and
> we start looking at subsets of rows instead of the entire query.  Will
> the first update of a matview after a change to numeric's internal data
> structure cause the entire thing to be rewritten?

Something like that could happen, if the newly generated values,
for example, all took less space than the old.  On the first
REFRESH on the new version of the software it might delete and
insert all rows; then it would stay with the new representation.  I
have trouble seeing that as a problem, since presumably we only
come up with a new representation because it is smaller, faster, or
cures some bug.

>>> and that we're making promises we won't be able to keep.
>>
>> The promise that a concurrent refresh will produce the same set of
>> rows as a non-concurrent one?
>
> The promise that we'll always return the binary representation of the
> data that we saw last.  When greatest(x,y) comes back 'false' for a
> MAX(), we then have to go check "well, does the type consider them
> equal?", because, if the type considers them equal, we then have to
> decide if we should replace x with y anyway, because it's different
> at a binary level.  That's what we're saying we'll always do now.

I'm having a little trouble following that.  The query runs as it
always has, with all the old definitions of comparisons.  After it
is done, we check whether the rows are the same.  The operation of
MAX() will not be affected in any way.  If MAX() returns a value
which is not the same as what the matview has, the matview will be
modified to match what MAX() returned.

> We're also saying that we'll replace things based on plan differences
> rather than based on if the rows underneath actually changed at all.

Only if the user uses a query which does not produce deterministic
results.

> We could end up with material differences in the result of matviews
> updated through incremental REFRESH and matviews updated through
> actual incremental mainteance- and people may *care* about those
> because we've told them (or they discover) they can depend on these
> types of changes to be reflected in the result.

Perhaps we should document the recommendation that people not
create materialized views with non-deterministic results, but I
don't think that should be a hard restriction.  For example, I
could see someone creating a materialized view to pick a random
sample from a jury pool to be the on the jury panel for today (from
which juries are selected on that day), and not want that to be
predictable and not want it to change until the next refresh of the
panel matview.  (That would not be my first design choice, but it
would not be a horrid choice if there were sufficient logging of
the REFRESH statements in a place the user could not modify.)

>>> Trying to do this incremental-but-not-really maintenance where
>>> the whole query is run but we try to skimp on what's actually
>>> getting updated in the matview is a premature optimization, imv,
>>> and one which may be less performant and more painful, with more
>>> gotchas and challenges for our users, to deal with in the long
>>> run.
>>
>> I have the evidence of a ten-fold performance improvement plus
>> minimized WAL and replication work on my side.  What evidence do
>> you have to back your assertions?  (Don't forget to work in bloat
>> and vacuum truncation issues to the costs of your proposal.)
>
> I don't doubt that there are cases in both directions and I'm not trying
> to argue that it'd always be faster, but I doubt it's always slower.

Sure.  We provide a way to support those cases, although it
involves blocking.  So far, even the tests I expected to be faster
with heap replacement have come out marginally slower that way than
with CONCURRENTLY, due to index activity; but I have no doubt cases
can be constructed that go the other way.

> I'm surprised that you had a case where the query was apparently quite
> fast yet the data set hardly changed and resulted in a very large result
> but I don't doubt that it happened.

The history of all events in the county (tens of millions of
records in Milwaukee County) need to be scanned to generate the
status of cases and the date of last activity, as well as scanning
all future calendar events to see what is scheduled.  This is
compared to tables setting standards for how quickly different
types of cases should be handled and how long a case should go
without some activity.  The results are used to provide a dashboard
for judges to see where things are slipping, and drill down to
detail so they can get a stalled case moving again.  On any one day
a relatively small number of cases change status or cross time
boundaries which make them need attention.

I really look forward to the time when we can support these queries
with incremental maintenance in matviews.  I don't expect that to
be until maybe version 10.3, though.  Not only will it eliminate a
big batch job each night, but judges will be able to see their work
reflected in the dashboard on the same day.

> What I was trying to get at is really that the delete/insert
> approach would be good enough in very many cases and it wouldn't
> have what look, to me anyway, as some pretty ugly warts around
> these cases.

I guess we could add a "DELETE everything and INSERT the new
version of everything option for REFRESH in addition to what is
there now, but I would be very reluctant to use it as a
replacement.

--
Kevin Grittner
EDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

В списке pgsql-hackers по дате отправления:

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Re: record identical operator