Re: SQL:2011 application time

Поиск
Список
Период
Сортировка
От Paul Jungwirth
Тема Re: SQL:2011 application time
Дата
Msg-id f061295a-d972-42b9-bf1b-90a833e3f686@illuminatedcomputing.com
обсуждение исходный текст
Ответ на Re: SQL:2011 application time  (Isaac Morland <isaac.morland@gmail.com>)
Ответы Re: SQL:2011 application time
Список pgsql-hackers
On 5/21/24 11:27, Isaac Morland wrote:
> On Tue, 21 May 2024 at 13:57, Robert Haas <robertmhaas@gmail.com <mailto:robertmhaas@gmail.com>> wrote:
> 
>     What I think is less clear is what that means for temporal primary
>     keys. As Paul pointed out upthread, in every other case, a temporal
>     primary key is at least as unique as a regular primary key, but in
>     this case, it isn't. And someone might reasonably think that a
>     temporal primary key should exclude empty ranges just as all primary
>     keys exclude nulls. Or they might think the opposite.
> 
> 
> Fascinating. I think you're absolutely right that it's clear that two empty intervals don't 
> conflict. If somebody wants to claim two intervals conflict, they need to point to at least one 
> instant in time that is common between them.
> 
> But a major point of a primary key, it seems to me, is that it uniquely identifies a row. If items 
> are identified by a time range, non-overlapping or not, then the empty range can only identify one 
> item (per value of whatever other columns are in the primary key). I think for a unique key the 
> non-overlapping restriction has to be considered an additional restriction on top of the usual 
> uniqueness restriction.
> 
> I suspect in many applications there will be a non-empty constraint; for example, it seems quite 
> reasonable to me for a meeting booking system to forbid empty meetings. But when they are allowed 
> they should behave in the mathematically appropriate way.

Finding a way forward for temporal PKs got a lot of discussion at pgconf.dev (thanks especially to 
Peter Eisentraut and Jeff Davis!), so I wanted to summarize some options and describe what I think 
is the best approach.

First the problem: empty ranges! A temporal PK/UNIQUE constraint is basically an exclusion 
constraint that is `(id WITH =, valid_at WITH &&)`. But the special 'empty' value never overlaps 
anything, *including itself*. (Note it has no "position": [3,3) is the same as [4,4).) Since the 
exclusion constraint forbids overlapping ranges, and empties never overlap, your table can have 
duplicates. (I'm talking about "literal uniqueness" as discussed in [1].) For instance:

     CREATE EXTENSION btree_gist;
     CREATE TABLE t (id int, valid_at daterange, name text);
     ALTER TABLE t ADD CONSTRAINT tpk PRIMARY KEY (id, valid_at WITHOUT OVERLAPS);
     INSERT INTO t VALUES (1, 'empty', 'foo');
     INSERT INTO t VALUES (1, 'empty', 'bar');

Multiranges have the same problem. So what do we do about that?

**Option 0**: Allow it but document it. It shouldn't happen in practice: there is no reason for an 
empty range to get into a temporal table, and it arguably doesn't mean anything. The record is true 
at no time? But of course it will happen anyway. It's a footgun and will break expectations for at 
least some.

It causes problems for us too. If you say `SELECT name FROM t GROUP BY id, valid_at`, we recognize 
that `name` is a functional dependency on the PK, so we allow it and give you the first row matching 
each key. You might get "foo" or you might get "bar". Also the planner uses not-nullable uniqueness 
to take many shortcuts. I couldn't create any concrete breakage there, but I bet someone else could. 
PKs that are not literally unique seems like something that would cause headaches for years.

**Option 1**: Temporal PKs should automatically create a CHECK constraint that forbids empty ranges. 
Should UNIQUE constraints too? I'm tempted to say no, since sometimes users surprise us by coming up 
with new ways to use things. For instance one way to use empty ranges is to reference a temporal 
table from a non-temporal table, since `'empty' <@ anything` is always true (though this has 
questionable meaning or practical use). But probably we should forbid empties for UNIQUE constraints 
too. Forbidding them is more aligned with the SQL standard, which says that when you have a PERIOD, 
startcol < endcol (not <=). And it feels more consistent to treat both constraints the same way. 
Finally, if UNIQUEs do allow empties, we still risk confusing our planner.

My last patch created these CHECK constraints for PKs (but not UNIQUEs) as INTERNAL dependencies. 
It's pretty clunky. There are lots of cases to handle, e.g. `ALTER COLUMN c TYPE` may reuse the PK 
index or may generate a new one. And what if the user already created the same constraint? Seeing 
all the trouble giving PKs automatic (cataloged) NOT NULL constraints makes me wary about this 
approach. It's not as bad, since there is no legacy, but it's still more annoying than I expected.

Finally, hanging the CHECK constraint off the PK sets us up for problems when we add true PERIODs. 
Under 11.27 of SQL/Foundation, General Rules 2b says that defining a PERIOD should automatically add 
a CHECK constraint that startcol < endcol. That is already part of my last patch in this series. But 
that would be redundant with the constraint from the PK. And attaching the constraint to the PERIOD 
is a lot simpler than attaching it to the PK.

**Option 2**: Add a new operator, called &&&, that works like && except an empty range *does* 
overlap another empty range. Empty ranges should still not overlap anything else. This would fix the 
exclusion constraint. You could add `(5, 'empty')` once but not twice. This would allow empties to 
people who want to use them. (We would still forbid them if you define a PERIOD, because those come 
with the CHECK constraint mentioned above.)
And there is almost nothing to code. But it is mathematically suspect to say an empty range overlaps 
something small (something with zero width) but not something big. Surely if a && b and b <@ c, then 
a && c? So this feels like the kind of elegant hack that you eventually regret.

**Option 3**: Forbid empties, not as a reified CHECK constraint, but just with some code in the 
executor. Again we could do just PKs or PKs and UNIQUEs. Let's do both, for all the reasons above. 
Not creating a CHECK constraint is much less clunky. There is no catalog entry to create/drop. Users 
don't wonder where it came from when they say `\d t`. It can't conflict with constraints of their 
own. We would enforce this in ExecConstraints, where we enforce NOT NULL and CHECK constraints, for 
any table with constraints where conperiod is true. We'd also need to do this check on existing rows 
when you create a temporal PK/UQ. This option also requires a new field in pg_class: just as we have 
relchecks, relhasrules, relhastriggers, etc. to let us skip work in the relcache, I assume we'd want 
relperiods.

**Option 4**: Teach GiST indexes to enforce uniqueness. We didn't discuss this at pgconf, at least 
not in reference to the empties problem. But I was thinking about this request from Matthias for 
temporal PKs & UQs to support `USING INDEX idx`.[2] It is confusing that a temporal index has 
indisunique, but if you try to create a unique GiST index directly we say they don't support UNIQUE 
indexes! Similarly `pg_indexam_has_property(783, 'can_unique')` returns false. There is something 
muddled about all that. So how about we give the GiST AM handler amcanunique?

As I understand it, GiST indexes are capable of uniqueness,[3] and indeed today you can create an 
exclusion constraint with the same effect, but in the past the core had no way of asking an opclass 
which operator gave equality. With the stratnum support proc from 6db4598fcb (part of this patch 
series, but reverted from v17), we could get a known operator for "equals". If the index's opclasses 
had that sproc and it gave non-zero for RTEqualStrategyNumber, then CREATE UNIQUE INDEX would 
succeed. We would just ("just") need to make GiST raise an error if it found a duplicate. And if 
*that* was happening, the empty ranges wouldn't cause a problem.

I think Option 3 is good, but I like Option 4 a lot because (1) it doesn't assume ranges & 
multiranges (2) it allows empties if users have some reason for them (3) since the real problem is 
duplicates, forbidding them is a more precise solution, (4) it clears up the confusing situation of
GiST not being canunique, even though you can create an index with indisunique.

OTOH it is probably more work, and it is slower than just forbidding duplicates. (The unique check 
requires a separate index search, according to [3], as an exclusion constraint would do.) Also if we 
do it to make GiST be canunique, that can happen separately from the temporal work.

So I'm proceeding with Option 3, which at worst can eventually become an optimization for Option 4. 
I don't think forbidding empty ranges is a great loss to be honest. But if anyone has any feedback, 
please share: ojections, alternatives, advice---all is welcome.

[1] 
https://www.postgresql.org/message-id/47550967-260b-4180-9791-b224859fe63e%40illuminatedcomputing.com
[2] 
https://www.postgresql.org/message-id/CAEze2Wh21V66udM8cbvBBsAgyQ_5x9nfR0d3sWzbmZk%2B%2Bey7xw%40mail.gmail.com
[3] https://dsf.berkeley.edu/papers/sigmod97-gist.pdf

Yours,

-- 
Paul              ~{:-)
pj@illuminatedcomputing.com



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: [multithreading] extension compatibility
Следующее
От: Paul Jungwirth
Дата:
Сообщение: Re: SQL:2011 application time