Обсуждение: Duplicate unique key values in inheritance tables
I came across a query that returned incorrect results and I traced it
down to being caused by duplicate unique key values in an inheritance
table. As a simple example, consider
create table p (a int primary key, b int);
create table c () inherits (p);
insert into p select 1, 1;
insert into c select 1, 2;
select a, b from p;
a | b
---+---
1 | 1
1 | 2
(2 rows)
explain (verbose, costs off)
select a, b from p group by a;
QUERY PLAN
--------------------------------------
HashAggregate
Output: p.a, p.b
Group Key: p.a
-> Append
-> Seq Scan on public.p p_1
Output: p_1.a, p_1.b
-> Seq Scan on public.c p_2
Output: p_2.a, p_2.b
(8 rows)
The parser considers 'p.b' functionally dependent on the group by
column 'p.a' because 'p.a' is identified as the primary key for table
'p'. However, this causes confusion for the executor when determining
which 'p.b' value should be returned for each group. In my case, I
observed that sorted and hashed aggregation produce different results
for the same query.
Reading the doc, it seems that this is a documented limitation of the
inheritance feature that we would have duplicate unique key values in
inheritance tables. Even adding a unique constraint to the children
does not prevent duplication compared to the parent.
As a workaround for this issue, I'm considering whether we can skip
checking functional dependency on primary keys for inheritance
parents, given that we cannot guarantee uniqueness on the keys in this
case. Maybe something like below.
@@ -1421,7 +1427,9 @@ check_ungrouped_columns_walker(Node *node,
Assert(var->varno > 0 &&
(int) var->varno <= list_length(context->pstate->p_rtable));
rte = rt_fetch(var->varno, context->pstate->p_rtable);
- if (rte->rtekind == RTE_RELATION)
+ if (rte->rtekind == RTE_RELATION &&
+ !(rte->relkind == RELKIND_RELATION &&
+ rte->inh && has_subclass(rte->relid)))
{
if (check_functional_grouping(rte->relid,
Any thoughts?
Thanks
Richard
On Tue, 16 Jul 2024 at 12:45, Richard Guo <guofenglinux@gmail.com> wrote:
> As a workaround for this issue, I'm considering whether we can skip
> checking functional dependency on primary keys for inheritance
> parents, given that we cannot guarantee uniqueness on the keys in this
> case. Maybe something like below.
>
> @@ -1421,7 +1427,9 @@ check_ungrouped_columns_walker(Node *node,
> Assert(var->varno > 0 &&
> (int) var->varno <= list_length(context->pstate->p_rtable));
> rte = rt_fetch(var->varno, context->pstate->p_rtable);
> - if (rte->rtekind == RTE_RELATION)
> + if (rte->rtekind == RTE_RELATION &&
> + !(rte->relkind == RELKIND_RELATION &&
> + rte->inh && has_subclass(rte->relid)))
> {
> if (check_functional_grouping(rte->relid,
>
> Any thoughts?
The problem with doing that is that it might mean queries that used to
work no longer work. CREATE VIEW could also fail where it used to
work which could render pg_dumps unrestorable.
Because it's a parser issue, I don't think we can fix it the same way
as a5be4062f was fixed.
I don't have any ideas on what we can do about this right now, but
thought it was worth sharing the above.
David
On Monday, July 15, 2024, David Rowley <dgrowleyml@gmail.com> wrote:
On Tue, 16 Jul 2024 at 12:45, Richard Guo <guofenglinux@gmail.com> wrote:
> As a workaround for this issue, I'm considering whether we can skip
> checking functional dependency on primary keys for inheritance
> parents, given that we cannot guarantee uniqueness on the keys in this
> case.
Because it's a parser issue, I don't think we can fix it the same way
as a5be4062f was fixed.
I don't have any ideas on what we can do about this right now, but
thought it was worth sharing the above.
Add another note to caveats in the docs and call it a feature. We produce a valid answer for the data model encountered. The non-determinism isn’t wrong, it’s just a poorly written query/model with non-deterministic results. Since v15 we have an any_value aggregate - we basically are applying this to the dependent columns implicitly. A bit of revisionist history but I’d rather do that than break said queries. Especially at parse time; I’d be a bit more open to execution-time enforcement if functional dependency on the id turns out to have actually been violated. But people want, and in other products have, any_value implicit aggregation in this situation so it’s hard to say it is wrong even if we otherwise take the position that we will not accept it.
David J.
On Tue, 16 Jul 2024 at 13:28, David G. Johnston <david.g.johnston@gmail.com> wrote: > Add another note to caveats in the docs and call it a feature. We produce a valid answer for the data model encountered. The non-determinism isn’t wrong, it’s just a poorly written query/model with non-deterministic results. Sincev15 we have an any_value aggregate - we basically are applying this to the dependent columns implicitly. A bit of revisionisthistory but I’d rather do that than break said queries. Especially at parse time; I’d be a bit more open to execution-timeenforcement if functional dependency on the id turns out to have actually been violated. But people want,and in other products have, any_value implicit aggregation in this situation so it’s hard to say it is wrong even ifwe otherwise take the position that we will not accept it. I think it might be best just to ignore it and do nothing. Maybe it would be worth putting something into the docs about it if people from userland come complaining about a bug as the doc mention might stop them wasting their time reporting something we already know about. Otherwise, I feel the docs would just draw attention to something that I'd personally rather people didn't do. As you say, using any_value() would be the way we'd encourage people to do it if they don't care which value of the ungrouped column they want, so documenting something else doesn't seem quite right to me. David