Re: Checking join outer relation uniqueness to prevent unnecessary memoization
| От | David Rowley |
|---|---|
| Тема | Re: Checking join outer relation uniqueness to prevent unnecessary memoization |
| Дата | |
| Msg-id | CAApHDvp7QAzG77CYc4bo-K=xqHvPRkc8RHmh9UANMp8sG0J4Gw@mail.gmail.com обсуждение исходный текст |
| Ответ на | Re: Checking join outer relation uniqueness to prevent unnecessary memoization (Jacob Jackson <jej.jackson.08@gmail.com>) |
| Ответы |
Re: Checking join outer relation uniqueness to prevent unnecessary memoization
|
| Список | pgsql-hackers |
On Sun, 4 Jan 2026 at 18:05, Jacob Jackson <jej.jackson.08@gmail.com> wrote: > I have seen the issue pop up a few times when the unique constraint is > across multiple columns and the join is only on one of those columns > (e.g. https://www.postgresql.org/message-id/CAAiQw3yBPrCw6ZLeTwVS4QhKDWgJkmmp9LnGPdodxeQmn=kqVg@mail.gmail.com), > and a constant filter is on the other column. I think what happens is > that this introduces potential for error into the sample because > Postgres can now come across more duplicates of the join key than > expected, reducing n_distinct, or could come across more rows with the > constant filtered value, thus increasing its predicted frequency (in > the case I linked, the constant's frequency was stored in the columns > MCV list), leading to a nonzero hit ratio as the cardinality > estimation/estcalls > ndistinct. However, I am not certain this is the > case (while there probably wouldn't need to be much of an asymmetry to > cause memorization given the high cost in the planner for extra index > scans, it still seems odd that stats could be off enough to enable > this because of sampling alone). Maybe there is a statistics bug at > play? I am not certain. I've managed to reverse engineer some tables for this base on the query you posted in the other thread. I didn't spend enough time to figure out the exact row counts to insert to get the Memoize plan by default, but I can get it from disabling merge and hash joins. This doesn't mean it's not an issue as ideally the Memoize costing would realise all lookups are unique and opt to not Memoize due to no repeat lookups. What seems to be happening is in estimate_num_groups(), we don't set isunique == true because the unique index search in has_unique_index() only considers very simplistic cases where there's a single column unique index on the Var that's being looked up. Later in estimate_num_groups(), because we have some other base quals doing some filtering, the "if (reldistinct > 0 && rel->rows < rel->tuples)" kicks in to attempt to estimate the number of distinct values accounting for the other qual (user = 0, in this case). That's fairly generic code that only does calculations based on the distinction between rel->tuples and rel->rows, so it doesn't know about the guarantees of uniqueness. For a fix, I suppose has_unique_index() could be expanded to find a unique index which mentions the particular Var and ensures that all other columns that exist in the index have EquivalenceClasses with an ec_has_const, but that would make that function much more expensive than it is today and it still wouldn't cover cases where there is another parameter in the Memoize lookup that matches to one of the other unique index's columns. That makes me think that the UniqueKey stuff might be the solution to these problems, as we'd easily be able to determine which unique properties hold true at any level of the join. That's a very complex patch, however. David
Вложения
В списке pgsql-hackers по дате отправления: