Обсуждение: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results
BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results
От
 
		    	PG Bug reporting form
		    Дата:
		        The following bug has been logged on the website:
Bug reference:      19094
Logged by:          Lori Corbani
Email address:      lori.corbani@jax.org
PostgreSQL version: 18.0
Operating system:   Rocky Linux 9.6 (Blue Onyx)
Description:
postgres 17:  (A) SELECT 115457, (B) SELECT 115457 : same counts
postgres 18:  (A) SELECT 115444,  (B) SELECT 115436 : different counts
I am running this from /usr/pgsql-18/bin/psql
step 1:  psql -h ${PG_DBSERVER} -U ${PG_DBUSER} -d ${PG_DBNAME} -e
step 2: run (A) version / without the "distinct" clause
step 3: select results
step 4: exit psql
step 5. repeat step 1-4
each time I run "pgsql" and the same SQL, I get a different count.
sometimes the count goes up and sometimes the count goes down.
running the SQL with the "distinct" fixes the problem, but we are not having
this problem on postgres 17.
Postgres 17 consistently returns the same count.
When we migrated from Postgres 17 to Postgres 18, I used the same
"postgressql.conf" and "pg_hpa.config".
Also, we migrated from Postgres 15 to Postgres 17 earlier this year with no
issues; using the same method.
If I remove the "exists" statement, then the counts are fine.
So, it seems that it is the "exists" statement that is causing the issue.
"select s._Strain_key"     VS  "select distinct s._Strain_key"
        from prb_strain s
        where s.private = 0
            and s.strain not ilike '%involves%'
            and s.strain not ilike '%either%'
            and s.strain not ilike '% and %'
            and s.strain not ilike '% or %'
            and exists (select 1 from voc_annot va, voc_term t
                where va._AnnotType_key = 1009
                and va._Term_key = t._Term_key
                and t.term != 'Not Applicable'
                and t.term != 'Not Specified'
                and va._Object_key = s._Strain_key)
			
		PG Bug reporting form <noreply@postgresql.org> writes:
> If I remove the "exists" statement, then the counts are fine.
> So, it seems that it is the "exists" statement that is causing the issue.
> "select s._Strain_key"     VS  "select distinct s._Strain_key"
>         from prb_strain s
>         where s.private = 0
>             and s.strain not ilike '%involves%'
>             and s.strain not ilike '%either%'
>             and s.strain not ilike '% and %'
>             and s.strain not ilike '% or %'
>             and exists (select 1 from voc_annot va, voc_term t
>                 where va._AnnotType_key = 1009
>                 and va._Term_key = t._Term_key
>                 and t.term != 'Not Applicable'
>                 and t.term != 'Not Specified'
>                 and va._Object_key = s._Strain_key)
This report is inadequate to help us identify the issue.
You've not provided the relevant table declarations,
nor any sample data that would reproduce the problem.
Given the squishiness of the described behavior, I realize that
building a self-contained reproducer might be hard.  In the
meantime, could you at least provide EXPLAIN ANALYZE results
from correct and incorrect executions?
            regards, tom lane
			
		
Tom,
Sorry, I haven't had to enter a PG ticket before.  I realize that there are many things I didn't provide.  My
apologies. I was looking for a link back to my bug report so I could attach more information but didn't see one in the
emailthat came back. 
I am working on getting the file together with the appropriate tables & data this morning.
In the meaintime, I did run the EXPLAIN ANALYZE after I sent out the initial email.  See below.
Postgres 17
EXPLAIN ANALYZE
 Merge Semi Join  (cost=1.46..65223.16 rows=89672 width=4) (actual time=1.315..367.474 rows=115457 loops=1)
   Merge Cond: (s._strain_key = va._object_key)
   ->  Index Scan using prb_strain_pkey on prb_strain s  (cost=0.42..6486.88 rows=115741 width=4) (actual
time=0.013..99.843rows=117065 loops=1) 
         Filter: ((strain !* '%involves%'::text) AND (strain !* '%either%'::text) AND (strain !* '% and %'::text) AND
(strain!* '% or %'::text) AND (private = 0)) 
         Rows Removed by Filter: 16340
   ->  Nested Loop  (cost=0.86..118573.26 rows=322352 width=4) (actual time=1.284..232.161 rows=321574 loops=1)
         ->  Index Only Scan using voc_annot_idx_clustered on voc_annot va  (cost=0.43..95392.15 rows=322480 width=8)
(actualtime=0.054..139.127 rows=322488 loops=1) 
               Index Cond: (_annottype_key = 1009)
               Heap Fetches: 1668
         ->  Memoize  (cost=0.43..0.51 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=322488)
               Cache Key: va._term_key
               Cache Mode: logical
               Hits: 322441  Misses: 47  Evictions: 0  Overflows: 0  Memory Usage: 5kB
               ->  Index Scan using voc_term_pkey on voc_term t  (cost=0.42..0.50 rows=1 width=4) (actual
time=0.010..0.010rows=1 loops=47) 
                     Index Cond: (_term_key = va._term_key)
                     Filter: ((term <> 'Not Applicable'::text) AND (term <> 'Not Specified'::text))
                     Rows Removed by Filter: 0
 Planning Time: 4.688 ms
 Execution Time: 371.618 ms
(19 rows)
Postgres 18
EXPLAIN ANALYZE
 Gather  (cost=17394.54..60888.14 rows=88631 width=4) (actual time=112.329..193.108 rows=115447.0
0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   Buffers: shared hit=13901
   ->  Parallel Hash Right Semi Join  (cost=16394.54..51025.04 rows=36930 width=4) (actual time=1
06.616..171.437 rows=38482.33 loops=3)
         Hash Cond: (va._object_key = s._strain_key)
         Buffers: shared hit=13901
         ->  Parallel Hash Join  (cost=12089.56..45217.05 rows=134260 width=4) (actual time=67.78
1..106.574 rows=107171.33 loops=3)
               Hash Cond: (va._term_key = t._term_key)
               Buffers: shared hit=12109
               ->  Parallel Bitmap Heap Scan on voc_annot va  (cost=3046.66..35821.57 rows=134313
 width=8) (actual time=12.580..32.848 rows=107476.00 loops=3)
                     Recheck Cond: (_annottype_key = 1009)
                     Heap Blocks: exact=1802
                     Buffers: shared hit=6277
                     Worker 0:  Heap Blocks: exact=2068
                     Worker 1:  Heap Blocks: exact=2132
                     ->  Bitmap Index Scan on voc_annot_idx_annottype_key  (cost=0.00..2966.07 ro
ws=322352 width=0) (actual time=11.476..11.477 rows=322428.00 loops=1)
                           Index Cond: (_annottype_key = 1009)
                           Index Searches: 1
                           Buffers: shared hit=275
               ->  Parallel Hash  (cost=7583.72..7583.72 rows=116735 width=4) (actual time=53.855
..53.856 rows=93389.33 loops=3)
                     Buckets: 524288  Batches: 1  Memory Usage: 15072kB
                     Buffers: shared hit=5832
                     ->  Parallel Seq Scan on voc_term t  (cost=0.00..7583.72 rows=116735 width=4
) (actual time=0.022..24.986 rows=93389.33 loops=3)
                           Filter: ((term <> 'Not Applicable'::text) AND (term <> 'Not Specified'
::text))
                           Rows Removed by Filter: 36
                           Buffers: shared hit=5832
         ->  Parallel Hash  (cost=3458.96..3458.96 rows=67681 width=4) (actual time=38.276..38.27
7 rows=39019.33 loops=3)
               Buckets: 131072  Batches: 1  Memory Usage: 5632kB
               Buffers: shared hit=1792
               ->  Parallel Seq Scan on prb_strain s  (cost=0.00..3458.96 rows=67681 width=4) (actual
time=0.033..27.676rows=39019.33 loops=3) 
                     Filter: ((strain !* '%involves%'::text) AND (strain !* '%either%'::text) AND (strain !* '% and
%'::text)AND (strain !* '% or %'::text) AND (private = 0)) 
                     Rows Removed by Filter: 2963
                     Buffers: shared hit=1792
 Planning:
   Buffers: shared hit=724
 Planning Time: 5.633 ms
 Execution Time: 198.235 ms
(38 rows)
-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Saturday, October 25, 2025 10:31 AM
To: Lori Corbani <Lori.Corbani@jax.org>
Cc: pgsql-bugs@lists.postgresql.org
Subject: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate
results
PG Bug reporting form <noreply@postgresql.org> writes:
> If I remove the "exists" statement, then the counts are fine.
> So, it seems that it is the "exists" statement that is causing the issue.
> "select s._Strain_key"     VS  "select distinct s._Strain_key"
>         from prb_strain s
>         where s.private = 0
>             and s.strain not ilike '%involves%'
>             and s.strain not ilike '%either%'
>             and s.strain not ilike '% and %'
>             and s.strain not ilike '% or %'
>             and exists (select 1 from voc_annot va, voc_term t
>                 where va._AnnotType_key = 1009
>                 and va._Term_key = t._Term_key
>                 and t.term != 'Not Applicable'
>                 and t.term != 'Not Specified'
>                 and va._Object_key = s._Strain_key)
This report is inadequate to help us identify the issue.
You've not provided the relevant table declarations, nor any sample data that would reproduce the problem.
Given the squishiness of the described behavior, I realize that building a self-contained reproducer might be hard.  In
themeantime, could you at least provide EXPLAIN ANALYZE results from correct and incorrect executions? 
            regards, tom lane
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s).
Ifyou believe you received this email by mistake, please notify the sender by return email as soon as possible. 
			
		Tom,
Attached is a file with this info.  Please let me know if this is what you need.
PRB_Strain.bcp.gz
PRB_Strain_create.object : schema
VOC_Annot.bcp.gz
VOC_Annot_create.object : schema
VOC_Term.bcp.gz
VOC_Term_create.object : schema
Thanks.
Lori
-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Saturday, October 25, 2025 10:31 AM
To: Lori Corbani <Lori.Corbani@jax.org>
Cc: pgsql-bugs@lists.postgresql.org
Subject: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate
results
PG Bug reporting form <noreply@postgresql.org> writes:
> If I remove the "exists" statement, then the counts are fine.
> So, it seems that it is the "exists" statement that is causing the issue.
> "select s._Strain_key"     VS  "select distinct s._Strain_key"
>         from prb_strain s
>         where s.private = 0
>             and s.strain not ilike '%involves%'
>             and s.strain not ilike '%either%'
>             and s.strain not ilike '% and %'
>             and s.strain not ilike '% or %'
>             and exists (select 1 from voc_annot va, voc_term t
>                 where va._AnnotType_key = 1009
>                 and va._Term_key = t._Term_key
>                 and t.term != 'Not Applicable'
>                 and t.term != 'Not Specified'
>                 and va._Object_key = s._Strain_key)
This report is inadequate to help us identify the issue.
You've not provided the relevant table declarations, nor any sample data that would reproduce the problem.
Given the squishiness of the described behavior, I realize that building a self-contained reproducer might be hard.  In
themeantime, could you at least provide EXPLAIN ANALYZE results from correct and incorrect executions? 
            regards, tom lane
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s).
Ifyou believe you received this email by mistake, please notify the sender by return email as soon as possible. 
			
		Вложения
Tom,
2 things I did this morning:
1.  I added "order by" clause ; no change
2. I added "select distinct"; which fixed this, as I expected.  However, when I googled the Postgres "exists" best
practices,it seems to suggest that the "distinct" is unnecessary. 
From Google Search:
Therefore, using DISTINCT within a subquery for an EXISTS clause is generally redundant and unnecessary. The EXISTS
operatoronly cares if any row satisfies the subquery's condition, not how many or if they are unique. Adding DISTINCT
wouldtypically add overhead by requiring the database to sort and filter for uniqueness, which is not required for the
EXISTScheck itself. 
What is your suggestion for best practice when using "exists" clause?
Many thanks.
Lori
-----Original Message-----
From: Lori Corbani
Sent: Monday, October 27, 2025 7:34 AM
To: 'Tom Lane' <tgl@sss.pgh.pa.us>
Cc: 'pgsql-bugs@lists.postgresql.org' <pgsql-bugs@lists.postgresql.org>
Subject: RE: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate
results
Tom,
Attached is a file with this info.  Please let me know if this is what you need.
PRB_Strain.bcp.gz
PRB_Strain_create.object : schema
VOC_Annot.bcp.gz
VOC_Annot_create.object : schema
VOC_Term.bcp.gz
VOC_Term_create.object : schema
Thanks.
Lori
-----Original Message-----
From: Tom Lane <tgl@sss.pgh.pa.us>
Sent: Saturday, October 25, 2025 10:31 AM
To: Lori Corbani <Lori.Corbani@jax.org>
Cc: pgsql-bugs@lists.postgresql.org
Subject: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate
results
PG Bug reporting form <noreply@postgresql.org> writes:
> If I remove the "exists" statement, then the counts are fine.
> So, it seems that it is the "exists" statement that is causing the issue.
> "select s._Strain_key"     VS  "select distinct s._Strain_key"
>         from prb_strain s
>         where s.private = 0
>             and s.strain not ilike '%involves%'
>             and s.strain not ilike '%either%'
>             and s.strain not ilike '% and %'
>             and s.strain not ilike '% or %'
>             and exists (select 1 from voc_annot va, voc_term t
>                 where va._AnnotType_key = 1009
>                 and va._Term_key = t._Term_key
>                 and t.term != 'Not Applicable'
>                 and t.term != 'Not Specified'
>                 and va._Object_key = s._Strain_key)
This report is inadequate to help us identify the issue.
You've not provided the relevant table declarations, nor any sample data that would reproduce the problem.
Given the squishiness of the described behavior, I realize that building a self-contained reproducer might be hard.  In
themeantime, could you at least provide EXPLAIN ANALYZE results from correct and incorrect executions? 
            regards, tom lane
---
The information in this email, including attachments, may be confidential and is intended solely for the addressee(s).
Ifyou believe you received this email by mistake, please notify the sender by return email as soon as possible. 
			
		Lori Corbani <Lori.Corbani@jax.org> writes:
> Attached is a file with this info.  Please let me know if this is what you need.
Thank you for the test case.  With your smaller data set (for some
reason not the larger one) I can duplicate the misbehavior: v17
consistently returns 115427 rows, v18 often one or two more than
that.  I just loaded up the data, vacuum analyzed, and ran
select count(*) from (
select  s._Strain_key
        from mgd.prb_strain s
        where s.private = 0
            and s.strain not ilike '%involves%'
            and s.strain not ilike '%either%'
            and s.strain not ilike '% and %'
            and s.strain not ilike '% or %'
            and exists (select 1 from mgd.voc_annot va, mgd.voc_term t
                where va._AnnotType_key = 1009
                and va._Term_key = t._Term_key
                and t.term != 'Not Applicable'
                and t.term != 'Not Specified'
                and va._Object_key = s._Strain_key)
) ss;
repeatedly in an out-of-the-box configuration.  (BTW, without the
outer "count(*)" subquery, it tends not to pick a PHJ plan, so the
misbehavior isn't there.  I didn't look into why not.)
After a good deal of time bisecting, I find the first commit that
shows the problem is
Author: Richard Guo <rguo@postgresql.org>
Branch: master Release: REL_18_BR [aa86129e1] 2024-07-05 09:26:48 +0900
    Support "Right Semi Join" plan shapes
which is perhaps unsurprising, because the planner picks a "Parallel
Hash Right Semi Join" plan after that and a non-Right plan before it.
So it looks like there is some race condition in that.
The failure rate at aa86129e1 and following commits is only perhaps
two or three tries out of 100, and I never saw more than 115428
rows produced.  However, after
Author: Thomas Munro <tmunro@postgresql.org>
Branch: master Release: REL_18_BR [4effd0844] 2024-08-31 17:28:02 +1200
Branch: REL_17_STABLE Release: REL_17_0 [3ed368361] 2024-08-31 17:29:30 +1200
    Fix unfairness in all-cached parallel seq scan.
the failure rate gets enormously worse, and there are many runs
producing more than one extra row, for example these stats from
200 test runs at 5668a857d:
$ sort -n ~/corbanicounts | uniq -c
  50 115427
  78 115428
  38 115429
  19 115430
  11 115431
   4 115432
I don't know what to make of that, exactly.  4effd0844 is nowhere near
the PHJ code, and it hasn't been causing any problems in the v17
branch as far as I've heard.  My guess is that it changes the order of
delivery of rows to PHJ in a way that tends to tickle the hypothesized
race condition.
BTW, 5668a857d did not move the needle one way or the other, so it's
not that.
None of this is code that I've been heavily into, so I hope somebody
else will pick it up.  But if we can't solve it in the next 10 days or
so, I think we need to disable parallel right semi joins for 18.1.
            regards, tom lane
			
		I wrote:
> After a good deal of time bisecting, I find the first commit that
> shows the problem is
> Author: Richard Guo <rguo@postgresql.org>
> Branch: master Release: REL_18_BR [aa86129e1] 2024-07-05 09:26:48 +0900
>     Support "Right Semi Join" plan shapes
After looking a little closer, we've got:
                if (node->js.jointype == JOIN_RIGHT_SEMI &&
                    HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
                    continue;
                ...
                if (joinqual == NULL || ExecQual(joinqual, econtext))
                {
                    if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
                        HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
As far as I can see, in a parallel hash join node->hj_CurTuple will
be pointing into a shared hash table, so that this code is fooling
with a shared HasMatch flag bit with no sort of lock whatsoever.
This query doesn't have any joinqual at the PHJ level if I'm reading
EXPLAIN correctly, so the window to get it wrong isn't very wide.
Nonetheless, if two parallel workers inspect the same inner tuple
at about the same time, this code would allow both of them to
return it.
I'm unsure if we've got any infrastructure that'd allow setting the
tuple's match bit in a more atomic fashion.  We might have to revert
aa86129e1.
            regards, tom lane
			
		On Tue, Oct 28, 2025 at 5:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > As far as I can see, in a parallel hash join node->hj_CurTuple will > be pointing into a shared hash table, so that this code is fooling > with a shared HasMatch flag bit with no sort of lock whatsoever. > This query doesn't have any joinqual at the PHJ level if I'm reading > EXPLAIN correctly, so the window to get it wrong isn't very wide. > Nonetheless, if two parallel workers inspect the same inner tuple > at about the same time, this code would allow both of them to > return it. > > I'm unsure if we've got any infrastructure that'd allow setting the > tuple's match bit in a more atomic fashion. We might have to revert > aa86129e1. Thanks Lori for the report and test case. Thanks Tom for the analysis. I think you're right. In a parallel hash join, the inner relation is stored in a shared global hash table for probing, allowing multiple workers to access the same tuples concurrently. If two workers probe the same inner tuple at the same time, it can lead to duplicate emissions of that tuple, violating the semantics of a right semi join. AFAICT, there are 3 possible options for a fix. 1) Revert aa86129e1. 2) Modify the code to perform atomic operations on the matched flag using a CAS (or a similar) mechanism when running in parallel execution. 3) Disable parallel right semi joins in the planner. For option #2, it seems to me that it would require non-trivial changes, which might not be suitable for back-patching. We would also need to evaluate the performance impact of introducing atomic operations. I'm somewhat inclined toward option #3 instead. We can implement the disablement in hash_inner_and_outer(), which would only require a small modification (perhaps two lines, plus some comment updates). If option #2 doesn't work out, I'm open to option #1. (I'm still trying to understand why concurrent access to the matched flag in cases other than right semi joins (such as right or full joins) doesn't lead to concurrency issues.) - Richard
On Tue, Oct 28, 2025 at 9:58 AM Richard Guo <guofenglinux@gmail.com> wrote: > If option #2 doesn't work out, I'm open to option #1. Sorry, I meant if option #3 doesn't work out … - Richard
Richard Guo <guofenglinux@gmail.com> writes:
> AFAICT, there are 3 possible options for a fix.
> 1) Revert aa86129e1.
> 2) Modify the code to perform atomic operations on the matched flag
> using a CAS (or a similar) mechanism when running in parallel
> execution.
> 3) Disable parallel right semi joins in the planner.
Right.  I agree that #3 is the most attractive stopgap answer.
We can look into #2 later, but it doesn't sound like something
to back-patch.  (The main problem according to my brief look
is that t_infomask2 is uint16, but we haven't built out any
16-bit atomic primitives; perhaps they do not exist everywhere.)
> (I'm still trying to understand why concurrent access to the matched
> flag in cases other than right semi joins (such as right or full
> joins) doesn't lead to concurrency issues.)
I believe PRSJ is the only case where we need to set and concurrently
inspect the HEAP_TUPLE_HAS_MATCH flag in a shared hashtable.
I have a nasty feeling that this was well understood back when
we first did parallel hash join, which is why it wasn't done
already.  Apparently the problem didn't get documented though,
or at least not in any place you chanced to look.
Looking at the code now, ExecParallelScanHashTableForUnmatched
also has unprotected tests (not sets) of the flag, but I think
that may be okay because we shouldn't still be mutating the
flags while that runs.
            regards, tom lane
			
		On Tue, Oct 28, 2025 at 11:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Richard Guo <guofenglinux@gmail.com> writes: > > AFAICT, there are 3 possible options for a fix. > > > 1) Revert aa86129e1. > > > 2) Modify the code to perform atomic operations on the matched flag > > using a CAS (or a similar) mechanism when running in parallel > > execution. > > > 3) Disable parallel right semi joins in the planner. > Right. I agree that #3 is the most attractive stopgap answer. > We can look into #2 later, but it doesn't sound like something > to back-patch. (The main problem according to my brief look > is that t_infomask2 is uint16, but we haven't built out any > 16-bit atomic primitives; perhaps they do not exist everywhere.) Agreed. Here's a patch that follows along the lines of option #3. > > (I'm still trying to understand why concurrent access to the matched > > flag in cases other than right semi joins (such as right or full > > joins) doesn't lead to concurrency issues.) > > I believe PRSJ is the only case where we need to set and concurrently > inspect the HEAP_TUPLE_HAS_MATCH flag in a shared hashtable. Right. In the case of RIGHT or FULL joins, the match flags are used to emit the unmatched inner tuples, and the scan for unmatched inner tuples occurs after we have finished a batch. Therefore, concurrent inspection and setting of the match flags does not break correctness. - Richard
Вложения
Richard Guo <guofenglinux@gmail.com> writes:
> On Tue, Oct 28, 2025 at 11:04 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Right.  I agree that #3 is the most attractive stopgap answer.
>> We can look into #2 later, but it doesn't sound like something
>> to back-patch.
> Agreed.  Here's a patch that follows along the lines of option #3.
Code changes look good, and I confirm that I can't reproduce the
failure anymore with this patch.
I'm not convinced that the new regression test case is worth the
cycles, at least not in this form.  The main thing that's annoying me
about it is creating/populating/analyzing its own large one-use table;
that approach soon leads to regression suites that take forever.
You could answer that objection by making use of some existing
regression table, for instance this seems to work as well:
explain select * from tenk1 t1
where exists(select 1 from tenk1 t2 where tenthous = t1.tenthous);
However, I feel like it may still not be a great test, because it only
shows that the planner *didn't* pick PRSJ, not that it *couldn't*.
The cost differential between PRSJ with these settings and the Hash
Semi Join plan that you get after applying the patch is not very
great; so it's easy to imagine future changes that'd mean we'd not
prefer PRSJ here anyway.  But I'm not sure what we could do about
that, so operationally this may be as good a test as we can get
anyway.
Another thought is that rather than having to remember to reset all
those planner options, you could make the test a bit shorter and
more maintainable by writing something like
begin;
set local parallel_setup_cost=0;
set local ...
explain ...
rollback;
            regards, tom lane
			
		On Wed, Oct 29, 2025 at 8:07 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Code changes look good, and I confirm that I can't reproduce the > failure anymore with this patch. Thanks for the review and confirmation. > I'm not convinced that the new regression test case is worth the > cycles, at least not in this form. The main thing that's annoying me > about it is creating/populating/analyzing its own large one-use table; > that approach soon leads to regression suites that take forever. Fair point. > You could answer that objection by making use of some existing > regression table, for instance this seems to work as well: > > explain select * from tenk1 t1 > where exists(select 1 from tenk1 t2 where tenthous = t1.tenthous); > > However, I feel like it may still not be a great test, because it only > shows that the planner *didn't* pick PRSJ, not that it *couldn't*. > The cost differential between PRSJ with these settings and the Hash > Semi Join plan that you get after applying the patch is not very > great; so it's easy to imagine future changes that'd mean we'd not > prefer PRSJ here anyway. But I'm not sure what we could do about > that, so operationally this may be as good a test as we can get > anyway. To make the right-semi join look more appealing, I wonder if we could apply a filter to t1 to make its output smaller than t2, so that the planner is more likely to choose t1 as the inner side for building the hash table. explain select * from tenk1 t1 where exists(select 1 from tenk1 t2 where fivethous = t1.fivethous) and t1.fivethous < 5; (I'm using fivethous instead of tenthous to avoid interference from index scan.) However, this doesn't seem to move the needle any further. The costs of PRSJ (unpatched) and PSJ (patched) are 755.67 and 777.54. The cost difference is still not very great. > Another thought is that rather than having to remember to reset all > those planner options, you could make the test a bit shorter and > more maintainable by writing something like > > begin; > set local parallel_setup_cost=0; > set local ... > > explain ... > > rollback; Brilliant! Thanks for the suggestion. - Richard
On Wed, Oct 29, 2025 at 10:46 AM Richard Guo <guofenglinux@gmail.com> wrote: > To make the right-semi join look more appealing, I wonder if we could > apply a filter to t1 to make its output smaller than t2, so that the > planner is more likely to choose t1 as the inner side for building the > hash table. > > explain select * from tenk1 t1 > where exists(select 1 from tenk1 t2 where fivethous = t1.fivethous) > and t1.fivethous < 5; > > (I'm using fivethous instead of tenthous to avoid interference from > index scan.) > > However, this doesn't seem to move the needle any further. The costs > of PRSJ (unpatched) and PSJ (patched) are 755.67 and 777.54. The cost > difference is still not very great. After spending a bit more time experimenting, I couldn't find a better query that shows a large difference between the costs of the PRSJ plan and the patched version plan. So I've updated the test case to use the aforementioned one. I also removed the reset statements and switched to the begin/rollback pattern for the test case. - Richard
Вложения
Richard Guo <guofenglinux@gmail.com> writes:
> After spending a bit more time experimenting, I couldn't find a better
> query that shows a large difference between the costs of the PRSJ plan
> and the patched version plan.  So I've updated the test case to use
> the aforementioned one.  I also removed the reset statements and
> switched to the begin/rollback pattern for the test case.
v2 patch LGTM.
            regards, tom lane
			
		On Thu, Oct 30, 2025 at 1:00 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: > Richard Guo <guofenglinux@gmail.com> writes: > > After spending a bit more time experimenting, I couldn't find a better > > query that shows a large difference between the costs of the PRSJ plan > > and the patched version plan. So I've updated the test case to use > > the aforementioned one. I also removed the reset statements and > > switched to the begin/rollback pattern for the test case. > v2 patch LGTM. Thanks for the review. I've pushed v2 and backpatched it to v18. Thanks to Lori for the report. - Richard
On Tue, Oct 28, 2025 at 9:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> After looking a little closer, we've got:
>
>                 if (node->js.jointype == JOIN_RIGHT_SEMI &&
>                     HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
>                     continue;
>                 ...
>                 if (joinqual == NULL || ExecQual(joinqual, econtext))
>                 {
>                     if (!HeapTupleHeaderHasMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple)))
>                         HeapTupleHeaderSetMatch(HJTUPLE_MINTUPLE(node->hj_CurTuple));
>
> As far as I can see, in a parallel hash join node->hj_CurTuple will
> be pointing into a shared hash table, so that this code is fooling
> with a shared HasMatch flag bit with no sort of lock whatsoever.
> This query doesn't have any joinqual at the PHJ level if I'm reading
> EXPLAIN correctly, so the window to get it wrong isn't very wide.
> Nonetheless, if two parallel workers inspect the same inner tuple
> at about the same time, this code would allow both of them to
> return it.
Yeah, that way of writing to it was good enough for the original
purpose: no backend would read it again until after synchronising on a
barrier, it doesn't matter if two backends set it, and though the
relaxed check if it's already set can be wrong, only in an acceptable
direction.  We only cared about which tuples were never matched, after
the probe phase is finished.
> I'm unsure if we've got any infrastructure that'd allow setting the
> tuple's match bit in a more atomic fashion.
Interesting problem.  My first drive-by thought is that we would need:
some new smaller atomics (that's a 16 bit member), and a decision that
it is OK to use the atomics API by casting (should be OK, since we no
longer have emulated atomics in smaller sizes, so now it's just a
fancy way of accessing memory with "boxed" types that have the same
size and alignment as the underlying type, and plain stores should
work for initialisation if careful about synchronisation).  I guess
the op we'd want is atomic_fetch_or with a check of the old value to
see if you lost the race.  I suppose the smallest code change would be
to keep the existing code flow that was apparently almost working here
(the window of optimism being just between the relaxed check at the
top and the set at the bottom).  If you fail to be the first to set
the bit, you abandon the work done, but it should hopefully be rare
enough and cheaper overall than something "pessimistic" that does
something akin to locking the tuple while thinking about it, and the
time it took for this to be discovered seems to support the
"optimistic" idea (I guess it depends how much processing happens to
determine that it's really a match?).  I expect it'd be very important
to skip the atomic op when not needed.  I don't know if it'd be worth
a new specialisation to avoid the branching... IIRC previously we've
chosen to set the flag in cases that don't need it on the basis that
the branch to decide might be worse than just setting it always (which
wasn't very scientific... it's also not nice to dirty memory for no
reason, hence the check to avoid setting when already set), but now
that we're talking about expensive memory synchronising ops we'd be
forced to reconsider that, with a branch and optionally a
specialisation to constant-fold it away (a technique we only use today
to strip out the branches in the serial vs parallel versions of the
function, pending more research on what other combinations are worth
spending code size on, but this one was already on the candidate list
to look into...).
			
		Thomas Munro <thomas.munro@gmail.com> writes:
> On Tue, Oct 28, 2025 at 9:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I'm unsure if we've got any infrastructure that'd allow setting the
>> tuple's match bit in a more atomic fashion.
> Interesting problem.  My first drive-by thought is that we would need:
> some new smaller atomics (that's a 16 bit member), and a decision that
> it is OK to use the atomics API by casting (should be OK, since we no
> longer have emulated atomics in smaller sizes, so now it's just a
> fancy way of accessing memory with "boxed" types that have the same
> size and alignment as the underlying type, and plain stores should
> work for initialisation if careful about synchronisation).
Right.  I wasn't excited about building out 16-bit atomics, not least
because I'm unsure that those exist on every supported architecture.
Instead I spent a little time thinking about how we could use a 32-bit
atomic op here.  Clearly, that should theoretically work, but you'd
have to identify where is the start of the 32-bit word (t_infomask2
sadly is not at a 4-byte boundary) and which bit within that word is
the target bit (that's gonna vary depending on endianness at least).
Seems like a pain in the rear, but probably still less work than
creating 16-bit atomic ops.
A vaguer thought is that we're not bound to represent the match
bit in exactly the way it's done now, if there's some other choice
that would be easier to fit into these concerns.  The only hard
limit I'd lay down is that making the struct bigger isn't OK.
> I guess
> the op we'd want is atomic_fetch_or with a check of the old value to
> see if you lost the race.
I was thinking about CAS, but at some level these are all equivalent.
> I suppose the smallest code change would be
> to keep the existing code flow that was apparently almost working here
> (the window of optimism being just between the relaxed check at the
> top and the set at the bottom).  If you fail to be the first to set
> the bit, you abandon the work done, but it should hopefully be rare
> enough and cheaper overall than something "pessimistic" that does
> something akin to locking the tuple while thinking about it,
Right.  If we encounter the race condition and lose it, we'd have
wasted the time for an extra evaluation of the joinqual, but that
should be fine given that we know this is rare.  (And we'd not be
using this plan shape if the joinqual were volatile.)
            regards, tom lane
			
		Hi All, Does this mean that a v18 patch will be available that we can download? https://www.postgresql.org/ Many thanks for addressing this so quickly. Thank you. Lori -----Original Message----- From: Tom Lane <tgl@sss.pgh.pa.us> Sent: Thursday, October 30, 2025 12:12 AM To: Thomas Munro <thomas.munro@gmail.com> Cc: Richard Guo <guofenglinux@gmail.com>; Lori Corbani <Lori.Corbani@jax.org>; pgsql-bugs@lists.postgresql.org Subject: Re: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results Thomas Munro <thomas.munro@gmail.com> writes: > On Tue, Oct 28, 2025 at 9:12 AM Tom Lane <tgl@sss.pgh.pa.us> wrote: >> I'm unsure if we've got any infrastructure that'd allow setting the >> tuple's match bit in a more atomic fashion. > Interesting problem. My first drive-by thought is that we would need: > some new smaller atomics (that's a 16 bit member), and a decision that > it is OK to use the atomics API by casting (should be OK, since we no > longer have emulated atomics in smaller sizes, so now it's just a > fancy way of accessing memory with "boxed" types that have the same > size and alignment as the underlying type, and plain stores should > work for initialisation if careful about synchronisation). Right. I wasn't excited about building out 16-bit atomics, not least because I'm unsure that those exist on every supportedarchitecture. Instead I spent a little time thinking about how we could use a 32-bit atomic op here. Clearly, that should theoreticallywork, but you'd have to identify where is the start of the 32-bit word (t_infomask2 sadly is not at a 4-byteboundary) and which bit within that word is the target bit (that's gonna vary depending on endianness at least). Seems like a pain in the rear, but probably still less work than creating 16-bit atomic ops. A vaguer thought is that we're not bound to represent the match bit in exactly the way it's done now, if there's some otherchoice that would be easier to fit into these concerns. The only hard limit I'd lay down is that making the structbigger isn't OK. > I guess > the op we'd want is atomic_fetch_or with a check of the old value to > see if you lost the race. I was thinking about CAS, but at some level these are all equivalent. > I suppose the smallest code change would be to keep the existing code > flow that was apparently almost working here (the window of optimism > being just between the relaxed check at the top and the set at the > bottom). If you fail to be the first to set the bit, you abandon the > work done, but it should hopefully be rare enough and cheaper overall > than something "pessimistic" that does something akin to locking the > tuple while thinking about it, Right. If we encounter the race condition and lose it, we'd have wasted the time for an extra evaluation of the joinqual,but that should be fine given that we know this is rare. (And we'd not be using this plan shape if the joinqualwere volatile.) regards, tom lane --- The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). Ifyou believe you received this email by mistake, please notify the sender by return email as soon as possible.
On Thu, Oct 30, 2025 at 4:57 PM Lori Corbani <Lori.Corbani@jax.org> wrote: > > Hi All, > > Does this mean that a v18 patch will be available that we can download? 18.1 is scheduled for November 13th. https://www.postgresql.org/developer/roadmap/ -- John Naylor Amazon Web Services
Thank you! Lori -----Original Message----- From: John Naylor <johncnaylorls@gmail.com> Sent: Thursday, October 30, 2025 6:12 AM To: Lori Corbani <Lori.Corbani@jax.org> Cc: Tom Lane <tgl@sss.pgh.pa.us>; Thomas Munro <thomas.munro@gmail.com>; Richard Guo <guofenglinux@gmail.com>; pgsql-bugs@lists.postgresql.org Subject: Re: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results On Thu, Oct 30, 2025 at 4:57 PM Lori Corbani <Lori.Corbani@jax.org> wrote: > > Hi All, > > Does this mean that a v18 patch will be available that we can download? 18.1 is scheduled for November 13th. https://www.postgresql.org/developer/roadmap/ -- John Naylor Amazon Web Services --- The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). Ifyou believe you received this email by mistake, please notify the sender by return email as soon as possible.
On Thu, Oct 30, 2025 at 5:11 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Right.  I wasn't excited about building out 16-bit atomics, not least
> because I'm unsure that those exist on every supported architecture.
> Instead I spent a little time thinking about how we could use a 32-bit
> atomic op here.  Clearly, that should theoretically work, but you'd
> have to identify where is the start of the 32-bit word (t_infomask2
> sadly is not at a 4-byte boundary) and which bit within that word is
> the target bit (that's gonna vary depending on endianness at least).
> Seems like a pain in the rear, but probably still less work than
> creating 16-bit atomic ops.
I read that GCC and Clang will do exactly that for us automatically if
we use builtins (generic-gcc.h) or eventually <stdatomic.h> on
hardware without small atomics.  AFAIK that's only RISC-V, which I
don't have, so I tried cross-compiling void f(atomic_char *x) {
atomic_fetch_or(x, 1); } and GCC happily generated a lr.w.aqrl,
sc.w.rl sequence with a bunch of bitswizzling.  With the right magic
switches I think it should be able to spit out a single amoor.b
instruction instead (looks like it's the "zaamo" extension that added
sub-word atomics quite recently), but I couldn't immediately figure
out how to turn it on.  I think every other ISA on our list can do it
in hardware except MIPS (RIP?).
> A vaguer thought is that we're not bound to represent the match
> bit in exactly the way it's done now, if there's some other choice
> that would be easier to fit into these concerns.  The only hard
> limit I'd lay down is that making the struct bigger isn't OK.
Yeah.  Some alternative match flag storage sketches came up in the PHJ
right join thread.  I've wondered a few times about tagged pointers.
What if we stole the lower two bits of the pointers to tuples to mean
"matched" and "every tuple that follows me in the chain is also
matched"?  In right join unmatched scans, and clearly in this case
too, we'd often avoid following pointers to matched tuples we're not
interested in, but we'd have to mask the tagged bits before
dereferencing.  Otherwise I think it should be much like the earlier
description, eg specialisation could remove all the match-tracking,
masking and/or atomic ops depending on the plan.
			
		On Thu, 30 Oct 2025 at 22:57, Lori Corbani <Lori.Corbani@jax.org> wrote: > Does this mean that a v18 patch will be available that we can download? The commit is in [1] if you're desperate for it sooner and don't mind patching/building yourself. The next batch of minor version releases is due out on the 13th of November [2]. David [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=257ee78341f2657d5c19cdaf0888f843e9bb0c33 [2] https://www.postgresql.org/developer/roadmap/
David, Thank you, but we can wait for the version release on Nov 13. We are fine running Postgres 17 until then. No rush. Many thanks for your quick response and fix. You can remove me from this thread, if you like. Lori -----Original Message----- From: David Rowley <dgrowleyml@gmail.com> Sent: Thursday, October 30, 2025 6:40 AM To: Lori Corbani <Lori.Corbani@jax.org> Cc: Tom Lane <tgl@sss.pgh.pa.us>; Thomas Munro <thomas.munro@gmail.com>; Richard Guo <guofenglinux@gmail.com>; pgsql-bugs@lists.postgresql.org Subject: Re: [EXTERNAL]Re: BUG #19094: select statement on postgres 17 vs postgres 18 is returning different/duplicate results On Thu, 30 Oct 2025 at 22:57, Lori Corbani <Lori.Corbani@jax.org> wrote: > Does this mean that a v18 patch will be available that we can download? The commit is in [1] if you're desperate for it sooner and don't mind patching/building yourself. The next batch of minorversion releases is due out on the 13th of November [2]. David [1] https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=257ee78341f2657d5c19cdaf0888f843e9bb0c33 [2] https://www.postgresql.org/developer/roadmap/ --- The information in this email, including attachments, may be confidential and is intended solely for the addressee(s). Ifyou believe you received this email by mistake, please notify the sender by return email as soon as possible.