Обсуждение: Merge joins on index scans

Поиск
Список
Период
Сортировка

Merge joins on index scans

От
James Parks
Дата:
Dear psql-performance,

I'm having issues with a certain query, and I was hoping you could help me out.

The schema:

(start with new database cluster, with either SQL_ASCII or en.us-UTF8 encoding, using the default server configuration available in the pgdg Jessie packages).

CREATE TABLE a (id bigint primary key, nonce bigint);
CREATE TABLE b (id bigint primary key, a_id bigint not null);
CREATE INDEX a_idx ON b (a_id);

The query:

SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = ? ORDER BY b.id ASC;

(skip down to [1] and [2] to see the query performance)

What I know:

If you force the query planner to use a merge join on the above query, it takes 10+ minutes to complete using the data as per below. If you force the query planner to use a hash join on the same data, it takes ~200 milliseconds. This behavior is the same both on Postgresql 9.2.15 and Postgresql 9.4.6 (at least as provided by the Debian Jessie repo hosted by postgresql.org), and happens both on i386 and x86_64 platforms. Note: the data for these queries is the same as described below. Let me know if you want me to provide the raw csv's or similar.

Creating a new Postgresql 9.4 cluster (64-bit), creating the tables (a) and (b), importing the table data into the tables (a) and (b), and then running the above query using EXPLAIN results in a merge join query plan, as in [1].

Creating a new Postgresql 9.2 cluster (32-bit or 64-bit), creating the tables (a) and (b), importing the table data into (a) and (b), and then running the above query results in a hash join query plan, as in [2].

When running query [1], the postgres process on the machine consumes 100% CPU for a long time (it seems CPU-bound).

What I expected:

I expected both of the hash join and merge join implementations of this query to have comparable query times; perhaps within an order of magnitude. This was expected on my part mostly because the cost metrics for each query were very similar. Instead, the "optimal" query plan for the query takes more than 1000x longer.

I also expected that the "Rows Removed by Filter: " for the index scan on (a) would not have such a high count, as the number of rows in table (a) (~500,000) is significantly less than the count (2,201,063,696).

What I want to know:

- Is this expected behavior? Can you describe how the merge join algorithm achieves these results?
- Can I avoid this issue by disabling merge joins in the server configuration?

Configuration:

The configuration of the database is the sample configuration as per the Debian Jessie packages of Postgresql available at http://apt.postgresql.org/pub/repos/apt/ with the exception that the data directory was explicitly specified.

Information about the data:

Here are some queries that help describe the data I'm working with:

postgres=# select distinct a_id, count(*) from b group by a_id;
  a_id  | count
--------+-------
  49872 |   320
  47994 |     5
  19084 | 82977
  53251 |   100
 109804 |    10
  51738 |     5
  49077 |    10

postgres=# select count(*) from b;
 count
-------
 83427

postgres=# select count(distinct nonce) from a;
 count
-------
   198

postgres=# select count(*) from a;
 count
--------
 490166

postgres=# select count(*) from a where nonce = 64;
 count
-------
   395

Hardware:

2015-era Intel Xeon processors
> 300 GB of ram (greater than the size of the database with a large margin)
database on hardware raid 1 array on 2 SSDs

Commentary:

This is my first bug report to a major open source project, so I apologize in advance if I messed up this report. Let me know if I have left out key details -- I'm happy to provide them.

Given that there are roughly 500k rows in table a, and given that the EXPLAIN output claims that the filter (nonce = 64) caused 2 billion rows to be skipped (refer to [1]) suggests that each row in table b is being compared to a non-negligible number of rows in (a).

I can probably make this data available as a pg_dump file. Let me know if you think that's necessary, and where I should upload it.

Regards,
James

[1]
postgres=# explain (analyze,buffers) select b.* from b join a on b.a_id = a.id where a.nonce = 64 order by b.id asc;
                                                             QUERY PLAN                                                            
-------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7855.25..7855.61 rows=143 width=16) (actual time=752058.415..752080.731 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2096kB
   Buffers: shared hit=2151025721 read=479, temp read=267 written=267
   I/O Timings: read=2.384
   ->  Merge Join  (cost=869.07..7850.13 rows=143 width=16) (actual time=5.718..751760.637 rows=83427 loops=1)
         Merge Cond: (b.a_id = a.id)
         Buffers: shared hit=2151025721 read=479
         I/O Timings: read=2.384
         ->  Index Scan using a_idx on b  (cost=0.00..2953.35 rows=83427 width=16) (actual time=0.007..68.165 rows=83427 loops=1)
               Buffers: shared hit=1303 read=139
               I/O Timings: read=1.369
         ->  Index Scan using a_pkey on a  (cost=0.00..26163.20 rows=843 width=8) (actual time=5.706..751385.306 rows=83658 loops=1)
               Filter: (nonce = 64)
               Rows Removed by Filter: 2201063696
               Buffers: shared hit=2151024418 read=340
               I/O Timings: read=1.015
 Total runtime: 752092.206 ms

[2]
postgres=# explain (analyze,buffers) select b.* from b join a on b.a_id = a.id where a.nonce = 64 order by b.id asc;
                                                     QUERY PLAN                                                    
---------------------------------------------------------------------------------------------------------------------
 Sort  (cost=10392.28..10392.64 rows=143 width=16) (actual time=164.415..186.297 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2112kB
   Buffers: shared hit=2514 read=587, temp read=267 written=267
   I/O Timings: read=1.199
   ->  Hash Join  (cost=8787.61..10387.16 rows=143 width=16) (actual time=61.836..113.434 rows=83427 loops=1)
         Hash Cond: (b.a_id = a.id)
         Buffers: shared hit=2514 read=587
         I/O Timings: read=1.199
         ->  Seq Scan on b  (cost=0.00..1285.27 rows=83427 width=16) (actual time=0.011..15.826 rows=83427 loops=1)
               Buffers: shared hit=449 read=2
               I/O Timings: read=0.009
         ->  Hash  (cost=8777.08..8777.08 rows=843 width=8) (actual time=61.812..61.812 rows=395 loops=1)
               Buckets: 1024  Batches: 1  Memory Usage: 16kB
               Buffers: shared hit=2065 read=585
               I/O Timings: read=1.190
               ->  Seq Scan on a  (cost=0.00..8777.08 rows=843 width=8) (actual time=0.143..61.609 rows=395 loops=1)
                     Filter: (nonce = 64)
                     Rows Removed by Filter: 489771
                     Buffers: shared hit=2065 read=585
                     I/O Timings: read=1.190
 Total runtime: 198.014 ms

The above numbers were acquired using this version of Postgresql:
PostgreSQL 9.2.15 on x86_64-unknown-linux-gnu, compiled by gcc (Debian 4.9.2-10) 4.9.2, 64-bit

Re: Merge joins on index scans

От
David Rowley
Дата:
On 27 February 2016 at 11:07, James Parks <james.parks@meraki.net> wrote:
>
> CREATE TABLE a (id bigint primary key, nonce bigint);
> CREATE TABLE b (id bigint primary key, a_id bigint not null);
> CREATE INDEX a_idx ON b (a_id);
>
> The query:
>
> SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = ? ORDER BY b.id
> ASC;
>
> (skip down to [1] and [2] to see the query performance)
>
> What I know:
>
> If you force the query planner to use a merge join on the above query, it
> takes 10+ minutes to complete using the data as per below. If you force the
> query planner to use a hash join on the same data, it takes ~200
> milliseconds.

I believe I know what is going on here, but can you please test;

SELECT b.* FROM b WHERE EXISTS (SELECT 1 FROM a ON b.a_id = a.id AND
a.nonce = ?) ORDER BY b.id ASC;

using the merge join plan.

If this performs much better then the problem is due to the merge join
mark/restore causing the join to have to transition through many
tuples which don't match the a.nonce = ? predicate. The mark and
restore is not required for the rewritten query, as this use a semi
join rather than a regular inner join. With the semi join the executor
knows that it's only meant to be matching a single tuple in "a", so
once the first match is found it can move to the next row in the outer
relation without having to restore the scan back to where it started
matching that inner row again.

If I'm right, to get around the problem you could; create index on a
(nonce, id);

If such an index is out of the question then a patch has been
submitted for review which should fix this problem in (hopefully)
either 9.6 or 9.7
https://commitfest.postgresql.org/9/129/
If you have a test environment handy, it would be nice if you could
test the patch on the current git head to see if this fixes your
problem. The findings would be quite interesting for me. Please note
this patch is for test environments only at this stage, not for
production use.

--
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Merge joins on index scans

От
Tom Lane
Дата:
David Rowley <david.rowley@2ndquadrant.com> writes:
> On 27 February 2016 at 11:07, James Parks <james.parks@meraki.net> wrote:
>> If you force the query planner to use a merge join on the above query, it
>> takes 10+ minutes to complete using the data as per below. If you force the
>> query planner to use a hash join on the same data, it takes ~200
>> milliseconds.

> I believe I know what is going on here, but can you please test;
> SELECT b.* FROM b WHERE EXISTS (SELECT 1 FROM a ON b.a_id = a.id AND
> a.nonce = ?) ORDER BY b.id ASC;
> using the merge join plan.

> If this performs much better then the problem is due to the merge join
> mark/restore causing the join to have to transition through many
> tuples which don't match the a.nonce = ? predicate.

Clearly we are rescanning an awful lot of the "a" table:

         ->  Index Scan using a_pkey on a  (cost=0.00..26163.20 rows=843 width=8) (actual time=5.706..751385.306
rows=83658loops=1) 
               Filter: (nonce = 64)
               Rows Removed by Filter: 2201063696
               Buffers: shared hit=2151024418 read=340
               I/O Timings: read=1.015

The other explain shows a scan of "a" reading about 490k rows and
returning 395 of them, so there's a factor of about 200 re-read here.
I wonder if the planner should have inserted a materialize node to
reduce that.

However, I think the real problem is upstream of that: if that indexscan
was estimated at 26163.20 units, how'd the mergejoin above it get costed
at only 7850.13 units?  The answer has to be that the planner thought the
merge would stop before reading most of "a", as a result of limited range
of b.a_id.  It would be interesting to look into what the actual maximum
b.a_id value is.

            regards, tom lane


Re: Merge joins on index scans

От
James Parks
Дата:
On Sun, Feb 28, 2016 at 2:06 AM, David Rowley <david.rowley@2ndquadrant.com> wrote:
On 27 February 2016 at 11:07, James Parks <james.parks@meraki.net> wrote:
>
> CREATE TABLE a (id bigint primary key, nonce bigint);
> CREATE TABLE b (id bigint primary key, a_id bigint not null);
> CREATE INDEX a_idx ON b (a_id);
>
> The query:
>
> SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = ? ORDER BY b.id
> ASC;
>
> (skip down to [1] and [2] to see the query performance)
>
> What I know:
>
> If you force the query planner to use a merge join on the above query, it
> takes 10+ minutes to complete using the data as per below. If you force the
> query planner to use a hash join on the same data, it takes ~200
> milliseconds.

I believe I know what is going on here, but can you please test;

SELECT b.* FROM b WHERE EXISTS (SELECT 1 FROM a ON b.a_id = a.id AND
a.nonce = ?) ORDER BY b.id ASC;

using the merge join plan.


Here's the query plan for that query (slight modifications to get it to run):

postgres=# explain (analyze,buffers) SELECT b.* FROM b WHERE EXISTS (SELECT 1 FROM a WHERE b.a_id = a.id AND a.nonce = 64) ORDER BY b.id ASC;
                                                            QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=16639.19..16855.62 rows=86572 width=16) (actual time=145.117..173.298 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2112kB
   Buffers: shared hit=86193 read=881, temp read=269 written=269
   I/O Timings: read=3.199
   ->  Merge Semi Join  (cost=795.82..8059.09 rows=86572 width=16) (actual time=6.680..91.862 rows=83427 loops=1)
         Merge Cond: (b.a_id = a.id)
         Buffers: shared hit=86193 read=881
         I/O Timings: read=3.199
         ->  Index Scan using a_idx on b  (cost=0.00..3036.70 rows=86572 width=16) (actual time=0.005..25.193 rows=83427 loops=1)
               Buffers: shared hit=1064 read=374
               I/O Timings: read=1.549
         ->  Index Scan using a_pkey on a  (cost=0.00..26259.85 rows=891 width=8) (actual time=6.663..35.177 rows=237 loops=1)
               Filter: (nonce = 64)
               Rows Removed by Filter: 87939
               Buffers: shared hit=85129 read=507
               I/O Timings: read=1.650
 Total runtime: 186.825 ms

... so, yes, it does indeed get a lot faster
 
If this performs much better then the problem is due to the merge join
mark/restore causing the join to have to transition through many
tuples which don't match the a.nonce = ? predicate. The mark and
restore is not required for the rewritten query, as this use a semi
join rather than a regular inner join. With the semi join the executor
knows that it's only meant to be matching a single tuple in "a", so
once the first match is found it can move to the next row in the outer
relation without having to restore the scan back to where it started
matching that inner row again.

If I'm right, to get around the problem you could; create index on a
(nonce, id);


postgres=# CREATE INDEX a_id_nonce_idx ON a (nonce, id);
CREATE INDEX
postgres=# explain (analyze,buffers) SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = 64 ORDER BY b.id ASC;
                                                                 QUERY PLAN                                                                 
---------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=3376.57..3376.92 rows=140 width=16) (actual time=144.340..160.875 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   Buffers: shared hit=368 read=522, temp read=266 written=266
   ->  Merge Join  (cost=69.50..3371.58 rows=140 width=16) (actual time=0.056..88.409 rows=83427 loops=1)
         Merge Cond: (b.a_id = a.id)
         Buffers: shared hit=368 read=522
         ->  Index Scan using a_idx on b  (cost=0.29..2724.58 rows=83427 width=16) (actual time=0.009..23.834 rows=83427 loops=1)
               Buffers: shared hit=171 read=518
         ->  Index Only Scan using a_id_nonce_idx on a  (cost=0.42..2450.58 rows=820 width=8) (actual time=0.041..20.776 rows=83658 loops=1)
               Index Cond: (nonce = 64)
               Heap Fetches: 83658
               Buffers: shared hit=197 read=4
 Planning time: 0.241 ms
 Execution time: 172.346 ms

Looks pretty fast to me. That being said, the number of rows returned by the Index Only Scan seems a bit high, as compared to the results below, so I added your patch below and got [2].
 
If such an index is out of the question then a patch has been
submitted for review which should fix this problem in (hopefully)
either 9.6 or 9.7
https://commitfest.postgresql.org/9/129/
If you have a test environment handy, it would be nice if you could
test the patch on the current git head to see if this fixes your
problem. The findings would be quite interesting for me. Please note
this patch is for test environments only at this stage, not for
production use.

Can confirm that your patch there seems to improve performance.

HEAD [1], with patch:

postgres=# explain (analyze,buffers) SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = 64 ORDER BY b.id ASC;
                                                            QUERY PLAN                                                           
----------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7320.39..7320.75 rows=144 width=16) (actual time=162.199..181.497 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   Buffers: shared hit=85780 read=472, temp read=266 written=266
   ->  Merge Join  (cost=808.17..7315.23 rows=144 width=16) (actual time=9.719..101.864 rows=83427 loops=1)
         Merge Cond: (b.a_id = a.id)
         Buffers: shared hit=85780 read=472
         ->  Index Scan using a_idx on b  (cost=0.29..2721.49 rows=83427 width=16) (actual time=0.059..26.346 rows=83427 loops=1)
               Buffers: shared hit=460 read=229
         ->  Index Scan using a_pkey on a  (cost=0.42..24394.88 rows=846 width=8) (actual time=9.651..41.231 rows=237 loops=1)
               Filter: (nonce = 64)
               Rows Removed by Filter: 87939
               Buffers: shared hit=85320 read=243
 Planning time: 0.289 ms
 Execution time: 195.071 ms

HEAD [1], without patch:

postgres=# explain (analyze,buffers) SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = 64 ORDER BY b.id ASC;
                                                             QUERY PLAN                                                             
-------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=7285.48..7285.83 rows=140 width=16) (actual time=710073.059..710089.924 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   Buffers: shared hit=2149282200 read=243, temp read=266 written=266
   ->  Merge Join  (cost=685.28..7280.49 rows=140 width=16) (actual time=10.022..709878.946 rows=83427 loops=1)
         Merge Cond: (b.a_id = a.id)
         Buffers: shared hit=2149282200 read=243
         ->  Index Scan using a_idx on b  (cost=0.29..2724.58 rows=83427 width=16) (actual time=0.013..59.103 rows=83427 loops=1)
               Buffers: shared hit=689
         ->  Index Scan using a_pkey on a  (cost=0.42..24404.69 rows=820 width=8) (actual time=9.998..709595.905 rows=83658 loops=1)
               Filter: (nonce = 64)
               Rows Removed by Filter: 2201063696
               Buffers: shared hit=2149281511 read=243
 Planning time: 0.297 ms
 Execution time: 710101.931 ms

Thank you for your response! This allows us to understand the situations in which we can run into trouble, and what we can do in most, if not all cases, to resolve.

Regards,
James

[1] HEAD is "a892234 Change the format of the VM fork to add a second bit per page." as found on the Github mirror of postgres

[2]
postgres=# explain (analyze,buffers) SELECT b.* FROM b JOIN a ON b.a_id = a.id WHERE a.nonce = 64 ORDER BY b.id ASC;
                                                                QUERY PLAN                                                                
-------------------------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=11000.03..11001.07 rows=417 width=16) (actual time=109.739..129.031 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   Buffers: shared hit=886, temp read=266 written=266
   ->  Merge Join  (cost=0.71..10981.88 rows=417 width=16) (actual time=0.050..55.701 rows=83427 loops=1)
         Merge Cond: (b.a_id = a.id)
         Buffers: shared hit=886
         ->  Index Scan using a_idx on b  (cost=0.29..3975.70 rows=83427 width=16) (actual time=0.017..23.647 rows=83427 loops=1)
               Buffers: shared hit=689
         ->  Index Only Scan using a_id_nonce_idx on a  (cost=0.42..6787.31 rows=2451 width=8) (actual time=0.028..0.172 rows=237 loops=1)
               Index Cond: (nonce = 64)
               Heap Fetches: 237
               Buffers: shared hit=197
 Planning time: 0.153 ms
 Execution time: 142.994 ms

Re: Merge joins on index scans

От
Tom Lane
Дата:
James Parks <james.parks@meraki.net> writes:
> On Mon, Feb 29, 2016 at 5:22 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The other explain shows a scan of "a" reading about 490k rows and
>> returning 395 of them, so there's a factor of about 200 re-read here.
>> I wonder if the planner should have inserted a materialize node to
>> reduce that.
>>
>> However, I think the real problem is upstream of that: if that indexscan
>> was estimated at 26163.20 units, how'd the mergejoin above it get costed
>> at only 7850.13 units?  The answer has to be that the planner thought the
>> merge would stop before reading most of "a", as a result of limited range
>> of b.a_id.  It would be interesting to look into what the actual maximum
>> b.a_id value is.

> I've attached a pg_dump of a database that contains all of the data, in the
> event you (or others) would like to look at it. The attachment is ~1.8MB
> (gzipped), and you can replay the pg_dump file on a database that has just
> been created with initdb.

Thanks for sending the test data --- I got a chance to look at this
finally.  It looks like my first suspicion was right and the second one
wrong: the planner is badly underestimating the amount of rescan required
and thus not putting in a Materialize buffer node where needed.

If I force a Materialize node to be put in, without changing any cost
estimates (basically, set path->materialize_inner = 1 at the end of
final_cost_mergejoin), then I get this:

 Sort  (cost=7343.28..7343.62 rows=137 width=16) (actual time=218.342..232.817 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   ->  Merge Join  (cost=696.93..7338.41 rows=137 width=16) (actual time=18.627..136.549 rows=83427 loops=1)
         Merge Cond: (b.a_id = a.id)
         ->  Index Scan using a_idx on b  (cost=0.29..2708.59 rows=83427 width=16) (actual time=0.048..28.876
rows=83427loops=1) 
         ->  Materialize  (cost=0.42..24368.01 rows=805 width=8) (actual time=18.568..74.447 rows=83658 loops=1)
               ->  Index Scan using a_pkey on a  (cost=0.42..24366.00 rows=805 width=8) (actual time=18.560..66.186
rows=238loops=1) 
                     Filter: (nonce = 64)
                     Rows Removed by Filter: 87955
 Execution time: 239.195 ms

which is a pretty satisfactory result in terms of runtime, though
still a bit slower than the hash alternative.

Now, there are 490166 "a" rows, but we can see that the inner indexscan
stopped after reading 87955+238 = 88193 of them.  So there really is an
early-stop effect and the planner seems to have gotten that about right.
The problem is the rescan effect, which we can now quantify at 83658/238
or about 350:1.  Despite which, stepping through final_cost_mergejoin
finds that it estimates *zero* rescan effect.  If I force the
rescannedtuples estimate to be the correct value of 83658 - 238 = 83420,
it properly decides that a materialize node ought to be put in; but that
also increases its overall cost estimate for this mergejoin to 7600, which
causes it to prefer doing the join in the other direction:

 Sort  (cost=7343.28..7343.62 rows=137 width=16) (actual time=169.579..184.184 rows=83427 loops=1)
   Sort Key: b.id
   Sort Method: external merge  Disk: 2120kB
   ->  Merge Join  (cost=696.93..7338.41 rows=137 width=16) (actual time=16.460..108.562 rows=83427 loops=1)
         Merge Cond: (a.id = b.a_id)
         ->  Index Scan using a_pkey on a  (cost=0.42..24366.00 rows=805 width=8) (actual time=16.442..63.197 rows=238
loops=1)
               Filter: (nonce = 64)
               Rows Removed by Filter: 87955
         ->  Index Scan using a_idx on b  (cost=0.29..2708.59 rows=83427 width=16) (actual time=0.013..26.811
rows=83427loops=1) 
 Execution time: 190.441 ms

which is an even more satisfactory outcome; the cheaper merge direction
is properly estimated as cheaper, and this is actually a tad faster than
the hash join for me.

So the entire problem here is a bogus rescannedtuples estimate.  The code
comment about that estimate is

     * When there are equal merge keys in the outer relation, the mergejoin
     * must rescan any matching tuples in the inner relation. This means
     * re-fetching inner tuples; we have to estimate how often that happens.
     *
     * For regular inner and outer joins, the number of re-fetches can be
     * estimated approximately as size of merge join output minus size of
     * inner relation. Assume that the distinct key values are 1, 2, ..., and
     * denote the number of values of each key in the outer relation as m1,
     * m2, ...; in the inner relation, n1, n2, ...  Then we have
     *
     * size of join = m1 * n1 + m2 * n2 + ...
     *
     * number of rescanned tuples = (m1 - 1) * n1 + (m2 - 1) * n2 + ... = m1 *
     * n1 + m2 * n2 + ... - (n1 + n2 + ...) = size of join - size of inner
     * relation
     *
     * This equation works correctly for outer tuples having no inner match
     * (nk = 0), but not for inner tuples having no outer match (mk = 0); we
     * are effectively subtracting those from the number of rescanned tuples,
     * when we should not.  Can we do better without expensive selectivity
     * computations?

Some investigation of the actual keys values says that indeed there are a
whole lot of a.id values that have no match in b.a_id, so I think the
comment at the end is telling us what the problem is.  Is there a better
way to make this estimate?

            regards, tom lane