Обсуждение: increasing effective_cache_size slows down join queries by a factor of 4000x

Поиск
Список
Период
Сортировка

increasing effective_cache_size slows down join queries by a factor of 4000x

От
Artyom Shaposhnikov
Дата:
I recently moved a postgres DB to a more powerful server with 1TB of
RAM instead of 64GB before. To my surprise after running the tuning on
the new server, the performance of joins deteriorated by 4000x
compared to the old server. I carefully examined all of the changes
and found the culprit:

if I use the effective_cache_size > 25GB, the query plan used is this:

Limit (cost=124.12..590.33 rows=1000 width=205) (actual
time=42326.662..42336.212 rows=1000 loops=1)
-> Nested Loop (cost=124.12..31909018.46 rows=68443040 width=205)
(actual time=42326.660..42336.127 rows=1000 loops=1)
-> Merge Join (cost=124.12..30710356.97 rows=68443040 width=169)
(actual time=42326.613..42332.764 rows=1000 loops=1)
Merge Cond: (d.id = dc.data_id)
-> Nested Loop (cost=1.00..31036282.72 rows=58785023 width=165)
(actual time=0.042..5.533 rows=854 loops=1)
-> Index Scan using data_pkey on data t (cost=0.57..4787030.00
rows=58785023 width=131) (actual time=0.023..0.526 rows=854 loops=1)
Index Cond: (id > 205284974)
-> Index Scan using user_pkey on data_user u (cost=0.43..0.45 rows=1
width=42) (actual time=0.005..0.005 rows=1 loops=854)
Index Cond: (id = d.user_id)
-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..4935483.78 rows=216964862 width=8) (actual
time=0.018..35022.908 rows=151321889 loops=1)
Heap Fetches: 151321889
-> Index Scan using class_id_index on class a (cost=0.00..0.02 rows=1
width=44) (actual time=0.003..0.003 rows=1 loops=1000)
Index Cond: (id = dc.class_id)
Planning Time: 4.114 ms
Execution Time: 42336.397 ms

and it is 4000x slower than the query plan used with the lower
effective_cache_size that uses indexes instead of the merge joins:

Limit (cost=1.57..4832.30 rows=1000 width=205) (actual
time=0.081..10.457 rows=1000 loops=1)
-> Nested Loop (cost=1.57..330629805.46 rows=68443040 width=205)
(actual time=0.080..10.378 rows=1000 loops=1)
-> Nested Loop (cost=1.57..267793481.39 rows=68443040 width=169)
(actual time=0.065..7.496 rows=1000 loops=1)
-> Nested Loop (cost=1.00..100917823.18 rows=58785023 width=165)
(actual time=0.040..5.424 rows=854 loops=1)
-> Index Scan using data_pkey on data t (cost=0.57..21427806.53
rows=58785023 width=131) (actual time=0.024..0.482 rows=854 loops=1)
Index Cond: (id > 205284974)
-> Index Scan using user_pkey on data_user u (cost=0.43..1.35 rows=1
width=42) (actual time=0.005..0.005 rows=1 loops=854)
Index Cond: (id = d.user_id)
-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..2.80 rows=4 width=8) (actual time=0.002..0.002 rows=1
loops=854)
Index Cond: (data_id = d.id)
Heap Fetches: 1000
-> Index Scan using class_id_index on class a (cost=0.00..0.92 rows=1
width=44) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = dc.class_id)
Planning Time: 5.074 ms
Execution Time: 10.614 ms

query:

explain analyze select d.time as time,d.id as id, u.username as
username, a.query_symbol as query_symbol from data as d, data_user as
u, class as a, data_class as dc
   where dc.class_id = a.id and dc.data_id = d.id and d.user_id = u.id
and d.id > 205284974
     order by d.id
     limit 1000;

I found a way to fix it by creating a distinct statistics on the
largest join table:

create statistics stat_data_class (ndistinct) on data_id, class_id
from data_class;
analyze data_class;

Question: are there other ways to give PostgreSQL a hint to use the
indexes instead of the merge join plan without artificially lowering
the memory available in the parameter effective_cache_size or creating
the additional statistics on the table?

Thank you,

-Art



Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
Michael Lewis
Дата:
What does the row estimate look like on the scan of data table with that statistic in place? Anytime the stats give a mis-estimate this far off, I wouldn't expect that plans would be optimal except by luck.

Index Scan using data_pkey on data t (cost=0.57..21427806.53 rows=58785023 width=131) (actual time=0.024..0.482 rows=854 loops=1)
Index Cond: (id > 205284974)

Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
Artyom Shaposhnikov
Дата:
the row estimate became ~1000x smaller with the stat in place, so it
looks like it grossly miscalculates the query plans without the stats
for large tables representing M:M relations.

On Wed, Feb 2, 2022 at 11:47 AM Michael Lewis <mlewis@entrata.com> wrote:
>
> What does the row estimate look like on the scan of data table with that statistic in place? Anytime the stats give a
mis-estimatethis far off, I wouldn't expect that plans would be optimal except by luck.
 
>
> Index Scan using data_pkey on data t (cost=0.57..21427806.53 rows=58785023 width=131) (actual time=0.024..0.482
rows=854loops=1)
 
> Index Cond: (id > 205284974)



Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
Imre Samu
Дата:
> Question: are there other ways to give PostgreSQL a hint 

What you your pg version?

Maybe with pg_hint_plan extension ( https://github.com/ossc-db/pg_hint_plan  )
"pg_hint_plan makes it possible to tweak PostgreSQL execution plans using so-called "hints" in SQL comments, like /*+ SeqScan(a) */."

regards,
 Imre



Artyom Shaposhnikov <artyom@gmail.com> ezt írta (időpont: 2022. febr. 2., Sze, 20:05):
I recently moved a postgres DB to a more powerful server with 1TB of
RAM instead of 64GB before. To my surprise after running the tuning on
the new server, the performance of joins deteriorated by 4000x
compared to the old server. I carefully examined all of the changes
and found the culprit:

if I use the effective_cache_size > 25GB, the query plan used is this:

Limit (cost=124.12..590.33 rows=1000 width=205) (actual
time=42326.662..42336.212 rows=1000 loops=1)
-> Nested Loop (cost=124.12..31909018.46 rows=68443040 width=205)
(actual time=42326.660..42336.127 rows=1000 loops=1)
-> Merge Join (cost=124.12..30710356.97 rows=68443040 width=169)
(actual time=42326.613..42332.764 rows=1000 loops=1)
Merge Cond: (d.id = dc.data_id)
-> Nested Loop (cost=1.00..31036282.72 rows=58785023 width=165)
(actual time=0.042..5.533 rows=854 loops=1)
-> Index Scan using data_pkey on data t (cost=0.57..4787030.00
rows=58785023 width=131) (actual time=0.023..0.526 rows=854 loops=1)
Index Cond: (id > 205284974)
-> Index Scan using user_pkey on data_user u (cost=0.43..0.45 rows=1
width=42) (actual time=0.005..0.005 rows=1 loops=854)
Index Cond: (id = d.user_id)
-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..4935483.78 rows=216964862 width=8) (actual
time=0.018..35022.908 rows=151321889 loops=1)
Heap Fetches: 151321889
-> Index Scan using class_id_index on class a (cost=0.00..0.02 rows=1
width=44) (actual time=0.003..0.003 rows=1 loops=1000)
Index Cond: (id = dc.class_id)
Planning Time: 4.114 ms
Execution Time: 42336.397 ms

and it is 4000x slower than the query plan used with the lower
effective_cache_size that uses indexes instead of the merge joins:

Limit (cost=1.57..4832.30 rows=1000 width=205) (actual
time=0.081..10.457 rows=1000 loops=1)
-> Nested Loop (cost=1.57..330629805.46 rows=68443040 width=205)
(actual time=0.080..10.378 rows=1000 loops=1)
-> Nested Loop (cost=1.57..267793481.39 rows=68443040 width=169)
(actual time=0.065..7.496 rows=1000 loops=1)
-> Nested Loop (cost=1.00..100917823.18 rows=58785023 width=165)
(actual time=0.040..5.424 rows=854 loops=1)
-> Index Scan using data_pkey on data t (cost=0.57..21427806.53
rows=58785023 width=131) (actual time=0.024..0.482 rows=854 loops=1)
Index Cond: (id > 205284974)
-> Index Scan using user_pkey on data_user u (cost=0.43..1.35 rows=1
width=42) (actual time=0.005..0.005 rows=1 loops=854)
Index Cond: (id = d.user_id)
-> Index Only Scan using data_class_pkey on data_class ta
(cost=0.57..2.80 rows=4 width=8) (actual time=0.002..0.002 rows=1
loops=854)
Index Cond: (data_id = d.id)
Heap Fetches: 1000
-> Index Scan using class_id_index on class a (cost=0.00..0.92 rows=1
width=44) (actual time=0.002..0.002 rows=1 loops=1000)
Index Cond: (id = dc.class_id)
Planning Time: 5.074 ms
Execution Time: 10.614 ms

query:

explain analyze select d.time as time,d.id as id, u.username as
username, a.query_symbol as query_symbol from data as d, data_user as
u, class as a, data_class as dc
   where dc.class_id = a.id and dc.data_id = d.id and d.user_id = u.id
and d.id > 205284974
     order by d.id
     limit 1000;

I found a way to fix it by creating a distinct statistics on the
largest join table:

create statistics stat_data_class (ndistinct) on data_id, class_id
from data_class;
analyze data_class;

Question: are there other ways to give PostgreSQL a hint to use the
indexes instead of the merge join plan without artificially lowering
the memory available in the parameter effective_cache_size or creating
the additional statistics on the table?

Thank you,

-Art


Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
Tomas Vondra
Дата:
On 2/2/22 22:10, Artyom Shaposhnikov wrote:
> the row estimate became ~1000x smaller with the stat in place, so it
> looks like it grossly miscalculates the query plans without the stats
> for large tables representing M:M relations.
> 

Well, if the estimates are significantly off (and 3 orders of magnitude 
certainly qualifies), then all bets are off. There's no magical option 
that'd fix planning in such conditions.

Ultimately, fixing the estimates (e.g. by creating extended statistics) 
is the right "fix" as it gives the optimizer the information needed to 
pick the right plan.

> On Wed, Feb 2, 2022 at 11:47 AM Michael Lewis <mlewis@entrata.com> wrote:
>>
>> What does the row estimate look like on the scan of data table with that statistic in place? Anytime the stats give
amis-estimate this far off, I wouldn't expect that plans would be optimal except by luck.
 
>>
>> Index Scan using data_pkey on data t (cost=0.57..21427806.53 rows=58785023 width=131) (actual time=0.024..0.482
rows=854loops=1)
 
>> Index Cond: (id > 205284974)
> 

It'd be interesting to see the plans without the LIMIT, as that makes 
the "actual" values low simply by terminating early.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
A Shaposhnikov
Дата:
I made a mistake yesterday claiming that the created statistics
changed the row counts in the estimates - it did not  - I looked at
the wrong query yesterday. In the correct query plan the row estimate
still differs from the actual by many orders of magnitude:

Nested Loop  (cost=1.01..27584834.53 rows=59608439 width=164) (actual
time=0.047..2.723 rows=854 loops=1)

It completely ignores the "limit" in SQL statement when doing the row
estimates - the estimates do not change when I change it. The business
logic really needs only 1000 rows in the result, I never need the
whole table in the result. The query always runs fast if the limit is
<= 46, when the limit >=47, it sometimes chooses to do a merge join on
2 tables with hundreds of millions of rows  instead of using the
indexes. The runtime difference is 4000x.

I migrated the data to the latest postgres 14.1. Both versions run
either the slow plan or the fast plan seemingly at random. The only
reliable way to make it choose the fast plan is to decrease the
effective_cache_size to a value 20x lower than the memory available in
the system. Dropping and creating the statistics on the join table
makes no difference in the estimated row counts and query plans.
PostgreSql seems to be caching something internally and choosing the
query plans at random - sometimes it is fast, sometimes 4000x slower
without any changes in configuration or statistics. The runtime
difference is 4000x.  Such randomness is clearly unacceptable. I think
I would have to try to use the "hint" extension suggested by Imre to
make it consistently choose the fast plan.



On Wed, Feb 2, 2022 at 3:34 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
> On 2/2/22 22:10, Artyom Shaposhnikov wrote:
> > the row estimate became ~1000x smaller with the stat in place, so it
> > looks like it grossly miscalculates the query plans without the stats
> > for large tables representing M:M relations.
> >
>
> Well, if the estimates are significantly off (and 3 orders of magnitude
> certainly qualifies), then all bets are off. There's no magical option
> that'd fix planning in such conditions.
>
> Ultimately, fixing the estimates (e.g. by creating extended statistics)
> is the right "fix" as it gives the optimizer the information needed to
> pick the right plan.
>
> > On Wed, Feb 2, 2022 at 11:47 AM Michael Lewis <mlewis@entrata.com> wrote:
> >>
> >> What does the row estimate look like on the scan of data table with that statistic in place? Anytime the stats
givea mis-estimate this far off, I wouldn't expect that plans would be optimal except by luck.
 
> >>
> >> Index Scan using data_pkey on data t (cost=0.57..21427806.53 rows=58785023 width=131) (actual time=0.024..0.482
rows=854loops=1)
 
> >> Index Cond: (id > 205284974)
> >
>
> It'd be interesting to see the plans without the LIMIT, as that makes
> the "actual" values low simply by terminating early.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
Vijaykumar Jain
Дата:

On Fri, 4 Feb 2022 at 01:03, A Shaposhnikov <artyom@gmail.com> wrote:
I made a mistake yesterday claiming that the created statistics
changed the row counts in the estimates - it did not  - I looked at
the wrong query yesterday. In the correct query plan the row estimate
still differs from the actual by many orders of magnitude:

Nested Loop  (cost=1.01..27584834.53 rows=59608439 width=164) (actual
time=0.047..2.723 rows=854 loops=1)

It completely ignores the "limit" in SQL statement when doing the row
estimates - the estimates do not change when I change it. The business
logic really needs only 1000 rows in the result, I never need the
whole table in the result. The query always runs fast if the limit is
<= 46, when the limit >=47, it sometimes chooses to do a merge join on
2 tables with hundreds of millions of rows  instead of using the
indexes. The runtime difference is 4000x.

I migrated the data to the latest postgres 14.1. Both versions run
either the slow plan or the fast plan seemingly at random. The only
reliable way to make it choose the fast plan is to decrease the
effective_cache_size to a value 20x lower than the memory available in
the system. Dropping and creating the statistics on the join table
makes no difference in the estimated row counts and query plans.
PostgreSql seems to be caching something internally and choosing the
query plans at random - sometimes it is fast, sometimes 4000x slower
without any changes in configuration or statistics. The runtime
difference is 4000x.  Such randomness is clearly unacceptable. I think
I would have to try to use the "hint" extension suggested by Imre to
make it consistently choose the fast plan.

there was a similar concern raised on slack, where on shared plan (where less effective_cache_size resulted in faster execution)

"
Hey all, I have been diagnosing a slow running query and found that it speeds up when the effective_cache_size is reduced.  It's set at ~21.5gb, and lowering to 10gb or under massively improves the query.  Plans are here and here.  You can see the difference is 33 - 42, where the order of execution results in a bunch more rows having to be filtered out later in the first plan.  Everything I've read suggests that a bigger effective_cache_size  should be better (assuming it's within the recommended boundaries of total memory of course).  Can anyone offer any insight into why this might happen, and what a good workaround would be?
" 


I am not sure, though effective_cache_size is the culprit. if there are bad estimates, even a simple query like below produces a poor plan.
i did try to check the code postgres/costsize.c at master · postgres/postgres (github.com)  (I am not a developer but try to read the english from code) and tried to understand how effective_cache_sizealters the cost and changes the path, but could not simulate on my laptop with small tables and less resources.
but yeah, i know 1TB is a huge setup, but can run vaccumdb and analyze all the tables in question (maybe reindex concurrently if required) and check if the row stats are still off in the plan and reducing effective_cache_size improves the execution time? I am not an expert, but just try to figure out if this is a red herring.

i also play with some gucs as in PostgreSQL: Documentation: 13: 19.7. Query Planning  to check if enabling/disabling  some of gucs results in a big difference in execution and then try to zero in what needs immediate vaccum or reindex or join count etc or bumping resources if required to use more cache.


postgres=# create table t(id int primary key, col1 int);
CREATE TABLE
postgres=# alter table t set (autovacuum_enabled = false, toast.autovacuum_enabled = false);
ALTER TABLE

postgres=# insert into t select x, 1 from generate_series(1, 1000000) x;
INSERT 0 1000000

postgres=# \x
Expanded display is off.

postgres=# do $$
declare i int;
begin
for i in 1..10 loop
update t set col1 = i::int;
commit;
end loop;
end; $$;
DO
---this is the default plan picked up by optimizer (tables still not analyzed)

postgres=# explain analyze select * from t where id < 100;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..1472242.50 rows=28995800 width=8) (actual time=198.740..4849.556 rows=99 loops=1)
   Filter: (id < 100)
   Rows Removed by Filter: 999901
 Planning Time: 1.296 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 1.194 ms, Inlining 4.383 ms, Optimization 13.051 ms, Emission 6.954 ms, Total 25.581 ms
 Execution Time: 4850.870 ms
(9 rows)

--play smart, disable seqscan
postgres=# set enable_seqscan TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=235689.89..1646219.33 rows=28995800 width=8) (actual time=1.312..2.313 rows=99 loops=1)
   Recheck Cond: (id < 100)
   Heap Blocks: exact=32
   ->  Bitmap Index Scan on t_pkey  (cost=0.00..228440.94 rows=28995800 width=0) (actual time=1.295..1.296 rows=126 loops=1)
         Index Cond: (id < 100)
 Planning Time: 0.054 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 7.309 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 7.309 ms
 Execution Time: 9.715 ms
(11 rows)

# rule out cache
postgres=# set enable_seqscan TO 1;
SET
postgres=# explain analyze select * from t where id < 100;
                                                  QUERY PLAN
--------------------------------------------------------------------------------------------------------------
 Seq Scan on t  (cost=0.00..1472242.50 rows=28995800 width=8) (actual time=196.485..4778.942 rows=99 loops=1)
   Filter: (id < 100)
   Rows Removed by Filter: 999901
 Planning Time: 0.052 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.286 ms, Inlining 1.389 ms, Optimization 4.462 ms, Emission 2.765 ms, Total 8.902 ms
 Execution Time: 4779.314 ms
(9 rows)

#again disable seqscan
postgres=# set enable_seqscan TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=235689.89..1646219.33 rows=28995800 width=8) (actual time=0.028..0.101 rows=99 loops=1)
   Recheck Cond: (id < 100)
   Heap Blocks: exact=32
   ->  Bitmap Index Scan on t_pkey  (cost=0.00..228440.94 rows=28995800 width=0) (actual time=0.017..0.017 rows=126 loops=1)
         Index Cond: (id < 100)
 Planning Time: 0.047 ms
 JIT:
   Functions: 2
   Options: Inlining true, Optimization true, Expressions true, Deforming true
   Timing: Generation 0.834 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 0.834 ms
 Execution Time: 0.978 ms
(11 rows)

#disable jit (even faster)
postgres=# set jit TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
                                                         QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=235689.89..1646219.33 rows=28995800 width=8) (actual time=0.030..0.115 rows=99 loops=1)
   Recheck Cond: (id < 100)
   Heap Blocks: exact=32
   ->  Bitmap Index Scan on t_pkey  (cost=0.00..228440.94 rows=28995800 width=0) (actual time=0.018..0.018 rows=126 loops=1)
         Index Cond: (id < 100)
 Planning Time: 0.053 ms
 Execution Time: 0.137 ms
(7 rows)

#disable bitmap scan (more faster)
postgres=# set enable_bitmapscan TO 0;
SET
postgres=# explain analyze select * from t where id < 100;
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Index Scan using t_pkey on t  (cost=0.44..2057998.94 rows=28995800 width=8) (actual time=0.018..0.147 rows=99 loops=1)
   Index Cond: (id < 100)
 Planning Time: 0.053 ms
 Execution Time: 0.166 ms
(4 rows)

#now cleanup dead tuples, and analyze to update stats and reset all gucs
postgres=# vacuum full t;  -- note full not required, 'vacuum analyze t'  would suffice too.
VACUUM
postgres=#  set enable_bitmapscan TO default;
SET
postgres=#  set enable_seqscan TO default;
SET
postgres=#  set jit to default;
SET
postgres=# explain analyze select * from t where id < 100;
                                                       QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=6243.76..14835.42 rows=333333 width=8) (actual time=0.028..0.042 rows=99 loops=1)
   Recheck Cond: (id < 100)
   Heap Blocks: exact=3
   ->  Bitmap Index Scan on t_pkey  (cost=0.00..6160.42 rows=333333 width=0) (actual time=0.017..0.018 rows=99 loops=1)
         Index Cond: (id < 100)
 Planning Time: 0.819 ms
 Execution Time: 0.061 ms
(7 rows)

postgres=# analyze t;
ANALYZE

#the default plan after vacuum analyze is the fastest.
postgres=# explain analyze select * from t where id < 100;
                                                 QUERY PLAN
-------------------------------------------------------------------------------------------------------------
 Index Scan using t_pkey on t  (cost=0.42..10.62 rows=92 width=8) (actual time=0.005..0.029 rows=99 loops=1)
   Index Cond: (id < 100)
 Planning Time: 0.159 ms
 Execution Time: 0.045 ms
(4 rows)
 

Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
A Shaposhnikov
Дата:
the random_page_cost = 1.2 and seq_page_cost= 1 in my setup are the
same. I also run the vacuum analyze on all tables. I have no idea why
the optimizer thinks that the merge on 2 200M row tables is faster
than 50 index lookups, other than it "thinks" that in-memory merges
are very low cost regardless of size. Another guess is that it almost
ignores the "limit" parameter or uses it in some incorrect way.

On Thu, Feb 3, 2022 at 12:31 PM Vijaykumar Jain
<vijaykumarjain.github@gmail.com> wrote:
>
>
> On Fri, 4 Feb 2022 at 01:03, A Shaposhnikov <artyom@gmail.com> wrote:
>>
>> I made a mistake yesterday claiming that the created statistics
>> changed the row counts in the estimates - it did not  - I looked at
>> the wrong query yesterday. In the correct query plan the row estimate
>> still differs from the actual by many orders of magnitude:
>>
>> Nested Loop  (cost=1.01..27584834.53 rows=59608439 width=164) (actual
>> time=0.047..2.723 rows=854 loops=1)
>>
>> It completely ignores the "limit" in SQL statement when doing the row
>> estimates - the estimates do not change when I change it. The business
>> logic really needs only 1000 rows in the result, I never need the
>> whole table in the result. The query always runs fast if the limit is
>> <= 46, when the limit >=47, it sometimes chooses to do a merge join on
>> 2 tables with hundreds of millions of rows  instead of using the
>> indexes. The runtime difference is 4000x.
>>
>> I migrated the data to the latest postgres 14.1. Both versions run
>> either the slow plan or the fast plan seemingly at random. The only
>> reliable way to make it choose the fast plan is to decrease the
>> effective_cache_size to a value 20x lower than the memory available in
>> the system. Dropping and creating the statistics on the join table
>> makes no difference in the estimated row counts and query plans.
>> PostgreSql seems to be caching something internally and choosing the
>> query plans at random - sometimes it is fast, sometimes 4000x slower
>> without any changes in configuration or statistics. The runtime
>> difference is 4000x.  Such randomness is clearly unacceptable. I think
>> I would have to try to use the "hint" extension suggested by Imre to
>> make it consistently choose the fast plan.
>
>
> there was a similar concern raised on slack, where on shared plan (where less effective_cache_size resulted in faster
execution)
>
> "
> Hey all, I have been diagnosing a slow running query and found that it speeds up when the effective_cache_size is
reduced. It's set at ~21.5gb, and lowering to 10gb or under massively improves the query.  Plans are here and here.
Youcan see the difference is 33 - 42, where the order of execution results in a bunch more rows having to be filtered
outlater in the first plan.  Everything I've read suggests that a bigger effective_cache_size  should be better
(assumingit's within the recommended boundaries of total memory of course).  Can anyone offer any insight into why this
mighthappen, and what a good workaround would be? 
> "
>
> https://explain.depesz.com/s/VsaY
> https://explain.depesz.com/s/nW3d
>
> I am not sure, though effective_cache_size is the culprit. if there are bad estimates, even a simple query like below
producesa poor plan. 
> i did try to check the code postgres/costsize.c at master · postgres/postgres (github.com)  (I am not a developer but
tryto read the english from code) and tried to understand how effective_cache_sizealters the cost and changes the path,
butcould not simulate on my laptop with small tables and less resources. 
> but yeah, i know 1TB is a huge setup, but can run vaccumdb and analyze all the tables in question (maybe reindex
concurrentlyif required) and check if the row stats are still off in the plan and reducing effective_cache_size
improvesthe execution time? I am not an expert, but just try to figure out if this is a red herring. 
>
> i also play with some gucs as in PostgreSQL: Documentation: 13: 19.7. Query Planning  to check if enabling/disabling
someof gucs results in a big difference in execution and then try to zero in what needs immediate vaccum or reindex or
joincount etc or bumping resources if required to use more cache. 
>
>
> postgres=# create table t(id int primary key, col1 int);
> CREATE TABLE
> postgres=# alter table t set (autovacuum_enabled = false, toast.autovacuum_enabled = false);
> ALTER TABLE
>
> postgres=# insert into t select x, 1 from generate_series(1, 1000000) x;
> INSERT 0 1000000
>
> postgres=# \x
> Expanded display is off.
>
> postgres=# do $$
> declare i int;
> begin
> for i in 1..10 loop
> update t set col1 = i::int;
> commit;
> end loop;
> end; $$;
> DO
> ---this is the default plan picked up by optimizer (tables still not analyzed)
>
> postgres=# explain analyze select * from t where id < 100;
>                                                   QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
>  Seq Scan on t  (cost=0.00..1472242.50 rows=28995800 width=8) (actual time=198.740..4849.556 rows=99 loops=1)
>    Filter: (id < 100)
>    Rows Removed by Filter: 999901
>  Planning Time: 1.296 ms
>  JIT:
>    Functions: 2
>    Options: Inlining true, Optimization true, Expressions true, Deforming true
>    Timing: Generation 1.194 ms, Inlining 4.383 ms, Optimization 13.051 ms, Emission 6.954 ms, Total 25.581 ms
>  Execution Time: 4850.870 ms
> (9 rows)
>
> --play smart, disable seqscan
> postgres=# set enable_seqscan TO 0;
> SET
> postgres=# explain analyze select * from t where id < 100;
>                                                          QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on t  (cost=235689.89..1646219.33 rows=28995800 width=8) (actual time=1.312..2.313 rows=99 loops=1)
>    Recheck Cond: (id < 100)
>    Heap Blocks: exact=32
>    ->  Bitmap Index Scan on t_pkey  (cost=0.00..228440.94 rows=28995800 width=0) (actual time=1.295..1.296 rows=126
loops=1)
>          Index Cond: (id < 100)
>  Planning Time: 0.054 ms
>  JIT:
>    Functions: 2
>    Options: Inlining true, Optimization true, Expressions true, Deforming true
>    Timing: Generation 7.309 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 7.309 ms
>  Execution Time: 9.715 ms
> (11 rows)
>
> # rule out cache
> postgres=# set enable_seqscan TO 1;
> SET
> postgres=# explain analyze select * from t where id < 100;
>                                                   QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
>  Seq Scan on t  (cost=0.00..1472242.50 rows=28995800 width=8) (actual time=196.485..4778.942 rows=99 loops=1)
>    Filter: (id < 100)
>    Rows Removed by Filter: 999901
>  Planning Time: 0.052 ms
>  JIT:
>    Functions: 2
>    Options: Inlining true, Optimization true, Expressions true, Deforming true
>    Timing: Generation 0.286 ms, Inlining 1.389 ms, Optimization 4.462 ms, Emission 2.765 ms, Total 8.902 ms
>  Execution Time: 4779.314 ms
> (9 rows)
>
> #again disable seqscan
> postgres=# set enable_seqscan TO 0;
> SET
> postgres=# explain analyze select * from t where id < 100;
>                                                          QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on t  (cost=235689.89..1646219.33 rows=28995800 width=8) (actual time=0.028..0.101 rows=99 loops=1)
>    Recheck Cond: (id < 100)
>    Heap Blocks: exact=32
>    ->  Bitmap Index Scan on t_pkey  (cost=0.00..228440.94 rows=28995800 width=0) (actual time=0.017..0.017 rows=126
loops=1)
>          Index Cond: (id < 100)
>  Planning Time: 0.047 ms
>  JIT:
>    Functions: 2
>    Options: Inlining true, Optimization true, Expressions true, Deforming true
>    Timing: Generation 0.834 ms, Inlining 0.000 ms, Optimization 0.000 ms, Emission 0.000 ms, Total 0.834 ms
>  Execution Time: 0.978 ms
> (11 rows)
>
> #disable jit (even faster)
> postgres=# set jit TO 0;
> SET
> postgres=# explain analyze select * from t where id < 100;
>                                                          QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on t  (cost=235689.89..1646219.33 rows=28995800 width=8) (actual time=0.030..0.115 rows=99 loops=1)
>    Recheck Cond: (id < 100)
>    Heap Blocks: exact=32
>    ->  Bitmap Index Scan on t_pkey  (cost=0.00..228440.94 rows=28995800 width=0) (actual time=0.018..0.018 rows=126
loops=1)
>          Index Cond: (id < 100)
>  Planning Time: 0.053 ms
>  Execution Time: 0.137 ms
> (7 rows)
>
> #disable bitmap scan (more faster)
> postgres=# set enable_bitmapscan TO 0;
> SET
> postgres=# explain analyze select * from t where id < 100;
>                                                        QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------
>  Index Scan using t_pkey on t  (cost=0.44..2057998.94 rows=28995800 width=8) (actual time=0.018..0.147 rows=99
loops=1)
>    Index Cond: (id < 100)
>  Planning Time: 0.053 ms
>  Execution Time: 0.166 ms
> (4 rows)
>
> #now cleanup dead tuples, and analyze to update stats and reset all gucs
> postgres=# vacuum full t;  -- note full not required, 'vacuum analyze t'  would suffice too.
> VACUUM
> postgres=#  set enable_bitmapscan TO default;
> SET
> postgres=#  set enable_seqscan TO default;
> SET
> postgres=#  set jit to default;
> SET
> postgres=# explain analyze select * from t where id < 100;
>                                                        QUERY PLAN
>
------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on t  (cost=6243.76..14835.42 rows=333333 width=8) (actual time=0.028..0.042 rows=99 loops=1)
>    Recheck Cond: (id < 100)
>    Heap Blocks: exact=3
>    ->  Bitmap Index Scan on t_pkey  (cost=0.00..6160.42 rows=333333 width=0) (actual time=0.017..0.018 rows=99
loops=1)
>          Index Cond: (id < 100)
>  Planning Time: 0.819 ms
>  Execution Time: 0.061 ms
> (7 rows)
>
> postgres=# analyze t;
> ANALYZE
>
> #the default plan after vacuum analyze is the fastest.
> postgres=# explain analyze select * from t where id < 100;
>                                                  QUERY PLAN
> -------------------------------------------------------------------------------------------------------------
>  Index Scan using t_pkey on t  (cost=0.42..10.62 rows=92 width=8) (actual time=0.005..0.029 rows=99 loops=1)
>    Index Cond: (id < 100)
>  Planning Time: 0.159 ms
>  Execution Time: 0.045 ms
> (4 rows)
>



Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
Michael Lewis
Дата:
I would guess that autovacuum is analyzing the table and causing the stats to change which is resulting in different estimates that result in different plans. Unless you can get the estimate much more accurate, you won't get far with expecting a stable plan that performs well.

How is data_class_pkey? If you run a query like this, how far off are the estimates?


explain analyze
select d.time as time,d.id as id, a.query_symbol as query_symbol

from
data as d
join data_class as dc ON dc.data_id = d.id
join class as a ON dc.class_id = a.id
where
d.id > 205284974
order by d.id
limit 1000;

If you run 'analyze data( id );' and then run the query again, do you get a better estimate? Have you tried adjusting default_stats_target? Are you running the default value for random_page_cost with SSDs?

I'm seeing Index Only Scan nodes, but a high number of fetches so it seems like you would benefit from vacuum to update pg_class.relallvisible value.

Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
A Shaposhnikov
Дата:
Tomas,

thank you! The query:

select 1 from data as d, data_class as dc
    where dc.data_id = d.id and d.id > 205284974
      and dc.data_id > 205284974     -- new condition
   order by d.id
   limit 1000;

totally solved it - it is now fast under all conditions! I thought
that the optimizer would be able to infer it itself.

Thank you

On Thu, Feb 3, 2022 at 4:24 PM Tomas Vondra
<tomas.vondra@enterprisedb.com> wrote:
>
>
>
> On 2/3/22 20:32, A Shaposhnikov wrote:
> > I made a mistake yesterday claiming that the created statistics
> > changed the row counts in the estimates - it did not  - I looked at
> > the wrong query yesterday. In the correct query plan the row estimate
> > still differs from the actual by many orders of magnitude:
> >
> > Nested Loop  (cost=1.01..27584834.53 rows=59608439 width=164) (actual
> > time=0.047..2.723 rows=854 loops=1)
> >
> > It completely ignores the "limit" in SQL statement when doing the row
> > estimates - the estimates do not change when I change it. The business
> > logic really needs only 1000 rows in the result, I never need the
> > whole table in the result. The query always runs fast if the limit is
> > <= 46, when the limit >=47, it sometimes chooses to do a merge join on
> > 2 tables with hundreds of millions of rows  instead of using the
> > indexes. The runtime difference is 4000x.
> >
>
> Yes, it ignores LIMIT while calculating the estimates - the estimates
> below LIMIT are as if the query executed to completion. But the "actual"
> value are affected by LIMIT, because the execution terminates early. So
> you can't compare these two values and conclude the estimates are off.
>
> Let me illustrate this on an example:
>
> create table t (a int, b int);
> insert into t select i, i from generate_series(1,1000000) s(i);
> create index on t (a,b);
> vacuum analyze t;
>
> explain analyze select b from t order by a limit 10;
>                        QUERY PLAN
> -----------------------------------------------------------------------
>   Limit  (cost=0.42..0.79 rows=10 width=8)
>          (actual time=0.018..0.033 rows=10 loops=1)
>     ->  Index Only Scan using t_a_b_idx on t
>         (cost=0.42..36214.93 rows=1000000 width=8)
>         (actual time=0.016..0.021 rows=10 loops=1)
>           Heap Fetches: 10
>   Planning Time: 0.047 ms
>   Execution Time: 0.049 ms
>
> The database has *perfect* stats in this case and the estimates are spot
> on too. But comparing rows=1000000 and rows=10 would lead to conclusion
> to a bogus conclusion that the estimates are wrong.
>
> Which is why I suggested getting an explain for the query without the
> limit clause.
>
> > I migrated the data to the latest postgres 14.1. Both versions run
> > either the slow plan or the fast plan seemingly at random. The only
> > reliable way to make it choose the fast plan is to decrease the
> > effective_cache_size to a value 20x lower than the memory available in
> > the system. Dropping and creating the statistics on the join table
> > makes no difference in the estimated row counts and query plans.
> > PostgreSql seems to be caching something internally and choosing the
> > query plans at random - sometimes it is fast, sometimes 4000x slower
> > without any changes in configuration or statistics. The runtime
> > difference is 4000x.  Such randomness is clearly unacceptable. I think
> > I would have to try to use the "hint" extension suggested by Imre to
> > make it consistently choose the fast plan.
> >
>
> Does the plan change depending on parameters in the query. Or does it
> change if you run the same query (including the same parameters in all
> the conditions)?
>
> One possible explanation is that there's some sort of correlation
> between parameters. Limit assumes the matching rows are distributed
> uniformly in the input, and if that's not true (due to some sort of
> correlation), it may pick the wrong plan.
>
> In this case there are joins so the correlation may be more complicated
> - e.g. through a join condition, between (dc.class_id, dc.data_id) or
> something like that.
>
> This part of the plan hints this might be the case, because it forces
> scanning a huge part of data_class table just to get to the interesting
> data_id values.
>
>    -> Index Only Scan using data_class_pkey on data_class ta
>       (cost=0.57..4935483.78 rows=216964862 width=8)
>       (actual time=0.018..35022.908 rows=151321889 loops=1)
>       Heap Fetches: 151321889
>
> That's hard to fix, because there's no way to tell the database about
> such dependencies. I'd try two things:
>
> (1) Try putting the same condition on ID on data_class too. I mean, if
> you do
>
>    select 1 from data as d, data_class as dc
>     where dc.data_id = d.id and d.id > 205284974
>    order by d.id
>    limit 1000;
>
> then clearly dc.data_id > 205284974, but the database does not realize
> it and will scan the index from start for mergejoin. Nestedloop does
> direct lookup for individual ID values, so it doesn't have this issue.
>
> So try this:
>
>    select 1 from data as d, data_class as dc
>     where dc.data_id = d.id and d.id > 205284974
>       and dc.data_id > 205284974     -- new condition
>    order by d.id
>    limit 1000;
>
> This is a special case of correlation between the tables, possibly
> confusing the LIMIT estimation. We are smarter about similar cases (with
> conditions on join keys), but only for equalities.
>
> The other thing I'd try is vacuuming the data_class. The high number of
> heap fetches suggests the visibility map is in pretty poor shape, which
> is not great for index-only scans.
>
>
> regards
>
> --
> Tomas Vondra
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company



Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
Tomas Vondra
Дата:
On 2/4/22 05:21, A Shaposhnikov wrote:
> Tomas,
> 
> thank you! The query:
> 
> select 1 from data as d, data_class as dc
>      where dc.data_id = d.id and d.id > 205284974
>        and dc.data_id > 205284974     -- new condition
>     order by d.id
>     limit 1000;
> 
> totally solved it - it is now fast under all conditions! I thought
> that the optimizer would be able to infer it itself.
> 

Unfortunately, the optimizer is not that smart - we can do that for 
equality conditions, but not for other operators. There was actually a 
thread [1] exploring a possibility to extend this to inequalities, but 
it went nowhere so far. It also explains why it's done only for equality 
operators. In short, it's fairly expensive, makes costing of joins more 
difficult, and most queries can't benefit from it (because conditions on 
join keys are not that common).

BTW how does the final query plan look like? Is it using the merge sort 
of nested loop? I wonder if this might be formulated as a costing issue, 
pushing the planner to use the nested loop.


[1] 
https://www.postgresql.org/message-id/flat/CAFQUnFhqkWuPCwQ1NmHYrisHJhYx4DoJak-dV%2BFcjyY6scooYA%40mail.gmail.com


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
A Shaposhnikov
Дата:

I started using the latest postgres 14.2 and the query plans there for a simplified query joining just 2 tables look like:


 explain analyze select t.*, ta.* from team as t, team_aliases as ta where ta.team_id = t.id and t.id > 200000000 order by t.id limit 1000;

QUERY PLAN

------------------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=81.33..331.82 rows=1000 width=183) (actual time=31328.561..31329.565 rows=1000 loops=1)

-> Merge Join (cost=81.33..17727020.90 rows=70768783 width=183) (actual time=31328.560..31329.498 rows=1000 loops=1)

Merge Cond: (t.id = ta.team_id)

-> Index Scan using team_pkey on team t (cost=0.57..11382381.88 rows=78693167 width=175) (actual time=0.016..0.466 rows=854 loops=1)

Index Cond: (id > 205284974)

-> Index Scan using fki_team_fk on team_aliases ta (cost=0.57..4893726.20 rows=218604096 width=8) (actual time=0.010..22172.405 rows=151321830 loops=1)

Planning Time: 0.472 ms

Execution Time: 31329.654 ms

(8 rows)


Now if I add the second condition as Tomas suggested, it speeds it up by factor of 15,000:

explain analyze select t.*, ta.* from team as t, team_aliases as ta where ta.team_id = t.id and t.id > 200000000 and ta.team_id > 200000000 order by t.id limit 1000;

QUERY PLAN

--------------------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=81.33..720.48 rows=1000 width=183) (actual time=0.073..1.909 rows=1000 loops=1)

-> Merge Join (cost=81.33..13859983.71 rows=21685030 width=183) (actual time=0.072..1.756 rows=1000 loops=1)

Merge Cond: (t.id = ta.team_id)

-> Index Scan using team_pkey on team t (cost=0.57..11382381.88 rows=78693167 width=175) (actual time=0.016..0.695 rows=854 loops=1)

Index Cond: (id > 205284974)

-> Index Scan using fki_team_fk on team_aliases ta (cost=0.57..1896563.38 rows=66984851 width=8) (actual time=0.052..0.356 rows=1000 loops=1)

Index Cond: (team_id > 205284974)

Planning Time: 0.503 ms

Execution Time: 2.056 ms

(9 rows)



Interestingly I have a second PG 14.2 database, with identical table definitions, but about 10% smaller row counts, and the exact same query works fast there without the 2nd condition:


explain analyze select t.*, ta.* from team as t, team_aliases as ta where ta.team_id = t.id and t.id > 200000000 order by t.id limit 1000;

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------------------------

Limit (cost=1.14..1700.75 rows=1000 width=168) (actual time=0.062..4.841 rows=1000 loops=1)

-> Nested Loop (cost=1.14..5685654.91 rows=3345265 width=168) (actual time=0.060..4.700 rows=1000 loops=1)

-> Index Scan using team_pkey on team t (cost=0.57..2366113.83 rows=2807531 width=160) (actual time=0.031..0.801 rows=888 loops=1)

Index Cond: (id > 205284974)

-> Index Scan using fki_team_fk on team_aliases ta (cost=0.57..1.14 rows=4 width=8) (actual time=0.003..0.004 rows=1 loops=888)

Index Cond: (team_id = t.id)

Planning Time: 0.559 ms

Execution Time: 4.953 ms


In both databases I run the "vacuum full analyze" on all tables. I am not sure why it decides to check the join condition much later in the plan when the query runs slow? Basically, it seems to randomly decide to produce two huge multimillion row tables and then merge them on the join condition instead of looping 1000 times using indexes. Obviously, the optimizer grossly miscalculates the costs, how can we help it?




On Fri, Feb 4, 2022 at 9:01 AM Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
On 2/4/22 05:21, A Shaposhnikov wrote:
> Tomas,
>
> thank you! The query:
>
> select 1 from data as d, data_class as dc
>      where dc.data_id = d.id and d.id > 205284974
>        and dc.data_id > 205284974     -- new condition
>     order by d.id
>     limit 1000;
>
> totally solved it - it is now fast under all conditions! I thought
> that the optimizer would be able to infer it itself.
>

Unfortunately, the optimizer is not that smart - we can do that for
equality conditions, but not for other operators. There was actually a
thread [1] exploring a possibility to extend this to inequalities, but
it went nowhere so far. It also explains why it's done only for equality
operators. In short, it's fairly expensive, makes costing of joins more
difficult, and most queries can't benefit from it (because conditions on
join keys are not that common).

BTW how does the final query plan look like? Is it using the merge sort
of nested loop? I wonder if this might be formulated as a costing issue,
pushing the planner to use the nested loop.


[1]
https://www.postgresql.org/message-id/flat/CAFQUnFhqkWuPCwQ1NmHYrisHJhYx4DoJak-dV%2BFcjyY6scooYA%40mail.gmail.com


regards

--
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
Michael Lewis
Дата:
I think that theoretically if the planner sees a condition like

a.column1 > constant_value, and it also has a condition like a.column1 = b.column2 then it could autogenerate the b.column2 > constant_value condition. And of course > could be <, <=, >= and <>

But I could be wrong, particularly with respect to nulls and how = is mostly the same as 'is not distinct from' except for 'select null = null' being null and 'select null is not distinct from null' being true. Perhaps if either/both columns have a not null constraint, then this could still be done. Whether or not it should be a separate question.

Re: increasing effective_cache_size slows down join queries by a factor of 4000x

От
Tomas Vondra
Дата:

On 2/15/22 01:06, A Shaposhnikov wrote:
> 
> Interestingly I have a second PG 14.2 database, with identical table 
> definitions, but about 10% smaller row counts, and the exact same query 
> works fast there without the 2nd condition:
> 

Are you sure about the 10%? Because in the plans from the first machine 
I see this:

 >
 > -> Index Scan using team_pkey on team t (cost=0.57..11382381.88
 > rows=78693167 width=175) (actual time=0.016..0.695 rows=854 loops=1)
 >

while the second machine does this:

> 
> -> Index Scan using team_pkey on team t (cost=0.57..2366113.83 
> rows=2807531 width=160) (actual time=0.031..0.801 rows=888 loops=1)
> 

That's 2.8M vs. 78M, quite far from "10% difference". Not sure about 
team_aliases table, that's imposible to say from the plans.

This may matter a lot, because we use effective cache size to calculate 
cache hit ratio for the query, with relation sizes as an input. So 
smaller relations (or larger effective_cache_size) means cheaper random 
I/O, hence preference for nested loop join.

The other thing is data distribution - that may matter too.


IMO it's pointless to investigate this further - we know what's causing 
the issue. The optimizer is oblivious that merge join will have to skip 
large part of the second input, due to the implicit condition. Notice 
that adding the condition changes the cost from:

  Limit (cost=81.33..331.82 rows=1000 width=183) ...

to

  Limit (cost=81.33..720.48 rows=1000 width=183) ...

So it seems *more* expensive than the first plan. Taken to the extreme 
the planner could theoretically have chosen to use the first plan (and 
delay the condition until after the join).

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company