Обсуждение: Terribly slow query with very good plan?

Поиск

Список

Период

Сортировка

Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 09:11:31

Hello,

I have a table that contains folders, and another one that contains files.

Here are the table definitions. I have removed most of the columns because they are not important for this question. (There are lots of columns.)

CREATE TABLE media.oo_folder (
id int8 NOT NULL,
is_active bool NOT NULL DEFAULT true,
title text NOT NULL,
relpath text NOT NULL,
CONSTRAINT chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT oo_folder_chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT pk_oo_folder PRIMARY KEY (id),
CONSTRAINT fk_oo_folder_parent_id FOREIGN KEY (parent_id) REFERENCES media.oo_folder(id) ON DELETE CASCADE DEFERRABLE
);
CREATE INDEX oo_folder_idx_parent ON media.oo_folder USING btree (parent_id);
CREATE INDEX oo_folder_idx_relpath ON media.oo_folder USING btree (relpath);
CREATE UNIQUE INDEX uidx_oo_folder_active_title ON media.oo_folder USING btree (parent_id, title) WHERE is_active;

CREATE TABLE media.oo_file (
id int8 NOT NULL,
is_active bool NOT NULL DEFAULT true,
title text NOT NULL,
ext text NULL,
relpath text NOT NULL,
sha1 text NOT NULL,
CONSTRAINT chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT oo_file_chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT pk_oo_file PRIMARY KEY (id),
CONSTRAINT fk_oo_file_oo_folder_id FOREIGN KEY (oo_folder_id) REFERENCES media.oo_folder(id) ON DELETE CASCADE DEFERRABLE,
);
CREATE INDEX oo_file_idx_oo_folder_id ON media.oo_file USING btree (oo_folder_id);
CREATE INDEX oo_file_idx_relpath ON media.oo_file USING btree (relpath);
CREATE INDEX oo_file_idx_sha1 ON media.oo_file USING btree (sha1);
CREATE UNIQUE INDEX uidx_oo_file_active_title ON media.oo_file USING btree (oo_folder_id, title) WHERE is_active;

The "replath" field contains the path of the file/folder. For example: "/folder1/folder2/folder3/filename4.ext5". The replath field is managed by triggers. There are about 1M rows for files and 600K folder rows in the database. The files are well distributed between folders, and there are only 45 root folders ( parent_id is null)

This query runs very fast:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select id, title from media.oo_folder f where f.parent_id is null

QUERY PLAN |
------------------------------------------------------------------------------------------------------------------------------------------+
Index Scan using oo_folder_idx_parent on media.oo_folder f (cost=0.42..73.70 rows=20 width=25) (actual time=0.030..0.159 rows=45 loops=1)|
Output: id, title |
Index Cond: (f.parent_id IS NULL) |
Buffers: shared hit=40 |
Planning Time: 0.123 ms |
Execution Time: 0.187 ms |

My task is to write a query that tells if a folder has any active file inside it - directly or in subfolders. Here is the query for that:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)

select id, title,
(exists (select f2.id from media.oo_file f2 where f2.relpath like f.relpath || '%')) as has_file
from media.oo_folder f where f.parent_id is null

QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------------------------------+
Index Scan using oo_folder_idx_parent on media.oo_folder f (cost=0.42..488.02 rows=20 width=26) (actual time=713.419..25414.969 rows=45 loops=1) |
Output: f.id, f.title, (SubPlan 1) |
Index Cond: (f.parent_id IS NULL) |
Buffers: shared hit=7014170 |
SubPlan 1 |
-> Index Only Scan using oo_file_idx_relpath on media.oo_file f2 (cost=0.55..108499.27 rows=5381 width=0) (actual time=564.756..564.756 rows=0 loops=45)|
Filter: (f2.relpath ~~ (f.relpath || '%'::text)) |
Rows Removed by Filter: 792025 |
Heap Fetches: 768960 |
Buffers: shared hit=7014130 |
Planning Time: 0.361 ms |
Execution Time: 25415.088 ms |

It also returns 45 rows, but in 25 seconds which is unacceptable.

It I execute the "has_file" subquery for one specific relpath then it speeds up again, to < 1msec:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select exists ( select id from media.oo_file of2 where relpath like 'Felhasználók%')

QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------+
Result (cost=1.66..1.67 rows=1 width=1) (actual time=0.049..0.050 rows=1 loops=1) |
Output: $0 |
Buffers: shared hit=2 |
InitPlan 1 (returns $0) |
-> Seq Scan on media.oo_file of2 (cost=0.00..144714.70 rows=86960 width=0) (actual time=0.044..0.044 rows=1 loops=1)|
Filter: (of2.relpath ~~ 'Felhasználók%'::text) |
Rows Removed by Filter: 15 |
Buffers: shared hit=2 |
Planning Time: 0.290 ms |
Execution Time: 0.076 ms |

In other words, I could write a pl/sql function with a nested loop instead of the problematic query, and it will be 1000 times faster.

What am I missing?

Thanks,

Laszlo

Re: Terribly slow query with very good plan?

От

Laurenz Albe

Дата:

04 февраля 2022 г., 09:18:43

On Fri, 2022-02-04 at 10:11 +0100, Les wrote:
> EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
> 
> select id, title,
>  (exists (select f2.id from media.oo_file f2 where f2.relpath like f.relpath || '%')) as has_file
> from media.oo_folder f where f.parent_id is null
> 
> QUERY PLAN                                                                                                          
                                        |
 
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------+
> Index Scan using oo_folder_idx_parent on media.oo_folder f  (cost=0.42..488.02 rows=20 width=26) (actual
time=713.419..25414.969rows=45 loops=1)             |
 
>   Output: f.id, f.title, (SubPlan 1)                                                                                
                                        |
 
>   Index Cond: (f.parent_id IS NULL)                                                                                  
                                       |
 
>   Buffers: shared hit=7014170                                                                                        
                                       |
 
>   SubPlan 1                                                                                                          
                                       |
 
>     ->  Index Only Scan using oo_file_idx_relpath on media.oo_file f2  (cost=0.55..108499.27 rows=5381 width=0)
(actualtime=564.756..564.756 rows=0 loops=45)|
 
>           Filter: (f2.relpath ~~ (f.relpath || '%'::text))                                                          
                                        |
 
>           Rows Removed by Filter: 792025                                                                            
                                        |
 
>           Heap Fetches: 768960                                                                                      
                                        |
 
>           Buffers: shared hit=7014130                                                                                
                                       |
 
> Planning Time: 0.361 ms                                                                                              
                                       |
 
> Execution Time: 25415.088 ms                                                                                        
                                        |
 
> 
> It also returns 45 rows, but in 25 seconds which is unacceptable. 

You should create an index that supports LIKE; for example

CREATE INDEX ON media.oo_file (relpath COLLATE "C");

Yours,
Laurenz Albe
-- 
Cybertec | https://www.cybertec-postgresql.com

Re: Terribly slow query with very good plan?

От

Pavel Stehule

Дата:

04 февраля 2022 г., 09:23:03

pá 4. 2. 2022 v 10:11 odesílatel Les <nagylzs@gmail.com> napsal:

Hello,

I have a table that contains folders, and another one that contains files.

Here are the table definitions. I have removed most of the columns because they are not important for this question. (There are lots of columns.)

CREATE TABLE media.oo_folder (
id int8 NOT NULL,
is_active bool NOT NULL DEFAULT true,
title text NOT NULL,
relpath text NOT NULL,
CONSTRAINT chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT oo_folder_chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT pk_oo_folder PRIMARY KEY (id),
CONSTRAINT fk_oo_folder_parent_id FOREIGN KEY (parent_id) REFERENCES media.oo_folder(id) ON DELETE CASCADE DEFERRABLE
);
CREATE INDEX oo_folder_idx_parent ON media.oo_folder USING btree (parent_id);
CREATE INDEX oo_folder_idx_relpath ON media.oo_folder USING btree (relpath);
CREATE UNIQUE INDEX uidx_oo_folder_active_title ON media.oo_folder USING btree (parent_id, title) WHERE is_active;

CREATE TABLE media.oo_file (
id int8 NOT NULL,
is_active bool NOT NULL DEFAULT true,
title text NOT NULL,
ext text NULL,
relpath text NOT NULL,
sha1 text NOT NULL,
CONSTRAINT chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT oo_file_chk_no_slash CHECK (("position"(title, '/'::text) = 0)),
CONSTRAINT pk_oo_file PRIMARY KEY (id),
CONSTRAINT fk_oo_file_oo_folder_id FOREIGN KEY (oo_folder_id) REFERENCES media.oo_folder(id) ON DELETE CASCADE DEFERRABLE,
);
CREATE INDEX oo_file_idx_oo_folder_id ON media.oo_file USING btree (oo_folder_id);
CREATE INDEX oo_file_idx_relpath ON media.oo_file USING btree (relpath);
CREATE INDEX oo_file_idx_sha1 ON media.oo_file USING btree (sha1);
CREATE UNIQUE INDEX uidx_oo_file_active_title ON media.oo_file USING btree (oo_folder_id, title) WHERE is_active;

The "replath" field contains the path of the file/folder. For example: "/folder1/folder2/folder3/filename4.ext5". The replath field is managed by triggers. There are about 1M rows for files and 600K folder rows in the database. The files are well distributed between folders, and there are only 45 root folders ( parent_id is null)

This query runs very fast:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS) select id, title from media.oo_folder f where f.parent_id is null

QUERY PLAN |
------------------------------------------------------------------------------------------------------------------------------------------+
Index Scan using oo_folder_idx_parent on media.oo_folder f (cost=0.42..73.70 rows=20 width=25) (actual time=0.030..0.159 rows=45 loops=1)|
Output: id, title |
Index Cond: (f.parent_id IS NULL) |
Buffers: shared hit=40 |
Planning Time: 0.123 ms |
Execution Time: 0.187 ms |

My task is to write a query that tells if a folder has any active file inside it - directly or in subfolders. Here is the query for that:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)

select id, title,
(exists (select f2.id from media.oo_file f2 where f2.relpath like f.relpath || '%')) as has_file
from media.oo_folder f where f.parent_id is null

QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------------------------------+
Index Scan using oo_folder_idx_parent on media.oo_folder f (cost=0.42..488.02 rows=20 width=26) (actual time=713.419..25414.969 rows=45 loops=1) |
Output: f.id, f.title, (SubPlan 1) |
Index Cond: (f.parent_id IS NULL) |
Buffers: shared hit=7014170 |
SubPlan 1 |
-> Index Only Scan using oo_file_idx_relpath on media.oo_file f2 (cost=0.55..108499.27 rows=5381 width=0) (actual time=564.756..564.756 rows=0 loops=45)|
Filter: (f2.relpath ~~ (f.relpath || '%'::text)) |
Rows Removed by Filter: 792025 |
Heap Fetches: 768960 |
Buffers: shared hit=7014130 |
Planning Time: 0.361 ms |
Execution Time: 25415.088 ms |

It also returns 45 rows, but in 25 seconds which is unacceptable.

It I execute the "has_file" subquery for one specific relpath then it speeds up again, to < 1msec:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select exists ( select id from media.oo_file of2 where relpath like 'Felhasználók%')
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------+
Result (cost=1.66..1.67 rows=1 width=1) (actual time=0.049..0.050 rows=1 loops=1) |
Output: $0 |
Buffers: shared hit=2 |
InitPlan 1 (returns $0) |
-> Seq Scan on media.oo_file of2 (cost=0.00..144714.70 rows=86960 width=0) (actual time=0.044..0.044 rows=1 loops=1)|
Filter: (of2.relpath ~~ 'Felhasználók%'::text) |
Rows Removed by Filter: 15 |
Buffers: shared hit=2 |
Planning Time: 0.290 ms |
Execution Time: 0.076 ms |

In other words, I could write a pl/sql function with a nested loop instead of the problematic query, and it will be 1000 times faster.

What am I missing?

I don't understand how it is possible in the slow case Rows Removed by Filter: 792025 (returns 0 row) and in the second case Rows Removed by Filter: 15 (returns 1 row).

It is strange.

Thanks,

Laszlo

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 09:32:58

Laurenz Albe <laurenz.albe@cybertec.at> ezt írta (időpont: 2022. febr. 4., P, 10:18):

|
>
> It also returns 45 rows, but in 25 seconds which is unacceptable.

You should create an index that supports LIKE; for example

CREATE INDEX ON media.oo_file (relpath COLLATE "C");

CREATE INDEX test ON media.oo_file (relpath COLLATE "C");

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)

select id, title,
(exists (select f2.id from
media.oo_file f2
where f2.relpath like f.relpath || '%' )) as has_file
from media.oo_folder f where f.parent_id is null;

QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------------------+
Index Scan using oo_folder_idx_parent on media.oo_folder f (cost=0.42..459.38 rows=20 width=26) (actual time=772.566..24081.820 rows=45 loops=1)|
Output: f.id, f.title, (SubPlan 1) |
Index Cond: (f.parent_id IS NULL) |
Buffers: shared hit=6672274 |
SubPlan 1 |
-> Index Only Scan using test on media.oo_file f2 (cost=0.55..100756.64 rows=5379 width=0) (actual time=535.113..535.113 rows=0 loops=45) |
Filter: (f2.relpath ~~ (f.relpath || '%'::text)) |
Rows Removed by Filter: 777428 |
Heap Fetches: 736418 |
Buffers: shared hit=6672234 |
Planning Time: 0.338 ms |
Execution Time: 24082.152 ms |

Not helping :-(

Re: Terribly slow query with very good plan?

От

Nick Cleaton

Дата:

04 февраля 2022 г., 09:59:55

On Fri, 4 Feb 2022 at 09:11, Les <nagylzs@gmail.com> wrote:

                                                                    |
>     ->  Index Only Scan using oo_file_idx_relpath on media.oo_file f2  (cost=0.55..108499.27 rows=5381 width=0)
(actualtime=564.756..564.756 rows=0 loops=45)| 
>           Filter: (f2.relpath ~~ (f.relpath || '%'::text))
                                       | 
>           Rows Removed by Filter: 792025
                                       | 
>           Heap Fetches: 768960
                                       | 
>           Buffers: shared hit=7014130
                                       | 
> Planning Time: 0.361 ms
> Execution Time: 25415.088 ms

>     ->  Seq Scan on media.oo_file of2  (cost=0.00..144714.70 rows=86960 width=0) (actual time=0.044..0.044 rows=1
loops=1)|
>           Filter: (of2.relpath ~~ 'Felhasználók%'::text)
   | 
>           Rows Removed by Filter: 15
   | 
>           Buffers: shared hit=2
   | 
> Planning Time: 0.290 ms
   | 
> Execution Time: 0.076 ms
   | 
>
> In other words, I could write a pl/sql function with a nested loop instead of the problematic query, and it will be
1000times faster. 
>
> What am I missing?

In the fast case the 'Felhasználók%' part is known at query planning
time, so it can be a prefix search.

In the slow case, the planner doesn't know what that value will be, it
could be something that starts with '%' for example.

Also your logic looks a bit unsafe, the query you have would include
files under all top-level folders with names starting with
Felhasználók, so you could accidentally merge in files in folders
called Felhasználókfoo and Felhasználókbar for example.

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 10:00:47

I really think now that the query plan is wrong (or "could be improved" so to say). As far as I understand, the "index only scan" is essentially a sequential scan on the index data. In this specific case, where the filter is a "begins with" condition on a field that is the starting (and only) column of an index, there is a much much better way to find out if there is a row or not: lookup the closest value in the index and see if it begins with the value. The operation of looking up the closest value in an index would be much more efficient.

> I don't understand how it is possible in the slow case Rows Removed by Filter: 792025 (returns 0 row) and in the second case Rows Removed by Filter: 15 (returns 1 row).

Pavel, I think it is because the scan found a suitable row at the beginning of the scan and stopped the scan. If you look at that plan you will see that it uses a seq scan. It was fast by accident. :-)

The plan of that single-row version was changed to a normal index scan, after I added the collation "C" index:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select exists (
select id from media.oo_file of2 where relpath like 'this does not exist%'
);

QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------------+
Result (cost=0.63..0.64 rows=1 width=1) (actual time=0.022..0.023 rows=1 loops=1) |
Output: $0 |
Buffers: shared hit=4 |
InitPlan 1 (returns $0) |
-> Index Only Scan using test on media.oo_file of2 (cost=0.55..8.57 rows=108 width=0) (actual time=0.018..0.018 rows=0 loops=1)|
Index Cond: ((of2.relpath >= 'this does not exist'::text) AND (of2.relpath < 'this does not exisu'::text)) |
Filter: (of2.relpath ~~ 'this does not exist%'::text) |
Heap Fetches: 0 |
Buffers: shared hit=4 |
Planning Time: 0.530 ms |
Execution Time: 0.055 ms |

I would expect for the same originally slow query with the has_file column, but it does not happen. :-(

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 10:05:13

Nick Cleaton <nick@cleaton.net> ezt írta (időpont: 2022. febr. 4., P, 11:00):

In the fast case the 'Felhasználók%' part is known at query planning
time, so it can be a prefix search.

In the slow case, the planner doesn't know what that value will be, it
could be something that starts with '%' for example.

First of all, it CANNOT start with '%'. This is a fact and this fact can be determined by analyzing the query. Something that the query planner should do, right?

Second argument: the same query is also slow with the ^@ operator...

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)

select id, title,
(exists (select f2.id from
media.oo_file f2
where f2.relpath ^@ f.relpath )) as has_file
from media.oo_folder f where f.parent_id is null

QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------------------+
Index Scan using oo_folder_idx_parent on media.oo_folder f (cost=0.42..449.38 rows=20 width=26) (actual time=1652.624..61636.232 rows=45 loops=1)|
Output: f.id, f.title, (SubPlan 1) |
Index Cond: (f.parent_id IS NULL) |
Buffers: shared hit=6672274 |
SubPlan 1 |
-> Index Only Scan using test on media.oo_file f2 (cost=0.55..98067.11 rows=5379 width=0) (actual time=1369.665..1369.665 rows=0 loops=45) |
Filter: (f2.relpath ^@ f.relpath) |
Rows Removed by Filter: 777428 |
Heap Fetches: 736418 |
Buffers: shared hit=6672234 |
Planning Time: 0.346 ms |
Execution Time: 61636.319 ms |

Also your logic looks a bit unsafe, the query you have would include
files under all top-level folders with names starting with
Felhasználók, so you could accidentally merge in files in folders
called Felhasználókfoo and Felhasználókbar for example.

Forgive me, I typed in these examples for demonstration. The actual code uses relpath || '/%' and it avoids those cases.

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 10:09:52

In the fast case the 'Felhasználók%' part is known at query planning
time, so it can be a prefix search.

In the slow case, the planner doesn't know what that value will be, it
could be something that starts with '%' for example.

First of all, it CANNOT start with '%'. This is a fact and this fact can be determined by analyzing the query. Something that the query planner should do, right?

Second argument: the same query is also slow with the ^@ operator...

Oh I see, the query planner does not know that there will be no % characters in file and folder names.

But what is the solution then? It just seems wrong that I can speed up a query 1000 times by replacing it with a nested loop in a pl/sql function :(

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 10:38:53

First of all, it CANNOT start with '%'. This is a fact and this fact can be determined by analyzing the query. Something that the query planner should do, right?

Second argument: the same query is also slow with the ^@ operator...

Oh I see, the query planner does not know that there will be no % characters in file and folder names.

On second thought, it does not explain why it is also slow with the ^@ operator.

Re: Terribly slow query with very good plan?

От

Nick Cleaton

Дата:

04 февраля 2022 г., 10:57:13

On Fri, 4 Feb 2022 at 10:09, Les <nagylzs@gmail.com> wrote:
>
> Oh I see, the query planner does not know that there will be no % characters in file and folder names.
>
> But what is the solution then? It just seems wrong that I can speed up a query 1000 times by replacing it with a
nestedloop in a pl/sql function :(

You don't need a nested loop, doing it in two stages in pl/pgsql would
be enough I think, first get the folder name and then construct a new
query using it as a constant.

I'd use SELECT FOR SHARE when getting the folder name, so that no
other process can change it underneath you before you run your second
query.

With the ^@ operator, my guess is that because the planner knows
nothing about the folder name value it could be the empty string,
which would be a prefix of everything.

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 12:27:28

Nick Cleaton <nick@cleaton.net> ezt írta (időpont: 2022. febr. 4., P, 11:57):

With the ^@ operator, my guess is that because the planner knows
nothing about the folder name value it could be the empty string,
which would be a prefix of everything.

I think I could narrow down the problem to the simplest query possible.

The "title could be empty" does not hold for this:

CREATE index test ON media.oo_file (relpath COLLATE "C");

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active and fi.relpath ^@ 'Természettudomány' limit 1

Limit (cost=0.00..2.70 rows=1 width=8) (actual time=14445.559..14445.561 rows=0 loops=1)
Output: id
Buffers: shared hit=22288 read=108975
-> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=53574 width=8) (actual time=14445.555..14445.556 rows=0 loops=1)
Output: id
Filter: (fi.is_active AND (fi.relpath ^@ 'Természettudomány'::text))
Rows Removed by Filter: 1075812
Buffers: shared hit=22288 read=108975
Planning Time: 0.398 ms
Execution Time: 14445.593 ms

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active and fi.relpath like 'Természettudomány%' limit 1

Limit (cost=0.00..2.70 rows=1 width=8) (actual time=11222.280..11222.282 rows=0 loops=1)
Output: id
Buffers: shared hit=22320 read=108943
-> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=53574 width=8) (actual time=11222.278..11222.279 rows=0 loops=1)
Output: id
Filter: (fi.is_active AND (fi.relpath ~~ 'Természettudomány%'::text))
Rows Removed by Filter: 1075812
Buffers: shared hit=22320 read=108943
Planning Time: 0.488 ms
Execution Time: 11222.307 ms

It is using seq scan for both cases. This is definitely wrong!

One of my collaguage has noticed that the LIKE query uses index scan for some of the letters:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active and fi.relpath like 'A%' limit 1;

Limit (cost=0.55..60.85 rows=1 width=8) (actual time=6.508..6.509 rows=0 loops=1)
Output: id
Buffers: shared hit=2776
-> Index Scan using test on media.oo_file fi (cost=0.55..4583.29 rows=76 width=8) (actual time=6.506..6.507 rows=0 loops=1)
Output: id
Index Cond: ((fi.relpath >= 'A'::text) AND (fi.relpath < 'B'::text))
Filter: (fi.is_active AND (fi.relpath ~~ 'A%'::text))
Rows Removed by Filter: 3784
Buffers: shared hit=2776
Planning Time: 0.543 ms
Execution Time: 6.560 ms

Actually, the number of files per starting letter is:

select substr(relpath, 0, 2), count(*)
from media.oo_file
group by substr(relpath, 0, 2)
order by count(*) desc

substr|count |
------+------+
O |386087|
F |236752|
N |167171|
Ó |111479|
T |109786|
M | 34348|
P | 19878|
L | 5657|
A | 3784|
I | 869|
C | 1|

PostgreSQL uses seq scan for O, F, N, T letters, but it uses index scan for A, I, C letters (with the "like" query).

There might be a problem with the planner here, because I think that using an index scan will always be faster than a seq scan. The number of rows for the prefix should not matter at all, because we are trying to get the first matching row only. For some reason it decides between seq/index scan based on the number of rows stored in some stats. At least it seems that way.

If I could tell the planner to use the index, I think my problem would be solved. Is there a way to put optimizer hints into the query?

There could be some improvement made to the @^ operator too, because it always uses seq scan, no matter what.

What do you think?

Re: Terribly slow query with very good plan?

От

Nick Cleaton

Дата:

04 февраля 2022 г., 12:59:21

On Fri, 4 Feb 2022 at 12:27, Les <nagylzs@gmail.com> wrote:

> PostgreSQL uses seq scan for O, F, N, T letters, but it uses index scan for A, I, C letters (with the "like" query).

That's interesting.

Does it help if you create an additional index on relpath with the
text_pattern_ops modifier, e.g.

CREATE INDEX ... USING btree (relpath text_pattern_ops);

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 13:07:59

> PostgreSQL uses seq scan for O, F, N, T letters, but it uses index scan for A, I, C letters (with the "like" query).

That's interesting.

Does it help if you create an additional index on relpath with the
text_pattern_ops modifier, e.g.

CREATE INDEX ... USING btree (relpath text_pattern_ops);

It does not help. Details below. (PostgreSQL version 12.8)

CREATE index test ON media.oo_file (relpath COLLATE "C");
CREATE INDEX test2 ON media.oo_file USING btree (relpath text_pattern_ops);
CREATE INDEX test3 ON media.oo_file USING btree (relpath collate "C" text_pattern_ops);

-- letter "A" ^@ operator -> slow seq scan

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active and fi.relpath ^@ 'A' limit 1;

QUERY PLAN |
----------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.00..1904.09 rows=1 width=8) (actual time=10779.585..10779.587 rows=0 loops=1) |
Output: id |
Buffers: shared hit=9960 read=121303 |
-> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=76 width=8) (actual time=10779.582..10779.583 rows=0 loops=1)|
Output: id |
Filter: (fi.is_active AND (fi.relpath ^@ 'A'::text)) |
Rows Removed by Filter: 1075812 |
Buffers: shared hit=9960 read=121303 |
Planning Time: 0.428 ms |
Execution Time: 10779.613 ms |

-- letter 'A' like expression index scan fast

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active and fi.relpath like 'A%' limit 1;

QUERY PLAN |
-------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.55..60.85 rows=1 width=8) (actual time=7.047..7.048 rows=0 loops=1) |
Output: id |
Buffers: shared hit=2776 |
-> Index Scan using test on media.oo_file fi (cost=0.55..4583.29 rows=76 width=8) (actual time=7.045..7.045 rows=0 loops=1)|
Output: id |
Index Cond: ((fi.relpath >= 'A'::text) AND (fi.relpath < 'B'::text)) |
Filter: (fi.is_active AND (fi.relpath ~~ 'A%'::text)) |
Rows Removed by Filter: 3784 |
Buffers: shared hit=2776 |
Planning Time: 0.937 ms |
Execution Time: 7.091 ms |

-- letter 'T' like expression, seq scan slow

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active and fi.relpath like 'Természettudomány%' limit 1;

QUERY PLAN |
-----------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.00..2.70 rows=1 width=8) (actual time=9842.935..9842.938 rows=0 loops=1) |
Output: id |
Buffers: shared hit=10024 read=121239 |
-> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=53574 width=8) (actual time=9842.933..9842.934 rows=0 loops=1)|
Output: id |
Filter: (fi.is_active AND (fi.relpath ~~ 'Természettudomány%'::text)) |
Rows Removed by Filter: 1075812 |
Buffers: shared hit=10024 read=121239 |
Planning Time: 0.975 ms |
Execution Time: 9842.962 ms |

Re: Terribly slow query with very good plan?

От

Nick Cleaton

Дата:

04 февраля 2022 г., 13:19:04

On Fri, 4 Feb 2022 at 13:07, Les <nagylzs@gmail.com> wrote:
>
>>
>> > PostgreSQL uses seq scan for O, F, N, T letters, but it uses index scan for A, I, C letters (with the "like"
query).
>>
>> That's interesting.
>>
>> Does it help if you create an additional index on relpath with the
>> text_pattern_ops modifier, e.g.
>>
>> CREATE INDEX ... USING btree (relpath text_pattern_ops);
>
>
> It does not help.

What if you try applying the C collation to the values from the table:

where fi.is_active  and fi.relpath collate "C" ^@ 'A'

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 13:21:51

>
> It does not help.

What if you try applying the C collation to the values from the table:

where fi.is_active and fi.relpath collate "C" ^@ 'A'

Slow

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active and fi.relpath collate "C" ^@ 'A' limit 1;

QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.00..1904.09 rows=1 width=8) (actual time=3837.338..3837.340 rows=0 loops=1) |
Output: id |
Buffers: shared hit=9355 read=121908 |
-> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=76 width=8) (actual time=3837.336..3837.336 rows=0 loops=1)|
Output: id |
Filter: (fi.is_active AND ((fi.relpath)::text ^@ 'A'::text)) |
Rows Removed by Filter: 1075812 |
Buffers: shared hit=9355 read=121908 |
Planning Time: 0.391 ms |
Execution Time: 3837.364 ms |

Re: Terribly slow query with very good plan?

От

Ninad Shah

Дата:

04 февраля 2022 г., 13:33:16

Hi Les,

I have reviewed the whole thread, and I do not see usage of gist or gin indexes. Have you tried using Gist or GIN indexes instead of a normal b-tree?

B-trees are a good option when your search is simple(e.g. =, >, <). The operators you are using are "like" or "^@", which fall into a full-text search category; in such scenarios, b-tree may not be effective every time. Hence, it may not deliver the result in the expected time-frame. I recommend you to try creating a Gist or a GIN index here.

Regards,

Ninad

On Fri, Feb 4, 2022 at 6:52 PM Les <nagylzs@gmail.com> wrote:

>
> It does not help.

What if you try applying the C collation to the values from the table:

where fi.is_active and fi.relpath collate "C" ^@ 'A'

Slow

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active and fi.relpath collate "C" ^@ 'A' limit 1;
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.00..1904.09 rows=1 width=8) (actual time=3837.338..3837.340 rows=0 loops=1) |
Output: id |
Buffers: shared hit=9355 read=121908 |
-> Seq Scan on media.oo_file fi (cost=0.00..144710.65 rows=76 width=8) (actual time=3837.336..3837.336 rows=0 loops=1)|
Output: id |
Filter: (fi.is_active AND ((fi.relpath)::text ^@ 'A'::text)) |
Rows Removed by Filter: 1075812 |
Buffers: shared hit=9355 read=121908 |
Planning Time: 0.391 ms |
Execution Time: 3837.364 ms |

Re: Terribly slow query with very good plan?

От

Nick Cleaton

Дата:

04 февраля 2022 г., 13:57:08

On Fri, 4 Feb 2022 at 13:21, Les <nagylzs@gmail.com> wrote:
>
>> What if you try applying the C collation to the values from the table:
>>
>> where fi.is_active  and fi.relpath collate "C" ^@ 'A'
>
>
> Slow

What about this:

fi.relpath between ('A' collate "C") and ('A'||chr(255) collate "C")

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 14:07:10

> Slow

What about this:

fi.relpath between ('A' collate "C") and ('A'||chr(255) collate "C")

It uses index scan.

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active
and fi.relpath between ('A' collate "C") and ('A'||chr(255) collate "C")
limit 1;

QUERY PLAN |
---------------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.55..6.12 rows=1 width=8) (actual time=1623.069..1623.070 rows=0 loops=1) |
Output: id |
Buffers: shared hit=2439 read=1994 dirtied=1107 |
-> Index Scan using test on media.oo_file fi (cost=0.55..5732.47 rows=1029 width=8) (actual time=1623.067..1623.067 rows=0 loops=1)|
Output: id |
Index Cond: ((fi.relpath >= 'A'::text COLLATE "C") AND (fi.relpath <= 'A '::text COLLATE "C")) |
Filter: fi.is_active |
Rows Removed by Filter: 3784 |
Buffers: shared hit=2439 read=1994 dirtied=1107 |
Planning Time: 18.817 ms |
Execution Time: 1623.104 ms |

Although the same with 'Természettudomány' uses seq scan:

EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
select fi.id from media.oo_file fi
where fi.is_active
and fi.relpath between
('Természettudomány' collate "C")
and ('Természettudomány'||chr(255) collate "C")
limit 1;

QUERY PLAN |
---------------------------------------------------------------------------------------------------------------------------------------------------+
Limit (cost=0.00..2.13 rows=1 width=8) (actual time=7521.531..7521.532 rows=0 loops=1) |
Output: id |
Buffers: shared hit=17018 read=150574 |
-> Seq Scan on media.oo_file fi (cost=0.00..188195.39 rows=88290 width=8) (actual time=7521.528..7521.529 rows=0 loops=1) |
Output: id |
Filter: (fi.is_active AND (fi.relpath >= 'Természettudomány'::text COLLATE "C") AND (fi.relpath <= 'Természettudomány '::text COLLATE "C"))|
Rows Removed by Filter: 1075812 |
Buffers: shared hit=17018 read=150574 |
Planning Time: 8.918 ms |
Execution Time: 7521.560 ms |

Re: Terribly slow query with very good plan?

От

Nick Cleaton

Дата:

04 февраля 2022 г., 14:34:52

On Fri, 4 Feb 2022 at 14:07, Les <nagylzs@gmail.com> wrote:
>
>
>
>> > Slow
>>
>> What about this:
>>
>> fi.relpath between ('A' collate "C") and ('A'||chr(255) collate "C")
>
>
> It uses index scan.

> Although the same with 'Természettudomány' uses seq scan:
>
>
> EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
> select fi.id from media.oo_file fi
> where fi.is_active
> and fi.relpath between
> ('Természettudomány' collate "C")
> and ('Természettudomány'||chr(255) collate "C")
> limit 1;
>
> QUERY PLAN
                            | 
>
---------------------------------------------------------------------------------------------------------------------------------------------------+
> Limit  (cost=0.00..2.13 rows=1 width=8) (actual time=7521.531..7521.532 rows=0 loops=1)
                            | 
>   Output: id
                            | 
>   Buffers: shared hit=17018 read=150574
                            | 
>   ->  Seq Scan on media.oo_file fi  (cost=0.00..188195.39 rows=88290 width=8) (actual time=7521.528..7521.529 rows=0
loops=1)                     | 
>         Output: id
                            | 
>         Filter: (fi.is_active AND (fi.relpath >= 'Természettudomány'::text COLLATE "C") AND (fi.relpath <=
'Természettudomány'::text COLLATE "C"))| 
>         Rows Removed by Filter: 1075812
                            | 
>         Buffers: shared hit=17018 read=150574
                            | 
> Planning Time: 8.918 ms
                            | 
> Execution Time: 7521.560 ms

That may be because it's expecting to get 88290 rows from the
sequential scan, and the"limit 1" means it expects sequential scan to
be fast because on average it will only need to scan 1/88290 of the
table before it finds a matching row, then it can stop.

Try it without the "limit 1"

Re: Terribly slow query with very good plan?

От

Thomas Kellerer

Дата:

04 февраля 2022 г., 14:42:37

Les schrieb am 04.02.2022 um 10:11:

> My task is to write a query that tells if a folder has any active file inside it - directly or in subfolders. Here is
thequery for that: 
>
> EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)
>
> select id, title,
> (exists (select f2.id <http://f2.id> from media.oo_file f2 where f2.relpath like f.relpath || '%')) as has_file
> from media.oo_folder f where f.parent_id is null
>
> QUERY PLAN
                                       | 
>
--------------------------------------------------------------------------------------------------------------------------------------------------------------+
> Index Scan using oo_folder_idx_parent on media.oo_folder f  (cost=0.42..488.02 rows=20 width=26) (actual
time=713.419..25414.969rows=45 loops=1)             | 
>   Output: f.id <http://f.id>, f.title, (SubPlan 1)
                                                     | 
>   Index Cond: (f.parent_id IS NULL)
                                       | 
>   Buffers: shared hit=7014170
                                       | 
>   SubPlan 1
                                       | 
>     ->  Index Only Scan using oo_file_idx_relpath on media.oo_file f2  (cost=0.55..108499.27 rows=5381 width=0)
(actualtime=564.756..564.756 rows=0 loops=45)| 
>           Filter: (f2.relpath ~~ (f.relpath || '%'::text))
                                       | 
>           Rows Removed by Filter: 792025
                                       | 
>           Heap Fetches: 768960
                                       | 
>           Buffers: shared hit=7014130
                                       | 

In addition to the collation tweaks, I wonder if using a lateral join might result in a more efficient plan:

    select id, title, c.id is not null as has_path
    from media.oo_folder f
      left join lateral (
        select f2.id
        from media.oo_file f2
        where f2.relpath like f.relpath || '%'
        limit 1
      ) c on true
    where f.parent_id is null

Re: Terribly slow query with very good plan?

От

Les

Дата:

04 февраля 2022 г., 14:43:55

That may be because it's expecting to get 88290 rows from the
sequential scan, and the"limit 1" means it expects sequential scan to
be fast because on average it will only need to scan 1/88290 of the
table before it finds a matching row, then it can stop.

We are looking for a single row. With an index scan, it is always much faster to find a single row. No seq scan can be faster "on average", when you are looking for a single row. Am I wrong?

> Try it without the "limit 1"

Without the limit it uses bitmap heap scan. Unbelievable!

QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------------------+
Bitmap Heap Scan on media.oo_file fi (cost=10480.10..140065.96 rows=70010 width=8) (actual time=9757.917..9757.920 rows=0 loops=1) |
Output: id |
Recheck Cond: ((fi.relpath >= 'Természettudomány'::text COLLATE "C") AND (fi.relpath <= 'Természettudomány '::text COLLATE "C")) |
Filter: fi.is_active |
Rows Removed by Filter: 85207 |
Heap Blocks: exact=24954 |
Buffers: shared hit=197 read=26531 |
-> Bitmap Index Scan on test (cost=0.00..10462.59 rows=99404 width=0) (actual time=425.571..425.572 rows=85207 loops=1) |
Index Cond: ((fi.relpath >= 'Természettudomány'::text COLLATE "C") AND (fi.relpath <= 'Természettudomány '::text COLLATE "C"))|
Buffers: shared hit=6 read=1768 |
Planning Time: 1.145 ms |
JIT: |
Functions: 6 |
Options: Inlining false, Optimization false, Expressions true, Deforming true |
Timing: Generation 2.295 ms, Inlining 0.000 ms, Optimization 1.142 ms, Emission 11.632 ms, Total 15.070 ms |
Execution Time: 9760.361 ms |

Re: Terribly slow query with very good plan?

От

Vijaykumar Jain

Дата:

05 февраля 2022 г., 21:25:02

On Fri, 4 Feb 2022 at 14:41, Les <nagylzs@gmail.com> wrote:

Hello,

The "replath" field contains the path of the file/folder. For example: "/folder1/folder2/folder3/filename4.ext5". The replath field is managed by triggers. There are about 1M rows for files and 600K folder rows in the database. The files are well distributed between folders, and there are only 45 root folders ( parent_id is null)

Replying in a separate thread, just in case this does not help.

It seems you already store relpath but as text via triggers, will the 'ltree' extension be of any help to get your results faster (to help implement path enumeration, but has a limitation of 65K objects)

https://www.postgresql.org/docs/current/ltree.html

https://github.com/postgres/postgres/blob/master/contrib/ltree/sql/ltree.sql (test cases for elaborate examples)

https://patshaughnessy.net/2017/12/14/manipulating-trees-using-sql-and-the-postgres-ltree-extension

also, another pattern i came across was via closure tables

https://www.slideshare.net/billkarwin/models-for-hierarchical-data

https://stackoverflow.com/questions/19834400/what-is-the-simplest-way-to-save-a-file-tree-in-a-postgres-database/19835575

ex. (from the doc)

postgres=# drop table test;

DROP TABLE

postgres=# CREATE TABLE test (path ltree);

INSERT INTO test VALUES ('Top');

INSERT INTO test VALUES ('Top.Science');

INSERT INTO test VALUES ('Top.Science.Astronomy');

INSERT INTO test VALUES ('Top.Science.Astronomy.Astrophysics');

INSERT INTO test VALUES ('Top.Science.Astronomy.Cosmology');

INSERT INTO test VALUES ('Top.Hobbies');

INSERT INTO test VALUES ('Top.Hobbies.Amateurs_Astronomy');

INSERT INTO test VALUES ('Top.Collections');

INSERT INTO test VALUES ('Top.Collections.Pictures');

INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy');

INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Stars');

INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Galaxies');

INSERT INTO test VALUES ('Top.Collections.Pictures.Astronomy.Astronauts');

CREATE INDEX path_gist_idx ON test USING GIST (path);

CREATE INDEX path_idx ON test USING BTREE (path); -- we can even make this unique index as there would only be one path

-- we can also create partial indexes depending on the query pattern

#my focus is no rows filtered (hence less wasted operations)

postgres=# analyze test;

ANALYZE

postgres=# explain analyze select exists (SELECT 1 FROM test WHERE path ~ '*.Stars');

QUERY PLAN

------------------------------------------------------------------------------------------------------

Result (cost=1.16..1.17 rows=1 width=1) (actual time=0.010..0.010 rows=1 loops=1)

InitPlan 1 (returns $0)

-> Seq Scan on test (cost=0.00..1.16 rows=1 width=0) (actual time=0.008..0.009 rows=1 loops=1)

Filter: (path ~ '*.Stars'::lquery)

Rows Removed by Filter: 10

Planning Time: 0.248 ms

Execution Time: 0.023 ms

(7 rows)

postgres=# set enable_seqscan TO 0; -- small table, hence

SET

postgres=# explain analyze select exists (SELECT 1 FROM test WHERE path ~ '*.Stars');

QUERY PLAN

----------------------------------------------------------------------------------------------------------------------------

Result (cost=8.15..8.16 rows=1 width=1) (actual time=0.020..0.021 rows=1 loops=1)

InitPlan 1 (returns $0)

-> Index Scan using path_gist_idx on test (cost=0.13..8.15 rows=1 width=0) (actual time=0.019..0.019 rows=1 loops=1)

Index Cond: (path ~ '*.Stars'::lquery)

Planning Time: 0.079 ms

Execution Time: 0.037 ms

(6 rows)

Please ignore, if not relevant to the discussion.

Thanks,

Vijay

LinkedIn - Vijaykumar Jain

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Terribly slow query with very good plan?