Обсуждение: [PERFORM] Inappropriate inner table for nested loop join

Поиск
Список
Период
Сортировка

[PERFORM] Inappropriate inner table for nested loop join

От
Akihiko Odaki
Дата:
Hi all,

I am having a problem with nested loop join.

A database has 2 tables: "posts" and "follows".
Table "posts" have two columns: "timestamp" and "account".
Table "follows" have two columns: "target_account" and "owner_account".
The database also has an index on "posts" ("account", "timestamp"), one
on "posts"("timestamp") and on "follows" ("owner_account",
"target_account").

Table "posts" is so big and have 10 million records.
The number of Records with the same value for "owner_accounts" in table
"follows" is about 100 by average.

I issue the following query:

SELECT "posts".*
   FROM "posts"
   JOIN "follows" ON "follows"."target_account" = "posts"."account"
   WHERE "follows"."owner_account" = $1
   ORDER BY "posts"."timestamp"
   LIMIT 100

That results in a nested loop join with table "posts" as the inner and
"follows" as the outer, which queried for each loop. EXPlAIN ANALYZE
says the actual number of rows queried from table "posts" is 500,000.
This behavior is problematic.

For performance, it may be better to retrieve 100 records joined with a
record in table "follows", and then to retrieve those whose
"posts"."timestamp" is greater than the one of last record we already
have, or 100 records, joined with another record in table "follows", and
so on. It would end up querying 10,000 records from table "posts" at
most. The number could be even smaller in some cases.

Now I have these tough questions:
* Is the "ideal" operation I suggested possible for PostgreSQL?
* If so, I think that could be achieved by letting PostgreSQL use
"follows" as the inner in the loops. How could I achieve that?
* Is there any other way to improve the performance of the query?

Answers are greatly appreciated.

Regards,
Akihiko Odaki


Re: [PERFORM] Inappropriate inner table for nested loop join

От
Albe Laurenz
Дата:
Akihiko Odaki wrote:
> I am having a problem with nested loop join.
> 
> A database has 2 tables: "posts" and "follows".
> Table "posts" have two columns: "timestamp" and "account".
> Table "follows" have two columns: "target_account" and "owner_account".
> The database also has an index on "posts" ("account", "timestamp"), one
> on "posts"("timestamp") and on "follows" ("owner_account",
> "target_account").
> 
> Table "posts" is so big and have 10 million records.
> The number of Records with the same value for "owner_accounts" in table
> "follows" is about 100 by average.
> 
> I issue the following query:
> 
> SELECT "posts".*
>    FROM "posts"
>    JOIN "follows" ON "follows"."target_account" = "posts"."account"
>    WHERE "follows"."owner_account" = $1
>    ORDER BY "posts"."timestamp"
>    LIMIT 100
> 
> That results in a nested loop join with table "posts" as the inner and
> "follows" as the outer, which queried for each loop. EXPlAIN ANALYZE
> says the actual number of rows queried from table "posts" is 500,000.
> This behavior is problematic.
> 
> For performance, it may be better to retrieve 100 records joined with a
> record in table "follows", and then to retrieve those whose
> "posts"."timestamp" is greater than the one of last record we already
> have, or 100 records, joined with another record in table "follows", and
> so on. It would end up querying 10,000 records from table "posts" at
> most. The number could be even smaller in some cases.
> 
> Now I have these tough questions:
> * Is the "ideal" operation I suggested possible for PostgreSQL?
> * If so, I think that could be achieved by letting PostgreSQL use
> "follows" as the inner in the loops. How could I achieve that?
> * Is there any other way to improve the performance of the query?

PostgreSQL`s plan is to use the index on "posts"."timestamp" to find the
rows with the lowest "timestamp", match with rows from "posts" in
a nested loop and stop as soon as it has found 100 matches.

Now it must be that the rows in "posts" that match with rows in "follows"
have high values of "timestamp".

PostgreSQL doesn't know that, because it has no estimates how
values correlate across tables, so it has to scan much more of the index
than it had expected to, and the query performs poorly.

You could either try to do something like

SELECT *
FROM (SELECT "posts".*
      FROM "posts"
         JOIN "follows" ON "follows"."target_account" = "posts"."account"
      WHERE "follows"."owner_account" = $1
      OFFSET 0) q
ORDER BY "posts"."timestamp"  
LIMIT 100;

Or you could frop the index on "posts"."timestamp" and see if that helps.

Yours,
Laurenz Albe

Re: [PERFORM] Inappropriate inner table for nested loop join

От
Akihiko Odaki
Дата:
Thank you for your quick reply. Your solution works for me!

On 2017-06-23 20:20, Albe Laurenz wrote:
 > PostgreSQL`s plan is to use the index on "posts"."timestamp" to find the
 > rows with the lowest "timestamp", match with rows from "posts" in
 > a nested loop and stop as soon as it has found 100 matches.
 >
 > Now it must be that the rows in "posts" that match with rows in "follows"
 > have high values of "timestamp".

I mistakenly dropped DESC. The actual query should be:

SELECT "posts".*
   FROM "posts"
   JOIN "follows" ON "follows"."target_account" = "posts"."account"
   WHERE "follows"."owner_account" = $1
   ORDER BY "posts"."timestamp" DESC
   LIMIT 100

I note that here since that may be confusion to understand the later
part of my first post.

 > PostgreSQL doesn't know that, because it has no estimates how
 > values correlate across tables, so it has to scan much more of the index
 > than it had expected to, and the query performs poorly.

That is exactly the problem what I have encountered.

 > You could either try to do something like
 >
 > SELECT *
 > FROM (SELECT "posts".*
 >        FROM "posts"
 >           JOIN "follows" ON "follows"."target_account" =
"posts"."account"
 >        WHERE "follows"."owner_account" = $1
 >        OFFSET 0) q
 > ORDER BY "posts"."timestamp"
 > LIMIT 100;

It works. I had to replace "posts"."timestamp" with "timestamp", but
that is trivial. Anything else is fine.

 > Or you could frop the index on "posts"."timestamp" and see if that helps.

That is not a solution for me because it was used by other queries, but
may make sense in other cases.


Re: [PERFORM] Inappropriate inner table for nested loop join

От
Akihiko Odaki
Дата:
On 2017-06-23 20:20, Albe Laurenz wrote:
> You could either try to do something like
>
> SELECT *
> FROM (SELECT "posts".*
>        FROM "posts"
>           JOIN "follows" ON "follows"."target_account" =
"posts"."account"
>        WHERE "follows"."owner_account" = $1
>        OFFSET 0) q
> ORDER BY "posts"."timestamp"
> LIMIT 100;

Now I wonder whether it actually sorted or not. As you said, I want to
"find rows with the greatest 'timestamp', match with rows from 'posts'
in a nested loop and stop as soon as it has found 100 matches".

However, it seems to query 100 records without any consideration for
"timestamp", and then sorts them. That is not expected. Here is a
abstract query plan:

  Limit
    ->  Sort
          Sort Key: posts.id DESC
          ->  Nested Loop
                ->  Seq Scan on follows
                      Filter: (owner_account = $1)
                ->  Index Scan using index_posts_on_account on posts
                      Index Cond: (account_id = follows.target_account)

index_posts_on_account is an obsolete index on "posts" and only for
"account". So it does nothing for sorting "timestamp".

Regards,
Akihiko Odaki


Re: [PERFORM] Inappropriate inner table for nested loop join

От
Albe Laurenz
Дата:
Akihiko Odaki wrote:
> On 2017-06-23 20:20, Albe Laurenz wrote:
>> You could either try to do something like
>>
>> SELECT *
>> FROM (SELECT "posts".*
>>        FROM "posts"
>>           JOIN "follows" ON "follows"."target_account" = "posts"."account"
>>        WHERE "follows"."owner_account" = $1
>>        OFFSET 0) q
>> ORDER BY "posts"."timestamp"
>> LIMIT 100;
> 
> Now I wonder whether it actually sorted or not. As you said, I want to
> "find rows with the greatest 'timestamp', match with rows from 'posts'
> in a nested loop and stop as soon as it has found 100 matches".
> 
> However, it seems to query 100 records without any consideration for
> "timestamp", and then sorts them. That is not expected. Here is a
> abstract query plan:
> 
>   Limit
>     ->  Sort
>           Sort Key: posts.id DESC
>           ->  Nested Loop
>                 ->  Seq Scan on follows
>                       Filter: (owner_account = $1)
>                 ->  Index Scan using index_posts_on_account on posts
>                       Index Cond: (account_id = follows.target_account)
> 
> index_posts_on_account is an obsolete index on "posts" and only for
> "account". So it does nothing for sorting "timestamp".

Yes, if you replace posts.timestamp with q.timestamp, it should
sort by that.

Could you send CREATE TABLE and CREATE INDEX statements so I can try it?

Yours,
Laurenz Albe

Re: [PERFORM] Inappropriate inner table for nested loop join

От
Albe Laurenz
Дата:
Akihiko Odaki wrote:
> On 2017-06-23 20:20, Albe Laurenz wrote:
>> You could either try to do something like
>>
>> SELECT *
>> FROM (SELECT "posts".*
>>        FROM "posts"
>>           JOIN "follows" ON "follows"."target_account" = "posts"."account"
>>        WHERE "follows"."owner_account" = $1
>>        OFFSET 0) q
>> ORDER BY "posts"."timestamp"
>> LIMIT 100;
> 
> Now I wonder whether it actually sorted or not. As you said, I want to
> "find rows with the greatest 'timestamp', match with rows from 'posts'
> in a nested loop and stop as soon as it has found 100 matches".
> 
> However, it seems to query 100 records without any consideration for
> "timestamp", and then sorts them. That is not expected. Here is a
> abstract query plan:
> 
>   Limit
>     ->  Sort
>           Sort Key: posts.id DESC
>           ->  Nested Loop
>                 ->  Seq Scan on follows
>                       Filter: (owner_account = $1)
>                 ->  Index Scan using index_posts_on_account on posts
>                       Index Cond: (account_id = follows.target_account)
> 
> index_posts_on_account is an obsolete index on "posts" and only for
> "account". So it does nothing for sorting "timestamp".

That should be fine.

It fetches all rows from "follows" that match the condition,
Then joins them will all matching rows from "posts", sorts the
result descending by "id" and returns the 100 rows with the largest
value for "id".

So you will get those 100 rows from "posts" with the largest "id"
that have a match in "follows" where the condition is fulfilled.

It is just a different plan to do the same thing that is more efficient
in your case.

Yours,
Laurenz Albe