Обсуждение: [GENERAL] Why would log_lock_waits affect a query plan?

Поиск
Список
Период
Сортировка

[GENERAL] Why would log_lock_waits affect a query plan?

От
Evan Martin
Дата:
I have an application that imports a lot of data and the does some
queries on it to build some caches in the database, all in one long
transaction. One of those cache updates repeatedly calls a plpgsql
function, which internally does some SQL queries. Sometimes this is
much, much slower than usual: 3-7 hours instead of 12-15 minutes. It was
totally reproducible when it happened, though (running on the same
machine, same input data).

It turns out that the problem only happens when the "log_lock_waits"
setting was OFF! Many machines had it ON (to troubleshoot a different
problem), so they never experienced it.

I eventually tracked it down to the query plan chosen for one particular
query in the plpgsql function: using a Nested Loop makes it fast and
using a Hash Join makes it very slow. Running an ANALYZE on one of the
tables involved fixes the problem - the fast query plan is chosen all
the time. This itself is a bit strange, because I was already running
ANALYZE on all tables after the data import - it seems that I needed to
run it a second time? But what I'd really like to understand is: why did
setting log_lock_waits to ON always change the query plan to use a
Nested Loop? It's just not something I'd ever expect to affect a query plan.

By the way, I also found that the problem does not occur if I commit
before the cache updates. This was with PostgreSQL 9.6.3 running on
Windows x64, if that matters.



Re: [GENERAL] Why would log_lock_waits affect a query plan?

От
Tom Lane
Дата:
Evan Martin <postgresql2@realityexists.net> writes:
> I have an application that imports a lot of data and the does some
> queries on it to build some caches in the database, all in one long
> transaction. One of those cache updates repeatedly calls a plpgsql
> function, which internally does some SQL queries. Sometimes this is
> much, much slower than usual: 3-7 hours instead of 12-15 minutes. It was
> totally reproducible when it happened, though (running on the same
> machine, same input data).

> It turns out that the problem only happens when the "log_lock_waits"
> setting was OFF! Many machines had it ON (to troubleshoot a different
> problem), so they never experienced it.

> I eventually tracked it down to the query plan chosen for one particular
> query in the plpgsql function: using a Nested Loop makes it fast and
> using a Hash Join makes it very slow. Running an ANALYZE on one of the
> tables involved fixes the problem - the fast query plan is chosen all
> the time. This itself is a bit strange, because I was already running
> ANALYZE on all tables after the data import - it seems that I needed to
> run it a second time?

Are you using the problematic function earlier in the process?  I wonder
if it's cached a bad plan that dates from when there was much less data
in the table.  I also wonder if maybe the specific table is being updated
after the ANALYZEs.  In the situation you've got here, you can't expect
any help from auto-analyze; only your own manual ANALYZEs are going to be
able to see the uncommitted data in the tables.

> But what I'd really like to understand is: why did
> setting log_lock_waits to ON always change the query plan to use a
> Nested Loop? It's just not something I'd ever expect to affect a query plan.

TBH, I don't believe it.  There are a lot of moving parts here, but
I don't see how that could be relevant.

            regards, tom lane


Re: [GENERAL] Why would log_lock_waits affect a query plan?

От
Evan Martin
Дата:
On 19/07/2017 11:52 PM, Tom Lane wrote:
> Evan Martin <postgresql2@realityexists.net> writes:
>> I have an application that imports a lot of data and the does some
>> queries on it to build some caches in the database, all in one long
>> transaction. One of those cache updates repeatedly calls a plpgsql
>> function, which internally does some SQL queries. Sometimes this is
>> much, much slower than usual: 3-7 hours instead of 12-15 minutes. It was
>> totally reproducible when it happened, though (running on the same
>> machine, same input data).
>> It turns out that the problem only happens when the "log_lock_waits"
>> setting was OFF! Many machines had it ON (to troubleshoot a different
>> problem), so they never experienced it.
>> I eventually tracked it down to the query plan chosen for one particular
>> query in the plpgsql function: using a Nested Loop makes it fast and
>> using a Hash Join makes it very slow. Running an ANALYZE on one of the
>> tables involved fixes the problem - the fast query plan is chosen all
>> the time. This itself is a bit strange, because I was already running
>> ANALYZE on all tables after the data import - it seems that I needed to
>> run it a second time?
> Are you using the problematic function earlier in the process?  I wonder
> if it's cached a bad plan that dates from when there was much less data
> in the table.  I also wonder if maybe the specific table is being updated
> after the ANALYZEs.  In the situation you've got here, you can't expect
> any help from auto-analyze; only your own manual ANALYZEs are going to be
> able to see the uncommitted data in the tables.
No, that function is not used earlier and the table is not updated after
the ANALYZE. I've also tried running DISCARD PLANS right after the
ANALYZE, which made no difference.
All the data import is done first, then ANALYZE, then the cache updates.
>> But what I'd really like to understand is: why did
>> setting log_lock_waits to ON always change the query plan to use a
>> Nested Loop? It's just not something I'd ever expect to affect a query plan.
> TBH, I don't believe it.  There are a lot of moving parts here, but
> I don't see how that could be relevant.
I agree, it's totally bizarre, which is why I'm posting here, but I've
tried it many times now and the results are very clear (at least on my
machine, yet to confirm on others). I change nothing, except
log_lock_waits, re-import the same data and see different query plans
and very different speeds for that function. (For testing purposes I
don't commit the transaction at the end and just start with a clean DB
each time.)