Re: [HACKERS] Two pass CheckDeadlock in contentent case

Поиск
Список
Период
Сортировка
От Sokolov Yura
Тема Re: [HACKERS] Two pass CheckDeadlock in contentent case
Дата
Msg-id 77de6c23263a06093ebb4a590c8bc4ae@postgrespro.ru
обсуждение исходный текст
Ответ на [HACKERS] Two pass CheckDeadlock in contentent case  (Sokolov Yura <y.sokolov@postgrespro.ru>)
Ответы Re: [HACKERS] Two pass CheckDeadlock in contentent case  (Ashutosh Bapat <ashutosh.bapat@enterprisedb.com>)
Re: [HACKERS] Two pass CheckDeadlock in contentent case  (Masahiko Sawada <sawada.mshk@gmail.com>)
Список pgsql-hackers
On 2017-10-03 17:30, Sokolov Yura wrote:
> Good day, hackers.
> 
> During hard workload sometimes process reaches deadlock timeout
> even if no real deadlock occurred. It is easily reproducible with
> pg_xact_advisory_lock on same value + some time consuming
> operation (update) and many clients.
> 
> When backend reaches deadlock timeout, it calls CheckDeadlock,
> which locks all partitions of lock hash in exclusive mode to
> walk through graph and search for deadlock.
> 
> If hundreds of backends reaches this timeout trying to acquire
> advisory lock on a same value, it leads to hard-stuck for many
> seconds, cause they all traverse same huge lock graph under
> exclusive lock.
> During this stuck there is no possibility to do any meaningful
> operations (no new transaction can begin).
> 
> Attached patch makes CheckDeadlock to do two passes:
> - first pass uses LW_SHARED on partitions of lock hash.
>   DeadLockCheck is called with flag "readonly", so it doesn't
>   modify anything.
> - If there is possibility of "soft" or "hard" deadlock detected,
>   ie if there is need to modify lock graph, then partitions
>   relocked with LW_EXCLUSIVE, and DeadLockCheck is called again.
> 
> It fixes hard-stuck, cause backends walk lock graph under shared
> lock, and found that there is no real deadlock.
> 
> Test on 4 socket xeon machine:
> pgbench_zipf -s 10 -c 800  -j 100 -M prepared -T 450 -f
> ./ycsb_read_zipf.sql@50 -f ./ycsb_update_lock2_zipf.sql@50 -P 5
> 
> ycsb_read_zipf.sql:
>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>     SELECT abalance FROM pgbench_accounts WHERE aid = :i;
> ycsb_update_lock2_zipf.sql:
>     \set i random_zipfian(1, 100000 * :scale, 1.01)
>     select lock_and_update( :i );
> 
>     CREATE OR REPLACE FUNCTION public.lock_and_update(i integer)
>      RETURNS void
>      LANGUAGE sql
>     AS $function$
>         SELECT pg_advisory_xact_lock( $1 );
>         UPDATE pgbench_accounts SET abalance = 1 WHERE aid = $1;
>     $function$
> 
> Without attached patch:
> 
> progress: 5.0 s, 45707.0 tps, lat 15.599 ms stddev 83.757
> progress: 10.0 s, 51124.4 tps, lat 15.681 ms stddev 78.353
> progress: 15.0 s, 52293.8 tps, lat 15.327 ms stddev 77.017
> progress: 20.0 s, 51280.4 tps, lat 15.603 ms stddev 78.199
> progress: 25.0 s, 47278.6 tps, lat 16.795 ms stddev 83.570
> progress: 30.0 s, 41792.9 tps, lat 18.535 ms stddev 93.697
> progress: 35.0 s, 12393.7 tps, lat 33.757 ms stddev 169.116
> progress: 40.0 s, 0.0 tps, lat -nan ms stddev -nan
> progress: 45.0 s, 0.0 tps, lat -nan ms stddev -nan
> progress: 50.0 s, 1.2 tps, lat 2497.734 ms stddev 5393.166
> progress: 55.0 s, 0.0 tps, lat -nan ms stddev -nan
> progress: 60.0 s, 27357.9 tps, lat 160.622 ms stddev 1866.625
> progress: 65.0 s, 38770.8 tps, lat 20.829 ms stddev 104.453
> progress: 70.0 s, 40553.2 tps, lat 19.809 ms stddev 99.741
> 
> (autovacuum led to trigger deadlock timeout,
>  and therefore, to stuck)
> 
> Patched:
> 
> progress: 5.0 s, 56264.7 tps, lat 12.847 ms stddev 66.980
> progress: 10.0 s, 55389.3 tps, lat 14.329 ms stddev 71.997
> progress: 15.0 s, 50757.7 tps, lat 15.730 ms stddev 78.222
> progress: 20.0 s, 50797.3 tps, lat 15.736 ms stddev 79.296
> progress: 25.0 s, 48485.3 tps, lat 16.432 ms stddev 82.720
> progress: 30.0 s, 45202.1 tps, lat 17.733 ms stddev 88.554
> progress: 35.0 s, 40035.8 tps, lat 19.343 ms stddev 97.787
> progress: 40.0 s, 14240.1 tps, lat 47.625 ms stddev 265.465
> progress: 45.0 s, 30150.6 tps, lat 31.140 ms stddev 196.114
> progress: 50.0 s, 38975.0 tps, lat 20.543 ms stddev 104.281
> progress: 55.0 s, 40573.9 tps, lat 19.770 ms stddev 99.487
> progress: 60.0 s, 38361.1 tps, lat 20.693 ms stddev 103.141
> progress: 65.0 s, 39793.3 tps, lat 20.216 ms stddev 101.080
> progress: 70.0 s, 38387.9 tps, lat 20.886 ms stddev 104.482
> 
> (autovacuum led to trigger deadlock timeout,
>  but postgresql did not stuck)
> 
> I believe, patch will positively affect other heavy workloads
> as well, although I have not collect benchmarks.
> 
> `make check-world` passes with configured `--enable-tap-tests 
> --enable-casserts`

Excuse me, corrected version is in attach.

-- 
Sokolov Yura
Postgres Professional: https://postgrespro.ru
The Russian Postgres Company
-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Robert Haas
Дата:
Сообщение: Re: [HACKERS] SendRowDescriptionMessage() is slow for queries with alot of columns
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] Postgresql gives error that role goes not exists while it exists