Re: hung backends stuck in spinlock heavy endless loop

Поиск
Список
Период
Сортировка
От Merlin Moncure
Тема Re: hung backends stuck in spinlock heavy endless loop
Дата
Msg-id CAHyXU0wyuJ7-2ypLZtkhOWZzC34udFXpm6tr+O51aZcCEbobag@mail.gmail.com
обсуждение исходный текст
Ответ на Re: hung backends stuck in spinlock heavy endless loop  (Andres Freund <andres@2ndquadrant.com>)
Ответы Re: hung backends stuck in spinlock heavy endless loop  (Peter Geoghegan <pg@heroku.com>)
Список pgsql-hackers
On Tue, Jan 13, 2015 at 5:42 PM, Andres Freund <andres@2ndquadrant.com> wrote:
> On 2015-01-13 17:39:09 -0600, Merlin Moncure wrote:
>> On Tue, Jan 13, 2015 at 5:21 PM, Andres Freund <andres@2ndquadrant.com> wrote:
>> > On 2015-01-13 15:17:15 -0800, Peter Geoghegan wrote:
>> >> I'm inclined to think that this is a livelock, and so the problem
>> >> isn't evident from the structure of the B-Tree, but it can't hurt to
>> >> check.
>> >
>> > My guess is rather that it's contention on the freelist lock via
>> > StrategyGetBuffer's. I've seen profiles like this due to exactly that
>> > before - and it fits to parallel loading quite well.
>>
>> I think I've got it to pop again. s_lock is only showing 35%
>> (increasing very slowly if at all) but performance is mostly halted.
>> Frame pointer is compiled out.  perf report attached.
>
>>  35.82%  postgres                            [.] s_lock
>>  23.71%  postgres                            [.] tas
>>  14.01%  postgres                            [.] tas
>>   6.82%  postgres                            [.] spin_delay
>>   5.93%  postgres                            [.] LWLockRelease
>>   4.36%  postgres                            [.] LWLockAcquireCommon
>
> Interesting. This profile looks quite different?

yep, it's very stable, and the database is approximately frozen.

> What kind of hardware is this on?

linux on crap workstation box, except I have good ssd (intel 3500).
model name : Intel(R) Core(TM) i5-3470 CPU @ 3.20GHz


top - 17:44:00 up 11 days,  6:25, 15 users,  load average: 6.94, 6.97, 5.73
Tasks: 259 total,   8 running, 250 sleeping,   0 stopped,   1 zombie
%Cpu0  : 95.7 us,  0.0 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  : 93.5 us,  2.2 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  : 95.7 us,  0.0 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  : 93.6 us,  2.1 sy,  0.0 ni,  4.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   8131028 total,  7977608 used,   153420 free,    21424 buffers
KiB Swap:  8340476 total,  1884900 used,  6455576 free,  5201188 cached
 PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
32492 mmoncure  20   0  175m  12m 9040 R  64.0  0.2   9:28.42 postgres
32367 mmoncure  20   0  175m  12m 9128 R  57.3  0.2   9:35.87 postgres
31792 mmoncure  20   0  175m  92m  88m R  52.9  1.2   9:57.50 postgres
32725 mmoncure  20   0  175m  13m 9852 R  52.9  0.2   9:08.50 postgres
31899 mmoncure  20   0  179m 104m  97m R  50.7  1.3   9:49.51 postgres
32368 mmoncure  20   0  175m  12m 9200 R  50.7  0.2   9:33.77 postgres
31282 mmoncure  20   0  184m 152m 140m R  46.3  1.9  10:26.29 postgres 666 mmoncure  20   0 20640 1684 1116 R   2.2
0.0  0:00.01 top 

Some more information what's happening:
This is a ghetto logical replication engine that migrates data from
sql sever to postgres, consolidating a sharded database into a single
set of tables (of which there are only two). There is only one index
on the destination table, and it's composite int,int in both cases.

A single source table is replicated in a single transaction, and it's
staged to a set of two temp tables, one for deletes, one for inserts.
I log each step in the database.  Here are the times:

cds2=# select cdsruntableid, finished - started, inserted from
cdsruntable where cdsrunid = 62 order by started;cdsruntableid │    ?column?     │ inserted
───────────────┼─────────────────┼──────────          833 │ 00:00:33.24044  │    11860          834 │ 00:00:35.44981  │
  19099          835 │ 00:02:01.725186 │   530923          836 │ 00:01:29.578811 │   211737          838 │
00:01:17.393461│    64258          837 │ 00:00:56.849106 │   227844          839 │ 00:02:12.016504 │   630309
840│ 00:00:00.00111  │          841 │ 00:01:09.058806 │   155309          842 │ 00:01:54.723747 │   704422          843
│00:01:14.965304 │          844 │ 00:00:45.410953 │    59934          845 │ 00:00:34.302632 │    14848          846 │
00:00:46.913832│    89893          848 │ 00:05:22.155189 │  2410622          847 │ 00:01:25.199141 │   268157
849│ 00:01:16.168352 │   117511          850 │ 00:00:34.809862 │     1175          851 │ 00:01:12.565397 │    67214
    852 │ 00:01:03.734776 │    20129          853 │ 00:00:41.038456 │    62765          854 │ 00:01:03.478917 │
14967         855 │ 00:00:33.88803  │     6901          856 │ 00:00:36.381886 │     6670          857 │ 00:00:36.626623
│    8268          858 │ 00:01:14.391584 │   363448          859 │ 00:01:44.619005 │   533395          860 │
00:01:02.042255│   212202          861 │ 00:00:00.001065 │          863 │ 00:00:58.265506 │   215234          862 │
00:02:45.572635│   827941          865 │ 00:01:28.049165 │   241020          864 │ 00:01:51.465643 │   531012
866│ 00:01:20.273391 │   419357          868 │ 00:01:25.479443 │   294262          869 │ 00:00:46.400825 │    46136
    870 │ 00:00:42.337286 │    25934          871 │ 00:00:48.350152 │    91375          872 │ 00:00:00.001743 │
873 │ 00:01:04.596749 │   199744          876 │ 00:01:15.210301 │   276276          877 │ 00:02:36.447723 │  1693887
     878 │ 00:00:55.925104 │    36938          879 │ 00:00:00.016874 │          882 │ 00:00:50.479248 │   171901
 881 │                 │          884 │                 │          883 │                 │          867 │
 │          874 │                 │          875 │                 │          880 │                 │ 
(52 rows)


There is a an 8 thread worker that walks that queue in id order and
migrates one source table of data to the destination postgres server.It's been stuck for quite some time now, and I'm
certainthat if I 
restart the database, and rerun the incremental processing routine, it
would work those last seven tables with no issues (at least that's
what happened last time).  It's been bogged with no progress now for a
half hour again, so there's some kind of weird bug here.

merlin



В списке pgsql-hackers по дате отправления:

Предыдущее
От: Andres Freund
Дата:
Сообщение: Re: hung backends stuck in spinlock heavy endless loop
Следующее
От: Peter Geoghegan
Дата:
Сообщение: Re: hung backends stuck in spinlock heavy endless loop