BUG #4692: VACUUM: write to WAL gets very slow and seems redundant

Поиск
Список
Период
Сортировка
От Peter Much
Тема BUG #4692: VACUUM: write to WAL gets very slow and seems redundant
Дата
Msg-id 200903041628.n24GSfHL064596@wwwmaster.postgresql.org
обсуждение исходный текст
Список pgsql-bugs
The following bug has been logged online:

Bug reference:      4692
Logged by:          Peter Much
Email address:      pmc@citylink.dinoex.sub.org
PostgreSQL version: 8.2.7
Operating system:   FreeBSD 6.3
Description:        VACUUM: write to WAL gets very slow and seems redundant
Details:

While most of the time the system does run very fine, sometimes (i perceived
it only during VACUUM) the throughput breaks down dramatically: while
normally the VACUUM fills multiple WAL logs per minute, it then takes 10
minutes to fill each WAL log, while the system is rather idle:

Normal:
Mar  4 03:18:29 <local0.info> edge postgres[1781]: [11-1] :[] LOG:  archived
transaction log file "000000020000004200000097"
Mar  4 03:19:09 <local0.info> edge postgres[1781]: [12-1] :[] LOG:  archived
transaction log file "000000020000004200000098"
Mar  4 03:20:26 <local0.info> edge postgres[1781]: [13-1] :[] LOG:  archived
transaction log file "000000020000004200000099"
Mar  4 03:21:11 <local0.info> edge postgres[1781]: [14-1] :[] LOG:  archived
transaction log file "00000002000000420000009A"

Defect:
Mar  4 10:16:16 <local0.info> edge postgres[1781]: [84-1] :[] LOG:  archived
transaction log file "0000000200000042000000E0"
Mar  4 10:25:19 <local0.info> edge postgres[1781]: [85-1] :[] LOG:  archived
transaction log file "0000000200000042000000E1"
Mar  4 10:37:20 <local0.info> edge postgres[1781]: [86-1] :[] LOG:  archived
transaction log file "0000000200000042000000E2"
Mar  4 10:44:04 <local0.info> edge postgres[1781]: [87-1] :[] LOG:  archived
transaction log file "0000000200000042000000E3"

Looking closer at this.
1.) The only process being active on the system is the VACUUM (FULL) task:
 6322  ??  Ds    17:16.52 postgres: pgsql bacula 192.168.98.6(50283) VACUUM
(po

2.) the system load is near idle.

3.) "top" shows the vacuum process waiting in "bo_waa" (or "biowr") nearly
all the time:
  PID USERNAME  THR PRI NICE   SIZE    RES STATE    TIME   WCPU COMMAND
 6322 pgsql       1  -5    0 76596K 44324K bo_wwa  17:22  1.03% postgres

4.) The disk with the WAL logs is 98% busy, and the only file that is read
or written there is the current WAL log:
[root@edge /var/pgsql/data]# find  . -type f  -exec ls -lu {} \;  | grep
"Mar  4 12:4"
[root@edge /var/pgsql/data]# find  . -type f  -exec ls -l {} \;  | grep "Mar
 4 12:4"
-rw-------  1 pgsql  pgsql  16777216 Mar  4 12:41
./pg_xlog/0000000200000042000000F0

5.) lsof shows the file-pointer in the WAL log proceeding:
[root@edge /var/pgsql/data]# lsof -o pg_xlog/0000000200000042000000F0 ;
sleep 10; lsof -o pg_xlog/0000000200000042000000F0
COMMAND   PID  USER   FD   TYPE DEVICE     OFFSET  NODE NAME
postgres 1780 pgsql    9u  VREG  0,167  0t8585216 14344
pg_xlog/0000000200000042000000F0
postgres 6322 pgsql   11u  VREG  0,167 0t14090240 14344
pg_xlog/0000000200000042000000F0
COMMAND   PID  USER   FD   TYPE DEVICE     OFFSET  NODE NAME
postgres 1780 pgsql    9u  VREG  0,167  0t8585216 14344
pg_xlog/0000000200000042000000F0
postgres 6322 pgsql   11u  VREG  0,167 0t14524416 14344
pg_xlog/0000000200000042000000F0

So, during 10 seconds the 6322 process has proceeded 434176 bytes, giving 43
kBps.

6.) iostat shows the WAL disk processing ~60 accesses per second, 16kB each
(while the database disk does nearly nothing):
            WAL
 KB/t tps  MB/s
16.70  69  1.12
16.76  63  1.03
16.00  45  0.70
16.00  58  0.90
16.71  68  1.10

This shows about 1000 kBps.
iostat does not show me which process does this activity, but since the
VACUUM process 6322 is the only active process, and since this process is in
"biowr" state all the time, the activity could hardly come from an other
process.

Evaluating, we have ~1000 kBps thruput but only 43 kBps write advance on the
WAL log, this gives about factor 1:20. And from the logs we see: about
30secs per WAL log in normal operation, and about 10mins in the defect
state, this is also 1:20.

So it seems I am looking at the right thing, and there is no problem on the
system. The only question is: what the **** is this VACUUM process doing?

When killing the 6322 process, the next task behaves normal:
            WAL               DB
 KB/t tps  MB/s   KB/t tps  MB/s
82.67   3  0.24  22.79 197  4.39
31.25   8  0.24  21.16 154  3.19
26.00   8  0.20  21.43 164  3.44
44.80   5  0.22  24.44 143  3.42
22.40   5  0.11  23.85 162  3.78
24.00   4  0.09  22.45 195  4.28


The behaviour occurs only occasionally; as we can see from the postgres log
above, at 3AM the VACUUM was running fine, only later it started with this
behaviour - for no apparent reason.

В списке pgsql-bugs по дате отправления:

Предыдущее
От: Tom Lane
Дата:
Сообщение: Re: BUG #4690: an select query is not using the index
Следующее
От: "Oscar Bejarano"
Дата:
Сообщение: BUG #4693: When I use the order by it gets me an error, but if i use it without order by it's a correct query.