Обсуждение: Autovaccum with cost_delay does not complete on one solaris 5.10 machine
All, We're having a very strange problem where autovacuum does not complete on a Postgres 8.3.8/Solaris 5.10 system. The reason I say strange is: this is one of a twin pair of identical systems,and the other system does not have this issue. Basically, vacuuming of a table which normally takes about 20 minutes interactively with vacuum_cost_delay set to 20 had not completed after 14 hours. When I trussed it, I saw activity which indicated to me that autovacuum was doing a pollsys, presumably for cost_limit, every data page. Autovacuum was running with vacuum_cost_limit = 200 and autovacuum_vacuum_cost_delay = 20, which I believe is the default for 8.3. Truss output: pollsys(0xFFFFFD7FFFDF83E0, 0, 0xFFFFFD7FFFDF8470, 0x00000000) = 0 lseek(4, 0x2AD9C000, SEEK_SET) = 0x2AD9C000 write(4, " L\v\0\010F0F8 a01\0\0\0".., 8192) = 8192 lseek(4, 0x2ADDC000, SEEK_SET) = 0x2ADDC000 read(4, " L\v\0\0108FFD a01\0\0\0".., 8192) = 8192 pollsys(0xFFFFFD7FFFDF83E0, 0, 0xFFFFFD7FFFDF8470, 0x00000000) = 0 lseek(4, 0x2AD9E000, SEEK_SET) = 0x2AD9E000 write(4, " L\v\0\0 X15F9 a01\0\0\0".., 8192) = 8192 lseek(4, 0x2ADDE000, SEEK_SET) = 0x2ADDE000 read(4, " L\v\0\080B0FD a01\0\0\0".., 8192) = 8192 pollsys(0xFFFFFD7FFFDF83E0, 0, 0xFFFFFD7FFFDF8470, 0x00000000) = 0 lseek(4, 0x2ADA0000, SEEK_SET) = 0x2ADA0000 write(4, " L\v\0\0D0 6F9 a01\0\0\0".., 8192) = 8192 lseek(4, 0x2ADE0000, SEEK_SET) = 0x2ADE0000 read(4, " L\v\0\0F8D1FD a01\0\0\0".., 8192) = 8192 Note that this is VERY different from the truss output for a manual vacuum on the same machine (although I think the above is an index and the below is the main table): pollsys(0xFFFFFD7FFFDF88C0, 0, 0xFFFFFD7FFFDF8950, 0x00000000) = 0 read(14, " (\v\0\010\v 19501\001\0".., 8192) = 8192 read(14, " !\v\0\0B8 qFF9701\001\0".., 8192) = 8192 read(14, " -\v\0\08895 WBC01\001\0".., 8192) = 8192 read(14, " (\v\0\0\b I 19501\001\0".., 8192) = 8192 read(14, " :\v\0\0 ( ;BCCD01\001\0".., 8192) = 8192 read(14, " (\v\0\0 @89 19501\001\0".., 8192) = 8192 read(14, " D\v\0\0B0 7 e l01\001\0".., 8192) = 8192 read(14, " (\v\0\0B0C7 19501\001\0".., 8192) = 8192 read(14, " -\v\0\0B8 5 XBC01\001\0".., 8192) = 8192 read(14, " (\v\0\0 ( 3 29501\001\0".., 8192) = 8192 read(14, " (\v\0\0 X R 29501\001\0".., 8192) = 8192 read(14, " :\v\0\0 CEBCCD01\001\0".., 8192) = 8192 read(14, " !\v\0\0D8A0 9801\001\0".., 8192) = 8192 read(14, " (\v\0\0C0C6 29501\001\0".., 8192) = 8192 read(14, "1C\v\0\0D0 u g [01\001\0".., 8192) = 8192 read(14, " !\v\0\0A0 `81 [01\001\0".., 8192) = 8192 read(14, " -\v\0\0 0ED XBC01\001\0".., 8192) = 8192 read(14, " 7\v\0\0C8 UECD901\001\0".., 8192) = 8192 read(14, "1A\v\0\0107F W z01\001\0".., 8192) = 8192 read(14, " !\v\0\0 p ZB5A401\001\0".., 8192) = 8192 read(14, " -\v\0\0A0D5 YBC01\001\0".., 8192) = 8192 read(14, " z\v\0\0 81AFB9A01\001\0".., 8192) = 8192 read(14, "1A\v\0\080 { X z01\001\0".., 8192) = 8192 read(14, " (\v\0\080 ] 39501\001\0".., 8192) = 8192 read(14, " (\v\0\0A8 | 39501\001\0".., 8192) = 8192 read(14, " :\v\0\0\09ABDCD01\001\0".., 8192) = 8192 Ideas on where to look next, anyone? -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Re: Autovaccum with cost_delay does not complete on one solaris 5.10 machine
От
Alvaro Herrera
Дата:
Josh Berkus wrote: > Basically, vacuuming of a table which normally takes about 20 minutes > interactively with vacuum_cost_delay set to 20 had not completed after > 14 hours. When I trussed it, I saw activity which indicated to me that > autovacuum was doing a pollsys, presumably for cost_limit, every data page. > > Autovacuum was running with vacuum_cost_limit = 200 and > autovacuum_vacuum_cost_delay = 20, which I believe is the default for 8.3. > > Truss output: > > pollsys(0xFFFFFD7FFFDF83E0, 0, 0xFFFFFD7FFFDF8470, 0x00000000) = 0 So what is it polling? Please try "truss -v pollsys"; is there a way in Solaris to report what each file descriptor is pointing to? (In linux I'd look at /proc/<pid>/fd) We don't call pollsys anywhere. Something in Solaris must be doing it under the hood. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
Alvaro Herrera <alvherre@commandprompt.com> writes: > We don't call pollsys anywhere. Something in Solaris must be doing it > under the hood. pg_usleep calls select(), and some googling indicates that select() is implemented as pollsys() on recent Solaris versions. So Josh's assumption that those are delay calls seems plausible. But it shouldn't be sleeping after each page with normal cost_delay parameters, should it? regards, tom lane
Re: Autovaccum with cost_delay does not complete on one solaris 5.10 machine
От
Alvaro Herrera
Дата:
Tom Lane wrote: > Alvaro Herrera <alvherre@commandprompt.com> writes: > > We don't call pollsys anywhere. Something in Solaris must be doing it > > under the hood. > > pg_usleep calls select(), and some googling indicates that select() is > implemented as pollsys() on recent Solaris versions. So Josh's > assumption that those are delay calls seems plausible. But it shouldn't > be sleeping after each page with normal cost_delay parameters, should it? Certainly not ... The only explanation would be that the cost balance gets over the limit very frequently. So one of the params would have to be abnormally high (vacuum_cost_page_hit, vacuum_cost_page_miss, vacuum_cost_page_dirty). -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
>> pg_usleep calls select(), and some googling indicates that select() is >> implemented as pollsys() on recent Solaris versions. So Josh's >> assumption that those are delay calls seems plausible. It's certainly the behavior I'm seeing otherwise. In "normal operation", the number of pages read between pollsys is consistent with the vacuum delay settings. >> But it shouldn't >> be sleeping after each page with normal cost_delay parameters, should it? Right, that's why I find this puzzling. If the problem was easier to reproduce it would be easier to analyze. > Certainly not ... The only explanation would be that the cost balance > gets over the limit very frequently. So one of the params would have to > be abnormally high (vacuum_cost_page_hit, vacuum_cost_page_miss, > vacuum_cost_page_dirty). Nope, all defaults. And, all identical to the other server which is behaving normally. Honestly, I mostly posted this in hopes that some other Solaris user would speak up to having seen the same thing. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Josh Berkus <josh@agliodbs.com> writes: >> But it shouldn't >> be sleeping after each page with normal cost_delay parameters, should it? > Right, that's why I find this puzzling. If the problem was easier to > reproduce it would be easier to analyze. The behavior would be explained if VacuumCostLimit were getting set to zero (or some unreasonably small value) in the autovac worker process. I looked at the autovac code that manages that, and it seems complicated enough that a bug wouldn't surprise me in the least. I especially note that wi_cost_limit is explicitly initialized to zero, rather than something sane; and that table_recheck_autovac falls back to setting vac_cost_limit from the previous value of VacuumCostLimit ... which is NOT constant but in general is left over from the previously processed table. One should also keep in mind that SIGHUP processing might reload VacuumCostLimit from GUC values. So I think that area needs a closer look. Josh, are you sure that both servers are identical in terms of both GUC-related and per-table autovacuum settings? regards, tom lane
> Josh, are you sure that both servers are identical in terms of both > GUC-related and per-table autovacuum settings? I should check per-table. GUC, yes, because the company has source management for config files. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Tom, Neither database has and per-table autovacuum settings. However, since this is a production database, I had to try something, and set vacuum_cost_limit up to 1000. The issue with vacuuming one page at a time went away, or at least I have not seen it repeat in the last 16 hours. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com
Re: Autovaccum with cost_delay does not complete on one solaris 5.10 machine
От
Alvaro Herrera
Дата:
Josh Berkus wrote: > Tom, > > Neither database has and per-table autovacuum settings. > > However, since this is a production database, I had to try > something, and set vacuum_cost_limit up to 1000. The issue with > vacuuming one page at a time went away, or at least I have not seen > it repeat in the last 16 hours. How many autovac workers are there? -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
> How many autovac workers are there? Max_workers is set to 3. However, I've never seen more than one active at a time. -- -- Josh Berkus PostgreSQL Experts Inc. http://www.pgexperts.com