Обсуждение: Hmmm ... isn't count_nondeletable_pages all wet?

Поиск
Список
Период
Сортировка

Hmmm ... isn't count_nondeletable_pages all wet?

От
Tom Lane
Дата:
While looking at the HOT patch I suddenly started to question the sanity
of vacuumlazy.c's count_nondeletable_pages().  It sits there and does a
HeapTupleSatisfiesVacuum on any tuples it finds, and is willing to
truncate away a page that contains only DEAD tuples.  The problem with
this theory is that any index entries linking to those tuples won't have
been cleaned up, and will therefore emerge as index corruption once the
table grows again (because they will link to tuples that in all
probability don't match the index entries).

Now since this test is made using the same OldestXmin threshold that we
used in the vacuuming pass, tuples that were RECENTLY_DEAD will still
be that way.  However, it seems to me that there's a race condition
anyway:

1. VACUUM scans and cleans a page.
2. Some other transaction inserts a tuple into that page.
3. The inserting transaction aborts.
4. VACUUM returns to the page and sees the tuple as HEAPTUPLE_DEAD.

In this scenario we could truncate the page away and not have cleaned
up the index entries linking to it.

I'm thinking that count_nondeletable_pages should not bother itself
with visibility tests, but just forget truncation if it finds any
items whatsoever on the page.

Is this analysis accurate, or am I missing something?  If it is
accurate, do we need to postpone the upcoming releases to fix it?
I am thinking that some previously unexplained reports of index
corruption might now be explained ...
        regards, tom lane


Re: Hmmm ... isn't count_nondeletable_pages all wet?

От
Tom Lane
Дата:
I wrote:
> Is this analysis accurate, or am I missing something?  If it is
> accurate, do we need to postpone the upcoming releases to fix it?
> I am thinking that some previously unexplained reports of index
> corruption might now be explained ...

Yeah, it's broken.  Reproducing the race condition is a bit tricky,
but if you have a debug-enabled build at hand it's simple.
Here's a test case (turn off autovacuum if it's on, to prevent it
from getting in there ahead of the manual vacuum):

In session 1:

regression=# create table foo (f1 int primary key);
NOTICE:  CREATE TABLE / PRIMARY KEY will create implicit index "foo_pkey" for table "foo"
CREATE TABLE
regression=# insert into foo select g from generate_series(1,10) g;
INSERT 0 10
regression=# delete from foo;
DELETE 10

Now start session 2, and attach to it with gdb, and set a breakpoint
at lazy_truncate_heap().  Continue, and in session 2 do

regression=# vacuum verbose foo;
INFO:  vacuuming "public.foo"
INFO:  scanned index "foo_pkey" to remove 10 row versions
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "foo": removed 10 row versions in 1 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  index "foo_pkey" now contains 0 row versions in 2 pages
DETAIL:  10 index row versions were removed.
0 index pages have been deleted, 0 are currently reusable.
CPU 0.00s/0.00u sec elapsed 0.00 sec.
INFO:  "foo": found 10 removable, 0 nonremovable row versions in 1 pages
DETAIL:  0 dead row versions cannot be removed yet.
There were 0 unused item pointers.
1 pages contain useful free space.
0 pages are entirely empty.
CPU 0.00s/0.00u sec elapsed 0.00 sec.

[ hangs here at the breakpoint ]

In session 1 do:

regression=# begin;
BEGIN
regression=# insert into foo select g+100 from generate_series(1,10) g;
INSERT 0 10
regression=# abort;
ROLLBACK

Now let gdb continue from the breakpoint, and notice session 2 thinks
it can truncate away the whole table:

INFO:  "foo": truncated 1 to 0 pages
DETAIL:  CPU 0.00s/0.00u sec elapsed 0.01 sec.
VACUUM
regression=# 

Back to session 1, insert some conflicting data:

regression=# insert into foo select g+1000 from generate_series(1,10) g;
INSERT 0 10

and now we have a corrupt index with multiple pointers to these rows:

regression=# select * from foo where f1 = 105; f1  
------1005
(1 row)

regression=# select * from foo where f1 = 1005; f1  
------1005
(1 row)

I am fairly sure that this bug explains problems previously reported
by Merlin Moncure:
http://archives.postgresql.org/pgsql-general/2006-10/msg01312.php
and Florian Weimer:
http://archives.postgresql.org/pgsql-general/2006-11/msg00305.php
In both those cases, off-list investigation showed that the symptoms
were caused by multiple index entries pointing to the same heap tuples,
where one index entry matched the actual contents of the row and
the other did not.  In both cases this occurred for a fairly small
number of rows that were clumped together into small ranges of blocks.
It looks to me like this is perfectly explained by the theory that
that range of blocks had been truncated away by a VACUUM at some point
in the table's history, and that the non-matching index entries stemmed
from an insert or update that occurred and then aborted after VACUUM had
examined the blocks the first time but before it could return to check
whether the blocks were still empty.

It's a corner case, but I say it's a must-fix.  Those bug reports have
been bothering me for most of a year ...
        regards, tom lane


Re: Hmmm ... isn't count_nondeletable_pages all wet?

От
Gregory Stark
Дата:
"Tom Lane" <tgl@sss.pgh.pa.us> writes:

> It's a corner case, but I say it's a must-fix.  

Of course

> Those bug reports have been bothering me for most of a year ...

Are there any other outstanding reports like that?

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com


Re: Hmmm ... isn't count_nondeletable_pages all wet?

От
Tom Lane
Дата:
Gregory Stark <stark@enterprisedb.com> writes:
> "Tom Lane" <tgl@sss.pgh.pa.us> writes:
>> Those bug reports have been bothering me for most of a year ...

> Are there any other outstanding reports like that?

Couldn't say.  Those two were on my mind because the reporters had
allowed me to troll through their table and index files and verify
that there were indeed multiple index pointers to the same table row,
and yet the index itself did not show any indication of corruption.
I had noticed the "clumping" pattern of the doubly-linked rows in
both cases, but hadn't found an explanation.
        regards, tom lane


Re: Hmmm ... isn't count_nondeletable_pages all wet?

От
"Pavan Deolasee"
Дата:


On 9/16/07, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Couldn't say.  Those two were on my mind because the reporters had
allowed me to troll through their table and index files and verify
that there were indeed multiple index pointers to the same table row,
and yet the index itself did not show any indication of corruption.
I had noticed the "clumping" pattern of the doubly-linked rows in
both cases, but hadn't found an explanation.


This was a nice catch indeed! I remember we fixed a similar issue
with VACUUM FULL few months back. Such issues would become
prominent with HOT because of early vacuuming and reuse of
tuples. Also unlike current VACUUM which runs exclusively on a table
(i.e. no two vacuums are allowed to run on a table simultaneously),
HOT makes concurrent pruning and vacuuming of a table possible.

In fact, during HOT development and testing cycles, I faced
similar race conditions a few times. This is just to keep ourselves aware of
the changes introduced by HOT. Hopefully by now we have fixed most of
the outstanding bugs in the HOT code (and Tom's eyes would catch the
remaining, if any :-)). Fingers crossed.

Thanks,
Pavan


--
Pavan Deolasee
EnterpriseDB     http://www.enterprisedb.com

Re: Hmmm ... isn't count_nondeletable_pages all wet?

От
Florian Weimer
Дата:
* Tom Lane:

> I am fairly sure that this bug explains problems previously reported
> by Merlin Moncure:
> http://archives.postgresql.org/pgsql-general/2006-10/msg01312.php
> and Florian Weimer:
> http://archives.postgresql.org/pgsql-general/2006-11/msg00305.php
> In both those cases, off-list investigation showed that the symptoms
> were caused by multiple index entries pointing to the same heap tuples,
> where one index entry matched the actual contents of the row and
> the other did not.  In both cases this occurred for a fairly small
> number of rows that were clumped together into small ranges of blocks.
> It looks to me like this is perfectly explained by the theory that
> that range of blocks had been truncated away by a VACUUM at some point
> in the table's history, and that the non-matching index entries stemmed
> from an insert or update that occurred and then aborted after VACUUM had
> examined the blocks the first time but before it could return to check
> whether the blocks were still empty.

We did have auto-vacuum running, and while the table in question was
supposedly INSERT-only, some rollback might have occurred before the
corruption hit us, resulting in the dead tuples.  So your explanation
makes sense to me (but I'm not really familiar with PostgreSQL
internals).

Regarding Scott's commment of other reports, I don't think we've
experienced the issue again; we've switched servers since then, and
the usage patterns have changed over time.

--
Florian Weimer                <fweimer@bfk.de>
BFK edv-consulting GmbH       http://www.bfk.de/
Kriegsstraße 100              tel: +49-721-96201-1
D-76133 Karlsruhe             fax: +49-721-96201-99


grep command

От
"Kuriakose, Cinu Cheriyamoozhiyil"
Дата:
Hi All,

I am giving the command

cat config.log|grep -w 'PG_VERSION'

Which gives the following Output:

| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
| #define PG_VERSION "8.3beta2"
#define PG_VERSION "8.3beta2"

But the output that I would require is

PG_VERSION "8.3beta2" that should occur only once, can anyone please
tell me the command to get this output.

Thanks in advance

Regards
Cinu Kuriakose


Re: grep command

От
Andrew Dunstan
Дата:
Leaving aside the question of why one might want to do this, Unix 101
should show you many ways to do it. For example,

sed -n -e 's/.*PG_VERSION /PG_VERSION /p' -e /PG_VERSION/q config.log

Please don't cross-post questions like this, especially when it's not
really a PostgreSQL question at all.

cheers

andrew

Kuriakose, Cinu Cheriyamoozhiyil wrote:
> Hi All,
>
> I am giving the command
>
> cat config.log|grep -w 'PG_VERSION'
>
> Which gives the following Output:
>
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> #define PG_VERSION "8.3beta2"
>
> But the output that I would require is
>
> PG_VERSION "8.3beta2" that should occur only once, can anyone please
> tell me the command to get this output.
>
> Thanks in advance
>
> Regards
> Cinu Kuriakose
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>
>

Re: [ADMIN] grep command

От
"Medi Montaseri"
Дата:
Or ... ask the application not the OS  ....

psql> select version() ;

Cheers
Medi

On 10/29/07, Andrew Dunstan < andrew@dunslane.net> wrote:

Leaving aside the question of why one might want to do this, Unix 101
should show you many ways to do it. For example,

sed -n -e 's/.*PG_VERSION /PG_VERSION /p' -e /PG_VERSION/q config.log

Please don't cross-post questions like this, especially when it's not
really a PostgreSQL question at all.

cheers

andrew

Kuriakose, Cinu Cheriyamoozhiyil wrote:
> Hi All,
>
> I am giving the command
>
> cat config.log|grep -w 'PG_VERSION'
>
> Which gives the following Output:
>
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION " 8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> | #define PG_VERSION "8.3beta2"
> #define PG_VERSION "8.3beta2"
>
> But the output that I would require is
>
> PG_VERSION "8.3beta2" that should occur only once, can anyone please
> tell me the command to get this output.
>
> Thanks in advance
>
> Regards
> Cinu Kuriakose
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>
>

---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster