Обсуждение: DROP TABLE and concurrent modifications

Поиск
Список
Период
Сортировка

DROP TABLE and concurrent modifications

От
Neil Conway
Дата:
I can reproduce the following behavior with CVS HEAD.
    1. Have a process do INSERTs into a table in a tight loop (I've       attached a trivial libpq app that does this)
    2. In another session, repeatedly drop and re-create the table       that is being modified

You should see a stream of error messages from the INSERT client like:

query failed: ERROR:  relation 29118 deleted while still in use
query failed: ERROR:  relation "test_tbl" does not exist
query failed: ERROR:  relation "test_tbl" does not exist
query failed: ERROR:  relation "test_tbl" does not exist
query failed: ERROR:  relation "test_tbl" does not exist
query failed: ERROR:  relation "test_tbl" does not exist
query failed: ERROR:  relation "test_tbl" does not exist
query failed: ERROR:  relation 32430 deleted while still in use
query failed: ERROR:  relation "test_tbl" does not exist
query failed: ERROR:  relation "test_tbl" does not exist
query failed: ERROR:  relation 34206 deleted while still in use

The problem is the variant of the error message. When the error
message variant occurs, the INSERT backend is in the following state:

[ ... ]
#2  0x0824ff48 in RelationClearRelation (relation=0x40c92538, rebuild=1 '\001') at relcache.c:1711
#3  0x0825006e in RelationFlushRelation (relation=0x40c92538) at relcache.c:1775
#4  0x082501b5 in RelationCacheInvalidateEntry (relationId=17145, rnode=0x0) at relcache.c:1842
#5  0x0824d153 in LocalExecuteInvalidationMessage (msg=0xbfffeed0) at inval.c:452
#6  0x081c6af5 in ReceiveSharedInvalidMessages (invalFunction=0x824d043 <LocalExecuteInvalidationMessage>,
resetFunction=0x824d213<InvalidateSystemCaches>) at sinval.c:125
 
#7  0x0824d3c6 in AcceptInvalidationMessages () at inval.c:611
#8  0x081c8f99 in LockRelation (relation=0x40c92538, lockmode=3) at lmgr.c:143
#9  0x08089232 in relation_open (relationId=17145, lockmode=3) at heapam.c:462
#10 0x080892c9 in relation_openrv (relation=0x83956e0, lockmode=3) at heapam.c:506
#11 0x08089576 in heap_openrv (relation=0x83956e0, lockmode=3) at heapam.c:610
#12 0x080ee857 in setTargetTable (pstate=0x83955ec, relation=0x83956e0, inh=0 '\0', alsoSource=0 '\0', requiredPerms=1)
atparse_clause.c:142
 
#13 0x080d4390 in transformInsertStmt (pstate=0x83955ec, stmt=0x8395808, extras_before=0xbffff0a0,
extras_after=0xbffff09c)at analyze.c:543
 
[ ... ]

i.e. it is waiting to acquire a lock on the relation it wants to
INSERT into, but before returning from LockRelation() it receives a
shared-cache invalidation message for the relation the other backend
has just dropped. This causes it to error out in the bowels of
RelationClearRelation():
    if (RelationBuildDesc(buildinfo, relation) != relation)    {        /* Should only get here if relation was deleted
*/       FreeTupleDesc(old_att);        if (old_rulescxt)            MemoryContextDelete(old_rulescxt);
pfree(relation);       elog(ERROR, "relation %u deleted while still in use",             buildinfo.i.info_id);    }
 

Assuming my analysis is correct, is this a bug?

AFAICS it should be totally harmless, but at the least it would be
nice to display a more friendly/informative error message. Can anyone
see a way to do this without too much pain?

-Neil



Re: DROP TABLE and concurrent modifications

От
Neil Conway
Дата:
Neil Conway <neilc@samurai.com> writes:
>      1. Have a process do INSERTs into a table in a tight loop (I've
>         attached a trivial libpq app that does this)

Sorry, I was evidently a little too quick off the draw. A simple test
app is /really/ attached this time.

-Neil

#include <stdio.h>
#include <unistd.h>
#include <libpq-fe.h>

int
main(void)
{
    PGconn *conn;

    conn = PQconnectdb("");
    if (PQstatus(conn) == CONNECTION_BAD)
        return 1;

    for (;;)
    {
        PGresult *res;

        res = PQexec(conn, "INSERT INTO test_tbl VALUES (5, 5, 5);");
        if (PQresultStatus(res) != PGRES_COMMAND_OK)
        {
            /* query failed */
            printf("query failed: %s", PQresultErrorMessage(res));
            fflush(stdout);
            sleep(1);
        }

        PQclear(res);
    }

    PQfinish(conn);

    return 0;
}

Re: DROP TABLE and concurrent modifications

От
Tom Lane
Дата:
Neil Conway <neilc@samurai.com> writes:
> Assuming my analysis is correct, is this a bug?

Yes, though a low-priority one in my mind.  There is a TODO item about
it:

* Acquire lock on a relation before building a relcache entry for it

(The TODO item is a bit unspecific though, since the issue here probably
has to do with reusing an existing relcache entry rather than starting
from scratch.)

The difficulty with acquiring lock earlier is that to acquire lock,
you need to know the relation's shared/unshared status as well as its
OID.  We'd need to do something with all the code that assumes that
an OID is sufficient information for opening relations.

For the case of DROP TABLE, we don't really need to solve this problem;
it would be sufficient to make the error message a bit more friendly
(we could possibly save aside the relation name before trying to rebuild
the cache entry).  I think the real reason for the TODO was concerns
about ALTER TABLE RENAME --- if someone else is doing that, you could
end up accessing a table that, by the time you've locked it, has a
different name than you were looking up.  It's not entirely clear to me
what *should* happen in that case, but silently continuing is arguably
not the best idea.
        regards, tom lane


Re: DROP TABLE and concurrent modifications

От
Bruce Momjian
Дата:
Tom Lane wrote:
> Neil Conway <neilc@samurai.com> writes:
> > Assuming my analysis is correct, is this a bug?
> 
> Yes, though a low-priority one in my mind.  There is a TODO item about
> it:
> 
> * Acquire lock on a relation before building a relcache entry for it
> 
> (The TODO item is a bit unspecific though, since the issue here probably
> has to do with reusing an existing relcache entry rather than starting
> from scratch.)
> 
> The difficulty with acquiring lock earlier is that to acquire lock,
> you need to know the relation's shared/unshared status as well as its
> OID.  We'd need to do something with all the code that assumes that
> an OID is sufficient information for opening relations.
> 
> For the case of DROP TABLE, we don't really need to solve this problem;
> it would be sufficient to make the error message a bit more friendly
> (we could possibly save aside the relation name before trying to rebuild
> the cache entry).  I think the real reason for the TODO was concerns
> about ALTER TABLE RENAME --- if someone else is doing that, you could
> end up accessing a table that, by the time you've locked it, has a
> different name than you were looking up.  It's not entirely clear to me
> what *should* happen in that case, but silently continuing is arguably
> not the best idea.

Any TODO's here?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
359-1001+  If your life is a hard drive,     |  13 Roberts Road +  Christ can be your backup.        |  Newtown Square,
Pennsylvania19073
 


Re: DROP TABLE and concurrent modifications

От
Neil Conway
Дата:
Tom Lane <tgl@sss.pgh.pa.us> writes:
> The difficulty with acquiring lock earlier is that to acquire lock,
> you need to know the relation's shared/unshared status as well as
> its OID.  We'd need to do something with all the code that assumes
> that an OID is sufficient information for opening relations.

Yeah, good point. I don't see an easy solution, and I've got bigger
fish to fry at the moment. I'll put it down on my todo list, tho.

> I think the real reason for the TODO was concerns about ALTER TABLE
> RENAME --- if someone else is doing that, you could end up accessing
> a table that, by the time you've locked it, has a different name
> than you were looking up.  It's not entirely clear to me what
> *should* happen in that case, but silently continuing is arguably
> not the best idea.

ISTM we should produce a "table does not exist" error. In effect, the
ALTER TABLE RENAME is executed before the INSERT. Therefore, the state
of the db at the beginning of the execution of the INSERT reflects the
rename, so the table referred to by the INSERT no longer exists.

Blake: thanks for the bug report.

-Neil