Обсуждение: Tables dissapearing
Hi guys! I use pgsql for some time already and am happy with it. Heh, sure this post has it's big BUT :-) Starting few months ago, one of our projects encoutered loss of one DB table (in that time, it was version 8.0 or so...) I did some research and found out, the vacuuming was set wrong and transaction wraparound failure occured... OK, I upgraded to 8.2.1 (latest in that time) and set the auto vacuum... (till now I think it was set correctly and working)... but after less than one month, the same happened... but this time, the age of all databases looked OK... So I did upgrade to 8.2.3 (the project is running this version till now) but the same happened today... This time I didn't rely on the autovacuuming and did code cron scheduled job to do vacuum full on all tables during nightly maintenance. I'm now sure, the wraparound didn't occure (am I?)... in fact, I'm lost... One more thing: The project runs proprietal CMS system and there are more instances of it with the same database layout in different databases. Every time the "lost" table is the same one - the bussiest one (mostly read)... and everytime the "lost" table is gone from more projects at once... today it was 2 projects affected out of 7 in the same pgsql server... it looks also like the problem happens everytime short after midnight - scarry - ha ha :-) I have the backup of DB files as they was this morning, but it's alter the vacuum script... My system is: OS: Debian 3.0 Kernel: 2.4.26 PgSQL: 8.2.3 Any advice and/or hint to the right direction is greatly welcome! Regards, -- Kamil
Kamil Srot <kamil.srot@nlogy.com> writes: > One more thing: > The project runs proprietal CMS system and there are more instances of > it with the same database layout in different databases. Every time the > "lost" table is the same one - the bussiest one (mostly read)... and > everytime the "lost" table is gone from more projects at once... today > it was 2 projects affected out of 7 in the same pgsql server... it looks > also like the problem happens everytime short after midnight - scarry - > ha ha :-) Tables don't just "disappear" like that. Check your application code to see if it's not dropping the table someplace. regards, tom lane
Tom Lane wrote: > Kamil Srot <kamil.srot@nlogy.com> writes: > >>One more thing: >>The project runs proprietal CMS system and there are more instances of >>it with the same database layout in different databases. Every time the >>"lost" table is the same one - the bussiest one (mostly read)... and >>everytime the "lost" table is gone from more projects at once... today >>it was 2 projects affected out of 7 in the same pgsql server... it looks >>also like the problem happens everytime short after midnight - scarry - >>ha ha :-) > > > Tables don't just "disappear" like that. Check your application code > to see if it's not dropping the table someplace. > > regards, tom lane > grep -i drop your_application_dir
brian wrote: > Tom Lane wrote: >> Kamil Srot <kamil.srot@nlogy.com> writes: >> >>> One more thing: >>> The project runs proprietal CMS system and there are more instances >>> of it with the same database layout in different databases. Every >>> time the "lost" table is the same one - the bussiest one (mostly >>> read)... and everytime the "lost" table is gone from more projects >>> at once... today it was 2 projects affected out of 7 in the same >>> pgsql server... it looks also like the problem happens everytime >>> short after midnight - scarry - ha ha :-) >> >> >> Tables don't just "disappear" like that. Check your application code >> to see if it's not dropping the table someplace. >> >> regards, tom lane >> > > grep -i drop your_application_dir > This was the first thing, I checked... if something, this is the thing, I'm 100% sure is not the case. The application doesn't manipulate with the database schema at all... so no CREATE/ALTER/DROP statements anywhere. On top there is the "dissaperance" from more databases at once (they are absolutelly no interconnected in any way - all live their own life) I don't say, it's gone by itself, I'm asking for help debuging this situation and hopefully find a solution. For the first time it happened, it had the same symptoms - this specific table was missing and transaction counter was wrapped-around. Now it seems to be the same but the transaction counters are far below the critical value... It happened four times already and the customer is starting to become crazy... the same as I do... Thank you, Tom, Brian and all who read it, for your time! Regards, -- Kamil
On Mon, Aug 27, 2007 at 06:37:17PM +0200, Kamil Srot wrote: > I don't say, it's gone by itself, I'm asking for help debuging this > situation and hopefully find a solution. For the first time it happened, > it had the same symptoms - this specific table was missing and > transaction counter was wrapped-around. Now it seems to be the same but > the transaction counters are far below the critical value... You don't say explicitly, but when you say "vacuum all tables" you didn't specify any tables names right? You just typed "vacuum". On each database... As for debugging, maybe something like: select xmin, age(xmin) from pg_class; Just to check the wraparound issue... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Martijn van Oosterhout wrote:
Hope it's correct?
SELECT datname, age(datfrozenxid) FROM pg_database;
In your SQL I see several rows with too high numbers!
They are all "internal" like pg_toast_618854, views and also some application level indices etc.
So it seems it's not enough to vacuum all DB's (templateX inclusive)
Can you please advice how to correctly manually vacuum the pgsql?
Thank you!
Regards,
--
Kamil
Correct...the script does echo "vacuum full;" | $PGDIR/bin/psql -U postgres $db for each database...On Mon, Aug 27, 2007 at 06:37:17PM +0200, Kamil Srot wrote:I don't say, it's gone by itself, I'm asking for help debuging this situation and hopefully find a solution. For the first time it happened, it had the same symptoms - this specific table was missing and transaction counter was wrapped-around. Now it seems to be the same but the transaction counters are far below the critical value...You don't say explicitly, but when you say "vacuum all tables" you didn't specify any tables names right? You just typed "vacuum". On each database...
Hope it's correct?
Maybe it's it... I did check the wraparound with:As for debugging, maybe something like: select xmin, age(xmin) from pg_class; Just to check the wraparound issue...
SELECT datname, age(datfrozenxid) FROM pg_database;
In your SQL I see several rows with too high numbers!
They are all "internal" like pg_toast_618854, views and also some application level indices etc.
So it seems it's not enough to vacuum all DB's (templateX inclusive)
Can you please advice how to correctly manually vacuum the pgsql?
Thank you!
Regards,
--
Kamil
On Mon, Aug 27, 2007 at 06:57:54PM +0200, Kamil Srot wrote: > Correct...the script does echo "vacuum full;" | $PGDIR/bin/psql -U > postgres $db for each database... > Hope it's correct? Well, I'd drop the "full" part, it tends to bloat indexes. Also, did you check it was actually completing (no errors)? > Maybe it's it... I did check the wraparound with: > SELECT datname, age(datfrozenxid) FROM pg_database; > > In your SQL I see several rows with too high numbers! > > They are all "internal" like pg_toast_618854, views and also some > application level indices etc. Depends what you mean by too high. Anything with XID 1 and 2 is not a problem, and age returns a really big number for them. Can you give some examples? Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Martijn van Oosterhout wrote: > On Mon, Aug 27, 2007 at 06:57:54PM +0200, Kamil Srot wrote: > >> Correct...the script does echo "vacuum full;" | $PGDIR/bin/psql -U >> postgres $db for each database... >> Hope it's correct? >> > > Well, I'd drop the "full" part, it tends to bloat indexes. Also, did > you check it was actually completing (no errors)? > > Yes, it completes w/o errors... just VACUUM as output... OK, I'll drop the full part and do it less often... >> Maybe it's it... I did check the wraparound with: >> SELECT datname, age(datfrozenxid) FROM pg_database; >> >> In your SQL I see several rows with too high numbers! >> >> They are all "internal" like pg_toast_618854, views and also some >> application level indices etc. >> > > Depends what you mean by too high. Anything with XID 1 and 2 is not a > problem, and age returns a really big number for them. Can you give > some examples? > High numbers is near by the bigint limit... 2 billions. Here you go: # select xmin, age(xmin) from pg_class; xmin | age -----------+------------ 2 | 2147483647 2 | 2147483647 2 | 2147483647 2 | 2147483647 2 | 2147483647 2 | 2147483647 236838019 | 539453 2 | 2147483647 2 | 2147483647 231899309 | 5478163 etc. Basically everythin has the same age 2147483647 with xmin = 1 or xmin = 2.. but the two lines shown... Actually the same problem happened several minutes ago :-( Thank you for any hint! Regards, -- Kamil
Kamil Srot wrote: > > Martijn van Oosterhout wrote: >> On Mon, Aug 27, 2007 at 06:57:54PM +0200, Kamil Srot wrote: >> >>> Correct...the script does echo "vacuum full;" | $PGDIR/bin/psql -U >>> postgres $db for each database... >>> Hope it's correct? >>> >> >> Well, I'd drop the "full" part, it tends to bloat indexes. Also, did >> you check it was actually completing (no errors)? >> > Yes, it completes w/o errors... just VACUUM as output... > OK, I'll drop the full part and do it less often... What user is doing it? Is it a superuser? If it's not, then there's your problem. -- Alvaro Herrera Valdivia, Chile ICBM: S 39º 49' 18.1", W 73º 13' 56.4" "If it wasn't for my companion, I believe I'd be having the time of my life" (John Dunbar)
Alvaro Herrera wrote: > Kamil Srot wrote: > >> Martijn van Oosterhout wrote: >> >>> On Mon, Aug 27, 2007 at 06:57:54PM +0200, Kamil Srot wrote: >>> >>> >>>> Correct...the script does echo "vacuum full;" | $PGDIR/bin/psql -U >>>> postgres $db for each database... >>>> Hope it's correct? >>>> >>>> >>> Well, I'd drop the "full" part, it tends to bloat indexes. Also, did >>> you check it was actually completing (no errors)? >>> >>> >> Yes, it completes w/o errors... just VACUUM as output... >> OK, I'll drop the full part and do it less often... >> > > What user is doing it? Is it a superuser? If it's not, then there's > your problem. > > It's user postgres... the vacuum completes correctly... Thank you, -- Kamil
On Mon, Aug 27, 2007 at 07:15:44PM +0200, Kamil Srot wrote: > OK, I'll drop the full part and do it less often... This doesn't address your problem, but when you move from VACUUM FULL to VACUUM, you want to do it _more_ often, not less. But given what you've posted, I am not even a little bit convinced your script is working as you think it is. You're clearly not managing to vacuum the entire database sometimes. A -- Andrew Sullivan | ajs@crankycanuck.ca A certain description of men are for getting out of debt, yet are against all taxes for raising money to pay it off. --Alexander Hamilton
Andrew Sullivan wrote: > On Mon, Aug 27, 2007 at 07:15:44PM +0200, Kamil Srot wrote: > >> OK, I'll drop the full part and do it less often... >> > > This doesn't address your problem, but when you move from VACUUM FULL > to VACUUM, you want to do it _more_ often, not less. > > Sure, I ment it like I'll do the FULL vacuum less often than daily and do daily the plain vacuum command. > But given what you've posted, I am not even a little bit convinced > your script is working as you think it is. You're clearly not > managing to vacuum the entire database sometimes. > Well, I do list all databases with the command "psql -U postgres -l -t" and with some shell coding I do vacuum all shown databases... I don't think there is some problem in it... If you do see some, can you give me a hint? Thank you, -- Kamil
On Mon, Aug 27, 2007 at 10:03:04PM +0200, Kamil Srot wrote: > > > Sure, I ment it like I'll do the FULL vacuum less often than daily and > do daily the plain vacuum command. If you have your servers set up correctly, you should never need to perform VACUUM FULL. > Well, I do list all databases with the command "psql -U postgres -l -t" > and with some shell coding I do vacuum all shown databases... > I don't think there is some problem in it... > > If you do see some, can you give me a hint? I don't think I've seen the script itself. Without seeing it, I can only guess. But you don't actually need to do that. You should be able to run vacuumdb -a, and get the result you need. A -- Andrew Sullivan | ajs@crankycanuck.ca This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie
Andrew Sullivan wrote:
#!/bin/sh
PGDIR=/opt/pgsql
DATABASES=`$PGDIR/bin/psql -U postgres -l -t | cut -f 1 -d '|'`
for db in $DATABASES
do
echo "vacuum;" | $PGDIR/bin/psql -U postgres $db
done
I can easily rewrite it to use the vacuumdb command, but I doubt it'll make any difference.
Is there a way how to find out the problem is here? I scanned the log output of pgsql for wrap around but didn't found aby recent relevant entries... I'm pretty sure, I miss something stupid :-(
Thank you very much for your help,
--
Kamil
I see. Thank you for the hint...On Mon, Aug 27, 2007 at 10:03:04PM +0200, Kamil Srot wrote:Sure, I ment it like I'll do the FULL vacuum less often than daily and do daily the plain vacuum command.If you have your servers set up correctly, you should never need to perform VACUUM FULL.
The script is very simple one:Well, I do list all databases with the command "psql -U postgres -l -t" and with some shell coding I do vacuum all shown databases... I don't think there is some problem in it... If you do see some, can you give me a hint?I don't think I've seen the script itself. Without seeing it, I can only guess. But you don't actually need to do that. You should be able to run vacuumdb -a, and get the result you need.
#!/bin/sh
PGDIR=/opt/pgsql
DATABASES=`$PGDIR/bin/psql -U postgres -l -t | cut -f 1 -d '|'`
for db in $DATABASES
do
echo "vacuum;" | $PGDIR/bin/psql -U postgres $db
done
I can easily rewrite it to use the vacuumdb command, but I doubt it'll make any difference.
Is there a way how to find out the problem is here? I scanned the log output of pgsql for wrap around but didn't found aby recent relevant entries... I'm pretty sure, I miss something stupid :-(
Thank you very much for your help,
--
Kamil
On Mon, Aug 27, 2007 at 10:31:11PM +0200, Kamil Srot wrote: > The script is very simple one: Well, I don't see anything obvious, but. . . > I can easily rewrite it to use the vacuumdb command, but I doubt it'll > make any difference. The point is that you don't have to rewrite it. Just run "vacuumdb -a" and it vacuums _all_ databases. > Is there a way how to find out the problem is here? I scanned the log > output of pgsql for wrap around but didn't found aby recent relevant > entries... I'm pretty sure, I miss something stupid :-( Are you quite sure you're getting all your logs where you think you are? A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner
Andrew Sullivan wrote: >> I can easily rewrite it to use the vacuumdb command, but I doubt it'll >> make any difference. >> > > The point is that you don't have to rewrite it. Just run "vacuumdb > -a" and it vacuums _all_ databases. > > Oh, I have it now! It takes some time, but at the end, I'll understand :-D Sure, it's much better than coding some scripts etc. I already used it in my script... >> Is there a way how to find out the problem is here? I scanned the log >> output of pgsql for wrap around but didn't found aby recent relevant >> entries... I'm pretty sure, I miss something stupid :-( >> > > Are you quite sure you're getting all your logs where you think you > are? > Yes, the log is correct... I can see statement errors I did selecting from system tables few hours ago... I have default debug level... should I increase it? I really need to make sure, the next problem (if it's necessary to let it happen) will be the last one... :-( Thank you, -- Kamil
Kamil Srot <kamil.srot@nlogy.com> writes: > # select xmin, age(xmin) from pg_class; > xmin | age > -----------+------------ > 2 | 2147483647 > 2 | 2147483647 > 2 | 2147483647 > 2 | 2147483647 > 2 | 2147483647 > 2 | 2147483647 > 236838019 | 539453 > 2 | 2147483647 > 2 | 2147483647 > 231899309 | 5478163 > etc. I see no indication here that there's anything wrong at all. The rows with xmin = 2 (ie, FrozenTransactionID) are evidently pg_class rows that have never been modified since the database was created. There isn't anything wrong with them. age() is returning INT_MAX for them to denote that they are "infinitely far in the past", but they are in no danger of being invalidated by wraparound. The newer rows indicate that your current transaction counter is around 237 million, which is barely a tenth of the way to wraparound. While it's conceivable that it already wrapped around (ie, you've really executed 237 million plus some multiple of 2^32 transactions), I think it's more probable that no wraparound has happened and your problem is entirely unrelated to that. I believe this in part because having just one table "disappear" is not the likely result of an XID wraparound problem, and in part because 8.2 has got very strong defenses against allowing wraparound failure to occur: it will shut down first, and it will do a whole lot of bleating and autovacuuming before that. You haven't yet showed us any details of what happens. What exactly do you mean by the table "disappearing"? Can you select from it? Do you see a row for it in pg_class? regards, tom lane
Tom Lane wrote:
I didn't check if the relation was in pg_class... but I have complete backup of database files after this issue from today and also from the last time it happened (but they was vacuumed few hour before the backup by nightly script, so the data can be changed in this snapshots)... but it it can help and you'll navigate me, I can fetch any data you want from it...
Thank you,
--
Kamil
I think the same, but I'm at the end of my ideas...Kamil Srot <kamil.srot@nlogy.com> writes:# select xmin, age(xmin) from pg_class; xmin | age -----------+------------ 2 | 2147483647 2 | 2147483647 2 | 2147483647 2 | 2147483647 2 | 2147483647 2 | 2147483647236838019 | 539453 2 | 2147483647 2 | 2147483647231899309 | 5478163 etc.I see no indication here that there's anything wrong at all. The rows with xmin = 2 (ie, FrozenTransactionID) are evidently pg_class rows that have never been modified since the database was created. There isn't anything wrong with them. age() is returning INT_MAX for them to denote that they are "infinitely far in the past", but they are in no danger of being invalidated by wraparound.
It's completelly gone... cannot do anything with this relation... I fixed it by creating this specific table from several hours old backup and it's fine (this table is nearly static)...You haven't yet showed us any details of what happens. What exactly do you mean by the table "disappearing"? Can you select from it? Do you see a row for it in pg_class?
I didn't check if the relation was in pg_class... but I have complete backup of database files after this issue from today and also from the last time it happened (but they was vacuumed few hour before the backup by nightly script, so the data can be changed in this snapshots)... but it it can help and you'll navigate me, I can fetch any data you want from it...
Thank you,
--
Kamil
On Aug 27, 2007, at 4:08 PM, Tom Lane wrote: > Kamil Srot <kamil.srot@nlogy.com> writes: >> # select xmin, age(xmin) from pg_class; >> xmin | age >> -----------+------------ >> 2 | 2147483647 >> 2 | 2147483647 >> 2 | 2147483647 >> 2 | 2147483647 >> 2 | 2147483647 >> 2 | 2147483647 >> 236838019 | 539453 >> 2 | 2147483647 >> 2 | 2147483647 >> 231899309 | 5478163 >> etc. > > I see no indication here that there's anything wrong at all. > > The rows with xmin = 2 (ie, FrozenTransactionID) are evidently > pg_class > rows that have never been modified since the database was created. > There isn't anything wrong with them. age() is returning INT_MAX for > them to denote that they are "infinitely far in the past", but they > are > in no danger of being invalidated by wraparound. > > The newer rows indicate that your current transaction counter is > around > 237 million, which is barely a tenth of the way to wraparound. While > it's conceivable that it already wrapped around (ie, you've really > executed 237 million plus some multiple of 2^32 transactions), I think > it's more probable that no wraparound has happened and your problem is > entirely unrelated to that. I believe this in part because having > just > one table "disappear" is not the likely result of an XID wraparound > problem, and in part because 8.2 has got very strong defenses against > allowing wraparound failure to occur: it will shut down first, and it > will do a whole lot of bleating and autovacuuming before that. > > You haven't yet showed us any details of what happens. What > exactly do > you mean by the table "disappearing"? Can you select from it? Do you > see a row for it in pg_class? > > regards, tom lane Also, in your original post you mentioned a "proprietal CMS system". Is this proprietary to your company or one that you've purchased? The fact that the same table going on multiple dbs all being run by that CMS system certainly makes it worthy of suspicion. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
> > Also, in your original post you mentioned a "proprietal CMS system". > Is this proprietary to your company or one that you've purchased? The > fact that the same table going on multiple dbs all being run by that > CMS system certainly makes it worthy of suspicion. > This is software developed in our company... so I'm sure it's not duing aby schema manipulation. I'm actually senior developer of this project by accident :-) The strange thing is, all the projects are completelly independend... has its own DB, folder with scripts, different data... just the DB user is the same... so it's higly unprobable, that it'll do 2 similar errors in thow distinct databases at nearly the same time... When this problem appeared for the first time, I had clearly the wraparound problem... I did vacuum it and partially restored the data... but in some meantime, I had commands like \dt showing all relations twice... (some system catalog problem)... then I did full dump and restore along with upgrade to newest pgsql server software... this duplicity was gone and never appeared again. From above mentioned duplications of relatio names and what Tom wrote recently (doesn't see like WA problem), it looks like the relation name is/gets corrupted in some way and this corruption is internally taken over to another instance of relation named the same but in another database... but I know - it's too speculative. Maybe the binary backups of all the DB files from the time of this problem can give some answers? Regards, -- Kamil
Kamil Srot wrote: > Erik Jones wrote >> >> Also, in your original post you mentioned a "proprietal CMS system". >> Is this proprietary to your company or one that you've purchased? The >> fact that the same table going on multiple dbs all being run by that >> CMS system certainly makes it worthy of suspicion. >> > This is software developed in our company... so I'm sure it's not duing > aby schema manipulation. I'm actually senior developer of this project > by accident :-) > > The strange thing is, all the projects are completelly independend... > has its own DB, folder with scripts, different data... just the DB user > is the same... so it's higly unprobable, that it'll do 2 similar errors > in thow distinct databases at nearly the same time... > IMHO, it's not at all improbable, given that this software is connecting to the same databases you are seeing affected by this phenomenon. Not to mention that it's proprietary, so fewer eyes have gone over it.[1] I suggested earlier grepping for 'drop' in your application. Perhaps you should do the same on the CMS. [1] Not meant as a dig at non-free software. brian
On Mon, Aug 27, 2007 at 11:44:38PM +0200, Kamil Srot wrote: > When this problem appeared for the first time, I had clearly the > wraparound problem... I did vacuum it and partially restored the data... I don't think vacuum would "fix" a wrap around problem in the way you describe. I don't think you have a wrap problem. > From above mentioned duplications of relatio names and what Tom wrote > recently (doesn't see like WA problem), it looks like the relation name > is/gets corrupted in some way and this corruption is internally taken > over to another instance of relation named the same but in another > database... but I know - it's too speculative. No, it just doesn't work that way. Are these all shared on the same hardware? Are you sure you're not having other problems? A -- Andrew Sullivan | ajs@crankycanuck.ca "The year's penultimate month" is not in truth a good way of saying November. --H.W. Fowler
Kamil Srot <kamil.srot@nlogy.com> writes: > From above mentioned duplications of relatio names and what Tom wrote > recently (doesn't see like WA problem), it looks like the relation name > is/gets corrupted in some way and this corruption is internally taken > over to another instance of relation named the same but in another > database... but I know - it's too speculative. The fact that the same thing happens concurrently in several independent databases is to me the strongest indication that it's something the client is doing. There isn't *any* mechanism within Postgres that would somehow cause similarly-named tables in different DBs to go south at the same time. I don't say that it couldn't be a PG bug, but I do say that the client is doing something similar in each database to trigger it. My advice is to turn on log_statement and acquire a trace (with timestamps) of all the SQL commands being executed across the whole system. Looking at what happens up to the instant the problem appears should provide some clues. regards, tom lane
On Aug 27, 2007, at 4:44 PM, Kamil Srot wrote: >> Also, in your original post you mentioned a "proprietal CMS >> system". Is this proprietary to your company or one that you've >> purchased? The fact that the same table going on multiple dbs all >> being run by that CMS system certainly makes it worthy of suspicion. >> > This is software developed in our company... so I'm sure it's not > duing aby schema manipulation. I'm actually senior developer of > this project by accident :-) That doesn't mean that there's not some whacked out race condition causing corruption. I'm just saying, keep exploring every option. Saying, "I wrote and am in charge of that so I know it didn't do it" is bad. You should at the very least have someone else, as well as you, review any code that touches that table. > > The strange thing is, all the projects are completelly > independend... has its own DB, folder with scripts, different > data... just the DB user is the same... so it's higly unprobable, > that it'll do 2 similar errors in thow distinct databases at nearly > the same time... Just the DB and your CMS. Given that setup, if your CMS is causing this to happen on one DB, it actually is not unlikely that it would do it to the others. > When this problem appeared for the first time, I had clearly the > wraparound problem... I did vacuum it and partially restored the > data... but in some meantime, I had commands like \dt showing all > relations twice... (some system catalog problem)... then I did full > dump and restore along with upgrade to newest pgsql server > software... this duplicity was gone and never appeared again. > > From above mentioned duplications of relatio names and what Tom > wrote recently (doesn't see like WA problem), it looks like the > relation name is/gets corrupted in some way and this corruption is > internally taken over to another instance of relation named the > same but in another database... but I know - it's too speculative. Is it the same instance of your CMS managing each of the databases or a separate instance per DB? Also, how often has this happened? Since wraparound has been ruled out, it's hard to say what else could be the culprit or what to look at and do next without any more specific details about the system at the time(s) this has happened. What kind of monitoring do you have set up on your DBs? Have you verified that the table's files are still on disk after it's "disappeared"? Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
On 8/27/07, Kamil Srot <kamil.srot@nlogy.com> wrote: > I really need to make sure, the next problem (if it's necessary to let > it happen) will be the last one... :-( Have you set your log_statement to 'ddl' in postgresql.conf? That way you'll have a log of every single ddl that goes through the database. At least then when the table disappears again you can look in your logs to see if it was in fact a client app that did it. Also, look at taking away permission from anyone BUT the postgres superuser to drop objects.
brian wrote: > Kamil Srot wrote: > > Erik Jones wrote >>> >>> Also, in your original post you mentioned a "proprietal CMS >>> system". Is this proprietary to your company or one that you've >>> purchased? The fact that the same table going on multiple dbs all >>> being run by that CMS system certainly makes it worthy of suspicion. >>> >> This is software developed in our company... so I'm sure it's not >> duing aby schema manipulation. I'm actually senior developer of this >> project by accident :-) >> >> The strange thing is, all the projects are completelly independend... >> has its own DB, folder with scripts, different data... just the DB >> user is the same... so it's higly unprobable, that it'll do 2 similar >> errors in thow distinct databases at nearly the same time... >> > > IMHO, it's not at all improbable, given that this software is > connecting to the same databases you are seeing affected by this > phenomenon. Not to mention that it's proprietary, so fewer eyes have > gone over it.[1] I suggested earlier grepping for 'drop' in your > application. Perhaps you should do the same on the CMS. > > [1] Not meant as a dig at non-free software. Understood and I agree. I would accept the possibility to do the most stupid coding error in the world, but there is simply no statement manipulating with the schema in the whole project. To be sure, I searched thru the sources for /drop/i and didn't found anywhere but the pear framework of PHP (some constants and test patterns of user statements)... OK, I agree this can be run by some coincidence and so I have started to log all statements in the logfile... but still I doubt executing the drop statement is the problem... Regards, -- Kamil
Andrew Sullivan wrote:
Regards,
--
Kamil
I don't know much about this part of the pgsql... so I did follow some instructions in the list and did restore the data... at the end, I did dump/restore... I just wanted to point to the fakt that wraparound was there in the beginning (entries in log, etc.)On Mon, Aug 27, 2007 at 11:44:38PM +0200, Kamil Srot wrote:When this problem appeared for the first time, I had clearly the wraparound problem... I did vacuum it and partially restored the data...I don't think vacuum would "fix" a wrap around problem in the way you describe. I don't think you have a wrap problem.
Yes, all affected databases rund under one pgsql server on the same hardware... I don't see any other problems so far. It's quite busy server runing apache, pgsql and php along with few scheduled jobs in the night...From above mentioned duplications of relatio names and what Tom wrote recently (doesn't see like WA problem), it looks like the relation name is/gets corrupted in some way and this corruption is internally taken over to another instance of relation named the same but in another database... but I know - it's too speculative.No, it just doesn't work that way. Are these all shared on the same hardware? Are you sure you're not having other problems?
Regards,
--
Kamil
Kamil Srot wrote: > brian wrote: >> >> IMHO, it's not at all improbable, given that this software is >> connecting to the same databases you are seeing affected by this >> phenomenon. Not to mention that it's proprietary, so fewer eyes >> have gone over it.[1] I suggested earlier grepping for 'drop' in >> your application. Perhaps you should do the same on the CMS. >> >> [1] Not meant as a dig at non-free software. > > Understood and I agree. I would accept the possibility to do the most > stupid coding error in the world, but there is simply no statement > manipulating with the schema in the whole project. To be sure, I > searched thru the sources for /drop/i and didn't found anywhere but > the pear framework of PHP (some constants and test patterns of user > statements)... OK, I agree this can be run by some coincidence and so > I have started to log all statements in the logfile... but still I > doubt executing the drop statement is the problem... Just wondering: what's the name of this table? What's its function? I think you mentioned that it's pretty well static. But what is it holding? Maybe that'll provide a clue as to where to look. brian
Erik Jones wrote: > On Aug 27, 2007, at 4:44 PM, Kamil Srot wrote: > >>> Also, in your original post you mentioned a "proprietal CMS >>> system". Is this proprietary to your company or one that you've >>> purchased? The fact that the same table going on multiple dbs all >>> being run by that CMS system certainly makes it worthy of suspicion. >>> >> This is software developed in our company... so I'm sure it's not >> duing aby schema manipulation. I'm actually senior developer of this >> project by accident :-) > > That doesn't mean that there's not some whacked out race condition > causing corruption. I'm just saying, keep exploring every option. > Saying, "I wrote and am in charge of that so I know it didn't do it" > is bad. You should at the very least have someone else, as well as > you, review any code that touches that table. > >> >> The strange thing is, all the projects are completelly independend... >> has its own DB, folder with scripts, different data... just the DB >> user is the same... so it's higly unprobable, that it'll do 2 similar >> errors in thow distinct databases at nearly the same time... > > Just the DB and your CMS. Given that setup, if your CMS is causing > this to happen on one DB, it actually is not unlikely that it would do > it to the others. > OK, sure I cannot put my hand in the fire for it, but after searching the sources for DDL commands and also becouse of the manner of the application, I simply do not belive in this opinion. OK, I have the logs on and I do log everything what goes to the server in the timestamped logfile... >> When this problem appeared for the first time, I had clearly the >> wraparound problem... I did vacuum it and partially restored the >> data... but in some meantime, I had commands like \dt showing all >> relations twice... (some system catalog problem)... then I did full >> dump and restore along with upgrade to newest pgsql server >> software... this duplicity was gone and never appeared again. >> >> From above mentioned duplications of relatio names and what Tom wrote >> recently (doesn't see like WA problem), it looks like the relation >> name is/gets corrupted in some way and this corruption is internally >> taken over to another instance of relation named the same but in >> another database... but I know - it's too speculative. > > Is it the same instance of your CMS managing each of the databases or > a separate instance per DB? Every DB has it's own web domain and folder structure with copy of script files... nothing is shared. > Also, how often has this happened? It happened I thing four times... first time about 4 months ago and then 3 times after about 20 to 30 days... last time yesterday. Then I noticed this problem one more time today just in one database (one of the two affected yesterday). > Since wraparound has been ruled out, it's hard to say what else could > be the culprit or what to look at and do next without any more > specific details about the system at the time(s) this has happened. > What kind of monitoring do you have set up on your DBs? Starting 2 ours ago, I have complete logs of all commands along with user and database they are applied to. In the past I have just the general logs w/o timestamps and DB names... so hard to read... > Have you verified that the table's files are still on disk after > it's "disappeared"? Do not have any idea how to do it... I wasn't able to access it using any DML/DDL commands... can try it on a binary backup of the damaged DB if you'll guide me... Thank you, -- Kamil
brian wrote: > > Just wondering: what's the name of this table? What's its function? I > think you mentioned that it's pretty well static. But what is it > holding? Maybe that'll provide a clue as to where to look. > The system is actually management of website&eshop with webbased UI ... the table is called section_mapping and contains web addresses of all elements of the website in hirarchical structure... similar to directory structure with parents/children. It has 5 integer and one varchar attribute (the part of URL)... The data are changed only during administration of the system (adding/deleteng products in the eshop, publishing new documents, manipulating with the existing ones)... but it's accessed (read) each time anything is shown thru the web... so several times per displayed web page... Also if it would happend during administration of the content, the customer would scream to us right away, but we found it our several hours after it occured as our monitoring didn't expect to loose one single table :-( -- Kamil
Tom Lane wrote:
Thank you for your help!
--
Kamil
OK, I did it... have good logging now, so if it'll happen again, I'll post relevant part of it to the list.Kamil Srot <kamil.srot@nlogy.com> writes:From above mentioned duplications of relatio names and what Tom wrote recently (doesn't see like WA problem), it looks like the relation name is/gets corrupted in some way and this corruption is internally taken over to another instance of relation named the same but in another database... but I know - it's too speculative.The fact that the same thing happens concurrently in several independent databases is to me the strongest indication that it's something the client is doing. There isn't *any* mechanism within Postgres that would somehow cause similarly-named tables in different DBs to go south at the same time. I don't say that it couldn't be a PG bug, but I do say that the client is doing something similar in each database to trigger it. My advice is to turn on log_statement and acquire a trace (with timestamps) of all the SQL commands being executed across the whole system. Looking at what happens up to the instant the problem appears should provide some clues.
Thank you for your help!
--
Kamil
Kamil Srot <kamil.srot@nlogy.com> writes: > Erik Jones wrote: >> Have you verified that the table's files are still on disk after >> it's "disappeared"? > Do not have any idea how to do it... I wasn't able to access it using > any DML/DDL commands... can try it on a binary backup of the damaged DB > if you'll guide me... Make a note now of the table's "relfilenode" value (it'll be different in each database), and confirm that you see it in the filesystem. After the next disappearance, see if anything's still there. For background read http://www.postgresql.org/docs/8.2/static/storage.html Note that certain operations like TRUNCATE and CLUSTER change the relfilenode, so if you're using any of those then it might get harder to track where the file is. regards, tom lane
Tom Lane wrote:
Thank you!
--
Kamil
OK, I have the filenames noted and I do confirm, they all does exist now under the base in the pgsql tree...Kamil Srot <kamil.srot@nlogy.com> writes:Erik Jones wrote:Have you verified that the table's files are still on disk after it's "disappeared"?Do not have any idea how to do it... I wasn't able to access it using any DML/DDL commands... can try it on a binary backup of the damaged DB if you'll guide me...Make a note now of the table's "relfilenode" value (it'll be different in each database), and confirm that you see it in the filesystem. After the next disappearance, see if anything's still there. For background read http://www.postgresql.org/docs/8.2/static/storage.html
There is not any manipulation with the structure of the DB, so it'll stay the same...Note that certain operations like TRUNCATE and CLUSTER change the relfilenode, so if you're using any of those then it might get harder to track where the file is.
Thank you!
--
Kamil
On Monday 27 August 2007 5:57 pm, Kamil Srot wrote: > Tom Lane wrote: > > Kamil Srot <kamil.srot@nlogy.com> writes: > >> Erik Jones wrote: > >>> Have you verified that the table's files are still on disk after > >>> it's "disappeared"? > >> > >> Do not have any idea how to do it... I wasn't able to access it using > >> any DML/DDL commands... can try it on a binary backup of the damaged DB > >> if you'll guide me... > > > > Make a note now of the table's "relfilenode" value (it'll be different > > in each database), and confirm that you see it in the filesystem. After > > the next disappearance, see if anything's still there. For background > > read > > http://www.postgresql.org/docs/8.2/static/storage.html > > OK, I have the filenames noted and I do confirm, they all does exist now > under the base in the pgsql tree... > > > Note that certain operations like TRUNCATE and CLUSTER change the > > relfilenode, so if you're using any of those then it might get harder to > > track where the file is. > > There is not any manipulation with the structure of the DB, so it'll > stay the same... > > Thank you! I have a question. First a little history. Right now, the people who know better than I are fairly certain Postgres is not changing things on its own and the developer is certain the CMS software is not doing schema changes. As I understand it logging has been cranked up to test both those assumptions. My question is, how are legitimate schema changes done? Just wondering if there is a third party involved. -- Adrian Klaver aklaver@comcast.net
Adrian Klaver wrote: > I have a question. First a little history. Right now, the people who know > better than I are fairly certain Postgres is not changing things on its own > and the developer is certain the CMS software is not doing schema changes. As > I understand it logging has been cranked up to test both those assumptions. > My question is, how are legitimate schema changes done? Just wondering if > there is a third party involved. Maybe someone discovered an SQL injection in your apps and is making fun of you by dropping tables. -- Alvaro Herrera http://www.CommandPrompt.com/ The PostgreSQL Company - Command Prompt, Inc.
On Aug 27, 2007, at 7:57 PM, Kamil Srot wrote: > Tom Lane wrote: >> Kamil Srot <kamil.srot@nlogy.com> writes: >>> Erik Jones wrote: >>>> Have you verified that the table's files are still on disk after >>>> it's "disappeared"? >>> Do not have any idea how to do it... I wasn't able to access it >>> using any DML/DDL commands... can try it on a binary backup of >>> the damaged DB if you'll guide me... >> Make a note now of the table's "relfilenode" value (it'll be >> different in each database), and confirm that you see it in the >> filesystem. After the next disappearance, see if anything's still >> there. For background read http://www.postgresql.org/docs/8.2/ >> static/storage.html > OK, I have the filenames noted and I do confirm, they all does > exist now under the base in the pgsql tree... >> Note that certain operations like TRUNCATE and CLUSTER change the >> relfilenode, so if you're using any of those then it might get >> harder to track where the file is. > There is not any manipulation with the structure of the DB, so > it'll stay the same... > > Thank you! Also, I'd write a simple "ping" script to check for the table that runs every 5 seconds or so. Have it, at the very least, write out the timestamp of when the table disappears into a file. Better yet would be for it to send you an alert so you can check it out right away. With that *when* you'll know *where* in your logs to look to see what happened at that time. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
Erik Jones <erik@myemma.com> writes: > Also, I'd write a simple "ping" script to check for the table that > runs every 5 seconds or so. I had gathered that the table was being touched constantly by his app, so that it'd be immediately obvious when it had gone away. If that's not so, then definitely do the above. regards, tom lane
On Tue, Aug 28, 2007 at 02:03:29AM +0200, Kamil Srot wrote: > The system is actually management of website&eshop with webbased UI ... Um, are you sure you don't have a SQL-injection problem, and someone is doing Something Bad to you? A -- Andrew Sullivan | ajs@crankycanuck.ca The fact that technology doesn't work is no bar to success in the marketplace. --Philip Greenspun
Adrian Klaver wrote: > I have a question. First a little history. Right now, the people who know > better than I are fairly certain Postgres is not changing things on its own > and the developer is certain the CMS software is not doing schema changes. As > I understand it logging has been cranked up to test both those assumptions. > My question is, how are legitimate schema changes done? Just wondering if > there is a third party involved. > In the version used on this particular server, there is no automatic/programing way of changing the schema. Upgrades are done manually and application itself doesn't need schema changes for routine operations... It's nothing more, nothing less than set of PHP scripts/procedural language functions and few utility scripts doing quite simple DML statements... Regards, -- Kamil
Tom Lane wrote:
is under not high, but big enough load to buy a hit at least each few seconds...
If anything happen with the table, entries with not existing relation will appear in the log
right away.
Regards,
--
Kamil
You're right, Tom... the table is intensivelly used in the application and the application itselfErik Jones <erik@myemma.com> writes:Also, I'd write a simple "ping" script to check for the table that runs every 5 seconds or so.I had gathered that the table was being touched constantly by his app, so that it'd be immediately obvious when it had gone away. If that's not so, then definitely do the above.
is under not high, but big enough load to buy a hit at least each few seconds...
If anything happen with the table, entries with not existing relation will appear in the log
right away.
Regards,
--
Kamil
Alvaro Herrera wrote:
application, I think it's not the case :-) ... just kidding...
As even the injected SQL will be shown in the logs, so we'll know more after some time. It's too much
issues of the same type to thing, it'll not appear anymore.
I'm really interested in what is the problem not only from proffesional point of view...
Regards,
--
Kamil
Andrew, Alvaro... well, sure SQL injection is possibility I cannot ignore... (and sure as "dad" of thisAdrian Klaver wrote:I have a question. First a little history. Right now, the people who know better than I are fairly certain Postgres is not changing things on its own and the developer is certain the CMS software is not doing schema changes. As I understand it logging has been cranked up to test both those assumptions. My question is, how are legitimate schema changes done? Just wondering if there is a third party involved.Maybe someone discovered an SQL injection in your apps and is making fun of you by dropping tables.
application, I think it's not the case :-) ... just kidding...
As even the injected SQL will be shown in the logs, so we'll know more after some time. It's too much
issues of the same type to thing, it'll not appear anymore.
I'm really interested in what is the problem not only from proffesional point of view...
Regards,
--
Kamil
On Mon, Aug 27, 2007 at 05:58:26PM -0400, Andrew Sullivan wrote: > On Mon, Aug 27, 2007 at 11:44:38PM +0200, Kamil Srot wrote: > > When this problem appeared for the first time, I had clearly the > > wraparound problem... I did vacuum it and partially restored the data... > > I don't think vacuum would "fix" a wrap around problem in the way you > describe. I don't think you have a wrap problem. Actually, it can: if you have just experienced wraparound (within 1 billion transactions or so) then vacuum will magically cause the wrapped rows to reappear. Ofcourse, your indexes may be invalid, your constraints may be violated, but the data will still be there... Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Kamil Srot wrote: > In the version used on this particular server, there is no > automatic/programing way of changing > the schema. Upgrades are done manually and application itself doesn't > need schema changes > for routine operations... In that case, you can settle the matter by making sure your application connects as a user that doesn't own any of the tables. That way they can't be dropped. -- Richard Huxton Archonet Ltd
Richard Huxton wrote: > Kamil Srot wrote: >> In the version used on this particular server, there is no >> automatic/programing way of changing >> the schema. Upgrades are done manually and application itself doesn't >> need schema changes >> for routine operations... > > In that case, you can settle the matter by making sure your > application connects as a user that doesn't own any of the tables. > That way they can't be dropped. > Yes, I'll do it in the future, but now, I want to know what is the problem and solve it rather than doing some workarounds (even if they are good also from the principle indeed). Regards, -- Kamil
My entire shop has set up a betting pool on the outcome of this...so I hope you post results regardless of the outcome, Kamil. (in all seriousness, we hope you find/fix the problem before things get really ugly) Kamil Srot wrote: > Richard Huxton wrote: >> Kamil Srot wrote: >>> In the version used on this particular server, there is no >>> automatic/programing way of changing >>> the schema. Upgrades are done manually and application itself >>> doesn't need schema changes >>> for routine operations... >> >> In that case, you can settle the matter by making sure your >> application connects as a user that doesn't own any of the tables. >> That way they can't be dropped. >> > Yes, I'll do it in the future, but now, I want to know what is the > problem and solve it rather than doing some workarounds (even if they > are good also from the principle indeed). > > Regards,
Jeff Amiel wrote: > My entire shop has set up a betting pool on the outcome of this...so I > hope you post results regardless of the outcome, Kamil. > Heh, is the pool still open? Maybe I can make at least something from it :-D > (in all seriousness, we hope you find/fix the problem before things > get really ugly) > Will post about every relevant doings to this issue... Thanks! -- Kamil
Kamil Srot wrote: > Heh, is the pool still open? Maybe I can make at least something from > it :-D > Current odds.... Application bug: even money Application configuration issue: 2-1 Rogue cron job or other maintenance process: 4-1 Somebody messing with you (or SQL injection): 8-1 XID wraparound issue : 10-1 Alien or supernatural intervention: 18-1 Obscure postgresql bug nobody else has ever seen: 25-1
On Aug 28, 2007, at 9:41 AM, Kamil Srot wrote: > Jeff Amiel wrote: >> My entire shop has set up a betting pool on the outcome of >> this...so I hope you post results regardless of the outcome, Kamil. >> > Heh, is the pool still open? Maybe I can make at least something > from it :-D >> (in all seriousness, we hope you find/fix the problem before >> things get really ugly) >> > Will post about every relevant doings to this issue... > > Thanks! > -- /me remembers Pete Rose. Erik Jones Software Developer | Emma® erik@myemma.com 800.595.4401 or 615.292.5888 615.292.0777 (fax) Emma helps organizations everywhere communicate & market in style. Visit us online at http://www.myemma.com
At 03:15 PM 8/28/2007, Kamil Srot wrote: >Andrew, Alvaro... well, sure SQL injection is possibility I cannot >ignore... (and sure as "dad" of this >application, I think it's not the case :-) ... just kidding... >As even the injected SQL will be shown in the logs, so we'll know >more after some time. It's too much >issues of the same type to thing, it'll not appear anymore. > >I'm really interested in what is the problem not only from >proffesional point of view... If it's SQL injection via a webserver app, and you have _already_ logged http requests in sufficient detail you could try to look for a "drop" in them or variations of escaped versions of it e.g. ..%44%52%4f%50... %64%52o%70 at about the time you think the incident happened. Of course it could just be someone had db access and sent the drop command. Regards, Link.
On 28/08/2007 15:48, Jeff Amiel wrote: > Alien or supernatural intervention: 18-1 > Obscure postgresql bug nobody else has ever seen: 25-1 That's the sort of confidence in the DBMS we all like to see! :-) Ray. --------------------------------------------------------------- Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland rod@iol.ie ---------------------------------------------------------------