Can slock_t ever be unaligned?
От | Tom Lane |
---|---|
Тема | Can slock_t ever be unaligned? |
Дата | |
Msg-id | 422.906661567@sss.pgh.pa.us обсуждение исходный текст |
Список | pgsql-hackers |
I'm about halfway convinced that the database corruption problem I reported yesterday is a result of interlock failure among multiple backends. I have a trace of the frontend/backend interactions that were happening at the time the table got corrupted, and let me tell you they are peculiar. Four clients were simultaneously trying to access two tables, using BEGIN TRANSACTION / LOCK / END TRANSACTION to ensure consistency. Works fine 99% of the time. This particular time, not only was the table corrupted but the clients got logically inconsistent results: one transaction saw some but not all of the updates committed by a previous transaction. Moreover, the timestamps show that one client successfully executed several begin/lock/update/ end transaction cycles on one of the tables *while another client believed it was holding a lock on that table*. The timestamps also indicate that the bogus transactions took about ten times longer to execute than they normally would've. Given this evidence, I am strongly inclined to think that spinlocking (S_LOCK and friends) is not working right on my platform ... which is HPUX 9. I've eyeballed the HP-PA assembly implementation of tas(), and the only thing potentially wrong with it that I can see is that the 16-byte slock_t object had better be aligned at least on a 4-byte boundary. If it happened to be placed at an odd byte address, the tas() code would overwrite one to three bytes beyond the end of the slock_t object. Can anyone say whether that's possible? Is slock_t ever part of a tuple that might be packed to strange boundaries? Another thing that would kill this implementation is if someone tried to copy an slock_t around while it is in the locked state --- the assembly code is actually using whichever word of the 16-byte object is aligned on a 16-byte boundary, because that's what HP-PA's semaphore lock instruction requires. Move the slock_t to a different address, and the active word within it probably changes. So is there any place in the system where structures containing slock_t's might be shifted around? I think I will try modding tas.s to arrange a coredump if the passed address isn't adequately aligned, and then start testing things... but if anyone can tell me exactly where slock_t's usually live, it might save me some time. The next possibility is that it's not spin-locking but a higher level of lock code that is broken. If anyone can give me an idea where to look, I'd appreciate it. BTW, I have only seen these failures with a 6.3.2 server, not with current sources ... but I haven't stressed my development server very much with multiple clients. The bug could still be in 6.4beta. regards, tom lane
В списке pgsql-hackers по дате отправления: