Обсуждение: VACUUM FULL results in deadlock
Hi everyone, When executing multiple threads that execute VACUUM FULL on distinct databases, a deadlock like the following can occur: ERROR: deadlock detected Detail: Process 16407 waits for AccessShareLock on relation 1260 of database 0; blocked by process 16404. Process 16404 waits for RowExclusiveLock on relation 1214 of database 0; blocked by process 16407. Hint: See server log for query details. This is unexpected, because the documentation does not mention that VACUUM FULL can result in a deadlock. Also, VACUUM without a table argument should affect only the current database [1]: "Without a table_and_columns list, VACUUM processes every table and materialized view in the current database that the current user has permission to vacuum." To reproduce such a deadlock, I've attached a Java program that first creates 32 databases (test0 to test31), and then starts 32 threads, each one connecting to one of the databases (with superuser privileges). Every thread then repeatedly executes "VACUUM FULL". Within a few seconds, deadlock error messages should pop up. I'm using the following Postgres version: psql (11.4 (Ubuntu 11.4-1.pgdg19.04+1)). Is this a bug? Best, Manuel [1] https://www.postgresql.org/docs/11/sql-vacuum.html
Вложения
Hi everyone, Did anyone try to verify this? I found a number of other race conditions where VACUUM causes errors such as "ERROR: found unexpected null value in index "i0", "invalid input syntax for type boolean", or "missing chunk number 0 for toast value 13171 in pg_toast_2619". Fixing this would help me to narrow down these issues. Best, Manuel On Sat, Jun 29, 2019 at 5:51 PM Manuel Rigger <rigger.manuel@gmail.com> wrote: > > Hi everyone, > > When executing multiple threads that execute VACUUM FULL on distinct > databases, a deadlock like the following can occur: > > ERROR: deadlock detected > Detail: Process 16407 waits for AccessShareLock on relation 1260 of > database 0; blocked by process 16404. > Process 16404 waits for RowExclusiveLock on relation 1214 of database > 0; blocked by process 16407. > Hint: See server log for query details. > > This is unexpected, because the documentation does not mention that > VACUUM FULL can result in a deadlock. Also, VACUUM without a table > argument should affect only the current database [1]: > > "Without a table_and_columns list, VACUUM processes every table and > materialized view in the current database that the current user has > permission to vacuum." > > To reproduce such a deadlock, I've attached a Java program that first > creates 32 databases (test0 to test31), and then starts 32 threads, > each one connecting to one of the databases (with superuser > privileges). Every thread then repeatedly executes "VACUUM FULL". > Within a few seconds, deadlock error messages should pop up. > > I'm using the following Postgres version: psql (11.4 (Ubuntu > 11.4-1.pgdg19.04+1)). > > Is this a bug? > > Best, > Manuel > > [1] https://www.postgresql.org/docs/11/sql-vacuum.html
On Thu, Jul 4, 2019 at 9:12 AM Manuel Rigger <rigger.manuel@gmail.com> wrote: > Did anyone try to verify this? I found a number of other race > conditions where VACUUM causes errors such as "ERROR: found unexpected > null value in index "i0", "invalid input syntax for type boolean", or > "missing chunk number 0 for toast value 13171 in pg_toast_2619". > Fixing this would help me to narrow down these issues. I don't know whether or not this is a bug, but my guess is that it isn't. pg_database and several other system catalogs are shared relations, which means that they are shared by all databases. So, it's not too surprising that you could get some interaction between VACUUM FULL command in different databases. Routine use of VACUUM FULL is something that should be avoided, so it doesn't seem like a huge problem to me that you can't run 32 of them at the same time. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Thanks for your comment! I can also reproduce deadlocks for other statements that operate on distinct databases, at least also for ANALYZE. The documentation states that such statements operate on their current databases. If this is not a bug, should maybe the documentation be updated to mention this caveat? Best, Manuel On Fri, Jul 5, 2019 at 3:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > > On Thu, Jul 4, 2019 at 9:12 AM Manuel Rigger <rigger.manuel@gmail.com> wrote: > > Did anyone try to verify this? I found a number of other race > > conditions where VACUUM causes errors such as "ERROR: found unexpected > > null value in index "i0", "invalid input syntax for type boolean", or > > "missing chunk number 0 for toast value 13171 in pg_toast_2619". > > Fixing this would help me to narrow down these issues. > > I don't know whether or not this is a bug, but my guess is that it > isn't. pg_database and several other system catalogs are shared > relations, which means that they are shared by all databases. So, it's > not too surprising that you could get some interaction between VACUUM > FULL command in different databases. Routine use of VACUUM FULL is > something that should be avoided, so it doesn't seem like a huge > problem to me that you can't run 32 of them at the same time. > > -- > Robert Haas > EnterpriseDB: http://www.enterprisedb.com > The Enterprise PostgreSQL Company
On Fri, Jul 5, 2019 at 9:27 AM Manuel Rigger <rigger.manuel@gmail.com> wrote: > Thanks for your comment! I can also reproduce deadlocks for other > statements that operate on distinct databases, at least also for > ANALYZE. The documentation states that such statements operate on > their current databases. If this is not a bug, should maybe the > documentation be updated to mention this caveat? Possibly. The trick is always to find a good place to put things like this -- they need to not consume a disproportionate amount of space relative to the importance of an issue, and they need to be inserted into a place in the documentation where they make logical sense and are likely to be seen by users. If you have a good idea, feel free to submit a patch, perhaps on pgsql-docs. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 2019-Jul-05, Robert Haas wrote: > On Thu, Jul 4, 2019 at 9:12 AM Manuel Rigger <rigger.manuel@gmail.com> wrote: > > Did anyone try to verify this? I found a number of other race > > conditions where VACUUM causes errors such as "ERROR: found unexpected > > null value in index "i0", "invalid input syntax for type boolean", or > > "missing chunk number 0 for toast value 13171 in pg_toast_2619". > > Fixing this would help me to narrow down these issues. > > I don't know whether or not this is a bug, but my guess is that it > isn't. pg_database and several other system catalogs are shared > relations, which means that they are shared by all databases. So, it's > not too surprising that you could get some interaction between VACUUM > FULL command in different databases. Routine use of VACUUM FULL is > something that should be avoided, so it doesn't seem like a huge > problem to me that you can't run 32 of them at the same time. Actually, in my read of the code, VACUUM FULL is supposed to obtain a InvalidOid-database lock of the relation being vacuumed if it's a shared one, so it seems to me that it should work -- namely that when vacuuming a global table, any other vacuum of that table should be blocked regardless of what database it occurs in. On the other hand, not all the errors that Manuel reports are obviously related to global catalogs. For example, why is it complaining about a missing toast value in pg_statistic's toast table? Do global tables have statistics in specific databases' pg_statistic? How does analyze work sensibly in that case? Maybe that's okay, but still the whole thing should be blocked by the InvalidOid-database-level lock. I agree that you should not be running 32 full vacuums at once, much less in a loop, but if you do, they shouldn't result in weird corruption such as the ones reported. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I agree that you should not be running 32 full vacuums at once, much > less in a loop, but if you do, they shouldn't result in weird corruption > such as the ones reported. I agree that we probably should look closer rather than just writing this off. However, without a more concrete report there's no way to look closer. regards, tom lane
Alvaro Herrera <alvherre@2ndquadrant.com> writes: > I agree that you should not be running 32 full vacuums at once, much > less in a loop, but if you do, they shouldn't result in weird corruption > such as the ones reported. BTW, looking at the OIDs in the original problem report, they are for pg_authid and pg_shdepend, both of which are shared catalogs. It does not seem terribly surprising to me that sets of operations that take out exclusive locks on such catalogs are subject to deadlocks, especially not when you consider that yet other sessions also have need to read those catalogs. So I'm more or less in agreement with Robert that the deadlock errors aren't very interesting. But I still concur with Alvaro that those other error messages probably shouldn't be happening. regards, tom lane
On Fri, Jul 5, 2019 at 3:17 PM Robert Haas <robertmhaas@gmail.com> wrote: > I don't know whether or not this is a bug, but my guess is that it > isn't. pg_database and several other system catalogs are shared > relations, which means that they are shared by all databases. So, it's > not too surprising that you could get some interaction between VACUUM > FULL command in different databases. Routine use of VACUUM FULL is > something that should be avoided, so it doesn't seem like a huge > problem to me that you can't run 32 of them at the same time. > To clarify (since it came up a couple of times in this thread): The deadlocks can also be observed with significantly less threads and (many) statements in between. My initial bug report used 32 threads and only VACUUM to make it easy to quickly reproduce the deadlocks.
I assume that you are talking about the corruptions. I'm still trying to make them reproducible. So far, I could not reproduce them by trying to repeatedly replay the recorded statements. Best, Manuel On Fri, Jul 5, 2019 at 7:08 PM Tom Lane <tgl@sss.pgh.pa.us> wrote: > > Alvaro Herrera <alvherre@2ndquadrant.com> writes: > > I agree that you should not be running 32 full vacuums at once, much > > less in a loop, but if you do, they shouldn't result in weird corruption > > such as the ones reported. > > I agree that we probably should look closer rather than just writing this > off. However, without a more concrete report there's no way to look > closer. > > regards, tom lane