Обсуждение: Deadlock between VACUUM and ALTER TABLE commands

Поиск
Список
Период
Сортировка

Deadlock between VACUUM and ALTER TABLE commands

От
Kevin Burke
Дата:
Hello,
Occasionally our test suite gets into a deadlocked state. I was curious why this happens, and what our options are for dealing with it.

We run ~2900 tests on an Ubuntu machine in sequence against a Postgres 9.4.6 database. There are about 60 tables; each test runs ~3 queries, and the largest test probably inserts ~1000 rows. To ensure each test starts with a clean slate, we run the following query in between each test run:

ALTER TABLE table1 DISABLE TRIGGER ALL; DELETE FROM table1; ALTER TABLE table1 ENABLE TRIGGER ALL; 
ALTER TABLE table2 DISABLE TRIGGER ALL; DELETE FROM table2; ALTER TABLE table2 ENABLE TRIGGER ALL;
...
ALTER TABLE table60 DISABLE TRIGGER ALL; DELETE FROM table60; ALTER TABLE table60 ENABLE TRIGGER ALL;
ALTER SEQUENCE <sequence> RESTART WITH 1;

We also have a check at the end of every test run that the test has returned all database connections to the pool, so I don't believe there are any other database queries in flight at this time.

Why mess around with DISABLE/ENABLE TRIGGER instead of TRUNCATE? We observed that TRUNCATE took about 200ms, but this method takes about 13ms. Over a our test suite TRUNCATE is going to more or less double the length of the suite.

We could resolve some kind of foreign key dependency tree, issue the DELETE FROM's in the right order, and skip DISABLE TRIGGER etc., but haven't invested the time to do this or figured out how to maintain it/update it as we add new tables.

We could also run each test in a transaction and issue a ROLLBACK at the end, unfortunately our ORM does not support transactions.

We observed that when we pushed tests to a third-party CI service, maybe 1 in 100 test runs fails with a mysterious timeout (set to 18 seconds). It doesn't seem to correlate with any one test.

I finally figured out how to enable the right logging and export our database logs from the remote machine. We saw the following results: https://gist.github.com/kevinburkeshyp/4f5eef58411b1c3c1160a6f6008135b8. I've also pasted relevant parts of the Postgres configuration at this URL.

Specifically we kick off the ALTER TABLE ... DELETE, a second later (our deadlock_timeout) a log message appears that the ALTER TABLE is waiting for an AccessExclusiveLock held by process 16936... which is later revealed to be an autovacuum process of the "events" table.

(The events table also gets cleared between every test. It has 5 columns that store less-than-60-byte integers, and one created_at timestamp. There are indexes on the primary key and on LOWER(object_id), another column. We need to put indexes on the LOWER() fields because of our ORM.)

Finally, the test timeout causes the application layer to start closing connections, at which point the ALTER TABLE gains the exclusive lock.

Here are my questions:

- An AccessExclusiveLock seems to indicate that the vacuum is running a VACUUM FULL, if I am reading the documentation correctly. Yet the documentation for autovacuum states that "The autovacuum daemon attempts to work this way, and in fact will never issue VACUUM FULL." The statements from the logs and from the documentation seem to be in conflict; am I missing something?

- Why does vacuum take so long/hang? I would expect it to be a fast operation.

- Should I just turn off autovacuum altogether? If not, how should I tune it? Our test run is 10k queries, 6 minutes, each of the ~3k tests runs that DELETE FROM query listed above, and maybe an average of 2 other queries.

- What other information can I/should I log here?

- Why don't we run into this more often? 

--
kevin

Re: Deadlock between VACUUM and ALTER TABLE commands

От
Alexey Bashtanov
Дата:
Hello Kevin,

On 06/04/16 23:22, Kevin Burke wrote:
Why mess around with DISABLE/ENABLE TRIGGER instead of TRUNCATE? We observed that TRUNCATE took about 200ms, but this method takes about 13ms. Over a our test suite TRUNCATE is going to more or less double the length of the suite.

We could resolve some kind of foreign key dependency tree, issue the DELETE FROM's in the right order, and skip DISABLE TRIGGER etc., but haven't invested the time to do this or figured out how to maintain it/update it as we add new tables.
I'd suggest one more method: SET CONSTRAINTS ALL DEFERRED and DELETE FROM all tables in arbitrary order (will require making constraints deferrable)
Specifically we kick off the ALTER TABLE ... DELETE, a second later (our deadlock_timeout) a log message appears that the ALTER TABLE is waiting for an AccessExclusiveLock held by process 16936... which is later revealed to be an autovacuum process of the "events" table.
this only indicates that 16936 does not allow ALTER TABLE to acquire AccessExclusiveLock
this does not necessarily mean that
16936 has acquired AccessExclusiveLock, it means that 16936 has acquired some lock conflicting with AccessExclusiveLock you can use this table to determine what locks are conflicting one with another Regards, Alexey Bashtanov

Re: Deadlock between VACUUM and ALTER TABLE commands

От
Kevin Burke
Дата:
Hi Alexey,
Thank you for your suggestion. We had to rewrite our constraints to be DEFERRABLE INITIALLY IMMEDIATE, but after we did that we saw a nice speedup in our clear-all-tables query.

Unfortunately I'm still seeing a very slow query which is affecting our tests. It's happening with roughly the same frequency as the previous error. 


In the log, it looks like an INSERT on the users table takes about 23 seconds. This query is connecting from Javascript to a Postgres server on the same machine. After 18 seconds, the test performing the INSERT times out, we try to clear the database, and then the next test runs. The next test also attempts to INSERT a record, but is blocked on the first INSERT. After 5 more seconds, the first INSERT completes, and the second INSERT is allowed to run.

2016-04-14 00:46:54.649 UTC 2016-04-14 00:46:34 UTC [14308-570ee86a.37e4-42] 00000 ubuntu@circle_test LOG:  process 14308 still waiting for ExclusiveLock on extension of relation 17215 of database 16387 after 1000.104 ms
2016-04-14 00:46:54.649 UTC 2016-04-14 00:46:34 UTC [14308-570ee86a.37e4-43] 00000 ubuntu@circle_test DETAIL:  Process holding the lock: 14260. Wait queue: 14308.
2016-04-14 00:46:58.979 UTC 2016-04-14 00:46:34 UTC [14308-570ee86a.37e4-45] 00000 ubuntu@circle_test LOG:  process 14308 acquired ExclusiveLock on extension of relation 17215 of database 16387 after 5330.117 ms

Between the last time I emailed and now, I also disabled autovacuum on the server; it seemed unnecessary if we are issuing ~10k queries tops, and generating ~100k records tops, in the course of a test run. However, I didn't see any other queries or log messages (e.g. "kicking off autovacuum") that would have indicated a problem.

Here's the schema on the table that generated the 23-second long INSERT: https://gist.github.com/kevinburkeshyp/5a40d372b6aa1e1426801d8195cb0b61

Any ideas why an INSERT on an empty table, to a low-load server running on the same machine, with no other queries being generated during that time, could take 23 seconds to complete? The server is running Postgres 9.4.6.

What other information do I/you need to figure out what's going on?

Thanks,
Kevin

On Thu, Apr 7, 2016 at 1:26 AM, Alexey Bashtanov <alexey@brandwatch.com> wrote:
Hello Kevin,

On 06/04/16 23:22, Kevin Burke wrote:
Why mess around with DISABLE/ENABLE TRIGGER instead of TRUNCATE? We observed that TRUNCATE took about 200ms, but this method takes about 13ms. Over a our test suite TRUNCATE is going to more or less double the length of the suite.

We could resolve some kind of foreign key dependency tree, issue the DELETE FROM's in the right order, and skip DISABLE TRIGGER etc., but haven't invested the time to do this or figured out how to maintain it/update it as we add new tables.
I'd suggest one more method: SET CONSTRAINTS ALL DEFERRED and DELETE FROM all tables in arbitrary order (will require making constraints deferrable)
Specifically we kick off the ALTER TABLE ... DELETE, a second later (our deadlock_timeout) a log message appears that the ALTER TABLE is waiting for an AccessExclusiveLock held by process 16936... which is later revealed to be an autovacuum process of the "events" table.
this only indicates that 16936 does not allow ALTER TABLE to acquire AccessExclusiveLock
this does not necessarily mean that
16936 has acquired AccessExclusiveLock, it means that 16936 has acquired some lock conflicting with AccessExclusiveLock you can use this table to determine what locks are conflicting one with another Regards, Alexey Bashtanov



--
kevin

Re: Deadlock between VACUUM and ALTER TABLE commands

От
Alexey Bashtanov
Дата:

On 14/04/16 18:34, Kevin Burke wrote:
Unfortunately I'm still seeing a very slow query which is affecting our tests. It's happening with roughly the same frequency as the previous error. 


In the log, it looks like an INSERT on the users table takes about 23 seconds. This query is connecting from Javascript to a Postgres server on the same machine. After 18 seconds, the test performing the INSERT times out, we try to clear the database, and then the next test runs. The next test also attempts to INSERT a record, but is blocked on the first INSERT. After 5 more seconds, the first INSERT completes, and the second INSERT is allowed to run.

2016-04-14 00:46:54.649 UTC 2016-04-14 00:46:34 UTC [14308-570ee86a.37e4-42] 00000 ubuntu@circle_test LOG:  process 14308 still waiting for ExclusiveLock on extension of relation 17215 of database 16387 after 1000.104 ms
2016-04-14 00:46:54.649 UTC 2016-04-14 00:46:34 UTC [14308-570ee86a.37e4-43] 00000 ubuntu@circle_test DETAIL:  Process holding the lock: 14260. Wait queue: 14308.
2016-04-14 00:46:58.979 UTC 2016-04-14 00:46:34 UTC [14308-570ee86a.37e4-45] 00000 ubuntu@circle_test LOG:  process 14308 acquired ExclusiveLock on extension of relation 17215 of database 16387 after 5330.117 ms

Between the last time I emailed and now, I also disabled autovacuum on the server; it seemed unnecessary if we are issuing ~10k queries tops, and generating ~100k records tops, in the course of a test run. However, I didn't see any other queries or log messages (e.g. "kicking off autovacuum") that would have indicated a problem.

Here's the schema on the table that generated the 23-second long INSERT: https://gist.github.com/kevinburkeshyp/5a40d372b6aa1e1426801d8195cb0b61

Any ideas why an INSERT on an empty table, to a low-load server running on the same machine, with no other queries being generated during that time, could take 23 seconds to complete? The server is running Postgres 9.4.6.

What other information do I/you need to figure out what's going on?

Hello Kevin,

It would be nice if you could determine the names of the relations and queries for the processes mentioned in the log message quoted.
For relations simply cast the oid (numbers) to regclass, for processes please query pg_stat_activity.

Have you tried to turn autovacuum on?

Regards,
  Alexey

Re: Deadlock between VACUUM and ALTER TABLE commands

От
Alexey Bashtanov
Дата:

On 14/04/16 18:34, Kevin Burke wrote:
Unfortunately I'm still seeing a very slow query which is affecting our tests. It's happening with roughly the same frequency as the previous error. 


In the log, it looks like an INSERT on the users table takes about 23 seconds. This query is connecting from Javascript to a Postgres server on the same machine. After 18 seconds, the test performing the INSERT times out, we try to clear the database, and then the next test runs. The next test also attempts to INSERT a record, but is blocked on the first INSERT. After 5 more seconds, the first INSERT completes, and the second INSERT is allowed to run.

2016-04-14 00:46:54.649 UTC 2016-04-14 00:46:34 UTC [14308-570ee86a.37e4-42] 00000 ubuntu@circle_test LOG:  process 14308 still waiting for ExclusiveLock on extension of relation 17215 of database 16387 after 1000.104 ms
2016-04-14 00:46:54.649 UTC 2016-04-14 00:46:34 UTC [14308-570ee86a.37e4-43] 00000 ubuntu@circle_test DETAIL:  Process holding the lock: 14260. Wait queue: 14308.
2016-04-14 00:46:58.979 UTC 2016-04-14 00:46:34 UTC [14308-570ee86a.37e4-45] 00000 ubuntu@circle_test LOG:  process 14308 acquired ExclusiveLock on extension of relation 17215 of database 16387 after 5330.117 ms

Between the last time I emailed and now, I also disabled autovacuum on the server; it seemed unnecessary if we are issuing ~10k queries tops, and generating ~100k records tops, in the course of a test run. However, I didn't see any other queries or log messages (e.g. "kicking off autovacuum") that would have indicated a problem.

Here's the schema on the table that generated the 23-second long INSERT: https://gist.github.com/kevinburkeshyp/5a40d372b6aa1e1426801d8195cb0b61

Any ideas why an INSERT on an empty table, to a low-load server running on the same machine, with no other queries being generated during that time, could take 23 seconds to complete? The server is running Postgres 9.4.6.

What other information do I/you need to figure out what's going on?

Hello Kevin,

It would be nice if you could determine the names of the relations and queries for the processes mentioned in the log message quoted.
For relations simply cast the oid (numbers) to regclass, for processes please query pg_stat_activity.

Have you tried to turn autovacuum on?

Regards,
  Alexey