Обсуждение: pg_dump does way too much before acquiring table locks
While poking at pg_dump for some work I'll show later, I grew quite unhappy with the extent to which people have ignored this advice given near the head of getTables(): * Note: in this phase we should collect only a minimal amount of * information about each table, basically just enough to decide if it is * interesting. We must fetch all tables in this phase because otherwise * we cannot correctly identify inherited columns, owned sequences, etc. Far from collecting "a minimal amount of information", we have repeatedly stuffed extra joins and expensive sub-selects into this query. That's fairly inefficient if we aren't interested in dumping every table, but it's much worse than that: we are collecting all this info *before we have lock* on the tables. That means we are at serious risk of data skew in any place where we consult server-side functions, eg pg_get_partkeydef(). For example: regression=# create table foo(f1 int) partition by range(f1); CREATE TABLE regression=# begin; drop table foo; BEGIN DROP TABLE Now start a pg_dump in another shell session, wait a second or two, and regression=*# commit; COMMIT and the pg_dump blows up: $ pg_dump -s regression pg_dump: error: query failed: ERROR: could not open relation with OID 37796 (Note: it's not so easy to provoke this failure manually before e2ff7d9a8 guaranteed that pg_dump would wait around for you, but it's certainly possible.) To add insult to injury, all that work being done in this query makes the time window for trouble wider. Testing on the regression database, which isn't all that big, I observe getTable's query taking 140 ms, substantially longer than the next slowest thing done by "pg_dump -s regression". It seems that the largest chunk of this can be blamed on the sub-selects added by the pg_init_privs patch: EXPLAIN ANALYZE puts their total runtime at ~100ms. I believe that the pg_init_privs sub-selects can probably be nuked, or at least their cost can be paid later. The other work I mentioned has proven that we do not actually need to know the ACL situation to determine whether a table is "interesting", so we don't have to do that work before acquiring lock. However, that still leaves us doing several inessential joins, not to mention those unsafe partition-examination functions, before acquiring lock. I am thinking that the best fix is to make getTables() perform two queries (or, possibly, split it into two functions). The first pass would acquire only the absolutely minimal amount of data needed to decide whether a table is "interesting", and then lock such tables. Then we'd go back to fill in the remaining data. While it's fairly annoying to read pg_class twice, it looks like the extra query might only take a handful of msec. Also, if we don't do it like that, it seems that we'd have to add entirely new queries to call pg_get_partkeydef() and pg_get_expr(relpartbound) in. So I'm not really seeing a better-performing alternative. Thoughts? regards, tom lane
On Wed, Oct 20, 2021 at 05:14:45PM -0400, Tom Lane wrote: > Lastly, patch 0003 addresses the concern I raised at [3] that it's > unsafe to call pg_get_partkeydef() and pg_get_expr(relpartbound) > in getTables(). Looking closer I realized that we can't cast > pg_class.reloftype to regtype at that point either, since regtypeout > is going to notice if the type has been concurrently dropped. > > In [3] I'd imagined that we could just delay those calls to a second > query in getTables(), but that doesn't work at all: if we apply > these functions to every row of pg_class, we still risk failure > against any relation that we didn't lock. So there seems little > alternative but to push these functions out to secondary queries > executed later. > > Arguably, 0003 is a bug fix that we should consider back-patching. > However, I've not heard field reports of the problems it fixes, > so maybe there's no need to bother. > [3] https://www.postgresql.org/message-id/1462940.1634496313%40sss.pgh.pa.us FYI, I see this issue happen in production environment. Grepping logfiles on ~40 servers, I see it happened at least 3 times since October 1. Our backup script is probably particularly sensitive to this: it separately dumps each "old" partition once and for all. We're more likely to hit this since pg_dump is called in a loop. I never reported it, since I think it's a documented issue, and it's no great problem, so long as it runs the next day. But it'd be a pain to find that the backup was incomplete when we needed it. Also, I use the backups to migrate to new servers, and it would be a pain to start the job at a calculated time expecting it to finish at the beginning of a coordinated maintenance window, but then discover that it had failed and needs to be rerun with fingers crossed. On Sun, Oct 17, 2021 at 02:45:13PM -0400, Tom Lane wrote: > While poking at pg_dump for some work I'll show later, I grew quite > unhappy with the extent to which people have ignored this advice > given near the head of getTables(): > > * Note: in this phase we should collect only a minimal amount of > * information about each table, basically just enough to decide if it is > * interesting. We must fetch all tables in this phase because otherwise > * we cannot correctly identify inherited columns, owned sequences, etc. > > Far from collecting "a minimal amount of information", we have > repeatedly stuffed extra joins and expensive sub-selects into this > query. That's fairly inefficient if we aren't interested in dumping > every table, but it's much worse than that: we are collecting all this > info *before we have lock* on the tables. That means we are at > serious risk of data skew in any place where we consult server-side > functions, eg pg_get_partkeydef(). For example: > > regression=# create table foo(f1 int) partition by range(f1); > CREATE TABLE > regression=# begin; drop table foo; > BEGIN > DROP TABLE > > Now start a pg_dump in another shell session, wait a second or two, > and > > regression=*# commit; > COMMIT > > and the pg_dump blows up: > > $ pg_dump -s regression > pg_dump: error: query failed: ERROR: could not open relation with OID 37796 ...
Justin Pryzby <pryzby@telsasoft.com> writes: > On Wed, Oct 20, 2021 at 05:14:45PM -0400, Tom Lane wrote: >> Arguably, 0003 is a bug fix that we should consider back-patching. >> However, I've not heard field reports of the problems it fixes, >> so maybe there's no need to bother. > FYI, I see this issue happen in production environment. > I never reported it, since I think it's a documented issue, and it's no great > problem, so long as it runs the next day. But it'd be a pain to find that the > backup was incomplete when we needed it. Also, I use the backups to migrate to > new servers, and it would be a pain to start the job at a calculated time > expecting it to finish at the beginning of a coordinated maintenance window, > but then discover that it had failed and needs to be rerun with fingers > crossed. Yeah, if you're dropping tables all the time, pg_dump is going to have a problem with that. The fix I'm suggesting here would only ensure that you get a "clean" failure at the LOCK TABLE command --- but from an operational standpoint, that's little improvement. The natural response is to consider retrying the whole dump after a lock failure. I'm not sure if it'd be practical to do that within pg_dump itself, as opposed to putting a loop in whatever script calls it. regards, tom lane