Обсуждение: pg_dump --load-via-partition-root vs. parallel restore

Поиск

Список

Период

Сортировка

pg_dump --load-via-partition-root vs. parallel restore

От

Tom Lane

Дата:

29 августа 2018 г., 01:53:39

Parallel pg_restore generally assumes that the archive file is telling it
the truth about data dependencies; in particular, that a TABLE DATA item
naming a particular target table is loading data into exactly that table.
--load-via-partition-root creates a significant probability that that
assumption is wrong, at least in scenarios where the data really does get
redirected into other partitions than the original one.  This can result
in inefficiencies (e.g., index rebuild started before a table's data is
really all loaded) or outright failures (foreign keys or RLS policies
applied before the data is all loaded).  I suspect that deadlock failures
during restore are also possible, since identify_locking_dependencies
is not going to be nearly close to the truth about which operations
might hold which locks.

This could possibly be fixed by changing around the dependencies shown
in the archive file so that POST_DATA objects that're nominally dependent
on any one of a partitioned table's members are shown as dependent on all
of them.  I'm not particularly eager to write that patch though.

For the moment I'm inclined to just document the problem, e.g. "It's
recommended that parallel restore not be used with archives generated
with this option."

            regards, tom lane

Re: pg_dump --load-via-partition-root vs. parallel restore

От

Robert Haas

Дата:

03 сентября 2018 г., 04:31:01

On Tue, Aug 28, 2018 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Parallel pg_restore generally assumes that the archive file is telling it
> the truth about data dependencies; in particular, that a TABLE DATA item
> naming a particular target table is loading data into exactly that table.
> --load-via-partition-root creates a significant probability that that
> assumption is wrong, at least in scenarios where the data really does get
> redirected into other partitions than the original one.  This can result
> in inefficiencies (e.g., index rebuild started before a table's data is
> really all loaded) or outright failures (foreign keys or RLS policies
> applied before the data is all loaded).  I suspect that deadlock failures
> during restore are also possible, since identify_locking_dependencies
> is not going to be nearly close to the truth about which operations
> might hold which locks.

Hmm.  I had the idea that this wasn't a problem because I thought we
had all of pg_dump strictly separated into pre-data, data, and
post-data; therefore, I thought it would be the case that none of that
other stuff would happen until all table data was loaded.  From what
you are saying here, I guess that's not the case?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: pg_dump --load-via-partition-root vs. parallel restore

От

Tom Lane

Дата:

03 сентября 2018 г., 04:42:12

Robert Haas <robertmhaas@gmail.com> writes:
> On Tue, Aug 28, 2018 at 3:53 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Parallel pg_restore generally assumes that the archive file is telling it
>> the truth about data dependencies; in particular, that a TABLE DATA item
>> naming a particular target table is loading data into exactly that table.
>> --load-via-partition-root creates a significant probability that that
>> assumption is wrong, at least in scenarios where the data really does get
>> redirected into other partitions than the original one.  This can result
>> in inefficiencies (e.g., index rebuild started before a table's data is
>> really all loaded) or outright failures (foreign keys or RLS policies
>> applied before the data is all loaded).  I suspect that deadlock failures
>> during restore are also possible, since identify_locking_dependencies
>> is not going to be nearly close to the truth about which operations
>> might hold which locks.

> Hmm.  I had the idea that this wasn't a problem because I thought we
> had all of pg_dump strictly separated into pre-data, data, and
> post-data; therefore, I thought it would be the case that none of that
> other stuff would happen until all table data was loaded.  From what
> you are saying here, I guess that's not the case?

We don't run restore operations in parallel during the pre-data phase,
mainly because we lack faith that the dependencies are really completely
represented for those objects (particularly in older archives).  Once we
get to parallel operation, though, everything is schedulable as soon as
its dependencies are satisfied; so it's possible for post-data objects
to run before supposedly-unrelated data objects are done.  There's no
explicit representation of the data/post-data boundary in the archive's
dependencies.

            regards, tom lane

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: pg_dump --load-via-partition-root vs. parallel restore

pg_dump --load-via-partition-root vs. parallel restore

Re: pg_dump --load-via-partition-root vs. parallel restore

Re: pg_dump --load-via-partition-root vs. parallel restore