RE: Data is copied twice when specifying both child and parent table in publication

Поиск
Список
Период
Сортировка
От houzj.fnst@fujitsu.com
Тема RE: Data is copied twice when specifying both child and parent table in publication
Дата
Msg-id OS0PR01MB5716C756312959F293A822C794869@OS0PR01MB5716.jpnprd01.prod.outlook.com
обсуждение исходный текст
Ответ на Re: Data is copied twice when specifying both child and parent table in publication  (Dilip Kumar <dilipbalaut@gmail.com>)
Ответы Re: Data is copied twice when specifying both child and parent table in publication  (Amit Langote <amitlangote09@gmail.com>)
Список pgsql-hackers
Hi,

As there are basically two separate issues mentioned in the thread, I tried to
summarize the discussion so far which might be helpful to others.

* The first issue[1]:

If we include both the partitioned table and (explicitly) its child partitions
in the publication when set publish_via_partition_root=true, like:
---
CREATE PUBLICATION pub FOR TABLE parent_table, child_table with (publish_via_partition_root=on);
---
It could execute initial sync for both the partitioned(parent_table) table and
(explicitly) its child partitions(child_table) which cause duplication of
data in partition(child_table) in subscriber side.

The reasons I considered this behavior a bug are:

a) In this case, the behavior of initial sync is inconsistent with the behavior
of transaction streaming. All changes in the leaf the partition will be applied
using the identity and schema of the partitioned(root) table. But for the
initial sync, it will execute sync for both the partitioned(root) table and
(explicitly) its child partitions which cause duplication of data.

b) The behavior of FOR TABLE is inconsistent with the behavior of FOR ALL TABLE.
If user create a FOR ALL TABLE publication and set publish_via_partition_root=true,
then only the top most partitioned(root) table will execute initial sync.

IIRC, most people in this thread agreed that the current behavior is not
expected. So, maybe it's time to try to fix it.

Attach my fix patch here. The patch try to fix this issue by making the
pg_publication_tables view only show partitioned table when
publish_via_partition_root is true.


* The second issue[2]:
-----
CREATE TABLE sale (sale_date date not null,country_code text, product_sku text,
units integer) PARTITION BY RANGE (sale_date);
CREATE TABLE sale_201901 PARTITION OF sale FOR VALUES FROM ('2019-01-01') TO
('2019-02-01');
CREATE TABLE sale_201902 PARTITION OF sale FOR VALUES FROM ('2019-02-01') TO
('2019-03-01');

(1) PUB:  CREATE PUBLICATION pub FOR TABLE sale_201901,
sale_201902 WITH (publish_via_partition_root=true);
(2) SUB:  CREATE SUBSCRIPTION sub CONNECTION 'dbname=postgres host=localhost port=5432' PUBLICATION pub;
(3) PUB:  INSERT INTO sale VALUES('2019-01-01', 'AU', 'cpu', 5), ('2019-01-02', 'AU', 'disk', 8);
(4) SUB:  SELECT * FROM sale;
(5) PUB:  ALTER PUBLICATION pub ADD TABLE sale;
(6) SUB:  ALTER SUBSCRIPTION sub REFRESH PUBLICATION;
(7) SUB:  SELECT * FROM sale;
-----

In step (7), we can see duplication of data.

The reason is that the INSERTed data is first published though the partitions,
since initially there is no partitioned table in the publication (so
publish_via_partition_root=true doesn't have any effect). But then adding the
partitioned table to the publication and refreshing the publication in the
subscriber, the data is then published "using the identity and schema of the
partitioned table" due to publish_via_partition_root=true.
(Copied from Greg's analysis).

Whether this behavior is correct is still under debate.


Overall, I think the second issue still needs further discussion while the
first issue seems clear that most people think it's unexpected. So, I think it
might be better to fix the first issue.

[1]
https://www.postgresql.org/message-id/OS0PR01MB57167F45D481F78CDC5986F794B99%40OS0PR01MB5716.jpnprd01.prod.outlook.com
[2]
https://www.postgresql.org/message-id/flat/CAJcOf-d8SWk3z3fJaLW9yuVux%3D2ESTsXOSdKzCq1O3AWBpgnMQ%40mail.gmail.com#fc96a42158b5e98ace26d077a6f7eac5

Best regards,
Hou zj

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Erik Rijkers
Дата:
Сообщение: Re: [PATCH] Proposal for HIDDEN/INVISIBLE column
Следующее
От: Michael Paquier
Дата:
Сообщение: Re: inefficient loop in StandbyReleaseLockList()