Обсуждение: Improve pg_dump dumping publication tables

Поиск
Список
Период
Сортировка

Improve pg_dump dumping publication tables

От
"Hsu, John"
Дата:
Hi hackers,

I was wondering if there's a good reason in pg_dump getPublicationTables() 
to iterate through all tables one by one and querying to see if it has a 
corresponding publication other than memory concerns?

Attached is a patch to construct the publications by querying for all publications 
at once and then finding the corresponding tableInfo information in tblinfoindex. 

It seems more likely that users will have a lot more tables than publications so 
this should be a lot faster, especially when there's millions of tables.

Setup: 
time pg_dump --schema-only -d postgres -f dump.sql -v 2> dump.log
10k tables, each table having its own publication.

Before patch:
real    0m6.409s
user    0m0.234s
sys    0m0.584s

With patch:
real    0m5.483s
user    0m0.399s
sys    0m0.443s

10k tables, no publications.

Before patch:
real    0m6.193s
user    0m0.276s
sys    0m0.531s

With patch:
real    0m5.261s
user    0m0.254s
sys    0m0.377s

Thanks,

John H


Вложения

Re: Improve pg_dump dumping publication tables

От
Cary Huang
Дата:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           tested, passed
Documentation:            not tested

Hi

I applied the patch to PG master branch successfully and the patch seems to do as described. If there are a lot of
tablescreated in the Postgres and only a few of them having publications, the getPublicationTables() function will loop
throughall the tables and check each one to see if it is the desired relation type and if it contains one or more
publications.So potentially a lot of looping and checking may happen. In John's patch, all these looping and checking
arereplaced with one query that will return all the tables that have publications. One problem though, is that, if
thereare a million tables created in the server like you say and all of them have publications, the query can return a
millionrows, which will require a lot of memory on client side and the pg_dump client may need to handle this extreme
casewith proper pagination, which adds complexity to the function's logic. Existing logic does not have this problem,
becauseit simply queries a table's publication one at a time and do it a million times.
 

thank you
Cary Huang
HighGo Software

Re: Improve pg_dump dumping publication tables

От
John Hsu
Дата:
Hi Cary,

Thanks for taking a look. I agree there's a risk  since there's more memory usage on client side, and if you're dumping
saymillions of publicationTables then that could be problematic. 
 
I'd like to think this isn't any riskier than existing pg_dump code such as in getTables(...) where presumably we would
runinto similar problems. 
 

Cheers,
John H

Re: Improve pg_dump dumping publication tables

От
Tom Lane
Дата:
"Hsu, John" <hsuchen@amazon.com> writes:
> I was wondering if there's a good reason in pg_dump getPublicationTables() 
> to iterate through all tables one by one and querying to see if it has a 
> corresponding publication other than memory concerns?

I just came across this entry in the CommitFest, and I see that it's
practically the same as something I did in passing in 8e396a773.
The main difference is that I got rid of the server-side join, too,
in favor of having getPublicationTables locate the PublicationInfo
that should have been created already by getPublications.  (The
immediate rationale for that was to get the publication owner name
from the PublicationInfo; but we'd have had to do that eventually
anyway if we ever want to allow selective dumping of publications.)

Anyway, I apologize for treading on your toes.  If I'd noticed this
thread earlier I would certainly have given you credit for being the
first to have the idea.

As far as the memory argument goes, I'm not too concerned about it
because both the PublicationRelInfo structs and the tuples of the
transient PGresult are pretty small.  In principle if you had very
many entries in pg_publication_rel, but a selective dump was only
interested in a few of them, there might be an interesting amount of
space wasted here.  But that argument fails because even a selective
dump collects data about all tables, for reasons that are hard to get
around.  The incremental space usage for PublicationRelInfos seems
unlikely to be significant compared to the per-table data we'd have
anyway.

I'll mark this CF entry "withdrawn", since it wasn't rejected
exactly.  Too bad we don't have a classification of "superseded
by events", or something like that.

            regards, tom lane