Обсуждение: Improve pg_dump dumping publication tables
Hi hackers, I was wondering if there's a good reason in pg_dump getPublicationTables() to iterate through all tables one by one and querying to see if it has a corresponding publication other than memory concerns? Attached is a patch to construct the publications by querying for all publications at once and then finding the corresponding tableInfo information in tblinfoindex. It seems more likely that users will have a lot more tables than publications so this should be a lot faster, especially when there's millions of tables. Setup: time pg_dump --schema-only -d postgres -f dump.sql -v 2> dump.log 10k tables, each table having its own publication. Before patch: real 0m6.409s user 0m0.234s sys 0m0.584s With patch: real 0m5.483s user 0m0.399s sys 0m0.443s 10k tables, no publications. Before patch: real 0m6.193s user 0m0.276s sys 0m0.531s With patch: real 0m5.261s user 0m0.254s sys 0m0.377s Thanks, John H
Вложения
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: tested, passed Documentation: not tested Hi I applied the patch to PG master branch successfully and the patch seems to do as described. If there are a lot of tablescreated in the Postgres and only a few of them having publications, the getPublicationTables() function will loop throughall the tables and check each one to see if it is the desired relation type and if it contains one or more publications.So potentially a lot of looping and checking may happen. In John's patch, all these looping and checking arereplaced with one query that will return all the tables that have publications. One problem though, is that, if thereare a million tables created in the server like you say and all of them have publications, the query can return a millionrows, which will require a lot of memory on client side and the pg_dump client may need to handle this extreme casewith proper pagination, which adds complexity to the function's logic. Existing logic does not have this problem, becauseit simply queries a table's publication one at a time and do it a million times. thank you Cary Huang HighGo Software
Hi Cary, Thanks for taking a look. I agree there's a risk since there's more memory usage on client side, and if you're dumping saymillions of publicationTables then that could be problematic. I'd like to think this isn't any riskier than existing pg_dump code such as in getTables(...) where presumably we would runinto similar problems. Cheers, John H
"Hsu, John" <hsuchen@amazon.com> writes: > I was wondering if there's a good reason in pg_dump getPublicationTables() > to iterate through all tables one by one and querying to see if it has a > corresponding publication other than memory concerns? I just came across this entry in the CommitFest, and I see that it's practically the same as something I did in passing in 8e396a773. The main difference is that I got rid of the server-side join, too, in favor of having getPublicationTables locate the PublicationInfo that should have been created already by getPublications. (The immediate rationale for that was to get the publication owner name from the PublicationInfo; but we'd have had to do that eventually anyway if we ever want to allow selective dumping of publications.) Anyway, I apologize for treading on your toes. If I'd noticed this thread earlier I would certainly have given you credit for being the first to have the idea. As far as the memory argument goes, I'm not too concerned about it because both the PublicationRelInfo structs and the tuples of the transient PGresult are pretty small. In principle if you had very many entries in pg_publication_rel, but a selective dump was only interested in a few of them, there might be an interesting amount of space wasted here. But that argument fails because even a selective dump collects data about all tables, for reasons that are hard to get around. The incremental space usage for PublicationRelInfos seems unlikely to be significant compared to the per-table data we'd have anyway. I'll mark this CF entry "withdrawn", since it wasn't rejected exactly. Too bad we don't have a classification of "superseded by events", or something like that. regards, tom lane