Обсуждение: loading data, creating indexes, clustering, vacuum...
Hi everyone, Looking for a small bit of advice... I have a script that updates several tables with large amounts of data. Before running the updates, it drops all indexes for optimal performance. When the updates have finished, I run the following procedure: recreate the indexes cluster the tables vacuum full analyze on the tables I was hoping an expert could comment on the optimal way to order these three commands. For instance I have a hunch that creating the indexes first (as I do now) could slow down the clustering - perhaps the row locations in the indexes all have to be updated as the cluster command shifts their locations? And perhaps vacuuming should be done before clustering so that dead tuples aren't "in the way"? Of course I could just test every combination until I get it right, but I'd like to have a good understanding as well. Any insight would be much appreciated. Thank you, Mark
On Thursday 07 December 2006 08:38, "Angva" <angvaw@gmail.com> wrote: > three commands. For instance I have a hunch that creating the indexes > first (as I do now) could slow down the clustering - perhaps the row > locations in the indexes all have to be updated as the cluster command > shifts their locations? And perhaps vacuuming should be done before > clustering so that dead tuples aren't "in the way"? clustering also removes the dead tuples. I would just: - create one index, the one to be clustered - cluster the table - create the remaining indexes -- Eat right. Exercise regularly. Die anyway.
On Thu, Dec 07, 2006 at 02:52:11PM -0800, Alan Hodgson wrote: > On Thursday 07 December 2006 08:38, "Angva" <angvaw@gmail.com> wrote: > > three commands. For instance I have a hunch that creating the indexes > > first (as I do now) could slow down the clustering - perhaps the row > > locations in the indexes all have to be updated as the cluster command > > shifts their locations? And perhaps vacuuming should be done before > > clustering so that dead tuples aren't "in the way"? > > clustering also removes the dead tuples. > > I would just: > > - create one index, the one to be clustered > - cluster the table > - create the remaining indexes And then run ANALYSE. No need to vacuum because the cluster did that already. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Angva wrote: > Looking for a small bit of advice... > > I have a script that updates several tables with large amounts of data. > Before running the updates, it drops all indexes for optimal > performance. When the updates have finished, I run the following > procedure: > > recreate the indexes > cluster the tables > vacuum full analyze on the tables Hi all, I'd like to see a general way to take indexes off line without actually losing their definitions. For example, something like "ALTER TABLE [EN | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc. This could also be used internally when a backend encounters an error reading/writing an index. Rather than refusing to execute queries, it could just ignore indexes it knows are disabled or bad in some way and re-plan as needed. This would have two benefits. First, the above scenerio would be much simpler. Rather than dropping and re-creating new indexes, you could just disable and then re-enable them without having any knowledge of their structure. Secondly, it would allow us to put indexes in an alternate table space on a non-redundant volume and, in the case of a drive failure, be able to limp along, and get the system back to normal simply by replacing the disk and issuing a REINDEX command. I realize there are a couple gotchas with this. For example, what to do with unique indexes? Perhaps a backend would still need to refuse to do update/inserts on a table with degraded unique indexes, unless the index was disabled explicitly? And then, refuse to rebuild/re-enable the index as normal if non-unique values found? Thx for considering :-) -Glen Parker
Hi all, Since PITR works well, my use of pg_dump has shifted. Rather than using it as a backup tool, I now use it as a snapshotting tool. At the end of each month we do an ASCII dump to keep around, so if we ever need to, we can see the data as it was any number of months or years ago. Not a backup at all, just a raw data archive. These archives do not need to hold all our data, for example, system logs would be useless later. There also is no reason to include indexes. Ignoring extranious tables and indexes is a great way to keep the archive small and keep the time to restore as low as possible. So, it would be great if pg_dump could accept some sort of argument to make it simply not dump certain types of objects. Indexes, views, functions, etc. Thx for considering :-) -Glen Parker
Re: Marking indexes out of date (WAS: loading data, creating indexes, clustering, vacuum) feature request?
От
Martijn van Oosterhout
Дата:
On Fri, Dec 08, 2006 at 12:49:30PM -0800, Glen Parker wrote: > I'd like to see a general way to take indexes off line without actually > losing their definitions. For example, something like "ALTER TABLE [EN > | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc. This could > also be used internally when a backend encounters an error > reading/writing an index. Rather than refusing to execute queries, it > could just ignore indexes it knows are disabled or bad in some way and > re-plan as needed. One issue would be that even disabled indexes would need to be updated when there are new rows. If you don't update the index when it's disabled, then re-enabling will essentially need to rebuild the index. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > From each according to his ability. To each according to his ability to litigate.
Вложения
Martijn van Oosterhout wrote: > On Fri, Dec 08, 2006 at 12:49:30PM -0800, Glen Parker wrote: >> I'd like to see a general way to take indexes off line without actually >> losing their definitions. For example, something like "ALTER TABLE [EN >> | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc. This could >> also be used internally when a backend encounters an error >> reading/writing an index. Rather than refusing to execute queries, it >> could just ignore indexes it knows are disabled or bad in some way and >> re-plan as needed. > > One issue would be that even disabled indexes would need to be updated > when there are new rows. If you don't update the index when it's > disabled, then re-enabling will essentially need to rebuild the index. That's what I had in mind. You could just as easily blow away the index file(s). It's just that I don't want it to toss the index *definition*. To continued to update such an index would be to completely negate the benefit of disabling it! -Glen
Martijn van Oosterhout <kleptog@svana.org> writes: > One issue would be that even disabled indexes would need to be updated > when there are new rows. If you don't update the index when it's > disabled, then re-enabling will essentially need to rebuild the index. I assume that's what he wants. However, it's not immediately clear that there's a sufficient use case for this to justify the extra apparatus compared to just DROP INDEX (and recreate it later). regards, tom lane
> clustering also removes the dead tuples. I have a followup question. What if the set of dead tuples is too big and I need to VACUUM FULL, as opposed to VACUUM. (The update size varies greatly from day to day.) Will the clustering effectively do a VACUUM FULL, or just a VACUUM? Thanks again for your help, Mark
Thank you very much, Alan and Martijn, for the advice! Mark Martijn van Oosterhout wrote: > On Thu, Dec 07, 2006 at 02:52:11PM -0800, Alan Hodgson wrote: > > On Thursday 07 December 2006 08:38, "Angva" <angvaw@gmail.com> wrote: > > > three commands. For instance I have a hunch that creating the indexes > > > first (as I do now) could slow down the clustering - perhaps the row > > > locations in the indexes all have to be updated as the cluster command > > > shifts their locations? And perhaps vacuuming should be done before > > > clustering so that dead tuples aren't "in the way"? > > > > clustering also removes the dead tuples. > > > > I would just: > > > > - create one index, the one to be clustered > > - cluster the table > > - create the remaining indexes > > And then run ANALYSE. No need to vacuum because the cluster did that > already. > > Have a nice day, > -- > Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > > From each according to his ability. To each according to his ability to litigate. > > --fXStkuK2IQBfcDe+ > Content-Type: application/pgp-signature > Content-Disposition: inline; > filename="signature.asc" > Content-Description: Digital signature > X-Google-AttachSize: 190
"Angva" <angvaw@gmail.com> wrote: > > > clustering also removes the dead tuples. > > I have a followup question. What if the set of dead tuples is too big > and I need to VACUUM FULL, as opposed to VACUUM. (The update size > varies greatly from day to day.) Will the clustering effectively do a > VACUUM FULL, or just a VACUUM? CLUSTER is the equivalent of VACUUM FULL, with the addition that it orders the data in the table in the order of the index you specify. VACUUM FULL doesn't follow any particular order for the data. -Bill
If set have the max_fsm_pages set correctly, that makes VACUUM faster and removes the need for VACUUM FULL or REINDEX. Should be slightly more than the total number of data pages which will be touched by updates and deletes between vacuums.
Thanks,
--------
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)
Thanks,
--------
Shoaib Mir
EnterpriseDB (www.enterprisedb.com)
On 8 Dec 2006 09:33:31 -0800, Angva <angvaw@gmail.com> wrote:
> clustering also removes the dead tuples.
I have a followup question. What if the set of dead tuples is too big
and I need to VACUUM FULL, as opposed to VACUUM. (The update size
varies greatly from day to day.) Will the clustering effectively do a
VACUUM FULL, or just a VACUUM?
Thanks again for your help,
Mark
---------------------------(end of broadcast)---------------------------
TIP 3: Have you checked our extensive FAQ?
http://www.postgresql.org/docs/faq
Gurus, I hope I can make this clear somehow... Anyway... This all involves PG 8.1.4 on a 64-bit FC5 box. Select version() says "PostgreSQL 8.1.4 on x86_64-redhat-linux-gnu, compiled by GCC x86_64-redhat-linux-gcc (GCC) 4.1.0 20060304 (Red Hat 4.1.0-3)". I guess the best question I can see is, under what circumstances is the directory name in pg_tablespace actually used? I have a scenario where I want to restore from a PITR backup, into an alternate location on the same machine it came from, while the original database is still up and running. I have one alternate table space. It goes like this. First I expand the base archive into an alternate location, then expand the table space archive(s) into alternate location(s). Then I recreate the links under pg_tblspc. I then fiddle a little bit with config files and run postgres on the new alternate location. Everthing goes fine, the database rolls forward, and then postgres quits (because I give it an SQL file for stdin). Great. But now I have a problem. What if I move objects from the main tablespace to the alternate one, such as indexes, between the time of the backup and the restore? During the restore/recovery, the pg_tablespace table is out of date. If the tablespace directory listed there was used in copying files, I'd have a big fat mess involving a badly broken production database. Hopefully that all makes sense... -Glen
Glen Parker <glenebob@nwlink.com> writes: > I guess the best question I can see is, under what circumstances is the > directory name in pg_tablespace actually used? It isn't used (except by pg_dumpall) ... what counts is where the symlink in $PGDATA/pg_tblspc points. > I have a scenario where I want to restore from a PITR backup, into an > alternate location on the same machine it came from, while the original > database is still up and running. I have one alternate table space. You can probably make this work if you don't issue any CREATE TABLESPACE commands while PITR logging is active, but you'll want to test your procedures pretty carefully. regards, tom lane
> You can probably make this work if you don't issue any CREATE TABLESPACE > commands while PITR logging is active, but you'll want to test your > procedures pretty carefully. That's what I thought, and after your message, I went ahead with it and had no problems. Thx, Tom. -Glen
Glen Parker wrote: > Martijn van Oosterhout wrote: > > On Fri, Dec 08, 2006 at 12:49:30PM -0800, Glen Parker wrote: > >> I'd like to see a general way to take indexes off line without actually > >> losing their definitions. For example, something like "ALTER TABLE [EN > >> | DIS] ABLE INDEXES", "ALTER INDEX [EN | DIS] ABLE", etc. This could > >> also be used internally when a backend encounters an error > >> reading/writing an index. Rather than refusing to execute queries, it > >> could just ignore indexes it knows are disabled or bad in some way and > >> re-plan as needed. > > > > One issue would be that even disabled indexes would need to be updated > > when there are new rows. If you don't update the index when it's > > disabled, then re-enabling will essentially need to rebuild the index. > > > That's what I had in mind. You could just as easily blow away the index > file(s). It's just that I don't want it to toss the index *definition*. Well, you can just pg_dump the index definition to a file. What more would someone want? -- Bruce Momjian bruce@momjian.us EnterpriseDB http://www.enterprisedb.com + If your life is a hard drive, Christ can be your backup. +