Обсуждение: Tuning massive UPDATES and GROUP BY's?
Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS) is never going to be that fast, what should one do to make it faster? I set work_mem to 2048MB, but it currently is only using a little bit of memory and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier, it was using 70% of the memory). The data is not particularly sensitive; if something happened and it rolled back, that wouldnt be the end of the world. So I don't know if I can use "dangerous" setting for WAL checkpoints etc. There are also aren't a lot of concurrent hits on the DB, though a few. I am loathe to create a new table from a select, since the indexes themselves take a really long time to build. As the title alludes, I will also be doing GROUP BY's on the data, and would love to speed these up, mostly just for my own impatience...
On Thu, Mar 10, 2011 at 9:40 AM, fork <forkandwait@gmail.com> wrote: > Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million > row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS) > is never going to be that fast, what should one do to make it faster? > > I set work_mem to 2048MB, but it currently is only using a little bit of memory > and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier, > it was using 70% of the memory). > > The data is not particularly sensitive; if something happened and it rolled > back, that wouldnt be the end of the world. So I don't know if I can use > "dangerous" setting for WAL checkpoints etc. There are also aren't a lot of > concurrent hits on the DB, though a few. > > I am loathe to create a new table from a select, since the indexes themselves > take a really long time to build. you are aware that updating the field for the entire table, especially if there is an index on it (or any field being updated), will cause all your indexes to be rebuilt anyways? when you update a record, it gets a new position in the table, and a new index entry with that position. insert/select to temp, + truncate + insert/select back is usually going to be faster and will save you the reindex/cluster. otoh, if you have foreign keys it can be a headache. > As the title alludes, I will also be doing GROUP BY's on the data, and would > love to speed these up, mostly just for my own impatience... need to see the query here to see if you can make them go faster. merlin
Merlin Moncure <mmoncure <at> gmail.com> writes: > > I am loathe to create a new table from a select, since the indexes themselves > > take a really long time to build. > > you are aware that updating the field for the entire table, especially > if there is an index on it (or any field being updated), will cause > all your indexes to be rebuilt anyways? when you update a record, it > gets a new position in the table, and a new index entry with that > position. > insert/select to temp, + truncate + insert/select back is > usually going to be faster and will save you the reindex/cluster. > otoh, if you have foreign keys it can be a headache. Hmph. I guess I will have to find a way to automate it, since there will be a lot of times I want to do this. > > As the title alludes, I will also be doing GROUP BY's on the data, and would > > love to speed these up, mostly just for my own impatience... > > need to see the query here to see if you can make them go faster. I guess I was hoping for a blog entry on general guidelines given a DB that is really only for batch analysis versus transaction processing. Like "put all your temp tables on a different disk" or whatever. I will post specifics later.
On Thu, Mar 10, 2011 at 17:40, fork <forkandwait@gmail.com> wrote: > The data is not particularly sensitive; if something happened and it rolled > back, that wouldnt be the end of the world. So I don't know if I can use > "dangerous" setting for WAL checkpoints etc. There are also aren't a lot of > concurrent hits on the DB, though a few. If you don't mind long recovery times in case of a crash, set checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this will improve write throughput significantly. Also, if you don't mind CORRUPTing your database after a crash, setting fsync=off and full_page_writes=off gives another significant boost. > I am loathe to create a new table from a select, since the indexes themselves > take a really long time to build. UPDATE on a table with many indexes will probably be slower. If you want to speed up this part, use INSERT INTO x SELECT and take this chance to partition your table, such that each individual partition and most indexes will fit in your cache. Index builds from a warm cache are very fast in PostgreSQL. You can create several indexes at once in separate sessions, and the table will only be scanned once. Don't forget to bump up maintenance_work_mem for index builds, 256MB might be a reasonable arbitrary value. The downside is that partitioning can interfere with your read queries if they expect the data in a sorted order. But then, HashAggregate tends to be faster than GroupAggregate in many cases, so this might not matter for your queries. Alternatively you can experiment with PostgreSQL 9.1 alpha, which has mostly fixed this shortcoming with the "merge append" plan node. > As the title alludes, I will also be doing GROUP BY's on the data, and would > love to speed these up, mostly just for my own impatience... I think regular tuning is the best you can do here. Regards, Marti
Marti Raudsepp <marti <at> juffo.org> writes: > If you don't mind long recovery times in case of a crash, set > checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this > will improve write throughput significantly. Sounds good. > Also, if you don't mind CORRUPTing your database after a crash, > setting fsync=off and full_page_writes=off gives another significant > boost. I probably won't do this... ;) > UPDATE on a table with many indexes will probably be slower. If you > want to speed up this part, use INSERT INTO x SELECT and take this > chance to partition your table, Like the following? Will it rebuild the indexes in a sensical way? BEGIN; CREATE TABLE tempfoo as SELECT *, foo + bar AS newcol FROM bar; TRUNCATE foo; ALTER TABLE foo ADD COLUMN newcol; INSERT INTO foo SELECT * FROM tempfoo; DROP TABLE tempfoo; END; > such that each individual partition > and most indexes will fit in your cache. Is there a rule of thumb on tradeoffs in a partitioned table? About half the time, I will want to do GROUP BY's that use the partition column, but about half the time I won't. (I would use the partition column whatever I am most likely to cluster by in a single big table, right?) For example, I might intuitively partition by age5 (into 20 tables like tab00, tab05, tab10, etc). Often a query would be "SELECT ... FROM PARENTTABLE GROUP BY age5, race, etc", but often it would be "GROUP BY state" or whatever with no age5 component. I know I can experiment ;), but it takes a while to load anything, and i would rather stand on the shoulders. Thanks so much for all your helps!
On Fri, Mar 11, 2011 at 21:06, fork <forkandwait@gmail.com> wrote: > Like the following? Will it rebuild the indexes in a sensical way? Don't insert data into an indexed table. A very important point with bulk-loading is that you should load all the data first, then create the indexes. Running multiple (different) CREATE INDEX queries in parallel can additionally save a lot of time. Also don't move data back and forth between the tables, just drop the original when you're done. Doing this should give a significant performance win. Partitioning them to fit in cache should improve it further, but I'm not sure anymore that it's worthwhile considering the costs and extra maintenance. > Is there a rule of thumb on tradeoffs in a partitioned table? The only certain thing is that you'll lose "group" aggregate and "merge join" query plans. If you only see "HashAggregate" plans when you EXPLAIN your GROUP BY queries then it probably won't make much of a difference. > I would use the partition column whatever I am most likely > to cluster by in a single big table, right? Yes. Regards, Marti
>Don't insert data into an indexed table. A very important point with
>bulk-loading is that you should load all the data first, then create
>the indexes. Running multiple (different) CREATE INDEX queries in
>parallel can additionally save a lot of time. Also don't move data
>back and forth between the tables, just drop the original when you're
>done.
I just saw your post and it looks similar to what I'm doing.
We're going to be loading 12G of data from a MySQL dump into our
pg 9.0.3 database next weekend. I've been testing this for the last
two weeks. Tried removing the indexes and other constraints just for
the import but for a noob like me, this was too much to ask. Maybe
when I get more experience. So I *WILL* be importing all of my data
into indexed tables. I timed it and it will take eight hours.
I'm sure I could get it down to two or three hours for the import
if I really knew more about postgres but that's the price you pay when
you "slam dunk" a project and your staff isn't familiar with the
database back-end. Other than being very inefficient, and consuming
more time than necessary, is there any other down side to importing
into an indexed table? In the four test imports I've done,
everything seems to work fine, just takes a long time.
Sorry for hijacking your thread here!
On Sun, Mar 13, 2011 at 18:36, runner <runner@winning.com> wrote: > Tried removing the indexes and other constraints just for > the import but for a noob like me, this was too much to ask. Maybe > when I get more experience. pgAdmin should make it pretty easy. Choose each index and constraint, save the code from the "SQL pane" for when you need to restore it, and do a right click -> Drop > Other than being very inefficient, and consuming > more time than necessary, is there any other down side to importing > into an indexed table? Doing so will result in somewhat larger (more bloated) indexes, but generally the performance impact of this is minimal. Regards, Marti
On Mon, Mar 14, 2011 at 4:17 AM, Marti Raudsepp <marti@juffo.org> wrote:
On Sun, Mar 13, 2011 at 18:36, runner <runner@winning.com> wrote:> Other than being very inefficient, and consumingDoing so will result in somewhat larger (more bloated) indexes, but
> more time than necessary, is there any other down side to importing
> into an indexed table?
generally the performance impact of this is minimal.
Bulk data imports of this size I've done with minimal pain by simply breaking the raw data into chunks (10M records becomes 10 files of 1M records), on a separate spindle from the database, and performing multiple COPY commands but no more than 1 COPY per server core. I tested this a while back on a 4 core server and when I attempted 5 COPY's at a time the time to complete went up almost 30%. I don't recall any benefit having fewer than 4 in this case but the server was only processing my data at the time. Indexes were on the target table however I dropped all constraints. The UNIX split command is handy for breaking the data up into individual files.
Greg
> Bulk data imports of this size I've done with minimal pain by simply
> breaking the raw data into chunks (10M records becomes 10 files of
> 1M records), on a separate spindle from the database, and performing
> multiple COPY commands but no more than 1 COPY per server core.
> I tested this a while back on a 4 core server and when I attempted 5
> COPY's at a time the time to complete went up almost 30%. I don't
> recall any benefit having fewer than 4 in this case but the server was
> only processing my data at the time. Indexes were on the target table
> however I dropped all constraints. The UNIX split command is handy
> for breaking the data up into individual files.
> breaking the raw data into chunks (10M records becomes 10 files of
> 1M records), on a separate spindle from the database, and performing
> multiple COPY commands but no more than 1 COPY per server core.
> I tested this a while back on a 4 core server and when I attempted 5
> COPY's at a time the time to complete went up almost 30%. I don't
> recall any benefit having fewer than 4 in this case but the server was
> only processing my data at the time. Indexes were on the target table
> however I dropped all constraints. The UNIX split command is handy
> for breaking the data up into individual files.
I'm not using COPY. My dump file is a bunch if INSERT INTO statements. I know it would be faster to use copy. If I can figure out how to do this in one hour I will try it. I did two mysqldumps, one with INSERT INTO and one as CSV to I can try COPY at a later time. I'm running five parallel psql processes to import the data which has been broken out by table.