Обсуждение: Tuning massive UPDATES and GROUP BY's?

Поиск
Список
Период
Сортировка

Tuning massive UPDATES and GROUP BY's?

От
fork
Дата:
Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million
row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS)
is never going to be that fast, what should one do to make it faster?

I set work_mem to 2048MB, but it currently is only using a little bit of memory
and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier,
it was using 70% of the memory).

The data is not particularly sensitive; if something happened and it rolled
back, that wouldnt be the end of the world.  So I don't know if I can use
"dangerous" setting for WAL checkpoints etc.   There are also aren't a lot of
concurrent hits on the DB, though a few.

I am loathe to create a new table from a select, since the indexes themselves
take a really long time to build.

As the title alludes, I will also be doing GROUP BY's on the data, and would
love to speed these up, mostly just for my own impatience...



Re: Tuning massive UPDATES and GROUP BY's?

От
Merlin Moncure
Дата:
On Thu, Mar 10, 2011 at 9:40 AM, fork <forkandwait@gmail.com> wrote:
> Given that doing a massive UPDATE SET foo = bar || ' ' || baz; on a 12 million
> row table (with about 100 columns -- the US Census PUMS for the 2005-2009 ACS)
> is never going to be that fast, what should one do to make it faster?
>
> I set work_mem to 2048MB, but it currently is only using a little bit of memory
> and CPU. (3% and 15% according to top; on a SELECT DISTINCT ... LIMIT earlier,
> it was using 70% of the memory).
>
> The data is not particularly sensitive; if something happened and it rolled
> back, that wouldnt be the end of the world.  So I don't know if I can use
> "dangerous" setting for WAL checkpoints etc.   There are also aren't a lot of
> concurrent hits on the DB, though a few.
>
> I am loathe to create a new table from a select, since the indexes themselves
> take a really long time to build.

you are aware that updating the field for the entire table, especially
if there is an index on it (or any field being updated), will cause
all your indexes to be rebuilt anyways?  when you update a record, it
gets a new position in the table, and a new index entry with that
position.  insert/select to temp, + truncate + insert/select back is
usually going to be faster and will save you the reindex/cluster.
otoh, if you have foreign keys it can be a headache.

> As the title alludes, I will also be doing GROUP BY's on the data, and would
> love to speed these up, mostly just for my own impatience...

need to see the query here to see if you can make them go faster.

merlin

Re: Tuning massive UPDATES and GROUP BY's?

От
fork
Дата:
Merlin Moncure <mmoncure <at> gmail.com> writes:

> > I am loathe to create a new table from a select, since the indexes themselves
> > take a really long time to build.
>
> you are aware that updating the field for the entire table, especially
> if there is an index on it (or any field being updated), will cause
> all your indexes to be rebuilt anyways?  when you update a record, it
> gets a new position in the table, and a new index entry with that
> position.
> insert/select to temp, + truncate + insert/select back is
> usually going to be faster and will save you the reindex/cluster.
> otoh, if you have foreign keys it can be a headache.

Hmph.  I guess I will have to find a way to automate it, since there will be a
lot of times I want to do this.

> > As the title alludes, I will also be doing GROUP BY's on the data, and would
> > love to speed these up, mostly just for my own impatience...
>
> need to see the query here to see if you can make them go faster.

I guess I was hoping for a blog entry on general guidelines given a DB that is
really only for batch analysis versus transaction processing.  Like "put all
your temp tables on a different disk" or whatever.  I will post specifics later.

Re: Tuning massive UPDATES and GROUP BY's?

От
Marti Raudsepp
Дата:
On Thu, Mar 10, 2011 at 17:40, fork <forkandwait@gmail.com> wrote:
> The data is not particularly sensitive; if something happened and it rolled
> back, that wouldnt be the end of the world.  So I don't know if I can use
> "dangerous" setting for WAL checkpoints etc.   There are also aren't a lot of
> concurrent hits on the DB, though a few.

If you don't mind long recovery times in case of a crash, set
checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this
will improve write throughput significantly.

Also, if you don't mind CORRUPTing your database after a crash,
setting fsync=off and full_page_writes=off gives another significant
boost.

> I am loathe to create a new table from a select, since the indexes themselves
> take a really long time to build.

UPDATE on a table with many indexes will probably be slower. If you
want to speed up this part, use INSERT INTO x SELECT and take this
chance to partition your table, such that each individual partition
and most indexes will fit in your cache. Index builds from a warm
cache are very fast in PostgreSQL. You can create several indexes at
once in separate sessions, and the table will only be scanned once.

Don't forget to bump up maintenance_work_mem for index builds, 256MB
might be a reasonable arbitrary value.

The downside is that partitioning can interfere with your read queries
if they expect the data in a sorted order. But then, HashAggregate
tends to be faster than GroupAggregate in many cases, so this might
not matter for your queries. Alternatively you can experiment with
PostgreSQL 9.1 alpha, which has mostly fixed this shortcoming with the
"merge append" plan node.

> As the title alludes, I will also be doing GROUP BY's on the data, and would
> love to speed these up, mostly just for my own impatience...

I think regular tuning is the best you can do here.

Regards,
Marti

Re: Tuning massive UPDATES and GROUP BY's?

От
fork
Дата:
Marti Raudsepp <marti <at> juffo.org> writes:

> If you don't mind long recovery times in case of a crash, set
> checkpoint_segments to ~100 and checkpoint_completion_target=0.9; this
> will improve write throughput significantly.

Sounds good.

> Also, if you don't mind CORRUPTing your database after a crash,
> setting fsync=off and full_page_writes=off gives another significant
> boost.

I probably won't do this... ;)

> UPDATE on a table with many indexes will probably be slower. If you
> want to speed up this part, use INSERT INTO x SELECT and take this
> chance to partition your table,

Like the following?  Will it rebuild the indexes in a sensical way?

BEGIN;
CREATE TABLE tempfoo as SELECT *, foo + bar AS newcol FROM bar;
TRUNCATE foo;
ALTER TABLE foo ADD COLUMN newcol;
INSERT INTO foo SELECT * FROM tempfoo;
DROP TABLE tempfoo;
END;

> such that each individual partition
> and most indexes will fit in your cache.

Is there a rule of thumb on tradeoffs in a partitioned table?  About half the
time, I will want to do GROUP BY's that use the partition column, but about half
the time I won't.  (I would use the partition column whatever I am most likely
to cluster by in a single big table, right?)

For example, I might intuitively partition by age5 (into 20 tables like tab00,
tab05, tab10, etc). Often a query would be "SELECT ... FROM PARENTTABLE GROUP BY
age5, race, etc", but often it would be "GROUP BY state" or whatever with no
age5 component.

I know I can experiment ;), but it takes a while to load anything, and i would
rather stand on the shoulders.

Thanks so much for all your helps!



Re: Tuning massive UPDATES and GROUP BY's?

От
Marti Raudsepp
Дата:
On Fri, Mar 11, 2011 at 21:06, fork <forkandwait@gmail.com> wrote:
> Like the following?  Will it rebuild the indexes in a sensical way?

Don't insert data into an indexed table. A very important point with
bulk-loading is that you should load all the data first, then create
the indexes. Running multiple (different) CREATE INDEX queries in
parallel can additionally save a lot of time. Also don't move data
back and forth between the tables, just drop the original when you're
done.

Doing this should give a significant performance win. Partitioning
them to fit in cache should improve it further, but I'm not sure
anymore that it's worthwhile considering the costs and extra
maintenance.

> Is there a rule of thumb on tradeoffs in a partitioned table?

The only certain thing is that you'll lose "group" aggregate and
"merge join" query plans. If you only see "HashAggregate" plans when
you EXPLAIN your GROUP BY queries then it probably won't make much of
a difference.

> I would use the partition column whatever I am most likely
> to cluster by in a single big table, right?

Yes.

Regards,
Marti

Re: Tuning massive UPDATES and GROUP BY's?

От
runner
Дата:
>Don't insert data into an indexed table. A very important point with
>bulk-loading is that you should load all the data first, then create
>the indexes. Running multiple (different) CREATE INDEX queries in
>parallel can additionally save a lot of time. Also don't move data
>back and forth between the tables, just drop the original when you're
>done.
I just saw your post and it looks similar to what I'm doing.
We're going to be loading 12G of data from a MySQL dump into our
pg 9.0.3 database next weekend. I've been testing this for the last
two weeks. Tried removing the indexes and other constraints just for
the import but for a noob like me, this was too much to ask. Maybe
when I get more experience. So I *WILL* be importing all of my data
into indexed tables. I timed it and it will take eight hours.

I'm sure I could get it down to two or three hours for the import
if I really knew more about postgres but that's the price you pay when
you "slam dunk" a project and your staff isn't familiar with the
database back-end. Other than being very inefficient, and consuming
more time than necessary, is there any other down side to importing
into an indexed table? In the four test imports I've done,
everything seems to work fine, just takes a long time.

Sorry for hijacking your thread here!

Re: Tuning massive UPDATES and GROUP BY's?

От
Marti Raudsepp
Дата:
On Sun, Mar 13, 2011 at 18:36, runner <runner@winning.com> wrote:
> Tried removing the indexes and other constraints just for
> the import but for a noob like me, this was too much to ask.  Maybe
> when I get more experience.

pgAdmin should make it pretty easy. Choose each index and constraint,
save the code from the "SQL pane" for when you need to restore it, and
do a right click -> Drop

> Other than being very inefficient, and consuming
> more time than necessary, is there any other down side to importing
> into an indexed table?

Doing so will result in somewhat larger (more bloated) indexes, but
generally the performance impact of this is minimal.

Regards,
Marti

Re: Tuning massive UPDATES and GROUP BY's?

От
Greg Spiegelberg
Дата:
On Mon, Mar 14, 2011 at 4:17 AM, Marti Raudsepp <marti@juffo.org> wrote:
On Sun, Mar 13, 2011 at 18:36, runner <runner@winning.com> wrote:
> Other than being very inefficient, and consuming
> more time than necessary, is there any other down side to importing
> into an indexed table?

Doing so will result in somewhat larger (more bloated) indexes, but
generally the performance impact of this is minimal.


Bulk data imports of this size I've done with minimal pain by simply breaking the raw data into chunks (10M records becomes 10 files of 1M records), on a separate spindle from the database, and performing multiple COPY commands but no more than 1 COPY per server core.  I tested this a while back on a 4 core server and when I attempted 5 COPY's at a time the time to complete went up almost 30%.  I don't recall any benefit having fewer than 4 in this case but the server was only processing my data at the time.  Indexes were on the target table however I dropped all constraints.  The UNIX split command is handy for breaking the data up into individual files.

Greg

Re: Tuning massive UPDATES and GROUP BY's?

От
runner
Дата:




> Bulk data imports of this size I've done with minimal pain by simply
> breaking the raw data into chunks (10M records becomes 10 files of
> 1M records), on a separate spindle from the database, and performing
> multiple COPY commands but no more than 1 COPY per server core. 
> I tested this a while back on a 4 core server and when I attempted 5
> COPY's at a time the time to complete went up almost 30%.  I don't
> recall any benefit having fewer than 4 in this case but the server was
> only processing my data at the time.  Indexes were on the target table
> however I dropped all constraints.  The UNIX split command is handy
> for breaking the data up into individual files.

I'm not using COPY.  My dump file is a bunch if INSERT INTO statements.  I know it would be faster to use copy.  If I can figure out how to do this in one hour I will try it.  I did two mysqldumps, one with INSERT INTO and one as CSV to I can try COPY at a later time.   I'm running five parallel psql processes to import the data which has been broken out by table.