Обсуждение: best practice for moving millions of rows to child table when setting up partitioning?

Поиск
Список
Период
Сортировка

best practice for moving millions of rows to child table when setting up partitioning?

От
Mark Stosberg
Дата:
Hello,

I'm working on moving a table with over 30 million to rows to be
partitioned. The table seeing several inserts per second. It's
essentially an activity log that only sees insert activity and is
lightly used for reporting, such that queries against it can safely be
disabled during a transition.

I'm looking for recommendations for a way to do this that will be least
disruptive to the flow of inserts statements that will continue to
stream in.

Here's the plan which seems best to me at the moment. Is it is
reasonable?

1. Handling creating the empty/future partitions is easy. I have the
code for this done already, and will make several partitions in advance
of needing them.

2. To create the partitions that should have data moved from the parent,
I'm thinking of creating them, and then before they are "live",
using INSERT ... SELECT to fill them with data from the parent table.
I'll run the INSERT first, and then add their indexes.

3. I will then install the trigger to redirect the inserts to the child
table.

4. There will still be a relatively small number of new rows from the
parent table to be deal with that came in after the INSERT from #2 was
started, so a final INSERT .. SELECT statement will be made to copy the
remaining rows.

5. Finally, I'll drop the indexes on the parent table and truncate it.

Thanks for advice here. If there's a tutorial out there about this that
I've missed, I'm happy to review it instead having it rehashed here.

Thanks for the help!

    Mark

Re: best practice for moving millions of rows to child table when setting up partitioning?

От
Raghavendra
Дата:
Hi Mark,

Similar posting on partition table, take this inputs before going forward with partition table.


Best solution given by Greg Smith as well Vick. 

Best Regards,
Raghavendra
EnterpriseDB Corporation
The Enterprise Postgres Company




On Wed, Apr 27, 2011 at 8:18 PM, Mark Stosberg <mark@summersault.com> wrote:

Hello,

I'm working on moving a table with over 30 million to rows to be
partitioned. The table seeing several inserts per second. It's
essentially an activity log that only sees insert activity and is
lightly used for reporting, such that queries against it can safely be
disabled during a transition.

I'm looking for recommendations for a way to do this that will be least
disruptive to the flow of inserts statements that will continue to
stream in.

Here's the plan which seems best to me at the moment. Is it is
reasonable?

1. Handling creating the empty/future partitions is easy. I have the
code for this done already, and will make several partitions in advance
of needing them.

2. To create the partitions that should have data moved from the parent,
I'm thinking of creating them, and then before they are "live",
using INSERT ... SELECT to fill them with data from the parent table.
I'll run the INSERT first, and then add their indexes.

3. I will then install the trigger to redirect the inserts to the child
table.

4. There will still be a relatively small number of new rows from the
parent table to be deal with that came in after the INSERT from #2 was
started, so a final INSERT .. SELECT statement will be made to copy the
remaining rows.

5. Finally, I'll drop the indexes on the parent table and truncate it.

Thanks for advice here. If there's a tutorial out there about this that
I've missed, I'm happy to review it instead having it rehashed here.

Thanks for the help!

   Mark


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: best practice for moving millions of rows to child table when setting up partitioning?

От
Mark Stosberg
Дата:
> Similar posting on partition table, take this inputs before going
> forward with partition table.
>
> http://archives.postgresql.org/pgsql-general/2011-04/msg00808.php
>
> Best solution given by Greg Smith as well Vick.

Thanks for the replies.

Today I reviewed the section on partitioning from Greg's High
Performance book.

I tried out the method of using an "update trigger" on the parent table
to move rows child tables.

In particular, I wanted to check whether the UPDATE statement would
alter all the rows automatically, or if the underlying trigger would
cause all the rows processed a row at a time.

It appears from my test that the result of the UPDATE was going to
appear all at once. I'm worried about the resource implications of
inserting mullions of rows all at once.

Someone else suggested writing a small program to insert the rows in s
smaller chunks. I think that's the approach I'm going to consider taking.

Thanks for the input!

   Mark

Re: best practice for moving millions of rows to child table when setting up partitioning?

От
Bob Lunney
Дата:
Mark,

Comments inline below for items 2 and 3.  What you are planning should work like a charm.

Bob Lunney

--- On Wed, 4/27/11, Mark Stosberg <mark@summersault.com> wrote:

> From: Mark Stosberg <mark@summersault.com>
> Subject: [ADMIN] best practice for moving millions of rows to child table when setting up partitioning?
> To: pgsql-admin@postgresql.org
> Date: Wednesday, April 27, 2011, 10:48 AM
>
> Hello,
>
> I'm working on moving a table with over 30 million to rows
> to be
> partitioned. The table seeing several inserts per second.
> It's
> essentially an activity log that only sees insert activity
> and is
> lightly used for reporting, such that queries against it
> can safely be
> disabled during a transition.
>
> I'm looking for recommendations for a way to do this that
> will be least
> disruptive to the flow of inserts statements that will
> continue to
> stream in.
>
> Here's the plan which seems best to me at the moment. Is it
> is
> reasonable?
>
> 1. Handling creating the empty/future partitions is easy. I
> have the
> code for this done already, and will make several
> partitions in advance
> of needing them.
>
> 2. To create the partitions that should have data moved
> from the parent,
> I'm thinking of creating them, and then before they are
> "live",
> using INSERT ... SELECT to fill them with data from the
> parent table.
> I'll run the INSERT first, and then add their indexes.
>

Use "create table as select ..." (CTAS) instead of creating the table, then inserting.  Since the table is created and
populatedatomically there is no need to log the inserts in WAL, and the operation is much faster. 

> 3. I will then install the trigger to redirect the inserts
> to the child
> table.
>

If possible, its better to have the code simply do inserts directly into the child table - after all, if the
partitioningis based on date, both the code and database know the date, so the code knows to which child table it
shouldwrite at any given moment. 

> 4. There will still be a relatively small number of new
> rows from the
> parent table to be deal with that came in after the INSERT
> from #2 was
> started, so a final INSERT .. SELECT statement will be made
> to copy the
> remaining rows.
>
> 5. Finally, I'll drop the indexes on the parent table and
> truncate it.
>
> Thanks for advice here. If there's a tutorial out there
> about this that
> I've missed, I'm happy to review it instead having it
> rehashed here.
>
> Thanks for the help!
>
>     Mark
>
>
> --
> Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-admin
>

Re: best practice for moving millions of rows to child table when setting up partitioning?

От
Mark Stosberg
Дата:
On 04/27/2011 10:48 AM, Mark Stosberg wrote:
>
> Hello,
>
> I'm working on moving a table with over 30 million to rows to be
> partitioned. The table seeing several inserts per second. It's
> essentially an activity log that only sees insert activity and is
> lightly used for reporting, such that queries against it can safely be
> disabled during a transition.
>
> I'm looking for recommendations for a way to do this that will be least
> disruptive to the flow of inserts statements that will continue to
> stream in.
>
> Here's the plan which seems best to me at the moment. Is it is
> reasonable?

I revised my plan based on feedback and mentioned resources here, and
also due to my own concerns about the resource and performance impact of
using the INSERT .. SELECT pattern on millions of rows at a time.

Here's my revised plan for the transition, which avoids using large
INSERT..SELECT statements, and only requires a minimal amount of the
transition to work to happen once the application has started to insert
data into the child tables. Seem reasonable?

New Plan
---------

1. Create the child tables targeted to contain data from the parent as
   standalone tables (including archive tables and the current month).

2. Take a full database backup from this morning and extract the COPY
   statement for the parent table. Manually split it up by date to create
   multiple copy statements, one for each partition.

3. Run the COPY statements to load the data into each child table.  So
   far, everything has happened outside of application access.

4. Find the max ID that has been inserted in the current child table.

5. INSERT .. SELECT the missing rows from the last backup from the parent
   table to the current child table to be. Again, note the max ID.

6. Now, during a maintenance window:
   - alter the child tables to inherit the parent
   - Set up the trigger which starts redirecting inserts
     from the parent table to the child table.
   - INSERT .. SELECT the file the final few missing rows from
     the parent to the current child
   - TRUNCATE then CLUSTER the parent table.

Re: Re: best practice for moving millions of rows to child table when setting up partitioning?

От
"ktm@rice.edu"
Дата:
On Wed, Apr 27, 2011 at 04:17:16PM -0400, Mark Stosberg wrote:
> On 04/27/2011 10:48 AM, Mark Stosberg wrote:
> >
> > Hello,
> >
> > I'm working on moving a table with over 30 million to rows to be
> > partitioned. The table seeing several inserts per second. It's
> > essentially an activity log that only sees insert activity and is
> > lightly used for reporting, such that queries against it can safely be
> > disabled during a transition.
> >
> > I'm looking for recommendations for a way to do this that will be least
> > disruptive to the flow of inserts statements that will continue to
> > stream in.
> >
> > Here's the plan which seems best to me at the moment. Is it is
> > reasonable?
>
> I revised my plan based on feedback and mentioned resources here, and
> also due to my own concerns about the resource and performance impact of
> using the INSERT .. SELECT pattern on millions of rows at a time.
>
> Here's my revised plan for the transition, which avoids using large
> INSERT..SELECT statements, and only requires a minimal amount of the
> transition to work to happen once the application has started to insert
> data into the child tables. Seem reasonable?
>
> New Plan
> ---------
>
> 1. Create the child tables targeted to contain data from the parent as
>    standalone tables (including archive tables and the current month).
>
> 2. Take a full database backup from this morning and extract the COPY
>    statement for the parent table. Manually split it up by date to create
>    multiple copy statements, one for each partition.
>
> 3. Run the COPY statements to load the data into each child table.  So
>    far, everything has happened outside of application access.
>
> 4. Find the max ID that has been inserted in the current child table.
>
> 5. INSERT .. SELECT the missing rows from the last backup from the parent
>    table to the current child table to be. Again, note the max ID.
>
> 6. Now, during a maintenance window:
>    - alter the child tables to inherit the parent
>    - Set up the trigger which starts redirecting inserts
>      from the parent table to the child table.
>    - INSERT .. SELECT the file the final few missing rows from
>      the parent to the current child
>    - TRUNCATE then CLUSTER the parent table.
>

Hi Mark,

I used a similar process to migrate to a partitioned table from a
non-partitioned table. However, I the future partitions first and
put them into place. Then I updated the trigger to push to the
child tables. Then once the in-use "daily" or "weekly" table rolled
I back-filled the existing tables from the big table. Anyway, my
two cents.

Regards,
Ken

Re: Re: best practice for moving millions of rows to child table when setting up partitioning?

От
Scott Marlowe
Дата:
I had a similar problem about a year ago,  The parent table had about
1.5B rows each with a unique ID from a bigserial.  My approach was to
create all the child tables needed for the past and the next month or
so.  Then, I simple did something like:

begin;
insert into table select * from only table where id between 1 and 10000000;
delete from only table where id between 1 and 10000000;
-- first few times check to make sure it's working of course
commit;
begin;
insert into table select * from only table where id between 10000001
and 20000000;
delete from only table where id between 10000001 and 20000000;
commit;

and so on.  New entries were already going into the child tables as
they showed up, old entries were migrating 10M rows at a time.  This
kept the moves small enough so as not to run the machine out of any
resource involved in moving 1.5B rows at once.

Re: Re: best practice for moving millions of rows to child table when setting up partitioning?

От
Scott Marlowe
Дата:
On Wed, Apr 27, 2011 at 6:26 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> I had a similar problem about a year ago,  The parent table had about
> 1.5B rows each with a unique ID from a bigserial.  My approach was to
> create all the child tables needed for the past and the next month or
> so.  Then, I simple did something like:

Note that I also created all the triggers to put the rows into the
right tables as well.

> begin;
> insert into table select * from only table where id between 1 and 10000000;
> delete from only table where id between 1 and 10000000;
> -- first few times check to make sure it's working of course
> commit;
> begin;
> insert into table select * from only table where id between 10000001
> and 20000000;
> delete from only table where id between 10000001 and 20000000;
> commit;
>
> and so on.  New entries were already going into the child tables as
> they showed up, old entries were migrating 10M rows at a time.  This
> kept the moves small enough so as not to run the machine out of any
> resource involved in moving 1.5B rows at once.
>



--
To understand recursion, one must first understand recursion.

On 04/27/2011 03:35 PM, Mark Stosberg wrote:
> In particular, I wanted to check whether the UPDATE statement would
> alter all the rows automatically, or if the underlying trigger would
> cause all the rows processed a row at a time.
>
> It appears from my test that the result of the UPDATE was going to
> appear all at once. I'm worried about the resource implications of
> inserting mullions of rows all at once.
>

You can put a LIMIT on an UPDATE, same as any other type of query.  No
reason that style of migration must happen all at once, you can just
target a smaller number of rows at a time and slowly siphon rows over to
the children by iteration.

I don't see anything inherently wrong with the approach you're
proposing.  CREATE INDEX CONCURRENTLY does something similar to the
logic you've outlined--mark where data was inserted at, move over all
older data, then just copy over the new rows at the end.  The main
downsides of that is complexity and the need for downtime to get an
atomic swap to using the child tables.  You can't add them to the parent
until the original is gone, if you've inserted duplicate data into them.

I'm not sure what all that complexity buys you, compared to just adding
all the children, putting a limit on the UPDATE, and looping over that
with some delay after each iteration until it's finished if you want to
further control the rate.  There's usually no reason you have to be in a
rush to moving data over.  Relying on the database's transactional
scheme to avoid making any mistakes here--making it so a goof will
ROLLBACK--and avoiding any need for downtime are normally higher
priorities in a partition migration than making the move happen as fast
as possible.

--
Greg Smith   2ndQuadrant US    greg@2ndQuadrant.com   Baltimore, MD
PostgreSQL Training, Services, and 24x7 Support  www.2ndQuadrant.us
"PostgreSQL 9.0 High Performance": http://www.2ndQuadrant.com/books


Re: best practice for moving millions of rows to child table when setting up partitioning?

От
Mark Stosberg
Дата:
>> 5. Finally, I'll drop the indexes on the parent table and
>> truncate it.

Luckily I noticed the problem with TRUNCATE and partitioning before my
work got to production.

TRUNCATE cascades automatically and silently to child tables, which was
not my intent.

This is mentioned here:
http://wiki.postgresql.org/wiki/Table_partitioning

But is not mentioned in the official documentation for TRUNCATE:

http://www.postgresql.org/docs/9.0/static/sql-truncate.html

The work-around we used was to put the TRUNCATE statement ahead of the
ALTER TABLE .. INHERIT statements in our final transaction.

   Mark


On Wed, May 4, 2011 at 10:48 AM, Mark Stosberg <mark@summersault.com> wrote:
>
>>> 5. Finally, I'll drop the indexes on the parent table and
>>> truncate it.
>
> Luckily I noticed the problem with TRUNCATE and partitioning before my
> work got to production.
>
> TRUNCATE cascades automatically and silently to child tables, which was
> not my intent.
>
> This is mentioned here:
> http://wiki.postgresql.org/wiki/Table_partitioning
>
> But is not mentioned in the official documentation for TRUNCATE:
>
> http://www.postgresql.org/docs/9.0/static/sql-truncate.html

Surely it is.  Quoting:

"If ONLY is specified, only that table is truncated. If ONLY is not
specified, the table and all its descendant tables (if any) are
truncated. "

On 05/04/2011 12:54 PM, Scott Marlowe wrote:
> On Wed, May 4, 2011 at 10:48 AM, Mark Stosberg <mark@summersault.com> wrote:
>>
>>>> 5. Finally, I'll drop the indexes on the parent table and
>>>> truncate it.
>>
>> Luckily I noticed the problem with TRUNCATE and partitioning before my
>> work got to production.
>>
>> TRUNCATE cascades automatically and silently to child tables, which was
>> not my intent.
>>
>> This is mentioned here:
>> http://wiki.postgresql.org/wiki/Table_partitioning
>>
>> But is not mentioned in the official documentation for TRUNCATE:
>>
>> http://www.postgresql.org/docs/9.0/static/sql-truncate.html
>
> Surely it is.  Quoting:
>
> "If ONLY is specified, only that table is truncated. If ONLY is not
> specified, the table and all its descendant tables (if any) are
> truncated. "

Thanks for the reference, Scott.

It is not as findable as it could be then. Besides scanning the page, I
also searched for "child", "parent" and "partition", and none of those
words are mentioned. Neither is "inherit". Pulling out "ONLY" to have
it's own "Parameter" sub-heading also help, instead of bundling that
documentation under the "name" sub-heading.

I suggest that at least one of the above search terms be added to better
relate the documentation to the partitioning documentation.

Further, since TRUNCATE permanently and instantly deletes mass amounts
of data, I would hope that it would provide "safety" by default, but
only truncating one table unless I specify otherwise.

Then again, it would be nice if an UPDATE with no WHERE clause did
little or nothing by default, instead of mangling an entire table, but
SQL doesn't seem to be designed with safe-by-default in mind.

   Mark

On Wed, May 4, 2011 at 11:04 AM, Mark Stosberg <mark@summersault.com> wrote:
> It is not as findable as it could be then. Besides scanning the page, I
> also searched for "child", "parent" and "partition", and none of those
> words are mentioned. Neither is "inherit". Pulling out "ONLY" to have
> it's own "Parameter" sub-heading also help, instead of bundling that
> documentation under the "name" sub-heading.
>
> I suggest that at least one of the above search terms be added to better
> relate the documentation to the partitioning documentation.

Agreed.

> Further, since TRUNCATE permanently and instantly deletes mass amounts
> of data, I would hope that it would provide "safety" by default, but
> only truncating one table unless I specify otherwise.

Keep in mind you're using postgres, the only thing you can't wrap in a
transaction is create / drop tablespace or database.  So you can test
your truncate within a transaction to be sure it's doing what you
want.

Scott Marlowe <scott.marlowe@gmail.com> writes:
> On Wed, May 4, 2011 at 11:04 AM, Mark Stosberg <mark@summersault.com> wrote:
>> Further, since TRUNCATE permanently and instantly deletes mass amounts
>> of data, I would hope that it would provide "safety" by default, but
>> only truncating one table unless I specify otherwise.

The reason it works like that is that the SQL standard says so :-(
There was considerable angst about this when we made TRUNCATE recurse
to children, IIRC, but we decided we had to make it work that way.

            regards, tom lane

Quoting Mark Stosberg <mark@summersault.com>:

>
>>> 5. Finally, I'll drop the indexes on the parent table and
>>> truncate it.
>
> Luckily I noticed the problem with TRUNCATE and partitioning before my
> work got to production.
>
> TRUNCATE cascades automatically and silently to child tables, which was
> not my intent.
>
> This is mentioned here:
> http://wiki.postgresql.org/wiki/Table_partitioning
>
> But is not mentioned in the official documentation for TRUNCATE:
>
> http://www.postgresql.org/docs/9.0/static/sql-truncate.html
>
> The work-around we used was to put the TRUNCATE statement ahead of the
> ALTER TABLE .. INHERIT statements in our final transaction.
>
>    Mark
You need to use the "ONLY" option to TRUNCATE to just do the single
parent and not cascade to the child tables.

Regards,
Ken