Обсуждение: [GENERAL] efficiently migrating 'old' data from one table to another

Поиск
Список
Период
Сортировка

[GENERAL] efficiently migrating 'old' data from one table to another

От
Jonathan Vanasco
Дата:
I'm just wondering if there's a more efficient way of handling a certain periodic data migration.

We have a pair of tables with this structure:

    table_a__live
        column_1 INT
        column_2 INT
        record_timestamp TIMESTAMP

    table_a__archive
        column_1 INT
        column_2 INT
        record_timestamp TIMESTAMP

periodically, we must migrate items that are 'stale' from `table_a__live ` to `table_a__archive`.  The entries are
copiedover to the archive, then deleted. 

The staleness is calculated based on age--  so we need to use INTERVAL.  the "live" table can have anywhere from 100k
to20MM records. 

the primary key on `table_a__live` is a composite of column_1 & column_2,

In order to minimize scanning the table, we opted to hint migrations with a dedicated column:

    ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL;
    CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE is_migrate IS NOT NULL;

so our migration is then based on that `is_migrate` column:

    BEGIN;
    UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' -
INTERVAL'1 month'; 
    INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp
FROMtable_a__live WHERE is_migrate IS TRUE; 
    DELETE FROM table_a__live WHERE is_migrate IS TRUE;
    COMMIT;

The inserts & deletes are blazing fast, but the UPDATE is a bit slow from postgres re-writing all the rows.

can anyone suggest a better approach?

I considered copying everything to a tmp table then inserting/deleting based on that table -- but there's a lot of
disk-ioon that approach too. 


fwiw we're on postgres9.6.1

Re: [GENERAL] efficiently migrating 'old' data from one table toanother

От
"btober@computer.org"
Дата:

----- Original Message -----
> From: "Jonathan Vanasco" <postgres@2xlp.com>
> To: "pgsql-general general" <pgsql-general@postgresql.org>
> Sent: Thursday, January 12, 2017 3:06:14 PM
> Subject: [GENERAL] efficiently migrating 'old' data from one table to another
>
> I'm just wondering if there's a more efficient way of handling a certain
> periodic data migration.
>
> We have a pair of tables with this structure:
>
>     table_a__live
>         column_1 INT
>         column_2 INT
>         record_timestamp TIMESTAMP
>
>     table_a__archive
>         column_1 INT
>         column_2 INT
>         record_timestamp TIMESTAMP
>
> periodically, we must migrate items that are 'stale' from `table_a__live ` to
> `table_a__archive`.  The entries are copied over to the archive, then
> deleted.
>
> The staleness is calculated based on age--  so we need to use INTERVAL.  the
> "live" table can have anywhere from 100k to 20MM records.
>
> the primary key on `table_a__live` is a composite of column_1 & column_2,
>
> In order to minimize scanning the table, we opted to hint migrations with a
> dedicated column:
>
>     ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL;
>     CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE
>     is_migrate IS NOT NULL;
>
> so our migration is then based on that `is_migrate` column:
>
>     BEGIN;
>     UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp <
>     transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
>     INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT
>     column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS
>     TRUE;
>     DELETE FROM table_a__live WHERE is_migrate IS TRUE;
>     COMMIT;
>
> The inserts & deletes are blazing fast, but the UPDATE is a bit slow from
> postgres re-writing all the rows.
>
> can anyone suggest a better approach?
>
> I considered copying everything to a tmp table then inserting/deleting based
> on that table -- but there's a lot of disk-io on that approach too.


Review manual section 7.8.2. Data-Modifying Statements in WITH


https://www.postgresql.org/docs/9.6/static/queries-with.html


-- B


Re: [GENERAL] efficiently migrating 'old' data from one table toanother

От
John R Pierce
Дата:
On 1/12/2017 12:06 PM, Jonathan Vanasco wrote:
> I'm just wondering if there's a more efficient way of handling a certain periodic data migration.

partition the tables by some date interval such as week (if you do this
archiving weekly).   each week, disconnect the oldest partition from the
'active' partition set, and add it to the 'archive' partition set.
voila, no inserts, deletes, or updates are done at all, just some
metadata operations.


--
john r pierce, recycling bits in santa cruz



Re: [GENERAL] efficiently migrating 'old' data from one table toanother

От
Adrian Klaver
Дата:
On 01/12/2017 12:06 PM, Jonathan Vanasco wrote:
> I'm just wondering if there's a more efficient way of handling a certain periodic data migration.
>
> We have a pair of tables with this structure:
>
>     table_a__live
>         column_1 INT
>         column_2 INT
>         record_timestamp TIMESTAMP
>
>     table_a__archive
>         column_1 INT
>         column_2 INT
>         record_timestamp TIMESTAMP
>
> periodically, we must migrate items that are 'stale' from `table_a__live ` to `table_a__archive`.  The entries are
copiedover to the archive, then deleted. 
>
> The staleness is calculated based on age--  so we need to use INTERVAL.  the "live" table can have anywhere from 100k
to20MM records. 
>
> the primary key on `table_a__live` is a composite of column_1 & column_2,
>
> In order to minimize scanning the table, we opted to hint migrations with a dedicated column:
>
>     ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL;
>     CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE is_migrate IS NOT NULL;
>
> so our migration is then based on that `is_migrate` column:
>
>     BEGIN;
>     UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' -
INTERVAL'1 month'; 
>     INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp
FROMtable_a__live WHERE is_migrate IS TRUE; 
>     DELETE FROM table_a__live WHERE is_migrate IS TRUE;
>     COMMIT;
>
> The inserts & deletes are blazing fast, but the UPDATE is a bit slow from postgres re-writing all the rows.

Maybe I am missing something, but why do the UPDATE?

Why not?:

BEGIN;

    INSERT INTO
        table_a__archive (column_1, column_2, record_timestamp)
    SELECT
        column_1, column_2, record_timestamp
    FROM
    table_a__live
    WHERE
        record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';

    DELETE FROM
       table_a__live
    WHERE
       record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
COMMIT;

With an index on record_timestamp.

>
> can anyone suggest a better approach?
>
> I considered copying everything to a tmp table then inserting/deleting based on that table -- but there's a lot of
disk-ioon that approach too. 
>
>
> fwiw we're on postgres9.6.1
>


--
Adrian Klaver
adrian.klaver@aklaver.com


Re: [GENERAL] efficiently migrating 'old' data from one table to another

От
"David G. Johnston"
Дата:
On Thu, Jan 12, 2017 at 2:45 PM, Adrian Klaver <adrian.klaver@aklaver.com> wrote:

> so our migration is then based on that `is_migrate` column:
>
>       BEGIN;
>       UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp < transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
>       INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS TRUE;
>       DELETE FROM table_a__live WHERE is_migrate IS TRUE;
>       COMMIT;
>
> The inserts & deletes are blazing fast, but the UPDATE is a bit slow from postgres re-writing all the rows.

Maybe I am missing something, but why do the UPDATE?

​Not to mention doubling the amount of I/O vacuum is going to chew up.

​David J.

Re: [GENERAL] efficiently migrating 'old' data from one table to another

От
Merlin Moncure
Дата:
On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org
<btober@broadstripe.net> wrote:
>
>
> ----- Original Message -----
>> From: "Jonathan Vanasco" <postgres@2xlp.com>
>> To: "pgsql-general general" <pgsql-general@postgresql.org>
>> Sent: Thursday, January 12, 2017 3:06:14 PM
>> Subject: [GENERAL] efficiently migrating 'old' data from one table to another
>>
>> I'm just wondering if there's a more efficient way of handling a certain
>> periodic data migration.
>>
>> We have a pair of tables with this structure:
>>
>>       table_a__live
>>               column_1 INT
>>               column_2 INT
>>               record_timestamp TIMESTAMP
>>
>>       table_a__archive
>>               column_1 INT
>>               column_2 INT
>>               record_timestamp TIMESTAMP
>>
>> periodically, we must migrate items that are 'stale' from `table_a__live ` to
>> `table_a__archive`.  The entries are copied over to the archive, then
>> deleted.
>>
>> The staleness is calculated based on age--  so we need to use INTERVAL.  the
>> "live" table can have anywhere from 100k to 20MM records.
>>
>> the primary key on `table_a__live` is a composite of column_1 & column_2,
>>
>> In order to minimize scanning the table, we opted to hint migrations with a
>> dedicated column:
>>
>>       ALTER TABLE table_a__live ADD is_migrate BOOLEAN DEFAULT NULL;
>>       CREATE INDEX idx_table_a__live_migrate ON table_a__live(is_migrate) WHERE
>>       is_migrate IS NOT NULL;
>>
>> so our migration is then based on that `is_migrate` column:
>>
>>       BEGIN;
>>       UPDATE table_a__live SET is_migrate = TRUE WHERE record_timestamp <
>>       transaction_timestamp() AT TIME ZONE 'UTC' - INTERVAL '1 month';
>>       INSERT INTO table_a__archive (column_1, column_2, record_timestamp) SELECT
>>       column_1, column_2, record_timestamp FROM table_a__live WHERE is_migrate IS
>>       TRUE;
>>       DELETE FROM table_a__live WHERE is_migrate IS TRUE;
>>       COMMIT;
>>
>> The inserts & deletes are blazing fast, but the UPDATE is a bit slow from
>> postgres re-writing all the rows.
>>
>> can anyone suggest a better approach?
>>
>> I considered copying everything to a tmp table then inserting/deleting based
>> on that table -- but there's a lot of disk-io on that approach too.
>
>
> Review manual section 7.8.2. Data-Modifying Statements in WITH
>
>
> https://www.postgresql.org/docs/9.6/static/queries-with.html

this.

with data as (delete from foo where ... returning * ) insert into
foo_backup select * from data;


Re: [GENERAL] efficiently migrating 'old' data from one table to another

От
Jonathan Vanasco
Дата:
On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote:

> On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org
> <btober@broadstripe.net> wrote:
>>
>> Review manual section 7.8.2. Data-Modifying Statements in WITH
>>
>>
>> https://www.postgresql.org/docs/9.6/static/queries-with.html
>
> this.
>
> with data as (delete from foo where ... returning * ) insert into
> foo_backup select * from data;

Thanks, btober and merlin.  that's exactly what i want.


On Jan 12, 2017, at 4:45 PM, Adrian Klaver wrote:
> Maybe I am missing something, but why do the UPDATE?
> Why not?:
> ...
> With an index on record_timestamp.

That's actually the production deployment that we're trying to optimize.  Depending on the size of the table (rows,
width)it performs "less than great", even with the index on record_timestamp. 

The UPDATE actually worked faster in most situations.  I honestly don't know why (the only thing that makes sense to me
isserver-load)... but the update + bool test ended up being (much) faster than the timestamp comparison.   

Re: [GENERAL] efficiently migrating 'old' data from one table to another

От
""
Дата:
On Jan 12, 2017, Jonathan Vanasco <postgres@2xlp.com> wrote:
>On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote:
>
>> On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org
>> <btober@broadstripe.net> wrote:
>>>
>>> Review manual section 7.8.2. Data-Modifying Statements in WITH
>>>
>>>
>>> https://www.postgresql.org/docs/9.6/static/queries-with.html
>>
>> this.
>>
>> with data as (delete from foo where ... returning * ) insert into
>> foo_backup select * from data;
>
>Thanks, btober and merlin.  that's exactly what i want.

To help you a little more, I just did this for a set of tables within the
last week. :) The heart of the program is this sql:

    my $Chunk_size = 10000;
    my $Interval = 24;
    my $sql = "
        WITH
            keys AS (
                SELECT $pk_column
                FROM $table
                WHERE $time_column < NOW() - '$Interval MONTHS'::INTERVAL
                ORDER BY $pk_column
                LIMIT $Chunk_size ),
            data AS (
                DELETE FROM $table
                WHERE $pk_column <= (SELECT MAX($pk_column) FROM keys)
                RETURNING * )
        INSERT INTO archive_$table SELECT * FROM data;";

That's from Perl, but I suspect you can guess as to what each var should be for
your application. You can set $Chunk_size to whatever you want. There is
obviously a loop around that which executes until we get 0 rows, then we move
on to the next table.

The point of the chunks was to limit the impact on the production tables
as we move data out of them. If you don't have that concern and want to do all
rows at once then remove the LIMIT and ORDER BY.

HTH,
Kevin


Re: [GENERAL] efficiently migrating 'old' data from one table to another

От
Merlin Moncure
Дата:
On Fri, Jan 13, 2017 at 12:03 PM,  <kbrannen@pwhome.com> wrote:
> On Jan 12, 2017, Jonathan Vanasco <postgres@2xlp.com> wrote:
>>On Jan 12, 2017, at 5:52 PM, Merlin Moncure wrote:
>>
>>> On Thu, Jan 12, 2017 at 2:19 PM, btober@computer.org
>>> <btober@broadstripe.net> wrote:
>>>>
>>>> Review manual section 7.8.2. Data-Modifying Statements in WITH
>>>>
>>>>
>>>> https://www.postgresql.org/docs/9.6/static/queries-with.html
>>>
>>> this.
>>>
>>> with data as (delete from foo where ... returning * ) insert into
>>> foo_backup select * from data;
>>
>>Thanks, btober and merlin.  that's exactly what i want.
>
> To help you a little more, I just did this for a set of tables within the
> last week. :) The heart of the program is this sql:
>
>     my $Chunk_size = 10000;
>     my $Interval = 24;
>     my $sql = "
>         WITH
>             keys AS (
>                 SELECT $pk_column
>                 FROM $table
>                 WHERE $time_column < NOW() - '$Interval MONTHS'::INTERVAL
>                 ORDER BY $pk_column
>                 LIMIT $Chunk_size ),
>             data AS (
>                 DELETE FROM $table
>                 WHERE $pk_column <= (SELECT MAX($pk_column) FROM keys)
>                 RETURNING * )
>         INSERT INTO archive_$table SELECT * FROM data;";
>
> That's from Perl, but I suspect you can guess as to what each var should be for
> your application. You can set $Chunk_size to whatever you want. There is
> obviously a loop around that which executes until we get 0 rows, then we move
> on to the next table.
>
> The point of the chunks was to limit the impact on the production tables
> as we move data out of them. If you don't have that concern and want to do all
> rows at once then remove the LIMIT and ORDER BY.

FYI, although it's likely ok in this instance, directly inserting
table names without precaution is considered dubious and should be
avoided as practice.  SQL injection is a risk, and your code will fail
in the presence of unusual bug legal table names containing spaces.

For posterity handling this kind of action inside the database (via
plpgsql/EXECUTE) in order to leverage some internal routines,
especially quote_ident(), is generally a good idea.

merlin