Обсуждение: Long running INSERT+SELECT query

Поиск
Список
Период
Сортировка

Long running INSERT+SELECT query

От
Vitaliy Garnashevich
Дата:
Hi,

We're trying to populate a table with aggregated data from other tables. 
For that we're running a huge INSERT+SELECT query which joins several 
tables, aggregates values, and then inserts the results into another 
table. The problem we're facing is that while the query is running , 
some records in tables referenced by the results table may be deleted, 
which causes the following error:

ERROR:  insert or update on table "..." violates foreign key constraint 
"..."
DETAIL:  Key (...)=(...) is not present in table "...".

Who do we make sure that such aggregating query would not fail?

Regards,
Vitaliy




Re: Long running INSERT+SELECT query

От
Adrian Klaver
Дата:
On 04/26/2018 09:13 AM, Vitaliy Garnashevich wrote:
> Hi,
> 
> We're trying to populate a table with aggregated data from other tables. 
> For that we're running a huge INSERT+SELECT query which joins several 
> tables, aggregates values, and then inserts the results into another 
> table. The problem we're facing is that while the query is running , 
> some records in tables referenced by the results table may be deleted, 
> which causes the following error:
> 
> ERROR:  insert or update on table "..." violates foreign key constraint 
> "..."
> DETAIL:  Key (...)=(...) is not present in table "...".
> 

SELECT .. FOR UPDATE?:

https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE

https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS

> Who do we make sure that such aggregating query would not fail?
> 
> Regards,
> Vitaliy
> 
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Long running INSERT+SELECT query

От
Vitaliy Garnashevich
Дата:
> SELECT .. FOR UPDATE?:
>
> https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE 
>
>
> https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS 
>

It says: "Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR 
KEY SHARE cannot be specified with GROUP BY."

Regards,
Vitaliy


Re: Long running INSERT+SELECT query

От
Adrian Klaver
Дата:
On 04/26/2018 10:45 AM, Vitaliy Garnashevich wrote:
> 
>> SELECT .. FOR UPDATE?:
>>
>> https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE 
>>
>>
>> https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS 
>>
> 
> It says: "Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR 
> KEY SHARE cannot be specified with GROUP BY."

Without the query we are flying blind, so suggestions will have a ?

> 
> Regards,
> Vitaliy


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Long running INSERT+SELECT query

От
Vitaliy Garnashevich
Дата:
> Without the query we are flying blind, so suggestions will have a ? 

Here is one such query:

     INSERT INTO cmdb_sp_usage_history
       (created_by, updated_by, created_on, updated_on, mod_count,
       summary_on, quarter, product, used_from, "user",
       keystrokes, minutes_in_use, times_started, avg_keystrokes, 
max_keystrokes, spkg_operational)
     SELECT
        2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1,
        CURRENT_TIMESTAMP, quarter.id, "spv"."product", 
"usage"."used_from", "usage"."user",
        coalesce(sum("usage"."keystrokes"), 0),
        coalesce(sum("usage"."minutes_in_use"), 0),
        coalesce(sum("usage"."times_started"), 0),
        coalesce(avg("usage"."keystrokes"), 0),
        coalesce(max("usage"."keystrokes"), 0),
        bool_or("cmdb_ci"."operational")
     FROM
       "cmdb_program_daily_usage" "usage"
     LEFT OUTER JOIN
       "cmdb_program_instance" "p" ON "p"."id" = "usage"."program_instance"
     LEFT OUTER JOIN
       "cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg"
     LEFT OUTER JOIN
       "cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id"
     LEFT OUTER JOIN
       "cmdb_software_product_version" "spv" ON "spv"."id" = "s"."software"
     WHERE ("usage"."minutes_in_use" > 0)
       AND ((NOT ("s"."software" IS NULL))
            AND ((NOT ("s"."os" = TRUE))
                 OR ("s"."os" IS NULL)))
       AND ("usage"."usage_date" >= quarter.start_date)
       AND ("usage"."usage_date" < quarter.end_date)
     GROUP BY "spv"."product", "usage"."used_from", "usage"."user"
     HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR 
(coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR 
(coalesce(sum("usage"."times_started"), 0) > 0)
     ORDER BY "spv"."product", "usage"."used_from", "usage"."user";

Regards,
Vitaliy



Re: Long running INSERT+SELECT query

От
Adrian Klaver
Дата:
On 04/26/2018 11:08 AM, Vitaliy Garnashevich wrote:
> 
>> Without the query we are flying blind, so suggestions will have a ? 
> 
> Here is one such query:
> 
>      INSERT INTO cmdb_sp_usage_history
>        (created_by, updated_by, created_on, updated_on, mod_count,
>        summary_on, quarter, product, used_from, "user",
>        keystrokes, minutes_in_use, times_started, avg_keystrokes, 
> max_keystrokes, spkg_operational)
>      SELECT
>         2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1,
>         CURRENT_TIMESTAMP, quarter.id, "spv"."product", 
> "usage"."used_from", "usage"."user",
>         coalesce(sum("usage"."keystrokes"), 0),
>         coalesce(sum("usage"."minutes_in_use"), 0),
>         coalesce(sum("usage"."times_started"), 0),
>         coalesce(avg("usage"."keystrokes"), 0),
>         coalesce(max("usage"."keystrokes"), 0),
>         bool_or("cmdb_ci"."operational")
>      FROM
>        "cmdb_program_daily_usage" "usage"
>      LEFT OUTER JOIN
>        "cmdb_program_instance" "p" ON "p"."id" = "usage"."program_instance"
>      LEFT OUTER JOIN
>        "cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg"
>      LEFT OUTER JOIN
>        "cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id"
>      LEFT OUTER JOIN
>        "cmdb_software_product_version" "spv" ON "spv"."id" = "s"."software"
>      WHERE ("usage"."minutes_in_use" > 0)
>        AND ((NOT ("s"."software" IS NULL))
>             AND ((NOT ("s"."os" = TRUE))
>                  OR ("s"."os" IS NULL)))
>        AND ("usage"."usage_date" >= quarter.start_date)
>        AND ("usage"."usage_date" < quarter.end_date)
>      GROUP BY "spv"."product", "usage"."used_from", "usage"."user"
>      HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR 
> (coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR 
> (coalesce(sum("usage"."times_started"), 0) > 0)
>      ORDER BY "spv"."product", "usage"."used_from", "usage"."user";

Have not worked through all of the above, but a first draft suggestion:

Move the SELECT minus the aggregation functions into a sub-query that 
uses FOR UPDATE. Then do the aggregation on the results of the sub-query.

> 
> Regards,
> Vitaliy
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Long running INSERT+SELECT query

От
Vitaliy Garnashevich
Дата:
> Have not worked through all of the above, but a first draft suggestion:
>
> Move the SELECT minus the aggregation functions into a sub-query that 
> uses FOR UPDATE. Then do the aggregation on the results of the sub-query. 

The aggregated table has hundreds of millions of rows, and the query 
runs for many hours (which is one of the reasons why it's better not to 
fail). I really doubt that row level locking would work. That would be a 
lot of RAM just to hold all the locks.

On the other hand, I don't see something like FOR KEY SHARE kind of 
locks at table level, so that the query would try not to block most of 
other existing activity (e.g. SELECTs, UPDATEs).

Maybe this could be solved by calculating results into a temporary 
table, which would not check foreign key constraints, and then copy the 
data into the actual results table, while checking each row for FK 
consistency and skipping if necessary. But then I don't think it would 
be possible for my transaction to see row deletions which other 
transactions have done, and to check row existence (the transaction is 
there, because the whole thing is implemented as a DO statement with 
some local variables).

Thoughts?

Regards,
Vitaliy



Re: Long running INSERT+SELECT query

От
"David G. Johnston"
Дата:
On Thu, Apr 26, 2018 at 1:32 PM, Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
The aggregated table has hundreds of millions of rows, and the query runs for many hours (which is one of the reasons why it's better not to fail)
​[...]​
 
Maybe this could be solved by calculating results into a temporary table, which would not check foreign key constraints, and then copy the data into the actual results table, while checking each row for FK consistency and skipping if necessary.

Me, I'd try very hard to design things so the final calculation goes into an actual results table that omits FK constraints​ and wouldn't try to "skip if necessary".  You are already running an hours-long query - the users of said information needs to understand that what they are seeing does not reflect changes in the subsequent hour(s) since it started and that certain related records being no longer present doesn't detract from the fact that they were present "back then" and thus represent valid data at that point in time.

David J.

Re: Long running INSERT+SELECT query

От
Alvaro Herrera
Дата:
Vitaliy Garnashevich wrote:
> 
> > Have not worked through all of the above, but a first draft suggestion:
> > 
> > Move the SELECT minus the aggregation functions into a sub-query that
> > uses FOR UPDATE. Then do the aggregation on the results of the
> > sub-query.
> 
> The aggregated table has hundreds of millions of rows, and the query runs
> for many hours (which is one of the reasons why it's better not to fail). I
> really doubt that row level locking would work. That would be a lot of RAM
> just to hold all the locks.

Row locks are not stored in memory.

Of course, a FOR KEY SHARE lock would block DELETEs that try to remove
the locked row.

I think your proposed strategy of trying to merge what other processes
did while you were copying is very problematic.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: Long running INSERT+SELECT query

От
Adrian Klaver
Дата:
On 04/26/2018 01:32 PM, Vitaliy Garnashevich wrote:
> 
>> Have not worked through all of the above, but a first draft suggestion:
>>
>> Move the SELECT minus the aggregation functions into a sub-query that 
>> uses FOR UPDATE. Then do the aggregation on the results of the sub-query. 
> 
> The aggregated table has hundreds of millions of rows, and the query 
> runs for many hours (which is one of the reasons why it's better not to 
> fail). I really doubt that row level locking would work. That would be a 
> lot of RAM just to hold all the locks.
> 
> On the other hand, I don't see something like FOR KEY SHARE kind of 
> locks at table level, so that the query would try not to block most of 
> other existing activity (e.g. SELECTs, UPDATEs).
> 
> Maybe this could be solved by calculating results into a temporary 
> table, which would not check foreign key constraints, and then copy the 
> data into the actual results table, while checking each row for FK 
> consistency and skipping if necessary. But then I don't think it would 
> be possible for my transaction to see row deletions which other 
> transactions have done, and to check row existence (the transaction is 
> there, because the whole thing is implemented as a DO statement with 
> some local variables).
> 
> Thoughts?

The procedure seems to be fighting itself. There is an inherent conflict 
between trying to keep up with data changes and presenting a consistent 
result. Keeping up means constantly updating the aggregation 
calculations which in turn means the result will continually changing. 
As David and Alvaro have also suggested your best bet is to pick a point 
in time and work off that.

> 
> Regards,
> Vitaliy
> 
> 


-- 
Adrian Klaver
adrian.klaver@aklaver.com


Re: Long running INSERT+SELECT query

От
Steven Lembark
Дата:
On Thu, 26 Apr 2018 19:13:17 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

> We're trying to populate a table with aggregated data from other
> tables. For that we're running a huge INSERT+SELECT query which joins
> several tables, aggregates values, and then inserts the results into
> another table. The problem we're facing is that while the query is
> running , some records in tables referenced by the results table may
> be deleted, which causes the following error:
>
> ERROR:  insert or update on table "..." violates foreign key
> constraint "..."
> DETAIL:  Key (...)=(...) is not present in table "...".
>
> Who do we make sure that such aggregating query would not fail?

Create a temporary table with a useful subset of the data.

You can select the mininimum number of columns joined and release
the locks. This can also help large queries by giving you a stable
snapshot of the data for repeated queries.

I usually find that pre-joining the tables is easier because
temp tables have restrictions on re-use w/in the query, and also
usually don't have indexes to speed up the joins.

If you are going to run this, say, daily it's easy enough to create
a view and just "create temporary table foo as select * from bar"
for some collection of views and go from there. This makes it easier
to tune the queries on the back end without having to hack the front
end code.

--
Steven Lembark                                       1505 National Ave
Workhorse Computing                                 Rockford, IL 61103
lembark@wrkhors.com                                    +1 888 359 3508


Re: Long running INSERT+SELECT query

От
Steven Lembark
Дата:
On Thu, 26 Apr 2018 21:08:01 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

>      INSERT INTO cmdb_sp_usage_history
>        (created_by, updated_by, created_on, updated_on, mod_count,
>        summary_on, quarter, product, used_from, "user",
>        keystrokes, minutes_in_use, times_started, avg_keystrokes,
> max_keystrokes, spkg_operational)
>      SELECT
>         2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1,
>         CURRENT_TIMESTAMP, quarter.id, "spv"."product",
> "usage"."used_from", "usage"."user",
>         coalesce(sum("usage"."keystrokes"), 0),
>         coalesce(sum("usage"."minutes_in_use"), 0),
>         coalesce(sum("usage"."times_started"), 0),
>         coalesce(avg("usage"."keystrokes"), 0),
>         coalesce(max("usage"."keystrokes"), 0),
>         bool_or("cmdb_ci"."operational")
>      FROM
>        "cmdb_program_daily_usage" "usage"
>      LEFT OUTER JOIN
>        "cmdb_program_instance" "p" ON "p"."id" =
> "usage"."program_instance" LEFT OUTER JOIN
>        "cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg"
>      LEFT OUTER JOIN
>        "cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id"
>      LEFT OUTER JOIN
>        "cmdb_software_product_version" "spv" ON "spv"."id" =
> "s"."software" WHERE ("usage"."minutes_in_use" > 0)
>        AND ((NOT ("s"."software" IS NULL))
>             AND ((NOT ("s"."os" = TRUE))
>                  OR ("s"."os" IS NULL)))
>        AND ("usage"."usage_date" >= quarter.start_date)
>        AND ("usage"."usage_date" < quarter.end_date)
>      GROUP BY "spv"."product", "usage"."used_from", "usage"."user"
>      HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR
> (coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR
> (coalesce(sum("usage"."times_started"), 0) > 0)
>      ORDER BY "spv"."product", "usage"."used_from", "usage"."user";

create temporary table
    foobar

    select
        <unaggregatedl, un-coalesced data>
    from
        <join from hell, above, sans group by>
;

This isolates the lock time to performing the bare select, after
which you can coalesce and sum to your heart's content without
locking any of it.

The point is performing the absolute minimum of processing to
generate the temp table so as to release any locks quickly and
avoid "group by" in the main join.

Yes, this might end up creating a large-ish temp table :-)

One other approach would be selecting only incremental data
(e.g., daily) which locks a much smaller subset of the rows
and aggregating the daily totals into quarterly, whatever.
Call it daily usage, select where usage_date = today's or
timestamp && a tstzrange of ( 0000, 2400, [) ). That might
also simplify your query logic: all the coalesce op's end up
in your daily/weekly/monthly/whatever summary, the quarterly
values in the reporting are just sum X group by.

--
Steven Lembark                                       1505 National Ave
Workhorse Computing                                 Rockford, IL 61103
lembark@wrkhors.com                                    +1 888 359 3508


Re: Long running INSERT+SELECT query

От
Steven Lembark
Дата:
On Thu, 26 Apr 2018 23:32:33 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

> The aggregated table has hundreds of millions of rows, and the query 
> runs for many hours (which is one of the reasons why it's better not
> to fail). I really doubt that row level locking would work. That
> would be a lot of RAM just to hold all the locks.

All the more reason to use temp tables.

-- 
Steven Lembark                                       1505 National Ave
Workhorse Computing                                 Rockford, IL 61103
lembark@wrkhors.com                                    +1 888 359 3508


Re: Long running INSERT+SELECT query

От
Vitaliy Garnashevich
Дата:
Everybody thanks for the suggestions!

We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of 
the tables, which are referenced by results, before running the big 
query. That should be up to a million of rows in total. It will probably 
not cover the case when a record is INSERT'ed and then DELETE'd after 
the calculation has begun, but such cases should be even more rare than 
the DELETE's we're currently facing.

If this will not help, we'll also try to use triggers, as mentioned 
here: https://stackoverflow.com/a/39828119

Regards,
Vitaliy


Re: Long running INSERT+SELECT query

От
Steven Lembark
Дата:
On Fri, 27 Apr 2018 19:38:15 +0300
Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:

> We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of 
> the tables, which are referenced by results, before running the big 
> query. That should be up to a million of rows in total. It will probably 
> not cover the case when a record is INSERT'ed and then DELETE'd after 
> the calculation has begun, but such cases should be even more rare than 
> the DELETE's we're currently facing.

Thing about using a couple of Materialized Views for the worst 
part of it.


-- 
Steven Lembark                                         1505 National Ave
Workhorse Computing                                   Rockford, IL 61103
lembark@wrkhors.com                                      +1 888 359 3508


Re: Long running INSERT+SELECT query

От
Tim Cross
Дата:
Steven Lembark <lembark@wrkhors.com> writes:

> On Fri, 27 Apr 2018 19:38:15 +0300
> Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote:
>
>> We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of 
>> the tables, which are referenced by results, before running the big 
>> query. That should be up to a million of rows in total. It will probably 
>> not cover the case when a record is INSERT'ed and then DELETE'd after 
>> the calculation has begun, but such cases should be even more rare than 
>> the DELETE's we're currently facing.
>
> Thing about using a couple of Materialized Views for the worst 
> part of it.

+1 re: materialised views - I have found them to be extremely useful for
situations where you want a snapshot of data and need to present it in a
way which is easier to process, especially when the underlying data is
changing faster than your reporting process can generate the report.  

-- 
Tim Cross