Обсуждение: Long running INSERT+SELECT query
Hi, We're trying to populate a table with aggregated data from other tables. For that we're running a huge INSERT+SELECT query which joins several tables, aggregates values, and then inserts the results into another table. The problem we're facing is that while the query is running , some records in tables referenced by the results table may be deleted, which causes the following error: ERROR: insert or update on table "..." violates foreign key constraint "..." DETAIL: Key (...)=(...) is not present in table "...". Who do we make sure that such aggregating query would not fail? Regards, Vitaliy
On 04/26/2018 09:13 AM, Vitaliy Garnashevich wrote: > Hi, > > We're trying to populate a table with aggregated data from other tables. > For that we're running a huge INSERT+SELECT query which joins several > tables, aggregates values, and then inserts the results into another > table. The problem we're facing is that while the query is running , > some records in tables referenced by the results table may be deleted, > which causes the following error: > > ERROR: insert or update on table "..." violates foreign key constraint > "..." > DETAIL: Key (...)=(...) is not present in table "...". > SELECT .. FOR UPDATE?: https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS > Who do we make sure that such aggregating query would not fail? > > Regards, > Vitaliy > > > -- Adrian Klaver adrian.klaver@aklaver.com
> SELECT .. FOR UPDATE?: > > https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE > > > https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS > It says: "Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR KEY SHARE cannot be specified with GROUP BY." Regards, Vitaliy
On 04/26/2018 10:45 AM, Vitaliy Garnashevich wrote: > >> SELECT .. FOR UPDATE?: >> >> https://www.postgresql.org/docs/10/static/sql-select.html#SQL-FOR-UPDATE-SHARE >> >> >> https://www.postgresql.org/docs/10/static/explicit-locking.html#LOCKING-ROWS >> > > It says: "Currently, FOR NO KEY UPDATE, FOR UPDATE, FOR SHARE and FOR > KEY SHARE cannot be specified with GROUP BY." Without the query we are flying blind, so suggestions will have a ? > > Regards, > Vitaliy -- Adrian Klaver adrian.klaver@aklaver.com
> Without the query we are flying blind, so suggestions will have a ? Here is one such query: INSERT INTO cmdb_sp_usage_history (created_by, updated_by, created_on, updated_on, mod_count, summary_on, quarter, product, used_from, "user", keystrokes, minutes_in_use, times_started, avg_keystrokes, max_keystrokes, spkg_operational) SELECT 2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1, CURRENT_TIMESTAMP, quarter.id, "spv"."product", "usage"."used_from", "usage"."user", coalesce(sum("usage"."keystrokes"), 0), coalesce(sum("usage"."minutes_in_use"), 0), coalesce(sum("usage"."times_started"), 0), coalesce(avg("usage"."keystrokes"), 0), coalesce(max("usage"."keystrokes"), 0), bool_or("cmdb_ci"."operational") FROM "cmdb_program_daily_usage" "usage" LEFT OUTER JOIN "cmdb_program_instance" "p" ON "p"."id" = "usage"."program_instance" LEFT OUTER JOIN "cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg" LEFT OUTER JOIN "cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id" LEFT OUTER JOIN "cmdb_software_product_version" "spv" ON "spv"."id" = "s"."software" WHERE ("usage"."minutes_in_use" > 0) AND ((NOT ("s"."software" IS NULL)) AND ((NOT ("s"."os" = TRUE)) OR ("s"."os" IS NULL))) AND ("usage"."usage_date" >= quarter.start_date) AND ("usage"."usage_date" < quarter.end_date) GROUP BY "spv"."product", "usage"."used_from", "usage"."user" HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR (coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR (coalesce(sum("usage"."times_started"), 0) > 0) ORDER BY "spv"."product", "usage"."used_from", "usage"."user"; Regards, Vitaliy
On 04/26/2018 11:08 AM, Vitaliy Garnashevich wrote: > >> Without the query we are flying blind, so suggestions will have a ? > > Here is one such query: > > INSERT INTO cmdb_sp_usage_history > (created_by, updated_by, created_on, updated_on, mod_count, > summary_on, quarter, product, used_from, "user", > keystrokes, minutes_in_use, times_started, avg_keystrokes, > max_keystrokes, spkg_operational) > SELECT > 2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1, > CURRENT_TIMESTAMP, quarter.id, "spv"."product", > "usage"."used_from", "usage"."user", > coalesce(sum("usage"."keystrokes"), 0), > coalesce(sum("usage"."minutes_in_use"), 0), > coalesce(sum("usage"."times_started"), 0), > coalesce(avg("usage"."keystrokes"), 0), > coalesce(max("usage"."keystrokes"), 0), > bool_or("cmdb_ci"."operational") > FROM > "cmdb_program_daily_usage" "usage" > LEFT OUTER JOIN > "cmdb_program_instance" "p" ON "p"."id" = "usage"."program_instance" > LEFT OUTER JOIN > "cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg" > LEFT OUTER JOIN > "cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id" > LEFT OUTER JOIN > "cmdb_software_product_version" "spv" ON "spv"."id" = "s"."software" > WHERE ("usage"."minutes_in_use" > 0) > AND ((NOT ("s"."software" IS NULL)) > AND ((NOT ("s"."os" = TRUE)) > OR ("s"."os" IS NULL))) > AND ("usage"."usage_date" >= quarter.start_date) > AND ("usage"."usage_date" < quarter.end_date) > GROUP BY "spv"."product", "usage"."used_from", "usage"."user" > HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR > (coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR > (coalesce(sum("usage"."times_started"), 0) > 0) > ORDER BY "spv"."product", "usage"."used_from", "usage"."user"; Have not worked through all of the above, but a first draft suggestion: Move the SELECT minus the aggregation functions into a sub-query that uses FOR UPDATE. Then do the aggregation on the results of the sub-query. > > Regards, > Vitaliy > -- Adrian Klaver adrian.klaver@aklaver.com
> Have not worked through all of the above, but a first draft suggestion: > > Move the SELECT minus the aggregation functions into a sub-query that > uses FOR UPDATE. Then do the aggregation on the results of the sub-query. The aggregated table has hundreds of millions of rows, and the query runs for many hours (which is one of the reasons why it's better not to fail). I really doubt that row level locking would work. That would be a lot of RAM just to hold all the locks. On the other hand, I don't see something like FOR KEY SHARE kind of locks at table level, so that the query would try not to block most of other existing activity (e.g. SELECTs, UPDATEs). Maybe this could be solved by calculating results into a temporary table, which would not check foreign key constraints, and then copy the data into the actual results table, while checking each row for FK consistency and skipping if necessary. But then I don't think it would be possible for my transaction to see row deletions which other transactions have done, and to check row existence (the transaction is there, because the whole thing is implemented as a DO statement with some local variables). Thoughts? Regards, Vitaliy
The aggregated table has hundreds of millions of rows, and the query runs for many hours (which is one of the reasons why it's better not to fail)
[...]
Maybe this could be solved by calculating results into a temporary table, which would not check foreign key constraints, and then copy the data into the actual results table, while checking each row for FK consistency and skipping if necessary.
Me, I'd try very hard to design things so the final calculation goes into an actual results table that omits FK constraints and wouldn't try to "skip if necessary". You are already running an hours-long query - the users of said information needs to understand that what they are seeing does not reflect changes in the subsequent hour(s) since it started and that certain related records being no longer present doesn't detract from the fact that they were present "back then" and thus represent valid data at that point in time.
David J.
Vitaliy Garnashevich wrote: > > > Have not worked through all of the above, but a first draft suggestion: > > > > Move the SELECT minus the aggregation functions into a sub-query that > > uses FOR UPDATE. Then do the aggregation on the results of the > > sub-query. > > The aggregated table has hundreds of millions of rows, and the query runs > for many hours (which is one of the reasons why it's better not to fail). I > really doubt that row level locking would work. That would be a lot of RAM > just to hold all the locks. Row locks are not stored in memory. Of course, a FOR KEY SHARE lock would block DELETEs that try to remove the locked row. I think your proposed strategy of trying to merge what other processes did while you were copying is very problematic. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
On 04/26/2018 01:32 PM, Vitaliy Garnashevich wrote: > >> Have not worked through all of the above, but a first draft suggestion: >> >> Move the SELECT minus the aggregation functions into a sub-query that >> uses FOR UPDATE. Then do the aggregation on the results of the sub-query. > > The aggregated table has hundreds of millions of rows, and the query > runs for many hours (which is one of the reasons why it's better not to > fail). I really doubt that row level locking would work. That would be a > lot of RAM just to hold all the locks. > > On the other hand, I don't see something like FOR KEY SHARE kind of > locks at table level, so that the query would try not to block most of > other existing activity (e.g. SELECTs, UPDATEs). > > Maybe this could be solved by calculating results into a temporary > table, which would not check foreign key constraints, and then copy the > data into the actual results table, while checking each row for FK > consistency and skipping if necessary. But then I don't think it would > be possible for my transaction to see row deletions which other > transactions have done, and to check row existence (the transaction is > there, because the whole thing is implemented as a DO statement with > some local variables). > > Thoughts? The procedure seems to be fighting itself. There is an inherent conflict between trying to keep up with data changes and presenting a consistent result. Keeping up means constantly updating the aggregation calculations which in turn means the result will continually changing. As David and Alvaro have also suggested your best bet is to pick a point in time and work off that. > > Regards, > Vitaliy > > -- Adrian Klaver adrian.klaver@aklaver.com
On Thu, 26 Apr 2018 19:13:17 +0300 Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote: > We're trying to populate a table with aggregated data from other > tables. For that we're running a huge INSERT+SELECT query which joins > several tables, aggregates values, and then inserts the results into > another table. The problem we're facing is that while the query is > running , some records in tables referenced by the results table may > be deleted, which causes the following error: > > ERROR: insert or update on table "..." violates foreign key > constraint "..." > DETAIL: Key (...)=(...) is not present in table "...". > > Who do we make sure that such aggregating query would not fail? Create a temporary table with a useful subset of the data. You can select the mininimum number of columns joined and release the locks. This can also help large queries by giving you a stable snapshot of the data for repeated queries. I usually find that pre-joining the tables is easier because temp tables have restrictions on re-use w/in the query, and also usually don't have indexes to speed up the joins. If you are going to run this, say, daily it's easy enough to create a view and just "create temporary table foo as select * from bar" for some collection of views and go from there. This makes it easier to tune the queries on the back end without having to hack the front end code. -- Steven Lembark 1505 National Ave Workhorse Computing Rockford, IL 61103 lembark@wrkhors.com +1 888 359 3508
On Thu, 26 Apr 2018 21:08:01 +0300 Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote: > INSERT INTO cmdb_sp_usage_history > (created_by, updated_by, created_on, updated_on, mod_count, > summary_on, quarter, product, used_from, "user", > keystrokes, minutes_in_use, times_started, avg_keystrokes, > max_keystrokes, spkg_operational) > SELECT > 2, 2, CURRENT_TIMESTAMP, CURRENT_TIMESTAMP, 1, > CURRENT_TIMESTAMP, quarter.id, "spv"."product", > "usage"."used_from", "usage"."user", > coalesce(sum("usage"."keystrokes"), 0), > coalesce(sum("usage"."minutes_in_use"), 0), > coalesce(sum("usage"."times_started"), 0), > coalesce(avg("usage"."keystrokes"), 0), > coalesce(max("usage"."keystrokes"), 0), > bool_or("cmdb_ci"."operational") > FROM > "cmdb_program_daily_usage" "usage" > LEFT OUTER JOIN > "cmdb_program_instance" "p" ON "p"."id" = > "usage"."program_instance" LEFT OUTER JOIN > "cmdb_ci_spkg" "s" ON "s"."id" = "p"."spkg" > LEFT OUTER JOIN > "cmdb_ci" "cmdb_ci" ON "s"."id" = "cmdb_ci"."id" > LEFT OUTER JOIN > "cmdb_software_product_version" "spv" ON "spv"."id" = > "s"."software" WHERE ("usage"."minutes_in_use" > 0) > AND ((NOT ("s"."software" IS NULL)) > AND ((NOT ("s"."os" = TRUE)) > OR ("s"."os" IS NULL))) > AND ("usage"."usage_date" >= quarter.start_date) > AND ("usage"."usage_date" < quarter.end_date) > GROUP BY "spv"."product", "usage"."used_from", "usage"."user" > HAVING (coalesce(sum("usage"."keystrokes"), 0) > 0) OR > (coalesce(sum("usage"."minutes_in_use"), 0) > 0) OR > (coalesce(sum("usage"."times_started"), 0) > 0) > ORDER BY "spv"."product", "usage"."used_from", "usage"."user"; create temporary table foobar select <unaggregatedl, un-coalesced data> from <join from hell, above, sans group by> ; This isolates the lock time to performing the bare select, after which you can coalesce and sum to your heart's content without locking any of it. The point is performing the absolute minimum of processing to generate the temp table so as to release any locks quickly and avoid "group by" in the main join. Yes, this might end up creating a large-ish temp table :-) One other approach would be selecting only incremental data (e.g., daily) which locks a much smaller subset of the rows and aggregating the daily totals into quarterly, whatever. Call it daily usage, select where usage_date = today's or timestamp && a tstzrange of ( 0000, 2400, [) ). That might also simplify your query logic: all the coalesce op's end up in your daily/weekly/monthly/whatever summary, the quarterly values in the reporting are just sum X group by. -- Steven Lembark 1505 National Ave Workhorse Computing Rockford, IL 61103 lembark@wrkhors.com +1 888 359 3508
On Thu, 26 Apr 2018 23:32:33 +0300 Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote: > The aggregated table has hundreds of millions of rows, and the query > runs for many hours (which is one of the reasons why it's better not > to fail). I really doubt that row level locking would work. That > would be a lot of RAM just to hold all the locks. All the more reason to use temp tables. -- Steven Lembark 1505 National Ave Workhorse Computing Rockford, IL 61103 lembark@wrkhors.com +1 888 359 3508
Everybody thanks for the suggestions! We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of the tables, which are referenced by results, before running the big query. That should be up to a million of rows in total. It will probably not cover the case when a record is INSERT'ed and then DELETE'd after the calculation has begun, but such cases should be even more rare than the DELETE's we're currently facing. If this will not help, we'll also try to use triggers, as mentioned here: https://stackoverflow.com/a/39828119 Regards, Vitaliy
On Fri, 27 Apr 2018 19:38:15 +0300 Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote: > We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of > the tables, which are referenced by results, before running the big > query. That should be up to a million of rows in total. It will probably > not cover the case when a record is INSERT'ed and then DELETE'd after > the calculation has begun, but such cases should be even more rare than > the DELETE's we're currently facing. Thing about using a couple of Materialized Views for the worst part of it. -- Steven Lembark 1505 National Ave Workhorse Computing Rockford, IL 61103 lembark@wrkhors.com +1 888 359 3508
Steven Lembark <lembark@wrkhors.com> writes: > On Fri, 27 Apr 2018 19:38:15 +0300 > Vitaliy Garnashevich <vgarnashevich@gmail.com> wrote: > >> We're going to try using "SELECT 1 FROM table FOR KEY SHARE" for each of >> the tables, which are referenced by results, before running the big >> query. That should be up to a million of rows in total. It will probably >> not cover the case when a record is INSERT'ed and then DELETE'd after >> the calculation has begun, but such cases should be even more rare than >> the DELETE's we're currently facing. > > Thing about using a couple of Materialized Views for the worst > part of it. +1 re: materialised views - I have found them to be extremely useful for situations where you want a snapshot of data and need to present it in a way which is easier to process, especially when the underlying data is changing faster than your reporting process can generate the report. -- Tim Cross