Обсуждение: Out Of Memory 8.1
Hi All, I am having problems with a query on 8.1 running on RHEL 5.4 16GB RAM Linux pgsql3 2.6.18-164.el5PAE #1 SMP Tue Aug 18 15:59:11 EDT 2009 i686 i686 i386 GNU/Linux 2 x Xeon X5650 (2.67GHz 6 Cores) Disks are on PERC 6 controller in RAID 10 Postgresql.conf: # - Memory - shared_buffers = 327680 work_mem = 98394 maintenance_work_mem = 983040 The query: 2011-04-06 23:23:03 BST <app.charisma><26083>LOG: statement: CREATE TABLE medusa.cmi_group_free AS SELECT stkl_stockno, (SUM(stkphys) - (SUM(stkalloc) + SUM(stkbacko))) as group_free FROM charisma.sk_stklfl WHERE stkl_stockno NOT LIKE 'ZZ%' AND stkl_stockno ~ '^([A-z&]{2,3})([0-9][0-9][0-9])([0-9][0-9][0-9][0-9][A-z0-9]{1,3})$' GROUP BY stkl_stockno; The explain (cannot explain analyze, or Postgres runs out of memory again) 'HashAggregate (cost=2731947.55..2731947.57 rows=1 width=38)' ' -> Seq Scan on stkl_rec (cost=0.00..2731947.54 rows=1 width=38)' ' Filter: (((stkl_comp)::text = 'A'::text) AND ((stkl_stockno)::text ~ '^(TK[A-Za-z0-9][0-9]{3}(?:[0-9]{5}(?:[0-9]{3,4})?)?|NSP[0-9]{3}([0-9]{4 })?|Z[MZ][0-9]{8,9}|LSP[0-9]{7}[A-Za-z0-9]|[A-Z][A-Z&][A-Z][0-9]{7}[A-Z0 -9])$'::text) AND ((stkl_stockno)::text !~~ 'ZZ%'::text) AND ((stkl_stockno)::text ~ '^([A-z&]{2,3})([0-9][0-9][0-9])([0-9][0-9][0-9][0-9][A-z0-9]{1,3})$'::t ext))' The table has approximately 6.9 million rows It's not the OOM that's doing this as vm.overcommit_memory = 2 and vm.overcommit_ratio = 90 nor is postgres swapping to disk. Can anyone help me here? The same query on exactly the same data on an older machine running the same setup except that it's RHEL 5.0 runs fine! Thanks. Martin ___________________________________________________ This email is intended for the named recipient. The information contained in it is confidential. You should not copy it for any purposes, nor disclose its contents to any other party. If you received this email in error, please notify the sender immediately via email, and delete it from your computer. Any views or opinions presented are solely those of the author and do not necessarily represent those of the company. PCI Compliancy: Please note, we do not send or wish to receive banking, credit or debit card information by email or any other form of communication. Cromwell Tools Limited, PO Box 14, 65 Chartwell Drive Wigston, Leicester LE18 1AT. Tel 0116 2888000 Registered in England and Wales, Reg No 00986161 VAT GB 115 5713 87 900 __________________________________________________
"French, Martin" <frenchm@cromwell.co.uk> writes: > I am having problems with a query on 8.1 running on > RHEL 5.4 > work_mem = 98394 > The explain (cannot explain analyze, or Postgres runs out of memory > again) > 'HashAggregate (cost=2731947.55..2731947.57 rows=1 width=38)' > ' -> Seq Scan on stkl_rec (cost=0.00..2731947.54 rows=1 width=38)' > ' Filter: (((stkl_comp)::text = 'A'::text) AND > ((stkl_stockno)::text ~ > '^(TK[A-Za-z0-9][0-9]{3}(?:[0-9]{5}(?:[0-9]{3,4})?)?|NSP[0-9]{3}([0-9]{4 > })?|Z[MZ][0-9]{8,9}|LSP[0-9]{7}[A-Za-z0-9]|[A-Z][A-Z&][A-Z][0-9]{7}[A-Z0 > -9])$'::text) AND ((stkl_stockno)::text !~~ 'ZZ%'::text) AND > ((stkl_stockno)::text ~ > '^([A-z&]{2,3})([0-9][0-9][0-9])([0-9][0-9][0-9][0-9][A-z0-9]{1,3})$'::t > ext))' Apparently the number of groups is way more than the planner expects, and so the hash table grows to exceed available memory. Kluge fixes: try reducing work_mem to discourage it from using HashAggregate. Or you could temporarily turn off enable_hashagg. A non-kluge fix would involve getting the planner to realize there are a lot of groups needed. Have you analyzed the table lately? Maybe you need to increase the statistics target for it. regards, tom lane
Thanks for the info Tom. The table has been analyzed (somewhat repeatedly...), with the stats target set at various limits. At the moment default_statistics_target = 50. I've had work_mem as low as 1MB and as high as 128MB, with little to no avail. Setting enable_hashagg = off, does allow the query to complete, but mashes the next query in the batch with it having to be killed at 8 hours. Setting it off for this query alone causes the next query to do exactly the same. The thing that perplexes me is; that on RHEL 5.0 with all the same settings and data, the query works fine... so it's confusing as to what can have changed so much to cause this. I know that we had to move to RHEL 5.4 because of PERC Raid drivers, but I wouldn't have thought that would've made too much of a difference. It might also be of interest that "VACCUM ANALYZE VERBOSE" immediately falls over with out of memory on a request of 960MB (what it's set to use...) but that "VACUUM FULL ANALYZE VEBOSE" doesn't... I have a sneaky suspicion that this is to do with the width of the table (271 columns of mainly text), and the fact that the designer (not me...) of it, didn't take into consideration how difficult it would be to process this amount opf data on a row by row data. cheers -----Original Message----- From: Tom Lane [mailto:tgl@sss.pgh.pa.us] Sent: 07 April 2011 15:26 To: French, Martin Cc: pgsql-admin@postgresql.org Subject: Re: [ADMIN] Out Of Memory 8.1 "French, Martin" <frenchm@cromwell.co.uk> writes: > I am having problems with a query on 8.1 running on > RHEL 5.4 > work_mem = 98394 > The explain (cannot explain analyze, or Postgres runs out of memory > again) > 'HashAggregate (cost=2731947.55..2731947.57 rows=1 width=38)' > ' -> Seq Scan on stkl_rec (cost=0.00..2731947.54 rows=1 width=38)' > ' Filter: (((stkl_comp)::text = 'A'::text) AND > ((stkl_stockno)::text ~ > '^(TK[A-Za-z0-9][0-9]{3}(?:[0-9]{5}(?:[0-9]{3,4})?)?|NSP[0-9]{3}([0-9]{4 > })?|Z[MZ][0-9]{8,9}|LSP[0-9]{7}[A-Za-z0-9]|[A-Z][A-Z&][A-Z][0-9]{7}[A-Z0 > -9])$'::text) AND ((stkl_stockno)::text !~~ 'ZZ%'::text) AND > ((stkl_stockno)::text ~ > '^([A-z&]{2,3})([0-9][0-9][0-9])([0-9][0-9][0-9][0-9][A-z0-9]{1,3})$'::t > ext))' Apparently the number of groups is way more than the planner expects, and so the hash table grows to exceed available memory. Kluge fixes: try reducing work_mem to discourage it from using HashAggregate. Or you could temporarily turn off enable_hashagg. A non-kluge fix would involve getting the planner to realize there are a lot of groups needed. Have you analyzed the table lately? Maybe you need to increase the statistics target for it. regards, tom lane ___________________________________________________ This email is intended for the named recipient. The information contained in it is confidential. You should not copy it for any purposes, nor disclose its contents to any other party. If you received this email in error, please notify the sender immediately via email, and delete it from your computer. Any views or opinions presented are solely those of the author and do not necessarily represent those of the company. PCI Compliancy: Please note, we do not send or wish to receive banking, credit or debit card information by email or any other form of communication. Cromwell Tools Limited, PO Box 14, 65 Chartwell Drive Wigston, Leicester LE18 1AT. Tel 0116 2888000 Registered in England and Wales, Reg No 00986161 VAT GB 115 5713 87 900 __________________________________________________
On Thu, Apr 7, 2011 at 8:56 AM, French, Martin <frenchm@cromwell.co.uk> wrote: > Thanks for the info Tom. > > The table has been analyzed (somewhat repeatedly...), with the stats > target set at various limits. > > At the moment default_statistics_target = 50. > > I've had work_mem as low as 1MB and as high as 128MB, with little to no > avail. > > Setting enable_hashagg = off, does allow the query to complete, but > mashes the next query in the batch with it having to be killed at 8 > hours. Setting it off for this query alone causes the next query to do > exactly the same. > > The thing that perplexes me is; that on RHEL 5.0 with all the same > settings and data, the query works fine... so it's confusing as to what > can have changed so much to cause this. I know that we had to move to > RHEL 5.4 because of PERC Raid drivers, but I wouldn't have thought that > would've made too much of a difference. So what's the difference between the query plans on the two machines? Any difference in non-default settings in postgresql.conf between the two machines?
OK - MORE development here. If I bring down shared buffers to the equivalent of 1.7GB, it'll vacuum without a problem. Does anyone know why this might be? Thanks in advance. Cheers -----Original Message----- From: Scott Marlowe [mailto:scott.marlowe@gmail.com] Sent: 08 April 2011 08:47 To: French, Martin Cc: Tom Lane; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Out Of Memory 8.1 On Thu, Apr 7, 2011 at 8:56 AM, French, Martin <frenchm@cromwell.co.uk> wrote: > Thanks for the info Tom. > > The table has been analyzed (somewhat repeatedly...), with the stats > target set at various limits. > > At the moment default_statistics_target = 50. > > I've had work_mem as low as 1MB and as high as 128MB, with little to no > avail. > > Setting enable_hashagg = off, does allow the query to complete, but > mashes the next query in the batch with it having to be killed at 8 > hours. Setting it off for this query alone causes the next query to do > exactly the same. > > The thing that perplexes me is; that on RHEL 5.0 with all the same > settings and data, the query works fine... so it's confusing as to what > can have changed so much to cause this. I know that we had to move to > RHEL 5.4 because of PERC Raid drivers, but I wouldn't have thought that > would've made too much of a difference. So what's the difference between the query plans on the two machines? Any difference in non-default settings in postgresql.conf between the two machines? ___________________________________________________ This email is intended for the named recipient. The information contained in it is confidential. You should not copy it for any purposes, nor disclose its contents to any other party. If you received this email in error, please notify the sender immediately via email, and delete it from your computer. Any views or opinions presented are solely those of the author and do not necessarily represent those of the company. PCI Compliancy: Please note, we do not send or wish to receive banking, credit or debit card information by email or any other form of communication. Cromwell Tools Limited, PO Box 14, 65 Chartwell Drive Wigston, Leicester LE18 1AT. Tel 0116 2888000 Registered in England and Wales, Reg No 00986161 VAT GB 115 5713 87 900 __________________________________________________
Ok, so this has moved on a little from the query now... As it stands now, without ANY data in the DB at all, taking a schema pg_dump from the old machine and porting to the new, then run vacuum: [root@pgsql3 build]# psql -U postgres -d hermes Welcome to psql 8.1.9, the PostgreSQL interactive terminal. Type: \copyright for distribution terms \h for help with SQL commands \? for help with psql commands \g or terminate with semicolon to execute query \q to quit hermes=# vacuum; ERROR: out of memory DETAIL: Failed on request of size 1006632960. hermes=# 2011-04-08 11:33:36 BST <postgres><10937>STATEMENT: VACUUM VERBOSE TopMemoryContext: 55064 total in 5 blocks; 6560 free (10 chunks); 48504 used Operator class cache: 8192 total in 1 blocks; 4936 free (0 chunks); 3256 used TopTransactionContext: 8192 total in 1 blocks; 7712 free (9 chunks); 480 used MessageContext: 8192 total in 1 blocks; 7368 free (1 chunks); 824 used smgr relation table: 8192 total in 1 blocks; 3904 free (0 chunks); 4288 used Portal hash: 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used PortalMemory: 8192 total in 1 blocks; 8040 free (0 chunks); 152 used PortalHeapMemory: 1024 total in 1 blocks; 968 free (3 chunks); 56 used Vacuum: 57344 total in 3 blocks; 28184 free (0 chunks); 29160 used Relcache by OID: 8192 total in 1 blocks; 3896 free (0 chunks); 4296 used CacheMemoryContext: 253952 total in 5 blocks; 13520 free (2 chunks); 240432 used pg_index_indrelid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_type_typname_nsp_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_type_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_statistic_relid_att_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_auth_members_member_role_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_auth_members_role_member_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_proc_proname_args_nsp_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_proc_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_operator_oprname_l_r_n_index: 1024 total in 1 blocks; 192 free (0 chunks); 832 used pg_operator_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_opclass_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_opclass_am_name_nsp_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_namespace_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_namespace_nspname_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_language_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_language_name_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_inherits_relid_seqno_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_index_indexrelid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_authid_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_authid_rolname_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_database_datname_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_conversion_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_conversion_name_nsp_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_conversion_default_index: 1024 total in 1 blocks; 192 free (0 chunks); 832 used pg_class_relname_nsp_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_class_oid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used pg_cast_source_target_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_amproc_opc_proc_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_amop_opr_opc_index: 1024 total in 1 blocks; 328 free (0 chunks); 696 used pg_amop_opc_strat_index: 1024 total in 1 blocks; 256 free (0 chunks); 768 used pg_aggregate_fnoid_index: 1024 total in 1 blocks; 392 free (0 chunks); 632 used MdSmgr: 8192 total in 1 blocks; 8080 free (0 chunks); 112 used LockTable (locallock hash): 8192 total in 1 blocks; 3912 free (0 chunks); 4280 used Timezones: 47592 total in 2 blocks; 5968 free (0 chunks); 41624 used ErrorContext: 8192 total in 1 blocks; 8176 free (0 chunks); 16 used 2011-04-08 11:39:48 BST <postgres><11119>ERROR: out of memory 2011-04-08 11:39:48 BST <postgres><11119>DETAIL: Failed on request of size 1006632960. 2011-04-08 11:39:48 BST <postgres><11119>STATEMENT: vacuum; I'm totally lost with this now, as I assumed it was a data corruption issue within the catalogs (hence rebuilding the DB) but obviously, that's had no effect at all. Even an 8.1.23 build on this machine does exactly the same thing... Cheers -----Original Message----- From: Scott Marlowe [mailto:scott.marlowe@gmail.com] Sent: 08 April 2011 08:47 To: French, Martin Cc: Tom Lane; pgsql-admin@postgresql.org Subject: Re: [ADMIN] Out Of Memory 8.1 On Thu, Apr 7, 2011 at 8:56 AM, French, Martin <frenchm@cromwell.co.uk> wrote: > Thanks for the info Tom. > > The table has been analyzed (somewhat repeatedly...), with the stats > target set at various limits. > > At the moment default_statistics_target = 50. > > I've had work_mem as low as 1MB and as high as 128MB, with little to no > avail. > > Setting enable_hashagg = off, does allow the query to complete, but > mashes the next query in the batch with it having to be killed at 8 > hours. Setting it off for this query alone causes the next query to do > exactly the same. > > The thing that perplexes me is; that on RHEL 5.0 with all the same > settings and data, the query works fine... so it's confusing as to what > can have changed so much to cause this. I know that we had to move to > RHEL 5.4 because of PERC Raid drivers, but I wouldn't have thought that > would've made too much of a difference. So what's the difference between the query plans on the two machines? Any difference in non-default settings in postgresql.conf between the two machines? ___________________________________________________ This email is intended for the named recipient. The information contained in it is confidential. You should not copy it for any purposes, nor disclose its contents to any other party. If you received this email in error, please notify the sender immediately via email, and delete it from your computer. Any views or opinions presented are solely those of the author and do not necessarily represent those of the company. PCI Compliancy: Please note, we do not send or wish to receive banking, credit or debit card information by email or any other form of communication. Cromwell Tools Limited, PO Box 14, 65 Chartwell Drive Wigston, Leicester LE18 1AT. Tel 0116 2888000 Registered in England and Wales, Reg No 00986161 VAT GB 115 5713 87 900 __________________________________________________