Обсуждение: Question about memory usage

Поиск
Список
Период
Сортировка

Question about memory usage

От
Preston Hagar
Дата:
tl;dr: Moved from 8.3 to 9.3 and are now getting out of memory errors despite the server now having 32 GB instead of 4 GB of RAM and the workload and number of clients remaining the same.


Details:

We have been using Postgresql for some time internally with much success.  Recently, we completed a migration off of an older server running 8.3 to a new server running 9.3.  The older server had 4GB of RAM, the new server has 32 GB. 

For some reason, since migrating we are getting lots of "out of memory" and "cannot allocate memory" errors on the new server when the server gets under a decent load.  We have upped shmmax to 17179869184 and shmall to 4194304.  

We had originally copied our shared_buffers, work_mem, wal_buffers and other similar settings from our old config, but after getting the memory errors have tweaked them to the following:

shared_buffers            = 7680MB          
temp_buffers              = 12MB
max_prepared_transactions = 0
work_mem                  = 80MB
maintenance_work_mem      = 1GB 
wal_buffers = 8MB
max_connections = 350

The current settings seem to have helped, but we are still occasionally getting the errors.

The weird thing is that our old server had 1/8th the RAM, was set to max_connections = 600 and had the same clients connecting in the same way to the same databases and we never saw any errors like this in the several years we have been using it.

One issue I could see is that one of our main applications that connects to the database, opens a connection on startup, holds it open the entire time it is running, and doesn't close it until the app is closed.  In daily usage, for much of our staff it is opened first thing in the morning and left open all day (meaning the connection is held open for 8+ hours).  This was never an issue with 8.3, but I know it isn't a "best practice" in general.

We are working to update our application to be able to use pgbouncer with transaction connections to try to alleviate the long held connections, but it will take some time.

In the meantime, is there some other major difference or setting in 9.3 that we should look out for that could be causing this?  Like I said, the same database with the same load and number of clients has been running on a 8.3 install for years (pretty much since 2008 when 8.3 was released) with lesser hardware with no issues.

Let me know if any other information would help out or if anyone has suggestions of things to check.

Thanks,

Preston

Re: Question about memory usage

От
Steve Atkins
Дата:
On Jan 10, 2014, at 8:35 AM, Preston Hagar <prestonh@gmail.com> wrote:

> tl;dr: Moved from 8.3 to 9.3 and are now getting out of memory errors despite the server now having 32 GB instead of
4GB of RAM and the workload and number of clients remaining the same. 
>
>
> Details:
>
> We have been using Postgresql for some time internally with much success.  Recently, we completed a migration off of
anolder server running 8.3 to a new server running 9.3.  The older server had 4GB of RAM, the new server has 32 GB.  
>
> For some reason, since migrating we are getting lots of "out of memory" and "cannot allocate memory" errors on the
newserver when the server gets under a decent load.  We have upped shmmax to 17179869184 and shmall to 4194304.   

What are the exact error messages you’re getting, and where are you seeing them?

>
> We had originally copied our shared_buffers, work_mem, wal_buffers and other similar settings from our old config,
butafter getting the memory errors have tweaked them to the following: 
>
> shared_buffers            = 7680MB
> temp_buffers              = 12MB
> max_prepared_transactions = 0
> work_mem                  = 80MB
> maintenance_work_mem      = 1GB
> wal_buffers = 8MB
> max_connections = 350
>
> The current settings seem to have helped, but we are still occasionally getting the errors.
>
> The weird thing is that our old server had 1/8th the RAM, was set to max_connections = 600 and had the same clients
connectingin the same way to the same databases and we never saw any errors like this in the several years we have been
usingit. 
>
> One issue I could see is that one of our main applications that connects to the database, opens a connection on
startup,holds it open the entire time it is running, and doesn't close it until the app is closed.  In daily usage, for
muchof our staff it is opened first thing in the morning and left open all day (meaning the connection is held open for
8+hours).  This was never an issue with 8.3, but I know it isn't a "best practice" in general. 

That’s probably not related to the problems you’re seeing - I have apps that hold a connection to the database open for
years.As long as it doesn’t keep a transaction open for a long time, you’re fine. 

>
> We are working to update our application to be able to use pgbouncer with transaction connections to try to alleviate
thelong held connections, but it will take some time. 

Using pgbouncer is probably a good idea - to reduce the number of concurrent connections, rather than the length of
connections,though. 

>
> In the meantime, is there some other major difference or setting in 9.3 that we should look out for that could be
causingthis?  Like I said, the same database with the same load and number of clients has been running on a 8.3 install
foryears (pretty much since 2008 when 8.3 was released) with lesser hardware with no issues. 
>
> Let me know if any other information would help out or if anyone has suggestions of things to check.

Cheers,
  Steve



Re: Question about memory usage

От
Preston Hagar
Дата:
On Fri, Jan 10, 2014 at 11:09 AM, Steve Atkins <steve@blighty.com> wrote:

On Jan 10, 2014, at 8:35 AM, Preston Hagar <prestonh@gmail.com> wrote:

> tl;dr: Moved from 8.3 to 9.3 and are now getting out of memory errors despite the server now having 32 GB instead of 4 GB of RAM and the workload and number of clients remaining the same.
>
>
> Details:
>
> We have been using Postgresql for some time internally with much success.  Recently, we completed a migration off of an older server running 8.3 to a new server running 9.3.  The older server had 4GB of RAM, the new server has 32 GB.
>
> For some reason, since migrating we are getting lots of "out of memory" and "cannot allocate memory" errors on the new server when the server gets under a decent load.  We have upped shmmax to 17179869184 and shmall to 4194304.

What are the exact error messages you’re getting, and where are you seeing them?


Thanks for the reply.  I'm seeing them in the main postgresql log (/var/log/postgresql/postgresql-9.3-main.log)

Here are a couple of examples from the incident we had this morning:

2014-01-10 06:14:40 CST  30176    LOG:  could not fork new process for connection: Cannot allocate memory
2014-01-10 06:14:40 CST  30176    LOG:  could not fork new process for connection: Cannot allocate memory


TopMemoryContext: 160408 total in 19 blocks; 12984 free (41 chunks); 147424 used
  TopTransactionContext: 8192 total in 1 blocks; 7392 free (0 chunks); 800 used
  Btree proof lookup cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  TableSpace cache: 8192 total in 1 blocks; 3216 free (0 chunks); 4976 used
  Type information cache: 24240 total in 2 blocks; 3744 free (0 chunks); 20496 used
  Operator lookup cache: 24576 total in 2 blocks; 11888 free (5 chunks); 12688 used
  MessageContext: 524288 total in 7 blocks; 225240 free (3 chunks); 299048 used
  Operator class cache: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  smgr relation table: 57344 total in 3 blocks; 34320 free (10 chunks); 23024 used
  TransactionAbortContext: 32768 total in 1 blocks; 32736 free (0 chunks); 32 used
  Portal hash: 8192 total in 1 blocks; 1680 free (0 chunks); 6512 used
  PortalMemory: 8192 total in 1 blocks; 7888 free (0 chunks); 304 used
    PortalHeapMemory: 1024 total in 1 blocks; 768 free (0 chunks); 256 used
      ExecutorState: 329080 total in 8 blocks; 105944 free (4 chunks); 223136 used
        TupleSort: 32816 total in 2 blocks; 176 free (2 chunks); 32640 used
        printtup: 0 total in 0 blocks; 0 free (0 chunks); 0 used
        ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
        ExprContext: 8192 total in 1 blocks; 8128 free (3 chunks); 64 used
        ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
        ExprContext: 8192 total in 1 blocks; 7896 free (0 chunks); 296 used
        ExprContext: 0 total in 0 blocks; 0 free (0 chunks); 0 used
  Relcache by OID: 24576 total in 2 blocks; 11792 free (3 chunks); 12784 used
  CacheMemoryContext: 9815680 total in 25 blocks; 8143416 free (1 chunks); 1672264 used
    pg_toast_3598032_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    mls_office_uid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    staff_primary: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    staff_desk_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    person_primary: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    person_person_workphone_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    person_person_mobilephone_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    person_person_lastname_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    person_person_homephone_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    person_person_homeofficephone_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    person_person_firstname_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    agent_primary: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    agent_membertype_id: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    agent_floydtraining_id: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    agent_agent_sync_realtorid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    agent_agent_sync_oagentid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    agent_agent_personid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    agent_agent_license_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_toast_2619_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    mls_agent_uid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_uidprp_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_uid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_streetnum_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_streetname: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_sqfttotal_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_ownername_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_officelist_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_modified: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_mlsnumber_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_liststatus_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_listprice_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_city_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_beds_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_res_bathsfull_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_subdivision: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_zipcode: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_uidprp: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_uid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_priority: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_officelist: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_modified: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_mlsnum: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_longitude_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_liststatus: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_listprice: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_latitude_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_ishud_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_city_upcase_idx: 3072 total in 2 blocks; 1848 free (0 chunks); 1224 used
    mls_prp_city: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_agentlist2: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_prp_agentlist: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_media_uid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_media_tableuid: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_media_mediasource_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    mls_media_custompic_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_primary: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listing_sync_ilresid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listing_streetnumber_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listing_streetname_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listing_mlsnumber_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listing_longitude_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listing_listingid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listing_latitude_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listing_brokerageid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listing_agentid_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    listing_listdate_idx: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    code_primary: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    class_primary: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_index_indrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_attrdef_adrelid_adnum_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_db_role_setting_databaseid_rol_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_opclass_am_name_nsp_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
    pg_foreign_data_wrapper_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_enum_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_class_relname_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_foreign_server_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_statistic_relid_att_inh_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
    pg_cast_source_target_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_language_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_collation_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_amop_fam_strat_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
    pg_index_indexrelid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_ts_template_tmplname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_ts_config_map_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
    pg_opclass_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_foreign_data_wrapper_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_event_trigger_evtname_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_ts_dict_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_event_trigger_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_conversion_default_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
    pg_operator_oprname_l_r_n_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
    pg_trigger_tgrelid_tgname_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_enum_typid_label_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_ts_config_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_user_mapping_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_opfamily_am_name_nsp_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
    pg_foreign_table_relid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_type_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_aggregate_fnoid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_constraint_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_rewrite_rel_rulename_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_ts_parser_prsname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_ts_config_cfgname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_ts_parser_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_operator_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_namespace_nspname_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_ts_template_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_amop_opr_fam_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
    pg_default_acl_role_nsp_obj_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
    pg_collation_name_enc_nsp_index: 3072 total in 2 blocks; 1784 free (2 chunks); 1288 used
    pg_range_rngtypid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_ts_dict_dictname_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_type_typname_nsp_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_opfamily_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_class_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_proc_proname_args_nsp_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
    pg_attribute_relid_attnum_index: 1024 total in 1 blocks; 16 free (0 chunks); 1008 used
    pg_proc_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_language_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_namespace_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_amproc_fam_proc_index: 3072 total in 2 blocks; 1736 free (2 chunks); 1336 used
    pg_foreign_server_name_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_attribute_relid_attnam_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_conversion_oid_index: 1024 total in 1 blocks; 200 free (0 chunks); 824 used
    pg_user_mapping_user_server_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_conversion_name_nsp_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_authid_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_auth_members_member_role_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_tablespace_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_database_datname_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_auth_members_role_member_index: 1024 total in 1 blocks; 64 free (0 chunks); 960 used
    pg_database_oid_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
    pg_authid_rolname_index: 1024 total in 1 blocks; 152 free (0 chunks); 872 used
  MdSmgr: 8192 total in 1 blocks; 5440 free (0 chunks); 2752 used
  ident parser context: 0 total in 0 blocks; 0 free (0 chunks); 0 used
  hba parser context: 15360 total in 4 blocks; 6544 free (2 chunks); 8816 used
  LOCALLOCK hash: 24576 total in 2 blocks; 13920 free (4 chunks); 10656 used
  Timezones: 83472 total in 2 blocks; 3744 free (0 chunks); 79728 used
  ErrorContext: 8192 total in 1 blocks; 8160 free (0 chunks); 32 used
2014-01-10 06:18:46 CST 10.1.1.6 16669 [unknown] production  10.1.1.6(36680)ERROR:  out of memory
2014-01-10 06:18:46 CST 10.1.1.6 16669 [unknown] production  10.1.1.6(36680)DETAIL:  Failed on request of size 500.



After some of the connections cleared up, (we load a lot of data around that time), things got "back to normal".  Our Nagios monitoring showed a 30 minute window or so when no new connections could be made and the log is filled with more or less the same lines.


 
>
> We had originally copied our shared_buffers, work_mem, wal_buffers and other similar settings from our old config, but after getting the memory errors have tweaked them to the following:
>
> shared_buffers            = 7680MB
> temp_buffers              = 12MB
> max_prepared_transactions = 0
> work_mem                  = 80MB
> maintenance_work_mem      = 1GB
> wal_buffers = 8MB
> max_connections = 350
>
> The current settings seem to have helped, but we are still occasionally getting the errors.
>
> The weird thing is that our old server had 1/8th the RAM, was set to max_connections = 600 and had the same clients connecting in the same way to the same databases and we never saw any errors like this in the several years we have been using it.
>
> One issue I could see is that one of our main applications that connects to the database, opens a connection on startup, holds it open the entire time it is running, and doesn't close it until the app is closed.  In daily usage, for much of our staff it is opened first thing in the morning and left open all day (meaning the connection is held open for 8+ hours).  This was never an issue with 8.3, but I know it isn't a "best practice" in general.

That’s probably not related to the problems you’re seeing - I have apps that hold a connection to the database open for years. As long as it doesn’t keep a transaction open for a long time, you’re fine.


Good to know, thanks.

 
>
> We are working to update our application to be able to use pgbouncer with transaction connections to try to alleviate the long held connections, but it will take some time.

Using pgbouncer is probably a good idea - to reduce the number of concurrent connections, rather than the length of connections, though.


I'm testing a updated version of our application against pgbouncer now so hopefully that will help some.

 
>
> In the meantime, is there some other major difference or setting in 9.3 that we should look out for that could be causing this?  Like I said, the same database with the same load and number of clients has been running on a 8.3 install for years (pretty much since 2008 when 8.3 was released) with lesser hardware with no issues.
>
> Let me know if any other information would help out or if anyone has suggestions of things to check.

Cheers,
  Steve



Thanks for your help.

Preston

 

Re: Question about memory usage

От
Tom Lane
Дата:
Preston Hagar <prestonh@gmail.com> writes:
>>> tl;dr: Moved from 8.3 to 9.3 and are now getting out of memory errors
>>> despite the server now having 32 GB instead of 4 GB of RAM and the workload
>>> and number of clients remaining the same.

> Here are a couple of examples from the incident we had this morning:
> 2014-01-10 06:14:40 CST  30176    LOG:  could not fork new process for
> connection: Cannot allocate memory
> 2014-01-10 06:14:40 CST  30176    LOG:  could not fork new process for
> connection: Cannot allocate memory

That's odd ... ENOMEM from fork() suggests that you're under system-wide
memory pressure.

> [ memory map dump showing no remarkable use of memory at all ]
> 2014-01-10 06:18:46 CST 10.1.1.6 16669 [unknown] production
>  10.1.1.6(36680)ERROR:  out of memory
> 2014-01-10 06:18:46 CST 10.1.1.6 16669 [unknown] production
>  10.1.1.6(36680)DETAIL:  Failed on request of size 500.

I think that what you've got here isn't really a Postgres issue, but
a system-level configuration issue: the kernel is being unreasonably
stingy about giving out memory, and it's not clear why.

It might be worth double-checking that the postmaster is not being
started under restrictive ulimit settings; though offhand I don't
see how that theory could account for fork-time failures, since
the ulimit memory limits are per-process.

Other than that, you need to burrow around in the kernel settings
and see if you can find something there that's limiting how much
memory it will give to Postgres.  It might also be worth watching
the kernel log when one of these problems starts.  Plain old "top"
might also be informative as to how much memory is being used.

>> We had originally copied our shared_buffers, work_mem, wal_buffers and
>> other similar settings from our old config, but after getting the memory
>> errors have tweaked them to the following:
>
> shared_buffers            = 7680MB
> temp_buffers              = 12MB
> max_prepared_transactions = 0
> work_mem                  = 80MB
> maintenance_work_mem      = 1GB
> wal_buffers = 8MB
> max_connections = 350

That seems like a dangerously large work_mem for so many connections;
but unless all the connections were executing complex queries, which
doesn't sound to be the case, that isn't the immediate problem.

>> The weird thing is that our old server had 1/8th the RAM, was set to
>> max_connections = 600 and had the same clients connecting in the same way
>> to the same databases and we never saw any errors like this in the several
>> years we have been using it.

This reinforces the impression that something's misconfigured at the
kernel level on the new server.

            regards, tom lane


Re: Question about memory usage

От
"Tomas Vondra"
Дата:
On 10 Leden 2014, 19:19, Tom Lane wrote:
> Preston Hagar <prestonh@gmail.com> writes:
>>>> tl;dr: Moved from 8.3 to 9.3 and are now getting out of memory errors
>>>> despite the server now having 32 GB instead of 4 GB of RAM and the
>>>> workload
>>>> and number of clients remaining the same.
>
>> Here are a couple of examples from the incident we had this morning:
>> 2014-01-10 06:14:40 CST  30176    LOG:  could not fork new process for
>> connection: Cannot allocate memory
>> 2014-01-10 06:14:40 CST  30176    LOG:  could not fork new process for
>> connection: Cannot allocate memory
>
> That's odd ... ENOMEM from fork() suggests that you're under system-wide
> memory pressure.
>
>> [ memory map dump showing no remarkable use of memory at all ]
>> 2014-01-10 06:18:46 CST 10.1.1.6 16669 [unknown] production
>>  10.1.1.6(36680)ERROR:  out of memory
>> 2014-01-10 06:18:46 CST 10.1.1.6 16669 [unknown] production
>>  10.1.1.6(36680)DETAIL:  Failed on request of size 500.
>
> I think that what you've got here isn't really a Postgres issue, but
> a system-level configuration issue: the kernel is being unreasonably
> stingy about giving out memory, and it's not clear why.
>
> It might be worth double-checking that the postmaster is not being
> started under restrictive ulimit settings; though offhand I don't
> see how that theory could account for fork-time failures, since
> the ulimit memory limits are per-process.
>
> Other than that, you need to burrow around in the kernel settings
> and see if you can find something there that's limiting how much
> memory it will give to Postgres.  It might also be worth watching
> the kernel log when one of these problems starts.  Plain old "top"
> might also be informative as to how much memory is being used.

My bet is on overcommit - what are

    vm.overcommit_memory
    vm.overcommit_ratio

set to? Do you have a swap or no? I've repeatedly ran into very similar
OOM issues on machines with overcommit disabled (overcommit_memory=2) and
with no swap. There was plenty of RAM available (either free or in page
cache) but in case of sudden peak the allocations failed. Also
vm.swappiness seems to play a role in this.

>>> The weird thing is that our old server had 1/8th the RAM, was set to
>>> max_connections = 600 and had the same clients connecting in the same
>>> way
>>> to the same databases and we never saw any errors like this in the
>>> several
>>> years we have been using it.

Chances are the old machine had swap, overcommit and/or higher swappiness,
so it was not running into these issues with overcommit.

Anyway, I see you've mentioned shmmax/shmall in one of your previous
messages. I'm pretty sure that's irrelevant to the problem, because that
only affects allocation of shared buffers (i.e. shared memory). But if the
database starts OK, the cause is somewhere else.

kind regards
Tomas Vondra



Re: Question about memory usage

От
Preston Hagar
Дата:


On Fri, Jan 10, 2014 at 12:19 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Preston Hagar <prestonh@gmail.com> writes:
>>> tl;dr: Moved from 8.3 to 9.3 and are now getting out of memory errors
>>> despite the server now having 32 GB instead of 4 GB of RAM and the workload
>>> and number of clients remaining the same.

> Here are a couple of examples from the incident we had this morning:
> 2014-01-10 06:14:40 CST  30176    LOG:  could not fork new process for
> connection: Cannot allocate memory
> 2014-01-10 06:14:40 CST  30176    LOG:  could not fork new process for
> connection: Cannot allocate memory

That's odd ... ENOMEM from fork() suggests that you're under system-wide
memory pressure.

> [ memory map dump showing no remarkable use of memory at all ]
> 2014-01-10 06:18:46 CST 10.1.1.6 16669 [unknown] production
>  10.1.1.6(36680)ERROR:  out of memory
> 2014-01-10 06:18:46 CST 10.1.1.6 16669 [unknown] production
>  10.1.1.6(36680)DETAIL:  Failed on request of size 500.

I think that what you've got here isn't really a Postgres issue, but
a system-level configuration issue: the kernel is being unreasonably
stingy about giving out memory, and it's not clear why.

It might be worth double-checking that the postmaster is not being
started under restrictive ulimit settings; though offhand I don't
see how that theory could account for fork-time failures, since
the ulimit memory limits are per-process.

Other than that, you need to burrow around in the kernel settings
and see if you can find something there that's limiting how much
memory it will give to Postgres.  It might also be worth watching
the kernel log when one of these problems starts.  Plain old "top"
might also be informative as to how much memory is being used.
 
   Thanks for the response.  I think it might have been the lack of a swapfile (I replied as such in another response)
 
That said, we have been using this site as a guide to try to figure things out about postgres and memory:


we came up with the following for all our current processes (we aren't out of memory and new connections are being accepted right now, but memory seems low)

1. List of RSS usage for all postgres processes:


2. List of all memory segments for postgres checkpoint process (pid 30178)

grep -B1 -E '^Size: *[0-9]{6}' /proc/30178/smaps
7f208acec000-7f2277328000 rw-s 00000000 00:04 31371473                   /dev/zero (deleted)
Size:            8067312 kB

3. Info on largest memory allocation for postgres checkpoint process. It is using 5GB of RAM privately.

cat /proc/30178/smaps | grep 7f208acec000 -B 0 -A 20

Total RSS: 11481148
7f208acec000-7f2277328000 rw-s 00000000 00:04 31371473                   /dev/zero (deleted)
Size:            8067312 kB
Rss:             5565828 kB
Pss:             5284432 kB
Shared_Clean:          0 kB
Shared_Dirty:     428840 kB
Private_Clean:         0 kB
Private_Dirty:   5136988 kB
Referenced:      5559624 kB
Anonymous:             0 kB
AnonHugePages:         0 kB
Swap:                  0 kB
KernelPageSize:        4 kB
MMUPageSize:           4 kB
Locked:                0 kB
7f2277328000-7f22775f1000 r--p 00000000 09:00 2889301                    /usr/lib/locale/locale-archive
Size:               2852 kB
Rss:                   8 kB
Pss:                   0 kB
Shared_Clean:          8 kB
Shared_Dirty:          0 kB

If I am understanding all this correctly, the postgres checkpoint process has around 5GB of RAM "Private_Dirty" allocated (not shared buffers).  Is this normal?  Any thoughts as to why this would get so high?

I'm still trying to dig in further to figure out exactly.  We are running on Ubuntu 12.04.3 (Kernel 3.5.0-44).  We set vm.overcommit_memory = 2 but didn't have a swap partition we have since added one and are seeing if that helps.

 
 
>> We had originally copied our shared_buffers, work_mem, wal_buffers and
>> other similar settings from our old config, but after getting the memory
>> errors have tweaked them to the following:
>
> shared_buffers            = 7680MB
> temp_buffers              = 12MB
> max_prepared_transactions = 0
> work_mem                  = 80MB
> maintenance_work_mem      = 1GB
> wal_buffers = 8MB
> max_connections = 350

That seems like a dangerously large work_mem for so many connections;
but unless all the connections were executing complex queries, which
doesn't sound to be the case, that isn't the immediate problem.


Thanks for the heads up.  We had come about the value originally using pgtune and I think 250 connections and I forgot to lower work_mem when I upped the connections.  I now have it set to 45 MB, does that seem more reasonable?


 
>> The weird thing is that our old server had 1/8th the RAM, was set to
>> max_connections = 600 and had the same clients connecting in the same way
>> to the same databases and we never saw any errors like this in the several
>> years we have been using it.

This reinforces the impression that something's misconfigured at the
kernel level on the new server.

                        regards, tom lane


Forgot to copy the list on the reply, so I am here.

 
Thanks for your help and time.

Preston