[GENERAL] Logical decoding CPU-bound w/ large number of tables

Поиск
Список
Период
Сортировка
От Mathieu Fenniak
Тема [GENERAL] Logical decoding CPU-bound w/ large number of tables
Дата
Msg-id CAHoiPjzea6N0zuCi=+f9v_j94nfsy6y8SU7-=bp4=7qw6_i=Rg@mail.gmail.com
обсуждение исходный текст
Ответы Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables
Re: [GENERAL] Logical decoding CPU-bound w/ large number of tables
Список pgsql-general
Heyo,

I'm attempting to use logical decoding with the streaming replication protocol to perform change-data-capture on PostgreSQL 9.5.4.  I'm seeing the replication stream "stall" for long periods of time where the walsender process will be pinned at 100% CPU utilization, but no data is being sent to my client.

The stalls occur unpredictably on my production system, but generally seem to be correlated with schema operations.  My source database has about 100,000 tables; it's a one-schema-per-tenant multi-tenant SaaS system.

I've reproduced the same symptoms with two different approaches on my local machine.  With both, I have a replication client connected via the streaming protocol.

In reproduction approach 1, I've created a thread that inserts small sets of data, and a thread that creates a schema w/ 500 tables and then drops it.  This approach has pinned CPU usage, but data does come out of it -- just excruciatingly slow when compared to the same test without the schema create & drop.

In reproduction approach 2, I've created a database w/ 100,000 tables on it and performed a "vacuum ful".  The walsender goes to 100% CPU and no data comes out of the replication stream for hours.

I've performed a CPU sampling with the OSX `sample` tool based upon reproduction approach #1: https://gist.github.com/mfenniak/366d7ed19b2d804f41180572dc1600d8  It appears that most of the time is spent in the RelfilenodeMapInvalidateCallback and CatalogCacheIdInvalidate cache invalidation callbacks, both of which appear to be invalidating caches based upon the cache value.

Has anyone else run into this kind of performance problem?  Any thoughts on how it might be resolved?  I don't mind putting in the work if someone could describe what is happening here, and have a discussion with me about what kind of changes might be necessary to improve the performance.

Thanks all,


Mathieu Fenniak | Senior Software Architect | Phone 1-587-315-1185

Replicon | The leader in cloud time tracking applications - 7,800+ Customers - 70+ Countries - 1.5 Million Users
www.replicon.com | facebook | linkedin | twitter | blog | contact us

We are hiring! | search jobs



В списке pgsql-general по дате отправления:

Предыдущее
От: Bill Moran
Дата:
Сообщение: Re: [GENERAL] Can PG replace redis, amqp, s3 in the future?
Следующее
От: Justin Pryzby
Дата:
Сообщение: [GENERAL] PG96 pg_restore connecting to PG95 causes ERROR: unrecognizedconfiguration parameter "idle_in_transaction_session_timeout"