Обсуждение: CPU causes 100% load in user space when ntp client runs and postgresql is under heavy load

Поиск
Список
Период
Сортировка

CPU causes 100% load in user space when ntp client runs and postgresql is under heavy load

От
Dennis Brouwer
Дата:
Dear mailing list,

I am currently benching postgresq-9.2 using debain squeeze (Linux 2.6.32-5-amd64 x86_64 GNU/Linux).

The server used for benching is a quad core E5-1620, 32 GB RAM and for storage we use and LSI-9265 with 8 SSDs. The database freshly restored is about 90GB in size and doesn't fit in RAM in order to test the IO system.

The database mainly consists of a partitioned table with 6 partitions. In order to test the performance I run 32 queries in parallel doing some grouping queries on the partitioned table. Every query runs in its own transaction. While the number of concurrent queries run may be higher then recommended we consider this a stress test as well.

Last week I was repeatedly able to run all these tests on the database without any issue but recently, all of a sudden at random, some of the queries performed a factor 100 less. It may take hours to complete the transaction. At the same moment we see a dramatic decrease in IO and the CPU is nearly 100% busy in user space.

After days of testing I may have found the cause: the ntp client. If I stop the ntp client the problem vanishes.

I have started reading on spinlocks and other related material but this all is rather complicated stuff and kindly ask in what direction I should search. The issue can be reproduced for both postgresql-9.1 and postgresql-9.2 and perhaps can be rephrased as: Very high CPU load in user space (at random) with ntp enabled and (long?) running transactions.

Perhaps somebody from the mailing list has sufficient experience debugging this kind of behaviour to exclude a bug in postgresql. Much appreciated!


Very kind regards,

Dennis Brouwer
M4N

P.S. If required I can provide more details like: the queries, auto_explain output, iostat, top, iotop, postgresql.conf etc etc.

Re: CPU causes 100% load in user space when ntp client runs and postgresql is under heavy load

От
Tom Lane
Дата:
Dennis Brouwer <dennis.brouwer@m4n.nl> writes:
> Last week I was repeatedly able to run all these tests on the database
> without any issue but recently, all of a sudden at random, some of the
> queries performed a factor 100 less. It may take hours to complete the
> transaction. At the same moment we see a dramatic decrease in IO and the
> CPU is nearly 100% busy in user space.

> After days of testing I may have found the cause: the ntp client. If I stop
> the ntp client the problem vanishes.

> I have started reading on spinlocks and other related material but this all
> is rather complicated stuff and kindly ask in what direction I should
> search. The issue can be reproduced for both postgresql-9.1 and
> postgresql-9.2 and perhaps can be rephrased as: Very high CPU load in user
> space (at random) with ntp enabled and (long?) running transactions.

That's really bizarre.  What "ntp client" are you using exactly?  Is it
configured to adjust the system clock by slewing, or by stepping?  Can
you identify what part of the code is eating CPU (try perf or oprofile)?

            regards, tom lane


Re: CPU causes 100% load in user space when ntp client runs and postgresql is under heavy load

От
Dennis Brouwer
Дата:
Dear Tom Lane,

Thanks for the tip for using perf or oprofile but ntp might not the problem at all. During testing with ntp off the problem was still reproducable be it less frequent. It might have something to do with accessive row locking. We are currently looking into the explain results from the postgresql log if there is a pattern to be observerd and reading the pg_locks chapters from the books ;-). It will take some time to understand whats going on.

I still might need to use the tools to identify where in code the CPU user load comes from.

I'll keep you posted.

Kind regards,

Dennis Brouwer
M4n

On Mon, Sep 24, 2012 at 6:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dennis Brouwer <dennis.brouwer@m4n.nl> writes:
> Last week I was repeatedly able to run all these tests on the database
> without any issue but recently, all of a sudden at random, some of the
> queries performed a factor 100 less. It may take hours to complete the
> transaction. At the same moment we see a dramatic decrease in IO and the
> CPU is nearly 100% busy in user space.

> After days of testing I may have found the cause: the ntp client. If I stop
> the ntp client the problem vanishes.

> I have started reading on spinlocks and other related material but this all
> is rather complicated stuff and kindly ask in what direction I should
> search. The issue can be reproduced for both postgresql-9.1 and
> postgresql-9.2 and perhaps can be rephrased as: Very high CPU load in user
> space (at random) with ntp enabled and (long?) running transactions.

That's really bizarre.  What "ntp client" are you using exactly?  Is it
configured to adjust the system clock by slewing, or by stepping?  Can
you identify what part of the code is eating CPU (try perf or oprofile)?

                        regards, tom lane

Re: CPU causes 100% load in user space when ntp client runs and postgresql is under heavy load

От
Marcello Perathoner
Дата:
On 09/24/2012 03:53 PM, Dennis Brouwer wrote:

> Last week I was repeatedly able to run all these tests on the database
> without any issue but recently, all of a sudden at random, some of the
> queries performed a factor 100 less. It may take hours to complete the
> transaction. At the same moment we see a dramatic decrease in IO and the
> CPU is nearly 100% busy in user space.
>
> After days of testing I may have found the cause: the ntp client. If I stop
> the ntp client the problem vanishes.

Any chance you are hitting this known linux bug in conjunction with a
misconfigured ntp server? ie. does a

  # date -s now

fix the cpu load?



http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix/comment-page-1/#comment-1471



http://serverfault.com/questions/403732/anyone-else-experiencing-high-rates-of-linux-server-crashes-during-a-leap-second



--
Marcello Perathoner
webmaster@gutenberg.org


Re: CPU causes 100% load in user space when ntp client runs and postgresql is under heavy load

От
Dennis Brouwer
Дата:
Hi Tom,

I now have excluded ntp as root cause for the CPU cycles being wasted in user space.

I installed perf and monitored two servers (with different postgresql versions and hardware specification) which are "hanging" and have some output. Since I'm no die-hard at interpreting the output of perf top what would be the next step to do?

Would it be a good idea to a) read the perf manual and/or 2) provide the output of perf top as a first step to see what is going on?

What I think I see is a lot spin_lock_irq and scheduler processes active.

Any guidance much appreciated.

Most Regards,

Dennis Brouwer
M4N





On Mon, Sep 24, 2012 at 6:30 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Dennis Brouwer <dennis.brouwer@m4n.nl> writes:
> Last week I was repeatedly able to run all these tests on the database
> without any issue but recently, all of a sudden at random, some of the
> queries performed a factor 100 less. It may take hours to complete the
> transaction. At the same moment we see a dramatic decrease in IO and the
> CPU is nearly 100% busy in user space.

> After days of testing I may have found the cause: the ntp client. If I stop
> the ntp client the problem vanishes.

> I have started reading on spinlocks and other related material but this all
> is rather complicated stuff and kindly ask in what direction I should
> search. The issue can be reproduced for both postgresql-9.1 and
> postgresql-9.2 and perhaps can be rephrased as: Very high CPU load in user
> space (at random) with ntp enabled and (long?) running transactions.

That's really bizarre.  What "ntp client" are you using exactly?  Is it
configured to adjust the system clock by slewing, or by stepping?  Can
you identify what part of the code is eating CPU (try perf or oprofile)?

                        regards, tom lane

Re: CPU causes 100% load in user space when ntp client runs and postgresql is under heavy load

От
Tom Lane
Дата:
Dennis Brouwer <dennis.brouwer@m4n.nl> writes:
> I now have excluded ntp as root cause for the CPU cycles being wasted in
> user space.

Good, cause that wasn't making any sense at all.

> I installed perf and monitored two servers (with different postgresql
> versions and hardware specification) which are "hanging" and have some
> output. Since I'm no die-hard at interpreting the output of perf top what
> would be the next step to do?

I'd suggest asking for help in pgsql-performance.  I don't know much
about perf either (still an oprofile guy), but the people who do know
it hang out there.

            regards, tom lane