Обсуждение: Parallel Select query performance and shared buffers

Поиск
Список
Период
Сортировка

Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:
We have several independent tables on a multi-core machine serving Select queries. These tables fit into memory; and each Select queries goes over one table's pages sequentially. In this experiment, there are no indexes or table joins.

When we send concurrent Select queries to these tables, query performance doesn't scale out with the number of CPU cores. We find that complex Select queries scale out better than simpler ones. We also find that increasing the block size from 8 KB to 32 KB, or increasing shared_buffers to include the working set mitigates the problem to some extent.

For our experiments, we chose an 8-core machine with 68 GB of memory from Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and set shared_buffers to 4 GB.

We then generated 1, 2, 4, and 8 separate tables using the data generator from the industry standard TPC-H benchmark. Each table we generated, called lineitem-1, lineitem-2, etc., had about 750 MB of data. Next, we sent 1, 2, 4, and 8 concurrent Select queries to these tables to observe the scale out behavior. Our expectation was that since this machine had 8 cores, our run times would stay constant all throughout. Also, we would have expected the machine's CPU utilization to go up to 100% at 8 concurrent queries. Neither of those assumptions held true.

We found that query run times degraded as we increased the number of concurrent Select queries. Also, CPU utilization flattened out at less than 50% for the simpler queries. Full results with block size of 8KB are below:

                         Table select count(*)    TPC-H Simple (#6)[2]    TPC-H Complex (#1)[1]
1 Table  / 1 query               1.5 s                    2.5 s                  8.4 s
2 Tables / 2 queries             1.5 s                    2.5 s                  8.4 s
4 Tables / 4 queries             2.0 s                    2.9 s                  8.8 s
8 Tables / 8 queries             3.3 s                    4.0 s                  9.6 s

We then increased the block size (BLCKSZ) from 8 KB to 32 KB and recompiled PostgreSQL. This change had a positive impact on query completion times. Here are the new results with block size of 32 KB:

                         Table select count(*)    TPC-H Simple (#6)[2]    TPC-H Complex (#1)[1]
1 Table  / 1 query               1.5 s                    2.3 s                  8.0 s
2 Tables / 2 queries             1.5 s                    2.3 s                  8.0 s
4 Tables / 4 queries             1.6 s                    2.4 s                  8.1 s
8 Tables / 8 queries             1.8 s                    2.7 s                  8.3 s

As a quick side, we also repeated the same experiment on an EC2 instance with 16 CPU cores, and found that the scale out behavior became worse there. (We also tried increasing the shared_buffers to 30 GB. This change completely solved the scaling out problem on this instance type, but hurt our performance on the hi1.4xlarge instances.)

Unfortunately, increasing the block size from 8 to 32 KB has other implications for some of our customers. Could you help us out with the problem here?

What can we do to identify the problem's root cause? Can we work around it?

Thank you,
Metin

Re: Parallel Select query performance and shared buffers

От
Alvaro Herrera
Дата:
Metin Doslu wrote:

> When we send concurrent Select queries to these tables, query performance
> doesn't scale out with the number of CPU cores. We find that complex Select
> queries scale out better than simpler ones. We also find that increasing
> the block size from 8 KB to 32 KB, or increasing shared_buffers to include
> the working set mitigates the problem to some extent.

Maybe you could help test this patch:
http://www.postgresql.org/message-id/20131115194725.GG5489@awork2.anarazel.de

--
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Select query performance and shared buffers

От
Claudio Freire
Дата:
On Tue, Dec 3, 2013 at 10:49 AM, Metin Doslu <metin@citusdata.com> wrote:
> We have several independent tables on a multi-core machine serving Select
> queries. These tables fit into memory; and each Select queries goes over one
> table's pages sequentially. In this experiment, there are no indexes or
> table joins.
>
> When we send concurrent Select queries to these tables, query performance
> doesn't scale out with the number of CPU cores. We find that complex Select
> queries scale out better than simpler ones. We also find that increasing the
> block size from 8 KB to 32 KB, or increasing shared_buffers to include the
> working set mitigates the problem to some extent.
>
> For our experiments, we chose an 8-core machine with 68 GB of memory from
> Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and set
> shared_buffers to 4 GB.


If you are certain your tables fit in RAM, you may want to disable
synchronized sequential scans, as they will create contention between
the threads.


Re: Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:
Looking into syncscan.c, it says in comments:

"When multiple backends run a sequential scan on the same table, we try to keep them synchronized to reduce the overall I/O needed."

But in my workload, every process was running on a different table.




On Tue, Dec 3, 2013 at 5:56 PM, Claudio Freire <klaussfreire@gmail.com> wrote:
On Tue, Dec 3, 2013 at 10:49 AM, Metin Doslu <metin@citusdata.com> wrote:
> We have several independent tables on a multi-core machine serving Select
> queries. These tables fit into memory; and each Select queries goes over one
> table's pages sequentially. In this experiment, there are no indexes or
> table joins.
>
> When we send concurrent Select queries to these tables, query performance
> doesn't scale out with the number of CPU cores. We find that complex Select
> queries scale out better than simpler ones. We also find that increasing the
> block size from 8 KB to 32 KB, or increasing shared_buffers to include the
> working set mitigates the problem to some extent.
>
> For our experiments, we chose an 8-core machine with 68 GB of memory from
> Amazon's EC2 service. We installed PostgreSQL 9.3.1 on the instance, and set
> shared_buffers to 4 GB.


If you are certain your tables fit in RAM, you may want to disable
synchronized sequential scans, as they will create contention between
the threads.

Re: Parallel Select query performance and shared buffers

От
Claudio Freire
Дата:
On Tue, Dec 3, 2013 at 1:24 PM, Metin Doslu <metin@citusdata.com> wrote:
> Looking into syncscan.c, it says in comments:
>
> "When multiple backends run a sequential scan on the same table, we try to
> keep them synchronized to reduce the overall I/O needed."
>
> But in my workload, every process was running on a different table.

Ah, ok, so that's what you meant by "independent tables".


Re: Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:

Which repository should I apply these patches. I tried main repository, 9.3 stable and source code of 9.3.1, and in my trials at least of one the patches is failed. What patch command should I use?

Re: Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:
Here are some extra information:

- When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is disappeared for 8 core machines and come back with 16 core machines on Amazon EC2. Would it be related with PostgreSQL locking mechanism?

- I tried this test with 4 core machines including my personel computer and some other instances on Amazon EC2, I didn't see this problem with 4 core machines. I started to see this problem in PostgreSQL when core count is 8 or more.

Here are the results of "vmstat 1" while running 8 parallel select count(*). Normally I would expect zero idle time.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 29838640  94000 38954740    0    0     0     0   22   21  0  0 100  0  0
 7  2      0 29788416  94000 38954740    0    0     0     0 53922 108490 14 24 60  1  1
 5  0      0 29747248  94000 38954740    0    0     0     0 68008 164571 22 48 27  2  1
 8  0      0 29725796  94000 38954740    0    0     0     0 43587 150574 28 54 16  1  1
 0  0      0 29838328  94000 38954740    0    0     0     0 15584 100459 26 55 18  1  0
 0  0      0 29838328  94000 38954740    0    0     0     0   42   15  0  0 100  0  0

- When I run 8 parallel wc command or other scripts, they scale out as expected and they utilize all cpu. This leads me to think that problem is related with PostgreSQL instead of OS.

Re: [HACKERS] Parallel Select query performance and shared buffers

От
Andres Freund
Дата:
On 2013-12-04 20:19:55 +0200, Metin Doslu wrote:
> - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> disappeared for 8 core machines and come back with 16 core machines on
> Amazon EC2. Would it be related with PostgreSQL locking mechanism?

You could try my lwlock-scalability improvement patches - for some
workloads here, the improvements have been rather noticeable. Which
version are you testing?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:
> You could try my lwlock-scalability improvement patches - for some
> workloads here, the improvements have been rather noticeable. Which
> version are you testing?

I'm testing with PostgreSQL 9.3.1.

Re: Parallel Select query performance and shared buffers

От
Amit Kapila
Дата:
On Wed, Dec 4, 2013 at 11:49 PM, Metin Doslu <metin@citusdata.com> wrote:
> Here are some extra information:
>
> - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> disappeared for 8 core machines and come back with 16 core machines on
> Amazon EC2. Would it be related with PostgreSQL locking mechanism?

  I think here there is a good chance of improvement with the patch
suggested by Andres in this thread, but
  still i think it might not completely resolve the current problem as
there will be overhead of associating data
  with shared buffers.

  Currently NUM_BUFFER_PARTITIONS is fixed, so may be auto tuning it
based on some parameter's can
  help such situations.

With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:
> - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> disappeared for 8 core machines and come back with 16 core machines on
> Amazon EC2. Would it be related with PostgreSQL locking mechanism?

If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we see tons of contention with default value of NUM_BUFFER_PARTITIONS which is 16:

$ tail -f /tmp/logfile | grep lwlock | egrep -v "blk 0"
...
PID 15965 lwlock 0: shacq 0 exacq 33 blk 2
PID 15965 lwlock 34: shacq 14010 exacq 27134 blk 6192
PID 15965 lwlock 35: shacq 14159 exacq 27397 blk 5426
PID 15965 lwlock 36: shacq 14111 exacq 27322 blk 4959
PID 15965 lwlock 37: shacq 14211 exacq 27507 blk 4370
PID 15965 lwlock 38: shacq 14110 exacq 27294 blk 3980
PID 15965 lwlock 39: shacq 13962 exacq 27027 blk 3719
PID 15965 lwlock 40: shacq 14023 exacq 27156 blk 3273
PID 15965 lwlock 41: shacq 14107 exacq 27309 blk 3201
PID 15965 lwlock 42: shacq 14120 exacq 27304 blk 2904
PID 15965 lwlock 43: shacq 14007 exacq 27129 blk 2740
PID 15965 lwlock 44: shacq 13948 exacq 27027 blk 2616
PID 15965 lwlock 45: shacq 14041 exacq 27198 blk 2431
PID 15965 lwlock 46: shacq 14067 exacq 27277 blk 2345
PID 15965 lwlock 47: shacq 14050 exacq 27203 blk 2106
PID 15965 lwlock 48: shacq 13910 exacq 26910 blk 2155
PID 15965 lwlock 49: shacq 14170 exacq 27360 blk 1989

After we increased NUM_BUFFER_PARTITIONS to 1024, lock contention is decreased:
...
PID 25220 lwlock 1000: shacq 247 exacq 494 blk 1
PID 25220 lwlock 1001: shacq 198 exacq 394 blk 1
PID 25220 lwlock 1002: shacq 203 exacq 404 blk 1
PID 25220 lwlock 1003: shacq 226 exacq 452 blk 1
PID 25220 lwlock 1004: shacq 235 exacq 470 blk 1
PID 25220 lwlock 1006: shacq 226 exacq 452 blk 2
PID 25220 lwlock 1007: shacq 214 exacq 428 blk 1
PID 25220 lwlock 1008: shacq 225 exacq 448 blk 1
PID 25220 lwlock 1010: shacq 209 exacq 418 blk 1
PID 25220 lwlock 1015: shacq 199 exacq 398 blk 1
PID 25220 lwlock 1016: shacq 214 exacq 426 blk 1
PID 25220 lwlock 1018: shacq 230 exacq 456 blk 1
PID 25220 lwlock 1019: shacq 222 exacq 444 blk 3
PID 25220 lwlock 1023: shacq 262 exacq 524 blk 1
PID 25220 lwlock 1027: shacq 213 exacq 426 blk 1
PID 25220 lwlock 1028: shacq 246 exacq 491 blk 1
PID 25220 lwlock 1029: shacq 226 exacq 452 blk 1

Re: Parallel Select query performance and shared buffers

От
Andres Freund
Дата:
On 2013-12-05 11:15:20 +0200, Metin Doslu wrote:
> > - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> > disappeared for 8 core machines and come back with 16 core machines on
> > Amazon EC2. Would it be related with PostgreSQL locking mechanism?
>
> If we build with -DLWLOCK_STATS to print locking stats from PostgreSQL, we
> see tons of contention with default value of NUM_BUFFER_PARTITIONS which is
> 16:

Is your workload bigger than RAM? I think a good bit of the contention
you're seeing in that listing is populating shared_buffers - and might
actually vanish once you're halfway cached.
From what I've seen so far the bigger problem than contention in the
lwlocks itself, is the spinlock protecting the lwlocks...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:
> Is your workload bigger than RAM?

RAM is bigger than workload (more than a couple of times).

> I think a good bit of the contention
> you're seeing in that listing is populating shared_buffers - and might
> actually vanish once you're halfway cached.
> From what I've seen so far the bigger problem than contention in the
> lwlocks itself, is the spinlock protecting the lwlocks...

Could you clarify a bit what do you mean by "halfway cached" and "spinlock protecting the lwlocks".

Re: Parallel Select query performance and shared buffers

От
Andres Freund
Дата:
On 2013-12-05 11:33:29 +0200, Metin Doslu wrote:
> > Is your workload bigger than RAM?
>
> RAM is bigger than workload (more than a couple of times).

> > I think a good bit of the contention
> > you're seeing in that listing is populating shared_buffers - and might
> > actually vanish once you're halfway cached.
> > From what I've seen so far the bigger problem than contention in the
> > lwlocks itself, is the spinlock protecting the lwlocks...
>
> Could you clarify a bit what do you mean by "halfway cached"

Well, your stats showed a) fairly low lock counts overall b) a high
percentage of exclusive locks.
a) indicates the system wasn't running long.
b) tells me there were lots of changes to the buffer mapping - which
   basically only happens if a buffer is placed or removed from
   shared-buffers.

If your shared_buffers is big enough to contain most of the data you
shouldn't see many exclusive locks in comparison to the number of shared
locks.

> and "spinlock protecting the lwlocks".

Every LWLock has an internal spinlock to protect its state. So whenever
somebody does a LWLockAcquire()/Release(), even if only in shared mode,
we currently acquire that spinlock, manipulate the LWLocks state, and
release the spinlock again. In lots of workloads that internal spinlock
is the contention point, not the lenght over which the lwlock is
held. Especially when they are mostly held in shared mode.

Makes sense?

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:
> You could try my lwlock-scalability improvement patches - for some
> workloads here, the improvements have been rather noticeable. Which
> version are you testing?

I tried your patches on next link. As you suspect I didn't see any improvements. I tested it on PostgreSQL 9.2 Stable.


On Wed, Dec 4, 2013 at 8:26 PM, Andres Freund <andres@2ndquadrant.com> wrote:
On 2013-12-04 20:19:55 +0200, Metin Doslu wrote:
> - When we increased NUM_BUFFER_PARTITIONS to 1024, this problem is
> disappeared for 8 core machines and come back with 16 core machines on
> Amazon EC2. Would it be related with PostgreSQL locking mechanism?

You could try my lwlock-scalability improvement patches - for some
workloads here, the improvements have been rather noticeable. Which
version are you testing?

Greetings,

Andres Freund

--
 Andres Freund                     http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] Parallel Select query performance and shared buffers

От
Andres Freund
Дата:
On 2013-12-05 17:46:44 +0200, Metin Doslu wrote:
> I tried your patches on next link. As you suspect I didn't see any
> improvements. I tested it on PostgreSQL 9.2 Stable.

You tested the correct branch, right? Which commit does "git rev-parse
HEAD" show?

But generally, as long as your profile hides all the important
information behind the hypervisor's cost, you're going to have a hard
time analyzing the problems. You really should try to reproduce the
problems on native hardware (as similar to the host hardware as
possible), to get accurate data. On CPU bound workloads that information
is often transportable to the virtual world.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:
> You tested the correct branch, right? Which commit does "git rev-parse
> HEAD" show?

I applied last two patches manually on PostgreSQL 9.2 Stable.

Re: Parallel Select query performance and shared buffers

От
Metin Doslu
Дата:
> From what I've seen so far the bigger problem than contention in the
> lwlocks itself, is the spinlock protecting the lwlocks...

Postgres 9.3.1 also reports spindelay, it seems that there is no contention on spinlocks.

PID 21121 lwlock 0: shacq 0 exacq 33 blk 1 spindelay 0
PID 21121 lwlock 33: shacq 7602 exacq 14688 blk 4381 spindelay 0
PID 21121 lwlock 34: shacq 7826 exacq 15113 blk 3786 spindelay 0
PID 21121 lwlock 35: shacq 7792 exacq 15110 blk 3356 spindelay 0
PID 21121 lwlock 36: shacq 7803 exacq 15125 blk 3075 spindelay 0
PID 21121 lwlock 37: shacq 7822 exacq 15177 blk 2756 spindelay 0
PID 21121 lwlock 38: shacq 7694 exacq 14863 blk 2513 spindelay 0
PID 21121 lwlock 39: shacq 7914 exacq 15320 blk 2400 spindelay 0
PID 21121 lwlock 40: shacq 7855 exacq 15203 blk 2220 spindelay 0
PID 21121 lwlock 41: shacq 7942 exacq 15363 blk 1996 spindelay 0
PID 21121 lwlock 42: shacq 7828 exacq 15115 blk 1872 spindelay 0
PID 21121 lwlock 43: shacq 7820 exacq 15159 blk 1833 spindelay 0
PID 21121 lwlock 44: shacq 7709 exacq 14916 blk 1590 spindelay 0
PID 21121 lwlock 45: shacq 7831 exacq 15134 blk 1619 spindelay 0
PID 21121 lwlock 46: shacq 7744 exacq 14989 blk 1559 spindelay 0
PID 21121 lwlock 47: shacq 7808 exacq 15111 blk 1473 spindelay 0
PID 21121 lwlock 48: shacq 7729 exacq 14929 blk 1381 spindelay 0

Re: Parallel Select query performance and shared buffers

От
Claudio Freire
Дата:
On Thu, Dec 5, 2013 at 1:03 PM, Metin Doslu <metin@citusdata.com> wrote:
>> From what I've seen so far the bigger problem than contention in the
>> lwlocks itself, is the spinlock protecting the lwlocks...
>
> Postgres 9.3.1 also reports spindelay, it seems that there is no contention
> on spinlocks.


Did you check hugepages?