Обсуждение: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

Поиск
Список
Период
Сортировка

Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:

This question was posted originally on http://dba.stackexchange.com/questions/96444/cant-get-dell-pe-t420-perc-h710-perform-better-than-a-macmini-with-postgresql and they suggested to post it on this mailing list.

It's months that I'm trying to solve a performance issue with PostgreSQL. I’m able to give you all the technical details needed.

SYSTEM CONFIGURATION

Our deployment machine is a Dell PowerEdge T420 with a Perc H710 RAID controller configured in this way:

  • VD0: two 15k SAS disks (ext4, OS partition, WAL partition, RAID1)
  • VD1: ten 10k SAS disks (XFS, Postgres data partition, RAID5)

This system has the following configuration:

  • Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)
  • 128GB RAM (DDR3, 8x16GB @1600Mhz)
  • two Intel Xeon E5-2640 v2 @2Ghz
  • Dell Perc H710 with 512MB RAM (Write cache: "WriteBack", Read cache: "ReadAhead", Disk cache: "disabled"):
    • VD0 (OS and WAL partition): two 15k SAS disks (ext4, RAID1)
    • VD1 (Postgres data partition): ten 10k SAS disks (XFS, RAID5)
  • PostgreSQL 9.4 (updated to the latest available version)
  • moved pg_stat_tmp to RAM disk

My personal low cost and low profile development machine is a MacMini configured in this way:

  • OS X Server 10.7.5
  • 8GB RAM (DDR3, 2x4GB @1333Mhz)
  • one Intel i7 @2.2Ghz
  • two Internal 500GB 7.2k SAS HDD (non RAID) for OS partition
  • external Promise Pegasus R1 connected with Thunderbolt v1 (512MB RAM, four 1TB 7.2k SAS HDD 32MB cache, RAID5, Write cache: "WriteBack", Read cache: "ReadAhead", Disk cache: "enabled", NCQ: "enabled")
  • PostgreSQL 9.0.13 (the original built-in shipped with OS X Server)
  • moved pg_stat_tmp to RAM disk

So far I've made a lot of tuning adjustments to both machines, including kernel reccomended ones on the official Postgres doc site.

APPLICATION

The deployment machine runs a web platform which instructs Postgres to make big transactions over billion of records. It's a platform designed for one user because system resources have to be dedicated as much as possible to one single job due to data size (I don't like to call it big data because big data are in the order ob ten of billion).

ISSUEs

I've found the deployment machine to be a lot slower than the development machine. This is paradoxal because the two machine really differs in many aspects. I've run many queries to investigate this strange behaviour and have done a lot of tuning adjustments.

During the last two months I've prepared and executed two type of query sets:

  • A: these sets make use of SELECT ... INTO, CREATE INDEX, CLUSTER and VACUUM ANALYZE.
  • B: these sets are from our application generated transactions and make use of SELECT over the tables created with set A.

A and B were always slower on T420. The only type of operation that was faster is the VACUUM ANALYZE.

RESULTS

A type set:

  • T420: went from 311seconds (default postgresql.conf) to 195seconds doing tuning adjustments over RAID, kernel and postgresql.conf;
  • MacMini: 40seconds.

B type set:

  • T420: 141seconds;
  • MacMini: 101seconds.

I've to mention that we have also adjusted the BIOS on T420 setting all possible parameters to "performance" and disabling low energy profiles. This lowered time execution over a type A set from 240seconds to 211seconds.

We have also upgrade all firmware and BIOS to the latest available versions.


Here are two benchmarks generated using pg_test_fsync:

T420 pg_test_fsync

60 seconds per test
O_DIRECT supported on this platform for open_datasync and open_sync.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)       open_datasync                   23358.758 ops/sec      43 usecs/op       fdatasync                       21417.018 ops/sec      47 usecs/op       fsync                           21112.662 ops/sec      47 usecs/op       fsync_writethrough                            n/a       open_sync                       23082.764 ops/sec      43 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)       open_datasync                   11737.746 ops/sec      85 usecs/op       fdatasync                       19222.074 ops/sec      52 usecs/op       fsync                           18608.405 ops/sec      54 usecs/op       fsync_writethrough                            n/a       open_sync                       11510.074 ops/sec      87 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)        1 * 16kB open_sync write       21484.546 ops/sec      47 usecs/op        2 *  8kB open_sync writes      11478.119 ops/sec      87 usecs/op        4 *  4kB open_sync writes       5885.149 ops/sec     170 usecs/op        8 *  2kB open_sync writes       3027.676 ops/sec     330 usecs/op       16 *  1kB open_sync writes       1512.922 ops/sec     661 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       write, fsync, close             17946.690 ops/sec      56 usecs/op       write, close, fsync             17976.202 ops/sec      56 usecs/op

Non-Sync'ed 8kB writes:       write                           343202.937 ops/sec       3 usecs/op

MacMini pg_test_fsync

60 seconds per test
Direct I/O is not supported on this platform.

Compare file sync methods using one 8kB write:
(in wal_sync_method preference order, except fdatasync
is Linux's default)       open_datasync                      3780.341 ops/sec     265 usecs/op       fdatasync                          3117.094 ops/sec     321 usecs/op       fsync                              3156.298 ops/sec     317 usecs/op       fsync_writethrough                  110.300 ops/sec    9066 usecs/op       open_sync                          3077.932 ops/sec     325 usecs/op

Compare file sync methods using two 8kB writes:
(in wal_sync_method preference order, except fdatasync
is Linux's default)       open_datasync                      1522.400 ops/sec     657 usecs/op       fdatasync                          2700.055 ops/sec     370 usecs/op       fsync                              2670.652 ops/sec     374 usecs/op       fsync_writethrough                   98.462 ops/sec   10156 usecs/op       open_sync                          1532.235 ops/sec     653 usecs/op

Compare open_sync with different write sizes:
(This is designed to compare the cost of writing 16kB
in different write open_sync sizes.)        1 * 16kB open_sync write          2634.754 ops/sec     380 usecs/op        2 *  8kB open_sync writes         1547.801 ops/sec     646 usecs/op        4 *  4kB open_sync writes          801.542 ops/sec    1248 usecs/op        8 *  2kB open_sync writes          405.515 ops/sec    2466 usecs/op       16 *  1kB open_sync writes          204.095 ops/sec    4900 usecs/op

Test if fsync on non-write file descriptor is honored:
(If the times are similar, fsync() can sync data written
on a different descriptor.)       write, fsync, close                2747.345 ops/sec     364 usecs/op       write, close, fsync                3070.877 ops/sec     326 usecs/op

Non-Sync'ed 8kB writes:       write                              3275.716 ops/sec     305 usecs/op

This confirms the hardware IO capabilities of T420 but doesn't explain why MacMini is MUCH MORE FAST.



Now let’s propose some query profiling times.

B type set are transactions, so it's impossible for me to post EXPLAIN ANALYZE results. I've extracted two querys from a single transactions and executed the twos on both system. Here are the results:

T420

Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM

Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06

MacMini

Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx

Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk


COMPILING PGSQL

I compiled and tested all the latest pgsql versions (9.0.19, 9.1.15, 9.2.10, 9.3.6 and 9.4.1) using different combinations of parameters for gcc-4.9.1 (gcc 4.7 for pgsql 9.0.19) and Postgres (I’ve tried also clang compiler with different optimization flags with no benefits). I followed this article but I was unable to test the -flto option due to several errors returned by make.  After two days of testing I went down from 195 to 189 seconds on T420 where MacMini still is 40 seconds (A set); and from 141 to 129 seconds where MacMini is 101 seconds (B set). On MacMini I’ve used the built-in pgsql 9.0.13 version while on T420 I've used the following optimal compiling options:

./configure CFLAGS="-O3 -fno-inline-functions -march=native" --with-openssl --with-libxml --with-libxslt --with-wal-blocksize=64 --with-blocksize=32 --with-wal-segsize=64 --with-segsize=1

I've also tried to disable Hyper-Threading with echo 0 > /sys/devices/system/cpu/cpuN/online where cpuN is the N-th logical CPU but nothing changed over B set queries. We have 2 CPU with 8 cores for a total of 16 physical cores and 16 logical cores.


It seems like T420 doesn’t push on single transaction while is probably able to manage multiple connections much better than MacMini. I can’t figure out why it’s much much much slower than MacMini on any kind of query (from data loading to da selection).

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Ilya Kosmodemiansky
Дата:
Hi Pietro,

On Wed, Apr 1, 2015 at 3:56 PM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
> T420: went from 311seconds (default postgresql.conf) to 195seconds doing
> tuning adjustments over RAID, kernel and postgresql.conf;
> MacMini: 40seconds.

I'am afraid, the matter is, that PostgreSQL is not configured properly
(and so do operating system and probably controller, however
pg_test_fsync shows that things are not so bad there as with
postgresql.conf).

It is pretty useless to benchmark a database using out-of-the-box
configuration. You need at least configure shared memory related,
checkpoints-related and autovacuum-related settings. And as a first
step, please compare postgresql.conf on Mac and on the server:
sometimes (with some mac installers) default postgresql.conf can be
not the same as on server.

Best regards,
Ilya


--
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com


Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Aidan Van Dyk
Дата:
On Wed, Apr 1, 2015 at 9:56 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
 

Now let’s propose some query profiling times.

B type set are transactions, so it's impossible for me to post EXPLAIN ANALYZE results. I've extracted two querys from a single transactions and executed the twos on both system. Here are the results:

T420

Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM

Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06

MacMini

Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx

Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk


Looking at the 2 B_2 queries (since they are so drastically different), the in-memory quicksorts stand out on the Dell as being *drastically* slower than the disk-based sorts on your mac-mini....

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Wei Shan
Дата:
Just looking at the 2 B_2 queries, I'm curious as to why is the execution plan different between the 2 machines. Is the optimiser stats updated on both databases?

Regards,
Wei Shan

On 1 April 2015 at 22:32, Aidan Van Dyk <aidan@highrise.ca> wrote:
On Wed, Apr 1, 2015 at 9:56 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
 

Now let’s propose some query profiling times.

B type set are transactions, so it's impossible for me to post EXPLAIN ANALYZE results. I've extracted two querys from a single transactions and executed the twos on both system. Here are the results:

T420

Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM

Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06

MacMini

Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx

Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk


Looking at the 2 B_2 queries (since they are so drastically different), the in-memory quicksorts stand out on the Dell as being *drastically* slower than the disk-based sorts on your mac-mini....




--
Regards,
Ang Wei Shan

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Jeff Janes
Дата:
On Wed, Apr 1, 2015 at 6:56 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:

This question was posted originally on http://dba.stackexchange.com/questions/96444/cant-get-dell-pe-t420-perc-h710-perform-better-than-a-macmini-with-postgresql and they suggested to post it on this mailing list.

It's months that I'm trying to solve a performance issue with PostgreSQL. I’m able to give you all the technical details needed.

SYSTEM CONFIGURATION

Our deployment machine is a Dell PowerEdge T420 with a Perc H710 RAID controller configured in this way:

  • two Intel Xeon E5-2640 v2 @2Ghz
  • PostgreSQL 9.4 (updated to the latest available version)

My personal low cost and low profile development machine is a MacMini configured in this way:

  • one Intel i7 @2.2Ghz
  • PostgreSQL 9.0.13 (the original built-in shipped with OS X Server)

Using such different versions of PostgreSQL seems like a recipe for frustration.


Here are two benchmarks generated using pg_test_fsync:


This is unlikely to be important for the type of workload you describe.  Fsyncs are the bottleneck for many short transactions, but not often the bottleneck for very large transactions.


T420

Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06

MacMini

Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk



What collation is used for both databases?  Perhaps the T420 is using a much slower collation.

How can you sort 2,951,191 but then materialize 4,458,971 rows out of that?  I've never seen that before.  (Or, in the other plan, put 2,951,191 rows into the sort from the CTE but get 4,458,971 out of the sort?

Cheers,

Jeff

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Gerardo Herzig
Дата:
Ok, a quick view on the system, and some things that may be important to note:

> Our deployment machine is a Dell PowerEdge T420 with a Perc H710 RAID
> controller configured in this way:
>
>     * VD0: two 15k SAS disks (ext4, OS partition, WAL partition,
>     RAID1)
>     * VD1: ten 10k SAS disks (XFS, Postgres data partition, RAID5)
>

Well...usually RAID5 have the worst performance in writing...EVER!!! Have you tested this in another raid
configuration?RAID10 is usually the best bet. 

>
>
> This system has the following configuration:
>
>     * Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)
>     * 128GB RAM (DDR3, 8x16GB @1600Mhz)
>     * two Intel Xeon E5-2640 v2 @2Ghz
>     * Dell Perc H710 with 512MB RAM (Write cache: "WriteBack", Read
>     cache: "ReadAhead", Disk cache: "disabled"):
>         * VD0 (OS and WAL partition): two 15k SAS disks (ext4, RAID1)
>         * VD1 (Postgres data partition): ten 10k SAS disks (XFS,
>         RAID5)
>     * PostgreSQL 9.4 (updated to the latest available version)
>     * moved pg_stat_tmp to RAM disk
>
>
[...]> versions.
>
You did not mention any "postgres" configuration at all. If you let the default checkpoint_segments=3, that would be an
IOhell for your disk controler...and the RAID5 making things worst...Can you show us the values of: 

checkpoint_segments
shared_buffers
work_mem
maintenance_work_mem
effective_io_concurrency

I would start from there, few changes, and check again. I would change the RAID first of all things, and try those
testsagain. 

Cheers.
Gerardo


Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Gerardo,
thank you for your response.
At the moment I can’t switch to RAID10. I know it has best performance, but both systems have RAID5 and MacMini has a consumer desktop RAID solution while T420 has a server-grade one.
Anyway, I used two configurations for each system: one for data loading operations and the other one for any other kind of operation (SELECT etc.). These configurations were made studying different combinations. I’ve changed kernel parameters as stated in the official Postgres documentation ( www.postgresql.org/docs/9.4/static/kernel-resources.html ).
I copy and paste here the various postgresql.conf involved:

T420
Normal operations
autovacuum = on
maintenance_work_mem = 512MB
work_mem = 512MB
wal_buffers = 64MB
effective_cache_size = 64GB # this helps A LOT in disk write speed when creating indexes
shared_buffers = 32GB
checkpoint_segments = 2000
checkpoint_completion_target = 1.0
effective_io_concurrency = 0 # 1 doesn’t make any substantial difference
max_connections = 10 # 20 doesn’t make any difference

Data loading (same as above with the following changes):
autovacuum = off
maintenance_work_mem = 64GB


MacMini
Normal operations
autovacuum = on
maintenance_work_mem = 128MB
work_mem = 32MB
wal_buffers = 32MB
effective_cache_size = 800MB
shared_buffers = 512MB
checkpoint_segments = 32
checkpoint_completion_target = 1.0
effective_io_concurrency = 1
max_connections = 20

Data loading (same as above with the following changes):
autovacuum = off
maintenance_work_mem = 6GB


Best regards,
 Pietro


Il giorno 02/apr/2015, alle ore 04:19, Gerardo Herzig <gherzig@fmed.uba.ar> ha scritto:

Ok, a quick view on the system, and some things that may be important to note:

Our deployment machine is a Dell PowerEdge T420 with a Perc H710 RAID
controller configured in this way:

   * VD0: two 15k SAS disks (ext4, OS partition, WAL partition,
   RAID1)
   * VD1: ten 10k SAS disks (XFS, Postgres data partition, RAID5)


Well...usually RAID5 have the worst performance in writing...EVER!!! Have you tested this in another raid configuration? RAID10 is usually the best bet.



This system has the following configuration:

   * Ubuntu 14.04.2 LTS (GNU/Linux 3.13.0-48-generic x86_64)
   * 128GB RAM (DDR3, 8x16GB @1600Mhz)
   * two Intel Xeon E5-2640 v2 @2Ghz
   * Dell Perc H710 with 512MB RAM (Write cache: "WriteBack", Read
   cache: "ReadAhead", Disk cache: "disabled"):
       * VD0 (OS and WAL partition): two 15k SAS disks (ext4, RAID1)
       * VD1 (Postgres data partition): ten 10k SAS disks (XFS,
       RAID5)
   * PostgreSQL 9.4 (updated to the latest available version)
   * moved pg_stat_tmp to RAM disk


[...]> versions.

You did not mention any "postgres" configuration at all. If you let the default checkpoint_segments=3, that would be an IO hell for your disk controler...and the RAID5 making things worst...Can you show us the values of:

checkpoint_segments
shared_buffers
work_mem
maintenance_work_mem
effective_io_concurrency

I would start from there, few changes, and check again. I would change the RAID first of all things, and try those tests again.

Cheers.
Gerardo

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Jeff,
thank you for your response.
I’m using Postgres 9.0 on MacMini because I’ve noticed that it’s quite fast compared to different Ubuntu machines on which I’ve worked with different (and more performant) hardware.
The built-in Postgres version on OS X Server is impossible to update. I should stop it and install a parallel and independent distribution which has not been optimized by Apple. On opensource.appel.com they have different Postgres versions but the latest one is 9.2.x. They stopped updating it in 2012.
pg_test_fsync tells me that T420 disk iops are ~7 times faster than MacMini, which is ok, but queries run ~2-5 times slower (for brevity I didn’t report all test results in my first mail).

I’ve searched just now what a collation is because I’ve never explicitly used one before, so I think it uses the default one.

B_2 query is of the form:
WITH soggetti AS (
 SELECT ... FROM ... GROUP BY ...)
SELECT ... INTO ... FROM soggetti, ... WHERE ... 

(I omit the … part because they’re not relevant)

Best regards,
 Pietro

PS it’s the first time for me on this list so I don’t know if you read the other answers. I reported the postgresql.conf for both systems




Il giorno 01/apr/2015, alle ore 18:38, Jeff Janes <jeff.janes@gmail.com> ha scritto:

On Wed, Apr 1, 2015 at 6:56 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:

This question was posted originally on http://dba.stackexchange.com/questions/96444/cant-get-dell-pe-t420-perc-h710-perform-better-than-a-macmini-with-postgresql and they suggested to post it on this mailing list.

It's months that I'm trying to solve a performance issue with PostgreSQL. I’m able to give you all the technical details needed.

SYSTEM CONFIGURATION

Our deployment machine is a Dell PowerEdge T420 with a Perc H710 RAID controller configured in this way:

  • two Intel Xeon E5-2640 v2 @2Ghz
  • PostgreSQL 9.4 (updated to the latest available version)

My personal low cost and low profile development machine is a MacMini configured in this way:

  • one Intel i7 @2.2Ghz
  • PostgreSQL 9.0.13 (the original built-in shipped with OS X Server)

Using such different versions of PostgreSQL seems like a recipe for frustration.


Here are two benchmarks generated using pg_test_fsync:


This is unlikely to be important for the type of workload you describe.  Fsyncs are the bottleneck for many short transactions, but not often the bottleneck for very large transactions.


T420

Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06

MacMini

Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk



What collation is used for both databases?  Perhaps the T420 is using a much slower collation.

How can you sort 2,951,191 but then materialize 4,458,971 rows out of that?  I've never seen that before.  (Or, in the other plan, put 2,951,191 rows into the sort from the CTE but get 4,458,971 out of the sort?

Cheers,

Jeff

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Wei Shan,
Thank you for your response.
Query B was run after initializing the DB ex-novo doing VACUUM ANALYZE before and after creating and clustering indexes.
By the way, these results are consistent through time and are reproducible, so it’s not a metter of statistic collector (I guess).
Your observation is the same done at dba.stackexchange.com and this make me think that the built-in Postgres of OS X Server is truly optimized. 

Best regards,
 Pietro

PS on the other response I reported both postgresql.conf 


Il giorno 01/apr/2015, alle ore 16:44, Wei Shan <weishan.ang@gmail.com> ha scritto:

Just looking at the 2 B_2 queries, I'm curious as to why is the execution plan different between the 2 machines. Is the optimiser stats updated on both databases?

Regards,
Wei Shan

On 1 April 2015 at 22:32, Aidan Van Dyk <aidan@highrise.ca> wrote:
On Wed, Apr 1, 2015 at 9:56 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
 

Now let’s propose some query profiling times.

B type set are transactions, so it's impossible for me to post EXPLAIN ANALYZE results. I've extracted two querys from a single transactions and executed the twos on both system. Here are the results:

T420

Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM

Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06

MacMini

Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx

Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk


Looking at the 2 B_2 queries (since they are so drastically different), the in-memory quicksorts stand out on the Dell as being *drastically* slower than the disk-based sorts on your mac-mini....




--
Regards,
Ang Wei Shan

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Ilya,
thank your for your response.
Both system were configured for each test I’ve done. On T420 I’ve optimized the kernel following the official Postgres documentation ( http://www.postgresql.org/docs/9.4/static/kernel-resources.html ):
kernel.shmmax=68719476736
kernel.shmall=16777216
vm.overcommit_memory=2
vm.overcommit_ratio=90


RAID controllers were configured as following:
- Write cache: WriteBack
- Read cache: ReadAhead
- Disk cache (only T420): disabled to take full advantage of WriteBack cache (BBU is charged and working)
- NCQ (only MacMini because it’s a SATA option): enabled (this affects a lot the overall performance)

For postgresql.conf:

T420
Normal operations
autovacuum = on
maintenance_work_mem = 512MB
work_mem = 512MB
wal_buffers = 64MB
effective_cache_size = 64GB # this helps A LOT in disk write speed when creating indexes
shared_buffers = 32GB
checkpoint_segments = 2000
checkpoint_completion_target = 1.0
effective_io_concurrency = 0 # 1 doesn’t make any substantial difference
max_connections = 10 # 20 doesn’t make any difference

Data loading (same as above with the following changes):
autovacuum = off
maintenance_work_mem = 64GB


MacMini
Normal operations
autovacuum = on
maintenance_work_mem = 128MB
work_mem = 32MB
wal_buffers = 32MB
effective_cache_size = 800MB
shared_buffers = 512MB
checkpoint_segments = 32
checkpoint_completion_target = 1.0
effective_io_concurrency = 1
max_connections = 20

Data loading (same as above with the following changes):
autovacuum = off
maintenance_work_mem = 6GB


Best regards,
 Pietro



Il giorno 01/apr/2015, alle ore 16:27, Ilya Kosmodemiansky <ilya.kosmodemiansky@postgresql-consulting.com> ha scritto:

Hi Pietro,

On Wed, Apr 1, 2015 at 3:56 PM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
T420: went from 311seconds (default postgresql.conf) to 195seconds doing
tuning adjustments over RAID, kernel and postgresql.conf;
MacMini: 40seconds.

I'am afraid, the matter is, that PostgreSQL is not configured properly
(and so do operating system and probably controller, however
pg_test_fsync shows that things are not so bad there as with
postgresql.conf).

It is pretty useless to benchmark a database using out-of-the-box
configuration. You need at least configure shared memory related,
checkpoints-related and autovacuum-related settings. And as a first
step, please compare postgresql.conf on Mac and on the server:
sometimes (with some mac installers) default postgresql.conf can be
not the same as on server.

Best regards,
Ilya


--
Ilya Kosmodemiansky,

PostgreSQL-Consulting.com
tel. +14084142500
cell. +4915144336040
ik@postgresql-consulting.com

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
"Mkrtchyan, Tigran"
Дата:
Hi Pietro,

The modern CPUs trying to be too smart.

try to run this code to disable CPUs c-states:

----> setcpulatency.c

#include <stdio.h>
#include <fcntl.h>
#include <stdint.h>

int main(int argc, char **argv) {
   int32_t l;
   int fd;

   if (argc != 2) {
      fprintf(stderr, "Usage: %s <latency in us>\n", argv[0]);
      return 2;
   }

   l = atoi(argv[1]);
   printf("setting latency to %d us\n", l);
   fd = open("/dev/cpu_dma_latency", O_WRONLY);
   if (fd < 0) {
      perror("open /dev/cpu_dma_latency");
      return 1;
   }

   if (write(fd, &l, sizeof(l)) != sizeof(l)) {
      perror("write to /dev/cpu_dma_latency");
      return 1;
   }

   while (1) pause();
}


---->

you can use i7z (https://code.google.com/p/i7z/) to see the percentage of CPU power to be used.
Changing CPU from C1 to C0 takes quite some time and for DB workload not optimal (if you need a
high throughout and any given moment).

I see ~65% boost when run './setcpulatency 0'.

Tigran.

----- Original Message -----
> From: "Pietro Pugni" <pietro.pugni@gmail.com>
> To: ik@postgresql-consulting.com
> Cc: "pgsql-performance" <pgsql-performance@postgresql.org>
> Sent: Thursday, April 2, 2015 12:57:22 PM
> Subject: Re: [PERFORM] Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

> Hi Ilya,
> thank your for your response.
> Both system were configured for each test I’ve done. On T420 I’ve optimized the
> kernel following the official Postgres documentation (
> http://www.postgresql.org/docs/9.4/static/kernel-resources.html ):
> kernel.shmmax=68719476736
> kernel.shmall=16777216
> vm.overcommit_memory=2
> vm.overcommit_ratio=90
>
>
> RAID controllers were configured as following:
> - Write cache: WriteBack
> - Read cache: ReadAhead
> - Disk cache (only T420): disabled to take full advantage of WriteBack cache
> (BBU is charged and working)
> - NCQ (only MacMini because it’s a SATA option): enabled (this affects a lot the
> overall performance)
>
> For postgresql.conf:
>
> T420
> Normal operations
> autovacuum = on
> maintenance_work_mem = 512MB
> work_mem = 512MB
> wal_buffers = 64MB
> effective_cache_size = 64GB # this helps A LOT in disk write speed when creating
> indexes
> shared_buffers = 32GB
> checkpoint_segments = 2000
> checkpoint_completion_target = 1.0
> effective_io_concurrency = 0 # 1 doesn’t make any substantial difference
> max_connections = 10 # 20 doesn’t make any difference
>
> Data loading (same as above with the following changes):
> autovacuum = off
> maintenance_work_mem = 64GB
>
>
> MacMini
> Normal operations
> autovacuum = on
> maintenance_work_mem = 128MB
> work_mem = 32MB
> wal_buffers = 32MB
> effective_cache_size = 800MB
> shared_buffers = 512MB
> checkpoint_segments = 32
> checkpoint_completion_target = 1.0
> effective_io_concurrency = 1
> max_connections = 20
>
> Data loading (same as above with the following changes):
> autovacuum = off
> maintenance_work_mem = 6GB
>
>
> Best regards,
> Pietro
>
>
>
> Il giorno 01/apr/2015, alle ore 16:27, Ilya Kosmodemiansky
> <ilya.kosmodemiansky@postgresql-consulting.com> ha scritto:
>
>> Hi Pietro,
>>
>> On Wed, Apr 1, 2015 at 3:56 PM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
>>> T420: went from 311seconds (default postgresql.conf) to 195seconds doing
>>> tuning adjustments over RAID, kernel and postgresql.conf;
>>> MacMini: 40seconds.
>>
>> I'am afraid, the matter is, that PostgreSQL is not configured properly
>> (and so do operating system and probably controller, however
>> pg_test_fsync shows that things are not so bad there as with
>> postgresql.conf).
>>
>> It is pretty useless to benchmark a database using out-of-the-box
>> configuration. You need at least configure shared memory related,
>> checkpoints-related and autovacuum-related settings. And as a first
>> step, please compare postgresql.conf on Mac and on the server:
>> sometimes (with some mac installers) default postgresql.conf can be
>> not the same as on server.
>>
>> Best regards,
>> Ilya
>>
>>
>> --
>> Ilya Kosmodemiansky,
>>
>> PostgreSQL-Consulting.com
>> tel. +14084142500
>> cell. +4915144336040
> > ik@postgresql-consulting.com


Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
didier
Дата:
Hi,

On Thu, Apr 2, 2015 at 12:47 PM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
> Hi Jeff,
> thank you for your response.
> I’m using Postgres 9.0 on MacMini because I’ve noticed that it’s quite fast
> compared to different Ubuntu machines on which I’ve worked with different
> (and more performant) hardware.
> The built-in Postgres version on OS X Server is impossible to update. I
> should stop it and install a parallel and independent distribution which has
> not been optimized by Apple. On opensource.appel.com they have different
> Postgres versions but the latest one is 9.2.x. They stopped updating it in
> 2012.
If you want you can compile 9.0 on OSX and double check.
I don't remember well but ITSM that a fsync used by psql was a noop on  OSX.

> pg_test_fsync tells me that T420 disk iops are ~7 times faster than MacMini,
> which is ok, but queries run ~2-5 times slower (for brevity I didn’t report
> all test results in my first mail).

>
> I’ve searched just now what a collation is because I’ve never explicitly
> used one before, so I think it uses the default one.

What's the output of free and sysctl -a | grep vm.zone_reclaim_mode

Search the mailing list for zone_reclaim_mode there's some tips.


For testing you can also use the mac mini config with the dell, at
least it should give you the same plan.
With your example disks don't seem to matter, it's all in memory.

Keep in mind that a psql query is still single thread so the mac and
the dell should get more or less the same speed for in memory queries.

>
> B_2 query is of the form:
> WITH soggetti AS (
>  SELECT ... FROM ... GROUP BY ...)
> SELECT ... INTO ... FROM soggetti, ... WHERE ...
>
> (I omit the … part because they’re not relevant)
>
> Best regards,
>  Pietro


Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Aidan Van Dyk
Дата:
On Thu, Apr 2, 2015 at 6:33 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
 
T420
work_mem = 512MB
 
MacMini
work_mem = 32MB

So that is why the T420 does memory sorts and the mini does disk sorts.

I'd start looking at why memory sorts on the T420 is so slow.   Check your numa settings, etc (as already mentioned).

For a drastic test, disable the 2nd socket on the dell, and just use one (eliminate any numa/QPI costs) and see how it compares to the no-numa MacMini.

If you want to see how bad the NUMA/QPI is, play with stream to benchmark memory performance.

a.

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Aidan,

 
T420
work_mem = 512MB
 
MacMini
work_mem = 32MB

So that is why the T420 does memory sorts and the mini does disk sorts.

I'd start looking at why memory sorts on the T420 is so slow.   Check your numa settings, etc (as already mentioned).

For a drastic test, disable the 2nd socket on the dell, and just use one (eliminate any numa/QPI costs) and see how it compares to the no-numa MacMini.


the command 
dmesg | grep -i numa
doesn’t display me anything. I think T420 hasn’t NUMA on it. Is there a way to enable it from Ubuntu? I don’t have immediate access to BIOS (server is in another location).
For QPI I don’t know what to do. Please, can you give me more details?

If you want to see how bad the NUMA/QPI is, play with stream to benchmark memory performance.

With stream you refer to this: https://sites.utexas.edu/jdm4372/tag/stream-benchmark/ ? Do you suggest me some way to do this kind of tests?

Thank you very much 
 Pietro

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:

Il giorno 02/apr/2015, alle ore 14:29, didier <did447@gmail.com> ha scritto:

Hi,

On Thu, Apr 2, 2015 at 12:47 PM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
Hi Jeff,
thank you for your response.
I’m using Postgres 9.0 on MacMini because I’ve noticed that it’s quite fast
compared to different Ubuntu machines on which I’ve worked with different
(and more performant) hardware.
The built-in Postgres version on OS X Server is impossible to update. I
should stop it and install a parallel and independent distribution which has
not been optimized by Apple. On opensource.appel.com they have different
Postgres versions but the latest one is 9.2.x. They stopped updating it in
2012.
If you want you can compile 9.0 on OSX and double check.
I don't remember well but ITSM that a fsync used by psql was a noop on  OSX.

You’re referring to disk scheduler? I’ve tried to change it on T420 with no significant variations over performance.
I’ve also tried different fsync options with no improvements.

pg_test_fsync tells me that T420 disk iops are ~7 times faster than MacMini,
which is ok, but queries run ~2-5 times slower (for brevity I didn’t report
all test results in my first mail).


I’ve searched just now what a collation is because I’ve never explicitly
used one before, so I think it uses the default one.

What's the output of free and sysctl -a | grep vm.zone_reclaim_mode

Search the mailing list for zone_reclaim_mode there's some tips.

vm.zone_reclaim_mode = 0

I’ve also set these options in /etc/sysctl.conf:
kernel.shmmax=68719476736
kernel.shmall=16777216
vm.overcommit_memory=2
vm.overcommit_ratio=90

I’ll search the mailing list.

For testing you can also use the mac mini config with the dell, at
least it should give you the same plan.
With your example disks don't seem to matter, it's all in memory.
The same transaction took 106s on MacMini; 129s on T420 with my optimized configuration; 180s on T420 using MacMini configuration.
Query plans for B_1 and B_2 queries with the two configurations on T420:

T420 with optimal postgresql.conf
Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM
Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06


T420 with MacMini postgresql.conf
Query B_1 [51280.208ms + 0.699ms] http://explain.depesz.com/s/wlb
Query B_2 [177278.205ms + 0.428ms] http://explain.depesz.com/s/rzr

MacMini
Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx
Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk


Keep in mind that a psql query is still single thread so the mac and
the dell should get more or less the same speed for in memory queries.
Yes I know ;) With 128GB I try to maximize RAM usage, but it’s difficult to fully understand how to achieve this.

Thank you again,
 Pietro

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Aidan Van Dyk
Дата:
On Thu, Apr 2, 2015 at 9:23 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
 
the command 
dmesg | grep -i numa
doesn’t display me anything. I think T420 hasn’t NUMA on it. Is there a way to enable it from Ubuntu? I don’t have immediate access to BIOS (server is in another location).

NUMA stands for "Non-Uniform-Memory-Access" .  It's basically the "label" for systems which have memory attached to different cpu sockets, such that accessing all of the memory from a paritciular cpu thread has different costs based on where the actual memory is located (i.e. on some other socket, or the local socket).
 
For QPI I don’t know what to do. Please, can you give me more details?

QPI is the the intel "QuickPath Interconnect". It's the communication path between CPU sockets.   Memory ready by one cpu thread that has to come from another cpu socket's memory controller goes through QPI.

Google has lots of info on these, and how they impact performance, etc.
If you want to see how bad the NUMA/QPI is, play with stream to benchmark memory performance.

With stream you refer to this: https://sites.utexas.edu/jdm4372/tag/stream-benchmark/ ? Do you suggest me some way to do this kind of tests?

Ya, that's the one.  I don't have specific tests in mind.

A more simple "overview" might be "numactl --hardware"

a.

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Tigran,

The modern CPUs trying to be too smart.

try to run this code to disable CPUs c-states:

----> setcpulatency.c

#include <stdio.h>
#include <fcntl.h>
#include <stdint.h>

int main(int argc, char **argv) {
  int32_t l;
  int fd;

  if (argc != 2) {
     fprintf(stderr, "Usage: %s <latency in us>\n", argv[0]);
     return 2;
  }

  l = atoi(argv[1]);
  printf("setting latency to %d us\n", l);
  fd = open("/dev/cpu_dma_latency", O_WRONLY);
  if (fd < 0) {
     perror("open /dev/cpu_dma_latency");
     return 1;
  }

  if (write(fd, &l, sizeof(l)) != sizeof(l)) {
     perror("write to /dev/cpu_dma_latency");
     return 1;
  }

  while (1) pause();
}


——>

your C code should be equivalent to the following:
echo 0 > /dev/cpu_dma_latency
Right?
I executed the above command but time execution increase of about 2 seconds over 129seconds (I’ve executed the transaction several times repeating the procedure of restarting db and redoing transaction). With setting echo 1 > /dev/cpu_dma_latency it returns to 129seconds.

you can use i7z (https://code.google.com/p/i7z/) to see the percentage of CPU power to be used.

I’ve installed i7z-GUI but it reports the following and crashes with segmentation fault (T420 has Intel Xeon, not i-series):
i7z DEBUG: i7z version: svn-r77-(20-Nov-2011)
i7z DEBUG: Found Intel Processor
i7z DEBUG:    Stepping 4
i7z DEBUG:    Model e
i7z DEBUG:    Family 6
i7z DEBUG:    Processor Type 0
i7z DEBUG:    Extended Model 3
i7z DEBUG: msr = Model Specific Register
i7z DEBUG: detected a newer model of ivy bridge processor
i7z DEBUG: my coder doesn't know about it, can you send the following info to him?
i7z DEBUG: model e, extended model 3, proc_family 6
i7z DEBUG: msr device files DONOT exist, trying out a makedev script
i7z DEBUG: modprobbing for msr
[1]+  Segmentation fault      (core dumped) i7z_GUI


Changing CPU from C1 to C0 takes quite some time and for DB workload not optimal (if you need a
high throughout and any given moment).

I see ~65% boost when run './setcpulatency 0'.

Tigran.

With “takes quite some time” you mean that it will take some time to take effect?

Thank you a lot for your help.
Best regards,
 Pietro

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
didier
Дата:
Hi

On Thu, Apr 2, 2015 at 3:52 PM, Pietro Pugni <pietro.pugni@gmail.com> wrote:

>
> I’ve searched just now what a collation is because I’ve never explicitly
> used one before, so I think it uses the default one.
>
>
> What's the output of free and sysctl -a | grep vm.zone_reclaim_mode
>
> Search the mailing list for zone_reclaim_mode there's some tips.
>
> vm.zone_reclaim_mode = 0
In my understanding it's the rigth value
there's also huge page
/sys/kernel/mm/transparent_hugepage/enabled
can you try to disable it?

Also test on the dell:
select tmp.cf, tmp.dt from grep_studi.tmp;
and
select tmp.cf, tmp.dt from grep_studi.tmp order by tmp.cf;
in Query B_2
the sort is 9 time slower on the dell, you have to find why...

>
> For testing you can also use the mac mini config with the dell, at
> least it should give you the same plan.
> With your example disks don't seem to matter, it's all in memory.

> T420 with optimal postgresql.conf
> Query B_1 [55999.649 ms + 0.639 ms] http://explain.depesz.com/s/LbM
> Query B_2 [95664.832 ms + 0.523 ms] http://explain.depesz.com/s/v06
>
>
> T420 with MacMini postgresql.conf
> Query B_1 [51280.208ms + 0.699ms] http://explain.depesz.com/s/wlb
> Query B_2 [177278.205ms + 0.428ms] http://explain.depesz.com/s/rzr
>
32 GB for buffers is too high for the queries in your test but it
doesn't matter.

> MacMini
> Query B_1 [56315.614 ms] http://explain.depesz.com/s/uZTx
> Query B_2 [44890.813 ms] http://explain.depesz.com/s/y7Dk
>


Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi didier,
thank you for your time.
I forgot to display before the output of free. I’ve looked into it before and I found difficult to fully understand if there was something wrong.

Before starting Postgres:
             total       used       free     shared    buffers     cached
Mem:          125G         9G       115G        15M       362M       8.1G
-/+ buffers/cache:       1.5G       124G
Swap:         127G         0B       127G

Here’s an example of free output when queries B_1 and B_2 are running (they’re part of the same transaction). Generally values remains the same. For what I can understand, RAM isn’t used at all (there’s a lot of unused RAM).

             total       used       free     shared    buffers     cached
Mem:          125G        13G       112G       3.1G       362M        11G
-/+ buffers/cache:       1.9G       123G
Swap:         127G         0B       127G

With Postgres running after transaction has been executed:
             total       used       free     shared    buffers     cached
Mem:          125G        13G       112G       3.1G       362M        11G
-/+ buffers/cache:       1.5G       124G
Swap:         127G         0B       127G


there's also huge page
/sys/kernel/mm/transparent_hugepage/enabled
can you try to disable it?
It was enabled and after disabling it nothing changed: time execution is practically the same (131s for the same transaction tested in previous emails, which is composed by queries B_1 and B_2).


Also test on the dell:
select tmp.cf, tmp.dt from grep_studi.tmp;
and
select tmp.cf, tmp.dt from grep_studi.tmp order by tmp.cf;
in Query B_2
the sort is 9 time slower on the dell, you have to find why…
Here’s the output for the two queries:

select tmp.cf, tmp.dt from grep_studi.tmp;
"Seq Scan on grep_studi.tmp  (cost=0.00..11007.74 rows=1346868 width=72) (actual time=0.082..618.709 rows=2951191 loops=1)"
"  Output: cf, dt"
"  Buffers: shared hit=512 read=7802 dirtied=8314"
"Planning time: 0.087 ms"
"Execution time: 745.505 ms"

select tmp.cf, tmp.dt from grep_studi.tmp;
"Sort  (cost=38431.55..39104.99 rows=1346868 width=72) (actual time=3146.548..3306.179 rows=2951191 loops=1)"
"  Output: cf, dt"
"  Sort Key: tmp.cf"
"  Sort Method: quicksort  Memory: 328866kB"
"  Buffers: shared hit=8317"
"  ->  Seq Scan on grep_studi.tmp  (cost=0.00..11007.74 rows=1346868 width=72) (actual time=0.012..373.346 rows=2951191 loops=1)"
"        Output: cf, dt"
"        Buffers: shared hit=8314"
"Planning time: 0.034 ms"
"Execution time: 3459.065 ms"


32 GB for buffers is too high for the queries in your test but it
doesn't matter.
I’ve set shared_buffers to be 1/4 of the total RAM. I’ve changed kernel values to accomodate this value. Lowering to smaller values doesn’t improve the transaction results. Here’s the results with 1 run for each level of shared_buffers:

32GB: 131s
16GB: 132s
8GB: 133s
4GB: 132s
2GB: 143s
1GB: 148s
512MB: 183s
256MB: 192s

Probably I can keep 4GB but I make use of several partitions with tens of millions of records each. This is why I keep shared_buffers high. My applications is also similar to a DWH solution with one user. Like you said, big values of shared_buffers shouldn’t be a issue.

I’ve done some tests with sysbench on Dell T420 and MacMini.

T420 - RAM READ - 16GB / 1MB
sh-4.3# sysbench --test=memory --memory-oper=read --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 16384M

Memory operations type: read
Memory scope type: global
Threads started!
Done.

Operations performed: 16384 (3643025.32 ops/sec)

16384.00 MB transferred (3643025.32 MB/sec)


Test execution summary:
    total time:                          0.0045s
    total number of events:              16384
    total time taken by event execution: 0.0031
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.02ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   0.0031/0.00

MacMini - RAM READ - 16GB / 1MB
server:sysbench Pietro$ ./sysbench --test=memory --memory-oper=read --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Random number generator seed is 0 and will be ignored


Threads started!

Operations performed: 16384 ( 5484.50 ops/sec)

16384.00 MB transferred (5484.50 MB/sec)


General statistics:
    total time:                          2.9873s
    total number of events:              16384
    total time taken by event execution: 2.9836s
    response time:
         min:                                  0.18ms
         avg:                                  0.18ms
         max:                                  0.24ms
         approx.  95 percentile:               0.19ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   2.9836/0.00

T420 - RAM WRITE - 16GB / 1MB
sh-4.3# sysbench --test=memory --memory-oper=write --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 16384M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 16384 ( 8298.97 ops/sec)

16384.00 MB transferred (8298.97 MB/sec)


Test execution summary:
    total time:                          1.9742s
    total number of events:              16384
    total time taken by event execution: 1.9723
    per-request statistics:
         min:                                  0.12ms
         avg:                                  0.12ms
         max:                                  0.25ms
         approx.  95 percentile:               0.12ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   1.9723/0.00



MacMini - RAM WRITE - 16GB / 1MB
server:sysbench Pietro$ ./sysbench --test=memory --memory-oper=write --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Random number generator seed is 0 and will be ignored


Threads started!

Operations performed: 16384 ( 5472.90 ops/sec)

16384.00 MB transferred (5472.90 MB/sec)


General statistics:
    total time:                          2.9937s
    total number of events:              16384
    total time taken by event execution: 2.9890s
    response time:
         min:                                  0.18ms
         avg:                                  0.18ms
         max:                                  0.32ms
         approx.  95 percentile:               0.19ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   2.9890/0.00


T420 - CPU
sh-4.3# sysbench --test=cpu run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 10000


Test execution summary:
    total time:                          13.0683s
    total number of events:              10000
    total time taken by event execution: 13.0674
    per-request statistics:
         min:                                  1.30ms
         avg:                                  1.31ms
         max:                                  1.44ms
         approx.  95 percentile:               1.35ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   13.0674/0.00


MacMini - CPU
server:sysbench Pietro$ ./sysbench --test=cpu run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Random number generator seed is 0 and will be ignored


Primer numbers limit: 10000

Threads started!


General statistics:
    total time:                          11.5728s
    total number of events:              10000
    total time taken by event execution: 11.5703s
    response time:
         min:                                  1.15ms
         avg:                                  1.16ms
         max:                                  2.17ms
         approx.  95 percentile:               1.17ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   11.5703/0.00





I’ve done these tests because someone else on this discussion asked me to investigate on memory bandwidth and because I found this interesting article about Intel Xeon vs Intel i5 with different Postgres versions: http://blog.pgaddict.com/posts/performance-since-postgresql-7-4-to-9-4-pgbench
Hope this helps to better understand the problem.

Thank you very much.
Best regards,
 Pietro

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Aidan,
thank you again for your support.
I found an interesting article showing better performance from a Intel i5 vs a Intel Xeon on different Postgres versions: http://blog.pgaddict.com/posts/performance-since-postgresql-7-4-to-9-4-pgbench

NUMA stands for "Non-Uniform-Memory-Access" .  It's basically the "label" for systems which have memory attached to different cpu sockets, such that accessing all of the memory from a paritciular cpu thread has different costs based on where the actual memory is located (i.e. on some other socket, or the local socket).
Thanks, good to know.

QPI is the the intel "QuickPath Interconnect". It's the communication path between CPU sockets.   Memory ready by one cpu thread that has to come from another cpu socket's memory controller goes through QPI.
Google has lots of info on these, and how they impact performance, etc.
When I’ll get access to BIOS (probably next week or later) I’ll try to disable QPI (if possible). Meanwhile I’ll document on Internet about QPI vs performance.

If you want to see how bad the NUMA/QPI is, play with stream to benchmark memory performance.

With stream you refer to this: https://sites.utexas.edu/jdm4372/tag/stream-benchmark/ ? Do you suggest me some way to do this kind of tests?

Ya, that's the one.  I don't have specific tests in mind.
I’ve done some tests with sysbench on Dell T420 (via apt-get install) and MacMini (I’ve compiled the latest available sources at https://github.com/akopytov/sysbench ).
Here are some results with 16GB RAM read and written at 1MB block size (I don’t know if this makes sense, but I’ve no problem in changing these parameters).

T420 - RAM READ - 16GB / 1MB
sh-4.3# sysbench --test=memory --memory-oper=read --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 16384M

Memory operations type: read
Memory scope type: global
Threads started!
Done.

Operations performed: 16384 (3643025.32 ops/sec)

16384.00 MB transferred (3643025.32 MB/sec)


Test execution summary:
    total time:                          0.0045s
    total number of events:              16384
    total time taken by event execution: 0.0031
    per-request statistics:
         min:                                  0.00ms
         avg:                                  0.00ms
         max:                                  0.02ms
         approx.  95 percentile:               0.00ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   0.0031/0.00

MacMini - RAM READ - 16GB / 1MB
server:sysbench Pietro$ ./sysbench --test=memory --memory-oper=read --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Random number generator seed is 0 and will be ignored


Threads started!

Operations performed: 16384 ( 5484.50 ops/sec)

16384.00 MB transferred (5484.50 MB/sec)


General statistics:
    total time:                          2.9873s
    total number of events:              16384
    total time taken by event execution: 2.9836s
    response time:
         min:                                  0.18ms
         avg:                                  0.18ms
         max:                                  0.24ms
         approx.  95 percentile:               0.19ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   2.9836/0.00

T420 - RAM WRITE - 16GB / 1MB
sh-4.3# sysbench --test=memory --memory-oper=write --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing memory operations speed test
Memory block size: 1024K

Memory transfer size: 16384M

Memory operations type: write
Memory scope type: global
Threads started!
Done.

Operations performed: 16384 ( 8298.97 ops/sec)

16384.00 MB transferred (8298.97 MB/sec)


Test execution summary:
    total time:                          1.9742s
    total number of events:              16384
    total time taken by event execution: 1.9723
    per-request statistics:
         min:                                  0.12ms
         avg:                                  0.12ms
         max:                                  0.25ms
         approx.  95 percentile:               0.12ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   1.9723/0.00



MacMini - RAM WRITE - 16GB / 1MB
server:sysbench Pietro$ ./sysbench --test=memory --memory-oper=write --memory-block-size=1MB --memory-total-size=16GB run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Random number generator seed is 0 and will be ignored


Threads started!

Operations performed: 16384 ( 5472.90 ops/sec)

16384.00 MB transferred (5472.90 MB/sec)


General statistics:
    total time:                          2.9937s
    total number of events:              16384
    total time taken by event execution: 2.9890s
    response time:
         min:                                  0.18ms
         avg:                                  0.18ms
         max:                                  0.32ms
         approx.  95 percentile:               0.19ms

Threads fairness:
    events (avg/stddev):           16384.0000/0.00
    execution time (avg/stddev):   2.9890/0.00


T420 - CPU
sh-4.3# sysbench --test=cpu run
sysbench 0.4.12:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1

Doing CPU performance benchmark

Threads started!
Done.

Maximum prime number checked in CPU test: 10000


Test execution summary:
    total time:                          13.0683s
    total number of events:              10000
    total time taken by event execution: 13.0674
    per-request statistics:
         min:                                  1.30ms
         avg:                                  1.31ms
         max:                                  1.44ms
         approx.  95 percentile:               1.35ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   13.0674/0.00


MacMini - CPU
server:sysbench Pietro$ ./sysbench --test=cpu run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 1
Random number generator seed is 0 and will be ignored


Primer numbers limit: 10000

Threads started!


General statistics:
    total time:                          11.5728s
    total number of events:              10000
    total time taken by event execution: 11.5703s
    response time:
         min:                                  1.15ms
         avg:                                  1.16ms
         max:                                  2.17ms
         approx.  95 percentile:               1.17ms

Threads fairness:
    events (avg/stddev):           10000.0000/0.00
    execution time (avg/stddev):   11.5703/0.00



A more simple "overview" might be "numactl —hardware”
It returns the following output:

sh-4.3# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64385 MB
node 0 free: 56487 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64508 MB
node 1 free: 62201 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

Thank you so much for your help, really appreciate it.
Best regards,
 Pietro



Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Josh Krupka
Дата:

A more simple "overview" might be "numactl —hardware”
It returns the following output:

sh-4.3# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 64385 MB
node 0 free: 56487 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 64508 MB
node 1 free: 62201 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 

Did you already post the results of:
cat /proc/sys/vm/zone_reclaim_mode

Also, how big did you say your dataset is? Based on the output of free, you're certainly not using all the memory you have.  That could be just because you haven't accessed that much of your dataset, or it could be because zone reclaim is preventing you from using your entire amount of RAM as file system cache.

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Josh,

Did you already post the results of:
cat /proc/sys/vm/zone_reclaim_mode

zone_reclaim_mode was set on 0 for all my tests. I’ve also set it to the other values (1, 2, 4) but there was no improvement. Tests results are the following (1 run for each test):
echo 0 > /proc/sys/vm/zone_reclaim_mode
130s
echo 1 > /proc/sys/vm/zone_reclaim_mode
129s
echo 2 > /proc/sys/vm/zone_reclaim_mode
134s
echo 4 > /proc/sys/vm/zone_reclaim_mode
131s


Also, how big did you say your dataset is? Based on the output of free, you're certainly not using all the memory you have.  That could be just because you haven't accessed that much of your dataset, or it could be because zone reclaim is preventing you from using your entire amount of RAM as file system cache.
The table I use for this test is about 20milion row, has less than 10 columns (a small table) and has 4 indexes. Other tables I use are partitioned and consists of a total of 1.6bilion rows, 757 million rows ad so on descending.

Thank you for your help.
Best regards,
 Pietro

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Josh Krupka
Дата:
Sorry, how much disk space is actually used by the tables, indexes, etc involved in your queries? Or it that's a bit much to get, how much disk space is occupied by your database in total?

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Josh,
at the moment the server is unreachable so I can’t calculate sizes. I run all of my test both with all data loaded into
Postgresand with no data loaded (except from the single 20mln rows table with relative indexes). 
To give you an idea, with all data loaded into Postgres with indexes the space occupied is approximately 1.2-1.5TB and
freespace on is about 800GB. 

Many thanks.
Best regads,
 Pietro

Il giorno 03/apr/2015, alle ore 17:21, Josh Krupka <jkrupka@gmail.com> ha scritto:

> Sorry, how much disk space is actually used by the tables, indexes, etc involved in your queries? Or it that's a bit
muchto get, how much disk space is occupied by your database in total? 



Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:

On Tue, Apr 7, 2015 at 6:27 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
Hi Jeff,
sorry for the latency but server was down due to a error I made in the sysctl.conf file.

Yes, but are the defaults for those two systems?  on psql, use \l to see.

\l returns the following:

T420 (Postgres 9.4.1)
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 grep      | grep     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(4 rows)



MacMini (Postgres 9.0.13)
                                    List of databases
       Name        |   Owner    | Encoding | Collation | Ctype |    Access privileges    
-------------------+------------+----------+-----------+-------+-------------------------
 caldav            | caldav     | UTF8     | C         | C     | 
 collab            | collab     | UTF8     | C         | C     | 
 device_management | _devicemgr | UTF8     | C         | C     | 
 pen               | pen        | UTF8     | C         | C     | 
 postgres          | _postgres  | UTF8     | C         | C     | 
 roundcubemail     | roundcube  | UTF8     | C         | C     | 
 template0         | _postgres  | UTF8     | C         | C     | =c/_postgres           +
                   |            |          |           |       | _postgres=CTc/_postgres
 template1         | _postgres  | UTF8     | C         | C     | =c/_postgres           +
                   |            |          |           |       | _postgres=CTc/_postgres
(8 rows)


The difference between the "C" and the "en_US" collation is entirely sufficient to explain the difference in performance.  "C" is the fastest possible collation as it never needs to look ahead or consult tables, it just compares raw bytes.

Cheers,

Jeff

Hi Jeff,
is there a way to set a default collection during compiling or in the configuration file? I have never specified one, so I suppose that somewhere on MacMini “C” collation type is set as the default one.

Thank you a lot.
Best regards,
 Pietro

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
I meant “collation”, not “collection”.

 Pietro

Il giorno 07/apr/2015, alle ore 18:49, Pietro Pugni <pietro.pugni@gmail.com> ha scritto:


On Tue, Apr 7, 2015 at 6:27 AM, Pietro Pugni <pietro.pugni@gmail.com> wrote:
Hi Jeff,
sorry for the latency but server was down due to a error I made in the sysctl.conf file.

Yes, but are the defaults for those two systems?  on psql, use \l to see.

\l returns the following:

T420 (Postgres 9.4.1)
                                  List of databases
   Name    |  Owner   | Encoding |   Collate   |    Ctype    |   Access privileges   
-----------+----------+----------+-------------+-------------+-----------------------
 grep      | grep     | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 postgres  | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | 
 template0 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.UTF-8 | en_US.UTF-8 | =c/postgres          +
           |          |          |             |             | postgres=CTc/postgres
(4 rows)



MacMini (Postgres 9.0.13)
                                    List of databases
       Name        |   Owner    | Encoding | Collation | Ctype |    Access privileges    
-------------------+------------+----------+-----------+-------+-------------------------
 caldav            | caldav     | UTF8     | C         | C     | 
 collab            | collab     | UTF8     | C         | C     | 
 device_management | _devicemgr | UTF8     | C         | C     | 
 pen               | pen        | UTF8     | C         | C     | 
 postgres          | _postgres  | UTF8     | C         | C     | 
 roundcubemail     | roundcube  | UTF8     | C         | C     | 
 template0         | _postgres  | UTF8     | C         | C     | =c/_postgres           +
                   |            |          |           |       | _postgres=CTc/_postgres
 template1         | _postgres  | UTF8     | C         | C     | =c/_postgres           +
                   |            |          |           |       | _postgres=CTc/_postgres
(8 rows)


The difference between the "C" and the "en_US" collation is entirely sufficient to explain the difference in performance.  "C" is the fastest possible collation as it never needs to look ahead or consult tables, it just compares raw bytes.

Cheers,

Jeff

Hi Jeff,
is there a way to set a default collection during compiling or in the configuration file? I have never specified one, so I suppose that somewhere on MacMini “C” collation type is set as the default one.

Thank you a lot.
Best regards,
 Pietro

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Hi Jeff


The default collation for the database cluster is set when you create the cluster with initdb (the package you used to install postgresql might provide scripts that wrap initdb and call it something else, sorry I can't be much use with those).  
You can set it with --lc-collate flag to initdb, otherwise it is set based on the environment variables (LANG or LC_* variables) set in the shell you use to run initdb.

Note that you can create a new database in the cluster which has its own default which is different from the cluster's default.

guess what? it worked! Now I run the following command:
/usr/local/pgsql/bin/initdb -D /mnt/raid5/pg_data --no-locale --encoding=UTF8

Time execution for my reference transaction went from 2m9s to 1m18s where Mac Mini is 1m43s.
This is the best improvement after modifying BIOS settings.
I’ll do some testing. In the meanwhile I’ve made some kernel changes which may help in heavy workloads (at the moment they don't affect performance).

I should offer you a dinner…
Thank you a lot. Now I’m looking to push it faster (I think it can goes faster than this!).

Best regards,
 Pietro

Re: Can't get Dell PE T420 (Perc H710) perform better than a MacMini with PostgreSQL

От
Pietro Pugni
Дата:
Ciao Pietro,
stavo seguendo thread sulla mailing list Postgresql.

Puoi farmi un piccolo riassunto delle conclusioni perchè non sono sicuro di aver capito tutto?

Ciao Domenico,
sì effettivamente la mailing list è un po’ dispersiva.
Utilizzando il collation di tipo “C” il Dell T420 impiega meno tempo rispetto al Mac Mini nell’esecuzione di qualsiasi query. La differenza finora è del 20%-30% circa. Adesso sto facendo un caricamento dati massiccio che sul Dell impiegava 3 giorni circa mentre sul Mac Mini impiega 1 giorno e mezzo circa e vi farò sapere il tempo di esecuzione.

Alla fine la differenza di performance tra il DELL R420 e il Mac MINI è dovuta al tipo di "collate" utilizzato nell'inizializzazione del DB?

Possono essere adoperati diversi collation all’interno dello stesso DB. Ogni tabella può averne uno diverso, persino ogni attributo può avere un collation diverso dagli altri attributi della stessa tabella o della stessa SELECT.

Il comando:
initdb —no-locale

imposta il collate “C” come predefinito per il DB creato.
Tuttavia nel file di postgresql.conf sono presenti le seguenti variabili di configurazione che definiscono il collate predefinito per l’istanza di Postgres avviata:
# These settings are initialized by initdb, but they can be changed.
lc_messages = 'C'                       # locale for system error message
                                        # strings
lc_monetary = 'C'                       # locale for monetary formatting
lc_numeric = 'C'                        # locale for number formatting
lc_time = 'C'                           # locale for time formatting

In ogni caso mi aspettavo performance migliori da questo Dell, soprattutto considerando il livello della macchina (controller RAID di un certo livello, 10 dischi rigidi SAS da 10k rpm per i dati e 2 dischi SAS da 15k rpm per sistema operativo e WAL, 128GB di RAM e 2 CPU Xeon per un totale di 16 core fisici e 16 logici).
Oltre al collation è bene modificare le impostazioni del BIOS e del controller RAID; il primo perché su questi sistemi sono impostate opzioni di risparmio energetico che tagliano la potenza, mentre per il RAID è bene scegliere una cache in scrittura del tipo WriteBack e in lettura del tipo ReadAhead e poi controllare che la batteria della cache di scrittura sia carica, altrimenti non viene adoperata nessuna cache.
Ovviamente poi la configurazione RAID fa tanto; per ora utilizziamo RAID5 che non è affatto performante in scrittura ma lo è in lettura.
Il tutto dipende dall’applicazione specifica. Nel nostro caso è una sorta di data warehouse ibrido, quindi una soluzione mono utenza che fa uso di grandi moli di dati (siamo nell’ordine dei 1.8TB di database): poche transazioni ma massicce. Le impostazioni sul kernel agiscono soprattutto sulle performance in ambienti multi utenza (migliaia di connessioni) e con le ultime versioni di Postgres e del kernel linux il sistema è già abbastanza auto bilanciato.

Spero di essere stato sufficientemente chiaro ed esaustivo.
Cordiali saluti,
 Pietro