Обсуждение: BUG #16662: pgbench: error: client 418 script 0 aborted in command 5 query 0: ERROR: invalid page in block 4830
BUG #16662: pgbench: error: client 418 script 0 aborted in command 5 query 0: ERROR: invalid page in block 4830
От
PG Bug reporting form
Дата:
The following bug has been logged on the website: Bug reference: 16662 Logged by: 强 魏 Email address: 1726002692@qq.com PostgreSQL version: 13.0 Operating system: CentOS 7 X86_64 Description: During the pressure test using pgbench, the following error occurred, but the object with oid=16396 was queried through pg_class, and it did not exist. Is this a bug? pgbench -c 512 -j 8 -M simple -P 2 -T 600 -n ...... WARNING: page verification failed, calculated checksum 22845 but expected 3622 pgbench: error: client 418 script 0 aborted in command 5 query 0: ERROR: invalid page in block 48309 of relation base/14103/16396 progress: 74.0 s, 3145.2 tps, lat 159.570 ms stddev 91.381 progress: 76.0 s, 3247.4 tps, lat 154.893 ms stddev 87.175 progress: 78.0 s, 3327.4 tps, lat 151.347 ms stddev 90.210 progress: 80.0 s, 3461.3 tps, lat 146.127 ms stddev 80.133 progress: 82.0 s, 3375.2 tps, lat 149.019 ms stddev 84.425 WARNING: page verification failed, calculated checksum 22570 but expected 38935 pgbench: error: client 359 script 0 aborted in command 5 query 0: ERROR: invalid page in block 48183 of relation base/14103/16396 WARNING: page verification failed, calculated checksum 22845 but expected 3622 pgbench: error: client 66 script 0 aborted in command 5 query 0: ERROR: invalid page in block 48309 of relation base/14103/16396 progress: 84.0 s, 3379.2 tps, lat 148.318 ms stddev 94.854 progress: 86.0 s, 3541.5 tps, lat 131.812 ms stddev 82.052 ....... Object information in the database: postgres=# select * from pg_class where oid=16396; (0 rows) postgres=# select * from pg_database; -[ RECORD 1 ]-+------------------------------------ oid | 14103 datname | postgres datdba | 10 encoding | 6 datcollate | zh_CN.UTF-8 datctype | zh_CN.UTF-8 datistemplate | f datallowconn | t datconnlimit | -1 datlastsysoid | 14102 datfrozenxid | 478 datminmxid | 1 dattablespace | 1663 datacl |
On Thu, Oct 08, 2020 at 03:49:58PM +0000, PG Bug reporting form wrote: > During the pressure test using pgbench, the following error occurred, but > the object with oid=16396 was queried through pg_class, and it did not > exist. Is this a bug? Unlikely one in Postgres itself, I would recommend to be very careful with this instance :( A data checksum failure, as the one you are seeing here, means that an 8k page of a relation file that Postgres has flushed out to disk in the past has been loaded back with some unexpected data. This means that a source external to Postgres has changed this data. I have seen this class of failures with problems involving either the kernel, the file system, the hardware, or even some layer in charge of the host virtualization, if your host is a VM of course. So something has likely managed the flush request thought as completed by Postgres in a non-durable way, or something could have directly changed the on-disk data with the flush request actually done correctly, which is even a worse problem. What this error tells is that the problem does not come from Postgres itself. -- Michael
Вложения
When huge_page is set to off or try, there is no block damage problem in the pgbench pressure test process. When it is set to ON, the problem exists
sysctl -p
vm.nr_hugepages = 9000
vm.swappiness = 0
vm.overcommit_memory = 2
vm.overcommit_ratio = 98
vm.min_free_kbytes = 1024000
vm.dirty_background_ratio = 10
vm.dirty_ratio = 95
vm.vfs_cache_pressure = 150
fs.file-max = 6815744
fs.aio-max-nr = 1048576
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.core.wmem_default = 8388608
net.core.rmem_default = 8388608
net.ipv4.tcp_syncookies = 1
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.ip_local_port_range = 20000 60999
net.ipv4.tcp_max_syn_backlog = 8192
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.conf.all.rp_filter = 0
net.ipv4.conf.all.arp_filter = 0
net.ipv4.conf.default.rp_filter = 0
net.ipv4.conf.default.arp_filter = 0
net.ipv4.conf.lo.rp_filter = 0
net.ipv4.conf.lo.arp_filter = 0
kernel.sem = 250 32000 100 128
kernel.shmmni = 4906
kernel.shmall = 41231686041
kernel.shmmax = 25769803776
kernel.sysrq = 0
------------------ Original ------------------
From: "Michael Paquier" <michael@paquier.xyz>;
Date: Fri, Oct 9, 2020 08:32 AM
To: "两个孩子的爹"<1726002692@qq.com>;"pgsql-bugs"<pgsql-bugs@lists.postgresql.org>;
Subject: Re: BUG #16662: pgbench: error: client 418 script 0 aborted in command 5 query 0: ERROR: invalid page in block 4830
> During the pressure test using pgbench, the following error occurred, but
> the object with oid=16396 was queried through pg_class, and it did not
> exist. Is this a bug?
Unlikely one in Postgres itself, I would recommend to be very careful
with this instance :(
A data checksum failure, as the one you are seeing here, means that an
8k page of a relation file that Postgres has flushed out to disk in
the past has been loaded back with some unexpected data. This means
that a source external to Postgres has changed this data. I have seen
this class of failures with problems involving either the kernel, the
file system, the hardware, or even some layer in charge of the host
virtualization, if your host is a VM of course. So something has
likely managed the flush request thought as completed by Postgres in a
non-durable way, or something could have directly changed the on-disk
data with the flush request actually done correctly, which is even a
worse problem. What this error tells is that the problem does not
come from Postgres itself.
--
Michael