Обсуждение: [HACKERS] [TRAP: FailedAssertion] causing server to crash

Поиск
Список
Период
Сортировка

[HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Neha Sharma
Дата:
Hi,

I am getting FailedAssertion while executing the attached script.However,I am not able to produce the core dump for the same,the script runs in background and takes around a day time to produce the mentioned error.

"TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact, ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
2017-07-19 01:16:51.973 GMT [27873] LOG:  server process (PID 28084) was terminated by signal 6: Aborted
2017-07-19 01:16:51.973 GMT [27873] DETAIL:  Failed process was running: autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"

Log file attached for reference.

Thanks.
--
Regards,
Neha Sharma

Вложения

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Robert Haas
Дата:
On Wed, Jul 19, 2017 at 8:26 AM, Neha Sharma
<neha.sharma@enterprisedb.com> wrote:
> I am getting FailedAssertion while executing the attached script.However,I
> am not able to produce the core dump for the same,the script runs in
> background and takes around a day time to produce the mentioned error.

Maybe you should run it for another day with core dumps enabled (try
the "-c" option to pg_ctl) and then pull a backtrace if you get one.
This is a pretty generic script so it's going to be hard to guess
where the problem is otherwise.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Alvaro Herrera
Дата:
Robert Haas wrote:
> On Wed, Jul 19, 2017 at 8:26 AM, Neha Sharma
> <neha.sharma@enterprisedb.com> wrote:
> > I am getting FailedAssertion while executing the attached script.However,I
> > am not able to produce the core dump for the same,the script runs in
> > background and takes around a day time to produce the mentioned error.
> 
> Maybe you should run it for another day with core dumps enabled (try
> the "-c" option to pg_ctl) and then pull a backtrace if you get one.
> This is a pretty generic script so it's going to be hard to guess
> where the problem is otherwise.

Also, please state server version used.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Craig Ringer
Дата:
On 19 July 2017 at 20:26, Neha Sharma <neha.sharma@enterprisedb.com> wrote:
Hi,

I am getting FailedAssertion while executing the attached script.However,I am not able to produce the core dump for the same,the script runs in background and takes around a day time to produce the mentioned error.

"TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact, ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
2017-07-19 01:16:51.973 GMT [27873] LOG:  server process (PID 28084) was terminated by signal 6: Aborted
2017-07-19 01:16:51.973 GMT [27873] DETAIL:  Failed process was running: autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"

It's not much of a stretch to assume this was likely introduced by ea42cc18c:


so I'll have a look to see if I can spot how it could happen.

Please show your "SELECT version()" and if using a build from git, the "git rev-parse --short HEAD" for the build you're running.

It's interesting and probably significant that your Pg was nearing wraparound. Just a tip for that - if you want to create a situation where you near wraparound quickly and easily, for testing, you can, on a test instance you don't care about:

- select txid_current();
- stop pg
- pg_resetxlog -n $(( $the_xid + (1<<31) - 1001000 ))
- try to start pg, note the missing clog segment it complains about
- dd if=/dev/zero bs=1 count=262144 of=datadir/pg_clog/$MISSINGSEGNAME
- start Pg


That should put you about 1000 txn's from the 1 million xid limit, assuming I got my maths right (don't assume that), and assuming your starting xid is well short of the 2-million mark.


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Craig Ringer
Дата:
On 19 July 2017 at 20:26, Neha Sharma <neha.sharma@enterprisedb.com> wrote:
Hi,

I am getting FailedAssertion while executing the attached script.However,I am not able to produce the core dump for the same,the script runs in background and takes around a day time to produce the mentioned error.

"TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact, ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
2017-07-19 01:16:51.973 GMT [27873] LOG:  server process (PID 28084) was terminated by signal 6: Aborted
2017-07-19 01:16:51.973 GMT [27873] DETAIL:  Failed process was running: autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"


What are the starting conditions of your postgres instance? Does your script assume a newly initdb'd instance with no custom configuration? If not, what setup steps/configuration precede your script run?



 
well short of the 2-million mark.

Er, billion. 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Neha Sharma
Дата:
Hi Craig,

I had done a fresh initdb,the default parameter configuration was used. I was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below. Instead of starting it with postgres binary i have set it in conf file and starting the server with pg_ctl. I am waiting for the results,once the core dump is generated will share the details.

Test code Snippet :

#Pre condition:
#Create and initialize the database and set the export PGDATA
export PGDATA=/home/centos/PG_sources/postgresql/inst/bin/data
export PGPORT=5432
export LD_LIBRARY_PATH=/home/centos/PG_sources/postgresql/inst/lib:$LD_LIBRARY_PATH
#for i in "scale_factor shared_buffers time_for_readings no_of_readings orig_or_patch"
for i in "300 8GB 1800 3"
do
scale_factor=`echo $i | cut -d" " -f1`
shared_bufs=`echo $i | cut -d" " -f2`
time_for_reading=`echo $i | cut -d" " -f3`
no_of_readings=`echo $i | cut -d" " -f4`

# -----------------------------------------------

echo "Start of script for $scale_factor $shared_bufs " >> /home/centos/test_results.txt


echo "============== $run_bin =============" >> /home/centos/test_results.txt

for threads in 1 8 16 24 32 40 48 56 64 72 80 88 96 104 112 120 128
#for threads in 8 16
do      
        #Start taking reading
    for ((readcnt = 1 ; readcnt <= $no_of_readings ; readcnt++))
    do
        echo "================================================"  >> /home/centos/test_results.txt
        echo $scale_factor, $shared_bufs, $threads, $threads, $time_for_reading Reading - ${readcnt}  >> /home/centos/test_results.txt
        #start server
                ./pg_ctl -D data -c -l logfile start
        #./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &
        sleep 5
        #drop and recreate database
        ./dropdb test
        ./createdb test
        #initialize database
        ./pgbench -i -s $scale_factor test
        sleep 5
        # Run pgbench   
        ./pgbench -c $threads -j $threads -T $time_for_reading -M prepared test  >> /home/centos/test_results.txt
        sleep 10
        ./psql -d test -c "checkpoint" >> /home/centos/test_results.txt
        ./pg_ctl stop
    done;
done;

sleep 1


mv /home/centos/test_results.txt /home/centos/test_results_list_${scale_factor}_${shared_bufs}_rw.txt
done;



Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 11:23 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 19 July 2017 at 20:26, Neha Sharma <neha.sharma@enterprisedb.com> wrote:
Hi,

I am getting FailedAssertion while executing the attached script.However,I am not able to produce the core dump for the same,the script runs in background and takes around a day time to produce the mentioned error.

"TRAP: FailedAssertion("!(TransactionIdPrecedesOrEquals(oldestXact, ShmemVariableCache->oldestXid))", File: "clog.c", Line: 683)
2017-07-19 01:16:51.973 GMT [27873] LOG:  server process (PID 28084) was terminated by signal 6: Aborted
2017-07-19 01:16:51.973 GMT [27873] DETAIL:  Failed process was running: autovacuum: VACUUM pg_toast.pg_toast_13029 (to prevent wraparound)"


What are the starting conditions of your postgres instance? Does your script assume a newly initdb'd instance with no custom configuration? If not, what setup steps/configuration precede your script run?



 
well short of the 2-million mark.

Er, billion. 

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Craig Ringer
Дата:
On 20 July 2017 at 15:00, Neha Sharma <neha.sharma@enterprisedb.com> wrote:
Hi Craig,

I had done a fresh initdb,the default parameter configuration was used. I was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below. Instead of starting it with postgres binary i have set it in conf file and starting the server with pg_ctl. I am waiting for the results,once the core dump is generated will share the details. 

Thanks.

To verify that you do get a coredump, you might want to consider sending a kill -SEGV to a backend and make sure that it actually dumps core and you can find the core.

Ideally you'd actually set the coredumps to include shmem (see coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html), but with 8GB shared_buffers that may not be practical. It'd be very useful if possible.

If this is wraparound-related, as it appears to be, you might get faster results by using a custom pgbench script for one or more workers that just runs txid_current() a whole lot. Or jump the server's xid space forward.

I've got a few other things on right now but I'll keep an eye out and hope for a core dump.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Neha Sharma
Дата:


Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 1:28 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 20 July 2017 at 15:00, Neha Sharma <neha.sharma@enterprisedb.com> wrote:
Hi Craig,

I had done a fresh initdb,the default parameter configuration was used. I was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below. Instead of starting it with postgres binary i have set it in conf file and starting the server with pg_ctl. I am waiting for the results,once the core dump is generated will share the details. 

Thanks.

To verify that you do get a coredump, you might want to consider sending a kill -SEGV to a backend and make sure that it actually dumps core and you can find the core.

Ideally you'd actually set the coredumps to include shmem (see coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html), but with 8GB shared_buffers that may not be practical. It'd be very useful if possible.

If this is wraparound-related, as it appears to be, you might get faster results by using a custom pgbench script for one or more workers that just runs txid_current() a whole lot. Or jump the server's xid space forward.
Thanks. Will put together suggestions to get the result.

I've got a few other things on right now but I'll keep an eye out and hope for a core dump.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Neha Sharma
Дата:
Hi,

Attached is the core dump file received on PG 10beta2 version.


Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 2:45 PM, Neha Sharma <neha.sharma@enterprisedb.com> wrote:


Regards,
Neha Sharma

On Thu, Jul 20, 2017 at 1:28 PM, Craig Ringer <craig@2ndquadrant.com> wrote:
On 20 July 2017 at 15:00, Neha Sharma <neha.sharma@enterprisedb.com> wrote:
Hi Craig,

I had done a fresh initdb,the default parameter configuration was used. I was setting few set of parameters while startup by the below command.

./postgres -d postgres -c shared_buffers=$shared_bufs -N 200 -c min_wal_size=15GB -c max_wal_size=20GB -c checkpoint_timeout=900 -c maintenance_work_mem=1GB -c checkpoint_completion_target=0.9 &

Now I have modified the script a bit with Robert's suggestion as below. Instead of starting it with postgres binary i have set it in conf file and starting the server with pg_ctl. I am waiting for the results,once the core dump is generated will share the details. 

Thanks.

To verify that you do get a coredump, you might want to consider sending a kill -SEGV to a backend and make sure that it actually dumps core and you can find the core.

Ideally you'd actually set the coredumps to include shmem (see coredump_filter in http://man7.org/linux/man-pages/man5/core.5.html), but with 8GB shared_buffers that may not be practical. It'd be very useful if possible.

If this is wraparound-related, as it appears to be, you might get faster results by using a custom pgbench script for one or more workers that just runs txid_current() a whole lot. Or jump the server's xid space forward.
Thanks. Will put together suggestions to get the result.

I've got a few other things on right now but I'll keep an eye out and hope for a core dump.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Thomas Munro
Дата:
On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
<neha.sharma@enterprisedb.com> wrote:
>
> Attached is the core dump file received on PG 10beta2 version.

Thanks Neha.  It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG().  What am I missing here?

What actually prevents ShmemVariableCache->oldestXid from going
backwards anyway?  Suppose there are two or more autovacuum processes
that reach vac_truncate_clog() concurrently.  They do a scan of
pg_database whose tuples they access without locking through a
pointer-to-volatile because they expect concurrent in-place writers,
come up with a value for frozenXID, and then arrive at
SetTransactionIdLimit() in whatever order and clobber
ShmemVariableCache->oldestXid.  What am I missing here?

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Neha Sharma
Дата:
Here is the back trace from the core dump attached.

(gdb) bt
#0  0x00007f4a71424495 in raise () from /lib64/libc.so.6
#1  0x00007f4a71425c75 in abort () from /lib64/libc.so.6
#2  0x00000000009dc18a in ExceptionalCondition (conditionName=0xa905d0 "!(TransactionIdPrecedesOrEquals(oldestXact, ShmemVariableCache->oldestXid))",
    errorType=0xa9044f "FailedAssertion", fileName=0xa90448 "clog.c", lineNumber=683) at assert.c:54
#3  0x0000000000524215 in TruncateCLOG (oldestXact=150036635, oldestxid_datoid=13164) at clog.c:682
#4  0x00000000006a6be8 in vac_truncate_clog (frozenXID=150036635, minMulti=1, lastSaneFrozenXid=200562449, lastSaneMinMulti=1) at vacuum.c:1197
#5  0x00000000006a6948 in vac_update_datfrozenxid () at vacuum.c:1063
#6  0x00000000007ce0a2 in do_autovacuum () at autovacuum.c:2625
#7  0x00000000007cc987 in AutoVacWorkerMain (argc=0, argv=0x0) at autovacuum.c:1715
#8  0x00000000007cc562 in StartAutoVacWorker () at autovacuum.c:1512
#9  0x00000000007e2acd in StartAutovacuumWorker () at postmaster.c:5414
#10 0x00000000007e257e in sigusr1_handler (postgres_signal_arg=10) at postmaster.c:5111
#11 <signal handler called>
#12 0x00007f4a714d3603 in __select_nocancel () from /lib64/libc.so.6
#13 0x00000000007dde88 in ServerLoop () at postmaster.c:1717
#14 0x00000000007dd67d in PostmasterMain (argc=3, argv=0x2eb8b00) at postmaster.c:1361
#15 0x000000000071a218 in main (argc=3, argv=0x2eb8b00) at main.c:228
(gdb) print ShmemVariableCache->oldestXid
$3 = 548


Regards,
Neha Sharma

On Fri, Jul 21, 2017 at 11:01 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
<neha.sharma@enterprisedb.com> wrote:
>
> Attached is the core dump file received on PG 10beta2 version.

Thanks Neha.  It's be best to post the back trace and if possible
print oldestXact and ShmemVariableCache->oldestXid from the stack
frame for TruncateCLOG.

The failing assertion in TruncateCLOG() has a comment that says
"vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
*after* it calls TruncateCLOG().  What am I missing here?

What actually prevents ShmemVariableCache->oldestXid from going
backwards anyway?  Suppose there are two or more autovacuum processes
that reach vac_truncate_clog() concurrently.  They do a scan of
pg_database whose tuples they access without locking through a
pointer-to-volatile because they expect concurrent in-place writers,
come up with a value for frozenXID, and then arrive at
SetTransactionIdLimit() in whatever order and clobber
ShmemVariableCache->oldestXid.  What am I missing here?

--
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Kyotaro HORIGUCHI
Дата:
At Fri, 21 Jul 2017 11:39:38 +0530, Neha Sharma <neha.sharma@enterprisedb.com> wrote in
<CANiYTQuZm+hDvuHB14d65SkL2ko98ESR3Jf2kUiX=m1haL=xrg@mail.gmail.com>
> Here is the back trace from the core dump attached.
> 
> (gdb) bt
> #0  0x00007f4a71424495 in raise () from /lib64/libc.so.6
> #1  0x00007f4a71425c75 in abort () from /lib64/libc.so.6
> #2  0x00000000009dc18a in ExceptionalCondition (conditionName=0xa905d0
> "!(TransactionIdPrecedesOrEquals(oldestXact,
> ShmemVariableCache->oldestXid))",
>     errorType=0xa9044f "FailedAssertion", fileName=0xa90448 "clog.c",
> lineNumber=683) at assert.c:54
> #3  0x0000000000524215 in TruncateCLOG (oldestXact=150036635,
> oldestxid_datoid=13164) at clog.c:682

In vac_truncate_clog, TruncateCLOG is called before
SetTransactionIdLimit, which advances
ShmemVariableCache->oldestXid. Given that the assertion in
TruncateCLOG is valid, they should be called in reverse order. I
suppose that CLOG files can be safely truncated after advancing
XID limits.

By the way, the attached patch is made by "git diff --patience".

filterdiff converts it into somewhat wrong shape. Specifically,
the result is missing the addition part of the difference, as the
second attached patch. I'm not sure which of git(2.9.2) or
filterdiff (0.3.3), (or me?) is doing wrong..


> #4  0x00000000006a6be8 in vac_truncate_clog (frozenXID=150036635,
> minMulti=1, lastSaneFrozenXid=200562449, lastSaneMinMulti=1) at
> vacuum.c:1197
> #5  0x00000000006a6948 in vac_update_datfrozenxid () at vacuum.c:1063
> #6  0x00000000007ce0a2 in do_autovacuum () at autovacuum.c:2625
> #7  0x00000000007cc987 in AutoVacWorkerMain (argc=0, argv=0x0) at
> autovacuum.c:1715
> #8  0x00000000007cc562 in StartAutoVacWorker () at autovacuum.c:1512
> #9  0x00000000007e2acd in StartAutovacuumWorker () at postmaster.c:5414
> #10 0x00000000007e257e in sigusr1_handler (postgres_signal_arg=10) at
> postmaster.c:5111
> #11 <signal handler called>
> #12 0x00007f4a714d3603 in __select_nocancel () from /lib64/libc.so.6
> #13 0x00000000007dde88 in ServerLoop () at postmaster.c:1717
> #14 0x00000000007dd67d in PostmasterMain (argc=3, argv=0x2eb8b00) at
> postmaster.c:1361
> #15 0x000000000071a218 in main (argc=3, argv=0x2eb8b00) at main.c:228
> (gdb) print ShmemVariableCache->oldestXid
> $3 = 548
> 
> 
> Regards,
> Neha Sharma
> 
> On Fri, Jul 21, 2017 at 11:01 AM, Thomas Munro <
> thomas.munro@enterprisedb.com> wrote:
> 
> > On Fri, Jul 21, 2017 at 4:16 PM, Neha Sharma
> > <neha.sharma@enterprisedb.com> wrote:
> > >
> > > Attached is the core dump file received on PG 10beta2 version.
> >
> > Thanks Neha.  It's be best to post the back trace and if possible
> > print oldestXact and ShmemVariableCache->oldestXid from the stack
> > frame for TruncateCLOG.
> >
> > The failing assertion in TruncateCLOG() has a comment that says
> > "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
> > calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
> > *after* it calls TruncateCLOG().  What am I missing here?
> >
> > What actually prevents ShmemVariableCache->oldestXid from going
> > backwards anyway?  Suppose there are two or more autovacuum processes
> > that reach vac_truncate_clog() concurrently.  They do a scan of
> > pg_database whose tuples they access without locking through a
> > pointer-to-volatile because they expect concurrent in-place writers,
> > come up with a value for frozenXID, and then arrive at
> > SetTransactionIdLimit() in whatever order and clobber
> > ShmemVariableCache->oldestXid.  What am I missing here?
> >
> > --
> > Thomas Munro
> > http://www.enterprisedb.com
> >

-- 
堀口恭太郎

日本電信電話株式会社 NTTオープンソースソフトウェアセンタ
Phone: 03-5860-5115 / Fax: 03-5463-5490
diff --git a/src/backend/commands/vacuum.c b/src/backend/commands/vacuum.c
index faa1812..cd8be92 100644
--- a/src/backend/commands/vacuum.c
+++ b/src/backend/commands/vacuum.c
@@ -1192,13 +1192,6 @@ vac_truncate_clog(TransactionId frozenXID,    AdvanceOldestCommitTsXid(frozenXID);    /*
-     * Truncate CLOG, multixact and CommitTs to the oldest computed value.
-     */
-    TruncateCLOG(frozenXID, oldestxid_datoid);
-    TruncateCommitTs(frozenXID);
-    TruncateMultiXact(minMulti, minmulti_datoid);
-
-    /*     * Update the wrap limit for GetNewTransactionId and creation of new     * MultiXactIds.  Note: these
functionswill also signal the postmaster     * for an(other) autovac cycle if needed.   XXX should we avoid possibly
 
@@ -1206,6 +1199,14 @@ vac_truncate_clog(TransactionId frozenXID,     */    SetTransactionIdLimit(frozenXID,
oldestxid_datoid);   SetMultiXactIdLimit(minMulti, minmulti_datoid, false);
 
+
+    /*
+     * Truncate CLOG, multixact and CommitTs to the oldest computed value
+     * after advancing xid limits.
+     */
+    TruncateCLOG(frozenXID, oldestxid_datoid);
+    TruncateCommitTs(frozenXID);
+    TruncateMultiXact(minMulti, minmulti_datoid);}
*** a/src/backend/commands/vacuum.c
--- b/src/backend/commands/vacuum.c
***************
*** 1192,1204 **** vac_truncate_clog(TransactionId frozenXID,     AdvanceOldestCommitTsXid(frozenXID);      /*
-      * Truncate CLOG, multixact and CommitTs to the oldest computed value.
-      */
-     TruncateCLOG(frozenXID, oldestxid_datoid);
-     TruncateCommitTs(frozenXID);
-     TruncateMultiXact(minMulti, minmulti_datoid);
- 
-     /*      * Update the wrap limit for GetNewTransactionId and creation of new      * MultiXactIds.  Note: these
functionswill also signal the postmaster      * for an(other) autovac cycle if needed.   XXX should we avoid possibly
 
--- 1192,1197 ----

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Robert Haas
Дата:
On Fri, Jul 21, 2017 at 1:31 AM, Thomas Munro
<thomas.munro@enterprisedb.com> wrote:
> Thanks Neha.  It's be best to post the back trace and if possible
> print oldestXact and ShmemVariableCache->oldestXid from the stack
> frame for TruncateCLOG.
>
> The failing assertion in TruncateCLOG() has a comment that says
> "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
> calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
> *after* it calls TruncateCLOG().  What am I missing here?

This problem was introduced by commit
ea42cc18c35381f639d45628d792e790ff39e271, so this should be added to
the PostgreSQL 10 open items list. That commit intended to introduce a
distinction between (1) the oldest XID that can be safely examined and
(2) the oldest XID that can't yet be safely reused.  These are the
same except when we're in the middle of truncating CLOG: (1) advances
before the truncation, and (2) advances afterwards. That's why
AdvanceOldestClogXid() happens before truncation proper and
SetTransactionIdLimit() happens afterwards, and changing the order
would, I think, be quite wrong.

AFAICS, that assertion is simply a holdover from an earlier version of
the patch that escaped review.  There's just no reason to suppose that
it's true.

> What actually prevents ShmemVariableCache->oldestXid from going
> backwards anyway?  Suppose there are two or more autovacuum processes
> that reach vac_truncate_clog() concurrently.  They do a scan of
> pg_database whose tuples they access without locking through a
> pointer-to-volatile because they expect concurrent in-place writers,
> come up with a value for frozenXID, and then arrive at
> SetTransactionIdLimit() in whatever order and clobber
> ShmemVariableCache->oldestXid.  What am I missing here?

Hmm, there could be a bug there, but I don't think it's *this* bug.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Thomas Munro
Дата:
On Thu, Aug 3, 2017 at 3:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Jul 21, 2017 at 1:31 AM, Thomas Munro
> <thomas.munro@enterprisedb.com> wrote:
>> Thanks Neha.  It's be best to post the back trace and if possible
>> print oldestXact and ShmemVariableCache->oldestXid from the stack
>> frame for TruncateCLOG.
>>
>> The failing assertion in TruncateCLOG() has a comment that says
>> "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
>> calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
>> *after* it calls TruncateCLOG().  What am I missing here?
>
> This problem was introduced by commit
> ea42cc18c35381f639d45628d792e790ff39e271, so this should be added to
> the PostgreSQL 10 open items list. That commit intended to introduce a
> distinction between (1) the oldest XID that can be safely examined and
> (2) the oldest XID that can't yet be safely reused.  These are the
> same except when we're in the middle of truncating CLOG: (1) advances
> before the truncation, and (2) advances afterwards. That's why
> AdvanceOldestClogXid() happens before truncation proper and
> SetTransactionIdLimit() happens afterwards, and changing the order
> would, I think, be quite wrong.

Added to open items.

> AFAICS, that assertion is simply a holdover from an earlier version of
> the patch that escaped review.  There's just no reason to suppose that
> it's true.

Craig, are you planning to post a patch to remove the assertion, or
make it hold?

>> What actually prevents ShmemVariableCache->oldestXid from going
>> backwards anyway?  Suppose there are two or more autovacuum processes
>> that reach vac_truncate_clog() concurrently.  They do a scan of
>> pg_database whose tuples they access without locking through a
>> pointer-to-volatile because they expect concurrent in-place writers,
>> come up with a value for frozenXID, and then arrive at
>> SetTransactionIdLimit() in whatever order and clobber
>> ShmemVariableCache->oldestXid.  What am I missing here?
>
> Hmm, there could be a bug there, but I don't think it's *this* bug.

I'm not sure that it wrong per se, as long as no one asserts that the
number can't go backwards...

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Thomas Munro
Дата:
On Fri, Jul 21, 2017 at 7:17 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> In vac_truncate_clog, TruncateCLOG is called before
> SetTransactionIdLimit, which advances
> ShmemVariableCache->oldestXid. Given that the assertion in
> TruncateCLOG is valid, they should be called in reverse order. I
> suppose that CLOG files can be safely truncated after advancing
> XID limits.

If we keep the assertion by changing the order of changes to match the
comment like this, then don't we still have a problem if another
backend moves it backwards because of the data race I mentioned?  That
too could be fixed (perhaps by teaching SetTransactionIdLimit not to
overwrite higher values), but it sounds like the assertion might be a
mistake.

-- 
Thomas Munro
http://www.enterprisedb.com



Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Craig Ringer
Дата:
On 7 August 2017 at 14:04, Thomas Munro <thomas.munro@enterprisedb.com> wrote:
On Fri, Jul 21, 2017 at 7:17 PM, Kyotaro HORIGUCHI
<horiguchi.kyotaro@lab.ntt.co.jp> wrote:
> In vac_truncate_clog, TruncateCLOG is called before
> SetTransactionIdLimit, which advances
> ShmemVariableCache->oldestXid. Given that the assertion in
> TruncateCLOG is valid, they should be called in reverse order. I
> suppose that CLOG files can be safely truncated after advancing
> XID limits.

If we keep the assertion by changing the order of changes to match the
comment like this, then don't we still have a problem if another
backend moves it backwards because of the data race I mentioned?  That
too could be fixed (perhaps by teaching SetTransactionIdLimit not to
overwrite higher values), but it sounds like the assertion might be a
mistake.

I think so - specifically, that it's a leftover from a revision where the xid limit was advanced before clog truncation.

I'll be finding time in the next couple of days to look more closely and ensure that's all it is.

--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Noah Misch
Дата:
On Mon, Aug 07, 2017 at 05:29:34PM +1200, Thomas Munro wrote:
> On Thu, Aug 3, 2017 at 3:03 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> > On Fri, Jul 21, 2017 at 1:31 AM, Thomas Munro
> > <thomas.munro@enterprisedb.com> wrote:
> >> Thanks Neha.  It's be best to post the back trace and if possible
> >> print oldestXact and ShmemVariableCache->oldestXid from the stack
> >> frame for TruncateCLOG.
> >>
> >> The failing assertion in TruncateCLOG() has a comment that says
> >> "vac_truncate_clog already advanced oldestXid", but vac_truncate_clog
> >> calls SetTransactionIdLimit() to write ShmemVariableCache->oldestXid
> >> *after* it calls TruncateCLOG().  What am I missing here?
> >
> > This problem was introduced by commit
> > ea42cc18c35381f639d45628d792e790ff39e271, so this should be added to
> > the PostgreSQL 10 open items list. That commit intended to introduce a
> > distinction between (1) the oldest XID that can be safely examined and
> > (2) the oldest XID that can't yet be safely reused.  These are the
> > same except when we're in the middle of truncating CLOG: (1) advances
> > before the truncation, and (2) advances afterwards. That's why
> > AdvanceOldestClogXid() happens before truncation proper and
> > SetTransactionIdLimit() happens afterwards, and changing the order
> > would, I think, be quite wrong.
> 
> Added to open items.

[Action required within three days.  This is a generic notification.]

The above-described topic is currently a PostgreSQL 10 open item.  Robert,
since you committed the patch believed to have created it, you own this open
item.  If some other commit is more relevant or if this does not belong as a
v10 open item, please let us know.  Otherwise, please observe the policy on
open item ownership[1] and send a status update within three calendar days of
this message.  Include a date for your subsequent status update.  Testers may
discover new open items at any time, and I want to plan to get them all fixed
well in advance of shipping v10.  Consequently, I will appreciate your efforts
toward speedy resolution.  Thanks.

[1] https://www.postgresql.org/message-id/20170404140717.GA2675809%40tornado.leadboat.com



Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Robert Haas
Дата:
On Mon, Aug 7, 2017 at 2:06 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> I think so - specifically, that it's a leftover from a revision where the
> xid limit was advanced before clog truncation.
>
> I'll be finding time in the next couple of days to look more closely and
> ensure that's all it is.

A couple of days having gone by without hearing further on this, I'm
going to assume for the moment that my analysis is correct and just
push a commit to remove this assertion.  That's not intended to
discourage you from spending further time on this - it would be bad to
be wrong about this - but doing something we all agree is wrong can't
be better than doing something that I, at least, think is correct.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [HACKERS] [TRAP: FailedAssertion] causing server to crash

От
Craig Ringer
Дата:
On 10 August 2017 at 23:25, Robert Haas <robertmhaas@gmail.com> wrote:
On Mon, Aug 7, 2017 at 2:06 AM, Craig Ringer <craig@2ndquadrant.com> wrote:
> I think so - specifically, that it's a leftover from a revision where the
> xid limit was advanced before clog truncation.
>
> I'll be finding time in the next couple of days to look more closely and
> ensure that's all it is.

A couple of days having gone by without hearing further on this, I'm
going to assume for the moment that my analysis is correct and just
push a commit to remove this assertion.  That's not intended to
discourage you from spending further time on this - it would be bad to
be wrong about this - but doing something we all agree is wrong can't
be better than doing something that I, at least, think is correct.

 
Thanks. 

Keeps getting bumped down by other things, but it's on the stack to not get lost


--
 Craig Ringer                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services