Обсуждение: BUG #15114: logical decoding Segmentation fault

Поиск
Список
Период
Сортировка

BUG #15114: logical decoding Segmentation fault

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      15114
Logged by:          Peter Slavov
Email address:      pet.slavov@gmail.com
PostgreSQL version: 10.3
Operating system:   Ubuntu xenial
Description:

Hi,
I have a Segmentation fault problem when trying to start logical decoding on
the latest Postgres version - 10.3 (package version 10.3-1.pgdg16.04+1).
Here is what happens:

--- Query on master db server ---
create publication big_tables FOR TABLE table1, table2, table3, table4;

--- Query on the logical replica db server
create subscription sub_name CONNECTION 'host=master.server dbname=db_name
user=db_user password=db_password' PUBLICATION big_tables ;

--- Code dump: ---
------------------
Reading symbols from /usr/lib/postgresql/10/bin/postgres...Reading symbols
from
/usr/lib/debug/.build-id/a3/2aff963eb198ec48d1946523ef49379fa49c5e.debug...done.
done.
[New LWP 26112]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by 'postgres: 10/main: wal sender process lg_replica
ip-10-4-1-11.eu-west-1.compute'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  GetActiveSnapshot () at
/build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c:843
843    /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c:
No such file or directory.
(gdb)

--- Logs ---
------------
# Master db server:
2018-03-14 23:30:35 UTC [23352]: [7-1] user=,db= host=,app=[] LOG:  server
process (PID 26112) was terminated by signal 11: Segmentation fault
2018-03-14 23:30:35 UTC [23352]: [8-1] user=,db= host=,app=[] LOG:
terminating any other active server processes

# logical Replica db server:
2018-03-14 23:29:22 UTC [6777]: [3-1] user=postgres,db=db_replica_name
host=[local],app=[psql] LOG:  duration: 73.901 ms  statement: create
subscription sub_name CONNECTION 'host=master.server dbname=db_name
user=db_user password=db_password' PUBLICATION big_tables ;
2018-03-14 23:29:22 UTC [6779]: [1-1] user=,db= host=,app=[] LOG:  logical
replication apply worker for subscription "sub_name" has started
2018-03-14 23:29:22 UTC [6780]: [1-1] user=,db= host=,app=[] LOG:  logical
replication table synchronization worker for subscription "sub_name", table
"table1" has started
2018-03-14 23:29:22 UTC [6781]: [1-1] user=,db= host=,app=[] LOG:  logical
replication table synchronization worker for subscription "sub_name", table
"table2" has started
2018-03-14 23:29:23 UTC [6780]: [2-1] user=,db= host=,app=[] LOG:  logical
replication table synchronization worker for subscription "sub_name", table
"table1" has finished
2018-03-14 23:29:23 UTC [6782]: [1-1] user=,db= host=,app=[] LOG:  logical
replication table synchronization worker for subscription "sub_name", table
"table3" has started
2018-03-14 23:29:24 UTC [6781]: [2-1] user=,db= host=,app=[] LOG:  logical
replication table synchronization worker for subscription "sub_name", table
"table2" has finished
2018-03-14 23:29:24 UTC [6783]: [1-1] user=,db= host=,app=[] LOG:  logical
replication table synchronization worker for subscription "sub_name", table
"table4" has started
2018-03-14 23:29:26 UTC [6783]: [2-1] user=,db= host=,app=[] LOG:  logical
replication table synchronization worker for subscription "sub_name", table
"table4" has finished
2018-03-14 23:29:26 UTC [6782]: [2-1] user=,db= host=,app=[] LOG:  logical
replication table synchronization worker for subscription "sub_name", table
"table3" has finished
2018-03-14 23:30:35 UTC [6779]: [2-1] user=,db= host=,app=[] ERROR:  could
not receive data from WAL stream: SSL SYSCALL error: EOF detected
2018-03-14 23:30:35 UTC [6739]: [7-1] user=,db= host=,app=[] LOG:  worker
process: logical replication worker for subscription 17178 (PID 6779) exited
with exit code 1
2018-03-14 23:30:35 UTC [6798]: [1-1] user=,db= host=,app=[] LOG:  logical
replication apply worker for subscription "sub_name" has started
2018-03-14 23:30:35 UTC [6798]: [2-1] user=,db= host=,app=[] ERROR:  could
not connect to the publisher: FATAL:  the database system is in recovery
mode


--- Configurations ---
----------------------
Default debian postgresql.conf with this changes:

listen_addresses = '*'

max_connections = 500

shared_buffers = 1GB
work_mem = 4MB
maintenance_work_mem = 128MB
effective_cache_size = 4GB
effective_io_concurrency = 5

wal_level = logical
max_wal_senders = 10
max_replication_slots = 10
max_worker_processes = 20
wal_log_hints = off
hot_standby = on

shared_preload_libraries = 'pg_stat_statements'

tcp_keepalives_idle = 300
tcp_keepalives_interval = 10
tcp_keepalives_count = 6

log_min_duration_statement = 5
log_lock_waits = on
log_checkpoints = on

log_destination = 'stderr'
logging_collector = on
log_directory = 'pg_log'
log_filename = '%A.log'
log_truncate_on_rotation = on
log_rotation_age = 1d
log_rotation_size = 0
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d host=%h,app=[%a] '
log_timezone = 'UTC'
log_statement = 'none'
log_duration = off
log_min_duration_statement = 0
log_hostname = on
log_checkpoints = on
log_connections = on
log_disconnections = on

wal_compression = on

ssl_cert_file = '/etc/postgresql/ssl/database_public.crt'
ssl_key_file = '/etc/postgresql/ssl/database_private.key'


Does it sound familiar? what am I doing wrong ?




Re: BUG #15114: logical decoding Segmentation fault

От
Alvaro Herrera
Дата:
PG Bug reporting form wrote:

> --- Code dump: ---
> ------------------
> Reading symbols from /usr/lib/postgresql/10/bin/postgres...Reading symbols
> from
> /usr/lib/debug/.build-id/a3/2aff963eb198ec48d1946523ef49379fa49c5e.debug...done.
> done.
> [New LWP 26112]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by 'postgres: 10/main: wal sender process lg_replica
> ip-10-4-1-11.eu-west-1.compute'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  GetActiveSnapshot () at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c:843
> 843    /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c:
> No such file or directory.
> (gdb)

Can you do a "bt" in that gdb and paste the output, please?



-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15114: logical decoding Segmentation fault

От
Петър Славов
Дата:
Here is the code dump with bt:

Reading symbols from /usr/lib/postgresql/10/bin/postgres...Reading symbols from /usr/lib/debug/.build-id/a3/2aff963eb198ec48d1946523ef49379fa49c5e.debug...done.
done.
[New LWP 26112]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `postgres: 10/main: wal sender process lg_replica ip-10-4-1-11.eu-west-1.compute'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  GetActiveSnapshot () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c:843
843    /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c: No such file or directory.
(gdb) bt
#0  GetActiveSnapshot () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c:843
#1  0x00005649b753e928 in postquel_start (fcache=0x5649b89b7968, es=0x5649b89c1288) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/executor/functions.c:809
#2  fmgr_sql (fcinfo=0x5649b89b5c48) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/executor/functions.c:1150
#3  0x00005649b7530ae1 in ExecInterpExpr (state=0x5649b89b5b68, econtext=0x5649b89b6468, isnull=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/executor/execExprInterp.c:650
#4  0x00005649b75d9bc7 in ExecEvalExprSwitchContext (isNull=0x7fff8d0a8794 "", econtext=<optimized out>, state=0x5649b89b5b68) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/include/executor/executor.h:308
#5  evaluate_expr (expr=<optimized out>, result_type=result_type@entry=1114, result_typmod=result_typmod@entry=-1, result_collation=result_collation@entry=0) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/optimizer/util/clauses.c:4739
#6  0x00005649b75dba31 in evaluate_function (context=0x7fff8d0a8b50, func_tuple=0x7f76e7843548, funcvariadic=0 '\000', args=0x5649b89b2730, input_collid=0, result_collid=0, result_typmod=-1, result_type=1114, funcid=16506)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/optimizer/util/clauses.c:4296
#7  simplify_function (funcid=16506, result_type=1114, result_typmod=-1, result_collid=result_collid@entry=0, input_collid=input_collid@entry=0, args_p=args_p@entry=0x7fff8d0a8930, funcvariadic=0 '\000', process_args=1 '\001', allow_non_const=1 '\001', context=0x7fff8d0a8b50)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/optimizer/util/clauses.c:3936
#8  0x00005649b75dc449 in eval_const_expressions_mutator (node=0x5649b89b2530, context=0x7fff8d0a8b50) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/optimizer/util/clauses.c:2595
#9  0x00005649b757bc8b in expression_tree_mutator (node=<optimized out>, mutator=mutator@entry=0x5649b75dc210 <eval_const_expressions_mutator>, context=context@entry=0x7fff8d0a8b50) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/nodes/nodeFuncs.c:2903
#10 0x00005649b75dbfce in simplify_function (funcid=2057, result_type=16, result_typmod=result_typmod@entry=-1, result_collid=0, input_collid=0, args_p=args_p@entry=0x7fff8d0a8b00, funcvariadic=0 '\000', process_args=1 '\001', allow_non_const=1 '\001', context=0x7fff8d0a8b50)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/optimizer/util/clauses.c:3927
#11 0x00005649b75dc38b in eval_const_expressions_mutator (node=node@entry=0x5649b89b0888, context=context@entry=0x7fff8d0a8b50) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/optimizer/util/clauses.c:2643
#12 0x00005649b75ddabf in eval_const_expressions (root=root@entry=0x0, node=node@entry=0x5649b89b0888) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/optimizer/util/clauses.c:2437
#13 0x00005649b777181a in RelationGetIndexPredicate (relation=relation@entry=0x7f76e7841568) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/cache/relcache.c:4813
#14 0x00005649b7443d60 in BuildIndexInfo (index=index@entry=0x7f76e7841568) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/catalog/index.c:1660
#15 0x00005649b7771b90 in RelationGetIndexAttrBitmap (relation=relation@entry=0x7f76e7836638, attrKind=attrKind@entry=INDEX_ATTR_BITMAP_IDENTITY_KEY) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/cache/relcache.c:4938
#16 0x00005649b7614368 in logicalrep_write_attrs (rel=0x7f76e7836638, out=0x5649b88dee38) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/proto.c:530
#17 logicalrep_write_rel (out=0x5649b88dee38, rel=0x7f76e7836638) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/proto.c:317
#18 0x00007f76998b5e6f in pgoutput_change (ctx=<optimized out>, txn=<optimized out>, relation=0x7f76e7836638, change=0x5649b899b1f8) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/pgoutput/pgoutput.c:324
#19 0x00005649b760fbc9 in change_cb_wrapper (cache=<optimized out>, txn=<optimized out>, relation=<optimized out>, change=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/logical.c:711
#20 0x00005649b761844b in ReorderBufferCommit (rb=0x5649b89848a8, xid=xid@entry=71743261, commit_lsn=commit_lsn@entry=463103922872, end_lsn=end_lsn@entry=463103922928, commit_time=commit_time@entry=574385433206439, origin_id=origin_id@entry=0, origin_lsn=0)
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/reorderbuffer.c:1445
#21 0x00005649b760e012 in DecodeCommit (xid=71743261, parsed=0x7fff8d0a91e0, buf=<synthetic pointer>, ctx=0x5649b88b0d48) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/decode.c:611
#22 DecodeXactOp (buf=<synthetic pointer>, ctx=0x5649b88b0d48) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/decode.c:241
#23 LogicalDecodingProcessRecord (ctx=0x5649b88b0d48, record=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/decode.c:113
#24 0x00005649b761ff19 in XLogSendLogical () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/walsender.c:2781
#25 0x00005649b7621883 in WalSndLoop (send_data=send_data@entry=0x5649b761feb0 <XLogSendLogical>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/walsender.c:2149
#26 0x00005649b76224c3 in StartLogicalReplication (cmd=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/walsender.c:1102
#27 exec_replication_command (cmd_string=cmd_string@entry=0x5649b8966768 "START_REPLICATION SLOT \"rbank_staging\" LOGICAL 0/0 (proto_version '1', publication_names '\"big_tables\"')")
    at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/walsender.c:1539
#28 0x00005649b766ca72 in PostgresMain (argc=<optimized out>, argv=argv@entry=0x5649b88d9990, dbname=0x5649b88d9838 "receiptbank", username=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/tcop/postgres.c:4084
#29 0x00005649b73a60d9 in BackendRun (port=0x5649b88d7000) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:4405
#30 BackendStartup (port=0x5649b88d7000) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:4077
#31 ServerLoop () at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:1755
#32 0x00005649b75fc78b in PostmasterMain (argc=5, argv=<optimized out>) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/postmaster/postmaster.c:1363
#33 0x00005649b73a74d5 in main (argc=5, argv=0x5649b8891850) at /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/main/main.c:228
(gdb)


2018-03-15 15:56 GMT+01:00 Alvaro Herrera <alvherre@alvh.no-ip.org>:
PG Bug reporting form wrote:

> --- Code dump: ---
> ------------------
> Reading symbols from /usr/lib/postgresql/10/bin/postgres...Reading symbols
> from
> /usr/lib/debug/.build-id/a3/2aff963eb198ec48d1946523ef49379fa49c5e.debug...done.
> done.
> [New LWP 26112]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> Core was generated by 'postgres: 10/main: wal sender process lg_replica
> ip-10-4-1-11.eu-west-1.compute'.
> Program terminated with signal SIGSEGV, Segmentation fault.
> #0  GetActiveSnapshot () at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c:843
> 843   /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c:
> No such file or directory.
> (gdb)

Can you do a "bt" in that gdb and paste the output, please?



--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15114: logical decoding Segmentation fault

От
Andres Freund
Дата:
Hi,

On 2018-03-15 23:55:32 +0100, Петър Славов wrote:
> Here is the code dump with bt:

Thanks, this is interesting.  Could you show the definition of the
tables involved in this?

For some reason I dont understand you appear to have managed to get a
predicated index used as the identity column.


> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/optimizer/util/clauses.c:2437
> #13 0x00005649b777181a in RelationGetIndexPredicate
> (relation=relation@entry=0x7f76e7841568)
> at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/cache/relcache.c:4813
> #14 0x00005649b7443d60 in BuildIndexInfo (index=index@entry=0x7f76e7841568)
> at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/catalog/index.c:1660
> #15 0x00005649b7771b90 in RelationGetIndexAttrBitmap
> (relation=relation@entry=0x7f76e7836638,
> attrKind=attrKind@entry=INDEX_ATTR_BITMAP_IDENTITY_KEY)
> at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/cache/relcache.c:4938
> #16 0x00005649b7614368 in logicalrep_write_attrs (rel=0x7f76e7836638,
> out=0x5649b88dee38) at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/proto.c:530
> #17 logicalrep_write_rel (out=0x5649b88dee38, rel=0x7f76e7836638) at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/proto.c:317
> #18 0x00007f76998b5e6f in pgoutput_change (ctx=<optimized out>,
> txn=<optimized out>, relation=0x7f76e7836638, change=0x5649b899b1f8) at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/pgoutput/pgoutput.c:324

We should never end up with an identity index that has a predicate. So
something's wrong here...

Greetings,

Andres Freund


Re: BUG #15114: logical decoding Segmentation fault

От
Andres Freund
Дата:
Hi,

On 2018-03-15 23:55:32 +0100, Петър Славов wrote:
> #0  GetActiveSnapshot () at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/time/snapmgr.c:843
> #1  0x00005649b753e928 in postquel_start (fcache=0x5649b89b7968,
2
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/optimizer/util/clauses.c:2437
> #13 0x00005649b777181a in RelationGetIndexPredicate
> (relation=relation@entry=0x7f76e7841568)
> at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/cache/relcache.c:4813
> #14 0x00005649b7443d60 in BuildIndexInfo (index=index@entry=0x7f76e7841568)
> at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/catalog/index.c:1660
> #15 0x00005649b7771b90 in RelationGetIndexAttrBitmap
> (relation=relation@entry=0x7f76e7836638,
> attrKind=attrKind@entry=INDEX_ATTR_BITMAP_IDENTITY_KEY)
> at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/utils/cache/relcache.c:4938
> #16 0x00005649b7614368 in logicalrep_write_attrs (rel=0x7f76e7836638,
> out=0x5649b88dee38) at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/proto.c:530
> #17 logicalrep_write_rel (out=0x5649b88dee38, rel=0x7f76e7836638) at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/proto.c:317
> #18 0x00007f76998b5e6f in pgoutput_change (ctx=<optimized out>,
> txn=<optimized out>, relation=0x7f76e7836638, change=0x5649b899b1f8) at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/pgoutput/pgoutput.c:324
> #19 0x00005649b760fbc9 in change_cb_wrapper (cache=<optimized out>,
> txn=<optimized out>, relation=<optimized out>, change=<optimized out>) at
> /build/postgresql-10-drhiey/postgresql-10-10.3/build/../src/backend/replication/logical/logical.c:711

Alvaro, Petr: If I understand correctly what happens here is that some
*other* column than the replica identity has a predicate with a full
blown SQL function.  We shouldn't evaluate arbitrary code like that
inside the otuput plugin!

So RelationGetIndexAttrBitmap() builds a bitmap of *all* indexed
columns. Besides the code exec problem above, this is also just
inefficient.  I think this needs to be replaced by some replident
bespoke code?

Greetings,

Andres Freund


Re: BUG #15114: logical decoding Segmentation fault

От
Alvaro Herrera
Дата:
Andres Freund wrote:

> Alvaro, Petr: If I understand correctly what happens here is that some
> *other* column than the replica identity has a predicate with a full
> blown SQL function.  We shouldn't evaluate arbitrary code like that
> inside the otuput plugin!
> 
> So RelationGetIndexAttrBitmap() builds a bitmap of *all* indexed
> columns. Besides the code exec problem above, this is also just
> inefficient.  I think this needs to be replaced by some replident
> bespoke code?

I agree -- I think logicalrep_write_attrs() should not use
RelationGetIndexAttrBitmap at all but instead use some lower-level way
to obtain columns of the replica identity.  I don't know whether there
are historical-snapshot-related considerations to care about, in this
code, though.

This is neither my area nor my code, but if Petr and Peter decline
producing a patch for this quickly, I can look at it in a few days.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15114: logical decoding Segmentation fault

От
Andres Freund
Дата:
Hi,

On 2018-03-26 13:08:01 -0300, Alvaro Herrera wrote:
> I agree -- I think logicalrep_write_attrs() should not use
> RelationGetIndexAttrBitmap at all but instead use some lower-level way
> to obtain columns of the replica identity.

Indeed.


> I don't know whether there are historical-snapshot-related
> considerations to care about, in this code, though.

There shouldn't be much of an issue if all that's accessed is system
tables/cols.


> This is neither my area nor my code, but if Petr and Peter decline
> producing a patch for this quickly, I can look at it in a few days.

Cool!

Greetings,

Andres Freund


Re: BUG #15114: logical decoding Segmentation fault

От
Петър Славов
Дата:

Hi Andres,

Alvaro, Petr: If I understand correctly what happens here is that some
*other* column than the replica identity has a predicate with a full
blown SQL function.  We shouldn't evaluate arbitrary code like that
inside the otuput plugin!
 
Not sure what you mean here, but in my case the replica identity is a regular integer primary key, without any functions as predicates in the index or anything else.

... or my schema is not the problem at all in this case ?

Peter Slavov

Re: BUG #15114: logical decoding Segmentation fault

От
Alvaro Herrera
Дата:
Петър Славов wrote:
> Hi Andres,
> 
> > Alvaro, Petr: If I understand correctly what happens here is that some
> > *other* column than the replica identity has a predicate with a full
> > blown SQL function.  We shouldn't evaluate arbitrary code like that
> > inside the otuput plugin!
> 
> Not sure what you mean here, but in my case the replica identity is a
> regular integer primary key, without any functions as predicates in the
> index or anything else.
> 
> ... or my schema is not the problem at all in this case ?

Andres is saying that you have some other index that contains a
predicate.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15114: logical decoding Segmentation fault

От
Andres Freund
Дата:
On 2018-03-26 22:59:12 +0200, Петър Славов wrote:
> Hi Andres,
> 
> Alvaro, Petr: If I understand correctly what happens here is that some
> > *other* column than the replica identity has a predicate with a full
> > blown SQL function.  We shouldn't evaluate arbitrary code like that
> > inside the otuput plugin!
> >
> 
> Not sure what you mean here, but in my case the replica identity is a
> regular integer primary key, without any functions as predicates in the
> index or anything else.

Turns out it it's not the replica identity index that's the problem,
it's *another* index with a WHERE clause that evaluates an SQL
function. That's a bug that needs to be fixed. But you might be able to
work around it for now by simplifying your WHERE clauses for the indexes
to not use SQL functions.

Greetings,

Andres Freund


Re: BUG #15114: logical decoding Segmentation fault

От
Петър Славов
Дата:
Hi,

Now I understand - I have partial index with this predicate - "WHERE created_at > months_ago(2)",
where months_ago(int) is immutable function that returns current timestamp minus the given number of months.

Just to confirm - removing this index on the testing server stopped this Segmentation fault from happening.
Of course I will have to wait for a fix to be able to use logical decoding in this case.

One more thing - I am almost curtain that the same thing have to be fixed in pglogical extension ..

Peter Slavov 

2018-03-26 23:06 GMT+02:00 Andres Freund <andres@anarazel.de>:
On 2018-03-26 22:59:12 +0200, Петър Славов wrote:
> Hi Andres,
>
> Alvaro, Petr: If I understand correctly what happens here is that some
> > *other* column than the replica identity has a predicate with a full
> > blown SQL function.  We shouldn't evaluate arbitrary code like that
> > inside the otuput plugin!
> >
>
> Not sure what you mean here, but in my case the replica identity is a
> regular integer primary key, without any functions as predicates in the
> index or anything else.

Turns out it it's not the replica identity index that's the problem,
it's *another* index with a WHERE clause that evaluates an SQL
function. That's a bug that needs to be fixed. But you might be able to
work around it for now by simplifying your WHERE clauses for the indexes
to not use SQL functions.

Greetings,

Andres Freund

Re: BUG #15114: logical decoding Segmentation fault

От
Alvaro Herrera
Дата:
Alvaro Herrera wrote:

> I agree -- I think logicalrep_write_attrs() should not use
> RelationGetIndexAttrBitmap at all but instead use some lower-level way
> to obtain columns of the replica identity.  I don't know whether there
> are historical-snapshot-related considerations to care about, in this
> code, though.

Here's a patch.  There is a disadvantage: one additional syscache lookup
each time logicalrep_write_attrs() is called.  I think that's all right,
since that's done once for each relation being replicated (not, say,
once per tuple or once per transaction).  Even so, I tried to move this
code as a special "quick-path" case in RelationGetIndexAttrBitmap, so
that it could be cached separately without processing the other indexes.
That turned out a bit too messy for my taste.


-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: BUG #15114: logical decoding Segmentation fault

От
Петър Славов
Дата:
Hi,

I tried to compile this and i had this error:

gcc -Wall -Wmissing-prototypes -Wpointer-arith -Wdeclaration-after-statement -Wendif-labels -Wmissing-format-attribute -Wformat-security -fno-strict-aliasing -fwrapv -fexcess-precision=standard -O2 -I. -I../../../../src/include  -D_GNU_SOURCE   -c -o proto.o proto.c
proto.c: In function ‘logicalrep_write_attrs’:
proto.c:565:17: error: ‘FormData_pg_index {aka struct FormData_pg_index}’ has no member named ‘indnkeyatts’
    natts = index->indnkeyatts;
                 ^
<builtin>: recipe for target 'proto.o' failed
make[4]: *** [proto.o] Error 1
make[4]: Leaving directory '/usr/src/postgresql-10.3/src/backend/replication/logical'
../../../src/backend/common.mk:41: recipe for target 'logical-recursive' failed
make[3]: *** [logical-recursive] Error 2
make[3]: Leaving directory '/usr/src/postgresql-10.3/src/backend/replication'
common.mk:41: recipe for target 'replication-recursive' failed
make[2]: *** [replication-recursive] Error 2
make[2]: Leaving directory '/usr/src/postgresql-10.3/src/backend'
Makefile:38: recipe for target 'all-backend-recurse' failed
make[1]: *** [all-backend-recurse] Error 2
make[1]: Leaving directory '/usr/src/postgresql-10.3/src'
GNUmakefile:11: recipe for target 'all-src-recurse' failed
make: *** [all-src-recurse] Error 2

-----
maybe - index->indnatts; ??

Peter Slavov

2018-04-16 19:59 GMT+02:00 Alvaro Herrera <alvherre@alvh.no-ip.org>:
Alvaro Herrera wrote:

> I agree -- I think logicalrep_write_attrs() should not use
> RelationGetIndexAttrBitmap at all but instead use some lower-level way
> to obtain columns of the replica identity.  I don't know whether there
> are historical-snapshot-related considerations to care about, in this
> code, though.

Here's a patch.  There is a disadvantage: one additional syscache lookup
each time logicalrep_write_attrs() is called.  I think that's all right,
since that's done once for each relation being replicated (not, say,
once per tuple or once per transaction).  Even so, I tried to move this
code as a special "quick-path" case in RelationGetIndexAttrBitmap, so
that it could be cached separately without processing the other indexes.
That turned out a bit too messy for my taste.


--
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: BUG #15114: logical decoding Segmentation fault

От
Michael Paquier
Дата:
On Fri, Apr 27, 2018 at 12:26:52AM +0200, Петър Славов wrote:
> I tried to compile this and i had this error:
>
> proto.c: In function ‘logicalrep_write_attrs’:
> proto.c:565:17: error: ‘FormData_pg_index {aka struct FormData_pg_index}’
> has no member named ‘indnkeyatts’
>     natts = index->indnkeyatts;

The patch Alvaro has sent is made for the master branch, and you are
visibly using a build of REL_10_STABLE.  Ths conflict you are seeing
comes from 8224de4 which introduced Covering indexes, and renamed
indnatts to indnkeyatts in Form_pg_index.
--
Michael

Вложения

Re: BUG #15114: logical decoding Segmentation fault

От
Петър Славов
Дата:
Hi,
What happen here - I cannot see fix for this bug.
I also cannot see the workaround parch in the code.
Do I have to do something to mo e this fix forward?

I think we agreed that this bug braked logical decoding or I am wrong ?

Peter Slavov

На пт, 27.04.2018 г. в 3:22 ч. Michael Paquier <michael@paquier.xyz> написа:
On Fri, Apr 27, 2018 at 12:26:52AM +0200, Петър Славов wrote:
> I tried to compile this and i had this error:
>
> proto.c: In function ‘logicalrep_write_attrs’:
> proto.c:565:17: error: ‘FormData_pg_index {aka struct FormData_pg_index}’
> has no member named ‘indnkeyatts’
>     natts = index->indnkeyatts;

The patch Alvaro has sent is made for the master branch, and you are
visibly using a build of REL_10_STABLE.  Ths conflict you are seeing
comes from 8224de4 which introduced Covering indexes, and renamed
indnatts to indnkeyatts in Form_pg_index.
--
Michael

Re: BUG #15114: logical decoding Segmentation fault

От
Petr Jelinek
Дата:
Hi,

This thread was brought to my attention during pgconf.eu (as I sadly
missed it).

On 04/10/2018 10:03, Петър Славов wrote:
> Hi,
> What happen here - I cannot see fix for this bug.
> I also cannot see the workaround parch in the code.
> Do I have to do something to mo e this fix forward?
> 
> I think we agreed that this bug braked logical decoding or I am wrong ?
>

Yes it's definitely a bug.

Also please note that we bottom-post in PostgreSQL mailing lists.

> 
> На пт, 27.04.2018 г. в 3:22 ч. Michael Paquier <michael@paquier.xyz
> <mailto:michael@paquier.xyz>> написа:
> 
>     On Fri, Apr 27, 2018 at 12:26:52AM +0200, Петър Славов wrote:
>     > I tried to compile this and i had this error:
>     >
>     > proto.c: In function ‘logicalrep_write_attrs’:
>     > proto.c:565:17: error: ‘FormData_pg_index {aka struct
>     FormData_pg_index}’
>     > has no member named ‘indnkeyatts’
>     >     natts = index->indnkeyatts;
> 
>     The patch Alvaro has sent is made for the master branch, and you are
>     visibly using a build of REL_10_STABLE.  Ths conflict you are seeing
>     comes from 8224de4 which introduced Covering indexes, and renamed
>     indnatts to indnkeyatts in Form_pg_index.
>     --
>     Michael
> 

I wrote my own patch to solve this which is quite similar to Alvaro's
except that it does not do the manual work to build indexes list as we
already have RelationGetReplicaIndex which does all the necessary work
and also uses the same rd_idattr for cache that
RelationGetIndexAttrBitmap does, it's better for performance and looks
safe to me.

I attached both patch for PG11+ (ie master) and the backport patch for
PG10, they only differ in the line identified above as problematic.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services

Вложения

Re: BUG #15114: logical decoding Segmentation fault

От
Michael Paquier
Дата:
On Sun, Oct 28, 2018 at 11:59:49PM +0100, Petr Jelinek wrote:
> I wrote my own patch to solve this which is quite similar to Alvaro's
> except that it does not do the manual work to build indexes list as we
> already have RelationGetReplicaIndex which does all the necessary work
> and also uses the same rd_idattr for cache that
> RelationGetIndexAttrBitmap does, it's better for performance and looks
> safe to me.
>
> I attached both patch for PG11+ (ie master) and the backport patch for
> PG10, they only differ in the line identified above as problematic.

Hmm.  I am studying this patch, and this visibly cannot be easily
reproduced?  I would have imagined that something like this would be
enough, with a predicate index on the publisher side which is in charge
of loading the :
-- connect to publisher
\c postgres postgres /tmp 5432
create table aa (a int not null, b int);
create function copy_int(a int) returns int as $$ select $1 $$
  language sql immutable;
create index aai on aa (b) where copy_int(b) > 0;
create unique index aai2 on aa (a);
alter table aa replica identity using index aai2;
create publication big_tables for table aa;
insert into aa values (generate_series(1,100), 1);

-- connect to subscriber
\c postgres postgres /tmp 5433
create table aa (a int not null, b int);
create function copy_int(a int) returns int as $$ select $1 $$
  language sql immutable;
create index aai on aa (b) where copy_int(b) > 0;
create unique index aai2 on aa (a);
alter table aa replica identity using index aai2;
create subscription sub_name CONNECTION 'host=/tmp dbname=postgres
user=postgres' PUBLICATION big_tables;

Is there something I am missing?

If possible, could it be possible to add a test case in
001_rep_changes.pl with a more complete schema?
--
Michael

Вложения

Re: BUG #15114: logical decoding Segmentation fault

От
Петър Славов
Дата:
Hi Michael,

My latest tests was with immutable function with a now() function inside, which is a little different. Here is the schema that replicates the problem:

on Producer:
CREATE FUNCTION public.months_ago(integer) RETURNS timestamp without time zone
  LANGUAGE sql IMMUTABLE AS $_$   SELECT (NOW() - '1 month'::interval * $1)::timestamp;
$_$;

CREATE TABLE public.replicated_table (   id serial PRIMARY KEY NOT NULL,   a text,   b bigint,   created_at timestamp without time zone DEFAULT now()
);

CREATE INDEX ON public.replicated_table (a) WHERE (created_at > public.months_ago(4));

on Subscriber:
CREATE TABLE public.replicated_table ( id integer NOT NULL, a text, b bigint, created_at timestamp without time zone DEFAULT now() );

Peter

На пн, 29.10.2018 г. в 2:54 ч. Michael Paquier <michael@paquier.xyz> написа:
On Sun, Oct 28, 2018 at 11:59:49PM +0100, Petr Jelinek wrote:
> I wrote my own patch to solve this which is quite similar to Alvaro's
> except that it does not do the manual work to build indexes list as we
> already have RelationGetReplicaIndex which does all the necessary work
> and also uses the same rd_idattr for cache that
> RelationGetIndexAttrBitmap does, it's better for performance and looks
> safe to me.
>
> I attached both patch for PG11+ (ie master) and the backport patch for
> PG10, they only differ in the line identified above as problematic.

Hmm.  I am studying this patch, and this visibly cannot be easily
reproduced?  I would have imagined that something like this would be
enough, with a predicate index on the publisher side which is in charge
of loading the :
-- connect to publisher
\c postgres postgres /tmp 5432
create table aa (a int not null, b int);
create function copy_int(a int) returns int as $$ select $1 $$
  language sql immutable;
create index aai on aa (b) where copy_int(b) > 0;
create unique index aai2 on aa (a);
alter table aa replica identity using index aai2;
create publication big_tables for table aa;
insert into aa values (generate_series(1,100), 1);

-- connect to subscriber
\c postgres postgres /tmp 5433
create table aa (a int not null, b int);
create function copy_int(a int) returns int as $$ select $1 $$
  language sql immutable;
create index aai on aa (b) where copy_int(b) > 0;
create unique index aai2 on aa (a);
alter table aa replica identity using index aai2;
create subscription sub_name CONNECTION 'host=/tmp dbname=postgres
user=postgres' PUBLICATION big_tables;

Is there something I am missing?

If possible, could it be possible to add a test case in
001_rep_changes.pl with a more complete schema?
--
Michael

Re: BUG #15114: logical decoding Segmentation fault

От
Petr Jelinek
Дата:
Hi Michael,

On 29/10/2018 02:54, Michael Paquier wrote:
> On Sun, Oct 28, 2018 at 11:59:49PM +0100, Petr Jelinek wrote:
>> I wrote my own patch to solve this which is quite similar to Alvaro's
>> except that it does not do the manual work to build indexes list as we
>> already have RelationGetReplicaIndex which does all the necessary work
>> and also uses the same rd_idattr for cache that
>> RelationGetIndexAttrBitmap does, it's better for performance and looks
>> safe to me.
>>
>> I attached both patch for PG11+ (ie master) and the backport patch for
>> PG10, they only differ in the line identified above as problematic.
> 
> Hmm.  I am studying this patch, and this visibly cannot be easily
> reproduced?  I would have imagined that something like this would be
> enough, with a predicate index on the publisher side which is in charge
> of loading the :
> -- connect to publisher
> \c postgres postgres /tmp 5432
> create table aa (a int not null, b int);
> create function copy_int(a int) returns int as $$ select $1 $$
>   language sql immutable;
> create index aai on aa (b) where copy_int(b) > 0;
> create unique index aai2 on aa (a);
> alter table aa replica identity using index aai2;
> create publication big_tables for table aa;
> insert into aa values (generate_series(1,100), 1);
> 
> -- connect to subscriber
> \c postgres postgres /tmp 5433
> create table aa (a int not null, b int);
> create function copy_int(a int) returns int as $$ select $1 $$
>   language sql immutable;
> create index aai on aa (b) where copy_int(b) > 0;
> create unique index aai2 on aa (a);
> alter table aa replica identity using index aai2;
> create subscription sub_name CONNECTION 'host=/tmp dbname=postgres
> user=postgres' PUBLICATION big_tables;
> 
> Is there something I am missing?

It needs to be function that requires snapshot to be built, which yours
does not. This all happens because we try to inline the SQL functions
into expression so all the expression machinery for running the function
gets involved and if the function needs snapshot the snapshot will be
created (which is where the crash comes from).

The reason why it breaks for OP is that his IMMUTABLE function is not
actually IMMUTABLE, that's also why we don't see it reported much.

Given this I am not sure if we want to add test for this as it's
something one should not do because it has unpredictable side effects,
but if you think it will be useful anyway I can whip something out.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Re: BUG #15114: logical decoding Segmentation fault

От
Michael Paquier
Дата:
On Mon, Oct 29, 2018 at 04:03:31PM +0100, Petr Jelinek wrote:
> The reason why it breaks for OP is that his IMMUTABLE function is not
> actually IMMUTABLE, that's also why we don't see it reported much.

Thanks, Petr.  I have missed that part.  Now I can see the failure
easily.  Please note that on the receiver side you have the same
problem, RelationGetIndexAttrBitmap() is still being used in
logicalrep_rel_open() so that crashes the receiver:
#18 0x0000563cc278baa4 in RelationGetIndexPredicate
 (relation=0x7f9764cdc050) at relcache.c:4720
#19 0x0000563cc22ccdd6 in BuildIndexInfo (index=0x7f9764cdc050) at
 index.c:1789
#20 0x0000563cc278be4f in RelationGetIndexAttrBitmap
 (relation=0x7f9764de52e8, attrKind=INDEX_ATTR_BITMAP_IDENTITY_KEY) at
 relcache.c:4921
#21 0x0000563cc25a6b0c in logicalrep_rel_open (remoteid=16388,
 lockmode=3) at relation.c:313

> Given this I am not sure if we want to add test for this as it's
> something one should not do because it has unpredictable side effects,
> but if you think it will be useful anyway I can whip something out.

Hmm.  People should not do that but they can do it, so this thread is
telling is to be careful.  Adding new objects in the schema of the
publisher and the receiver would be also cheap.  A test like that is not
something we would see but I would suggest to stress the new routine on
both sides.  In the patch, something which makes me uncomfortable is
that get_rel_idattr() is a duplicate of the code paths taken by
RelationGetIndexAttrBitmap() when INDEX_ATTR_BITMAP_IDENTITY_KEY gets
used to fill in the relation cache.  In this case, wouldn't we want to
just fetch the replica index instead of the full index list?  What is
proposed here needs more thoughts I think.  Something that we could
consider on HEAD is to remove INDEX_ATTR_BITMAP_IDENTITY_KEY as it is
visibly unsafe.  rd_idattr is also an unsafe concept.

Putting a safeguard in RelationGetIndexAttrBitmap() so as it generates
an ERROR if called in an output plugin, or just assert would be nice,
however such a check would need to use MyReplicationSlot (I don't recall
something else), and the limitations could just be documented as well.
--
Michael

Вложения

Re: BUG #15114: logical decoding Segmentation fault

От
Petr Jelinek
Дата:

On 30/10/2018 07:28, Michael Paquier wrote:
> On Mon, Oct 29, 2018 at 04:03:31PM +0100, Petr Jelinek wrote:
>> The reason why it breaks for OP is that his IMMUTABLE function is not
>> actually IMMUTABLE, that's also why we don't see it reported much.
> 
> Thanks, Petr.  I have missed that part.  Now I can see the failure
> easily.  Please note that on the receiver side you have the same
> problem, RelationGetIndexAttrBitmap() is still being used in
> logicalrep_rel_open() so that crashes the receiver:
> #18 0x0000563cc278baa4 in RelationGetIndexPredicate
>  (relation=0x7f9764cdc050) at relcache.c:4720
> #19 0x0000563cc22ccdd6 in BuildIndexInfo (index=0x7f9764cdc050) at
>  index.c:1789
> #20 0x0000563cc278be4f in RelationGetIndexAttrBitmap
>  (relation=0x7f9764de52e8, attrKind=INDEX_ATTR_BITMAP_IDENTITY_KEY) at
>  relcache.c:4921
> #21 0x0000563cc25a6b0c in logicalrep_rel_open (remoteid=16388,
>  lockmode=3) at relation.c:313

The receiver side *should* actually use RelationGetIndexAttrBitmap, the
problem there seems to be rather than we setup snapshot few lines of
code too late in worker.c.

> 
>> Given this I am not sure if we want to add test for this as it's
>> something one should not do because it has unpredictable side effects,
>> but if you think it will be useful anyway I can whip something out.
> 
> Hmm.  People should not do that but they can do it, so this thread is
> telling is to be careful.  Adding new objects in the schema of the
> publisher and the receiver would be also cheap.  A test like that is not
> something we would see but I would suggest to stress the new routine on

Well, considering we don't test for this misuse of IMMUTABLE anywhere
else either, meh, but okay.

> both sides.  In the patch, something which makes me uncomfortable is
> that get_rel_idattr() is a duplicate of the code paths taken by
> RelationGetIndexAttrBitmap() when INDEX_ATTR_BITMAP_IDENTITY_KEY gets
> used to fill in the relation cache.  In this case, wouldn't we want to
> just fetch the replica index instead of the full index list?

I don't follow, we *do* only fetch replica index in the new function.

> What is
> proposed here needs more thoughts I think.  Something that we could
> consider on HEAD is to remove INDEX_ATTR_BITMAP_IDENTITY_KEY as it is
> visibly unsafe.  rd_idattr is also an unsafe concept.
> 

I don't agree, about that, INDEX_ATTR_BITMAP_IDENTITY_KEY is perfectly
safe when used correctly. Not sure what you mean regarding unsafety of
rd_idattr but I don't see why it should not be safe.

> Putting a safeguard in RelationGetIndexAttrBitmap() so as it generates
> an ERROR if called in an output plugin, or just assert would be nice,
> however such a check would need to use MyReplicationSlot (I don't recall
> something else), and the limitations could just be documented as well.
If we want to put additional Asserts it should be about having snapshot
not about output plugin IMHO.

-- 
  Petr Jelinek                  http://www.2ndQuadrant.com/
  PostgreSQL Development, 24x7 Support, Training & Services


Re: BUG #15114: logical decoding Segmentation fault

От
Michael Paquier
Дата:
On Tue, Oct 30, 2018 at 08:45:57AM +0100, Petr Jelinek wrote:
> The receiver side *should* actually use RelationGetIndexAttrBitmap, the
> problem there seems to be rather than we setup snapshot few lines of
> code too late in worker.c.

Does the worker need this much information though?  It just needs to
know about the replication index itself and its columns, not about all
the other indexes.  It may make sense to unify both code paths to behave
similarly.

>> both sides.  In the patch, something which makes me uncomfortable is
>> that get_rel_idattr() is a duplicate of the code paths taken by
>> RelationGetIndexAttrBitmap() when INDEX_ATTR_BITMAP_IDENTITY_KEY gets
>> used to fill in the relation cache.  In this case, wouldn't we want to
>> just fetch the replica index instead of the full index list?
>
> I don't follow, we *do* only fetch replica index in the new function.

I am wondering if both things should be unified, or simply if we should
just remove completely the dependency on the cached Relation in
get_rel_idattr().  One function able to overwrite what the other does is
weird.

> If we want to put additional Asserts it should be about having snapshot
> not about output plugin IMHO.

That looks like a good idea to me in any case, and that's a good thing
for HEAD.  Something like Assert(ActiveSnapshotSet()) perhaps?  I have
not thought much about it to be honest.  Let's tackle that with a
HEAD-only, separate patch.
--
Michael

Вложения

Re: BUG #15114: logical decoding Segmentation fault

От
Andres Freund
Дата:
On 2018-10-31 08:07:36 +0900, Michael Paquier wrote:
> On Tue, Oct 30, 2018 at 08:45:57AM +0100, Petr Jelinek wrote:
> > The receiver side *should* actually use RelationGetIndexAttrBitmap, the
> > problem there seems to be rather than we setup snapshot few lines of
> > code too late in worker.c.
> 
> Does the worker need this much information though?  It just needs to
> know about the replication index itself and its columns, not about all
> the other indexes.  It may make sense to unify both code paths to behave
> similarly.

Replication needs to maintain the other indexes, and for that it needs
to evaluate predicates.  So I don't understand how you could say that it
doesn't need to know about other indexes?


Greetings,

Andres Freund


Re: BUG #15114: logical decoding Segmentation fault

От
Michael Paquier
Дата:
On Tue, Oct 30, 2018 at 04:09:31PM -0700, Andres Freund wrote:
> Replication needs to maintain the other indexes, and for that it needs
> to evaluate predicates.  So I don't understand how you could say that
> it doesn't need to know about other indexes?

You are right, sorry for the noise.  The worker applies the changes so
it needs to know about all the indexes and its information.  So we need
to tackle three separate issues here then:
1) On the sender side, make sure that only the replica index information
is fetched.  I actually would prefer something like what Alvaro proposed
upthread, to not save the saved information into the cached Relation,
and potentially have RelationGetIndexAttrBitmap() overwrite it.
2) On the apply side, move the snapshot build a bit earlier so as the
index information can be built properly.
3) Make sure that RelationGetIndexAttrBitmap() never gets called if
there is no snapshot available.

1) and 2) are the actual bug fixes to back-patch.  3) is optionally
something for HEAD, which Petr proposed.
--
Michael

Вложения

Re: BUG #15114: logical decoding Segmentation fault

От
Kyotaro HORIGUCHI
Дата:
At Wed, 31 Oct 2018 16:29:38 +0900, Michael Paquier <michael@paquier.xyz> wrote in <20181031072938.GC7862@paquier.xyz>
> On Tue, Oct 30, 2018 at 04:09:31PM -0700, Andres Freund wrote:
> > Replication needs to maintain the other indexes, and for that it needs
> > to evaluate predicates.  So I don't understand how you could say that
> > it doesn't need to know about other indexes?
> 
> You are right, sorry for the noise.  The worker applies the changes so
> it needs to know about all the indexes and its information.  So we need
> to tackle three separate issues here then:

> 1) On the sender side, make sure that only the replica index information
> is fetched.  I actually would prefer something like what Alvaro proposed
> upthread, to not save the saved information into the cached Relation,
> and potentially have RelationGetIndexAttrBitmap() overwrite it.

+1. We won't reuse it elsewhere, and we don't build other
bitmaps there.

> 2) On the apply side, move the snapshot build a bit earlier so as the
> index information can be built properly.

Subscriber crashes without it and fixed with it.

> 3) Make sure that RelationGetIndexAttrBitmap() never gets called if
> there is no snapshot available.

I'm not sure putting assertion into only the place is
reasonable. BuildIndexInfo has the same hazard and will catch the
similar cases widely.

> 1) and 2) are the actual bug fixes to back-patch.  3) is optionally
> something for HEAD, which Petr proposed.

Agreed.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: BUG #15114: logical decoding Segmentation fault

От
Peter Eisentraut
Дата:
I have another approach for this.

Recall that the issue is ultimately that RelationGetIndexAttrBitmap()
might need a snapshot under certain circumstances, and it will crash if
there isn't one.  There are actually two separate bugs, in the publisher
and in the subscriber.  Some patches have been proposed in this thread
that either bypass RelationGetIndexAttrBitmap() or obtain a snapshot at
different times in the publisher or subscriber code.

My approach is to make RelationGetIndexAttrBitmap() not need a snapshot.
 The whole code was overly complicated anyway, calling BuildIndexInfo()
and then throwing the result away.  We can do it directly more
efficiently and avoid all the business about eval_const_expressions().
Moreover, this fixes the problem in a central place and does not require
bespoke separate fixes in the publisher and subscriber code.  External
logical decoding or logical replication implementations could also be
affected and would benefit from this fix.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Вложения

Re: BUG #15114: logical decoding Segmentation fault

От
Noah Misch
Дата:
Looks good.  I found no functional problems.

On Mon, Jan 28, 2019 at 10:32:16PM +0100, Peter Eisentraut wrote:
> The fix is to avoid the constant expressions simplification in
> RelationGetIndexAttrBitmap(), so that it becomes safe to call in these
> contexts.  The constant expressions simplification comes from the
> calls to RelationGetIndexExpressions()/RelationGetIndexPredicate() via
> BuildIndexInfo().  But RelationGetIndexAttrBitmap() calling
> BuildIndexInfo() is overkill.  The latter just takes pg_index catalog
> information, packs it into the IndexInfo structure, which former than
> just unpacks again and throws away.

s/former than/the former then/


Re: BUG #15114: logical decoding Segmentation fault

От
Michael Paquier
Дата:
On Mon, Jan 28, 2019 at 10:32:16PM +0100, Peter Eisentraut wrote:
> My approach is to make RelationGetIndexAttrBitmap() not need a snapshot.
>  The whole code was overly complicated anyway, calling BuildIndexInfo()
> and then throwing the result away.  We can do it directly more
> efficiently and avoid all the business about eval_const_expressions().
> Moreover, this fixes the problem in a central place and does not require
> bespoke separate fixes in the publisher and subscriber code.  External
> logical decoding or logical replication implementations could also be
> affected and would benefit from this fix.

I have spent some time for the last couple of days looking at this
patch, and that's a neat approach.

+        * those might run constants evaluation, which needs a snapshot, which
"constants evaluation" sounds like weird English.  "constant
evaluation" or "evaluation of constants would be better"?

Thanks for adding a test case as well, Peter.
--
Michael

Вложения

Re: BUG #15114: logical decoding Segmentation fault

От
Peter Eisentraut
Дата:
On 29/01/2019 08:59, Michael Paquier wrote:
> On Mon, Jan 28, 2019 at 10:32:16PM +0100, Peter Eisentraut wrote:
>> My approach is to make RelationGetIndexAttrBitmap() not need a snapshot.
>>  The whole code was overly complicated anyway, calling BuildIndexInfo()
>> and then throwing the result away.  We can do it directly more
>> efficiently and avoid all the business about eval_const_expressions().
>> Moreover, this fixes the problem in a central place and does not require
>> bespoke separate fixes in the publisher and subscriber code.  External
>> logical decoding or logical replication implementations could also be
>> affected and would benefit from this fix.
> 
> I have spent some time for the last couple of days looking at this
> patch, and that's a neat approach.
> 
> +        * those might run constants evaluation, which needs a snapshot, which
> "constants evaluation" sounds like weird English.  "constant
> evaluation" or "evaluation of constants would be better"?

Committed and backpatched with your and Noah's changes.

-- 
Peter Eisentraut              http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services


Re: BUG #15114: logical decoding Segmentation fault

От
Michael Paquier
Дата:
On Wed, Jan 30, 2019 at 12:22:15PM +0100, Peter Eisentraut wrote:
> Committed and backpatched with your and Noah's changes.

Thanks, Peter.
--
Michael

Вложения