> We have 2 PostgreSQL servers with logical replication between Postgres 11.6 > (Primary) and 12.1 (Logical). Some times ago, we changed column type in a 2 > big tables from integer to text: > ... > , this of course led to a full rewrite both tables. We repated this > operation on both servers. And after that we started to get error like > "background worker "logical replication worker" (PID <pid>) was terminated > by signal 11: Segmentation fault" and server goes to recovery mode.
Not sure, but this seems like it might be explained by this recent bug fix:
Fix bogus tuple-slot management in logical replication UPDATE handling.
slot_modify_cstrings seriously abused the TupleTableSlot API by relying on a slot's underlying data to stay valid across ExecClearTuple. Since this abuse was also quite undocumented, it's little surprise that the case got broken during the v12 slot rewrites. As reported in bug #16129 from Ondřej Jirman, this could lead to crashes or data corruption when a logical replication subscriber processes a row update. Problems would only arise if the subscriber's table contained columns of pass-by-ref types that were not being copied from the publisher.
Fix by explicitly copying the datum/isnull arrays from the source slot that the old row was in already. This ends up being about the same thing that happened pre-v12, but hopefully in a less opaque and fragile way.
We might've caught the problem sooner if there were any test cases dealing with updates involving non-replicated or dropped columns. Now there are.
Back-patch to v10 where this code came in. Even though the failure does not manifest before v12, IMO this code is too fragile to leave as-is. In any case we certainly want the additional test coverage.
Patch by me; thanks to Tomas Vondra for initial investigation.