Обсуждение: Proposal: Support Logical replication of large objects
Problem Statement
===============
Large object replication is currently unsupported in logical
replication, which presents a significant barrier to using logical
replication when large objects are in use. This lack of support limits
the usability of logical replication for applications that heavily
rely on large objects.
Background
==========
The primary challenge preventing support for this feature stems from
large objects being stored within the catalog. Catalog decoding is
intentionally unsupported because replicating a catalog entry, such as
a pg_class row, is meaningless without the underlying storage and
related catalog entries (e.g., pg_attribute, pg_constraint, etc.).
Nevertheless, the data within pg_largeobjects is different; it behaves
more like regular user table data than typical catalog data.
While it is technically possible to enable logical logging and
replication for rows in pg_largeobjects by modifying certain macros,
this approach presents several key drawbacks: 1) The decoded data
needs a logical representation, as a single pg_largeobject row is
meaningless outside of PostgreSQL. 2) The apply worker may not have
the necessary permissions to directly insert into the pg_largeobject
catalog if the subscription was not created by a superuser.
Proposal
=======
The proposed solution involves introducing a configuration parameter
that allows a large object to be logically logged, similar to a
standard user table. Then the decoder will convert the pg_largeobjects
rows into logical operations, such as LOWRITE: OID, OFFSET, DATA,
LENGTH. The decoder will then translate the standard single-row
insert/update WAL records generated for the internal pg_largeobject
table rows into logical operations formatted as LOWRITE: OID, OFFSET,
DATA, LENGTH, where the OFFSET is calculated as pageno (corresponding
to pg_largeobject row) * LOBLKSIZE. Subsequently, the apply worker on
the subscriber side converts this logical operation into lo_open(),
lo_seek() and lowrite() operation. While there is potential for
further optimization by generating the LOWRITE operation only for the
modified data instead of for the entire LOBLKSIZE bytes this behavior
is consistent with how updates are currently logged for standard user
tables.
I have a POC patch for the decoding part of this implementation. I
need feedback on the overall strategy before I put effort on cleanup,
testing, and further development. The patch still requires the
following to be completed:
a) This is just POC so it needs substantial cleanup and testing,
implementing other large objects operations other than lowrite.
b) Development for the 'apply' side of the implementation.
c) Implementation of a configuration parameter to conditionally enable
logically logging the large object (currently, it always logs the
large object).
Open points to be discussed
======================
1. I propose that logically logging the pg_largeobject tuple data
should be controlled by a configuration parameter, rather than being
implemented unconditionally. Making it configurable allows users to
opt-in to this behavior, preventing unnecessary logging overhead for
users who do not require large object replication. Thoughts?
2. Supporting lo_create() operations : We should consider extending
this proposal to support the lo_create() operation as well. If the
large object replication configuration is enabled, we could generate a
new WAL record directly from lo_create(). Alternatively, we could
leverage the existing WAL record for the insertion into
pg_largeobject_metadata and decode it into a logical lo_create
operation.
While some may categorize lo_create() as a DDL operation, it behaves
practically like a DML operation. When a table contains a large object
column, new large objects are created frequently, often for every row
insertion making it a commonplace runtime event rather than a schema
design activity.
Acknowledgements:
================
I would like to express my sincere thanks to Amit Kapila, Michael
Bautin, Hannu Korosing, Noah Misch, and Joe Conway for their valuable
input, including discussing various alternatives and suggesting
different approaches to this.
Also added some tests using test decoding to show how it works and
here is one of the examples.
postgres[1000776]=# select lo_create(1000);
lo_create
-----------
1000
(1 row)
postgres[1000776]=# SELECT lowrite(lo_open(1000, CAST(x'20000' |
x'40000' AS integer)), 'try decoding test data');
lowrite
---------
22
(1 row)
postgres[1000776]=# SELECT data FROM
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1');
data
-------------------------------------------------------------------------
BEGIN
LO_WRITE: loid: 1000 offset: 0 datalen: 22 data: try decoding test data
COMMIT
(3 rows)
--
Regards,
Dilip Kumar
Google
Вложения
Am Montag, dem 01.12.2025 um 09:27 +0530 schrieb Dilip Kumar:
> The decoder will then translate the standard single-row
> insert/update WAL records generated for the internal pg_largeobject
> table rows into logical operations formatted as LOWRITE: OID, OFFSET,
> DATA, LENGTH, where the OFFSET is calculated as pageno (corresponding
> to pg_largeobject row) * LOBLKSIZE. Subsequently, the apply worker on
> the subscriber side converts this logical operation into lo_open(),
> lo_seek() and lowrite() operation. While there is potential for
> further optimization by generating the LOWRITE operation only for the
> modified data instead of for the entire LOBLKSIZE bytes this behavior
> is consistent with how updates are currently logged for standard user
> tables.
Thanks for this, i think this is a long awaited feature, at least for
those workloads that can't easily get rid of LOs...
I didn't look into your POC (yet), but what happens if the subscriber
database concurrently does lo_create()? Would that cause conflicting
OIDs, preventing applying the records decoded from the publisher?
Bernd
On Tue, Dec 2, 2025 at 6:41 PM Bernd Helmle <mailings@oopsware.de> wrote: > > Am Montag, dem 01.12.2025 um 09:27 +0530 schrieb Dilip Kumar: > > The decoder will then translate the standard single-row > > insert/update WAL records generated for the internal pg_largeobject > > table rows into logical operations formatted as LOWRITE: OID, OFFSET, > > DATA, LENGTH, where the OFFSET is calculated as pageno (corresponding > > to pg_largeobject row) * LOBLKSIZE. Subsequently, the apply worker on > > the subscriber side converts this logical operation into lo_open(), > > lo_seek() and lowrite() operation. While there is potential for > > further optimization by generating the LOWRITE operation only for the > > modified data instead of for the entire LOBLKSIZE bytes this behavior > > is consistent with how updates are currently logged for standard user > > tables. > > > Thanks for this, i think this is a long awaited feature, at least for > those workloads that can't easily get rid of LOs... Right > I didn't look into your POC (yet), but what happens if the subscriber > database concurrently does lo_create()? Would that cause conflicting > OIDs, preventing applying the records decoded from the publisher? I've implemented decoding for lowrite operations, specifically handling INSERT statements into pg_largeobject by translating them into corresponding lowrite operations. However, lo_create is currently not decoded. The patch expects the user to execute lo_create on the subscriber explicitly, similar to how DDL for tables is handled. But I understand the lo_create can be a very frequent activity so instead of treating them as DDL we should decode and replicate the lo_create as well and I have asked for the suggestion in my email. And if we replicate the lo_create as well then any existing conflicting OID on the subscriber will create conflict and I believe we can handle that as part of conflict detection. -- Regards, Dilip Kumar Google
Hi Dilip, Thanks for initiating this work. I have created a POC for the apply worker on top of your patch. I refactored lo_put to provide a function which can be called without an fmgr and used that to apply the lo_write operation coming from the publisher. I tested this manually along with your patch and it works as long as lo_create is called beforehand on the subscriber with the same oid. This patch currently doesn't handle the tablesync worker so it can only apply the lo_write ops done after the replication slot is created. I'm looking into how to support large objects in tablesync. PFA the patch and let me know what you think. Thanks, Nitin Motiani Google
Вложения
On Tue, Jan 6, 2026 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote: > > Hi Dilip, > > Thanks for initiating this work. I have created a POC for the apply > worker on top of your patch. I refactored lo_put to provide a function > which can be called without an fmgr and used that to apply the > lo_write operation coming from the publisher. I tested this manually > along with your patch and it works as long as lo_create is called > beforehand on the subscriber with the same oid. Thanks Nitin for working on this, I will look into it. This patch currently > doesn't handle the tablesync worker so it can only apply the lo_write > ops done after the replication slot is created. I'm looking into how > to support large objects in tablesync. PFA the patch and let me know > what you think. Make sense, Thanks. -- Regards, Dilip Kumar Google
Hi Dilip, I have created a couple of POC patches for the tablesync side of this. The challenge with the tablesync involved permissions because the table is owned by the superuser. I considered a few solutions and have patches for a couple. 1. One simple solution is to allow only the superuser to set up large object replication. This can be achieved with a command to enable large object replication to a publication. Or maybe a guc flag can be used. Ideally we don't want all publications to publish the large objects here. So a publication level command might be preferred. While the enablement mechanism is in question, I have implemented the POC by adding the pg_largeobject table to the result of pg_get_publication_tables. In the final implementation, we might also add the pg_largeobject_metadata table. The patch is attached in POC-v3-0003-Tablesync-for-large-objects-for-superuser.patch. An alternative would be to handle this special case on the subscriber side. But it is simpler to do it on the publisher side. Alternatively we might need the setup on both the publisher and subscriber sides. For the full enablement we'll need this patch, the existing apply worker patch for lo_write, and a patch supporting lo_create on the apply worker side. 2. An alternative solution would be to allow a subscription owner to only copy the large object entries owned by its counterpart on the publisher. This allows all users to replicate large objects. It also enables multiple subscriptions to setup large object replication. The enablement question from the first approach still remains open. We can also assume that a patch supporting lo_create on the apply worker will be created along with this change. For implementation, we can have two tablesync workers - one for pg_largeobject_metadata and one for pg_largeobject. The pg_largeobject_metadata worker can use lo_create to create the largeobject entries. Alternatively it can just copy the items since permissions are not an issue for this table. The pg_largeobject worker will use lo_write/lo_put to insert the entries owned by it. The major challenge with this approach is the synchronization required between the pg_largeobject and pg_largeobject_metadata tablesync workers. Usually for a table copy (as suggested for superusers in the first approach), integrity checks are turned off and we should have no issue with the two tablesync workers running in parallel. However, explicitly using lo_write will check for the large object's existence, which can lead to failures. 3. To avoid the synchronization problem above, I experimented with a modified approach. Instead of having a separate tablesync worker for pg_largeobject_metadata, I only created one tablesync worker for pg_largeobject. This differs from a regular tablesync worker in the following ways : a. Instead of a copy command, it runs "select m.oid, lo_get(m.oid) from pg_largeobject_metadata as m join pg_user as u on m.lomowner = u.usesysid where u.usename = CURRENT_USER;" This returns all largeobjects owned by the user. b. Then I use lo_from_bytea (a refactored version which doesn't need fmgr) to both create and write the large object which removes the need for a separate lo_create. POC-v3-0004-Enable-tablesync-for-large-objects-for-all-users.patch contains this implementation. I tested it successfully for a small set of largeobjects. This patch, along with support for lo_create in applyworker should provide full support for large objects replication. Note : We should also modify the applyWorker code to replicate only the lo_writes on objects owned by the subscription user. I have not made that change in the POC but can do it in the next version. One major concern here is the performance of tablesync. I think that most of the users will set up replication at the start time. Or convert a physical replica to a logical replica. So this cost might not be borne in many cases. 4. If we want a more performant version, one idea is to support bulk writes for large objects. Then the above solution can be made more performant. I have not analyzed the work required. Suggestions on this would be welcome. Thanks. Nitin Motiani Google