Обсуждение: Proposal: Support Logical replication of large objects

Поиск
Список
Период
Сортировка

Proposal: Support Logical replication of large objects

От
Dilip Kumar
Дата:
Problem Statement
===============
Large object replication is currently unsupported in logical
replication, which presents a significant barrier to using logical
replication when large objects are in use. This lack of support limits
the usability of logical replication for applications that heavily
rely on large objects.

Background
==========
The primary challenge preventing support for this feature stems from
large objects being stored within the catalog. Catalog decoding is
intentionally unsupported because replicating a catalog entry, such as
a pg_class row, is meaningless without the underlying storage and
related catalog entries (e.g., pg_attribute, pg_constraint, etc.).
Nevertheless, the data within pg_largeobjects is different; it behaves
more like regular user table data than typical catalog data.

While it is technically possible to enable logical logging and
replication for rows in pg_largeobjects by modifying certain macros,
this approach presents several key drawbacks:  1) The decoded data
needs a logical representation, as a single pg_largeobject row is
meaningless outside of PostgreSQL.  2) The apply worker may not have
the necessary permissions to directly insert into the pg_largeobject
catalog if the subscription was not created by a superuser.

Proposal
=======
The proposed solution involves introducing a configuration parameter
that allows a large object to be logically logged, similar to a
standard user table. Then the decoder will convert the pg_largeobjects
rows into logical operations, such as LOWRITE: OID, OFFSET, DATA,
LENGTH.  The decoder will then translate the standard single-row
insert/update WAL records generated for the internal pg_largeobject
table rows into logical operations formatted as LOWRITE: OID, OFFSET,
DATA, LENGTH, where the OFFSET is calculated as pageno (corresponding
to pg_largeobject row) * LOBLKSIZE. Subsequently, the apply worker on
the subscriber side converts this logical operation into lo_open(),
lo_seek() and lowrite() operation.  While there is potential for
further optimization by generating the LOWRITE operation only for the
modified data instead of for the entire LOBLKSIZE bytes this behavior
is consistent with how updates are currently logged for standard user
tables.

I have a POC patch for the decoding part of this implementation. I
need feedback on the overall strategy before I put effort on cleanup,
testing, and further development. The patch still requires the
following to be completed:
a) This is just POC so it needs substantial cleanup and testing,
implementing other large objects operations other than lowrite.
b) Development for the 'apply' side of the implementation.
c) Implementation of a configuration parameter to conditionally enable
logically logging the large object (currently, it always logs the
large object).

Open points to be discussed
======================
1. I propose that logically logging the pg_largeobject tuple data
should be controlled by a configuration parameter, rather than being
implemented unconditionally.  Making it configurable allows users to
opt-in to this behavior, preventing unnecessary logging overhead for
users who do not require large object replication. Thoughts?

2. Supporting lo_create() operations : We should consider extending
this proposal to support the lo_create() operation as well.  If the
large object replication configuration is enabled, we could generate a
new WAL record directly from lo_create().  Alternatively, we could
leverage the existing WAL record for the insertion into
pg_largeobject_metadata and decode it into a logical lo_create
operation.

While some may categorize lo_create() as a DDL operation, it behaves
practically like a DML operation. When a table contains a large object
column, new large objects are created frequently, often for every row
insertion making it a commonplace runtime event rather than a schema
design activity.

Acknowledgements:
================
I would like to express my sincere thanks to Amit Kapila, Michael
Bautin, Hannu Korosing, Noah Misch, and Joe Conway for their valuable
input, including discussing various alternatives and suggesting
different approaches to this.

Also added some tests using test decoding to show how it works and
here is one of the examples.

postgres[1000776]=# select lo_create(1000);
 lo_create
-----------
      1000
(1 row)

postgres[1000776]=# SELECT lowrite(lo_open(1000, CAST(x'20000' |
x'40000' AS integer)), 'try decoding test data');
 lowrite
---------
      22
(1 row)

postgres[1000776]=# SELECT data FROM
pg_logical_slot_get_changes('regression_slot', NULL, NULL,
'include-xids', '0', 'skip-empty-xacts', '1');
                                  data
-------------------------------------------------------------------------
 BEGIN
 LO_WRITE: loid: 1000 offset: 0 datalen: 22 data: try decoding test data
 COMMIT
(3 rows)

-- 
Regards,
Dilip Kumar
Google

Вложения

Re: Proposal: Support Logical replication of large objects

От
Bernd Helmle
Дата:
Am Montag, dem 01.12.2025 um 09:27 +0530 schrieb Dilip Kumar:
> The decoder will then translate the standard single-row
> insert/update WAL records generated for the internal pg_largeobject
> table rows into logical operations formatted as LOWRITE: OID, OFFSET,
> DATA, LENGTH, where the OFFSET is calculated as pageno (corresponding
> to pg_largeobject row) * LOBLKSIZE. Subsequently, the apply worker on
> the subscriber side converts this logical operation into lo_open(),
> lo_seek() and lowrite() operation.  While there is potential for
> further optimization by generating the LOWRITE operation only for the
> modified data instead of for the entire LOBLKSIZE bytes this behavior
> is consistent with how updates are currently logged for standard user
> tables.


Thanks for this, i think this is a long awaited feature, at least for
those workloads that can't easily get rid of LOs...

I didn't look into your POC (yet), but what happens if the subscriber
database concurrently does lo_create()? Would that cause conflicting
OIDs, preventing applying the records decoded from the publisher?

    Bernd



Re: Proposal: Support Logical replication of large objects

От
Dilip Kumar
Дата:
On Tue, Dec 2, 2025 at 6:41 PM Bernd Helmle <mailings@oopsware.de> wrote:
>
> Am Montag, dem 01.12.2025 um 09:27 +0530 schrieb Dilip Kumar:
> > The decoder will then translate the standard single-row
> > insert/update WAL records generated for the internal pg_largeobject
> > table rows into logical operations formatted as LOWRITE: OID, OFFSET,
> > DATA, LENGTH, where the OFFSET is calculated as pageno (corresponding
> > to pg_largeobject row) * LOBLKSIZE. Subsequently, the apply worker on
> > the subscriber side converts this logical operation into lo_open(),
> > lo_seek() and lowrite() operation.  While there is potential for
> > further optimization by generating the LOWRITE operation only for the
> > modified data instead of for the entire LOBLKSIZE bytes this behavior
> > is consistent with how updates are currently logged for standard user
> > tables.
>
>
> Thanks for this, i think this is a long awaited feature, at least for
> those workloads that can't easily get rid of LOs...

Right

> I didn't look into your POC (yet), but what happens if the subscriber
> database concurrently does lo_create()? Would that cause conflicting
> OIDs, preventing applying the records decoded from the publisher?

I've implemented decoding for lowrite operations, specifically
handling INSERT statements into pg_largeobject by translating them
into corresponding lowrite operations. However, lo_create is currently
not decoded. The patch expects the user to execute lo_create on the
subscriber explicitly, similar to how DDL for tables is handled.

But I understand the lo_create can be a very frequent activity so
instead of treating them as DDL we should decode and replicate the
lo_create as well and I have asked for the suggestion in my email.
And if we replicate the lo_create as well then any existing
conflicting OID on the subscriber will create conflict and I believe
we can handle that as part of conflict detection.

--
Regards,
Dilip Kumar
Google



Re: Proposal: Support Logical replication of large objects

От
Nitin Motiani
Дата:
Hi Dilip,

Thanks for initiating this work. I have created a POC for the apply
worker on top of your patch. I refactored lo_put to provide a function
which can be called without an fmgr and used that to apply the
lo_write operation coming from the publisher. I tested this manually
along with your patch and it works as long as lo_create is called
beforehand on the subscriber with the same oid. This patch currently
doesn't handle the tablesync worker so it can only apply the lo_write
ops done after the replication slot is created. I'm looking into how
to support large objects in tablesync. PFA the patch and let me know
what you think.

Thanks,
Nitin Motiani
Google

Вложения

Re: Proposal: Support Logical replication of large objects

От
Dilip Kumar
Дата:
On Tue, Jan 6, 2026 at 3:05 PM Nitin Motiani <nitinmotiani@google.com> wrote:
>
> Hi Dilip,
>
> Thanks for initiating this work. I have created a POC for the apply
> worker on top of your patch. I refactored lo_put to provide a function
> which can be called without an fmgr and used that to apply the
> lo_write operation coming from the publisher. I tested this manually
> along with your patch and it works as long as lo_create is called
> beforehand on the subscriber with the same oid.

Thanks Nitin for working on this, I will look into it.

 This patch currently
> doesn't handle the tablesync worker so it can only apply the lo_write
> ops done after the replication slot is created. I'm looking into how
> to support large objects in tablesync. PFA the patch and let me know
> what you think.

Make sense, Thanks.

--
Regards,
Dilip Kumar
Google



Re: Proposal: Support Logical replication of large objects

От
Nitin Motiani
Дата:
Hi Dilip,

I have created a couple of POC patches for the tablesync side of this.
The challenge with the tablesync involved permissions because the
table is owned by the superuser. I considered a few solutions and have
patches for a couple.

1. One simple solution is to allow only the superuser to set up large
object replication. This can be achieved with a command to enable
large object replication to a publication. Or maybe a guc flag can be
used. Ideally we don't want all publications to publish the large
objects here. So a publication level command might be preferred.

While the enablement mechanism is in question, I have implemented the
POC by adding the pg_largeobject table to the result of
pg_get_publication_tables. In the final implementation, we might also
add the pg_largeobject_metadata table.

The patch is attached in
POC-v3-0003-Tablesync-for-large-objects-for-superuser.patch. An
alternative would be to handle this special case on the subscriber
side. But it is simpler to do it on the publisher side. Alternatively
we might need the setup on both the publisher and subscriber sides.

For the full enablement we'll need this patch, the existing apply
worker patch for lo_write, and a patch supporting lo_create on the
apply worker side.

2. An alternative solution would be to allow a subscription owner to
only copy the large object entries owned by its counterpart on the
publisher. This allows all users to replicate large objects. It also
enables multiple subscriptions to setup large object replication. The
enablement question from the first approach still remains open. We can
also assume that a patch supporting lo_create on the apply worker will
be created along with this change.

For implementation, we can have two tablesync workers - one for
pg_largeobject_metadata and one for pg_largeobject. The
pg_largeobject_metadata worker can use lo_create to create the
largeobject entries. Alternatively it can just copy the items since
permissions are not an issue for this table. The pg_largeobject worker
will use lo_write/lo_put to insert the entries owned by it.

The major challenge with this approach is the synchronization required
between the pg_largeobject and pg_largeobject_metadata tablesync
workers. Usually for a table copy (as suggested for superusers in the
first approach), integrity checks are turned off and we should have no
issue with the two tablesync workers running in parallel. However,
explicitly using lo_write will check for the large object's existence,
which can lead to failures.

3. To avoid the synchronization problem above, I experimented with a
modified approach. Instead of having a separate tablesync worker for
pg_largeobject_metadata, I only created one tablesync worker for
pg_largeobject. This differs from a regular tablesync worker in the
following ways :

a. Instead of a copy command, it runs "select m.oid, lo_get(m.oid)
from pg_largeobject_metadata as m join pg_user as u on m.lomowner =
u.usesysid where u.usename = CURRENT_USER;"

This returns all largeobjects owned by the user.

b. Then I use lo_from_bytea (a refactored version which doesn't need
fmgr) to both create and write the large object which removes the need
for a separate lo_create.

POC-v3-0004-Enable-tablesync-for-large-objects-for-all-users.patch
contains this implementation. I tested it successfully for a small set
of largeobjects. This patch, along with support for lo_create in
applyworker should provide full support for large objects replication.

Note : We should also modify the applyWorker code to replicate only
the lo_writes on objects owned by the subscription user. I have not
made that change in the POC but can do it in the next version.

One major concern here is the performance of tablesync. I think that
most of the users will set up replication at the start time. Or
convert a physical replica to a logical replica. So this cost might
not be borne in many cases.

4. If we want a more performant version, one idea is to support bulk
writes for large objects. Then the above solution can be made more
performant. I have not analyzed the work required. Suggestions on this
would be welcome.

Thanks.


Nitin Motiani
Google

Вложения