Re: Transactions involving multiple postgres foreign servers

Поиск
Список
Период
Сортировка
От Ashutosh Bapat
Тема Re: Transactions involving multiple postgres foreign servers
Дата
Msg-id CAFjFpRf3uEnnCJkk689MhqH7XnkRSpdM7eogVcahFE8L77T6ag@mail.gmail.com
обсуждение исходный текст
Ответ на Re: Transactions involving multiple postgres foreign servers  (Heikki Linnakangas <hlinnaka@iki.fi>)
Ответы Re: Transactions involving multiple postgres foreign servers  (Robert Haas <robertmhaas@gmail.com>)
Re: Transactions involving multiple postgres foreign servers  (Amit Kapila <amit.kapila16@gmail.com>)
Список pgsql-hackers
Hi All,
I have been working on improving the previous implementation and addressing TODOs in my previous mail. Let me explain the approach first and I will get to Heikki's comments later in the same mail.

The patch provides support for atomic commit for transactions involving foreign servers. When a transaction makes changes to foreign servers, either all the changes to all the foreign servers commit or rollback. We should not see some changes committed and others rolled back.

Hooks and GUCs
==============
The patch introduces a GUC atomic_foreign_transaction, which when ON ensures atomic commit for foreign transactions, otherwise not. The value of this GUC at the time of committing or preparing a local transaction is used. This gives applications the flexibility to choose the behaviour as late in the transaction as possible. This GUC has no effect if there are no foreign servers involved in the transaction.

Another GUC max_fdw_transactions sets the maximum number of transactions that can be simultaneously prepared on all the foreign servers. This limits the memory required for remembering the prepared foreign transactions.

Two new FDW hooks are introduced for transaction management.
1. GetPrepareId: to get the prepared transaction identifier for a given foreign server connection. An FDW which doesn't want to support this feature can keep this hook undefined (NULL). When defined the hook should return a unique identifier for the transaction prepared on the foreign server. The identifier should be unique enough not to conflict with currently prepared or future transactions. This point will be clear when discussing phase 2 of 2PC.

2. HandleForeignTransaction: to end a transaction in specified way. The hook should be able to prepare/commit/rollback current running transaction on given connection or commit/rollback a previously prepared transaction. This is described in detail while describing phase two of two-phase commit. The function is required to return a boolean status of whether the requested operation was successful or not. The function or its minions should not raise any error on failure so as not to interfere with the distributed transaction processing. This point will be clarified more in the description below.

Achieving atomic commit
===================
If atomic_foreign_transaction is enabled, two-commit protocol is used to achieve atomic commit for transaction involving foreign servers. All the foreign servers participating in such transaction should be capable of participating in two-phase commit protocol. If not, the local and foreign transactions are aborted as atomic commit can not be guaranteed.

Phase 1
-----------
Every FDW needs to register the connection while starting new transaction on a foreign connection (RegisterXactForeignServer()). A foreign server connection is identified by foreign server oid and the local user oid (similar to the entry cached by postgres_fdw). While registering FDW also tells whether the foreign server is capable of participating in two-phase commit protocol. How to decide that is left entirely to the FDW. An FDW like file_fdw may not have 2PC support at all, so all its foreign servers do not comply with 2PC. An FDW might have all its servers 2PC compliant. An FDW like postgres_fdw can have some of its servers compliant and some not, depending upon server version, configuration (max_prepared_transactions = 0) etc. An FDW can decide not to register its connections at all and the foreign servers belonging to that FDW will not be considered by the core at all.

During pre-commit processing following steps are executed
1. GetPrepareId hook is called on each of the connections registered to get the identifier that will be used to prepare the transaction.
2. For each connection the prepared transaction id along with the connection information, database id and local transaction id (xid) is recorded in the memory.
3. This is logged in XLOG. If standby is configured, it is replayed on standby. In case of master failover a standby is able to resolve in-doubt prepared transactions created by the master.
4. The information is written to an on-disk file in pg_fdw_xact/ directory. This directory contains one file per prepared transaction on foreign connection. The file is fsynced during checkpoint similar to pg_twophase files. The file management in this directory is similar to the way, files are managed in pg_twophase.
5. HandleForeignTransaction is called to prepare the transaction on given connection with the identifier provided by GetPrepareId().

If the server crashes after step 5, we will remember the transaction prepared on the foreign server and will try to abort it after recovery. If it crashes after step 3 and completion of 5, we will remember a transaction that was never prepared and try to resolve it later. This scenario will be described while describing phase 2.

If any of the steps fail including the PREPARE on the foreign server itself, the local transaction will be aborted. All the prepared transactions on foreign servers will be aborted as described in phase 2 discussion below. Yet to be prepared transactions are rolled back by using the same hook. If step 5 fails, the prepared foreign transaction entry is removed from memory and disk following steps 2,3,4 in phase 2. HandleForeignTransaction throwing an error will interfere with this, so it is not expected to throw an error.

If the transactions are prepared on all the foreign servers successfully, we enter phase 2 of 2PC.

The local transaction is not required to be prepared per say.

Phase 2
-----------
After the local transaction has committed or aborted the foreign transactions prepared as part of the local transaction as committed or aborted resp. Committing or aborting prepared foreign transaction is collectively henceforth termed as "resolving" for simplicity. Following steps are executed while resolving a foreign prepared transaction.

1. Resolve the foreign prepared transaction on corresponding foreign server using user mapping of local user used at the time of preparing the transaction. This is done through hook HandleForeignTransaction().
2. If the resolution is successful, remove the prepared foreign transaction entry from the memory
3. Log about removal of entry in XLOG. When this log is replayed during recovery or in standby mode, it executes step 4 below.
4. Remove the corresponding file from pg_fdw_xact directory.

If the resolution is unsuccessful, leave the entry untouched. Since this phase is carried out when no transaction exists, HandleForeignTransaction should not throw an error and should be designed not to access database while performing this operation.

In case server crashes after step 1 and before step 3, a resolved foreign transaction will be considered unresolved when the local server recovers or standby takes over the master. It will try to resolve the prepared transaction again and should get an error from foreign server. HandleForeignTransaction hook should treat this as normal and return true since the prepared transaction is resolved (or rather there is nothing that can be done). For such cases it is important that GetPrepareId returns a transaction identifier which does not conflict with a future tansaction id, lest we may resolve (may be with wrong outcome) a prepared transaction which shouldn't be resolved.

Any crash or connection failure in phase 2  leaves the prepared transaction in unresolved state.

Resolving unresolved foreign transactions
================================
A local/foreign server crash or connection failure after a transaction is prepared on the foreign server leaves that transaction in unresolved state. The patch provides a built-in function pg_fdw_resolve() to resolve those after recovering from the failure. This built-in scans all the prepared transactions in-memory and decides the fate (commit/rollback) based on the fate of local transaction that prepared it on the foreign server. It does not touch entries corresponding to the in-progress local transactions. It then executes the same steps as phase 2 to resolve the prepared foreign transactions. Since foreign server information is contained within a database, the function only touches the entries corresponding to the database from which it is invoked. A user can configure a daemon or cron-job to execute this function frequently from various databases. Alternatively, user can use contrib module pg_fdw_xact_resolver which does the same using background worker mechanism. This module needs to be installed and listed in shared_preload_libraries to start the daemon automatically on the startup.

A foreign server, user mapping corresponding to an unresolved foreign transaction is not allowed to be dropped or altered until the foreign transaction is resolved. This is required to retain the connection properties which need to resolve the prepared transaction on the foreign server.

Crash recovery
============
During crash recovery, the files in pg_fdw_xact/ are created or removed when corresponding WAL records are replayed. After the redo is done pg_fdw_xact directory is scanned for unresolved foreign prepared transactions. The files in this directory are named as triplet (xid, foreign server oid, user oid) to create a unique name for each file. This scan also emits the oldest transaction id with an unresolved prepared foreign transactions. This affects oldest active transaction id, since the status of this transaction id is required to decide the fate of unresolved prepared foreign transaction.

On standby during WAL replay files are just created or removed. If the standby is required to finish recovery and take over the master, pg_fdw_xact is scanned to read unresolved foreign prepared transactions into the shared memory.

Preparing transaction involving foreign server/s, on local server
=================================================
While PREPARing a local transaction that involves foreign servers, the transactions are prepared on the foreign server (as described in phase 1 above), if atomic_foreign_transaction is enabled. If the GUC is disabled, such local transactions can not be prepared (as of this patch at least). This also means that all the foreign servers participating in the transaction to be prepared are required to support 2PC. While committing/rolling back the prepared transaction the corresponding foreign prepared transactions are committed or rolled back (as described in phase 2) resp. Any unresolved foreign transactions are resolved the same way as above.

View for checking the current foreign prepared transactions
=============================================
A built-in function pg_fdw_xact() lists all the currently prepared foreign transactions. This function does not list anything on standby while its replaying WAL, since it doesn't have any entry in-memory. A convenient view pg_fdw_xacts lists the same with the oids converted to the names.

Handling non-atomic foreign transactions
===============================
When atomic_foreign_transaction is disabled, one-phase commit protocol is used to commit/rollback the foreign transactions. After the local transaction has committed/aborted, all the foreign transactions on the registered foreign connections are committed or aborted resp. using hook HandleForeignTransaction. Failing to commit a foreign transaction does not affect the other foreign transactions; they are still tried to be committed (if the local transaction commits).

PITR
====
PITR may rewind the database to a point before an xid associated with an unresolved foreign transaction. There are two approaches to deal with the situation.
1. Just forget about the unresolved foreign transaction and remove the file just like we do for a prepared local transaction. But then the prepared transaction on the foreign server might be left unresolved forever and will keep holding the resources.
2. Do not allow PITR to such point. We can not get rid of the transaction id without getting rid of prepared foreign transaction. If we do so, we might create conflicting files in future and might resolve the transaction with wrong outcome.

Rest of the mail contains replies to Heikki's comments.

On Tue, Jul 7, 2015 at 2:55 PM, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 02/17/2015 11:26 AM, Ashutosh Bapat wrote:
Hi All,

Here are the steps and infrastructure for achieving atomic commits across
multiple foreign servers. I have tried to address most of the concerns
raised in this mail thread before. Let me know, if I have left something.
Attached is a WIP patch implementing the same for postgres_fdw. I have
tried to make it FDW-independent.

Wow, this is going to be a lot of new infrastructure. This is going to need good documentation, explaining how two-phase commit works in general, how it's implemented, how to monitor it etc. It's important to explain all the possible failure scenarios where you're left with in-doubt transactions, and how the DBA can resolve them.

I have included some documentation in the patch. Once we agree on the functionality, design, I will improve the documentation further.
 

Since we're building a Transaction Manager into PostgreSQL, please put a lot of thought on what kind of APIs it provides to the rest of the system. APIs for monitoring it, configuring it, etc. And how an extension could participate in a transaction, without necessarily being an FDW.


The patch has added all of it except extension thing. Let me know if anything is missing.

Even today and extension can participate in a transaction by registering transaction and subtransaction call backs. So, as long as an extension (and so some FDW) does things such that the failures in those do not affect the atomicity, they can use these callbacks. However, these call backs are not enough to handle unresolved prepared transactions or handle connectivity failures in the phase 2. The patch adds infrastructure to do that.

dblink might be something on your mind, but to support dblink here, it will need too liberal format for storing information about the prepared transactions on other servers. This format will vary from extension to extension, and may not be very useful as above. What we might be able to do is expose the functions for creating files for prepared transactions and logging about them and let extension use them. BTW, dblink_plus already supports 2PC for dblink.
 
Regarding the configuration, there are many different behaviours that an FDW could implement:

1. The FDW is read-only. Commit/abort behaviour is moot.

I can think of two flavours of read-only FDW: 1. the underlying data is read-only 2. the FDW is read-only but the underlying data is not.
In first case, the FDW may choose not to participate in the transaction management at all, so doesn't register the foreign connections. Still the rest of the transaction will be atomic.

In second case however, the writes to other foreign server may depend upon what has been read from the read-only FDW esp. in repeatable read and higher isolation levels. So it's important that the data once read remains intact till the transaction commits or at least is prepared, implying we have to start a transaction on the read-only foreign server. Once the other foreign transactions get prepared, we might be able to commit the transaction on read-only foreign server. That optimization is not yet implemented by my patch. But it should be possible to do in the approach taken by the patch. Can we leave that as a future enhancement?

Does that solve your concern?
 
2. Transactions are not supported. All updates happen immediately regardless of the local transaction.

An FDW can choose not to register its server and local PostgreSQL won't know about it. Is that acceptable behaviour?
 
3. Transactions are supported, but two-phase commit is not. There are three different ways we can use the remote transactions in that case:
 
This case is supported by using GUC atomic_foreign_transaction. The patch implements 3.2 approach.
 
3.1. Commit the remote transaction before local transaction.
3.2. Commit the remote transaction after local transaction.
3.3. As long as there is only one such FDW involved, we can still do safe two-phase commit using so-called Last Resource Optimization.

IIUC LRO, the patch uses the local transaction as last resource, which is always present. The fate of foreign transaction is decided by the fate of the local transaction, which is not required to be prepared per say. There is more relevant note later.
 
4. Full two-phases commit support

We don't necessarily have to support all of that, but let's keep all these cases in mind when we design the how to configure FDWs. There's more to it than "does it support 2PC".

A. Steps during transaction processing
------------------------------------------------

1. When an FDW connects to a foreign server and starts a transaction, it
registers that server with a boolean flag indicating whether that server is
capable of participating in a two phase commit. In the patch this is
implemented using function RegisterXactForeignServer(), which raises an
error, thus aborting the transaction, if there is at least one foreign
server incapable of 2PC in a multiserver transaction. This error thrown as
early as possible. If all the foreign servers involved in the transaction
are capable of 2PC, the function just updates the information. As of now,
in the patch the function is in the form of a stub.

Whether a foreign server is capable of 2PC, can be
a. FDW level decision e.g. file_fdw as of now, is incapable of 2PC but it
can build the capabilities which can be used for all the servers using
file_fdw
b. a decision based on server version type etc. thus FDW can decide that by
looking at the server properties for each server
c. a user decision where the FDW can allow a user to specify it in the form
of CREATE/ALTER SERVER option. Implemented in the patch.

For a transaction involving only a single foreign server, the current code
remains unaltered as two phase commit is not needed.

Just to be clear: you also need two-phase commit if the transaction updated anything in the local server and in even one foreign server.

Any local transaction involving a foreign sever transaction uses two-phase commit for the foreign transaction. The local transaction is not prepared per say. However, we should be able to optimize a case, when there are no local changes. I am not able to find a way to deduce that there was no local change, so I have left that case in this patch. Is there a way to know whether a local transaction changed something locally or not?
 

D. Persistent and in-memory storage considerations
--------------------------------------------------------------------
I considered following options for persistent storage
1. in-memory table and file(s) - The foreign transaction entries are saved
and manipulated in shared memory. They are written to file whenever
persistence is necessary e.g. while registering the foreign transaction in
step A.2. Requirements C.1, C.2 need some SQL interface in the form of
built-in functions or SQL commands.

The patch implements the in-memory foreign transaction table as a fixed
size array of foreign transaction entries (similar to prepared transaction
entries in twophase.c). This puts a restriction on number of foreign
prepared transactions that need to be maintained at a time. We need
separate locks to syncronize the access to the shared memory; the patch
uses only a single LW lock. There is restriction on the length of prepared
transaction id (or prepared transaction information saved by FDW to be
general), since everything is being saved in fixed size memory. We may be
able to overcome that restriction by writing this information to separate
files (one file per foreign prepared transaction). We need to take the same
route as 2PC for C.5.

Your current approach with a file that's flushed to disk on every update has a few problems. Firstly, it's not crash safe. Secondly, if you make it crash-safe with fsync(), performance will suffer. You're going to need to need several fsyncs per commit with 2PC anyway, there's no way around that, but the scalable way to do that is to use the WAL so that one fsync() can flush more than one update in one operation.

So I think you'll need to do something similar to the pg_twophase files. WAL-log each update, and only flush the file/files to disk on a checkpoint. Perhaps you could use the pg_twophase infrastructure for this directly, by essentially treating every local transaction as a two-phase transaction, with some extra flag to indicate that it's an internally-created one.

I have used approach similar to pg_twophase, but implemented it as a separate code, as the requirements differ. But, I would like to minimize code by unifying both, if we finalise this design. Suggestions in this regard will be very helpful.
 

2. New catalog - This method takes out the need to have separate method for
C1, C5 and even C2, also the synchronization will be taken care of by row
locks, there will be no limit on the number of foreign transactions as well
as the size of foreign prepared transaction information. But big problem
with this approach is that, the changes to the catalogs are atomic with the
local transaction. If a foreign prepared transaction can not be aborted
while the local transaction is rolled back, that entry needs to retained.
But since the local transaction is aborting the corresponding catalog entry
would become invisible and thus unavailable to the resolver (alas! we do
not have autonomous transaction support). We may be able to overcome this,
by simulating autonomous transaction through a background worker (which can
also act as a resolver). But the amount of communication and
synchronization, might affect the performance.

Or you could insert/update the rows in the catalog with xmin=FrozenXid, ignoring MVCC. Not sure how well that would work.

I am not aware how to do that. Do we have any precedence in the code. Something like a reference implementation, which I can follow. It will help to lift two restrictions
1. Restriction on the number of simultaneously prepared foreign transactions.
2. Restriction on the prepared transaction identifier length.

Obviously we may be able to shed a lot of code related to file managment, lookup etc.
 

3. WAL records - Since the algorithm follows "write ahead of action", WAL
seems to be a possible way to persist the foreign transaction entries. But
WAL records can not be used for repeated scan as is required by the foreign
transaction resolver. Also, replaying WALs is controlled by checkpoint, so
not all WALs are replayed. If a checkpoint happens after a foreign prepared
transaction remains resolved, corresponding WALs will never be replayed,
thus causing the foreign prepared transaction to remain unresolved forever
without a clue. So, WALs alone don't seem to be a fit here.

Right. The pg_twophase files solve that exact same issue.

There is clearly a lot of work to do here.
I'm marking this as Returned with Feedback in the commitfest, I don't think more review is going to be helpful at this point.

That's sad. Hope people to review the patch and help it improve, even if it's out of commitfest.
 

- Heikki




--
Best Wishes,
Ashutosh Bapat
EnterpriseDB Corporation
The Postgres Database Company
Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Rowley
Дата:
Сообщение: Re: Sharing aggregate states between different aggregate functions
Следующее
От: Heikki Linnakangas
Дата:
Сообщение: Re: Improving log capture of TAP tests with IPC::Run