Обсуждение: SQL/MED with simple wrappers

Поиск
Список
Период
Сортировка

SQL/MED with simple wrappers

От
Shigeru HANADA
Дата:
Here is a revised version of SQL/MED patch which is based on
the discussion in CommitFest 2010-09.  The original post in the
CommitFest is:
http://archives.postgresql.org/pgsql-hackers/2010-09/msg00779.php

In this message, I'll describe major changes from previous version of
patch.  Changes from HEAD are described in the wiki.
http://wiki.postgresql.org/wiki/SQL/MED

=== Changes from last patch ===
1) Per-column generic options
In addition to per-table options, now per-column options are supported.
Per-column options are stored in pg_attribute.attgenoptions column in
same form as other FDW objects.  New information_schema view columns
and column_options are also added.

2) FDW Interface
Per previous discussion, a function EstimateCosts() was added to
FdwRoutine.  It's called from cost_foreignscan() to estimate
FDW-specific costs of a foreign scan in their own ways.

Calling sequence is:

    1) EstimateCosts(), from cost_foreignscan(), from
    2) ConnectServer(), from ExecInitForeignScan()
    3) Open(), from ExecInitForeignScan()
    4) Iterate() for each result tuple, from ForeignNext()
    5) ReOpen() for each outer loop, if necessary, from ExecForeignReScan()
    6) Close(), from ExecEndForeignScan()
    -) FreeFSConnection() is not called from backend

3) Simplified FDWs(WIP)
I also have revised two FDWs, file_fdw and postgresql_fdw, to more
simple implementation to focus on discussion about the interface.

3-1) file_fdw

This FDW handles flat text files on the server side, like COPY FROM.

The available format is TEXT and CSV of COPY command.  The codes are
almost copied from COPY FROM and have many duplication.  It would be
better to integrate them, maybe COPY FROM will use file_fdw instead of
heap-am.

For this FDW,  ConnectServer() and FreeFSConnection() have nothing to
do because the flat files need no connection.

EstimateCosts() should consider costs of:
  - reading records from the file
  - parsing records
  - creating tuples

3-2) postgresql_fdw(in contrib)

This FDW handles data in external PostgreSQL servers, like
contrib/dblink.  Some part of the implementation would be able to be
integrated with contrib/dblink.

ConnectServer() manages connection cache with identifying connection
by the server name, and all cached connections are disposed when the
backend dies.

EstimateCosts() should consider costs of:
  - executing query on the remote PostgreSQL
  - transferring result tuples via libpq
  - creating tuples from string
  - storing tuples in tuplestore
  - retrieve tuples from tuplestore

The first one of above would be able to estimated by executing EXPLAIN
ANALYZE on the remote side.  EstimateCosts() would need to call
ConnectServer() by itself, though.

=== Known issues and feature improvement ===

1) Optimize foreign query
In current implementation, each foreign table are scanned separately.
It would be useful to push ORDER BY, LIMIT and OFFSET down to remote
side for applications which uses paging.  To implement such kind of
optimization, optimizer would need to generate practical numbers of
sortkeys of a foreign table, not on a brute force method.

Some foreign scans using same foreign server would be able to united
into one foreign query.  For instance, JOIN, UNION and subqueries.

2) Integrate FDW with COPY FROM
COPY FROM can handle CSV, TEXT and BINARY format stream (file or
stdin) as data source.  It would be worth to change COPY FROM to
accept various data source using FDW infrastructure. The syntax would be:

    COPY table_name FROM SERVER server_name WITH OPTIONS (on-the-fly options);

3) Integrate file_fdw with dblink
Currently dblink and postgresql_fdw have connection cache mechanism
for each.  Can they share connection management?

3) Connection invalidation on catalog change
Should the foreign connection be invalid if the generic options of
FDW-related objects, such as user mapping or foreign server, have been
changed after establishment of the connection?  Or ALTER statement
should be blocked if the object was used by active connection?

4) List of foreign connections
Users (especially DBAs?) might want to see list of foreign connections.
Currently postgresql_fdw provides its own connection list via
postgresql_fdw_connections view.  Common view such as
pg_foreign_connections would be needed?  If so, function which returns
list of active connections would be necessary in FDW API.

5) Routine mapping
If a function in local query can be evaluated on the remote side in
same semantics, it seems efficient to push the function down to the
remote side.  But how can we know whether the function can be pushed
down or not?  For such purpose, SQL/MED standard defines "routine
mapping".  Should we implement routine mapping?

6) MERGE join support
ExecForeignMarkPos() and ExecForeignRestrPos() has been defined, but
not implemented.  They are used to mark/restore the position of
scanning for merge join.  New APIs, MarkPos() and RestorePos() should
be added to FdwRoutine?  It's not clear to me how the planner choose
merge join.

7) Using cursor in postgresql_fdw
postgresql_fdw fetches all of the result tuples from the foreign
server in the first pgIterate() call.  It could cause out-of-memory if
the result set was huge.  If libpq supports protocol-level cursor,
postgresql_fdw will be able to divide result set into some sets and
lower the usage of memory.  Or should we use declare implicit cursor
with DECLARE statement?  One connection can be used by multiple
ForeignScan nodes in a local query alternately.  An issue is that
cursor requires implicit transaction block.  Is it OK to start
transaction automatically?

8) Query-time constraints
Currently foreign tables can have NOT NULL and/or CHECK constraints.
It helps table partitioning with constraint exclusion.  It also helps
data validation on query-time.  But applying constraints at query-time
is has not been implemented.

Regards,
--
Shigeru Hanada

Вложения

Re: SQL/MED with simple wrappers

От
Pavel Stehule
Дата:
Hello

> 4) List of foreign connections
> Users (especially DBAs?) might want to see list of foreign connections.
> Currently postgresql_fdw provides its own connection list via
> postgresql_fdw_connections view.  Common view such as
> pg_foreign_connections would be needed?  If so, function which returns
> list of active connections would be necessary in FDW API.
>

+ list of foreign tables?

> 5) Routine mapping
> If a function in local query can be evaluated on the remote side in
> same semantics, it seems efficient to push the function down to the
> remote side.  But how can we know whether the function can be pushed
> down or not?  For such purpose, SQL/MED standard defines "routine
> mapping".  Should we implement routine mapping?
>

is it related to aggregate functions? If yes, this it can be really
significant help

>
> 7) Using cursor in postgresql_fdw
> postgresql_fdw fetches all of the result tuples from the foreign
> server in the first pgIterate() call.  It could cause out-of-memory if
> the result set was huge.  If libpq supports protocol-level cursor,
> postgresql_fdw will be able to divide result set into some sets and
> lower the usage of memory.  Or should we use declare implicit cursor
> with DECLARE statement?  One connection can be used by multiple
> ForeignScan nodes in a local query alternately.  An issue is that
> cursor requires implicit transaction block.  Is it OK to start
> transaction automatically?

I don't know why DECLARE statement is problem? Can you explain it, please.

regards

Pavel Stehule


Re: SQL/MED with simple wrappers

От
Shigeru HANADA
Дата:
Thanks for your comments.

On Mon, 25 Oct 2010 15:05:51 +0200
Pavel Stehule <pavel.stehule@gmail.com> wrote:
> > 4) List of foreign connections
> > Users (especially DBAs?) might want to see list of foreign connections.
> > Currently postgresql_fdw provides its own connection list via
> > postgresql_fdw_connections view.  Common view such as
> > pg_foreign_connections would be needed?  If so, function which returns
> > list of active connections would be necessary in FDW API.
> >
> 
> + list of foreign tables?

I've implemented that functionality in some places.

1) \det psql command shows list of foreign table in the format like  \dew and \des.
2) pg_foreign_tables catalog shows pair of OIDs (relation oid and  server oid) and options in raw format.
3) views in information_schema, foreign_tables and foreign_table_options  show information about foreign tables in SQL
standardformat.
 

Here the detail of \det psql command is.

\det psql command (followed naming of \des/\dew) shows list of
foreign tables, and \det <pattern> shows the list of foreign tables
whose name match the pattern.  For example of file_fdw:

postgres=# \det  List of foreign tables   Table     |   Server
--------------+-------------csv_accounts | file_servercsv_branches | file_servercsv_history  | file_servercsv_tellers
|file_server
 
(4 rows)

Adding postfix "+" shows per-table generic options too.

postgres=# \det+                                  List of foreign tables   Table     |   Server    |
       Options
 
--------------+-------------+----------------------------------------------------------------csv_accounts | file_server
|{format=csv,filename=/home/hanada/DB/CSV/pgbench_accounts.csv}csv_branches | file_server |
{format=csv,filename=/home/hanada/DB/CSV/pgbench_branches.csv}csv_history | file_server |
{format=csv,filename=/home/hanada/DB/CSV/pgbench_history.csv}csv_tellers | file_server |
{format=csv,filename=/home/hanada/DB/CSV/pgbench_tellers.csv}
(4 rows)

I have chosen \det+ command to show per-table options because \d+
command has already many columns and seems difficult to add long
values.

In addition to \det, \dec command would be necessary to show per-column
options with columns.  It hasn't been implemented yet, though.

> > 5) Routine mapping
> > If a function in local query can be evaluated on the remote side in
> > same semantics, it seems efficient to push the function down to the
> > remote side.  But how can we know whether the function can be pushed
> > down or not?  For such purpose, SQL/MED standard defines "routine
> > mapping".  Should we implement routine mapping?
> >
> 
> is it related to aggregate functions? If yes, this it can be really
> significant help

Yes.  I was thinking about only normal functions at original post,
though.  To push down aggregate function to remote side, FDW would
need additional planner hook to merge Aggregate node in to ForeignScan
node.  Such planner hook might be able to handle Order or Limit node
too.

> > 7) Using cursor in postgresql_fdw
> > postgresql_fdw fetches all of the result tuples from the foreign
> > server in the first pgIterate() call.  It could cause out-of-memory if
> > the result set was huge.  If libpq supports protocol-level cursor,
> > postgresql_fdw will be able to divide result set into some sets and
> > lower the usage of memory.  Or should we use declare implicit cursor
> > with DECLARE statement?  One connection can be used by multiple
> > ForeignScan nodes in a local query alternately.  An issue is that
> > cursor requires implicit transaction block.  Is it OK to start
> > transaction automatically?
> 
> I don't know why DECLARE statement is problem? Can you explain it, please.

The most serious issue would be that SQL-level cursors require
explicit transaction block.  To use SQL-level cursors with shared
connections between multiple ForeignScan, postgresql_fdw need to
manage transaction state per connection.  Especially, recovering
error would make codes complex.  Or, we would be able to take the
easiest way, discarding connection at the error.

I'll try to implement cursor-version of postgresql_fdw experimentally
to make this issue clearer.

Regards,
--
Shigeru Hanada