Обсуждение: Pluggable storage

Поиск

Список

Период

Сортировка

Pluggable storage

От

Alvaro Herrera

Дата:

13 августа 2016 г., 02:15:37

Many have expressed their interest in this topic, but I haven't seen any
design of how it should work.  Here's my attempt; I've been playing with
this for some time now and I think what I propose here is a good initial
plan.  This will allow us to write permanent table storage that works
differently than heapam.c.  At this stage, I haven't throught through
whether this is going to allow extensions to define new storage modules;
I am focusing on AMs that can coexist with heapam in core.

The design starts with a new row type in pg_am, of type "s" (for "storage").
The handler function returns a struct of node StorageAmRoutine.  This
contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples
(tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the
like), and operations on tuples that are part of slots (tuple_deform,
materialize).

To support this, we introduce StorageTuple and StorageScanDesc.
StorageTuples represent a physical tuple coming from some storage AM.
It is necessary to have a pointer to a StorageAmRoutine in order to
manipulate the tuple.  For heapam.c, a StorageTuple is just a HeapTuple.

RelationData gains ->rd_stamroutine which is a pointer to the
StorageAmRoutine for the relation in question.  Similarly,
TupleTableSlot is augmented with a link to the StorageAmRoutine to
handle the StorageTuple it contains (probably in most cases it's set at
the same time as the tupdesc).  This implies that routines such as
ExecAssignScanType need to pass down the StorageAmRoutine from the
relation to the slot.

The executor is modified so that instead of calling heap_insert etc
directly, it uses rel->rd_stamroutine to call these methods.  The
executor is still in charge of dealing with indexes, constraints, and
any other thing that's not the tuple storage itself (this is one major
point in which this differs from FDWs).  This all looks simple enough,
with one exception and a few notes:

exception a) ExecMaterializeSlot needs special consideration.  This is
used in two different ways: a1) is the stated "make tuple independent
from any underlying storage" point, which is handled by
ExecMaterializeSlot itself and calling a method from the storage AM to
do any byte copying as needed.  ExecMaterializeSlot no longer returns a
HeapTuple, because there might not be any.  The second usage pattern a2)
is to create a HeapTuple that's passed to other modules which only deal
with HT and not slots (triggers are the main case I noticed, but I think
there are others such as the executor itself wanting tuples as Datum for
some reason).  For the moment I'm handling this by having a new
ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of
the original tuple format.

note b) EvalPlanQual currently maintains an array of HeapTuple in
EState->es_epqTuple.  I think it works to replace that with an array of
StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine
methods in order to interact with it.  Other than those changes, it
seems okay.

note c) nodeSubplan has curTuple as a HeapTuple.  It seems simple
to replace this with an independent slot-based tuple.

note d) grp_firstTuple in nodeAgg / nodeSetOp.  These are less
simple than the above, but replacing the HeapTuple with a slot-based
tuple seems doable too.

note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual.
TupleTableSlot also seems a good replacement.  This has fallout in other
users of EvalPlanQual, too.

note f) More widespread, MinimalTuples currently use a tweaked HeapTuple
format.  In the long run, it may be possible to replace them with a
separate storage module that's specifically designed to handle tuples
meant for tuplestores etc.  That may simplify TupleTableSlot and
execTuples.  For the moment we keep the tts_mintuple as it is.  Whenever
a tuple is not already in heap format, we heapify it in order to put in
the store.


The current heapam.c routines need some changes.  Currently, practice is
that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on
their input tuples to set the resulting ItemPointer in tuple->t_self.
This is messy if we want StorageTuples to be abstract.  I'm changing
this so that the resulting ItemPointer is returned in a separate output
argument; the tuple itself is left alone.  This is somewhat messy in the
case of heap_multi_insert because it returns several items; I think it's
acceptable to return an array of ItemPointers in the same order as the
input tuples.  This works fine for the only caller, which is COPY in
batch mode.  For the other routines, they don't really care where the
TID is returned AFAICS.


Additional noteworthy items:

i) Speculative insertion: the speculative insertion token is no longer
installed directly in the heap tuple by the executor (of course).
Instead, the token becomes part of the slot.  When the tuple_insert
method is called, the insertion routine is in charge of setting the
token from the slot into the storage tuple.  Executor is in charge of
calling method->speculative_finish() / abort() once the insertion has
been confirmed by the indexes.

ii) execTuples has additional accessors for tuples-in-slot, such as
ExecFetchSlotTuple and friends.  I expect to have some of them to return
abstract StorageTuples, others HeapTuple or MinimalTuples (possibly
wrapped in Datum), depending on callers.  We might be able to cut down
on these later; my first cut will try to avoid API changes to keep
fallout to a minimum.

iii) All tuples need to be identifiable by ItemPointers.  Storages that
have different requirements will need careful additional thought across
the board.

iv) System catalogs cannot use pluggable storage.  We continue to use
heap_open etc in the DDL code, in order not to make this more invasive
that it already is.  We may lift this restriction later for specific
catalogs, as needed.

v) Currently, one Buffer may be associated with one HeapTuple living in a
slot; when the slot is cleared, the buffer pin is released.  My current
patch moves the buffer pin to inside the heapam-based storage AM and the
buffer is released by the ->slot_clear_tuple method.  The rationale for
doing this is that some storage AMs might want to keep several buffers
pinned at once, for example, and must not to release those pins
individually but in batches as the scan moves forwards (say a batch of
tuples in a columnar storage AM has column values spread across many
buffers; they must all be kept pinned until the scan has moved past the
whole set of tuples).  But I'm not really sure that this is a great
design.


I welcome comments on these ideas.  My patch for this is nowhere near
completion yet; expect things to change for items that I've overlooked,
but I hope I didn't overlook any major.  If things are handwavy, it is
probably because I haven't fully figured them out yet.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Pluggable storage

От

Alexander Korotkov

Дата:

15 августа 2016 г., 14:55:39

On Sat, Aug 13, 2016 at 2:15 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

Many have expressed their interest in this topic, but I haven't seen any
design of how it should work.

That's it. My design presented at PGCon was very sketchy, and I didn't deliver any prototype yet.

Here's my attempt; I've been playing with
this for some time now and I think what I propose here is a good initial
plan.

Great! It's nice to see you're working in this direction!

This will allow us to write permanent table storage that works
differently than heapam.c. At this stage, I haven't throught through
whether this is going to allow extensions to define new storage modules;
I am focusing on AMs that can coexist with heapam in core.

So, as I get you're proposing extendability to replace heap with something another

but compatible with heap (with executor nodes, index access methods and so on).

That's good, but what alternative storage access methods could be?

AFAICS, we can't fit here, for instance, another MVCC implementation (undo log) or

in-memory storage or columnar storage. However, it seems that we would be

able to make compressed heap or alternative HOT/WARM tuples mechanism.

Correct me if I'm wrong.

ISTM you're proposing something quite orthogonal to my view

of pluggable storage engines. My idea, in general, was to extend FDW mechanism

to let it be something manageable inside database (support for VACUUM, defining

indexes and so on). But imperative was that it should have its own executor nodes,

and it doesn't have to compatible with existing index access methods.

Therefore, I think my design presented at PGCon and your current proposal are

about orthogonal features which could coexist.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: Pluggable storage

От

Robert Haas

Дата:

15 августа 2016 г., 19:02:31

On Fri, Aug 12, 2016 at 7:15 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Many have expressed their interest in this topic, but I haven't seen any
> design of how it should work.  Here's my attempt; I've been playing with
> this for some time now and I think what I propose here is a good initial
> plan.  This will allow us to write permanent table storage that works
> differently than heapam.c.  At this stage, I haven't throught through
> whether this is going to allow extensions to define new storage modules;
> I am focusing on AMs that can coexist with heapam in core.

Thanks for taking a stab at this.  I'd like to throw out a few concerns.

One, I'm worried that adding an additional layer of pointer-jumping is
going to slow things down and make Andres' work to speed up the
executor more difficult.  I don't know that there is a problem there,
and if there is a problem I don't know what to do about it, but I
think it's something we need to consider.  I am somewhat inclined to
believe that we need to restructure the executor in a bigger way so
that it passes around datums instead of tuples; I'm inclined to
believe that the current tuple-centric model is probably not optimal
even for the existing storage format.  It seems even less likely to be
right for a data format in which fetching columns is more expensive
than currently, such as a columnar store.

Two, I think that we really need to think very hard about how the
query planner will interact with new heap storage formats.  For
example, suppose cstore_fdw were rewritten as a new heap storage
format.   Because ORC contains internal indexing structures with
characteristics somewhat similar to BRIN, many scans can be executed
much more efficiently than for our current heap storage format.  If it
can be seen that an entire chunk will fail to match the quals, we can
skip the whole chunk.  Some operations may permit additional
optimizations: for example, given SELECT count(*) FROM thing WHERE
quals, we may be able to push the COUNT(*) down into the heap access
layer.  If it can be verified that EVERY tuple in a chunk will match
the quals, we can just increment the count by that number without
visiting each tuple individually.  This could be really fast.  These
kinds of query planner issues are generally why I have favored trying
to do something like this through the FDW interface, which already has
the right APIs for this kind of thing, even if we're not using them
all yet.  I don't say that's the only way to crack this problem, but I
think we're going to find that a heap storage API that doesn't include
adequate query planner integration is not a very exciting thing.

Three, with respect to this limitation:

> iii) All tuples need to be identifiable by ItemPointers.  Storages that
> have different requirements will need careful additional thought across
> the board.

I think it's a good idea for a first patch in this area to ignore (or
mostly ignore) this problem - e.g. maybe allow such storage formats
but refuse to create indexes on them.  But eventually I think we're
going to want/need to do something about it.  There are an awful lot
of interesting ideas that we can't pursue without addressing this.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Pluggable storage

От

Anastasia Lubennikova

Дата:

15 августа 2016 г., 21:09:07

13.08.2016 02:15, Alvaro Herrera:
> Many have expressed their interest in this topic, but I haven't seen any
> design of how it should work.  Here's my attempt; I've been playing with
> this for some time now and I think what I propose here is a good initial
> plan.  This will allow us to write permanent table storage that works
> differently than heapam.c.  At this stage, I haven't throught through
> whether this is going to allow extensions to define new storage modules;
> I am focusing on AMs that can coexist with heapam in core.
Hello,
thank you for starting this topic.

I am working on a very similar proposal in another thread.
https://commitfest.postgresql.org/10/700/

At first glance our drafts are very similar. I hope it means that we are
moving in the right direction. It's great that your proposal is focused on
interactions with executor while mine are mostly about internals of new
StorageAm interface and interactions with a buffer and storage management.

Please read and comment https://wiki.postgresql.org/wiki/HeapamRefactoring

I'll leave more concrete comments tomorrow.
Thank you for the explicit description of issues.


> The design starts with a new row type in pg_am, of type "s" (for "storage").
> The handler function returns a struct of node StorageAmRoutine.  This
> contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples
> (tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the
> like), and operations on tuples that are part of slots (tuple_deform,
> materialize).
> To support this, we introduce StorageTuple and StorageScanDesc.
> StorageTuples represent a physical tuple coming from some storage AM.
> It is necessary to have a pointer to a StorageAmRoutine in order to
> manipulate the tuple.  For heapam.c, a StorageTuple is just a HeapTuple.
>
> RelationData gains ->rd_stamroutine which is a pointer to the
> StorageAmRoutine for the relation in question.  Similarly,
> TupleTableSlot is augmented with a link to the StorageAmRoutine to
> handle the StorageTuple it contains (probably in most cases it's set at
> the same time as the tupdesc).  This implies that routines such as
> ExecAssignScanType need to pass down the StorageAmRoutine from the
> relation to the slot.
>
> The executor is modified so that instead of calling heap_insert etc
> directly, it uses rel->rd_stamroutine to call these methods.  The
> executor is still in charge of dealing with indexes, constraints, and
> any other thing that's not the tuple storage itself (this is one major
> point in which this differs from FDWs).  This all looks simple enough,
> with one exception and a few notes:
>
> exception a) ExecMaterializeSlot needs special consideration.  This is
> used in two different ways: a1) is the stated "make tuple independent
> from any underlying storage" point, which is handled by
> ExecMaterializeSlot itself and calling a method from the storage AM to
> do any byte copying as needed.  ExecMaterializeSlot no longer returns a
> HeapTuple, because there might not be any.  The second usage pattern a2)
> is to create a HeapTuple that's passed to other modules which only deal
> with HT and not slots (triggers are the main case I noticed, but I think
> there are others such as the executor itself wanting tuples as Datum for
> some reason).  For the moment I'm handling this by having a new
> ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of
> the original tuple format.
>
> note b) EvalPlanQual currently maintains an array of HeapTuple in
> EState->es_epqTuple.  I think it works to replace that with an array of
> StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine
> methods in order to interact with it.  Other than those changes, it
> seems okay.
>
> note c) nodeSubplan has curTuple as a HeapTuple.  It seems simple
> to replace this with an independent slot-based tuple.
>
> note d) grp_firstTuple in nodeAgg / nodeSetOp.  These are less
> simple than the above, but replacing the HeapTuple with a slot-based
> tuple seems doable too.
>
> note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual.
> TupleTableSlot also seems a good replacement.  This has fallout in other
> users of EvalPlanQual, too.
>
> note f) More widespread, MinimalTuples currently use a tweaked HeapTuple
> format.  In the long run, it may be possible to replace them with a
> separate storage module that's specifically designed to handle tuples
> meant for tuplestores etc.  That may simplify TupleTableSlot and
> execTuples.  For the moment we keep the tts_mintuple as it is.  Whenever
> a tuple is not already in heap format, we heapify it in order to put in
> the store.
>
>
> The current heapam.c routines need some changes.  Currently, practice is
> that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on
> their input tuples to set the resulting ItemPointer in tuple->t_self.
> This is messy if we want StorageTuples to be abstract.  I'm changing
> this so that the resulting ItemPointer is returned in a separate output
> argument; the tuple itself is left alone.  This is somewhat messy in the
> case of heap_multi_insert because it returns several items; I think it's
> acceptable to return an array of ItemPointers in the same order as the
> input tuples.  This works fine for the only caller, which is COPY in
> batch mode.  For the other routines, they don't really care where the
> TID is returned AFAICS.
>
>
> Additional noteworthy items:
>
> i) Speculative insertion: the speculative insertion token is no longer
> installed directly in the heap tuple by the executor (of course).
> Instead, the token becomes part of the slot.  When the tuple_insert
> method is called, the insertion routine is in charge of setting the
> token from the slot into the storage tuple.  Executor is in charge of
> calling method->speculative_finish() / abort() once the insertion has
> been confirmed by the indexes.
>
> ii) execTuples has additional accessors for tuples-in-slot, such as
> ExecFetchSlotTuple and friends.  I expect to have some of them to return
> abstract StorageTuples, others HeapTuple or MinimalTuples (possibly
> wrapped in Datum), depending on callers.  We might be able to cut down
> on these later; my first cut will try to avoid API changes to keep
> fallout to a minimum.
>
> iii) All tuples need to be identifiable by ItemPointers.  Storages that
> have different requirements will need careful additional thought across
> the board.
>
> iv) System catalogs cannot use pluggable storage.  We continue to use
> heap_open etc in the DDL code, in order not to make this more invasive
> that it already is.  We may lift this restriction later for specific
> catalogs, as needed.
>
> v) Currently, one Buffer may be associated with one HeapTuple living in a
> slot; when the slot is cleared, the buffer pin is released.  My current
> patch moves the buffer pin to inside the heapam-based storage AM and the
> buffer is released by the ->slot_clear_tuple method.  The rationale for
> doing this is that some storage AMs might want to keep several buffers
> pinned at once, for example, and must not to release those pins
> individually but in batches as the scan moves forwards (say a batch of
> tuples in a columnar storage AM has column values spread across many
> buffers; they must all be kept pinned until the scan has moved past the
> whole set of tuples).  But I'm not really sure that this is a great
> design.
>
>
> I welcome comments on these ideas.  My patch for this is nowhere near
> completion yet; expect things to change for items that I've overlooked,
> but I hope I didn't overlook any major.  If things are handwavy, it is
> probably because I haven't fully figured them out yet.
>

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Pluggable storage

От

Andres Freund

Дата:

16 августа 2016 г., 21:46:19

On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
> Thanks for taking a stab at this.  I'd like to throw out a few concerns.
> 
> One, I'm worried that adding an additional layer of pointer-jumping is
> going to slow things down and make Andres' work to speed up the
> executor more difficult.  I don't know that there is a problem there,
> and if there is a problem I don't know what to do about it, but I
> think it's something we need to consider.

I'm quite concerned about that as well.

> I am somewhat inclined to
> believe that we need to restructure the executor in a bigger way so
> that it passes around datums instead of tuples; I'm inclined to
> believe that the current tuple-centric model is probably not optimal
> even for the existing storage format.

I actually prototyped that, and it's not an easy win so far. Column
extraction cost, even after significant optimization, is still often a
significant portion of the runtime. And e.g. projection only extracting
all columns, after evaluating a restrictive qual referring to an "early"
column, can be a significant win.  We'd definitely have to give up on
extracting columns 0..n when accessing later columns... Hm.

Greetings,

Andres Freund

Re: Pluggable storage

От

Anastasia Lubennikova

Дата:

17 августа 2016 г., 16:01:28

13.08.2016 02:15, Alvaro Herrera:
> Many have expressed their interest in this topic, but I haven't seen any
> design of how it should work.  Here's my attempt; I've been playing with
> this for some time now and I think what I propose here is a good initial
> plan.  This will allow us to write permanent table storage that works
> differently than heapam.c.  At this stage, I haven't throught through
> whether this is going to allow extensions to define new storage modules;
> I am focusing on AMs that can coexist with heapam in core.
>
> The design starts with a new row type in pg_am, of type "s" (for "storage").
> The handler function returns a struct of node StorageAmRoutine.  This
> contains functions for 1) scans (beginscan, getnext, endscan) 2) tuples
> (tuple_insert/update/delete/lock, as well as set_oid, get_xmin and the
> like), and operations on tuples that are part of slots (tuple_deform,
> materialize).
>
> To support this, we introduce StorageTuple and StorageScanDesc.
> StorageTuples represent a physical tuple coming from some storage AM.
> It is necessary to have a pointer to a StorageAmRoutine in order to
> manipulate the tuple.  For heapam.c, a StorageTuple is just a HeapTuple.

StorageTuples concept looks really cool. I've got some questions on
details of implementation.

Do StorageTuples have fields common to all implementations?
Or StorageTuple is totally abstract structure that has nothing to do
with data, except pointing to it?

I mean, now we already have HeapTupleData structure, which is a pretty
good candidate to replace with StorageTuple.
It's already widely used in executor and moreover, it's the only structure
(except MinimalTuples and all those crazy optimizations) that works with
tuples, both extracted from the page or created on-the-fly in executor node.

typedef struct HeapTupleData
{    uint32        t_len;           /* length of *t_data */    ItemPointerData t_self;        /* SelfItemPointer */
Oid           t_tableOid;     /* table the tuple came from */    HeapTupleHeader t_data;        /* -> tuple header and
data*/
 
} HeapTupleData;

We can simply change t_data type from HeapTupleHeader to Pointer.
And maybe add a "t_handler" field that points out to handler functions.
I don't sure if it will be a name of StorageAm, or its OID, or maybe the
main function itself. Although, If I'm not mistaken, we always have
RelationData when we want to operate the tuple, so having t_handler
in the StorageTuple is excessive.


typedef struct StorageTupleData
{    uint32            t_len;         /* length of *t_data */    ItemPointerData   t_self;        /* SelfItemPointer */
  Oid               t_tableOid;    /* table the tuple came from */    Pointer           t_data;        /* -> tuple
headerand data                                      * This field should never be 
 
accessed directly,                                      * only via StorageAm handler 
functions,                                      * because we don't know 
underlying data structure.                                      */    ???               t_handler;     /* StorageAm
thatknows what to do 
 
with the tuple */
} StorageTupleData
;

This approach allows to minimize code changes and ensure that we
won't miss any function that handles tuples.

Do you see any weak points of the suggestion?
What design do you use in your prototype?

> RelationData gains ->rd_stamroutine which is a pointer to the
> StorageAmRoutine for the relation in question.  Similarly,
> TupleTableSlot is augmented with a link to the StorageAmRoutine to
> handle the StorageTuple it contains (probably in most cases it's set at
> the same time as the tupdesc).  This implies that routines such as
> ExecAssignScanType need to pass down the StorageAmRoutine from the
> relation to the slot.

If we already have this pointer in t_handler as described below,
we don't need to pass it between functions and slots.
> The executor is modified so that instead of calling heap_insert etc
> directly, it uses rel->rd_stamroutine to call these methods.  The
> executor is still in charge of dealing with indexes, constraints, and
> any other thing that's not the tuple storage itself (this is one major
> point in which this differs from FDWs).  This all looks simple enough,
> with one exception and a few notes:

That is exactly what I tried to describe in my proposal.
Chapter "Relation management". I'm sure, you've already noticed
that it will require huge source code cleaning. I've carefully read
the sources and found "violators" of abstraction in src/backend/commands.
The list is attached to the wiki page 
https://wiki.postgresql.org/wiki/HeapamRefactoring.

Except these, there are some pretty strange and unrelated functions in 
src/backend/catalog.
I'm willing to fix them, but I'd like to synchronize our efforts.

> exception a) ExecMaterializeSlot needs special consideration.  This is
> used in two different ways: a1) is the stated "make tuple independent
> from any underlying storage" point, which is handled by
> ExecMaterializeSlot itself and calling a method from the storage AM to
> do any byte copying as needed.  ExecMaterializeSlot no longer returns a
> HeapTuple, because there might not be any.  The second usage pattern a2)
> is to create a HeapTuple that's passed to other modules which only deal
> with HT and not slots (triggers are the main case I noticed, but I think
> there are others such as the executor itself wanting tuples as Datum for
> some reason).  For the moment I'm handling this by having a new
> ExecHeapifyTuple which creates a HeapTuple from a slot, regardless of
> the original tuple format.

Yes, triggers are a very special case. Thank you for the explanation.

That still goes well with my suggestion of a format.
Nothing to do, just substitute t_data with proper HeapTupleHeader
representation. I think it's a job for StorageAm. Let's say each StorageAm
must have stam_to_heaptuple() function and opposite function
stam_from_heaptuple().

> note b) EvalPlanQual currently maintains an array of HeapTuple in
> EState->es_epqTuple.  I think it works to replace that with an array of
> StorageTuples; EvalPlanQualFetch needs to call the StorageAmRoutine
> methods in order to interact with it.  Other than those changes, it
> seems okay.
>
> note c) nodeSubplan has curTuple as a HeapTuple.  It seems simple
> to replace this with an independent slot-based tuple.
>
> note d) grp_firstTuple in nodeAgg / nodeSetOp.  These are less
> simple than the above, but replacing the HeapTuple with a slot-based
> tuple seems doable too.
>
> note e) nodeLockRows uses lr_curtuples to feed EvalPlanQual.
> TupleTableSlot also seems a good replacement.  This has fallout in other
> users of EvalPlanQual, too.
>
> note f) More widespread, MinimalTuples currently use a tweaked HeapTuple
> format.  In the long run, it may be possible to replace them with a
> separate storage module that's specifically designed to handle tuples
> meant for tuplestores etc.  That may simplify TupleTableSlot and
> execTuples.  For the moment we keep the tts_mintuple as it is.  Whenever
> a tuple is not already in heap format, we heapify it in order to put in
> the store.
I wonder, do we really need MinimalTuples to support all formats?

> The current heapam.c routines need some changes.  Currently, practice is
> that heap_insert, heap_multi_insert, heap_fetch, heap_update scribble on
> their input tuples to set the resulting ItemPointer in tuple->t_self.
> This is messy if we want StorageTuples to be abstract.  I'm changing
> this so that the resulting ItemPointer is returned in a separate output
> argument; the tuple itself is left alone.  This is somewhat messy in the
> case of heap_multi_insert because it returns several items; I think it's
> acceptable to return an array of ItemPointers in the same order as the
> input tuples.  This works fine for the only caller, which is COPY in
> batch mode.  For the other routines, they don't really care where the
> TID is returned AFAICS.
>
>
> Additional noteworthy items:
>
> i) Speculative insertion: the speculative insertion token is no longer
> installed directly in the heap tuple by the executor (of course).
> Instead, the token becomes part of the slot.  When the tuple_insert
> method is called, the insertion routine is in charge of setting the
> token from the slot into the storage tuple.  Executor is in charge of
> calling method->speculative_finish() / abort() once the insertion has
> been confirmed by the indexes.
>
> ii) execTuples has additional accessors for tuples-in-slot, such as
> ExecFetchSlotTuple and friends.  I expect to have some of them to return
> abstract StorageTuples, others HeapTuple or MinimalTuples (possibly
> wrapped in Datum), depending on callers.  We might be able to cut down
> on these later; my first cut will try to avoid API changes to keep
> fallout to a minimum.
I'd suggest replacing all occurrences of HeapTuple with StorageTuple.
Do you see any problems with it?

> iii) All tuples need to be identifiable by ItemPointers.  Storages that
> have different requirements will need careful additional thought across
> the board.

For a start, we can simply deny secondary indexes for these storages
or require a function that converts tuple identifier inside the storage to
ItemPointer suitable for an index.

> iv) System catalogs cannot use pluggable storage.  We continue to use
> heap_open etc in the DDL code, in order not to make this more invasive
> that it already is.  We may lift this restriction later for specific
> catalogs, as needed.
+1
>
> v) Currently, one Buffer may be associated with one HeapTuple living in a
> slot; when the slot is cleared, the buffer pin is released.  My current
> patch moves the buffer pin to inside the heapam-based storage AM and the
> buffer is released by the ->slot_clear_tuple method.  The rationale for
> doing this is that some storage AMs might want to keep several buffers
> pinned at once, for example, and must not to release those pins
> individually but in batches as the scan moves forwards (say a batch of
> tuples in a columnar storage AM has column values spread across many
> buffers; they must all be kept pinned until the scan has moved past the
> whole set of tuples).  But I'm not really sure that this is a great
> design.

Frankly, I doubt that it's real to implement columnar storage just as
a variant of pluggable storage. It requires a lot of changes in executor
and optimizer and so on, which are hardly compatible with existing
tuple-oriented model. However I'm not so good in this area, so if you
feel that it's possible, go ahead.

> I welcome comments on these ideas.  My patch for this is nowhere near
> completion yet; expect things to change for items that I've overlooked,
> but I hope I didn't overlook any major.  If things are handwavy, it is
> probably because I haven't fully figured them out yet.

Thank you again for beginning the big project.
Looking forward to the prototype. I think it will make the discussion
more concrete and useful.

-- 
Anastasia Lubennikova
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: Pluggable storage

От

Alvaro Herrera

Дата:

18 августа 2016 г., 00:43:11

Anastasia Lubennikova wrote:
> 13.08.2016 02:15, Alvaro Herrera:

> >To support this, we introduce StorageTuple and StorageScanDesc.
> >StorageTuples represent a physical tuple coming from some storage AM.
> >It is necessary to have a pointer to a StorageAmRoutine in order to
> >manipulate the tuple.  For heapam.c, a StorageTuple is just a HeapTuple.
> 
> StorageTuples concept looks really cool. I've got some questions on
> details of implementation.
> 
> Do StorageTuples have fields common to all implementations?
> Or StorageTuple is totally abstract structure that has nothing to do
> with data, except pointing to it?
> 
> I mean, now we already have HeapTupleData structure, which is a pretty
> good candidate to replace with StorageTuple.

I was planning to replace all uses of HeapTuple in the executor with
StorageTuple, actually.  But the main reason I would like to avoid
HeapTupleData itself is that it contains an assumption that there is a
single palloc chunk that contains the tuple (t_len and t_data).  This
might not be true in representations that split the tuple, for example
in columnar storage where you have one column in page A and another
column in page B, for the same tuple.  I suppose there might be some
point to keeping t_tableOid and t_self, though.

> And maybe add a "t_handler" field that points out to handler functions.
> I don't sure if it will be a name of StorageAm, or its OID, or maybe the
> main function itself. Although, If I'm not mistaken, we always have
> RelationData when we want to operate the tuple, so having t_handler
> in the StorageTuple is excessive.

Yeah, I think the RelationData (or more precisely the StorageAmRoutine)
is going to be available always, so I don't think we need a pointer in
the tuple itself.

> This approach allows to minimize code changes and ensure that we
> won't miss any function that handles tuples.
> 
> Do you see any weak points of the suggestion?
> What design do you use in your prototype?

It's currently a "void *" pointer in my prototype.

> >RelationData gains ->rd_stamroutine which is a pointer to the
> >StorageAmRoutine for the relation in question.  Similarly,
> >TupleTableSlot is augmented with a link to the StorageAmRoutine to
> >handle the StorageTuple it contains (probably in most cases it's set at
> >the same time as the tupdesc).  This implies that routines such as
> >ExecAssignScanType need to pass down the StorageAmRoutine from the
> >relation to the slot.
> 
> If we already have this pointer in t_handler as described below,
> we don't need to pass it between functions and slots.

I think it's better to have it in slots, so you can install multiple
tuples in the slot without having to change the routine pointers each
time.

> >The executor is modified so that instead of calling heap_insert etc
> >directly, it uses rel->rd_stamroutine to call these methods.  The
> >executor is still in charge of dealing with indexes, constraints, and
> >any other thing that's not the tuple storage itself (this is one major
> >point in which this differs from FDWs).  This all looks simple enough,
> >with one exception and a few notes:
> 
> That is exactly what I tried to describe in my proposal.
> Chapter "Relation management". I'm sure, you've already noticed
> that it will require huge source code cleaning. I've carefully read
> the sources and found "violators" of abstraction in src/backend/commands.
> The list is attached to the wiki page
> https://wiki.postgresql.org/wiki/HeapamRefactoring.
> 
> Except these, there are some pretty strange and unrelated functions in
> src/backend/catalog.
> I'm willing to fix them, but I'd like to synchronize our efforts.

I very much would like to stay away from touching src/backend/catalog,
which are the functions that deal with system catalogs.  We can simply
say that system catalogs are hardcoded to use heapam.c storage for now.
If we later see a need to enable some particular catalog using a
different storage implementation, we can change the code for that
specific catalog in src/backend/catalog and everywhere else, to use the
abstract API instead of hardcoding heap_insert etc.  But that can be
left for a second pass.  (This is my point "iv" further below, to which
you said "+1").

> Nothing to do, just substitute t_data with proper HeapTupleHeader
> representation. I think it's a job for StorageAm. Let's say each StorageAm
> must have stam_to_heaptuple() function and opposite function
> stam_from_heaptuple().

Hmm, yeah, that also works.  We'd have to check again whether it's more
convenient to start as a slot rather than a StorageTuple.  AFAICS the
trigger.c code is all starting from a slot, so it makes sense to have
the conversion use the slot code -- that way, there's no need for each
storageAM to re-implement conversion to HeapTuple.

> >note f) More widespread, MinimalTuples currently use a tweaked HeapTuple
> >format.  In the long run, it may be possible to replace them with a
> >separate storage module that's specifically designed to handle tuples
> >meant for tuplestores etc.  That may simplify TupleTableSlot and
> >execTuples.  For the moment we keep the tts_mintuple as it is.  Whenever
> >a tuple is not already in heap format, we heapify it in order to put in
> >the store.
> I wonder, do we really need MinimalTuples to support all formats?

Sure.  I wouldn't want to say "you can create table in columnar storage
format, but if you do, these tables cannot use hash join".

> >ii) execTuples has additional accessors for tuples-in-slot, such as
> >ExecFetchSlotTuple and friends.  I expect to have some of them to return
> >abstract StorageTuples, others HeapTuple or MinimalTuples (possibly
> >wrapped in Datum), depending on callers.  We might be able to cut down
> >on these later; my first cut will try to avoid API changes to keep
> >fallout to a minimum.
>
> I'd suggest replacing all occurrences of HeapTuple with StorageTuple.
> Do you see any problems with it?

The HeapTuple-in-datum representation, as I recall, is used in the SQL
function manager; maybe other places too.  Maybe there's a way to fix
that layer so that it uses StorageTuple instead, but I prefer not to
touch it in the first phase.  We can fix it later.  This is already a
big enough patch ...

> >iii) All tuples need to be identifiable by ItemPointers.  Storages that
> >have different requirements will need careful additional thought across
> >the board.
> 
> For a start, we can simply deny secondary indexes for these storages
> or require a function that converts tuple identifier inside the storage to
> ItemPointer suitable for an index.

Umm.  I don't think rejecting secondary indexes would work very well.  I
think we can lift this limitation later; we just need to change the
IndexTuple abstraction so that it doesn't rely on ItemPointer as
currently.

> >v) Currently, one Buffer may be associated with one HeapTuple living in a
> >slot; when the slot is cleared, the buffer pin is released.  My current
> >patch moves the buffer pin to inside the heapam-based storage AM and the
> >buffer is released by the ->slot_clear_tuple method.  The rationale for
> >doing this is that some storage AMs might want to keep several buffers
> >pinned at once, for example, and must not to release those pins
> >individually but in batches as the scan moves forwards (say a batch of
> >tuples in a columnar storage AM has column values spread across many
> >buffers; they must all be kept pinned until the scan has moved past the
> >whole set of tuples).  But I'm not really sure that this is a great
> >design.
> 
> Frankly, I doubt that it's real to implement columnar storage just as
> a variant of pluggable storage. It requires a lot of changes in executor
> and optimizer and so on, which are hardly compatible with existing
> tuple-oriented model. However I'm not so good in this area, so if you
> feel that it's possible, go ahead.

Well, not *just* as a variant of pluggable storage.  This thread is just
one sub-project inside the greater project to enable column-oriented
storage; that includes further changes to executor, too, but I haven't
discussed those in this proposal.  I mentioned all this in Brussels'
developer meeting earlier this year.  (There I mostly talked about
vertical partitioning, which is a different subproject that I've put
aside for the moment, but really it's all part of the same thing.)
https://wiki.postgresql.org/wiki/Future_of_storage

Thanks for reading!

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Pluggable storage

От

Simon Riggs

Дата:

18 августа 2016 г., 10:58:22

On 16 August 2016 at 19:46, Andres Freund <andres@anarazel.de> wrote:
> On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
>> Thanks for taking a stab at this.  I'd like to throw out a few concerns.
>>
>> One, I'm worried that adding an additional layer of pointer-jumping is
>> going to slow things down and make Andres' work to speed up the
>> executor more difficult.  I don't know that there is a problem there,
>> and if there is a problem I don't know what to do about it, but I
>> think it's something we need to consider.
>
> I'm quite concerned about that as well.

This objection would apply to all other proposals as well, FDW etc..

Do you see some way to add flexibility yet without adding a branch
point in the code?

-- 
Simon Riggs                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Pluggable storage

От

Alexander Korotkov

Дата:

18 августа 2016 г., 14:01:50

On Thu, Aug 18, 2016 at 10:58 AM, Simon Riggs <simon@2ndquadrant.com> wrote:

On 16 August 2016 at 19:46, Andres Freund <andres@anarazel.de> wrote:
> On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
>> Thanks for taking a stab at this. I'd like to throw out a few concerns.
>>
>> One, I'm worried that adding an additional layer of pointer-jumping is
>> going to slow things down and make Andres' work to speed up the
>> executor more difficult. I don't know that there is a problem there,
>> and if there is a problem I don't know what to do about it, but I
>> think it's something we need to consider.
>
> I'm quite concerned about that as well.

This objection would apply to all other proposals as well, FDW etc..

Do you see some way to add flexibility yet without adding a branch
point in the code?

It's impossible without branch point in code. The question is where this branch should be located.

In particular, be can put this branch point into planner by defining distinct executor nodes for each pluggable storage. In this case, each storage would have own optimized executor nodes.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: Pluggable storage

От

Amit Kapila

Дата:

18 августа 2016 г., 14:23:53

On Wed, Aug 17, 2016 at 10:33 PM, Alvaro Herrera
<alvherre@2ndquadrant.com> wrote:
> Anastasia Lubennikova wrote:
>>
>> Except these, there are some pretty strange and unrelated functions in
>> src/backend/catalog.
>> I'm willing to fix them, but I'd like to synchronize our efforts.
>
> I very much would like to stay away from touching src/backend/catalog,
> which are the functions that deal with system catalogs.  We can simply
> say that system catalogs are hardcoded to use heapam.c storage for now.
>

Does this mean that if any storage needs to access system catalog
information, they need to be aware of HeapTuple and other required
stuff like syscache?  Again, if they need to update some stats or
something like that, they need to be aware of heap tuple format.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Pluggable storage

От

Ants Aasma

Дата:

18 августа 2016 г., 17:45:02

On Tue, Aug 16, 2016 at 9:46 PM, Andres Freund <andres@anarazel.de> wrote:
> On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
>> I am somewhat inclined to
>> believe that we need to restructure the executor in a bigger way so
>> that it passes around datums instead of tuples; I'm inclined to
>> believe that the current tuple-centric model is probably not optimal
>> even for the existing storage format.
>
> I actually prototyped that, and it's not an easy win so far. Column
> extraction cost, even after significant optimization, is still often a
> significant portion of the runtime. And e.g. projection only extracting
> all columns, after evaluating a restrictive qual referring to an "early"
> column, can be a significant win.  We'd definitely have to give up on
> extracting columns 0..n when accessing later columns... Hm.

What about going even further than [1] in converting the executor to
being opcode based and merging projection and qual evaluation to a
single pass? Optimizer would then have some leeway about how to order
column extraction and qual evaluation. Might even be worth it to
special case some functions as separate opcodes (e.g. int4eq,
timestamp_lt).

Regards,
Ants Aasma

[1] https://www.postgresql.org/message-id/20160714011850.bd5zhu35szle3n3c@alap3.anarazel.de

Re: Pluggable storage

От

Andres Freund

Дата:

18 августа 2016 г., 18:15:06


On August 18, 2016 7:44:50 AM PDT, Ants Aasma <ants.aasma@eesti.ee> wrote:
>On Tue, Aug 16, 2016 at 9:46 PM, Andres Freund <andres@anarazel.de>
>wrote:
>> On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
>>> I am somewhat inclined to
>>> believe that we need to restructure the executor in a bigger way so
>>> that it passes around datums instead of tuples; I'm inclined to
>>> believe that the current tuple-centric model is probably not optimal
>>> even for the existing storage format.
>>
>> I actually prototyped that, and it's not an easy win so far. Column
>> extraction cost, even after significant optimization, is still often
>a
>> significant portion of the runtime. And e.g. projection only
>extracting
>> all columns, after evaluating a restrictive qual referring to an
>"early"
>> column, can be a significant win.  We'd definitely have to give up on
>> extracting columns 0..n when accessing later columns... Hm.
>
>What about going even further than [1] in converting the executor to
>being opcode based and merging projection and qual evaluation to a
>single pass? Optimizer would then have some leeway about how to order
>column extraction and qual evaluation. Might even be worth it to
>special case some functions as separate opcodes (e.g. int4eq,
>timestamp_lt).
>
>Regards,
>Ants Aasma
>
>[1]
>https://www.postgresql.org/message-id/20160714011850.bd5zhu35szle3n3c@alap3.anarazel.de

Good question. I think I have a reasonable answer,  but lets discuss that in the other thread.

Andres
-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Re: Pluggable storage

От

Andres Freund

Дата:

18 августа 2016 г., 19:25:05

On 2016-08-18 08:58:11 +0100, Simon Riggs wrote:
> On 16 August 2016 at 19:46, Andres Freund <andres@anarazel.de> wrote:
> > On 2016-08-15 12:02:18 -0400, Robert Haas wrote:
> >> Thanks for taking a stab at this.  I'd like to throw out a few concerns.
> >>
> >> One, I'm worried that adding an additional layer of pointer-jumping is
> >> going to slow things down and make Andres' work to speed up the
> >> executor more difficult.  I don't know that there is a problem there,
> >> and if there is a problem I don't know what to do about it, but I
> >> think it's something we need to consider.
> >
> > I'm quite concerned about that as well.
> 
> This objection would apply to all other proposals as well, FDW etc..

Not really. The place you draw your boundary significantly influences
where and how much of a price you pay.  Having another indirection
inside HeapTuple - which is accessed in many many places, is something
different from having a seqscan equivalent, which returns you a batch of
already deformed tuples in array form.  In the latter case there's one
additional indirection per batch of tuples, in the former there's many
for each tuple.

> Do you see some way to add flexibility yet without adding a branch
> point in the code?

I'm not even saying that the approach of doing the indirection inside
the HeapTuple replacement is a no-go, just that it concerns me.  I do
think that working on only lugging arround values/isnull arrays is
something that I could see working better, if some problems are
addressed beforehand.

Greetings,

Andres Freund

Re: Pluggable storage

От

Alvaro Herrera

Дата:

24 августа 2016 г., 21:22:10

Alvaro Herrera wrote:
> Many have expressed their interest in this topic, but I haven't seen any
> design of how it should work.  Here's my attempt; I've been playing with
> this for some time now and I think what I propose here is a good initial
> plan.

I regret to announce that I'll have to stay away from this topic for a
little while, as I have another project taking priority.  I expect to
return to this shortly thereafter, hopefully in time to get it done for
pg10.

If anyone is interested in helping with the (currently not compilable)
patch I have, please mail me privately and we can discuss.

-- 
Álvaro Herrera                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: Pluggable storage

От

Alvaro Herrera

Дата:

13 октября 2016 г., 23:26:54

I have sent the partial patch I have to Hari Babu Kommi.  We expect that
he will be able to further this goal some more.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

13 июня 2017 г., 07:50:27

On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:

I have sent the partial patch I have to Hari Babu Kommi. We expect that
he will be able to further this goal some more.

Thanks Alvaro for sharing your development patch.

Most of the patch design is same as described by Alvaro in the first mail [1].

I will detail the modifications, pending items and open items (needs discussion)

to implement proper pluggable storage.

Here I attached WIP patches to support pluggable storage. The patch series

are may not work individually. Still so many things are under development.

These patches are just to share the approach of the current development.

Some notable changes that I did to make the patch work:

1. Added storageam handler to the slot, this is because not all places

the relation is not available in handy.

2. Retained the minimal Tuple in the slot, as this is used in HASH join.

As per the first version, I feel it is fine to allow creating HeapTuple

format data.

Thanks everyone for sharing their ideas in the developer's unconference at

PGCon Ottawa.

Pending items:

1. Replacement of Tuple with slot in Trigger functionality

2. Replacement of Tuple with Slot from storage handler functions.

3. Remove/minimize the use of HeapTuple as a Datum.

4. Replace all references of HeapScanDesc with StorageScanDesc

5. Planner changes to consider the relation storage during the planning.

6. Any planner changes based on the discussion of open items?

7. some Executor changes to consider the storage advantages?

Open Items:

1. The BitmapHeapScan and TableSampleScan are tightly coupled with

HeapTuple and HeapScanDesc, So these scans are directly operating

on those structures and providing the result.

These scan types may not be applicable to different storage formats.

So how to handle them?

Currently my goal to provide a basic infrastructure of pluggable storage as

a first step and later improve it further to improve the performance by

taking the advantage of storage.

[1] - https://www.postgresql.org/message-id/20160812231527.GA690404%40alvherre.pgsql

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

22 июня 2017 г., 01:47:22

On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> Open Items:
>
> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> HeapTuple and HeapScanDesc, So these scans are directly operating
> on those structures and providing the result.
>
> These scan types may not be applicable to different storage formats.
> So how to handle them?

I think that BitmapHeapScan, at least, is applicable to any table AM
that has TIDs.   It seems to me that in general we can imagine three
kinds of table AMs:

1. Table AMs where a tuple can be efficiently located by a real TID.
By a real TID, I mean that the block number part is really a block
number and the item ID is really a location within the block.  These
are necessarily quite similar to our current heap, but they can change
the tuple format and page format to some degree, and it seems like in
many cases it should be possible to plug them into our existing index
AMs without too much heartache.  Both index scans and bitmap index
scans ought to work.

2. Table AMs where a tuple has some other kind of locator.  For
example, imagine an index-organized table where the locator is the
primary key, which is a bit like what Alvaro had in mind for indirect
indexes.  If the locator is 6 bytes or less, it could potentially be
jammed into a TID, but I don't think that's a great idea.  For things
like int8 or numeric, it won't work at all.  Even for other things,
it's going to cause problems because the bit patterns won't be what
the code is expecting; e.g. bitmap scans care about the structure of
the TID, not just how many bits it is.  (Due credit: Somebody, maybe
Alvaro, pointed out this problem before, at PGCon.)  For these kinds
of tables, larger modifications to the index AMs are likely to be
necessary, at least if we want a really general solution, or maybe we
should have separate index AMs - e.g. btree for traditional TID-based
heaps, and generic_btree or indirect_btree or key_btree or whatever
for heaps with some other kind of locator.  It's not too hard to see
how to make index scans work with this sort of structure but it's very
unclear to me whether, or how, bitmap scans can be made to work.

3. Table AMs where a tuple doesn't really have a locator at all.  In
these cases, we can't support any sort of index AM at all.  When the
table is queried, there's really nothing the core system can do except
ask the table AM for a full scan, supply the quals, and hope the table
AM has some sort of smarts that enable it to optimize somehow.  For
example, you can imagine converting cstore_fdw into a table AM of this
sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole
chunks to be proven uninteresting and skipped.  (You could use chunk
number + offset to turn this into a table AM of the previous type if
you wanted to support secondary indexes; not sure if that'd be useful,
but it'd certainly be harder.)

I'm more interested in #1 than in #3, and more interested in #3 than
#2, but other people may have different priorities.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Pluggable storage

От

Michael Paquier

Дата:

22 июня 2017 г., 07:01:46

On Thu, Jun 22, 2017 at 4:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> I think that BitmapHeapScan, at least, is applicable to any table AM
> that has TIDs.   It seems to me that in general we can imagine three
> kinds of table AMs:
>
> 1. Table AMs where a tuple can be efficiently located by a real TID.
> By a real TID, I mean that the block number part is really a block
> number and the item ID is really a location within the block.  These
> are necessarily quite similar to our current heap, but they can change
> the tuple format and page format to some degree, and it seems like in
> many cases it should be possible to plug them into our existing index
> AMs without too much heartache.  Both index scans and bitmap index
> scans ought to work.
>
> 2. Table AMs where a tuple has some other kind of locator.  For
> example, imagine an index-organized table where the locator is the
> primary key, which is a bit like what Alvaro had in mind for indirect
> indexes.  If the locator is 6 bytes or less, it could potentially be
> jammed into a TID, but I don't think that's a great idea.  For things
> like int8 or numeric, it won't work at all.  Even for other things,
> it's going to cause problems because the bit patterns won't be what
> the code is expecting; e.g. bitmap scans care about the structure of
> the TID, not just how many bits it is.  (Due credit: Somebody, maybe
> Alvaro, pointed out this problem before, at PGCon.)  For these kinds
> of tables, larger modifications to the index AMs are likely to be
> necessary, at least if we want a really general solution, or maybe we
> should have separate index AMs - e.g. btree for traditional TID-based
> heaps, and generic_btree or indirect_btree or key_btree or whatever
> for heaps with some other kind of locator.  It's not too hard to see
> how to make index scans work with this sort of structure but it's very
> unclear to me whether, or how, bitmap scans can be made to work.
>
> 3. Table AMs where a tuple doesn't really have a locator at all.  In
> these cases, we can't support any sort of index AM at all.  When the
> table is queried, there's really nothing the core system can do except
> ask the table AM for a full scan, supply the quals, and hope the table
> AM has some sort of smarts that enable it to optimize somehow.  For
> example, you can imagine converting cstore_fdw into a table AM of this
> sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole
> chunks to be proven uninteresting and skipped.  (You could use chunk
> number + offset to turn this into a table AM of the previous type if
> you wanted to support secondary indexes; not sure if that'd be useful,
> but it'd certainly be harder.)
>
> I'm more interested in #1 than in #3, and more interested in #3 than
> #2, but other people may have different priorities.

Putting that in a couple of words.
1. Table AM with a 6-byte TID.
2. Table AM with a custom locator format, which could be TID-like.
3. Table AM with no locators.

Getting into having #1 first to work out would already be really
useful for users. My take on the matter is that being able to plug in
in-core index AMs directly into a table AM #1 is more useful in the
long term, as it is possible for multiple table AMs to use the same
kind of index AM which is designed nicely enough. So the index AM
logic basically does not need to be duplicated across multiple table
AMs. #3 implies that the index AM logic is implemented in the table
AM. Not saying that it is not useful, but it does not feel natural to
have the planner request for a sequential scan, just to have the table
AM secretly do some kind of index/skipping scan.
-- 
Michael

Re: [HACKERS] Pluggable storage

От

Amit Langote

Дата:

22 июня 2017 г., 08:12:12

On 2017/06/22 10:01, Michael Paquier wrote:
> #3 implies that the index AM logic is implemented in the table
> AM. Not saying that it is not useful, but it does not feel natural to
> have the planner request for a sequential scan, just to have the table
> AM secretly do some kind of index/skipping scan.

I had read a relevant comment on a pluggable storage thread awhile back
[1].  In short, the comment was that the planner should be able to get
some intelligence, via some API, from the heap storage implementation
about the latter's access cost characteristics.  The storage should
provide accurate-enough cost information to the planner when such a
request is made by, say, cost_seqscan(), so that the planner can make
appropriate choice.  If two tables containing the same number of rows (and
the same size in bytes, perhaps) use different storage implementations,
then, planner's cost parameters remaining same, cost_seqscan() will end up
calculating different costs for the two tables.  Perhaps, SeqScan would be
chosen for one table but not the another based on that.

Thanks,
Amit

[1]
https://www.postgresql.org/message-id/CA%2BTgmoY3LXVUPQVdZW70XKp5PsXffO82pXXt%3DbeegcV%2B%3DRsQgg%40mail.gmail.com

Re: [HACKERS] Pluggable storage

От

Michael Paquier

Дата:

22 июня 2017 г., 09:00:36

On Thu, Jun 22, 2017 at 11:12 AM, Amit Langote
<Langote_Amit_f8@lab.ntt.co.jp> wrote:
> On 2017/06/22 10:01, Michael Paquier wrote:
>> #3 implies that the index AM logic is implemented in the table
>> AM. Not saying that it is not useful, but it does not feel natural to
>> have the planner request for a sequential scan, just to have the table
>> AM secretly do some kind of index/skipping scan.
>
> I had read a relevant comment on a pluggable storage thread awhile back
> [1].  In short, the comment was that the planner should be able to get
> some intelligence, via some API, from the heap storage implementation
> about the latter's access cost characteristics.  The storage should
> provide accurate-enough cost information to the planner when such a
> request is made by, say, cost_seqscan(), so that the planner can make
> appropriate choice.  If two tables containing the same number of rows (and
> the same size in bytes, perhaps) use different storage implementations,
> then, planner's cost parameters remaining same, cost_seqscan() will end up
> calculating different costs for the two tables.  Perhaps, SeqScan would be
> chosen for one table but not the another based on that.

Yeah, I agree that the costing part needs some clear attention and
thoughts, and the gains are absolutely huge with the correct
interface. That could be done in a later step though.
-- 
Michael

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

22 июня 2017 г., 18:16:21

On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
I have sent the partial patch I have to Hari Babu Kommi. We expect that
he will be able to further this goal some more.

Thanks Alvaro for sharing your development patch.

Most of the patch design is same as described by Alvaro in the first mail [1].
I will detail the modifications, pending items and open items (needs discussion)
to implement proper pluggable storage.

Here I attached WIP patches to support pluggable storage. The patch series
are may not work individually. Still so many things are under development.
These patches are just to share the approach of the current development.

Some notable changes that I did to make the patch work:

1. Added storageam handler to the slot, this is because not all places
the relation is not available in handy.
2. Retained the minimal Tuple in the slot, as this is used in HASH join.
As per the first version, I feel it is fine to allow creating HeapTuple
format data.

Thanks everyone for sharing their ideas in the developer's unconference at
PGCon Ottawa.

Pending items:

1. Replacement of Tuple with slot in Trigger functionality
2. Replacement of Tuple with Slot from storage handler functions.
3. Remove/minimize the use of HeapTuple as a Datum.
4. Replace all references of HeapScanDesc with StorageScanDesc
5. Planner changes to consider the relation storage during the planning.
6. Any planner changes based on the discussion of open items?
7. some Executor changes to consider the storage advantages?

Open Items:

1. The BitmapHeapScan and TableSampleScan are tightly coupled with
HeapTuple and HeapScanDesc, So these scans are directly operating
on those structures and providing the result.

What about vacuum? I see vacuum is untouched in the patchset and it is not mentioned in this discussion.

Do you plan to override low-level function like heap_page_prune(), lazy_vacuum_page() etc., but preserve high-level logic of vacuum?

Or do you plan to let pluggable storage implement its own high-level vacuum algorithm?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

22 июня 2017 г., 18:32:00

On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier <michael.paquier@gmail.com> wrote:

Putting that in a couple of words.
1. Table AM with a 6-byte TID.
2. Table AM with a custom locator format, which could be TID-like.
3. Table AM with no locators.

Getting into having #1 first to work out would already be really
useful for users.

What exactly would be useful for *users*? Any kind of API itself is completely useless for users, because they are users, not developers. Storage API could be useful for developers to implement storage AMs whose in turn could be useful for users. Then while saying that #1 is useful for users, it would be nice to keep in mind particular storage AMs which can be implemented using #1.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

22 июня 2017 г., 20:27:22

On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> Putting that in a couple of words.
>> 1. Table AM with a 6-byte TID.
>> 2. Table AM with a custom locator format, which could be TID-like.
>> 3. Table AM with no locators.
>>
>> Getting into having #1 first to work out would already be really
>> useful for users.
>
> What exactly would be useful for *users*?  Any kind of API itself is
> completely useless for users, because they are users, not developers.
> Storage API could be useful for developers to implement storage AMs whose in
> turn could be useful for users.

What's your point?  I assume that is what Michael meant.

> Then while saying that #1 is useful for
> users, it would be nice to keep in mind particular storage AMs which can be
> implemented using #1.

I don't think anybody's arguing with that.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

22 июня 2017 г., 20:30:14

On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> Open Items:
>
> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> HeapTuple and HeapScanDesc, So these scans are directly operating
> on those structures and providing the result.
>
> These scan types may not be applicable to different storage formats.
> So how to handle them?

I think that BitmapHeapScan, at least, is applicable to any table AM
that has TIDs. It seems to me that in general we can imagine three
kinds of table AMs:

1. Table AMs where a tuple can be efficiently located by a real TID.
By a real TID, I mean that the block number part is really a block
number and the item ID is really a location within the block. These
are necessarily quite similar to our current heap, but they can change
the tuple format and page format to some degree, and it seems like in
many cases it should be possible to plug them into our existing index
AMs without too much heartache. Both index scans and bitmap index
scans ought to work.

If #1 is only about changing tuple and page formats, then could be much simpler than the patch upthread? We can implement "page format access methods" with routines for insertion, update, pruning and deletion of tuples *in particular page*. There is no need to redefine high-level logic for scanning heap, doing updates and so on...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

22 июня 2017 г., 20:36:49

On Thu, Jun 22, 2017 at 5:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier <michael.paquier@gmail.com>
> wrote:
>> Putting that in a couple of words.
>> 1. Table AM with a 6-byte TID.
>> 2. Table AM with a custom locator format, which could be TID-like.
>> 3. Table AM with no locators.
>>
>> Getting into having #1 first to work out would already be really
>> useful for users.
>
> What exactly would be useful for *users*? Any kind of API itself is
> completely useless for users, because they are users, not developers.
> Storage API could be useful for developers to implement storage AMs whose in
> turn could be useful for users.

What's your point? I assume that is what Michael meant.

TBH, I don't understand what particular enchantments do we expect from having #1.

This is why it's hard for me to say if #1 is good idea. It's also hard for me to say if patch upthread is right way of implementing #1.

But, I have gut feeling that if even #1 is good idea itself, it's definitely not what users expect from "pluggable storages".

From user side, it would be normal to expect that "pluggable storage" could be index-organized table where index could be say LSM...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Tomas Vondra

Дата:

23 июня 2017 г., 03:27:11

Hi,

On 6/21/17 9:47 PM, Robert Haas wrote:
...>
> like int8 or numeric, it won't work at all.  Even for other things,
> it's going to cause problems because the bit patterns won't be what
> the code is expecting; e.g. bitmap scans care about the structure of
> the TID, not just how many bits it is.  (Due credit: Somebody, maybe
> Alvaro, pointed out this problem before, at PGCon.) 

Can you elaborate a bit more about this TID bit pattern issues? I do 
remember that not all TIDs are valid due to safeguards on individual 
fields, like for example
    Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits))

But perhaps there are some other issues?

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Pluggable storage

От

Tomas Vondra

Дата:

23 июня 2017 г., 03:36:27

Hi,

On 6/22/17 4:36 PM, Alexander Korotkov wrote:
> On Thu, Jun 22, 2017 at 5:27 PM, Robert Haas <robertmhaas@gmail.com 
> <mailto:robertmhaas@gmail.com>> wrote:
> 
>     On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov
>     <a.korotkov@postgrespro.ru <mailto:a.korotkov@postgrespro.ru>> wrote:
>     > On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier <michael.paquier@gmail.com
<mailto:michael.paquier@gmail.com>>
>     > wrote:
>     >> Putting that in a couple of words.
>     >> 1. Table AM with a 6-byte TID.
>     >> 2. Table AM with a custom locator format, which could be TID-like.
>     >> 3. Table AM with no locators.
>     >>
>     >> Getting into having #1 first to work out would already be really
>     >> useful for users.
>     >
>     > What exactly would be useful for *users*?  Any kind of API itself is
>     > completely useless for users, because they are users, not developers.
>     > Storage API could be useful for developers to implement storage AMs whose in
>     > turn could be useful for users.
> 
>     What's your point?  I assume that is what Michael meant.
> 
> 
> TBH, I don't understand what particular enchantments do we expect from 
> having #1.

I'd say that's one of the things we're trying to figure out in this 
thread. Does it make sense to go with #1 in v1 of the patch, or do we 
have to implement #2 or #3 to get some real benefit for the users?
>
> This is why it's hard for me to say if #1 is good idea.  It's also hard 
> for me to say if patch upthread is right way of implementing #1.
> But, I have gut feeling that if even #1 is good idea itself, it's 
> definitely not what users expect from "pluggable storages".

The question is "why" do you think that. What features do you expect 
from pluggable storage API that would be impossible to implement with 
option #1?

I'm not trying to be annoying or anything like that - I don't know the 
answer and discussing those things is exactly why this thread exists.

I do agree #1 has limitations, and that it'd be great to get API that 
supports all kinds of pluggable storage implementations. But I guess 
that'll take some time.


regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Pluggable storage

От

Michael Paquier

Дата:

23 июня 2017 г., 05:49:00

On Thu, Jun 22, 2017 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Jun 22, 2017 at 8:32 AM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
>> On Thu, Jun 22, 2017 at 4:01 AM, Michael Paquier <michael.paquier@gmail.com>
>> wrote:
>>> Putting that in a couple of words.
>>> 1. Table AM with a 6-byte TID.
>>> 2. Table AM with a custom locator format, which could be TID-like.
>>> 3. Table AM with no locators.
>>>
>>> Getting into having #1 first to work out would already be really
>>> useful for users.
>>
>> What exactly would be useful for *users*?  Any kind of API itself is
>> completely useless for users, because they are users, not developers.
>> Storage API could be useful for developers to implement storage AMs whose in
>> turn could be useful for users.
>
> What's your point?  I assume that is what Michael meant.

Sorry for the confusion. I implied here that users are the ones
developing modules.
-- 
Michael

Re: [HACKERS] Pluggable storage

От

Teodor Sigaev

Дата:

23 июня 2017 г., 14:38:57

> 1. Table AM with a 6-byte TID.
> 2. Table AM with a custom locator format, which could be TID-like.
> 3. Table AM with no locators.

Currently TID has its own type in system catalog. Seems, it's possible that 
storage claims type of TID which it uses. Then, AM could claim it too, so the 
core based on that information could solve the question about AM-storage 
compatibility. Storage could also claim that it hasn't TID type at all so it 
couldn't be used with any access method, use case: compressed column oriented 
storage.

As I remeber, only GIN depends on TID format, other indexes use it as opaque 
type. Except, at least, btree and GiST - they believ that internal pointers are 
the same as outer (to heap)

Another dubious part - BitmapScan.

-- 
Teodor Sigaev                                   E-mail: teodor@sigaev.ru
  WWW: http://www.sigaev.ru/

Re: [HACKERS] Pluggable storage

От

Tomas Vondra

Дата:

23 июня 2017 г., 20:07:44

Hi,

On 6/23/17 10:38 AM, Teodor Sigaev wrote:
>> 1. Table AM with a 6-byte TID.
>> 2. Table AM with a custom locator format, which could be TID-like.
>> 3. Table AM with no locators.
> 
> Currently TID has its own type in system catalog. Seems, it's possible 
> that storage claims type of TID which it uses. Then, AM could claim it 
> too, so the core based on that information could solve the question 
> about AM-storage compatibility. Storage could also claim that it hasn't 
> TID type at all so it couldn't be used with any access method, use case: 
> compressed column oriented storage.
> 

Isn't the fact that TID is an existing type defined in system catalogs 
is a fairly insignificant detail? I mean, we could just as easily define 
a new 64-bit locator data type, and use that instead, for example.

The main issue here is that we assume things about the TID contents, 
i.e. that it contains page/offset etc. And Bitmap nodes rely on that, to 
some extent - e.g. when prefetching data.
>
> As I remeber, only GIN depends on TID format, other indexes use it as 
> opaque type. Except, at least, btree and GiST - they believe that 
> internal pointers are the same as outer (to heap)
> 

I think you're probably right - GIN does compress the posting lists by 
exploiting the TID redundancy (that it's page/offset structure), and I 
don't think there are other index AMs doing that.

But I'm not sure we can simply rely on that - it's possible people will 
try to improve other index types (e.g. by adding similar compression to 
other index types). Moreover we now support extensions defining custom 
index AMs, and we have no clue what people may do in those.

So this would clearly require some sort of flag for each index AM.
>
> Another dubious part - BitmapScan.
> 

It would be really great if you could explain why BitmapScans are 
dubious, instead of just labeling them as dubious. (I guess you mean 
Bitmap Heap Scans, right?)

I see no conceptual issue with bitmap scans on arbitrary locator types, 
as long as there's sufficient locality information encoded in the value. 
What I mean by that is that for any two locator values A and B:
   (1) if (A < B) then (A is stored before B)
   (2) if (A is close to B) then (A is stored close to B)

Without these features it's probably futile to try to do bitmap scans, 
because the bitmap would not result in mostly sequential access pattern 
and things like prefetch would not be very efficient, I think.

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Pluggable storage

От

Tom Lane

Дата:

23 июня 2017 г., 20:24:59

Tomas Vondra <tomas.vondra@2ndquadrant.com> writes:
> It would be really great if you could explain why BitmapScans are 
> dubious, instead of just labeling them as dubious. (I guess you mean 
> Bitmap Heap Scans, right?)

The two things I'm aware of are (a) the assumption you noted, that
fetching tuples in TID sort order is a reasonably efficient thing,
and (b) the "offset" part of a TID can't exceed MaxHeapTuplesPerPage
--- see data structure in tidbitmap.c.  The latter issue means that
you don't really have a full six bytes to play with in a TID, only
about five.

I don't think (b) would be terribly hard to fix if we had a motivation to,
but I wonder whether there aren't other places that also know this about
TIDs.
        regards, tom lane

Re: [HACKERS] Pluggable storage

От

Julien Rouhaud

Дата:

25 июня 2017 г., 15:38:00

On 23/06/2017 16:07, Tomas Vondra wrote:
> 
> I think you're probably right - GIN does compress the posting lists by
> exploiting the TID redundancy (that it's page/offset structure), and I
> don't think there are other index AMs doing that.
> 
> But I'm not sure we can simply rely on that - it's possible people will
> try to improve other index types (e.g. by adding similar compression to
> other index types).
That reminds me
https://www.postgresql.org/message-id/55E4051B.7020209@postgrespro.ru
where Anastasia proposed something similar.

-- 
Julien Rouhaud
http://dalibo.com - http://dalibo.org

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

26 июня 2017 г., 21:18:08

Hackers,

I see that design question for PostgreSQL pluggable storages is very hard. BTW, I think it worth analyzing existing use-cases of pluggable storages. I think that the most famous DBMS with pluggable storage API is MySQL. This why I decided to start with it. I've added MySQL/MariaDB section on wiki page.

https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB

It appears that significant part of MySQL storage engines are misuses. MySQL lacks of various features like FDWs or writable views and so on. This is why developers created a lot of pluggable storages for that purposes. We definitely don't want something like this in PostgreSQL now. I created special resume column where I expressed whether it would be nice to have something like this table engine in PostgreSQL.

Any comments and additions are welcome. I'm planning to write similar table for MongoDB.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Tomas Vondra

Дата:

27 июня 2017 г., 01:55:35

Hi,

On 06/26/2017 05:18 PM, Alexander Korotkov wrote:
> Hackers,
> 
> I see that design question for PostgreSQL pluggable storages is very 
> hard.

IMHO it's mostly expected to be hard.

Firstly, PostgreSQL is a mature product with many advanced features, and 
reworking a low-level feature without breaking something on top of it is 
hard by definition.

Secondly, project policies and code quality requirements set the bar 
very high too, I think.
> BTW, I think it worth analyzing existing use-cases of pluggable
> storages.  I think that the most famous DBMS with pluggable storage API
> is MySQL. This why I decided to start with it. I've added
> MySQL/MariaDB section on wiki page.
> https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB
> It appears that significant part of MySQL storage engines are misuses.  
> MySQL lacks of various features like FDWs or writable views and so on.  
> This is why developers created a lot of pluggable storages for that 
> purposes.  We definitely don't want something like this in PostgreSQL 
> now.  I created special resume column where I expressed whether it
> would be nice to have something like this table engine in PostgreSQL.
> 

I don't want to discourage you, but I'm not sure how valuable this is.

I agree it's valuable to have a an over-view of use cases for pluggable 
storage, but I don't think we'll get that from looking at MySQL. As you 
noticed, most of the storage engines are misuses, so it's difficult to 
learn anything valuable from them. You can argue that using FDWs to 
implement alternative storage engines is a misuse too, but at least that 
gives us a valid use case (columnar storage implemented using FDW).

If anything, the MySQL storage engines should serve as a cautionary tale 
how not to do things - there's also a plenty of references in the MySQL 
"Restrictions and Limitations" section of the manual:
  https://downloads.mysql.com/docs/mysql-reslimits-excerpt-5.7-en.pdf

regards

-- 
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

27 июня 2017 г., 02:39:46

On Mon, Jun 26, 2017 at 10:55 PM, Tomas Vondra <tomas.vondra@2ndquadrant.com> wrote:

On 06/26/2017 05:18 PM, Alexander Korotkov wrote:
I see that design question for PostgreSQL pluggable storages is very hard.

IMHO it's mostly expected to be hard.

Firstly, PostgreSQL is a mature product with many advanced features, and reworking a low-level feature without breaking something on top of it is hard by definition.

Secondly, project policies and code quality requirements set the bar very high too, I think.

Sure.

> BTW, I think it worth analyzing existing use-cases of pluggable
storages. I think that the most famous DBMS with pluggable storage API
is MySQL. This why I decided to start with it. I've added
MySQL/MariaDB section on wiki page.
https://wiki.postgresql.org/wiki/Future_of_storage#MySQL.2FMariaDB
It appears that significant part of MySQL storage engines are misuses. MySQL lacks of various features like FDWs or writable views and so on. This is why developers created a lot of pluggable storages for that purposes. We definitely don't want something like this in PostgreSQL now. I created special resume column where I expressed whether it
would be nice to have something like this table engine in PostgreSQL.

I don't want to discourage you, but I'm not sure how valuable this is.

I agree it's valuable to have a an over-view of use cases for pluggable storage, but I don't think we'll get that from looking at MySQL. As you noticed, most of the storage engines are misuses, so it's difficult to learn anything valuable from them. You can argue that using FDWs to implement alternative storage engines is a misuse too, but at least that gives us a valid use case (columnar storage implemented using FDW).

If anything, the MySQL storage engines should serve as a cautionary tale how not to do things - there's also a plenty of references in the MySQL "Restrictions and Limitations" section of the manual:

https://downloads.mysql.com/docs/mysql-reslimits-excerpt-5.7-en.pdf

Just to clarify the thing. I don't propose any adoption of MySQL pluggable storage API to PostgreSQL or something like this. I just wrote this table for completeness of vision. It may appear that somebody will make some valuable notes out of it, it may appear that not. "Yes" in third column means only that there is positive user visible effects which are *nice to have* in PostgreSQL.

Also, I remember there was a table with comparison of different designs of pluggable storages and their use-cases at PGCon 2017 unconference. Could somebody reproduce it?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

27 июня 2017 г., 19:08:41

On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>> > Open Items:
>> >
>> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> > HeapTuple and HeapScanDesc, So these scans are directly operating
>> > on those structures and providing the result.
>> >
>> > These scan types may not be applicable to different storage formats.
>> > So how to handle them?
>>
>> I think that BitmapHeapScan, at least, is applicable to any table AM
>> that has TIDs.   It seems to me that in general we can imagine three
>> kinds of table AMs:
>>
>> 1. Table AMs where a tuple can be efficiently located by a real TID.
>> By a real TID, I mean that the block number part is really a block
>> number and the item ID is really a location within the block.  These
>> are necessarily quite similar to our current heap, but they can change
>> the tuple format and page format to some degree, and it seems like in
>> many cases it should be possible to plug them into our existing index
>> AMs without too much heartache.  Both index scans and bitmap index
>> scans ought to work.
>
>
> If #1 is only about changing tuple and page formats, then could be much
> simpler than the patch upthread?  We can implement "page format access
> methods" with routines for insertion, update, pruning and deletion of tuples
> *in particular page*.  There is no need to redefine high-level logic for
> scanning heap, doing updates and so on...
>

If you are changing tuple format then you do need to worry about
places wherever we are using HeapTuple like TupleTableSlots,
Visibility routines, etc. (just think if somebody changes tuple
header, then all such places are susceptible to change).  Similarly,
if the page format is changed you need to consider all page scan API's
like heapgettup_pagemode/heapgetpage.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

27 июня 2017 г., 19:19:50

On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
>> wrote:
>>>
>>> I have sent the partial patch I have to Hari Babu Kommi.  We expect that
>>> he will be able to further this goal some more.
>>
>>
>> Thanks Alvaro for sharing your development patch.
>>
>> Most of the patch design is same as described by Alvaro in the first mail
>> [1].
>> I will detail the modifications, pending items and open items (needs
>> discussion)
>> to implement proper pluggable storage.
>>
>> Here I attached WIP patches to support pluggable storage. The patch series
>> are may not work individually. Still so many things are under development.
>> These patches are just to share the approach of the current development.
>>
>> Some notable changes that I did to make the patch work:
>>
>> 1. Added storageam handler to the slot, this is because not all places
>> the relation is not available in handy.
>> 2. Retained the minimal Tuple in the slot, as this is used in HASH join.
>> As per the first version, I feel it is fine to allow creating HeapTuple
>> format data.
>>
>> Thanks everyone for sharing their ideas in the developer's unconference at
>> PGCon Ottawa.
>>
>> Pending items:
>>
>> 1. Replacement of Tuple with slot in Trigger functionality
>> 2. Replacement of Tuple with Slot from storage handler functions.
>> 3. Remove/minimize the use of HeapTuple as a Datum.
>> 4. Replace all references of HeapScanDesc with StorageScanDesc
>> 5. Planner changes to consider the relation storage during the planning.
>> 6. Any planner changes based on the discussion of open items?
>> 7. some Executor changes to consider the storage advantages?
>>
>> Open Items:
>>
>> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> HeapTuple and HeapScanDesc, So these scans are directly operating
>> on those structures and providing the result.
>
>
> What about vacuum?  I see vacuum is untouched in the patchset and it is not
> mentioned in this discussion.
> Do you plan to override low-level function like heap_page_prune(),
> lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
> Or do you plan to let pluggable storage implement its own high-level vacuum
> algorithm?
>

Probably, some other algorithm for vacuum.  I am not sure current
vacuum with its parameters can be used so easily.  One thing that
might need some thoughts is that is it sufficient to say that keep
autovacuum as off and call some different API for places where the
vacuum can be invoked manually like Vacuum command to the developer
implementing some different strategy for vacuum or we need something
more as well.



-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

27 июня 2017 г., 19:53:12

On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>> > Open Items:
>> >
>> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> > HeapTuple and HeapScanDesc, So these scans are directly operating
>> > on those structures and providing the result.
>> >
>> > These scan types may not be applicable to different storage formats.
>> > So how to handle them?
>>
>> I think that BitmapHeapScan, at least, is applicable to any table AM
>> that has TIDs. It seems to me that in general we can imagine three
>> kinds of table AMs:
>>
>> 1. Table AMs where a tuple can be efficiently located by a real TID.
>> By a real TID, I mean that the block number part is really a block
>> number and the item ID is really a location within the block. These
>> are necessarily quite similar to our current heap, but they can change
>> the tuple format and page format to some degree, and it seems like in
>> many cases it should be possible to plug them into our existing index
>> AMs without too much heartache. Both index scans and bitmap index
>> scans ought to work.
>
>
> If #1 is only about changing tuple and page formats, then could be much
> simpler than the patch upthread? We can implement "page format access
> methods" with routines for insertion, update, pruning and deletion of tuples
> *in particular page*. There is no need to redefine high-level logic for
> scanning heap, doing updates and so on...

If you are changing tuple format then you do need to worry about
places wherever we are using HeapTuple like TupleTableSlots,
Visibility routines, etc. (just think if somebody changes tuple
header, then all such places are susceptible to change).

Agree. I think that we can consider pluggable tuple format as an independent feature which is desirable to have before pluggable storages. For example, I believe some FDWs could already have benefit from pluggable tuple format.

Similarly,
if the page format is changed you need to consider all page scan API's
like heapgettup_pagemode/heapgetpage.

If page format is changed, then heapgettup_pagemode/heapgetpage should use appropriate API functions for manipulating page items. I'm very afraid of overriding whole heapgettup_pagemode/heapgetpage and monstrous functions like heap_update without understanding of clear use-case. It's definitely not needed for changing heap page format.

------

Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

27 июня 2017 г., 20:00:42

On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
>> wrote:
>>>
>>> I have sent the partial patch I have to Hari Babu Kommi. We expect that
>>> he will be able to further this goal some more.
>>
>>
>> Thanks Alvaro for sharing your development patch.
>>
>> Most of the patch design is same as described by Alvaro in the first mail
>> [1].
>> I will detail the modifications, pending items and open items (needs
>> discussion)
>> to implement proper pluggable storage.
>>
>> Here I attached WIP patches to support pluggable storage. The patch series
>> are may not work individually. Still so many things are under development.
>> These patches are just to share the approach of the current development.
>>
>> Some notable changes that I did to make the patch work:
>>
>> 1. Added storageam handler to the slot, this is because not all places
>> the relation is not available in handy.
>> 2. Retained the minimal Tuple in the slot, as this is used in HASH join.
>> As per the first version, I feel it is fine to allow creating HeapTuple
>> format data.
>>
>> Thanks everyone for sharing their ideas in the developer's unconference at
>> PGCon Ottawa.
>>
>> Pending items:
>>
>> 1. Replacement of Tuple with slot in Trigger functionality
>> 2. Replacement of Tuple with Slot from storage handler functions.
>> 3. Remove/minimize the use of HeapTuple as a Datum.
>> 4. Replace all references of HeapScanDesc with StorageScanDesc
>> 5. Planner changes to consider the relation storage during the planning.
>> 6. Any planner changes based on the discussion of open items?
>> 7. some Executor changes to consider the storage advantages?
>>
>> Open Items:
>>
>> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> HeapTuple and HeapScanDesc, So these scans are directly operating
>> on those structures and providing the result.
>
>
> What about vacuum? I see vacuum is untouched in the patchset and it is not
> mentioned in this discussion.
> Do you plan to override low-level function like heap_page_prune(),
> lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
> Or do you plan to let pluggable storage implement its own high-level vacuum
> algorithm?
>

Probably, some other algorithm for vacuum. I am not sure current
vacuum with its parameters can be used so easily. One thing that
might need some thoughts is that is it sufficient to say that keep
autovacuum as off and call some different API for places where the
vacuum can be invoked manually like Vacuum command to the developer
implementing some different strategy for vacuum or we need something
more as well.

What kind of other vacuum algorithm do you expect? It would be rather easier to understand if we would have examples.

For me, changing of vacuum algorithm is not needed for just heap page format change. Existing vacuum algorithm could just call page format API functions for manipulating individual pages.

Changing of vacuum algorithm might be needed for more invasive changes than just heap page format. However, we should first understand what these changes could be and how are they consistent with rest of API design.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

28 июня 2017 г., 08:07:43

On Thu, Jun 22, 2017 at 5:47 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> Open Items:
>
> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
> HeapTuple and HeapScanDesc, So these scans are directly operating
> on those structures and providing the result.
>
> These scan types may not be applicable to different storage formats.
> So how to handle them?

I think that BitmapHeapScan, at least, is applicable to any table AM
that has TIDs. It seems to me that in general we can imagine three
kinds of table AMs:

1. Table AMs where a tuple can be efficiently located by a real TID.
By a real TID, I mean that the block number part is really a block
number and the item ID is really a location within the block. These
are necessarily quite similar to our current heap, but they can change
the tuple format and page format to some degree, and it seems like in
many cases it should be possible to plug them into our existing index
AMs without too much heartache. Both index scans and bitmap index
scans ought to work.

2. Table AMs where a tuple has some other kind of locator. For
example, imagine an index-organized table where the locator is the
primary key, which is a bit like what Alvaro had in mind for indirect
indexes. If the locator is 6 bytes or less, it could potentially be
jammed into a TID, but I don't think that's a great idea. For things
like int8 or numeric, it won't work at all. Even for other things,
it's going to cause problems because the bit patterns won't be what
the code is expecting; e.g. bitmap scans care about the structure of
the TID, not just how many bits it is. (Due credit: Somebody, maybe
Alvaro, pointed out this problem before, at PGCon.) For these kinds
of tables, larger modifications to the index AMs are likely to be
necessary, at least if we want a really general solution, or maybe we
should have separate index AMs - e.g. btree for traditional TID-based
heaps, and generic_btree or indirect_btree or key_btree or whatever
for heaps with some other kind of locator. It's not too hard to see
how to make index scans work with this sort of structure but it's very
unclear to me whether, or how, bitmap scans can be made to work.

3. Table AMs where a tuple doesn't really have a locator at all. In
these cases, we can't support any sort of index AM at all. When the
table is queried, there's really nothing the core system can do except
ask the table AM for a full scan, supply the quals, and hope the table
AM has some sort of smarts that enable it to optimize somehow. For
example, you can imagine converting cstore_fdw into a table AM of this
sort - ORC has a sort of inbuilt BRIN-like indexing that allows whole
chunks to be proven uninteresting and skipped. (You could use chunk
number + offset to turn this into a table AM of the previous type if
you wanted to support secondary indexes; not sure if that'd be useful,
but it'd certainly be harder.)

I'm more interested in #1 than in #3, and more interested in #3 than
#2, but other people may have different priorities.

Hi Robert,

Thanks for the details and your opinion.

I also agree that option#1 is better to do first.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

28 июня 2017 г., 08:13:56

On Wed, Jun 28, 2017 at 12:00 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Jun 13, 2017 at 4:50 AM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>>
>> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
>> wrote:
>>>
>>> I have sent the partial patch I have to Hari Babu Kommi. We expect that
>>> he will be able to further this goal some more.
>>
>>
>> Thanks Alvaro for sharing your development patch.
>>
>> Most of the patch design is same as described by Alvaro in the first mail
>> [1].
>> I will detail the modifications, pending items and open items (needs
>> discussion)
>> to implement proper pluggable storage.
>>
>> Here I attached WIP patches to support pluggable storage. The patch series
>> are may not work individually. Still so many things are under development.
>> These patches are just to share the approach of the current development.
>>
>> Some notable changes that I did to make the patch work:
>>
>> 1. Added storageam handler to the slot, this is because not all places
>> the relation is not available in handy.
>> 2. Retained the minimal Tuple in the slot, as this is used in HASH join.
>> As per the first version, I feel it is fine to allow creating HeapTuple
>> format data.
>>
>> Thanks everyone for sharing their ideas in the developer's unconference at
>> PGCon Ottawa.
>>
>> Pending items:
>>
>> 1. Replacement of Tuple with slot in Trigger functionality
>> 2. Replacement of Tuple with Slot from storage handler functions.
>> 3. Remove/minimize the use of HeapTuple as a Datum.
>> 4. Replace all references of HeapScanDesc with StorageScanDesc
>> 5. Planner changes to consider the relation storage during the planning.
>> 6. Any planner changes based on the discussion of open items?
>> 7. some Executor changes to consider the storage advantages?
>>
>> Open Items:
>>
>> 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> HeapTuple and HeapScanDesc, So these scans are directly operating
>> on those structures and providing the result.
>
>
> What about vacuum? I see vacuum is untouched in the patchset and it is not
> mentioned in this discussion.
> Do you plan to override low-level function like heap_page_prune(),
> lazy_vacuum_page() etc., but preserve high-level logic of vacuum?
> Or do you plan to let pluggable storage implement its own high-level vacuum
> algorithm?
>

Probably, some other algorithm for vacuum. I am not sure current
vacuum with its parameters can be used so easily. One thing that
might need some thoughts is that is it sufficient to say that keep
autovacuum as off and call some different API for places where the
vacuum can be invoked manually like Vacuum command to the developer
implementing some different strategy for vacuum or we need something
more as well.

What kind of other vacuum algorithm do you expect? It would be rather easier to understand if we would have examples.

For me, changing of vacuum algorithm is not needed for just heap page format change. Existing vacuum algorithm could just call page format API functions for manipulating individual pages.

Changing of vacuum algorithm might be needed for more invasive changes than just heap page format. However, we should first understand what these changes could be and how are they consistent with rest of API design.

Yes, I agree that we need some changes in the vacuum to handle the pluggable storage.

Currently I didn't fully checked the changes that are needed in vacuum, but

I feel the low level changes of the function are enough, and also there should be

some option from storage handler to decide whether it needs a vacuum or not?

Based on this flag, the vacuum may be skipped on those tables. So these handlers

no need to register those API's.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

28 июня 2017 г., 09:16:26

On Tue, Jun 27, 2017 at 11:53 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
On Thu, Jun 22, 2017 at 8:00 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Wed, Jun 21, 2017 at 10:47 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Mon, Jun 12, 2017 at 9:50 PM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>> > Open Items:
>> >
>> > 1. The BitmapHeapScan and TableSampleScan are tightly coupled with
>> > HeapTuple and HeapScanDesc, So these scans are directly operating
>> > on those structures and providing the result.
>> >
>> > These scan types may not be applicable to different storage formats.
>> > So how to handle them?
>>
>> I think that BitmapHeapScan, at least, is applicable to any table AM
>> that has TIDs. It seems to me that in general we can imagine three
>> kinds of table AMs:
>>
>> 1. Table AMs where a tuple can be efficiently located by a real TID.
>> By a real TID, I mean that the block number part is really a block
>> number and the item ID is really a location within the block. These
>> are necessarily quite similar to our current heap, but they can change
>> the tuple format and page format to some degree, and it seems like in
>> many cases it should be possible to plug them into our existing index
>> AMs without too much heartache. Both index scans and bitmap index
>> scans ought to work.
>
>
> If #1 is only about changing tuple and page formats, then could be much
> simpler than the patch upthread? We can implement "page format access
> methods" with routines for insertion, update, pruning and deletion of tuples
> *in particular page*. There is no need to redefine high-level logic for
> scanning heap, doing updates and so on...

If you are changing tuple format then you do need to worry about
places wherever we are using HeapTuple like TupleTableSlots,
Visibility routines, etc. (just think if somebody changes tuple
header, then all such places are susceptible to change).

Agree. I think that we can consider pluggable tuple format as an independent feature which is desirable to have before pluggable storages. For example, I believe some FDWs could already have benefit from pluggable tuple format.

Accepting multiple tuple format is possible with complete replacement of

HeapTuple with TupleTableSlot or something like value/null array

instead of a single memory chunk tuple data.

Currently I am working on it to replace the HeapTuple with TupleTableSlot

in the upper layers once the tuples is returned from the scan. In most of the

places where the HeapTuple is present, either replace it with TupleTableSlot

or change it to StorageTuple (void *).

I am yet to evaluate whether it is possible to support as an independent feature

without the need of some heap format function to understand the tuple format

in every place.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

28 июня 2017 г., 18:39:03

On Tue, Jun 27, 2017 at 7:23 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Tue, Jun 27, 2017 at 4:08 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> Similarly,
>> if the page format is changed you need to consider all page scan API's
>> like heapgettup_pagemode/heapgetpage.
>
>
> If page format is changed, then heapgettup_pagemode/heapgetpage should use
> appropriate API functions for manipulating page items.  I'm very afraid of
> overriding whole heapgettup_pagemode/heapgetpage and monstrous functions
> like heap_update without understanding of clear use-case.  It's definitely
> not needed for changing heap page format.
>

Yeah, we might not change them completely, there will always be some
common parts, but as of now, this API considers that we can return a
pointer to tuple on the page with just having a pin on the buffer.
This is based on the assumption that nobody can update the current
tuple, only a new tuple will be created on the update.  For some use
cases like in-place updates, we might want to do that differently.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

28 июня 2017 г., 18:50:16

On Wed, Jun 28, 2017 at 7:43 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Wed, Jun 28, 2017 at 12:00 AM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
>>
>> On Tue, Jun 27, 2017 at 4:19 PM, Amit Kapila <amit.kapila16@gmail.com>
>> wrote:
>>>
>>> On Thu, Jun 22, 2017 at 5:46 PM, Alexander Korotkov
>>> <a.korotkov@postgrespro.ru> wrote:
>>>
>>> Probably, some other algorithm for vacuum.  I am not sure current
>>> vacuum with its parameters can be used so easily.  One thing that
>>> might need some thoughts is that is it sufficient to say that keep
>>> autovacuum as off and call some different API for places where the
>>> vacuum can be invoked manually like Vacuum command to the developer
>>> implementing some different strategy for vacuum or we need something
>>> more as well.
>>
>>
>> What kind of other vacuum algorithm do you expect?  It would be rather
>> easier to understand if we would have examples.
>>
>> For me, changing of vacuum algorithm is not needed for just heap page
>> format change.  Existing vacuum algorithm could just call page format API
>> functions for manipulating individual pages.
>>
>> Changing of vacuum algorithm might be needed for more invasive changes
>> than just heap page format.  However, we should first understand what these
>> changes could be and how are they consistent with rest of API design.
>
>
> Yes, I agree that we need some changes in the vacuum to handle the pluggable
> storage.
> Currently I didn't fully checked the changes that are needed in vacuum, but
> I feel the low level changes of the function are enough, and also there
> should be
> some option from storage handler to decide whether it needs a vacuum or not?
> Based on this flag, the vacuum may be skipped on those tables. So these
> handlers
> no need to register those API's.
>

Something in that direction sounds reasonable to me.  I am also not
very clear what kind of pluggability will be required for vacuum.  I
think for now we can park this problem and try to tackle tuple format
and page format related stuff.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

14 июля 2017 г., 19:35:56

On Wed, Jun 28, 2017 at 1:16 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

Accepting multiple tuple format is possible with complete replacement of
HeapTuple with TupleTableSlot or something like value/null array
instead of a single memory chunk tuple data.

Currently I am working on it to replace the HeapTuple with TupleTableSlot
in the upper layers once the tuples is returned from the scan. In most of the
places where the HeapTuple is present, either replace it with TupleTableSlot
or change it to StorageTuple (void *).

I am yet to evaluate whether it is possible to support as an independent feature
without the need of some heap format function to understand the tuple format
in every place.

To replace tuple with slot, I took trigger and SPI calls as the first step in modifying

from tuple to slot, Here I attached a WIP patch. The notable changes are,

1. Replace most of the HeapTuple with Slot in SPI interface functions.

2. In SPITupleTable, Instead of HeapTuple, it is changed to TupleTableSlot.

But this change may not be a proper approach, because a duplicate copy of

TupleTableSlot is generated and stored.

3. Changed all trigger interfaces to accept TupleTableSlot Instead of HeapTuple.

4. ItemPointerData is added as a member to the TupleTableSlot structure.

5. Modified the ExecInsert and others to work directly on TupleTableSlot instead

of tuple(not completely).

In many places while accessing system tables, the tuple is directly mapped to

a catalog table structure and accessed. Currently I am not modifying any of those,

and leaving them as it is until we solve all other HeapTuple replacement problems.

In order to continue further changes in replacing the HeapTuple with TupleTableSlot,

I am just thinking of replacing Trigger functions like plpgsql_exec_trigger to return

TupleTableSlot instead of HeapTuple. I am thinking that these changes may improve

the performance as it avoids the deform and forming a tuple. I am thinking that

same TupleTableSlot memory is valid across these function calls. Am I correct?

I am thinking that it may not possible to replace all the places of HeapTuple with

TupleTableSlot, but it is possible with a StorageTuple (which is a void*). wherever

the tuple is present, there exists a tupledesc in most of the cases. How about

adding some kind of information in tupledesc to find out the tuple format and call

the necessary functions to generate a TupleTableSlot from it and use that from there

onward? This approach may be useful for Storing StorageTuple in SPITupleTable

instead of TupleTableSlot.

Please let me know your comments/suggestions.

Regards,

Hari Babu

Fujitsu Australia

Вложения

0001-WIP-tuple-replace-with-slot.patch

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

15 июля 2017 г., 08:14:43

On Thu, Jun 22, 2017 at 9:30 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> If #1 is only about changing tuple and page formats, then could be much
> simpler than the patch upthread?  We can implement "page format access
> methods" with routines for insertion, update, pruning and deletion of tuples
> *in particular page*.  There is no need to redefine high-level logic for
> scanning heap, doing updates and so on...

That assumes that every tuple format does those things in the same
way, which I suspect is not likely to be the case.  I think that
pruning and vacuum are artifacts of the current heap format, and they
may be nonexistent or take some altogether different form in some
other storage engine.  InnoDB isn't much like the PostgreSQL heap, and
neither is SQL Server, IIUC.  If we're creating a heap format that can
only be different in trivial ways from what we have now, and anything
that really changes the paradigm is off-limits, well, that's not
really interesting enough to justify the work of creating a heap
storage API.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

15 июля 2017 г., 08:27:13

On Thu, Jun 22, 2017 at 4:27 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> Can you elaborate a bit more about this TID bit pattern issues? I do
> remember that not all TIDs are valid due to safeguards on individual fields,
> like for example
>
>     Assert(iptr->ip_posid < (1 << MaxHeapTuplesPerPageBits))
>
> But perhaps there are some other issues?

I think the other issues that I know about have largely already been
mentioned by others, but since this question was addressed to me:

1. TID bitmaps assume that the page and offset parts of the TID
contain just those things.  As Tom pointed out, tidbitmap.c isn't cool
with TIDs that have ip_posid out of range.  More broadly, we assume
lossification is a sensible way of keeping a TID bitmap down to a
reasonable size without losing too much efficiency, and that sorting
tuple references by the block ID is likely to produce a sequential I/O
pattern.  Those things won't necessarily be true if TIDs are treated
as opaque tuple identifiers.

2. Apparently, GIN uses the structure of TIDs to compress posting
lists.  (I'm not personally familiar with this code.)

In general, it's a fairly dangerous thing to suppose that you can
repurpose a value as widely used as a TID and not break anything.  I'm
not saying it can't be done, but we use TIDs in an awful lot of places
and rooting out all of the places somebody may have made an assumption
about the structure of them may not be trivial.  I tend to think it's
an ugly kludge to shove some other kind of value into a TID, anyway.
If we need to store something that's not a TID, I think we should have
a purpose-built mechanism for that, not just hammer on the existing
system until it sorta works.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

15 июля 2017 г., 08:30:39

On Fri, Jul 14, 2017 at 8:35 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> To replace tuple with slot, I took trigger and SPI calls as the first step
> in modifying
> from tuple to slot, Here I attached a WIP patch. The notable changes are,
>
> 1. Replace most of the HeapTuple with Slot in SPI interface functions.
> 2. In SPITupleTable, Instead of HeapTuple, it is changed to TupleTableSlot.
> But this change may not be a proper approach, because a duplicate copy of
> TupleTableSlot is generated and stored.
> 3. Changed all trigger interfaces to accept TupleTableSlot Instead of
> HeapTuple.
> 4. ItemPointerData is added as a member to the TupleTableSlot structure.
> 5. Modified the ExecInsert and others to work directly on TupleTableSlot
> instead
> of tuple(not completely).

What problem are you trying to solve with these changes?  I'm not
saying that it's a bad idea, but I think you should spell out the
motivation a little more explicitly.

I think performance is likely to be a key concern here.  Maybe these
interfaces are Just Better with TupleTableSlots and the patch is a win
regardless of pluggable storage -- but if the reverse turns out to be
true, and this slows things down in cases that anyone cares about, I
think that's going to be a problem.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

16 июля 2017 г., 04:36:22

On Sat, Jul 15, 2017 at 5:14 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Jun 22, 2017 at 9:30 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> If #1 is only about changing tuple and page formats, then could be much
> simpler than the patch upthread? We can implement "page format access
> methods" with routines for insertion, update, pruning and deletion of tuples
> *in particular page*. There is no need to redefine high-level logic for
> scanning heap, doing updates and so on...

That assumes that every tuple format does those things in the same
way, which I suspect is not likely to be the case. I think that
pruning and vacuum are artifacts of the current heap format, and they
may be nonexistent or take some altogether different form in some
other storage engine.

I think that pruning and vacuum are artifacts of not only current heap formats, but they are also artifacts of current index AM API. And this is more significant circumstance given that we're going to preserve compatibility of new storage engines with current index AMs. Our current index AM API assumes that we can delete from index only in bulk manner. Our payload to index key is TID, not arbitrary piece of data. And that payload can't be updated.

InnoDB isn't much like the PostgreSQL heap, and
neither is SQL Server, IIUC. If we're creating a heap format that can
only be different in trivial ways from what we have now, and anything
that really changes the paradigm is off-limits, well, that's not
really interesting enough to justify the work of creating a heap
storage API.

My concern is that we probably can't do anything that really changes paradigm while preserving compatibility with index AM API. If you don't agree with that, it would be good to provide some examples. It seems unlikely for me that we're going to have something like InnoDB or SQL Server table with our current index AM API. InnoDB utilizes index-organized tables where primary and secondary indexes are versioned independently. SQL Server utilizes flat data structure similar to our heap, but MVCC implementation also seems very different.

I think in general there are two ways dealing with out index AM API limitation. One of them is to extend index AM API. At least, we would need a method for deletion of individual index tuple (for sure, we already have kill_prior_tuple but it's just a hint for now). Also, it would be nice to have arbitrary payload to index tuples instead of TID, and method to update that payload. But that would be quite big amount of work. Alternatively, we could allow pluggable storages to have their own index AMs, and that will move this amount of work to the pluggable storage side.

The thing which we would evade is storage API, which would be invasive like something changing paradigm, but actually allowing just trivial changes in heap format. Mechanical replacement of heap methods with storage API methods could lead us there.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Peter Geoghegan

Дата:

16 июля 2017 г., 06:58:40

On Sat, Jul 15, 2017 at 3:36 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I think that pruning and vacuum are artifacts of not only current heap
> formats, but they are also artifacts of current index AM API.  And this is
> more significant circumstance given that we're going to preserve
> compatibility of new storage engines with current index AMs.  Our current
> index AM API assumes that we can delete from index only in bulk manner.  Our
> payload to index key is TID, not arbitrary piece of data.  And that payload
> can't be updated.

I agree that this is a big set of problems. This is where I think we
can get the most benefit.

One nice thing about having a fully logical TID is that you don't have
to bloat indexes just because a HOT update was not possible. You bloat
something else instead, certainly, but that can be optimized for
garbage collection. Not all bloat is equal. MVCC based on UNDO can be
very fast because UNDO is very well optimized for garbage collection,
and so can be bloated with no long term consequences, and minor short
term consequences. Still, it isn't fair to blame the Postgres VACUUM
design for the fact that Postgres bloats indexes so easily. Nothing
stops us from adding a new layer of indirection, so that bloating an
index degrades into something that is not even as bad as bloating the
heap [1]. We may just have a data structure problem, which is not
nearly the same thing as a fundamental design problem in the storage
layer.

This approach could pay off soon if we start with unique indexes,
where there is "queue" of row versions that can be pruned with simple
logic, temporal locality helps a lot, only zero or one versions can be
visible to your MVCC snapshot, etc. This might require only minimal
revisions the index AM API, to help nbtree. We could later improve
this so that you bloat UNDO instead of bloating a heap-like structure,
both for indexes and for the actual heap. That seems less urgent.

To repeat myself, for emphasis: *Not all bloat is equal*. Index bloat
makes the way a B-Tree's keyspace is split up far too granular, making
pages sparely packed, a problem that is more or less *irreversible* by
VACUUM or any garbage collection process [2]. That's how B-Trees work
-- they're optimized for concurrency, not for garbage collection.

>> InnoDB isn't much like the PostgreSQL heap, and
>> neither is SQL Server, IIUC.  If we're creating a heap format that can
>> only be different in trivial ways from what we have now, and anything
>> that really changes the paradigm is off-limits, well, that's not
>> really interesting enough to justify the work of creating a heap
>> storage API.
>
>
> My concern is that we probably can't do anything that really changes
> paradigm while preserving compatibility with index AM API.  If you don't
> agree with that, it would be good to provide some examples.  It seems
> unlikely for me that we're going to have something like InnoDB or SQL Server
> table with our current index AM API.  InnoDB utilizes index-organized tables
> where primary and secondary indexes are versioned independently.  SQL Server
> utilizes flat data structure similar to our heap, but MVCC implementation
> also seems very different.

I strongly agree. I simply don't understand how you can adopt UNDO for
MVCC, and yet expect to get a benefit commensurate with the effort
without also implementing "retail index tuple deletion" first.
Pursuing UNDO this way has the same problem that WARM likely has -- it
doesn't really help with the worst case, where users get big,
unpleasant surprises. Postgres is probably the only major database
system that doesn't support retail index tuple deletion. It's a basic
thing, that has nothing to do with MVCC. Really, what do we have to
lose?

The biggest weakness of the current design is IMV how it fails to
prevent index bloat in the first place, but avoiding bloating index
leaf pages in the first place doesn't seem incompatible with how
VACUUM works. Or at least, let's not assume that it is. We should
avoid throwing the baby out with the bathwater.

> I think in general there are two ways dealing with out index AM API
> limitation.  One of them is to extend index AM API.  At least, we would need
> a method for deletion of individual index tuple (for sure, we already have
> kill_prior_tuple but it's just a hint for now).

kill_prior_tuple can work well, but, like HOT, it works
inconsistently, in a way that is hard to predict.

> Also, it would be nice to
> have arbitrary payload to index tuples instead of TID, and method to update
> that payload.  But that would be quite big amount of work.  Alternatively,
> we could allow pluggable storages to have their own index AMs, and that will
> move this amount of work to the pluggable storage side.

I agree with Robert that being able to store an arbitrary payload as a
TID is probably not going to ever work very well. However, I don't
think that that's a reason to give up on the underlying idea: creating
a new layer of indirection for secondary indexes, that allows updates
that now have to create new index tuples to instead just update the
indirection layer metadata.

You can create a mapping of a TID-like 6 byte integer to a primary key
value. Techniques exist. That seems a lot more practical. Of course,
TID is what is sometimes called a "physiological" identifier -- it has
a "physical" component (block number) and "logical" component (item
offset). Nothing I can think of prevents us from creating an
alternative, entirely logical identifier that fits in the same 6
bytes. It can map to a versioning indirection layer, for unique
indexes, or to a primary key value, for secondary indirect indexes.

[1] postgr.es/m/CAH2-Wzmf6intNY1ggiNzOziiO5Eq=DsXfeptODGxO=2j-i1NGQ@mail.gmail.com
[2] https://wiki.postgresql.org/wiki/Key_normalization#VACUUM_and_nbtree_page_deletion
-- 
Peter Geoghegan

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

17 июля 2017 г., 16:22:40

On Sun, Jul 16, 2017 at 3:58 AM, Peter Geoghegan <pg@bowt.ie> wrote:

I strongly agree. I simply don't understand how you can adopt UNDO for
MVCC, and yet expect to get a benefit commensurate with the effort
without also implementing "retail index tuple deletion" first.
Pursuing UNDO this way has the same problem that WARM likely has -- it
doesn't really help with the worst case, where users get big,
unpleasant surprises. Postgres is probably the only major database
system that doesn't support retail index tuple deletion. It's a basic
thing, that has nothing to do with MVCC. Really, what do we have to
lose?

I think that "retail index tuple deletion" is the feature which could give us some advantages even independently from pluggable storages. For example, imagine very large table with only small amount of dead tuples. In this case, it would be cheaper to delete index links to those dead tuples one by one using "retail index tuple deletion", rather than do full scan of every index to perform "bulk delete" of index tuples. One may argue that you shouldn't do vacuum of large table when only small amount of tuples are dead. But in terms of index bloat mitigation, very aggressive vacuum strategy could be justified.

I agree with Robert that being able to store an arbitrary payload as a
TID is probably not going to ever work very well.

Support of arbitrary payload as a TID doesn't sound easy. However, that doesn't mean it's unachievable. For me, it's more like long way which could be traveled step by step. Some of our existing index access methods (B-tree, hash, GiST, SP-GiST) may support arbitrary payload relatively easy, because they are not relying on its internal structure. For others (GIN, BRIN) arbitrary payload is much harder to support, but I wouldn't say it's impossible. However, if we make arbitrary payload support an option of index AM and implement this support for first group of index AMs, it would be already great step forward. So, for sample, it would be possible to use indirect indexes when primary key is not 6-bytes, if index AM supports arbitrary payload.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Peter Geoghegan

Дата:

17 июля 2017 г., 22:51:55

On Mon, Jul 17, 2017 at 3:22 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I think that "retail index tuple deletion" is the feature which could give
> us some advantages even independently from pluggable storages.  For example,
> imagine very large table with only small amount of dead tuples.  In this
> case, it would be cheaper to delete index links to those dead tuples one by
> one using "retail index tuple deletion", rather than do full scan of every
> index to perform "bulk delete" of index tuples.  One may argue that you
> shouldn't do vacuum of large table when only small amount of tuples are
> dead.  But in terms of index bloat mitigation, very aggressive vacuum
> strategy could be justified.

Yes, definitely. Especially with the visibility map. Even still, I
tend to think that for unique indexes, true duplicates should be
disallowed, and dealt with with an additional layer of indirection. So
this would be for secondary indexes.

>> I agree with Robert that being able to store an arbitrary payload as a
>> TID is probably not going to ever work very well.
>
>
> Support of arbitrary payload as a TID doesn't sound easy.  However, that
> doesn't mean it's unachievable. For me, it's more like long way which could
> be traveled step by step.

To be fair, it probably is achievable. Where there is a will, there is
a way. I just think that it will be easier to find a different way of
realizing similar benefits. I'm mostly talking about benefits around
making it cheap to have many secondary indexes by having logical
indirection instead of physical pointers (doesn't *have* to be
user-visible primary key values). HOT simply isn't effective enough at
preventing UPDATE index tuple insertions for indexes on unchanged
attributes, often just because pruning can fail to happen in time,
which WARM will not fix.

-- 
Peter Geoghegan

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

18 июля 2017 г., 02:24:58

On Mon, Jul 17, 2017 at 7:51 PM, Peter Geoghegan <pg@bowt.ie> wrote:

On Mon, Jul 17, 2017 at 3:22 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I think that "retail index tuple deletion" is the feature which could give
> us some advantages even independently from pluggable storages. For example,
> imagine very large table with only small amount of dead tuples. In this
> case, it would be cheaper to delete index links to those dead tuples one by
> one using "retail index tuple deletion", rather than do full scan of every
> index to perform "bulk delete" of index tuples. One may argue that you
> shouldn't do vacuum of large table when only small amount of tuples are
> dead. But in terms of index bloat mitigation, very aggressive vacuum
> strategy could be justified.

Yes, definitely. Especially with the visibility map. Even still, I
tend to think that for unique indexes, true duplicates should be
disallowed, and dealt with with an additional layer of indirection. So
this would be for secondary indexes.

It's probably depends on particular storage (once we have pluggable storages). Some storages would have additional level of indirection while others wouldn't. But even if unique index contain no true duplicates, it's still possible that true delete happen. Then we still have to delete tuple even from unique index.

>> I agree with Robert that being able to store an arbitrary payload as a
>> TID is probably not going to ever work very well.
>
>
> Support of arbitrary payload as a TID doesn't sound easy. However, that
> doesn't mean it's unachievable. For me, it's more like long way which could
> be traveled step by step.

To be fair, it probably is achievable. Where there is a will, there is
a way. I just think that it will be easier to find a different way of
realizing similar benefits. I'm mostly talking about benefits around
making it cheap to have many secondary indexes by having logical
indirection instead of physical pointers (doesn't *have* to be
user-visible primary key values).

It's possible to add indirection layer "on demand". Thus, initially index tuples point directly to the heap tuple. If tuple gets updates and doesn't fit to the page anymore, then it's moved to another place with redirect in the old place. I think that if carefully designed, it's possible to guarantee there is at most one redirect.

But I sill think that evading arbitrary payload for indexes is delaying of inevitable, if only we want pluggable storages and want them to reuse existing index AMs. So, for example, arbitrary payload together with ability to update this payload allows us to make indexes separately versioned (have separate garbage collection process more or less unrelated to heap). Despite overhead caused by MVCC attributes, I think such indexes could give significant advantages in various workloads.

HOT simply isn't effective enough at
preventing UPDATE index tuple insertions for indexes on unchanged
attributes, often just because pruning can fail to happen in time,
which WARM will not fix.

Right. I think HOT and WARM depend on factors which are hard to control: distribution of UPDATEs between heap pages, oldest snapshot and so on. It's quite hard for DBA to understand why table starts getting bloat while it didn't before.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Peter Geoghegan

Дата:

18 июля 2017 г., 05:34:59

On Mon, Jul 17, 2017 at 1:24 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> It's probably depends on particular storage (once we have pluggable
> storages).  Some storages would have additional level of indirection while
> others wouldn't.

Agreed. Like kill_prior_tuple, it's an optional capability, and where
implemented is implemented in a fairly consistent way.

> But even if unique index contain no true duplicates, it's
> still possible that true delete happen.  Then we still have to delete tuple
> even from unique index.

I think I agree. I've been looking over the ARIES paper [1] again
today. They say this:

"For index updates, in the interest of increasing concurrency, we do
not want to prevent the space released by one transaction from being
consumed by another before the commit of the first transaction."

You can literally reclaim space from an index tuple deletion
*immediately* with their design, which matters because you want to
reclaim space as early as possible, before a page spit is needed.
Obviously they understand how important this is.

This might not work so well with an MVCC system, where there are no
2PL predicate locks. You need to keep a "ghost record", even for
non-unique indexes, and the deletion can only happen when the xact
commits. Remember, the block number cannot be used to see if there was
changes against the page, unlike the heap, because you have to worry
about page splits and page merges/deletion. UNDO is entirely logical
for indexes for this reason. (This is why UNDO does not actually undo
page splits, relation extension, etc. Only REDO/WAL always works at
the level of individual pages in all cases. UNDO for MVCC is not as
different to our design as I once thought.).

The reason I want to at least start with unique indexes is because you
need a TID to make non-unique/secondary indexes have unique keys
(unique keys are always needed if retail index tuple insertion is
always supported). For unique indexes, you really can do an update in
the index (see my design below for one example of how that can work),
but I think you need something more like a deletion followed by an
insertion for non-unique indexes, because there the physical/heap TID
changed, and that's part of the key, and that might belong on a
different page. You therefore haven't really fixed the problem with
secondary indexes sometimes needing new index tuples even though user
visible attributes weren't updated.

You haven't fixed the problem with secondary index, unless, of course,
all secondary indexes have logical pointers to begin with, such as the
PK value. Then you only need to "insert and delete, not update" when
the PK value is updated or when a secondary index needs a new index
tuple with distinct user visible attribute values to the previous
version's -- you fix the secondary index problem. And, while your
"version chain overflow indirection" structure is basically something
that lives outside the heap, it is still only needed for one index,
and not all of them.

This new indirection structure is a really nice target for pruning,
because you can prune physical TIDs that no possible snapshot could
use, unlike with the heap, where EvalPlanQual() could make any heap
tuple visible to snapshots at or after the minimal snapshot horizon
implied by RecentGlobalXmin. And, because index scans on any index can
prune for everyone.

You could also do "true index deletes", as you suggest, but you'd need
to have ghost records there too, and you'd need an asynchronous
cleanup process to do the cleanup when the deleting xact committed.
I'm not sure if it's worth doing that eagerly. It may or may not be
better to hope for kill_prior_tuple to do the job for us. Not sure
where this leaves index-only scans on secondary indexes..."true index
deletes" might be justified by making index only scans work more often
in general, especially for secondary indexes with logical pointers.

I'm starting to think that you were right all along about indirect
indexes needing to store PK values. Perhaps we should just bite the
bullet...it's not like places like the bufpage.c index routines
actually know or care about whether or not the index tuple has a TID,
what a TID is, etc. They care about stuff like the header values of
index tuples, and the page ItemId array, but TID is, as you put it,
merely payload.

> It's possible to add indirection layer "on demand".  Thus, initially index
> tuples point directly to the heap tuple.  If tuple gets updates and doesn't
> fit to the page anymore, then it's moved to another place with redirect in
> the old place.  I think that if carefully designed, it's possible to
> guarantee there is at most one redirect.

This is actually what I was thinking. Here is a sketch:

When you start out, index tuples in nbtree are the same as today --
one physical pointer (TID). But, on the first update to a PK index,
they grow a new pointer, but this is not a physical/heap TID. It's a
pointer to some kind of indirection structure that manages version
chains. You end up with an index with almost exactly the same external
interface as today, with one difference: you tell nbtree if something
is an insert or update, at least for unique indexes. Of course, you
need to have something to update in the index if it's an update, and
nbtree needs to be informed what that is.

My first guess is that we should limit the number of TIDs to two in
all cases, and start with only one physical TID, because:

* The first TID can always be the latest version, which in practice is
all most snapshots care about.

* We want to sharply limit the worst case page bloat, because
otherwise you have the same basic problem. Some queries might be a bit
slower, but it's worth it to be confident that bloat can only get so
bad.

* Simpler "1/3 of a page" enforcement. We simply add
"sizeof(ItemPointerData)" to the calculation.

* Gray says that split points are sometimes better if they're the
average of the min and max keys on the page, rather than the point at
which each half gets the most even share of space. Big index tuples
are basically bad for this.

> But I sill think that evading arbitrary payload for indexes is delaying of
> inevitable, if only we want pluggable storages and want them to reuse
> existing index AMs.  So, for example, arbitrary payload together with
> ability to update this payload allows us to make indexes separately
> versioned (have separate garbage collection process more or less unrelated
> to heap).  Despite overhead caused by MVCC attributes, I think such indexes
> could give significant advantages in various workloads.

Yeah. Technically you could have some indirection to keep under 6
bytes when that isn't assured by the PK index tuple width, but it
probably wouldn't be worth it. TID is almost like just another
attribute. The more I look, the less I think that TID is this thing
that a bunch of code makes many assumptions about that we will never
find all of. *Plenty* of TIDs today do not point to the heap at all.
For example, internal pages in nbtree uses TIDs that point to the
level below.

You would break some code within indextuple.c, but that doesn't seem
so bad. IndexInfoFindDataOffset() already has to deal with
variable-width NULL bitmaps. Why not a variabke-length pointer, too?

[1] https://pdfs.semanticscholar.org/39e3/d058a5987cb643e000bce555676d71be1c80.pdf

-- 
Peter Geoghegan

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

19 июля 2017 г., 12:03:20

On Sat, Jul 15, 2017 at 12:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Jul 14, 2017 at 8:35 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> To replace tuple with slot, I took trigger and SPI calls as the first step
> in modifying
> from tuple to slot, Here I attached a WIP patch. The notable changes are,
>
> 1. Replace most of the HeapTuple with Slot in SPI interface functions.
> 2. In SPITupleTable, Instead of HeapTuple, it is changed to TupleTableSlot.
> But this change may not be a proper approach, because a duplicate copy of
> TupleTableSlot is generated and stored.
> 3. Changed all trigger interfaces to accept TupleTableSlot Instead of
> HeapTuple.
> 4. ItemPointerData is added as a member to the TupleTableSlot structure.
> 5. Modified the ExecInsert and others to work directly on TupleTableSlot
> instead
> of tuple(not completely).

What problem are you trying to solve with these changes? I'm not
saying that it's a bad idea, but I think you should spell out the
motivation a little more explicitly.

Sorry for not providing complete details. I am trying these experiments

to find out the best way to return the tuple from Storage methods by

designing a proper API.

The changes that I am doing are to reduce the dependency on the

HeapTuple format with value/nulls array. So if there is no dependency

on the HeapTuple format in the upper layers of the PostgreSQL storage,

then directly we can define the StorageAPI to return the value/nulls array

instead of one big chunck of tuple data like HeapTuple or StorageTuple(void *).

I am finding out that eliminating the HeapTuple usage in the upper layers

needs some major changes, How about not changing anything in the upper

layers of storage currently and just support the pluggable tuple with one of

the following approach for first version?

1. Design an API that returns values/nulls array and convert that into a

HeapTuple whenever it is required in the upper layers. All the existing

heap form/deform tuples are used for every tuple with some adjustments.

2. Design an API that returns StorageTuple(void *) with first member

represents the TYPE of the storage, so that corresponding registered

function calls can be called to deform/form the tuple whenever there is

a need of tuple.

3. Design an API that returns StorageTuple(void *) but the necessary

format information of that tuple can be get from the tupledesc. wherever

the tuple is present, there exists a tupledesc in most of the cases. How

about adding some kind of information in tupledesc to find out the tuple

format and call the necessary functions

4. Design an API to return always the StorageTuple and it converts to

HeapTuple with a function hook if it gets registered (for heap storages

this is not required to register the hook, because it is already a HeapTuple

format). This function hook should be placed in the heap form/deform functions.

Any other better ideas for a first version.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

19 июля 2017 г., 23:53:24

On Sat, Jul 15, 2017 at 6:36 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> I think in general there are two ways dealing with out index AM API
> limitation.  One of them is to extend index AM API.

That's pretty much what I have in mind.  I think it's likely that if
we end up with, say, 3 kinds of heap and 12 kinds of index, there will
be some compatibility matrix.  Some index AMs will be compatible with
some heap AMs, and others won't be.  For example, if somebody makes an
IOT-type heap using the API proposed here or some other one, BRIN
probably won't work at all.  btree, on the other hand, could probably
be made to work, perhaps with some greater or lesser degree of
modification.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

19 июля 2017 г., 23:56:49

On Sat, Jul 15, 2017 at 8:58 PM, Peter Geoghegan <pg@bowt.ie> wrote:
> To repeat myself, for emphasis: *Not all bloat is equal*.

+1.

> I strongly agree. I simply don't understand how you can adopt UNDO for
> MVCC, and yet expect to get a benefit commensurate with the effort
> without also implementing "retail index tuple deletion" first.

I agree that we need retail index tuple deletion.  I liked Claudio's
idea at http://postgr.es/m/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com
-- that seems indispensible to making retail index tuple deletion
reasonably efficient.  Is anybody going to work on getting that
committed?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Pluggable storage

От

Peter Geoghegan

Дата:

20 июля 2017 г., 00:53:42

On Wed, Jul 19, 2017 at 10:56 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> I strongly agree. I simply don't understand how you can adopt UNDO for
>> MVCC, and yet expect to get a benefit commensurate with the effort
>> without also implementing "retail index tuple deletion" first.
>
> I agree that we need retail index tuple deletion.  I liked Claudio's
> idea at http://postgr.es/m/CAGTBQpZ-kTRQiAa13xG1GNe461YOwrA-s-ycCQPtyFrpKTaDBQ@mail.gmail.com
> -- that seems indispensible to making retail index tuple deletion
> reasonably efficient.  Is anybody going to work on getting that
> committed?

I will do review work on it.

IMV the main problems are:

* The way a "header" is added at the PageAddItemExtended() level,
rather than making heap TID something much closer to a conventional
attribute that perhaps only nbtree and indextuple.c have special
knowledge of, strikes me as the wrong way to go.

* It's simply not acceptable to add overhead to *all* internal items.
That kills fan-in. We're going to need suffix truncation for the
common case where the user-visible attributes for a split point/new
high key at the leaf level sufficiently distinguish what belongs on
either side. IOW, you should only see internal items with a heap TID
in the uncommon case where you have so many duplicates at the leaf
level that you have no choice put to use a split point that's right in
the middle of many duplicates.

Fortunately, if we confine ourselves to making heap TID part of the
keyspace, the code can be far simpler than what would be needed to get
my preferred, all-encompassing design for suffix truncation [1] to
work. I think we could just stash the number of attributes
participating in a comparison within internal pages' unused item
pointer offset. I've talked about this before, in the context of
Anastasia's INCLUDED columns patch. If we can have a variable number
of attributes for heap tuples, we can do so for index tuples, too.

* We might also have problems with changing the performance
characteristics for the worse in some cases by some measures. This
will probably technically increase the amount of bloat for some
indexes with sparse deletion patterns. I think that that will be well
worth it, but I don't expect a slam dunk.

A nice benefit of this work is that it lets us kill the hack that adds
randomness to the search for free space among duplicates, and may let
us follow the Lehman & Yao algorithm more closely.

[1] https://wiki.postgresql.org/wiki/Key_normalization#Suffix_truncation_of_normalized_keys
-- 
Peter Geoghegan

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

23 июля 2017 г., 12:10:21

On Wed, Jul 19, 2017 at 11:33 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Sat, Jul 15, 2017 at 12:30 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>
> I am finding out that eliminating the HeapTuple usage in the upper layers
> needs some major changes, How about not changing anything in the upper
> layers of storage currently and just support the pluggable tuple with one of
> the following approach for first version?
>

It is not very clear to me how any of the below alternatives are
better or worse as compare to the approach you are already working.
Do you mean to say that we will get away without changing all the
places which take HeapTuple as input or return it as output with some
of the below approaches?

> 1. Design an API that returns values/nulls array and convert that into a
> HeapTuple whenever it is required in the upper layers. All the existing
> heap form/deform tuples are used for every tuple with some adjustments.
>

So, this would have the additional cost of form/deform.  Also, how
would it have lesser changes as compare to what you have described
earlier?

> 2. Design an API that returns StorageTuple(void *) with first member
> represents the TYPE of the storage, so that corresponding registered
> function calls can be called to deform/form the tuple whenever there is
> a need of tuple.
>

Do you intend to say that we store such information in disk tuple or
only in the in-memory version of same?  Also, what makes you think
that we would need hooks only for form and deform?  Right now, in many
cases tuple will directly point to disk page and we deal with it by
retaining the pin on the corresponding buffer, what if some kinds of
tuple don't follow that rule?  For ex. to support in-place updates, we
might always need a separate copy of tuple rather than the one
pointing to disk page.

> 3. Design an API that returns StorageTuple(void *) but the necessary
> format information of that tuple can be get from the tupledesc. wherever
> the tuple is present, there exists a tupledesc in most of the cases. How
> about adding some kind of information in tupledesc to find out the tuple
> format and call the necessary functions
>
> 4. Design an API to return always the StorageTuple and it converts to
> HeapTuple with a function hook if it gets registered (for heap storages
> this is not required to register the hook, because it is already a HeapTuple
> format). This function hook should be placed in the heap form/deform
> functions.
>

I think some more information is required to comment on any of the
approaches or suggest a new one.  You might want to try by quoting
some specific examples from code so that it is easy to understand what
your proposal will change in that case.  One idea could be that we
start with some interfaces like structures TupleTableSlot, EState,
HeapScanDescData,  IndexScanDescData, etc. and interfaces like
heap_insert, heap_update, heap_lock_tuple,
SnapshotSatisfiesFunc, EvalPlanQualFetch, etc.  Now, it is quite
possible that we don't want to change some of these interfaces, but it
can help to see how such a usage can be replaced with new kind of
Tuple structure.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

01 августа 2017 г., 14:26:09

On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Jul 19, 2017 at 11:33 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> I am finding out that eliminating the HeapTuple usage in the upper layers
> needs some major changes, How about not changing anything in the upper
> layers of storage currently and just support the pluggable tuple with one of
> the following approach for first version?
>

It is not very clear to me how any of the below alternatives are
better or worse as compare to the approach you are already working.
Do you mean to say that we will get away without changing all the
places which take HeapTuple as input or return it as output with some
of the below approaches?

Yes, With the following approaches, my intention is to reduce the changing

of HeapTuple everywhere to support the Pluggable Storage API with minimal

changes and later improvise further. But until unless we change the HeapTuple

everywhere properly, may be we can't see the benefits of pluggable storages.

> 1. Design an API that returns values/nulls array and convert that into a
> HeapTuple whenever it is required in the upper layers. All the existing
> heap form/deform tuples are used for every tuple with some adjustments.
>

So, this would have the additional cost of form/deform. Also, how
would it have lesser changes as compare to what you have described
earlier?

Yes, It have the additional cost of form/deform. It is the same approach that

is described earlier. But I have an intention of modifying everywhere the

HeapTuple is accessed. But with the other prototype changes of removing

HeapTuple usage from triggers, I realized that it needs some clear design

to proceed further, instead of combining those changes with pluggable

Storage API.

- heap_getnext function is kept as it as and it is used only for system table

access.

- heap_getnext_slot function is introduced to return the slot whenever the

data is found, otherwise NULL, This function is used in all the places from

Executor and etc.

- The TupleTableSlot structure is modified to contain a void* tuple instead of

HeapTuple. And also it contains the storagehanlder functions.

- heap_insert and etc function can take Slot as an argument and perform the

insert operation.

The cases where the TupleTableSlot is not possible to sent, form a HeapTuple

from the data and sent it and also note down that it is a HeapTuple data, not

the tuple from the storage.

> 2. Design an API that returns StorageTuple(void *) with first member
> represents the TYPE of the storage, so that corresponding registered
> function calls can be called to deform/form the tuple whenever there is
> a need of tuple.
>

Do you intend to say that we store such information in disk tuple or
only in the in-memory version of same?

Only in-memory version.

Also, what makes you think
that we would need hooks only for form and deform? Right now, in many
cases tuple will directly point to disk page and we deal with it by
retaining the pin on the corresponding buffer, what if some kinds of
tuple don't follow that rule? For ex. to support in-place updates, we
might always need a separate copy of tuple rather than the one
pointing to disk page.

In any of the approaches, except for system tables, we are going to remove

the direct disk access of the tuple. Either with replace tuple with slot or something,

no direct disk access will be removed. Otherwise it will be difficult to support,

and also it doesn't provide much performance benefit also.

All the storagehandler functions needs to be maintained seperately

and accessed by them using the hanlder ID.

heap_getnext function returns StorageTuple and not HeapTuple.

The StorageTuple can be mapped to HeapTuple or another Tuple

based on the first member type.

heap_insert etc function takes input as StorageTuple and internally

decides based on the type of the tuple to perform the insert operation.

Instead of storing the handler functions inside a relation/slot, the

function pointers can be accessed directly based on the storage

type data.

This works every case where the tuple is accessed, but the problem

is, it may need changes wherever the tuple is accessed.

> 3. Design an API that returns StorageTuple(void *) but the necessary
> format information of that tuple can be get from the tupledesc. wherever
> the tuple is present, there exists a tupledesc in most of the cases. How
> about adding some kind of information in tupledesc to find out the tuple
> format and call the necessary functions
>

heap_getnext function returns StorageTuple instead of HeapTuple. The tuple

type information is available in the TupleDesc structure.

All heap_insert and etc function accepts TupleTableSlot as input and perform

the insert operation. This approach is almost same as first approach except the

storage handler functions are stored in TupleDesc.

In case if the tuple is formed internally based on the value/nulls array, the formed

tuple is always the HeapTuple format and the same is updated in TupleDesc.

I am having doubt that passing the StorageTuple everywhere in case that location

doesn't contains a TupleDesc and direct access of the tuple may give problems.

> 4. Design an API to return always the StorageTuple and it converts to
> HeapTuple with a function hook if it gets registered (for heap storages
> this is not required to register the hook, because it is already a HeapTuple
> format). This function hook should be placed in the heap form/deform
> functions.
>

heap_getnext function always returns HeapTuple irrespective of the storage.

That means all other storage modules have to frame the HeapTuple from the

data and send it back to the server.

heap_insert and etc functions can work with same interfaces and convert

the HeapTuple internally and perform the insert operation according to their

storage requirement.

There are not many changes in the structures, just adding the storage handler

functions.

This approach is simple, but it doesn't provide much benefit in having a different

storage in my opinion.

I am preferring to go with the Original (first) approach and generate the HeapTuple

wherever it is necessary in upper layers, but the API will return a TupleTableSlot

with either void pointer to tuple or value/nulls array.

Please provide your views.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

07 августа 2017 г., 19:12:22

On Tue, Aug 1, 2017 at 1:56 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
>
> On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>
>>
>> > 1. Design an API that returns values/nulls array and convert that into a
>> > HeapTuple whenever it is required in the upper layers. All the existing
>> > heap form/deform tuples are used for every tuple with some adjustments.
>> >
>>
>> So, this would have the additional cost of form/deform.  Also, how
>> would it have lesser changes as compare to what you have described
>> earlier?
>
>
> Yes, It have the additional cost of form/deform. It is the same approach
> that
> is described earlier. But I have an intention of modifying everywhere the
> HeapTuple is accessed. But with the other prototype changes of removing
> HeapTuple usage from triggers, I realized that it needs some clear design
> to proceed further, instead of combining those changes with pluggable
> Storage API.
>
> - heap_getnext function is kept as it as and it is used only for system
> table
>   access.
> - heap_getnext_slot function is introduced to return the slot whenever the
>   data is found, otherwise NULL, This function is used in all the places
> from
>   Executor and etc.
>
> - The TupleTableSlot structure is modified to contain a void* tuple instead
> of
> HeapTuple. And also it contains the storagehanlder functions.
> - heap_insert and etc function can take Slot as an argument and perform the
> insert operation.
>
> The cases where the TupleTableSlot is not possible to sent, form a HeapTuple
> from the data and sent it and also note down that it is a HeapTuple data,
> not
> the tuple from the storage.
>
..
>
>
>>
>> > 3. Design an API that returns StorageTuple(void *) but the necessary
>> > format information of that tuple can be get from the tupledesc. wherever
>> > the tuple is present, there exists a tupledesc in most of the cases. How
>> > about adding some kind of information in tupledesc to find out the tuple
>> > format and call the necessary functions
>> >
>
>
> heap_getnext function returns StorageTuple instead of HeapTuple. The tuple
> type information is available in the TupleDesc structure.
>
> All heap_insert and etc function accepts TupleTableSlot as input and perform
> the insert operation. This approach is almost same as first approach except
> the
> storage handler functions are stored in TupleDesc.
>

Why do we need to store handler function in TupleDesc?  As of now, the
above patch series has it available in RelationData and
TupleTableSlot, I am not sure if instead of that keeping it in
TupleDesc is a good idea.  Which all kind of places require TupleDesc
to contain handler?  If those are few places, can we think of passing
it as a parameter?


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

08 августа 2017 г., 10:21:12

On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>>
>> I have sent the partial patch I have to Hari Babu Kommi.  We expect that
>> he will be able to further this goal some more.
>
>
> Thanks Alvaro for sharing your development patch.
>
> Most of the patch design is same as described by Alvaro in the first mail
> [1].
> I will detail the modifications, pending items and open items (needs
> discussion)
> to implement proper pluggable storage.
>
> Here I attached WIP patches to support pluggable storage. The patch series
> are may not work individually. Still so many things are under development.
> These patches are just to share the approach of the current development.
>

+typedef struct StorageAmRoutine
+{

In this structure, you have already covered most of the API's that a
new storage module needs to provide, but I think there could be more.
One such API could be heap_hot_search.  This seems specific to current
heap where we have the provision of HOT.  I think we can provide a new
API tuple_search or something like that.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

12 августа 2017 г., 11:01:06

On Mon, Aug 7, 2017 at 11:12 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Aug 1, 2017 at 1:56 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
>
> On Sun, Jul 23, 2017 at 4:10 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>
>>
>> > 1. Design an API that returns values/nulls array and convert that into a
>> > HeapTuple whenever it is required in the upper layers. All the existing
>> > heap form/deform tuples are used for every tuple with some adjustments.
>> >
>>
>> So, this would have the additional cost of form/deform. Also, how
>> would it have lesser changes as compare to what you have described
>> earlier?
>
>
> Yes, It have the additional cost of form/deform. It is the same approach
> that
> is described earlier. But I have an intention of modifying everywhere the
> HeapTuple is accessed. But with the other prototype changes of removing
> HeapTuple usage from triggers, I realized that it needs some clear design
> to proceed further, instead of combining those changes with pluggable
> Storage API.
>
> - heap_getnext function is kept as it as and it is used only for system
> table
> access.
> - heap_getnext_slot function is introduced to return the slot whenever the
> data is found, otherwise NULL, This function is used in all the places
> from
> Executor and etc.
>
> - The TupleTableSlot structure is modified to contain a void* tuple instead
> of
> HeapTuple. And also it contains the storagehanlder functions.
> - heap_insert and etc function can take Slot as an argument and perform the
> insert operation.
>
> The cases where the TupleTableSlot is not possible to sent, form a HeapTuple
> from the data and sent it and also note down that it is a HeapTuple data,
> not
> the tuple from the storage.
>
..
>
>
>>
>> > 3. Design an API that returns StorageTuple(void *) but the necessary
>> > format information of that tuple can be get from the tupledesc. wherever
>> > the tuple is present, there exists a tupledesc in most of the cases. How
>> > about adding some kind of information in tupledesc to find out the tuple
>> > format and call the necessary functions
>> >
>
>
> heap_getnext function returns StorageTuple instead of HeapTuple. The tuple
> type information is available in the TupleDesc structure.
>
> All heap_insert and etc function accepts TupleTableSlot as input and perform
> the insert operation. This approach is almost same as first approach except
> the
> storage handler functions are stored in TupleDesc.
>

Why do we need to store handler function in TupleDesc? As of now, the
above patch series has it available in RelationData and
TupleTableSlot, I am not sure if instead of that keeping it in
TupleDesc is a good idea. Which all kind of places require TupleDesc
to contain handler? If those are few places, can we think of passing
it as a parameter?

Till now I am to able to proceed without adding any storage handler functions to

TupleDesc structure. Sure, I will try the way of passing as a parameter when

there is a need of it.

During the progress of the patch, I am facing problems in designing the storage API

regarding the Buffer. For example To replace all the HeapTupleSatisfiesMVCC and

related functions with function pointers, In HeapTuple format, the tuple may belongs

to one buffer, so the buffer is passed to the HeapTupleSatifisifes*** functions along

with buffer, But in case of other storage formats, the single buffer may not contains

the actual data. This buffer is used to set the Hint bits and mark the buffer as dirty.

In case if the buffer is not available, the performance may affect for the following

queries if the hint bits are not set.

And also the Buffer is used to get from heap_fetch, heap_lock_tuple and related

functions to check the Tuple visibility, but currently returning a buffer from the above

heap_** function is not possible for other formats. And also for the HeapTuple data,

the tuple data is copied into palloced buffer instead of pointing directly to the page.

So, returning a Buffer is a valid or not here?

Currently I am proceeding to remove the Buffer as parameter in the API and proceed

further, In case if it affects the performance, we need to find out a different appraoch

in handling the hint bits.

comments?

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

12 августа 2017 г., 11:04:58

On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera <alvherre@2ndquadrant.com>
> wrote:
>>
>> I have sent the partial patch I have to Hari Babu Kommi. We expect that
>> he will be able to further this goal some more.
>
>
> Thanks Alvaro for sharing your development patch.
>
> Most of the patch design is same as described by Alvaro in the first mail
> [1].
> I will detail the modifications, pending items and open items (needs
> discussion)
> to implement proper pluggable storage.
>
> Here I attached WIP patches to support pluggable storage. The patch series
> are may not work individually. Still so many things are under development.
> These patches are just to share the approach of the current development.
>

+typedef struct StorageAmRoutine
+{

In this structure, you have already covered most of the API's that a
new storage module needs to provide, but I think there could be more.
One such API could be heap_hot_search. This seems specific to current
heap where we have the provision of HOT. I think we can provide a new
API tuple_search or something like that.

Thanks for the review.

Yes, the storageAmRoutine needs more function pointers. Currently I am

adding all the functions that are present in the heapam.h and some slot

related function from tuptable.h. Once I stabilize the code and API's that are

currently added, then I will further enhance it with remaining functions that

are necessary to support pluggable storage API.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

13 августа 2017 г., 13:17:10

On Sat, Aug 12, 2017 at 10:31 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>>
>> Why do we need to store handler function in TupleDesc?  As of now, the
>> above patch series has it available in RelationData and
>> TupleTableSlot, I am not sure if instead of that keeping it in
>> TupleDesc is a good idea.  Which all kind of places require TupleDesc
>> to contain handler?  If those are few places, can we think of passing
>> it as a parameter?
>
>
> Till now I am to able to proceed without adding any storage handler
> functions to
> TupleDesc structure. Sure, I will try the way of passing as a parameter when
> there is a need of it.
>

Okay, I think it is better if you discuss such locations before
directly modifying those.

> During the progress of the patch, I am facing problems in designing the
> storage API
> regarding the Buffer. For example To replace all the HeapTupleSatisfiesMVCC
> and
> related functions with function pointers, In HeapTuple format, the tuple may
> belongs
> to one buffer, so the buffer is passed to the HeapTupleSatifisifes***
> functions along
> with buffer, But in case of other storage formats, the single buffer may not
> contains
> the actual data.
>

Also, it is quite possible that some of the storage Am's don't even
want to return bool as a parameter from HeapTupleSatisfies* API's.  I
guess what we need here is to provide a way so that different storage
am's can register their function pointer for an equivalent to
satisfies function.  So, we need to change
SnapshotData.SnapshotSatisfiesFunc in some way so that different
handlers can register their function instead of using that directly.
I think that should address the problem you are planning to solve by
omitting buffer parameter.

> This buffer is used to set the Hint bits and mark the
> buffer as dirty.
> In case if the buffer is not available, the performance may affect for the
> following
> queries if the hint bits are not set.
>

I don't think it is advisable to change that for the current heap.

> And also the Buffer is used to get from heap_fetch, heap_lock_tuple and
> related
> functions to check the Tuple visibility, but currently returning a buffer
> from the above
> heap_** function is not possible for other formats.
>

Why not?  I mean if we consider that all the formats we are worried at
this stage have TID (block number, tuple location), then we can get
the buffer.  We might want to consider passing TID as a parameter to
these API's if required to make that possible.  You also agreed above
[1] that we can first design the API considering storage formats
having TID.

> And also for the
> HeapTuple data,
> the tuple data is copied into palloced buffer instead of pointing directly
> to the page.
> So, returning a Buffer is a valid or not here?
>

Yeah, but I think for the sake of compatibility and not changing too
much in the current API's signature, we should try to avoid it.

> Currently I am proceeding to remove the Buffer as parameter in the API and
> proceed
> further, In case if it affects the performance, we need to find out a
> different appraoch
> in handling the hint bits.
>

Leaving aside the performance concern, I am not convinced that it is a
good idea to remove Buffer as a parameter from the API's you have
mentioned above.  Would you mind thinking once again keeping the
suggestions provided above in this email to see if we can avoid
removing Buffer as a parameter?

[1] - https://www.postgresql.org/message-id/CAJrrPGd8%2Bi8sqZCdhfvBhs2d1akEb_kEuBvgRHSPJ9z2Z7VBJw%40mail.gmail.com

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

13 августа 2017 г., 13:21:29

On Sat, Aug 12, 2017 at 10:34 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> On Tue, Jun 13, 2017 at 7:20 AM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>> >
>> >
>> > On Fri, Oct 14, 2016 at 7:26 AM, Alvaro Herrera
>> > <alvherre@2ndquadrant.com>
>> > wrote:
>> >>
>> >> I have sent the partial patch I have to Hari Babu Kommi.  We expect
>> >> that
>> >> he will be able to further this goal some more.
>> >
>> >
>> > Thanks Alvaro for sharing your development patch.
>> >
>> > Most of the patch design is same as described by Alvaro in the first
>> > mail
>> > [1].
>> > I will detail the modifications, pending items and open items (needs
>> > discussion)
>> > to implement proper pluggable storage.
>> >
>> > Here I attached WIP patches to support pluggable storage. The patch
>> > series
>> > are may not work individually. Still so many things are under
>> > development.
>> > These patches are just to share the approach of the current development.
>> >
>>
>> +typedef struct StorageAmRoutine
>> +{
>>
>> In this structure, you have already covered most of the API's that a
>> new storage module needs to provide, but I think there could be more.
>> One such API could be heap_hot_search.  This seems specific to current
>> heap where we have the provision of HOT.  I think we can provide a new
>> API tuple_search or something like that.
>
>
> Thanks for the review.
>
> Yes, the storageAmRoutine needs more function pointers. Currently I am
> adding all the functions that are present in the heapam.h and some slot
> related function from tuptable.h.
>

Hmm, this API is exposed via heapam.h.  Am I missing something?

> Once I stabilize the code and API's that
> are
> currently added, then I will further enhance it with remaining functions
> that
> are necessary to support pluggable storage API.
>

Sure, but I think if we found any thing during development/review,
then we should either add it immediately or at the very least add fix
me in a patch to avoid forgetting the finding.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

13 августа 2017 г., 19:23:30

On Sun, Aug 13, 2017 at 5:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Aug 12, 2017 at 10:34 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Tue, Aug 8, 2017 at 2:21 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>>
>> +typedef struct StorageAmRoutine
>> +{
>>
>> In this structure, you have already covered most of the API's that a
>> new storage module needs to provide, but I think there could be more.
>> One such API could be heap_hot_search. This seems specific to current
>> heap where we have the provision of HOT. I think we can provide a new
>> API tuple_search or something like that.
>
>
> Thanks for the review.
>
> Yes, the storageAmRoutine needs more function pointers. Currently I am
> adding all the functions that are present in the heapam.h and some slot
> related function from tuptable.h.
>

Hmm, this API is exposed via heapam.h. Am I missing something?

Sorry I was not clearly explaining in my earlier mail. Yes your are right

that the heap_hot_search function exists in heapam.h, but I am yet to

add all the exposed functions that are present in the heapam.h file to

the storageAmRoutine structure.

> Once I stabilize the code and API's that
> are
> currently added, then I will further enhance it with remaining functions
> that
> are necessary to support pluggable storage API.
>

Sure, but I think if we found any thing during development/review,
then we should either add it immediately or at the very least add fix
me in a patch to avoid forgetting the finding.

OK. I will add all the functions that are identified till now.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Andres Freund

Дата:

15 августа 2017 г., 12:53:48

Hi,

On 2017-06-13 11:50:27 +1000, Haribabu Kommi wrote:
> Here I attached WIP patches to support pluggable storage. The patch series
> are may not work individually. Still so many things are under development.
> These patches are just to share the approach of the current development.

Making a pass through the patchset to get a feel where this at, and
where this is headed. I previously skimmed the thread to get a rough
sense on what's discused, but not in a very detailed manner.

General:

- I think one important discussion we need to have is what kind of performance impact we're going to accept introducing
this.It seems very likely that this'll cause some slowdown. We can kind of alleviate that by doing some optimizations
atthe same time, but nevertheless, this abstraction is going to cost.

- I don't think we should introduce this without a user besides heapam. The likelihood that API will be usable by
anythingelse without a testcase seems fairly remote. I think some src/test/modules type implementation of a
per-session,in-memory storage - relatively easy to implement - or such is necessary.

- I think, and detailed some of that, we're going to need some cleanups that go in before this, to decrease the size /
increasethe quality of the new APIs. It's going to get more painful to change APIs subsequently.

- We'll have to document clearly that these APIs are going to change for a while, even after the release introducing
them.

StorageAm - Scan stuff:

- I think API isn't quite right. There's a lot of granular callback functionality like scan_begin_function /
scan_begin_catalog/ scan_begin_bm - these largely are convenience wrappers around the same function, and I don't think
thatwould, or rather should, change in any abstracted storage layer. So I think there needs to be some unification
first(pretty close w/ beginscan_internal already, but perhaps we should get rid of a few of these wrappers).

- Some of the exposed functionality, e.g. scan_getpage, scan_update_snapshot, scan_rescan_set_params looks like it
shouldjust be excised, i.e. there's no reason for it to exist.

- Minor: don't think the _function suffix for Storis necessary, just makes things long, and every member has it.
Besidesthat, it's also easy to misunderstand - for a second I understood scan_getnext_function to be about getting the
nextfunction...

- Scans are still represented as HeapScanDesc - I don't think that's going to fly. Either this needs to be an opaque
type(i.e. a struct that's not defined, just forward declared), or it needs to be a base struct that individual AMs
embedin their own structs. Individual AMs definitely are going to need different pieces of data.

Storage AM - tuple stuff:

- tuple_get_{xmin, updated_xid, cmin, itempointer, ctid, heaponly} are each individual functions, that seems pretty
painfulto maintain, and v/ likely to just grow and grow. Not sure what the solution is, but this seems like a hard
sell.

- The three *speculative* functions don't quite seem right to me, nor do I understand:
+ *
+ * Setting a tuple's speculative token is a slot-only operation, so no need
+ * for a storage AM method, but after inserting a tuple containing a
+ * speculative token, the insertion must be completed by these routines:
+ */ I don't see anything related to slots, here?

Storage AM - slot stuff:

- I think there's a number of wrapper functions (slot_getattr, slot_getallattrs, getsomeattrs, attisnull) around the
samefunctionality - that bloats the API and causes slowdowns. Instead we need something like
slot_virtualize_tuple(int16upto), and the rest should just be wrappers.

- I think it's wrong to have the slot functionality defined on the StorageAm level. That'll cause even more indirect
functioncalls (=> slowness), and besides that the TupleTableSlot struct will need members for each and every Am.

I think the solution is to instead have an Am level "create slot" function, and the returned slot is allocated by the
Am,with a base member of TupleTableSlot with basically just tts_nvalid, tts_values, tts_isnull as members. Those are
theonly members that can be accessed without functions.

Access to the individual functions (say store_tuple) would then be direct members of the TupleTableSlot interface.
Whilethat costs a bit of memory, it removes one indirection from an already performance critical path.

- MinimalTuples should be one type of slot for the above, except it's not created by an StorageAm but by a function
returninga TupleTableSlot.

This should remove the need for the slot_copy_min_tuple, slot_is_physical_tuple functions.

- Right now TupleTableSlots are an executor datastructure, but these patches (rightly!) make it much more widely used.
SoI think it needs to be moved outside of executor/, and probably renamed to something like TupleHolder or something.

- The oid integration seems wrong - without an accessor oids won't be queryable with this unless you break through the
API. But from a higher level view I do wonder if it's not time to remove "special" oid columns and replace them with a
normalcolumn. We should be hesitant enshrining crusty old concepts in new APIs.

Executor integration:

- I'm quite fearful that this'll cause slowdowns in a few tight paths. The most likely cases here seem to be a) bitmap
indexscansb) indexscans c) highly selective sequential scans. I do wonder if that can be partially addressed by
switchingout the individual executor routines in the relevant scan nodes by something using or similar to the
infrastructurein cc9f08b6b8

Regards,

Andres

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

21 августа 2017 г., 13:28:59

On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Sat, Aug 12, 2017 at 10:31 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>>
>> Why do we need to store handler function in TupleDesc? As of now, the
>> above patch series has it available in RelationData and
>> TupleTableSlot, I am not sure if instead of that keeping it in
>> TupleDesc is a good idea. Which all kind of places require TupleDesc
>> to contain handler? If those are few places, can we think of passing
>> it as a parameter?
>
>
> Till now I am to able to proceed without adding any storage handler
> functions to
> TupleDesc structure. Sure, I will try the way of passing as a parameter when
> there is a need of it.
>

Okay, I think it is better if you discuss such locations before
directly modifying those.

Sure. I will check with community before making any such changes.

> During the progress of the patch, I am facing problems in designing the
> storage API
> regarding the Buffer. For example To replace all the HeapTupleSatisfiesMVCC
> and
> related functions with function pointers, In HeapTuple format, the tuple may
> belongs
> to one buffer, so the buffer is passed to the HeapTupleSatifisifes***
> functions along
> with buffer, But in case of other storage formats, the single buffer may not
> contains
> the actual data.
>

Also, it is quite possible that some of the storage Am's don't even
want to return bool as a parameter from HeapTupleSatisfies* API's. I
guess what we need here is to provide a way so that different storage
am's can register their function pointer for an equivalent to
satisfies function. So, we need to change
SnapshotData.SnapshotSatisfiesFunc in some way so that different
handlers can register their function instead of using that directly.
I think that should address the problem you are planning to solve by
omitting buffer parameter.

Thanks for your suggestion. Yes, it is better to go in the direction of

SnapshotSatisfiesFunc.

I verified the above idea of implementing the Tuple visibility functions

and assign them into the snapshotData structure based on the snapshot.

The Tuple visibility functions that are specific to the relation are available

with the RelationData structure and this structure may not be available,

so I changed the SnapShotData structure to hold an enum to represent

what type of snapshot it is, instead of storing the pointer to the tuple

visibility function. Whenever there is a need to check for the tuple visibilty

the storageam handler pointer corresponding to the snapshot type is

called and result is obtained as earlier.

> This buffer is used to set the Hint bits and mark the
> buffer as dirty.
> In case if the buffer is not available, the performance may affect for the
> following
> queries if the hint bits are not set.
>

I don't think it is advisable to change that for the current heap.

I didn't change the prototype of existing functions. Currently tuple visibility

functions assumes that Buffer is always proper, but that may not be correct

based on the storage.

> And also the Buffer is used to get from heap_fetch, heap_lock_tuple and
> related
> functions to check the Tuple visibility, but currently returning a buffer
> from the above
> heap_** function is not possible for other formats.
>

Why not? I mean if we consider that all the formats we are worried at
this stage have TID (block number, tuple location), then we can get
the buffer. We might want to consider passing TID as a parameter to
these API's if required to make that possible. You also agreed above
[1] that we can first design the API considering storage formats
having TID.

The current approach is to support the storages that support TID bits.

But what I mean here, in some storage methods (for example column

storage), the tuple is not present in one buffer, the tuple data may be

calculated from many buffers and return the slot/storageTuple (until

unless we change everywhere to slot).

If any of the following code after the storage methods is expecting

a Buffer that should be valid may need some changes to check it first

whether it is a valid or not and perform the operations based on that.

> And also for the
> HeapTuple data,
> the tuple data is copied into palloced buffer instead of pointing directly
> to the page.
> So, returning a Buffer is a valid or not here?
>

Yeah, but I think for the sake of compatibility and not changing too
much in the current API's signature, we should try to avoid it.

Currently I am trying to avoid changing the current API's signatures.

Most of the signature changes are something like HeapTuple -> StorageTuple

and etc.

> Currently I am proceeding to remove the Buffer as parameter in the API and
> proceed
> further, In case if it affects the performance, we need to find out a
> different appraoch
> in handling the hint bits.
>

Leaving aside the performance concern, I am not convinced that it is a
good idea to remove Buffer as a parameter from the API's you have
mentioned above. Would you mind thinking once again keeping the
suggestions provided above in this email to see if we can avoid
removing Buffer as a parameter?

Thanks for your suggestions.
Yes. I am able to proceed without removing Buffer parameter.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

21 августа 2017 г., 15:25:00

On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>
>> Also, it is quite possible that some of the storage Am's don't even
>> want to return bool as a parameter from HeapTupleSatisfies* API's.  I
>> guess what we need here is to provide a way so that different storage
>> am's can register their function pointer for an equivalent to
>> satisfies function.  So, we need to change
>> SnapshotData.SnapshotSatisfiesFunc in some way so that different
>> handlers can register their function instead of using that directly.
>> I think that should address the problem you are planning to solve by
>> omitting buffer parameter.
>
>
> Thanks for your suggestion. Yes, it is better to go in the direction of
> SnapshotSatisfiesFunc.
>
> I verified the above idea of implementing the Tuple visibility functions
> and assign them into the snapshotData structure based on the snapshot.
>
> The Tuple visibility functions that are specific to the relation are
> available
> with the RelationData structure and this structure may not be available,
>

Which functions are you referring here?  I don't see anything in
tqual.h that uses RelationData.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

23 августа 2017 г., 11:26:03

On Tue, Aug 15, 2017 at 4:53 PM, Andres Freund <andres@anarazel.de> wrote:

Hi,

On 2017-06-13 11:50:27 +1000, Haribabu Kommi wrote:
> Here I attached WIP patches to support pluggable storage. The patch series
> are may not work individually. Still so many things are under development.
> These patches are just to share the approach of the current development.

Making a pass through the patchset to get a feel where this at, and
where this is headed. I previously skimmed the thread to get a rough
sense on what's discused, but not in a very detailed manner.

Thanks for the review.

General:

- I think one important discussion we need to have is what kind of
performance impact we're going to accept introducing this. It seems
very likely that this'll cause some slowdown. We can kind of
alleviate that by doing some optimizations at the same time, but
nevertheless, this abstraction is going to cost.

OK. May be to take some decision, we may need some performance

figures, I will measure the performance once the API's stabilized.

- I don't think we should introduce this without a user besides
heapam. The likelihood that API will be usable by anything else
without a testcase seems fairly remote. I think some src/test/modules
type implementation of a per-session, in-memory storage - relatively
easy to implement - or such is necessary.

Sure, I will add a test module once the API's are stabilized.

- I think, and detailed some of that, we're going to need some cleanups
that go in before this, to decrease the size / increase the quality of
the new APIs. It's going to get more painful to change APIs
subsequently.

- We'll have to document clearly that these APIs are going to change for
a while, even after the release introducing them.

Yes, that's correct, because this is the first time we are developing the

storage API's to support pluggable storage, so it may needs some refinements

based on the usage to support different storage methods.

StorageAm - Scan stuff:

- I think API isn't quite right. There's a lot of granular callback
functionality like scan_begin_function / scan_begin_catalog /
scan_begin_bm - these largely are convenience wrappers around the same
function, and I don't think that would, or rather should, change in
any abstracted storage layer. So I think there needs to be some
unification first (pretty close w/ beginscan_internal already, but
perhaps we should get rid of a few of these wrappers).

OK. I will change the API to add a function to beginscan_internal and replace

the rest of the functions usage with beginscan_internal. And also there are

many bool flags that are passed to the beginscan_internal, I will try to optimize

them also.

- Some of the exposed functionality, e.g. scan_getpage,
scan_update_snapshot, scan_rescan_set_params looks like it should just
be excised, i.e. there's no reason for it to exist.

Currently these API's are used only in Bitmap and Sample scan's.

These scan methods are fully depends on the heap format. I will

check how to remove these API's.

- Minor: don't think the _function suffix for Storis necessary, just
makes things long, and every member has it. Besides that, it's also
easy to misunderstand - for a second I understood
scan_getnext_function to be about getting the next function...

OK. How about adding _hook?

- Scans are still represented as HeapScanDesc - I don't think that's
going to fly. Either this needs to be an opaque type (i.e. a struct
that's not defined, just forward declared), or it needs to be a base
struct that individual AMs embed in their own structs. Individual AMs
definitely are going to need different pieces of data.

Currently the internal members of the HeapScanDesc are directly used

in many places especially in Bitmap and Sample scan's. I am yet to write

the code in a best way to handle these scan methods and then removing

its usage will be easy.

Storage AM - tuple stuff:

- tuple_get_{xmin, updated_xid, cmin, itempointer, ctid, heaponly} are
each individual functions, that seems pretty painful to maintain, and
v/ likely to just grow and grow. Not sure what the solution is, but
this seems like a hard sell.

OK. How about adding a one API and takes some flags to represent

what type of data that is needed from the tuple and returned the corresponding

data as void *. The caller must typecast the data to their corresponding

type before use it.

- The three *speculative* functions don't quite seem right to me, nor do
I understand:
+ *
+ * Setting a tuple's speculative token is a slot-only operation, so no need
+ * for a storage AM method, but after inserting a tuple containing a
+ * speculative token, the insertion must be completed by these routines:
+ */
I don't see anything related to slots, here?

The tuple_set_speculative_token API is not required. Just update the slot

member directly with speculative token is fine and this value is used in

the tuple_insert API to form the tuple with speculative token. Later with

the other two API's the tuple is either finished or aborted.

Storage AM - slot stuff:

- I think there's a number of wrapper functions (slot_getattr,
slot_getallattrs, getsomeattrs, attisnull) around the same
functionality - that bloats the API and causes slowdowns. Instead we
need something like slot_virtualize_tuple(int16 upto), and the rest
should just be wrappers.

OK. I will change accordingly.

- I think it's wrong to have the slot functionality defined on the
StorageAm level. That'll cause even more indirect function calls (=>
slowness), and besides that the TupleTableSlot struct will need
members for each and every Am.

I think the solution is to instead have an Am level "create slot"
function, and the returned slot is allocated by the Am, with a base
member of TupleTableSlot with basically just tts_nvalid, tts_values,
tts_isnull as members. Those are the only members that can be
accessed without functions.

Access to the individual functions (say store_tuple) would then be
direct members of the TupleTableSlot interface. While that costs a bit
of memory, it removes one indirection from an already performance
critical path.

OK. This will change the structure to hold the minimal members that are

accessed irrespective of storage AM, and rest of data will be a void pointer

that can be accessed by only the storage AM.

- MinimalTuples should be one type of slot for the above, except it's
not created by an StorageAm but by a function returning a
TupleTableSlot.

This should remove the need for the slot_copy_min_tuple,
slot_is_physical_tuple functions.

OK.

- Right now TupleTableSlots are an executor datastructure, but
these patches (rightly!) make it much more widely used. So I think it
needs to be moved outside of executor/, and probably renamed to
something like TupleHolder or something.

OK.

- The oid integration seems wrong - without an accessor oids won't be
queryable with this unless you break through the API. But from a
higher level view I do wonder if it's not time to remove "special" oid
columns and replace them with a normal column. We should be hesitant
enshrining crusty old concepts in new APIs.

OK.

Executor integration:

- I'm quite fearful that this'll cause slowdowns in a few tight paths.
The most likely cases here seem to be a) bitmap indexscans b)
indexscans c) highly selective sequential scans. I do wonder if
that can be partially addressed by switching out the individual
executor routines in the relevant scan nodes by something using or
similar to the infrastructure in cc9f08b6b8

Sorry, I didn't understand this point clearly. Can you provide some more

details.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

23 августа 2017 г., 11:35:04

On Mon, Aug 21, 2017 at 7:25 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>>
>> Also, it is quite possible that some of the storage Am's don't even
>> want to return bool as a parameter from HeapTupleSatisfies* API's. I
>> guess what we need here is to provide a way so that different storage
>> am's can register their function pointer for an equivalent to
>> satisfies function. So, we need to change
>> SnapshotData.SnapshotSatisfiesFunc in some way so that different
>> handlers can register their function instead of using that directly.
>> I think that should address the problem you are planning to solve by
>> omitting buffer parameter.
>
>
> Thanks for your suggestion. Yes, it is better to go in the direction of
> SnapshotSatisfiesFunc.
>
> I verified the above idea of implementing the Tuple visibility functions
> and assign them into the snapshotData structure based on the snapshot.
>
> The Tuple visibility functions that are specific to the relation are
> available
> with the RelationData structure and this structure may not be available,
>

Which functions are you referring here? I don't see anything in
tqual.h that uses RelationData.

With storage API's, the tuple visibility functions are available with RelationData

and those are needs used to update the SnapshotData structure

SnapshotSatisfiesFunc member.

But the RelationData is not available everywhere, where the snapshot is created,

but it is available every place where the tuple visibility is checked. So I just changed

the way of checking the tuple visibility with the information of snapshot by calling

the corresponding tuple visibility function from RelationData.

If SnapshotData provides MVCC, then the MVCC specific tuple visibility function from

RelationData is called. The SnapshotSatisfiesFunc member is changed to a enum

that holds the tuple visibility type such as MVCC, DIRTY, SELF and etc. Whenever

the visibility check is needed, the corresponding function is called.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

23 августа 2017 г., 19:59:47

On Wed, Aug 23, 2017 at 11:05 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Mon, Aug 21, 2017 at 7:25 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>> >
>> > On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> >>
>> >>
>> >> Also, it is quite possible that some of the storage Am's don't even
>> >> want to return bool as a parameter from HeapTupleSatisfies* API's.  I
>> >> guess what we need here is to provide a way so that different storage
>> >> am's can register their function pointer for an equivalent to
>> >> satisfies function.  So, we need to change
>> >> SnapshotData.SnapshotSatisfiesFunc in some way so that different
>> >> handlers can register their function instead of using that directly.
>> >> I think that should address the problem you are planning to solve by
>> >> omitting buffer parameter.
>> >
>> >
>> > Thanks for your suggestion. Yes, it is better to go in the direction of
>> > SnapshotSatisfiesFunc.
>> >
>> > I verified the above idea of implementing the Tuple visibility functions
>> > and assign them into the snapshotData structure based on the snapshot.
>> >
>> > The Tuple visibility functions that are specific to the relation are
>> > available
>> > with the RelationData structure and this structure may not be available,
>> >
>>
>> Which functions are you referring here?  I don't see anything in
>> tqual.h that uses RelationData.
>
>
>
> With storage API's, the tuple visibility functions are available with
> RelationData
> and those are needs used to update the SnapshotData structure
> SnapshotSatisfiesFunc member.
>
> But the RelationData is not available everywhere, where the snapshot is
> created,
> but it is available every place where the tuple visibility is checked. So I
> just changed
> the way of checking the tuple visibility with the information of snapshot by
> calling
> the corresponding tuple visibility function from RelationData.
>
> If SnapshotData provides MVCC, then the MVCC specific tuple visibility
> function from
> RelationData is called. The SnapshotSatisfiesFunc member is changed to a
> enum
> that holds the tuple visibility type such as MVCC, DIRTY, SELF and etc.
> Whenever
> the visibility check is needed, the corresponding function is called.
>

It will be easy to understand and see if there is some better
alternative once you have something in the form of a patch.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

26 августа 2017 г., 09:34:20

On Wed, Aug 23, 2017 at 11:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:

On Wed, Aug 23, 2017 at 11:05 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Mon, Aug 21, 2017 at 7:25 PM, Amit Kapila <amit.kapila16@gmail.com>
> wrote:
>>
>> On Mon, Aug 21, 2017 at 12:58 PM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>> >
>> > On Sun, Aug 13, 2017 at 5:17 PM, Amit Kapila <amit.kapila16@gmail.com>
>> > wrote:
>> >>
>> >>
>> >> Also, it is quite possible that some of the storage Am's don't even
>> >> want to return bool as a parameter from HeapTupleSatisfies* API's. I
>> >> guess what we need here is to provide a way so that different storage
>> >> am's can register their function pointer for an equivalent to
>> >> satisfies function. So, we need to change
>> >> SnapshotData.SnapshotSatisfiesFunc in some way so that different
>> >> handlers can register their function instead of using that directly.
>> >> I think that should address the problem you are planning to solve by
>> >> omitting buffer parameter.
>> >
>> >
>> > Thanks for your suggestion. Yes, it is better to go in the direction of
>> > SnapshotSatisfiesFunc.
>> >
>> > I verified the above idea of implementing the Tuple visibility functions
>> > and assign them into the snapshotData structure based on the snapshot.
>> >
>> > The Tuple visibility functions that are specific to the relation are
>> > available
>> > with the RelationData structure and this structure may not be available,
>> >
>>
>> Which functions are you referring here? I don't see anything in
>> tqual.h that uses RelationData.
>
>
>
> With storage API's, the tuple visibility functions are available with
> RelationData
> and those are needs used to update the SnapshotData structure
> SnapshotSatisfiesFunc member.
>
> But the RelationData is not available everywhere, where the snapshot is
> created,
> but it is available every place where the tuple visibility is checked. So I
> just changed
> the way of checking the tuple visibility with the information of snapshot by
> calling
> the corresponding tuple visibility function from RelationData.
>
> If SnapshotData provides MVCC, then the MVCC specific tuple visibility
> function from
> RelationData is called. The SnapshotSatisfiesFunc member is changed to a
> enum
> that holds the tuple visibility type such as MVCC, DIRTY, SELF and etc.
> Whenever
> the visibility check is needed, the corresponding function is called.
>

It will be easy to understand and see if there is some better
alternative once you have something in the form of a patch.

Sorry for the delay.

I will submit the new patch series with all comments given

in the upthread to the upcoming commitfest.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

01 сентября 2017 г., 07:51:50

On Sat, Aug 26, 2017 at 1:34 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

I will submit the new patch series with all comments given
in the upthread to the upcoming commitfest.

Here I attached new set of patches that are rebased to the latest master.

0001-Change-Create-Access-method-to-include-storage-handl:

Add the support of storage method to create as part of create

access method syntax.

0002-Storage-AM-API-hooks-and-related-functions:

The necessary storage AM API hooks that are required

to support pluggable storage API and supporting functions

to frame the storage routine.

0003-Adding-storageam-hanlder-to-relation-structure:

Add the storageAm routine pointer to relation structure

and necessary functions to initialize the storage AM routine

whenever the relation is built.

0004-Adding-tuple-visibility-function-to-storage-AM:

The tuple visibility functions are moved into heap storage AM

and the visibility function in snapshot structure is changed from

function pointer to an enum to indicate what type of snapshot

it is. Based on that enum, the corresponding visibility function

is executed from the relation storage AM routine.

0005-slot-hooks-are-added-to-storage-AM:

The slot specific storage AM routine pointer is added to slot

structure and removed some of the members and created

a new HeapamSlot structure to hold the necessary tuple

information. Currently the slot supports the minimal tuple

also.

Currently this patch may further needs some changes as it

assumes the tuple is in HeapTuple format in some API's.

This slot storage AM routine may be common to all the

pluggable storage modules.

0006-Tuple-Insert-API-is-added-to-Storage-AM:

The write support functionality is added to storage AM.

And also all the storage AM functions are extracted into

another file to make it easier to understand while writing

and changing the code to support new API's.

0007-Scan-functions-are-added-to-storage-AM:

All the scan supported functions are added to storage AM.

And these functions are also extracted into storageam.c file.

Pending comments from Andres:

1. Remove the usage of HeapScanDesc.

2. Remove the usage of scan_getpage, scan_update_snapshot

and scan_rescan_set_params hooks.

3. Many places the relation is not available while creating the slot,

and also slot shouldn't depend on relation, because the slot values

are may be some times from two different storage relations also.

So I modified the slot code to use the same storage mechanism.

4.Tuples functionality moving into a separate folder.

Other pending activities are:

1. Handle Bitmap and sample scans as they dependent on heap format.

2. Add some configuration flags in the storage AM, based on these flags

whether vacuum can run on these relations or not will be decided. This

may be further enhanced to provide the cost parameters also that can be

used for planning.

I attached individual patches in the mail, in case if it increases the mail size,

or creates problems to some one, i will attach the zip version from next time

onward.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Thomas Munro

Дата:

07 сентября 2017 г., 07:53:47

On Fri, Sep 1, 2017 at 1:51 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
> Here I attached new set of patches that are rebased to the latest master.

Hi Haribabu,

Could you please post a new rebased version?
0007-Scan-functions-are-added-to-storage-AM.patch conflicts with
recent changes.

Thanks!

-- 
Thomas Munro
http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

09 сентября 2017 г., 09:23:32

On Thu, Sep 7, 2017 at 11:53 AM, Thomas Munro <thomas.munro@enterprisedb.com> wrote:

On Fri, Sep 1, 2017 at 1:51 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
> Here I attached new set of patches that are rebased to the latest master.

Hi Haribabu,

Could you please post a new rebased version?
0007-Scan-functions-are-added-to-storage-AM.patch conflicts with
recent changes.

Thanks for checking the patch.

I rebased the patch to the latest master and also fixed the duplicate OID

and some slot fixes. Updated patches are attached.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

12 сентября 2017 г., 11:52:25

On Sat, Sep 9, 2017 at 1:23 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

I rebased the patch to the latest master and also fixed the duplicate OID
and some slot fixes. Updated patches are attached.

While analyzing the removal of HeapScanDesc usage other than heap

modules, The mostly used member is "rs_cbuf" for the purpose of locking

the buffer, visibility checks and marking it as dirty. The buffer is tightly integrated

with the visibility. Buffer may not be required for some storage routines where

the data is always in the memory and etc.

I found the references of its structure members in the following

places. I feel other than the following 4 parameters, rest of them needs to be

moved into their corresponding storage routines.

Relation rs_rd; /* heap relation descriptor */

Snapshot rs_snapshot; /* snapshot to see */

int rs_nkeys; /* number of scan keys */

ScanKey rs_key; /* array of scan key descriptors */

But currently I am treating the "rs_cbuf" also a needed member and also

expecting all storage routines will be provide it. Or we may need a another

approach to mark the buffer as dirty.

Suggestions?

Following are the rest of the parameters that are used

outside the heap.

BlockNumber rs_nblocks; /* total number of blocks in rel */

pgstattuple.c, tsm_system_rows.c, tsm_system_time.c, system.c

nodeBitmapheapscan.c nodesamplescan.c,

Mostly for the purpose of checking the number of blocks in a rel.

BufferAccessStrategy rs_strategy; /* access strategy for reads */

pgstattuple.c

bool rs_pageatatime; /* verify visibility page-at-a-time? */

BlockNumber rs_startblock; /* block # to start at */

bool rs_syncscan; /* report location to syncscan logic? */

bool rs_inited; /* false = scan not init'd yet */

nodesamplescan.c

HeapTupleData rs_ctup; /* current tuple in scan, if any */

genam.c, nodeBitmapHeapscan.c, nodesamplescan.c

Used for retrieving the last scanned tuple.

BlockNumber rs_cblock; /* current block # in scan, if any */

index.c, nodesamplescan.c

Buffer rs_cbuf; /* current buffer in scan, if any */

pgrowlocks.c, pgstattuple.c, genam.c, index.c, cluster.c,

tablecmds.c, nodeBitmapHeapscan.c, nodesamplescan.c

Mostly used for Locking the Buffer.

ParallelHeapScanDesc rs_parallel; /* parallel scan information */

nodeseqscan.c

int rs_cindex; /* current tuple's index in vistuples */

nodeBitmapHeapScan.c

int rs_ntuples; /* number of visible tuples on page */

OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */

tsm_system_rows.c, nodeBitmapHeapscan.c, nodesamplescan.c

Used for retrieve the offsets mainly Bitmap and sample scans.

I think rest of the above parameters usage other than heap can be changed

once the Bitmap and Sample scans are modified to use the storage routines

while returning the tuple instead of their own implementations. I feel these

scans are the major users of the rest of the parameters. This approach may

need to some more API's to get rid of Bitmap and sample scan's own

implementation.

suggestions?

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

14 сентября 2017 г., 11:17:56

On Tue, Sep 12, 2017 at 3:52 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

int rs_ntuples; /* number of visible tuples on page */
OffsetNumber rs_vistuples[MaxHeapTuplesPerPage]; /* their offsets */

tsm_system_rows.c, nodeBitmapHeapscan.c, nodesamplescan.c
Used for retrieve the offsets mainly Bitmap and sample scans.

I think rest of the above parameters usage other than heap can be changed
once the Bitmap and Sample scans are modified to use the storage routines
while returning the tuple instead of their own implementations. I feel these
scans are the major users of the rest of the parameters. This approach may
need to some more API's to get rid of Bitmap and sample scan's own
implementation.

Instead of modifying the Bitmap Heap and Sample scan's to avoid referring

the internal members of the HeapScanDesc, I divided the HeapScanDesc

into two parts.

1. StorageScanDesc

2. HeapPageScanDesc

The StorageScanDesc contains the minimal information that is required outside

the Storage routine and this must be provided by all storage routines. This

structure contains minimal information such as relation, snapshot, buffer and

etc.

The HeapPageScanDesc contains other extra information that is required for

Bitmap Heap and Sample scans to work. This structure contains the information

of blocks, visible offsets and etc. Currently this structure is used only in

Bitmap Heap and Sample scan and it's supported contrib modules, except

the pgstattuple module. The pgstattuple needs some additional changes.

By adding additional storage API to return HeapPageScanDesc as it required

by the Bitmap Heap and Sample scan's and this API is called only in these

two scan's. And also these scan methods are choosen by the planner only

when the storage routine supports to returning of HeapPageScanDesc API.

Currently Implemented the planner support only for Bitmap, yet to do it

for Sample scan.

With the above approach, I removed all the references of HeapScanDesc

outside the heap. The changes of this approach is available in the

0008-Remove-HeapScanDesc-usage-outside-heap.patch

Suggestions/comments with the above approach.

Because of a recent commit in the master, there was an OID conflict with

the other patches. Rebased patches are attached.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

15 сентября 2017 г., 01:10:49

On Thu, Sep 14, 2017 at 8:17 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

Instead of modifying the Bitmap Heap and Sample scan's to avoid referring
the internal members of the HeapScanDesc, I divided the HeapScanDesc
into two parts.

1. StorageScanDesc
2. HeapPageScanDesc

The StorageScanDesc contains the minimal information that is required outside
the Storage routine and this must be provided by all storage routines. This
structure contains minimal information such as relation, snapshot, buffer and
etc.

The HeapPageScanDesc contains other extra information that is required for
Bitmap Heap and Sample scans to work. This structure contains the information
of blocks, visible offsets and etc. Currently this structure is used only in
Bitmap Heap and Sample scan and it's supported contrib modules, except
the pgstattuple module. The pgstattuple needs some additional changes.

By adding additional storage API to return HeapPageScanDesc as it required
by the Bitmap Heap and Sample scan's and this API is called only in these
two scan's. And also these scan methods are choosen by the planner only
when the storage routine supports to returning of HeapPageScanDesc API.
Currently Implemented the planner support only for Bitmap, yet to do it
for Sample scan.

With the above approach, I removed all the references of HeapScanDesc
outside the heap. The changes of this approach is available in the
0008-Remove-HeapScanDesc-usage-outside-heap.patch

Suggestions/comments with the above approach.

For me, that's an interesting idea. Naturally, the way BitmapHeapScan and SampleScan work even on very high-level is applicable only for some storage AMs (i.e. heap-like storage AMs). For example, index-organized table wouldn't ever support BitmapHeapScan, because it refers tuples by PK values not TIDs. However, in this case, storage AM might have some alternative to our BitmapHeapScan. So, index-organized table might have some compressed representation of ordered PK values set and use it for bulk fetch of PK index.

Therefore, I think it would be nice to make BitmapHeapScan an heap-storage-AM-specific scan method while other storage AMs could provide other storage-AM-specific scan methods. Probably it would be too much for this patchset and should be done during one of next work cycles on storage AM (I'm sure that such huge project as pluggable storage AMs would have multiple iterations).

Similarly, SampleScans contain storage-AM-specific logic. For instance, our SYSTEM sampling method fetches random blocks from heap providing high performance way to sample heap. Coming back to the example of index-organized table, it could provide it's own storage-AM-specific table sampling methods including sophisticated PK tree traversal with fetching random small ranges of PK. Given that tablesample methods are already pluggable, making them storage-AM-specific would lead to user-visible changes. I.e. tablesample method should be created for particular storage AM or set of storage AMs. However, I didn't yet figure out what should API exactly look like...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

19 сентября 2017 г., 13:34:48

On Fri, Sep 15, 2017 at 5:10 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Sep 14, 2017 at 8:17 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
Instead of modifying the Bitmap Heap and Sample scan's to avoid referring
the internal members of the HeapScanDesc, I divided the HeapScanDesc
into two parts.

1. StorageScanDesc
2. HeapPageScanDesc

The StorageScanDesc contains the minimal information that is required outside
the Storage routine and this must be provided by all storage routines. This
structure contains minimal information such as relation, snapshot, buffer and
etc.

The HeapPageScanDesc contains other extra information that is required for
Bitmap Heap and Sample scans to work. This structure contains the information
of blocks, visible offsets and etc. Currently this structure is used only in
Bitmap Heap and Sample scan and it's supported contrib modules, except
the pgstattuple module. The pgstattuple needs some additional changes.

By adding additional storage API to return HeapPageScanDesc as it required
by the Bitmap Heap and Sample scan's and this API is called only in these
two scan's. And also these scan methods are choosen by the planner only
when the storage routine supports to returning of HeapPageScanDesc API.
Currently Implemented the planner support only for Bitmap, yet to do it
for Sample scan.

With the above approach, I removed all the references of HeapScanDesc
outside the heap. The changes of this approach is available in the
0008-Remove-HeapScanDesc-usage-outside-heap.patch

Suggestions/comments with the above approach.

For me, that's an interesting idea. Naturally, the way BitmapHeapScan and SampleScan work even on very high-level is applicable only for some storage AMs (i.e. heap-like storage AMs). For example, index-organized table wouldn't ever support BitmapHeapScan, because it refers tuples by PK values not TIDs. However, in this case, storage AM might have some alternative to our BitmapHeapScan. So, index-organized table might have some compressed representation of ordered PK values set and use it for bulk fetch of PK index.

Therefore, I think it would be nice to make BitmapHeapScan an heap-storage-AM-specific scan method while other storage AMs could provide other storage-AM-specific scan methods. Probably it would be too much for this patchset and should be done during one of next work cycles on storage AM (I'm sure that such huge project as pluggable storage AMs would have multiple iterations).

Thanks for your opinion. Yes, that was my first thought of making these

two scan methods as part of the storage AMs. I feel the approach of just

exposing some additional hooks doesn't look good. This may need some

better infrastructure to provide storage AMs of their own scan methods.

Because of this reason, currently I developed the temporary approach of

separating HeapScanDesc into two structures.

Similarly, SampleScans contain storage-AM-specific logic. For instance, our SYSTEM sampling method fetches random blocks from heap providing high performance way to sample heap. Coming back to the example of index-organized table, it could provide it's own storage-AM-specific table sampling methods including sophisticated PK tree traversal with fetching random small ranges of PK. Given that tablesample methods are already pluggable, making them storage-AM-specific would lead to user-visible changes. I.e. tablesample method should be created for particular storage AM or set of storage AMs. However, I didn't yet figure out what should API exactly look like...

Regarding SampleScans, I feel we can follow the same approach of supporting

particular sample methods with particular storage AMs similar like Bitmap scans.

I didn't check it completely.

Rebased patches are attached.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

27 сентября 2017 г., 18:05:45

Hi,

Thank you for review on this subject. I think it's extremely important for PostgreSQL to eventually get pluggable storage API.

In general I agree with all your points. But I'd like to make couple of comments.

On Tue, Aug 15, 2017 at 9:53 AM, Andres Freund <andres@anarazel.de> wrote:

- I don't think we should introduce this without a user besides
heapam. The likelihood that API will be usable by anything else
without a testcase seems fairly remote. I think some src/test/modules
type implementation of a per-session, in-memory storage - relatively
easy to implement - or such is necessary.

+1 for having a user before committing API. However, I'd like to note that sample storage implementation should do something really different from out current heap. In particular, if per-session, in-memory storage would be just another way to keep heap in local buffers, it wouldn't be OK for me; because such kind of storage could be achieved way more easier without so complex API. But if per-session, in-memory storage would, for instance, utilize different MVCC implementation, that would be very good sample of storage API usage.

- Minor: don't think the _function suffix for Storis necessary, just
makes things long, and every member has it. Besides that, it's also
easy to misunderstand - for a second I understood
scan_getnext_function to be about getting the next function...

_function suffix looks long for me too. But we should look on this question from uniformity point of view. FdwRoutine, TsmRoutine, IndexAmRoutine use _function suffix. This is why I think we should use _function suffix for StorageAmRoutine unless we're going to change that for other *Routines too.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

27 сентября 2017 г., 18:08:22

On Wed, Aug 23, 2017 at 8:26 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

- Minor: don't think the _function suffix for Storis necessary, just
makes things long, and every member has it. Besides that, it's also
easy to misunderstand - for a second I understood
scan_getnext_function to be about getting the next function...

OK. How about adding _hook?

I've answered to Andrew why I think _function suffix is OK for now.

And I don't particularly like _hook suffix for this purpose, because those functions are parts of API implementation, not hooks.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

27 сентября 2017 г., 22:51:23

Hi!

I took a look on this patch. I've following notes for now.

tuple_insert_hook tuple_insert; /* heap_insert */
tuple_update_hook tuple_update; /* heap_update */
tuple_delete_hook tuple_delete; /* heap_delete */

I don't think this set of functions provides good enough level of abstraction for storage AM. This functions encapsulate only low-level work of insert/update/delete tuples into heap itself. However, it still assumed that indexes are managed outside of storage AM. I don't think this is right, assuming that one of most demanded storage API usage would be different MVCC implementation (for instance, UNDO log based as discussed upthread). Different MVCC implementation is likely going to manage indexes in a different way. For example, storage AM utilizing UNDO would implement in-place update even when indexed columns are modified. Therefore this piece of code in ExecUpdate() wouldn't be relevant anymore.

/*
* insert index entries for tuple
*
* Note: heap_update returns the tid (location) of the new tuple in
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
*/
if ((resultRelInfo->ri_NumIndices > 0) && !storage_tuple_is_heaponly(resultRelationDesc, tuple))
recheckIndexes = ExecInsertIndexTuples(slot, &(slot->tts_tid),
estate, false, NULL, NIL);

I'm firmly convinced that this logic should be encapsulated into storage AM altogether with inserting new index tuples on storage insert. Also, HOT should be completely encapsulated into heapam. It's quite evident for me that storage utilizing UNDO wouldn't have anything like our current HOT. Therefore, I think there shouldn't be hot_search_buffer() API function. tuple_fetch() may cover hot_search_buffer(). That might require some signature change of tuple_fetch() (probably, extra arguments).

LockTupleMode and HeapUpdateFailureData shouldn't be private of heapam. Any fullweight OLTP storage AM should support our tuple lock modes and should be able to report update failures. HeapUpdateFailureData should be renamed to something like StorageUpdateFailureData. Contents of HeapUpdateFailureData seems to me general enough to be supported by any storage with ItemPointer tuple locator.

storage_setscanlimits() is used only during index build. I think that since storage AM may have different MVCC implementation then storage AM should decide how to communicate with indexes including index build. Therefore, instead of exposing storage_setscanlimits(), the whole IndexBuildHeapScan() should be encapsulated into storage AM.

Also, BulkInsertState should be private of structure of heapam. Another storages may have another state for bulk insert. On API level we might have some abstract pointer instead of BulkInsertState while having GetBulkInsertState and others as API methods.

storage_freeze_tuple() is called only once from rewrite_heap_tuple(). That makes me think that tuple_freeze API method is wrong for abstraction. We probably should make rewrite_heap_tuple() or even the whole rebuild_relation() an API method...

Heap reloptions are untouched for now. Storage AM should be able to provide its own specific options just like index AMs do.

That's all I have for now.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

09 октября 2017 г., 20:22:27

On Wed, Sep 27, 2017 at 7:51 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

I took a look on this patch. I've following notes for now.

tuple_insert_hook tuple_insert; /* heap_insert */
tuple_update_hook tuple_update; /* heap_update */
tuple_delete_hook tuple_delete; /* heap_delete */

I don't think this set of functions provides good enough level of abstraction for storage AM. This functions encapsulate only low-level work of insert/update/delete tuples into heap itself. However, it still assumed that indexes are managed outside of storage AM. I don't think this is right, assuming that one of most demanded storage API usage would be different MVCC implementation (for instance, UNDO log based as discussed upthread). Different MVCC implementation is likely going to manage indexes in a different way. For example, storage AM utilizing UNDO would implement in-place update even when indexed columns are modified. Therefore this piece of code in ExecUpdate() wouldn't be relevant anymore.

/*
* insert index entries for tuple
*
* Note: heap_update returns the tid (location) of the new tuple in
* the t_self field.
*
* If it's a HOT update, we mustn't insert new index entries.
*/
if ((resultRelInfo->ri_NumIndices > 0) && !storage_tuple_is_heaponly(resultRelationDesc, tuple))
recheckIndexes = ExecInsertIndexTuples(slot, &(slot->tts_tid),
estate, false, NULL, NIL);

I'm firmly convinced that this logic should be encapsulated into storage AM altogether with inserting new index tuples on storage insert. Also, HOT should be completely encapsulated into heapam. It's quite evident for me that storage utilizing UNDO wouldn't have anything like our current HOT. Therefore, I think there shouldn't be hot_search_buffer() API function. tuple_fetch() may cover hot_search_buffer(). That might require some signature change of tuple_fetch() (probably, extra arguments).

For me, it's crucial point that pluggable storages should be able to have different MVCC implementation, and correspondingly have full control over its interactions with indexes.

Thus, it would be good if we would get consensus on that point. I'd like other discussion participants to comment whether they agree/disagree and why.

Any comments?

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Tom Lane

Дата:

09 октября 2017 г., 20:32:19

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> For me, it's crucial point that pluggable storages should be able to have
> different MVCC implementation, and correspondingly have full control over
> its interactions with indexes.
> Thus, it would be good if we would get consensus on that point.  I'd like
> other discussion participants to comment whether they agree/disagree and
> why.
> Any comments?

TBH, I think that's a good way of ensuring that nothing will ever get
committed.  You're trying to draw the storage layer boundary at a point
that will take in most of the system.  If we did build it like that,
what we'd end up with would be very reminiscent of mysql's storage
engines, complete with inconsistent behaviors and varying feature sets
across engines.  I don't much want to go there.
        regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

09 октября 2017 г., 21:15:14

On Mon, Oct 9, 2017 at 5:32 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:

Alexander Korotkov <a.korotkov@postgrespro.ru> writes:
> For me, it's crucial point that pluggable storages should be able to have
> different MVCC implementation, and correspondingly have full control over
> its interactions with indexes.
> Thus, it would be good if we would get consensus on that point. I'd like
> other discussion participants to comment whether they agree/disagree and
> why.
> Any comments?

TBH, I think that's a good way of ensuring that nothing will ever get
committed. You're trying to draw the storage layer boundary at a point
that will take in most of the system. If we did build it like that,
what we'd end up with would be very reminiscent of mysql's storage
engines, complete with inconsistent behaviors and varying feature sets
across engines. I don't much want to go there.

However, if we insist that pluggable storage should have the same MVCC implementation, interacts with indexes the same way and also use TIDs as tuple identifiers, then what useful implementations might we have? Per-page heap compression and encryption? Or different heap page layout? Or tuple format? OK, but that doesn't justify such wide API as it's implemented in the current version of patch in this thread. If we really want to restrict applicability of pluggable storages that way, then we probably should give up with "pluggable storages" and make it "pluggable heap page format" at I proposed upthread.

Implementation of alternative storage would be hard and challenging task. Yes, it would include reimplementation of significant part of the system. But that seems inevitable if we're going to implement alternative really storages (not just hacks over existing storage). And I don't think that our pluggable storages would be reminiscent of mysql's storage engines while we're keeping two properties:

1) All the storages use the same WAL stream,

2) All the storages use same transactions and snapshots.

If we keep these two properties, we wouldn't need neither 2PC to run transactions across different storages, neither separate log for replication. These two are major drawbacks of MySQL model.

Varying feature sets across engines seems inevitable and natural. We've to invent alternative storages to have features whose are hard to have in our current storage. So, no wonder that feature sets would be varying...

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

12 октября 2017 г., 02:08:23

On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> For me, it's crucial point that pluggable storages should be able to have
> different MVCC implementation, and correspondingly have full control over
> its interactions with indexes.
> Thus, it would be good if we would get consensus on that point.  I'd like
> other discussion participants to comment whether they agree/disagree and
> why.
> Any comments?

I think it's good for new storage managers to have full control over
interactions with indexes.  I'm not sure about the MVCC part.  I think
it would be legitimate to want a storage manager to ignore MVCC
altogether - e.g. to build a non-transactional table.  I don't know
that it would be a very good idea to have two different full-fledged
MVCC implementations, though.  Like Tom says, that would be
replicating a lot of the awfulness of the MySQL model.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Peter Geoghegan

Дата:

12 октября 2017 г., 02:54:50

On Wed, Oct 11, 2017 at 1:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov
> <a.korotkov@postgrespro.ru> wrote:
>> For me, it's crucial point that pluggable storages should be able to have
>> different MVCC implementation, and correspondingly have full control over
>> its interactions with indexes.
>> Thus, it would be good if we would get consensus on that point.  I'd like
>> other discussion participants to comment whether they agree/disagree and
>> why.
>> Any comments?
>
> I think it's good for new storage managers to have full control over
> interactions with indexes.  I'm not sure about the MVCC part.  I think
> it would be legitimate to want a storage manager to ignore MVCC
> altogether - e.g. to build a non-transactional table.

I agree with Alexander -- if you're going to have a new MVCC
implementation, you have to do significant work within index access
methods. Adding "retail index tuple deletion" is probably just the
beginning. ISTM that you need something like InnoDB's purge thread
when index values change, since two versions of the same index tuple
(each with distinct attribute values) have to physically co-exist for
a time.

> I don't know
> that it would be a very good idea to have two different full-fledged
> MVCC implementations, though.  Like Tom says, that would be
> replicating a lot of the awfulness of the MySQL model.

It's not just the MySQL model, FWIW. SQL-on-Hadoop systems like
Impala, certain NoSQL systems, and AFAIK any database system that
claims to have pluggable storage all do it this way. That is, core
transaction management functions (e.g. MVCC snapshot acquisition) is
outsourced to the storage engine. It *is* very cumbersome, but that's
what they do.

-- 
Peter Geoghegan

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

13 октября 2017 г., 02:38:43

On Wed, Oct 11, 2017 at 11:08 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Mon, Oct 9, 2017 at 10:22 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> For me, it's crucial point that pluggable storages should be able to have
> different MVCC implementation, and correspondingly have full control over
> its interactions with indexes.
> Thus, it would be good if we would get consensus on that point. I'd like
> other discussion participants to comment whether they agree/disagree and
> why.
> Any comments?

I think it's good for new storage managers to have full control over
interactions with indexes. I'm not sure about the MVCC part. I think
it would be legitimate to want a storage manager to ignore MVCC
altogether - e.g. to build a non-transactional table. I don't know
that it would be a very good idea to have two different full-fledged
MVCC implementations, though. Like Tom says, that would be
replicating a lot of the awfulness of the MySQL model.

It's probably that we imply different meaning to "MVCC implementation".

While writing "MVCC implementation" I meant that, for instance, alternative storage

may implement UNDO chains to store versions of same row. Correspondingly,

it may not have any analogue of our HOT.

However I imply that alternative storage would share our "MVCC model". So, it

should share our transactional model including transactions, subtransactions, snapshots etc.

Therefore, if alternative storage is transactional, then in particular it should be able to fetch tuple with

given TID according to given snapshot. However, how it's implemented internally is

a black box for us. Thus, we don't insist that tuple should have different TID after update;

we don't insist there is any analogue of HOT; we don't insist alternative storage needs vacuum

(or if even it needs vacuum, it might be performed in completely different way) and so on.

During conversations with you at PGCon and other conferences I had impression

that you share this view on pluggable storages and MVCC. Probably, we just express

this view in different words. Or alternatively I might understand you terribly wrong.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

13 октября 2017 г., 03:23:34

On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> It's probably that we imply different meaning to "MVCC implementation".
> While writing "MVCC implementation" I meant that, for instance, alternative
> storage
> may implement UNDO chains to store versions of same row.  Correspondingly,
> it may not have any analogue of our HOT.

Yes, the zheap project on which EnterpriseDB is working has precisely
this characteristic.

> However I imply that alternative storage would share our "MVCC model".  So,
> it
> should share our transactional model including transactions,
> subtransactions, snapshots etc.
> Therefore, if alternative storage is transactional, then in particular it
> should be able to fetch tuple with
> given TID according to given snapshot.  However, how it's implemented
> internally is
> a black box for us.  Thus, we don't insist that tuple should have different
> TID after update;
> we don't insist there is any analogue of HOT; we don't insist alternative
> storage needs vacuum
> (or if even it needs vacuum, it might be performed in completely different
> way) and so on.

Fully agreed.

> During conversations with you at PGCon and other conferences I had
> impression
> that you share this view on pluggable storages and MVCC.  Probably, we just
> express
> this view in different words.  Or alternatively I might understand you
> terribly wrong.

No, it sounds like we are on the same page.  I'm only hoping that we
don't end with a bunch of storage engines that each use a different
XID space or something icky like that.  I don't think the API should
try to cater to that sort of development.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

13 октября 2017 г., 06:00:47

On Fri, Oct 13, 2017 at 8:23 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> It's probably that we imply different meaning to "MVCC implementation".
> While writing "MVCC implementation" I meant that, for instance, alternative
> storage
> may implement UNDO chains to store versions of same row. Correspondingly,
> it may not have any analogue of our HOT.

Yes, the zheap project on which EnterpriseDB is working has precisely
this characteristic.

> However I imply that alternative storage would share our "MVCC model". So,
> it
> should share our transactional model including transactions,
> subtransactions, snapshots etc.
> Therefore, if alternative storage is transactional, then in particular it
> should be able to fetch tuple with
> given TID according to given snapshot. However, how it's implemented
> internally is
> a black box for us. Thus, we don't insist that tuple should have different
> TID after update;
> we don't insist there is any analogue of HOT; we don't insist alternative
> storage needs vacuum
> (or if even it needs vacuum, it might be performed in completely different
> way) and so on.

Fully agreed.

Currently I added a snapshot_satisfies API to find out whether the tuple

satisfies the visibility or not with different types of visibility routines. I feel these

are some how enough to develop a different storage methods like UNDO.

The storage methods can decide internally how to provide the visibility.

+ amroutine->snapshot_satisfies[MVCC_VISIBILITY] = HeapTupleSatisfiesMVCC;

+ amroutine->snapshot_satisfies[SELF_VISIBILITY] = HeapTupleSatisfiesSelf;

+ amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny;

+ amroutine->snapshot_satisfies[TOAST_VISIBILITY] = HeapTupleSatisfiesToast;

+ amroutine->snapshot_satisfies[DIRTY_VISIBILITY] = HeapTupleSatisfiesDirty;

+ amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] = HeapTupleSatisfiesHistoricMVCC;

+ amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] = HeapTupleSatisfiesNonVacuumable;

+ amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate;

+ amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum;

Currently no changes are carried out in snapshot logic as that is kept seperate

from storage API.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

13 октября 2017 г., 06:55:14

On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> Currently I added a snapshot_satisfies API to find out whether the tuple
> satisfies the visibility or not with different types of visibility routines.
> I feel these
> are some how enough to develop a different storage methods like UNDO.
> The storage methods can decide internally how to provide the visibility.
>
> + amroutine->snapshot_satisfies[MVCC_VISIBILITY] = HeapTupleSatisfiesMVCC;
> + amroutine->snapshot_satisfies[SELF_VISIBILITY] = HeapTupleSatisfiesSelf;
> + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny;
> + amroutine->snapshot_satisfies[TOAST_VISIBILITY] = HeapTupleSatisfiesToast;
> + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] = HeapTupleSatisfiesDirty;
> + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] =
> HeapTupleSatisfiesHistoricMVCC;
> + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] =
> HeapTupleSatisfiesNonVacuumable;
> +
> + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate;
> + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum;
>
> Currently no changes are carried out in snapshot logic as that is kept
> seperate
> from storage API.

That seems like a strange choice of API.  I think it should more
integrated with the scan logic.  For example, if I'm doing an index
scan, and I get a TID, then I should be able to just say "here's a
TID, give me any tuples associated with that TID that are visible to
the scan snapshot".  Then for the current heap it will do
heap_hot_search_buffer, and for zheap it will walk the undo chain and
return the relevant tuple from the chain.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

13 октября 2017 г., 14:28:40

On Fri, Oct 13, 2017 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
> Currently I added a snapshot_satisfies API to find out whether the tuple
> satisfies the visibility or not with different types of visibility routines.
> I feel these
> are some how enough to develop a different storage methods like UNDO.
> The storage methods can decide internally how to provide the visibility.
>
> + amroutine->snapshot_satisfies[MVCC_VISIBILITY] = HeapTupleSatisfiesMVCC;
> + amroutine->snapshot_satisfies[SELF_VISIBILITY] = HeapTupleSatisfiesSelf;
> + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny;
> + amroutine->snapshot_satisfies[TOAST_VISIBILITY] = HeapTupleSatisfiesToast;
> + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] = HeapTupleSatisfiesDirty;
> + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] =
> HeapTupleSatisfiesHistoricMVCC;
> + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] =
> HeapTupleSatisfiesNonVacuumable;
> +
> + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate;
> + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum;
>
> Currently no changes are carried out in snapshot logic as that is kept
> seperate
> from storage API.

That seems like a strange choice of API. I think it should more
integrated with the scan logic. For example, if I'm doing an index
scan, and I get a TID, then I should be able to just say "here's a
TID, give me any tuples associated with that TID that are visible to
the scan snapshot". Then for the current heap it will do
heap_hot_search_buffer, and for zheap it will walk the undo chain and
return the relevant tuple from the chain.

OK, Understood.

I will check along these lines and come up with storage API's.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Kuntal Ghosh

Дата:

13 октября 2017 г., 15:25:03

On Fri, Oct 13, 2017 at 1:58 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
>
> On Fri, Oct 13, 2017 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>>
>> That seems like a strange choice of API.  I think it should more
>> integrated with the scan logic.  For example, if I'm doing an index
>> scan, and I get a TID, then I should be able to just say "here's a
>> TID, give me any tuples associated with that TID that are visible to
>> the scan snapshot".  Then for the current heap it will do
>> heap_hot_search_buffer, and for zheap it will walk the undo chain and
>> return the relevant tuple from the chain.
>
>
> OK, Understood.
> I will check along these lines and come up with storage API's.
>
I've some doubts regarding the following function hook:

+typedef bool (*hot_search_buffer_hook) (ItemPointer tid, Relation relation,
+    Buffer buffer, Snapshot snapshot, HeapTuple heapTuple,
+    bool *all_dead, bool first_call);

As per my understanding, with HOT feature  a new tuple placed on the
same page and with all indexed columns the same as its parent row
version does not get new index entries (README.HOT). For some other
storage engine, if we maintain the older version in different storage,
undo for example, and don't require a new index entry, should we still
call it HOT-chain? If not, IMHO, we may have something like
*search_buffer_hook(tid,....,storageTuple,...). Depending on the
underlying storage, one can traverse hot-chain or undo-chain or some
other multi-version strategy for non-key updates.

After a successful index search, most of the index AMs set
(HeapTupleData)xs_ctup->t_self of IndexScanDescData to the tupleid in
the storage. IMHO, some changes are needed here to make it generic.

@@ -328,47 +376,27 @@ ExecStoreTuple(HeapTuple tuple, Assert(tuple != NULL); Assert(slot != NULL);
Assert(slot->tts_tupleDescriptor!= NULL);

+ Assert(slot->tts_storageslotam != NULL); /* passing shouldFree=true for a tuple on a disk page is not sane */
Assert(BufferIsValid(buffer)? (!shouldFree) : true);

For some storage engine, isn't it possible that the buffer is valid
and the tuple to be stored is formed in memory (for example, tuple
formed from UNDO, in-memory decrypted tuple etc.)

-- 
Thanks & Regards,
Kuntal Ghosh
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Peter Geoghegan

Дата:

13 октября 2017 г., 23:59:51

On Thu, Oct 12, 2017 at 2:23 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Thu, Oct 12, 2017 at 4:38 PM, Alexander Korotkov
>> However I imply that alternative storage would share our "MVCC model".  So,
>> it
>> should share our transactional model including transactions,
>> subtransactions, snapshots etc.
>> Therefore, if alternative storage is transactional, then in particular it
>> should be able to fetch tuple with
>> given TID according to given snapshot.  However, how it's implemented
>> internally is
>> a black box for us.  Thus, we don't insist that tuple should have different
>> TID after update;
>> we don't insist there is any analogue of HOT; we don't insist alternative
>> storage needs vacuum
>> (or if even it needs vacuum, it might be performed in completely different
>> way) and so on.
>
> Fully agreed.

If we implement that interface, where does that leave EvalPlanQual()?
Do those semantics have to be preserved?

-- 
Peter Geoghegan


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

14 октября 2017 г., 00:37:10

On Fri, Oct 13, 2017 at 5:25 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> For some other
> storage engine, if we maintain the older version in different storage,
> undo for example, and don't require a new index entry, should we still
> call it HOT-chain?

I would say, emphatically, no.  HOT is a creature of the existing
heap.  If it's creeping into storage APIs they are not really
abstracted from what we have currently.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

14 октября 2017 г., 00:41:14

On Fri, Oct 13, 2017 at 1:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Fully agreed.
>
> If we implement that interface, where does that leave EvalPlanQual()?
> Do those semantics have to be preserved?

For a general-purpose heap storage format, I would say yes.

I mean, we don't really have control over how people use the API.  If
somebody decides to implement a storage API that breaks EvalPlanQual
semantics horribly, I can't stop them, and I don't want to stop them.
Open source FTW.

But I don't really want that code in our tree, either.  I think a
storage engine is and should be about the format in which data gets
stored on disk, and that it should only affect the performance of
queries not the answers that they give.  I am sure there will be cases
where, for reasons of implementation complexity, that turns out not to
be true, but I think in general we should try to avoid it as much as
we can.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

14 октября 2017 г., 00:54:04

On Fri, Oct 13, 2017 at 9:37 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 13, 2017 at 5:25 AM, Kuntal Ghosh
<kuntalghosh.2007@gmail.com> wrote:
> For some other
> storage engine, if we maintain the older version in different storage,
> undo for example, and don't require a new index entry, should we still
> call it HOT-chain?

I would say, emphatically, no. HOT is a creature of the existing
heap. If it's creeping into storage APIs they are not really
abstracted from what we have currently.

+1,

different storage may need to insert entries to only *some* of indexes.

Wherein these new index entries may have either same or new TIDs.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

14 октября 2017 г., 01:39:03

On Fri, Oct 13, 2017 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Oct 13, 2017 at 1:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> Fully agreed.
>
> If we implement that interface, where does that leave EvalPlanQual()?

From the first glance, it seems that pluggable storage should override EvalPlanQualFetch(), rest of EvalPlanQual() looks quite generic.

> Do those semantics have to be preserved?

For a general-purpose heap storage format, I would say yes.

I mean, we don't really have control over how people use the API. If
somebody decides to implement a storage API that breaks EvalPlanQual
semantics horribly, I can't stop them, and I don't want to stop them.
Open source FTW.

Yeah. We don't have any kind of "safe extensions". Any extension can break things really horribly.

For me that means user should absolutely trust extension developer.

But I don't really want that code in our tree, either.

We keep things in our tree as correct as we can. And for sure, we should

follow this politics for pluggable storages too.

I think a
storage engine is and should be about the format in which data gets
stored on disk, and that it should only affect the performance of
queries not the answers that they give.

Pretty same idea as index access methods. They also affects the

performance, but not query answers. When it's not true, this situation

is considered as bug, and it needs to be fixed.

I am sure there will be cases
where, for reasons of implementation complexity, that turns out not to
be true, but I think in general we should try to avoid it as much as
we can.

I think in some cases we can tolerate missing features (and document it),

but don't tolerate wrong features. For instance, we may have some pluggable

storage which doesn't support transactions at all (and that should be

documented for sure), but we shouldn't have pluggable storage which

transaction support is wrong.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

20 октября 2017 г., 09:47:10

On Fri, Oct 13, 2017 at 1:58 PM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Fri, Oct 13, 2017 at 11:55 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Thu, Oct 12, 2017 at 8:00 PM, Haribabu Kommi
>> <kommi.haribabu@gmail.com> wrote:
>> > Currently I added a snapshot_satisfies API to find out whether the tuple
>> > satisfies the visibility or not with different types of visibility
>> > routines.
>> > I feel these
>> > are some how enough to develop a different storage methods like UNDO.
>> > The storage methods can decide internally how to provide the visibility.
>> >
>> > + amroutine->snapshot_satisfies[MVCC_VISIBILITY] =
>> > HeapTupleSatisfiesMVCC;
>> > + amroutine->snapshot_satisfies[SELF_VISIBILITY] =
>> > HeapTupleSatisfiesSelf;
>> > + amroutine->snapshot_satisfies[ANY_VISIBILITY] = HeapTupleSatisfiesAny;
>> > + amroutine->snapshot_satisfies[TOAST_VISIBILITY] =
>> > HeapTupleSatisfiesToast;
>> > + amroutine->snapshot_satisfies[DIRTY_VISIBILITY] =
>> > HeapTupleSatisfiesDirty;
>> > + amroutine->snapshot_satisfies[HISTORIC_MVCC_VISIBILITY] =
>> > HeapTupleSatisfiesHistoricMVCC;
>> > + amroutine->snapshot_satisfies[NON_VACUUMABLE_VISIBILTY] =
>> > HeapTupleSatisfiesNonVacuumable;
>> > +
>> > + amroutine->snapshot_satisfiesUpdate = HeapTupleSatisfiesUpdate;
>> > + amroutine->snapshot_satisfiesVacuum = HeapTupleSatisfiesVacuum;
>> >
>> > Currently no changes are carried out in snapshot logic as that is kept
>> > seperate
>> > from storage API.
>>
>> That seems like a strange choice of API.  I think it should more
>> integrated with the scan logic.  For example, if I'm doing an index
>> scan, and I get a TID, then I should be able to just say "here's a
>> TID, give me any tuples associated with that TID that are visible to
>> the scan snapshot".  Then for the current heap it will do
>> heap_hot_search_buffer, and for zheap it will walk the undo chain and
>> return the relevant tuple from the chain.
>
>
> OK, Understood.
> I will check along these lines and come up with storage API's.
>

I think what we need here is a way to register satisfies function
(SnapshotSatisfiesFunc) in SnapshotData for different storage engines.
That is the core API to decide visibility with respect to different
storage engines.


-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

20 октября 2017 г., 09:56:47

On Sat, Oct 14, 2017 at 1:09 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> On Fri, Oct 13, 2017 at 9:41 PM, Robert Haas <robertmhaas@gmail.com> wrote:
>>
>> On Fri, Oct 13, 2017 at 1:59 PM, Peter Geoghegan <pg@bowt.ie> wrote:
>> >> Fully agreed.
>> >
>> > If we implement that interface, where does that leave EvalPlanQual()?
>
>
> From the first glance, it seems that pluggable storage should override
> EvalPlanQualFetch(), rest of EvalPlanQual() looks quite generic.
>

I think there is more to it.  Currently, EState->es_epqTuple is a
HeapTuple which is filled as part of EvalPlanQual mechanism and then
later used during the scan.  We need to make it pluggable in some way
so that other heaps can work.  We also need some work for
EvalPlanQualFetchRowMarks as that also seems to be tightly coupled
with HeapTuple.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

25 октября 2017 г., 12:07:20

On Fri, Oct 20, 2017 at 5:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> I think what we need here is a way to register satisfies function
> (SnapshotSatisfiesFunc) in SnapshotData for different storage engines.

I don't see how that helps very much.  SnapshotSatisfiesFunc takes a
HeapTuple as an argument, and it cares in detail about that tuple's
xmin, xmax, and infomask, and it sets hint bits.  All of that is bad,
because an alternative storage engine is likely to use a different
format than HeapTuple and to not have hint bits (or at least not in
the same form we have them now).  Also, it doesn't necessarily have a
Boolean answer to the question "can this snapshot see this tuple?".
It may be more like "given this TID, what tuple if any can I see
there?" or "given this tuple, what version of it would I see with this
snapshot?".

Another thing to consider is that, if we could replace satisfiesfunc,
it would probably break some existing code.  There are multiple places
in the code that compare snapshot->satisfies to
HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC.

I think the storage API should just leave snapshots alone.  If a
storage engine wants to call HeapTupleSatisfiesVisibility() with that
snapshot, it can do so.  Otherwise it can switch on
snapshot->satisfies and handle each case however it likes.  I don't
see how generalizing a Snapshot for other storage engines really buys
us anything except complexity and the danger of reducing performance
for the existing heap.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

25 октября 2017 г., 17:59:56

On Wed, Oct 25, 2017 at 11:37 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Oct 20, 2017 at 5:47 AM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> I think what we need here is a way to register satisfies function
>> (SnapshotSatisfiesFunc) in SnapshotData for different storage engines.
>
> I don't see how that helps very much.  SnapshotSatisfiesFunc takes a
> HeapTuple as an argument, and it cares in detail about that tuple's
> xmin, xmax, and infomask, and it sets hint bits.  All of that is bad,
> because an alternative storage engine is likely to use a different
> format than HeapTuple and to not have hint bits (or at least not in
> the same form we have them now).  Also, it doesn't necessarily have a
> Boolean answer to the question "can this snapshot see this tuple?".
> It may be more like "given this TID, what tuple if any can I see
> there?" or "given this tuple, what version of it would I see with this
> snapshot?".
>
> Another thing to consider is that, if we could replace satisfiesfunc,
> it would probably break some existing code.  There are multiple places
> in the code that compare snapshot->satisfies to
> HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC.
>
> I think the storage API should just leave snapshots alone.  If a
> storage engine wants to call HeapTupleSatisfiesVisibility() with that
> snapshot, it can do so.  Otherwise it can switch on
> snapshot->satisfies and handle each case however it likes.
>

How will it switch satisfies at runtime?  There are places where we
might know which visibility function (*MVCC , *Dirty, etc) needs to be
called, but I think there are other places (like heap_fetch) where it
is not clear and we decide based on what is stored in
snapshot->satisfies.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

27 октября 2017 г., 11:06:31

On Wed, Oct 25, 2017 at 1:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Another thing to consider is that, if we could replace satisfiesfunc,
>> it would probably break some existing code.  There are multiple places
>> in the code that compare snapshot->satisfies to
>> HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC.
>>
>> I think the storage API should just leave snapshots alone.  If a
>> storage engine wants to call HeapTupleSatisfiesVisibility() with that
>> snapshot, it can do so.  Otherwise it can switch on
>> snapshot->satisfies and handle each case however it likes.
>>
>
> How will it switch satisfies at runtime?  There are places where we
> might know which visibility function (*MVCC , *Dirty, etc) needs to be
> called, but I think there are other places (like heap_fetch) where it
> is not clear and we decide based on what is stored in
> snapshot->satisfies.

An alternative storage engine needs to provide its own implementation
of heap_fetch, and that replacement implementation can implement MVCC
and other snapshot behavior in any way it likes.

My point here is that I think it's better if the table access method
stuff doesn't end up modifying snapshots.  I think it's fine for a
table access method to get passed a standard snapshot.  Some code may
be needed to cater to the access method's specific needs, but that
code can live inside the table access method, without contaminating
the snapshot stuff.  We have to try to draw some boundary around table
access methods -- we don't want to end up teaching everything in the
system about them.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

31 октября 2017 г., 15:59:30

On Fri, Oct 27, 2017 at 4:06 PM, Robert Haas <robertmhaas@gmail.com> wrote:

On Wed, Oct 25, 2017 at 1:59 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
>> Another thing to consider is that, if we could replace satisfiesfunc,
>> it would probably break some existing code. There are multiple places
>> in the code that compare snapshot->satisfies to
>> HeapTupleSatisfiesHistoricMVCC and HeapTupleSatisfiesMVCC.
>>
>> I think the storage API should just leave snapshots alone. If a
>> storage engine wants to call HeapTupleSatisfiesVisibility() with that
>> snapshot, it can do so. Otherwise it can switch on
>> snapshot->satisfies and handle each case however it likes.
>>
>
> How will it switch satisfies at runtime? There are places where we
> might know which visibility function (*MVCC , *Dirty, etc) needs to be
> called, but I think there are other places (like heap_fetch) where it
> is not clear and we decide based on what is stored in
> snapshot->satisfies.

An alternative storage engine needs to provide its own implementation
of heap_fetch, and that replacement implementation can implement MVCC
and other snapshot behavior in any way it likes.

In the current set of patches, I changed the snapshot->satisfies function

point to an enum type, Based on the snapshot visibility type, internally

the storage AM will call the corresponding visibility function.

Additional changes that are done in the patches compared to earlier

patches apart from rebase.

0004-Adding tuple visibility:

Tuple visibility API functions are reduced to 3. It still needs further

optimization.

Tuple satisfies visibility check is added to heap function that don't have

currently.

0006-tuple-insert

Move the index tuple insertion logic inside storage Am with a function

pointer, applicable to insert and updates. yet to handle the insert index tuples

for multi insert scenario.

Removed the speculative finish API. Yet to remove abort API.

Known pending items:

1. Provide a generic new API like heap_fetch to remove heap_hot_search

and visibility functions usage.

2. Move toast table details into storage AM, Toast method depends on storage.

and also toast flattening function needs to be replaced with some generic

functions.

3. Provide new API to get the heaptuple from index or building the index.

may be the same API like heap_fetch may satisfy this requirement also.

4. Bulk insert functionality needs a separate API to deal with all storage AMs.

5. Provide a framework to add reloptions based on storage.

6. Needs a generic API to support rewrite the heap, (cluster command)

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

07 ноября 2017 г., 15:34:19

On Tue, Oct 31, 2017 at 8:59 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

Additional changes that are done in the patches compared to earlier
patches apart from rebase.

Rebased patches are attached.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Michael Paquier

Дата:

14 ноября 2017 г., 14:09:03

On Tue, Nov 7, 2017 at 6:34 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
> On Tue, Oct 31, 2017 at 8:59 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>> Additional changes that are done in the patches compared to earlier
>> patches apart from rebase.
>
> Rebased patches are attached.

This set of patches needs again a... Rebase.
-- 
Michael

Re: [HACKERS] Pluggable storage

От

Alvaro Herrera

Дата:

14 ноября 2017 г., 16:42:46

Hmm.  Am I reading it right that this discussion led to moving
essentially all code from tqual.c to heapam?  Given the hard time we've
had to get tqual.c right, it seems fundamentally misguided to me to
require that every single storage AM reimplements all the visibility
routines.

I think that changing tqual's API (such as not passing HeapTuples
anymore but some other more general representation) would be okay and
should be sufficient, but this wholesale movement of code seems
dangerous and wasteful in terms of future reimplementations that will be
necessary.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

14 ноября 2017 г., 19:17:01

On Tue, Nov 14, 2017 at 4:12 PM, Alvaro Herrera <alvherre@alvh.no-ip.org> wrote:
> Hmm.  Am I reading it right that this discussion led to moving
> essentially all code from tqual.c to heapam?  Given the hard time we've
> had to get tqual.c right, it seems fundamentally misguided to me to
> require that every single storage AM reimplements all the visibility
> routines.
>
> I think that changing tqual's API (such as not passing HeapTuples
> anymore but some other more general representation) would be okay and
> should be sufficient, but this wholesale movement of code seems
> dangerous and wasteful in terms of future reimplementations that will be
> necessary.
>

I don't think the idea is to touch existing tqual.c in any significant
way.  However, some other storage engine might need a different way to
check the visibility of tuples so we need provision for that. I think
for storage engine where tuple headers no longer contain transaction
information and or the old versions of tuples are chained in separate
storage (say undo storage), current visibility routines can be used.
I think the current tqual.c is quite tightly coupled with the
HeapTuple representation, so changing that or adding more code to it
for another storage engine with different tuple format won't be of
much use.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Michael Paquier

Дата:

29 ноября 2017 г., 10:50:40

On Tue, Nov 14, 2017 at 5:09 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Nov 7, 2017 at 6:34 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>> On Tue, Oct 31, 2017 at 8:59 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
>> wrote:
>>> Additional changes that are done in the patches compared to earlier
>>> patches apart from rebase.
>>
>> Rebased patches are attached.
>
> This set of patches needs again a... Rebase.

No rebased versions have showed up for two weeks. For now I am marking
this patch as returned with feedback.
-- 
Michael

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

12 декабря 2017 г., 10:06:35

On Wed, Nov 29, 2017 at 3:50 PM, Michael Paquier <michael.paquier@gmail.com> wrote:

On Tue, Nov 14, 2017 at 5:09 PM, Michael Paquier
<michael.paquier@gmail.com> wrote:
> On Tue, Nov 7, 2017 at 6:34 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>> On Tue, Oct 31, 2017 at 8:59 PM, Haribabu Kommi <kommi.haribabu@gmail.com>
>> wrote:
>>> Additional changes that are done in the patches compared to earlier
>>> patches apart from rebase.
>>
>> Rebased patches are attached.
>
> This set of patches needs again a... Rebase.

No rebased versions have showed up for two weeks. For now I am marking
this patch as returned with feedback.

I restructured that patch files to avoid showing unnecessary modifications,

and also it will be easy for adding of new API's based on the all the functions

that are exposed by heapam module easily compared earlier.

Attached are the latest set of patches. I will work on the remaining pending

items.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

27 декабря 2017 г., 09:54:04

On Tue, Dec 12, 2017 at 3:06 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

I restructured that patch files to avoid showing unnecessary modifications,
and also it will be easy for adding of new API's based on the all the functions
that are exposed by heapam module easily compared earlier.

Attached are the latest set of patches. I will work on the remaining pending
items.

Apart from rebase to the latest master code, following are the additional changes,

1. Added API for bulk insert and rewrite functionality(Logical rewrite is not touched yet)

2. Tuple lock API interface redesign to remove the traversal logic from executor module.

The tuple lock API interface changes are from "Alexander Korotkov" from "PostgresPro".

Thanks Alexander. Currently we both are doing joint development for faster closure of

open items that are pending to bring the "pluggable storage API" into a good shape.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

27 декабря 2017 г., 18:33:22

Hi!

On Wed, Dec 27, 2017 at 6:54 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Tue, Dec 12, 2017 at 3:06 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

I restructured that patch files to avoid showing unnecessary modifications,
and also it will be easy for adding of new API's based on the all the functions
that are exposed by heapam module easily compared earlier.

Attached are the latest set of patches. I will work on the remaining pending
items.

Apart from rebase to the latest master code, following are the additional changes,

1. Added API for bulk insert and rewrite functionality(Logical rewrite is not touched yet)
2. Tuple lock API interface redesign to remove the traversal logic from executor module.

Great, thank you!

The tuple lock API interface changes are from "Alexander Korotkov" from "PostgresPro".
Thanks Alexander. Currently we both are doing joint development for faster closure of
open items that are pending to bring the "pluggable storage API" into a good shape.

Thank you for announcing this. Yes, pluggable storage API requires a lot of work to get into committable shape. This is why I've decided to join the development.

Let me explain the idea behind new tuple lock API and further patches I plan to send. As I noted upthread, I consider possibility of alternative MVCC implementations as vital property of pluggable storage API. These include undo log option when tuple is updated in-place while old version of tuple is displaced to some special area. In this case, new version of tuple would reside on same TID as old version of tuple. This is an important point, because TID is not really tuple identifier anymore. Naturally, TID becomes a row identifier while tuple may be identified by pair (tid, snapshot). For current heap, snapshot is redundant and can be used just for assert checking (tuple on given tid is really visible using given snapshot). For heap with undo log, appropriate tuple could be found by snapshot in the undo chain associated with given tid.

One of consequences of above is that we cannot use fact that tid isn't changed after update as sign that tuple was deleted. This is why I've introduced HTSU_Result HeapTupleDeleted. Another consequence is that our tid traverse logic in the executor layer is not valid anymore. For instance, this traversal from older tuple to latter tuple doesn't make any sense for heap with undo log where latter tuple is more easily accessible than older tuple. This is why I decided to hide this logic in storage layer and provide TUPLE_LOCK_FLAG_FIND_LAST_VERSION flag which indicates that lock_tuple() have to find latest updated version and lock it. I've also changed follow_updates bool to more explicit TUPLE_LOCK_FLAG_LOCK_UPDATE_IN_PROGRESS flag in order to not mess it with previous flag which also kind of follow updates. Third consequence is that we have to pass snapshot to tuple_update() and tuple_delete() methods to let them check if row was concurrently updated while residing on the same TID. I'm going to provide this change as separate patch.

Also, I appreciate that now tuple_insert() and tuple_update() methods are responsible for inserting index tuples. This unleash pluggable storages to implement another way of interaction with indexes. However, I didn't get the point of passing InsertIndexTuples IndexFunc to them. Now, we're always passing ExecInsertIndexTuples() to this argument. As I understood storage is free to either call ExecInsertIndexTuples() or implement its own logic of interaction with indexes. But, I don't understand why do we need a callback when tuple_insert() and tuple_update() can call ExecInsertIndexTuples() directly if needed. Another thing is that tuple_delete() could also interact with indexes (especially when we will enhance index access method API), and we need to pass meta-information about indexes to tuple_delete() too.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

03 января 2018 г., 13:08:51

On Wed, Dec 27, 2017 at 11:33 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Also, I appreciate that now tuple_insert() and tuple_update() methods are responsible for inserting index tuples. This unleash pluggable storages to implement another way of interaction with indexes. However, I didn't get the point of passing InsertIndexTuples IndexFunc to them. Now, we're always passing ExecInsertIndexTuples() to this argument. As I understood storage is free to either call ExecInsertIndexTuples() or implement its own logic of interaction with indexes. But, I don't understand why do we need a callback when tuple_insert() and tuple_update() can call ExecInsertIndexTuples() directly if needed. Another thing is that tuple_delete() could also interact with indexes (especially when we will enhance index access method API), and we need to pass meta-information about indexes to tuple_delete() too.

The main reason for which I added the callback function to not to introduce the

dependency of storage on executor functions. This way storage can call the

function that is passed to it without any knowledge. I added the function pointer

for tuple_delete also in the new patches, currently it is passed as NULL for heap.

These API's can be enhanced later.

Apart from rebase, Added storage shared memory API, currently this API is used

only by the syncscan. And also all the exposed functions of syncscan usage is

removed outside the heap.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

04 января 2018 г., 05:00:17

On Wed, Jan 3, 2018 at 10:08 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Wed, Dec 27, 2017 at 11:33 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Also, I appreciate that now tuple_insert() and tuple_update() methods are responsible for inserting index tuples. This unleash pluggable storages to implement another way of interaction with indexes. However, I didn't get the point of passing InsertIndexTuples IndexFunc to them. Now, we're always passing ExecInsertIndexTuples() to this argument. As I understood storage is free to either call ExecInsertIndexTuples() or implement its own logic of interaction with indexes. But, I don't understand why do we need a callback when tuple_insert() and tuple_update() can call ExecInsertIndexTuples() directly if needed. Another thing is that tuple_delete() could also interact with indexes (especially when we will enhance index access method API), and we need to pass meta-information about indexes to tuple_delete() too.

The main reason for which I added the callback function to not to introduce the
dependency of storage on executor functions. This way storage can call the
function that is passed to it without any knowledge. I added the function pointer
for tuple_delete also in the new patches, currently it is passed as NULL for heap.
These API's can be enhanced later.

Understood, but in order to implement alternative behavior with indexes (for example,

insert index tuples to only some of indexes), storage am will still have to call executor

functions. So, yes this needs to be enhanced. Probably, we just need to implement

nicer executor API for storage am.

Apart from rebase, Added storage shared memory API, currently this API is used
only by the syncscan. And also all the exposed functions of syncscan usage is
removed outside the heap.

This makes me uneasy. You introduce two new hooks for size estimation and initialization

of shared memory needed by storage am's. But if storage am is implemented in shared library,

then this shared library can use our generic method for allocation of shared memory

(including memory needed by storage am). If storage am is builtin, then hooks are also not

needed, because we know all our builtin storage am's in advance. For me, it would be

nice to encapsulate heap am requirements in shared memory into functions like

HeapAmShmemSize() and HeapAmShmemInit(), and don't explicitly show outside that

this memory is needed for synchronized scan. But separate hooks don't look justified for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

04 января 2018 г., 11:03:18

On Thu, Jan 4, 2018 at 10:00 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Wed, Jan 3, 2018 at 10:08 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Wed, Dec 27, 2017 at 11:33 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Also, I appreciate that now tuple_insert() and tuple_update() methods are responsible for inserting index tuples. This unleash pluggable storages to implement another way of interaction with indexes. However, I didn't get the point of passing InsertIndexTuples IndexFunc to them. Now, we're always passing ExecInsertIndexTuples() to this argument. As I understood storage is free to either call ExecInsertIndexTuples() or implement its own logic of interaction with indexes. But, I don't understand why do we need a callback when tuple_insert() and tuple_update() can call ExecInsertIndexTuples() directly if needed. Another thing is that tuple_delete() could also interact with indexes (especially when we will enhance index access method API), and we need to pass meta-information about indexes to tuple_delete() too.

The main reason for which I added the callback function to not to introduce the
dependency of storage on executor functions. This way storage can call the
function that is passed to it without any knowledge. I added the function pointer
for tuple_delete also in the new patches, currently it is passed as NULL for heap.
These API's can be enhanced later.

Understood, but in order to implement alternative behavior with indexes (for example,
insert index tuples to only some of indexes), storage am will still have to call executor
functions. So, yes this needs to be enhanced. Probably, we just need to implement
nicer executor API for storage am.

OK.

Apart from rebase, Added storage shared memory API, currently this API is used
only by the syncscan. And also all the exposed functions of syncscan usage is
removed outside the heap.

This makes me uneasy. You introduce two new hooks for size estimation and initialization
of shared memory needed by storage am's. But if storage am is implemented in shared library,
then this shared library can use our generic method for allocation of shared memory
(including memory needed by storage am). If storage am is builtin, then hooks are also not
needed, because we know all our builtin storage am's in advance. For me, it would be
nice to encapsulate heap am requirements in shared memory into functions like
HeapAmShmemSize() and HeapAmShmemInit(), and don't explicitly show outside that
this memory is needed for synchronized scan. But separate hooks don't look justified for me.

Yes, I agree that for the builtin storage's there is no need of hooks. But in future,

if we want to support multiple storage's in an instance, we may need hooks for shared memory

registration. I am fine to change it.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

05 января 2018 г., 04:55:46

On Thu, Jan 4, 2018 at 8:03 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Thu, Jan 4, 2018 at 10:00 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Wed, Jan 3, 2018 at 10:08 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
Apart from rebase, Added storage shared memory API, currently this API is used
only by the syncscan. And also all the exposed functions of syncscan usage is
removed outside the heap.

This makes me uneasy. You introduce two new hooks for size estimation and initialization
of shared memory needed by storage am's. But if storage am is implemented in shared library,
then this shared library can use our generic method for allocation of shared memory
(including memory needed by storage am). If storage am is builtin, then hooks are also not
needed, because we know all our builtin storage am's in advance. For me, it would be
nice to encapsulate heap am requirements in shared memory into functions like
HeapAmShmemSize() and HeapAmShmemInit(), and don't explicitly show outside that
this memory is needed for synchronized scan. But separate hooks don't look justified for me.

Yes, I agree that for the builtin storage's there is no need of hooks. But in future,
if we want to support multiple storage's in an instance, we may need hooks for shared memory
registration. I am fine to change it.

Yes, but we already have hooks for shared memory registration in shared modules. I don't see the point for another hooks for the same purpose.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

05 января 2018 г., 05:20:32

On Fri, Jan 5, 2018 at 9:55 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Thu, Jan 4, 2018 at 8:03 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Thu, Jan 4, 2018 at 10:00 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Wed, Jan 3, 2018 at 10:08 AM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
Apart from rebase, Added storage shared memory API, currently this API is used
only by the syncscan. And also all the exposed functions of syncscan usage is
removed outside the heap.

This makes me uneasy. You introduce two new hooks for size estimation and initialization
of shared memory needed by storage am's. But if storage am is implemented in shared library,
then this shared library can use our generic method for allocation of shared memory
(including memory needed by storage am). If storage am is builtin, then hooks are also not
needed, because we know all our builtin storage am's in advance. For me, it would be
nice to encapsulate heap am requirements in shared memory into functions like
HeapAmShmemSize() and HeapAmShmemInit(), and don't explicitly show outside that
this memory is needed for synchronized scan. But separate hooks don't look justified for me.

Yes, I agree that for the builtin storage's there is no need of hooks. But in future,
if we want to support multiple storage's in an instance, we may need hooks for shared memory
registration. I am fine to change it.

Yes, but we already have hooks for shared memory registration in shared modules. I don't see the point for another hooks for the same purpose.

Oh, yes, I missed it. I will update the patch and share it later.

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

05 января 2018 г., 22:20:15

I do not like the way that this patch set uses the word "storage".  In
current usage, storage is a thing that certain kinds of relations
have.  Plain relations (a.k.a. heap tables) have storage, indexes have
storage, materialized views have storage, TOAST tables have storage,
and sequences have storage.  This patch set wants to use "storage AM"
to mean a replacement for a plain heap table, but I think that's going
to create a lot of confusion, because it overlaps heavily with the
existing meaning yet is different.  My suggestion is to call these
"table access methods" rather than "storage access methods".  Then,
the default table AM can be heap.  This has the nice property that you
create an index using CREATE INDEX and the support functions arrive
via an IndexAmRoutine, so correspondingly you would create a table
using CREATE TABLE and the support functions would arrive via a
TableAmRoutine -- so all the names match.

An alternative would be to call the new thing a "heap AM" with
HeapAmRoutine as the support function structure.  That's also not
unreasonable.  In that case, we're deciding that "heap" is not just
the name of the current implementation, but the name of the kind of
thing that backs a table at the storage level.  I don't like that
quite as well because then we've got a class of things called a heap
of which the current and only implementation is called heap, which is
a bit confusing in my view.  But we could probably make it work.

If we adopt the first proposal, it leads to another nice parallel: we
can have src/backend/access/table for those things which are generic
to table AMs, just as we have src/backend/access/index for things
which are generic to index AMs, and then src/backend/access/<am-name>
for things which are specific to a particular AM.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

06 января 2018 г., 01:31:36

On Fri, Jan 5, 2018 at 7:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:

I do not like the way that this patch set uses the word "storage". In
current usage, storage is a thing that certain kinds of relations
have. Plain relations (a.k.a. heap tables) have storage, indexes have
storage, materialized views have storage, TOAST tables have storage,
and sequences have storage. This patch set wants to use "storage AM"
to mean a replacement for a plain heap table, but I think that's going
to create a lot of confusion, because it overlaps heavily with the
existing meaning yet is different.

Good point, thank you for noticing that. Name "storage" is really confusing

for this purpose.

My suggestion is to call these
"table access methods" rather than "storage access methods". Then,
the default table AM can be heap. This has the nice property that you
create an index using CREATE INDEX and the support functions arrive
via an IndexAmRoutine, so correspondingly you would create a table
using CREATE TABLE and the support functions would arrive via a
TableAmRoutine -- so all the names match.

An alternative would be to call the new thing a "heap AM" with
HeapAmRoutine as the support function structure. That's also not
unreasonable. In that case, we're deciding that "heap" is not just
the name of the current implementation, but the name of the kind of
thing that backs a table at the storage level. I don't like that
quite as well because then we've got a class of things called a heap
of which the current and only implementation is called heap, which is
a bit confusing in my view. But we could probably make it work.

If we adopt the first proposal, it leads to another nice parallel: we
can have src/backend/access/table for those things which are generic
to table AMs, just as we have src/backend/access/index for things
which are generic to index AMs, and then src/backend/access/<am-name>
for things which are specific to a particular AM.

I would vote for the first proposal: table AM. Because we eventually

might get index-organized tables whose don't have something like heap.

So, it would be nice to avoid hardcoding "heap" name.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com

The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

09 января 2018 г., 18:42:28

On Sat, Jan 6, 2018 at 6:31 AM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

On Fri, Jan 5, 2018 at 7:20 PM, Robert Haas <robertmhaas@gmail.com> wrote:
I do not like the way that this patch set uses the word "storage". In
current usage, storage is a thing that certain kinds of relations
have. Plain relations (a.k.a. heap tables) have storage, indexes have
storage, materialized views have storage, TOAST tables have storage,
and sequences have storage. This patch set wants to use "storage AM"
to mean a replacement for a plain heap table, but I think that's going
to create a lot of confusion, because it overlaps heavily with the
existing meaning yet is different.

Good point, thank you for noticing that. Name "storage" is really confusing
for this purpose.

Thanks for the review and suggestion.

My suggestion is to call these
"table access methods" rather than "storage access methods". Then,
the default table AM can be heap. This has the nice property that you
create an index using CREATE INDEX and the support functions arrive
via an IndexAmRoutine, so correspondingly you would create a table
using CREATE TABLE and the support functions would arrive via a
TableAmRoutine -- so all the names match.

I changed the patches to use Table instead of storage. Changed all the

structures and exposed functions also for better readability

Updated patches are attached.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

05 февраля 2018 г., 17:22:29

On Tue, Jan 9, 2018 at 11:42 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

Updated patches are attached.

To integrate the columnar store with the pluggable storage API, I found that

there are couple of other things also that needs to be supported.

1. Choosing the right table access method for a particular table?

I am thinking of adding a new table option to let the user select the correct table

access method that the user wants for the table. HEAP is the default access

method. This approach may be simple and doesn't need any syntax changes.

Or Is it fine to add syntax "USING method" to CREATE TABLE similar like

CREATE INDEX?

comments?

2. As the columnar storage needs many other relations that are needs to be

created along with main relation.

As these extra relations are internal to the storage and shouldn't be visible

directly from pg_class and these will be stored in the storage specific

catalog tables. A dependency is created for the original table as these storage

specific tables must be created/dropped/altered whenever there is a change

with the original table.

Is it fine to add new API while creating/altering/drop the table to get the control?

or to use only exiting processutility hook?

Regards,

Hari Babu

Fujitsu Australia

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

16 февраля 2018 г., 16:56:13

Hi, Haribabu!

On Mon, Feb 5, 2018 at 2:22 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Tue, Jan 9, 2018 at 11:42 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

Updated patches are attached.

To integrate the columnar store with the pluggable storage API, I found that
there are couple of other things also that needs to be supported.

1. Choosing the right table access method for a particular table?

I am thinking of adding a new table option to let the user select the correct table
access method that the user wants for the table. HEAP is the default access
method. This approach may be simple and doesn't need any syntax changes.

Or Is it fine to add syntax "USING method" to CREATE TABLE similar like
CREATE INDEX?

comments?

Sure, user needs to specify particular table access method when creating a table. "USING method" is good for me. I think it's better to reuse already existing syntax rather than reinventing something new.

2. As the columnar storage needs many other relations that are needs to be
created along with main relation.

That's an interesting point. During experimenting on some new index access methods I also have to define some kind of "internal" relations invisible for user, but essential for main relation functionality. I've made them in hackery manner, and I think legal mechanism for them would be very good.

As these extra relations are internal to the storage and shouldn't be visible
directly from pg_class and these will be stored in the storage specific
catalog tables. A dependency is created for the original table as these storage
specific tables must be created/dropped/altered whenever there is a change
with the original table.

I think that internal relations should be visible from pg_class, otherwise it wouldn't be possible to define dependencies over them. But they should have different relkind, so they wouldn't be messed up with regular relations.

Is it fine to add new API while creating/altering/drop the table to get the control?
or to use only exiting processutility hook?

I believe that we need some generic way for defining internal relations, not hooks. For me it seems like useful feature for further development of both index access methods and table access methods.

During developer meeting [1], I've proposed to reorder patches so that refactoring patches go first and API introduction patches go afterwards. That would make possible to commit some of refactoring patches earlier without necessary committing API in the same release cycle. If no objection, I would do that reordering.

BTW, EnterpriseDB announces zheap table access method (heap with undo log) [2]. I think this is great, and I'm looking forward for publishing zheap in mailing lists. But I'm concerning about its compatibility with pluggable table access methods API. Does zheap use table AM API from this thread? Or does it just override current heap and needs to be adopted to use table AM API? Or does it implements own API?

Links

1. https://wiki.postgresql.org/wiki/FOSDEM/PGDay_2018_Developer_Meeting#Minutes

2. http://rhaas.blogspot.com.by/2018/01/do-or-undo-there-is-no-vacuum.html

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

21 февраля 2018 г., 06:29:03

On Fri, Feb 16, 2018 at 9:56 PM, Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:

Hi, Haribabu!

On Mon, Feb 5, 2018 at 2:22 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

To integrate the columnar store with the pluggable storage API, I found that
there are couple of other things also that needs to be supported.

1. Choosing the right table access method for a particular table?

I am thinking of adding a new table option to let the user select the correct table
access method that the user wants for the table. HEAP is the default access
method. This approach may be simple and doesn't need any syntax changes.

Or Is it fine to add syntax "USING method" to CREATE TABLE similar like
CREATE INDEX?

comments?

Sure, user needs to specify particular table access method when creating a table. "USING method" is good for me. I think it's better to reuse already existing syntax rather than reinventing something new.

Yes, I also felt the same. Updated patches included this implementation.

2. As the columnar storage needs many other relations that are needs to be
created along with main relation.

That's an interesting point. During experimenting on some new index access methods I also have to define some kind of "internal" relations invisible for user, but essential for main relation functionality. I've made them in hackery manner, and I think legal mechanism for them would be very good.

As these extra relations are internal to the storage and shouldn't be visible
directly from pg_class and these will be stored in the storage specific
catalog tables. A dependency is created for the original table as these storage
specific tables must be created/dropped/altered whenever there is a change
with the original table.

I think that internal relations should be visible from pg_class, otherwise it wouldn't be possible to define dependencies over them. But they should have different relkind, so they wouldn't be messed up with regular relations.

OK.

Is it fine to add new API while creating/altering/drop the table to get the control?
or to use only exiting processutility hook?

I believe that we need some generic way for defining internal relations, not hooks. For me it seems like useful feature for further development of both index access methods and table access methods.

How about a new relkind of RELKIND_EXTERNAL ('e') or something similar types and these can be created internally by the extensions that needs extra relations apart

from main relation. These type of relations can have storage, but these cannot be selected directly by the user using SQL commands.

I will try the above approach.

During developer meeting [1], I've proposed to reorder patches so that refactoring patches go first and API introduction patches go afterwards. That would make possible to commit some of refactoring patches earlier without necessary committing API in the same release cycle. If no objection, I would do that reordering.

Yes, that will be easy to review the API patches once the refactoring patches went in.

BTW, EnterpriseDB announces zheap table access method (heap with undo log) [2]. I think this is great, and I'm looking forward for publishing zheap in mailing lists. But I'm concerning about its compatibility with pluggable table access methods API. Does zheap use table AM API from this thread? Or does it just override current heap and needs to be adopted to use table AM API? Or does it implements own API?

I am also not sure about on which API the zheap is implemented. But it will be good if we all come up with table AM API's that works for all the external storage methods.

Latest rebased patch series are attached.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Robert Haas

Дата:

23 февраля 2018 г., 05:20:29

On Fri, Feb 16, 2018 at 5:56 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> BTW, EnterpriseDB announces zheap table access method (heap with undo log)
> [2].  I think this is great, and I'm looking forward for publishing zheap in
> mailing lists.  But I'm concerning about its compatibility with pluggable
> table access methods API.  Does zheap use table AM API from this thread?  Or
> does it just override current heap and needs to be adopted to use table AM
> API?  Or does it implements own API?

Right now it just hacks the code.  The plan is to adapt it to whatever
API we settle on in this thread.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: [HACKERS] Pluggable storage

От

Alexander Korotkov

Дата:

26 февраля 2018 г., 14:19:27

On Fri, Feb 23, 2018 at 2:20 AM, Robert Haas <robertmhaas@gmail.com> wrote:

On Fri, Feb 16, 2018 at 5:56 AM, Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> BTW, EnterpriseDB announces zheap table access method (heap with undo log)
> [2]. I think this is great, and I'm looking forward for publishing zheap in
> mailing lists. But I'm concerning about its compatibility with pluggable
> table access methods API. Does zheap use table AM API from this thread? Or
> does it just override current heap and needs to be adopted to use table AM
> API? Or does it implements own API?

Right now it just hacks the code. The plan is to adapt it to whatever
API we settle on in this thread.

Great, thank you for clarification. I'm looking forward reviewing zheap :)

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Re: [HACKERS] Pluggable storage

От

David Steele

Дата:

28 марта 2018 г., 22:23:56

On 2/26/18 3:19 AM, Alexander Korotkov wrote:
> On Fri, Feb 23, 2018 at 2:20 AM, Robert Haas <robertmhaas@gmail.com
> <mailto:robertmhaas@gmail.com>> wrote:
> 
>     On Fri, Feb 16, 2018 at 5:56 AM, Alexander Korotkov
>     <a.korotkov@postgrespro.ru <mailto:a.korotkov@postgrespro.ru>> wrote:
>     > BTW, EnterpriseDB announces zheap table access method (heap with undo log)
>     > [2].  I think this is great, and I'm looking forward for publishing zheap in
>     > mailing lists.  But I'm concerning about its compatibility with pluggable
>     > table access methods API.  Does zheap use table AM API from this thread?  Or
>     > does it just override current heap and needs to be adopted to use table AM
>     > API?  Or does it implements own API?
> 
>     Right now it just hacks the code.  The plan is to adapt it to whatever
>     API we settle on in this thread.
> 
> 
> Great, thank you for clarification.  I'm looking forward reviewing zheap :)
I think this entry should be moved the the next CF.  I'll do that
tomorrow unless there are objections.

Regards,
-- 
-David
david@pgmasters.net

Re: [HACKERS] Pluggable storage

От

Michael Paquier

Дата:

29 марта 2018 г., 06:39:44

On Wed, Mar 28, 2018 at 12:23:56PM -0400, David Steele wrote:
> I think this entry should be moved the the next CF.  I'll do that
> tomorrow unless there are objections.

Instead of moving things to the next CF by default, perhaps it would
make more sense to mark things as reviewed with feedback as this is the
last CF?  There is a 5-month gap between this commit fest and the next
one, I am getting afraid of flooding the beginning of v12 development
cycle with entries which keep rotting over time.  If the author(s) claim
that they will be able to work on it, then that's of course fine.

Sorry for the digression, patches ignored across CFs contribute to the
bloat we see, and those eat the time of the CFM.
--
Michael

Вложения

signature.asc

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

29 марта 2018 г., 11:54:34

On Thu, Mar 29, 2018 at 11:39 AM, Michael Paquier <michael@paquier.xyz> wrote:

On Wed, Mar 28, 2018 at 12:23:56PM -0400, David Steele wrote:
> I think this entry should be moved the the next CF. I'll do that
> tomorrow unless there are objections.

Instead of moving things to the next CF by default, perhaps it would
make more sense to mark things as reviewed with feedback as this is the
last CF? There is a 5-month gap between this commit fest and the next
one, I am getting afraid of flooding the beginning of v12 development
cycle with entries which keep rotting over time.

Yes, especially I observed many of the "ready of committer" patches are just

moving from past commitfests without a review from committer.

If the author(s) claim
that they will be able to work on it, then that's of course fine.

But in this case, I am planning to work on it. Here I attached rebased patches

on the latest master for review. I will move the patch to next commitfest in the

commitfest app.

The attached patches doesn't work with recent JIT changes that are gone in

master, because of removal many of the members from TupleTableSlot structure

and it effects the JIT tuple deforming. This is yet to fixed.

There is an another thread proposed by Andres in abstracting the TupleTableslot

dependency from HeapTuple in [1]. Based on the output of that thread, these

patches needs an update.

[1] - https://www.postgresql.org/message-id/20180220224318.gw4oe5jadhpmcdnm%40alap3.anarazel.de

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

David Steele

Дата:

29 марта 2018 г., 19:07:06

On 3/28/18 8:39 PM, Michael Paquier wrote:
> On Wed, Mar 28, 2018 at 12:23:56PM -0400, David Steele wrote:
>> I think this entry should be moved the the next CF.  I'll do that
>> tomorrow unless there are objections.
>
> Instead of moving things to the next CF by default, perhaps it would
> make more sense to mark things as reviewed with feedback as this is the
> last CF?  There is a 5-month gap between this commit fest and the next
> one, I am getting afraid of flooding the beginning of v12 development
> cycle with entries which keep rotting over time.  If the author(s) claim
> that they will be able to work on it, then that's of course fine.

I agree and I do my best to return patches that have stalled, but I
don't think this patch is in that category.  It has gotten review and
has been kept up to date.  I don't think it's a good fit for v11 but I'd
like to see it in the first CF for v12.

> Sorry for the digression, patches ignored across CFs contribute to the
> bloat we see, and those eat the time of the CFM.

There's no question that bloat has become a problem.  I don't have all
the answers, but vigilance by the CFMs is certainly a good start.

Regards,
--
-David
david@pgmasters.net

Вложения

signature.asc

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

20 апреля 2018 г., 12:44:25

On Thu, Mar 29, 2018 at 4:54 PM, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

The attached patches doesn't work with recent JIT changes that are gone in
master, because of removal many of the members from TupleTableSlot structure
and it effects the JIT tuple deforming. This is yet to fixed.

There is an another thread proposed by Andres in abstracting the TupleTableslot
dependency from HeapTuple in [1]. Based on the output of that thread, these
patches needs an update.

Here I attached a patches that are rebased to the latest master.

Apart from rebase, I have added the support for external relation to be stored in the

pg_class. These are additional relations that may be used by the extensions. Currently

these relations cannot be queried from SQL statements and also these relations cannot

be dumped using pg_dump. Yet to check and confirm the pg_upgrade of these relations.

JIT doesn't work at yet with these patches.

Regards,

Hari Babu

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

14 июня 2018 г., 02:20:51

On Fri, Apr 20, 2018 at 4:44 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

Apart from rebase, I have added the support for external relation to be stored in the
pg_class. These are additional relations that may be used by the extensions. Currently
these relations cannot be queried from SQL statements and also these relations cannot
be dumped using pg_dump. Yet to check and confirm the pg_upgrade of these relations.

Here I attached rebased patchset to the latest HEAD. Apart from rebase, I try to fix the JIT

support with pluggable storage, but it doesn't work yet.

Thanks Alexander for conducting the pluggable table access method discussion in unconference

session at PGCon, I was not able to attend. From one of my colleague who attended the session

told me that, there was a major discussion around TOAST and VACUUM features support. I would

like to share the state of those two features in the current patch set.

TOAST:

I already moved some part of the toast capabilities to the storage, still there exists majority of the

part that needs to be handled. The decision of TOAST should be part of the access method that

handles the data. I will work on it further to provide better API.

VACUUM:

Not much changes are done in this apart moving the Vacuum visibility functions as part of the

storage. But idea for the Vacuum was with each access method can define how it should perform.

Regards,

Haribabu Kommi

Fujitsu Australia

Вложения

Re: [HACKERS] Pluggable storage

От

Amit Kapila

Дата:

14 июня 2018 г., 08:25:56

On Thu, Jun 14, 2018 at 1:50 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Fri, Apr 20, 2018 at 4:44 PM Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>
> VACUUM:
> Not much changes are done in this apart moving the Vacuum visibility
> functions as part of the
> storage. But idea for the Vacuum was with each access method can define how
> it should perform.
>

We are planning to have a somewhat different mechanism for vacuum (for
non-delete marked indexes), so if you can provide some details or
discuss what you have in mind before implementation of same, that
would be great.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: Pluggable storage

От

AJG

Дата:

19 июня 2018 г., 01:43:57

@Amit

Re: Vacuum etc.

Chrome V8 just released this blog post around concurrent marking, which may
be of interest considering how cpu limited the browser is. Contains
benchmark numbers etc in post as well.

https://v8project.blogspot.com/2018/06/concurrent-marking.html

"This post describes the garbage collection technique called concurrent
marking. The optimization allows a JavaScript application to continue
execution while the garbage collector scans the heap to find and mark live
objects. Our benchmarks show that concurrent marking reduces the time spent
marking on the main thread by 60%–70%"



--
Sent from: http://www.postgresql-archive.org/PostgreSQL-hackers-f1928748.html

Re: Pluggable storage

От

Andres Freund

Дата:

19 июня 2018 г., 02:14:31

On 2018-06-18 12:43:57 -0700, AJG wrote:
> @Amit
> 
> Re: Vacuum etc.
> 
> Chrome V8 just released this blog post around concurrent marking, which may
> be of interest considering how cpu limited the browser is. Contains
> benchmark numbers etc in post as well.
> 
> https://v8project.blogspot.com/2018/06/concurrent-marking.html
> 
> "This post describes the garbage collection technique called concurrent
> marking. The optimization allows a JavaScript application to continue
> execution while the garbage collector scans the heap to find and mark live
> objects. Our benchmarks show that concurrent marking reduces the time spent
> marking on the main thread by 60%–70%"

I don't see how in-memory GC techniques have much bearing on the
discussion here?

Greetings,

Andres Freund

Re: Pluggable storage

От

Amit Kapila

Дата:

19 июня 2018 г., 09:59:07

On Tue, Jun 19, 2018 at 1:13 AM, AJG <ayden@gera.co.nz> wrote:
> @Amit
>
> Re: Vacuum etc.
>
> Chrome V8 just released this blog post around concurrent marking, which may
> be of interest considering how cpu limited the browser is. Contains
> benchmark numbers etc in post as well.
>
> https://v8project.blogspot.com/2018/06/concurrent-marking.html
>
> "This post describes the garbage collection technique called concurrent
> marking. The optimization allows a JavaScript application to continue
> execution while the garbage collector scans the heap to find and mark live
> objects. Our benchmarks show that concurrent marking reduces the time spent
> marking on the main thread by 60%–70%"
>

Thanks for sharing the link, but I don't think it is not directly
related to the work we are doing, but feel free to discuss it on zheap
thread [1] or even you can start a new thread, because it appears more
like some general technique to improve GC (garbage collection) rather
than something directly related zheap (or undo) technology.

[1] - https://www.postgresql.org/message-id/CAA4eK1%2BYtM5vxzSM2NZm%2BpC37MCwyvtkmJrO_yRBQeZDp9Wa2w%40mail.gmail.com

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com

Re: [HACKERS] Pluggable storage

От

Haribabu Kommi

Дата:

22 июня 2018 г., 10:24:26

On Thu, Jun 14, 2018 at 12:25 PM Amit Kapila <amit.kapila16@gmail.com> wrote:

On Thu, Jun 14, 2018 at 1:50 AM, Haribabu Kommi
<kommi.haribabu@gmail.com> wrote:
>
> On Fri, Apr 20, 2018 at 4:44 PM Haribabu Kommi <kommi.haribabu@gmail.com>
> wrote:
>
> VACUUM:
> Not much changes are done in this apart moving the Vacuum visibility
> functions as part of the
> storage. But idea for the Vacuum was with each access method can define how
> it should perform.
>

We are planning to have a somewhat different mechanism for vacuum (for
non-delete marked indexes), so if you can provide some details or
discuss what you have in mind before implementation of same, that
would be great.

OK. Thanks for your input. We will discuss the changes before proceed to code.

Apart from the this, the pluggable storage API contains some re-factored changes

along with API, some of the re-factored changes are

1. Change the snapshot satisfies type from function to an enum

2. Try to return always the palloced tuple instead of a pointer to buffer

(This change may have performance impact,so can be done later).

3. Perform a tuple visibility check at heap itself for the page mode scenario also

4. New function ExecSlotCompare to compare two slots or tuple by storing it a temp slot.

5. heap_fetch and heap_lock_tuple returns the palloced tuple, not the pointer to the buffer

6. The index insertion logic decision is moved into heap itself(insert, update), not in executor.

7. Split HeapscanDesc into two and remove it's usage outside heap of the second split

8. Move the tuple traversing and providing the updated record to heap.

Is it fine to create these changes as separate patches and can go if the changes are fine

and doesn't have any impact?

Any comments or additions or deletions to the above list?

Regards,

Haribabu Kommi

Fujitsu Australia

Вход в личный кабинет

Восстановление пароля

Подтверждение аккаунта

Изменение пароля

Обсуждение: Pluggable storage

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения

Вложения