Обсуждение: Pluggable Storage - Andres's take

Поиск
Список
Период
Сортировка

Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

As I've previously mentioned I had planned to spend some time to polish
Haribabu's version of the pluggable storage patch and rebase it on the
vtable based slot approach from [1]. While doing so I found more and
more things that I previously hadn't noticed. I started rewriting things
into something closer to what I think we want architecturally.

The current state of my version of the patch is *NOT* ready for proper
review (it doesn't even pass all tests, there's FIXME / elog()s).  But I
think it's getting close enough to it's eventual shape that more eyes,
and potentially more hands on keyboards, can be useful.

The most fundamental issues I had with Haribabu's last version from [2]
are the following:

- The use of TableTuple, a typedef from void *, is bad from multiple
  fronts. For one it reduces just about all type safety. There were
  numerous bugs in the patch where things were just cast from HeapTuple
  to TableTuple to HeapTuple (and even to TupleTableSlot).  I think it's
  a really, really bad idea to introduce a vague type like this for
  development purposes alone, it makes it way too hard to refactor -
  essentially throwing the biggest benefit of type safe languages out of
  the window.

  Additionally I think it's also the wrong approach architecturally. We
  shouldn't assume that a tuple can efficiently be represented as a
  single palloc'ed chunk. In fact, we should move *away* from relying on
  that so much.

  I've thus removed the TableTuple type entirely.


- Previous verions of the patchset exposed Buffers in the tableam.h API,
  performed buffer locking / pinning / ExecStoreTuple() calls outside of
  it.  That is wrong in my opinion, as various AMs will deal very
  differently with buffer pinning & locking. The relevant logic is
  largely moved within the AM.  Bringing me to the next point:


- tableam exposed various operations based on HeapTuple/TableTuple's
  (and their Buffers). This all need be slot based, as we can't
  represent the way each AM will deal with this.  I've largely converted
  the API to be slot based.  That has some fallout, but I think largely
  works.  Lots of outdated comments.


- I think the move of the indexing from outside the table layer into the
  storage layer isn't a good idea. It lead to having to pass EState into
  the tableam, a callback API to perform index updates, etc.  This seems
  to have at least partially been triggered by the speculative insertion
  codepaths.  I've reverted this part of the changes.  The speculative
  insertion / confirm codepaths are now exposed to tableam.h - I think
  that's the right thing because we'll likely want to have that
  functionality across more than a single tuple in the future.


- The visibility functions relied on the *caller* performing buffer
  locking. That's not a great idea, because generic code shouldn't know
  about the locking scheme a particular AM needs.  I've changed the
  external visibility functions to instead take a slot, and perform the
  necessary locking inside.


- There were numerous tableam callback uses inside heapam.c - that makes
  no sense, we know what the storage is therein.  The relevant


- The integration between index lookups and heap lookups based on the
  results on a index lookup was IMO too tight.  The index code dealt
  with heap tuples, which isn't great.  I've introduced a new concept, a
  'IndexFetchTableData' scan. It's initialized when building an index
  scan, and provides the necessary state (say current heap buffer), to
  do table lookups from within a heap.


- The am of relations required for bootstrapping was set to 0 - I don't
  think that's a good idea. I changed it so it's set to the heap AM as
  well.


- HOT was encoded in the API in a bunch of places. That doesn't look
  right to me. I tried to improve a bit on that, but I'm not yet quite
  sure I like it. Needs written explanation & arguments...


- the heap tableam did a heap_copytuple() nearly everywhere. Leading to
  a higher memory usage, because the resulting tuples weren't freed or
  anything. There might be a reason for doing such a change - we've
  certainly discussed that before - but I'm *vehemently* against doing
  that at the same time we introduce pluggable storage. Analyzing the
  performance effects will be hard enough without changes like this.


- I've for now backed out the heap rewrite changes, partially.  Mostly
  because I didn't like the way the abstraction looks, but haven't quite
  figured out how it should look like.


- I did not like that speculative tokens were moved to slots. There's
  really no reason for them to live outside parameters to tableam.h
  functsions.


- lotsa additional smaller changes.


- lotsa new bugs


My current working state is at [3] (urls to clone repo are at [4]).
This is *HEAVILY WIP*. I plan to continue working on it over the next
days, but I'll temporarily focus onto v11 work.  If others want I could
move repo to github and grant others write access.


I think the patchseries should eventually look like:

- move vacuumlazy.c (and other similar files) into access/heap, there's
  really nothing generic here. This is a fairly independent task.
- slot'ify FDW RefetchForeignRow_function
- vtable based slot API, based on [1]
- slot'ify trigger API
- redo EPQ based on slots (prototyped in my git tree)
- redo trigger API to be slot based
- tuple traversal API changes
- tableam infrastructure, with error if a non-builtin AM is chosen
- move heap and calling code to be tableam based
- make vacuum callback based (not vacuum.c, just vacuumlazy.c)
- [other patches]
- allow other AMs
- introduce test AM


Tasks / Questions:

- split up patch
- Change heap table AM to not allocate handler function for each table,
  instead allocate it statically. Avoids a significant amount of data
  duplication, and allows for a few more compiler optimizations.
- Merge tableam.h and tableamapi.h and make most tableam.c functions
  small inline functions. Having one-line tableam.c wrappers makes this
  more expensive than necessary. We'll have a big enough trouble not
  regressing performancewise.
- change scan level slot creation to use tableam function for doing so
- get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid
- COPY's multi_insert path should probably deal with a bunch of slots,
  rather than forming HeapTuples
- bitmap index scans probably need a new tableam.h callback, abstracting
  bitgetpage()
- suspect IndexBuildHeapScan might need to move into the tableam.h API -
  it's not clear to me that it's realistically possible to this in a
  generic manner.

Greetings,

Andres Freund

[1] http://archives.postgresql.org/message-id/20180220224318.gw4oe5jadhpmcdnm%40alap3.anarazel.de
[2] http://archives.postgresql.org/message-id/CAJrrPGcN5A4jH0PJ-s=6k3+SLA4pozC4HHRdmvU1ZBuA20TE-A@mail.gmail.com
[3] https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/pluggable-storage
[4] https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=summary


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Jul 3, 2018 at 5:06 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

As I've previously mentioned I had planned to spend some time to polish
Haribabu's version of the pluggable storage patch and rebase it on the
vtable based slot approach from [1]. While doing so I found more and
more things that I previously hadn't noticed. I started rewriting things
into something closer to what I think we want architecturally.

Thanks for the deep review and changes.
 
The current state of my version of the patch is *NOT* ready for proper
review (it doesn't even pass all tests, there's FIXME / elog()s).  But I
think it's getting close enough to it's eventual shape that more eyes,
and potentially more hands on keyboards, can be useful.

I will try to update it to make sure that it passes all the tests and also try to
reduce the FIXME's.
 
The most fundamental issues I had with Haribabu's last version from [2]
are the following:

- The use of TableTuple, a typedef from void *, is bad from multiple
  fronts. For one it reduces just about all type safety. There were
  numerous bugs in the patch where things were just cast from HeapTuple
  to TableTuple to HeapTuple (and even to TupleTableSlot).  I think it's
  a really, really bad idea to introduce a vague type like this for
  development purposes alone, it makes it way too hard to refactor -
  essentially throwing the biggest benefit of type safe languages out of
  the window.

My earlier intention was to remove the HeapTuple usage entirely and replace
it with slot everywhere outside the tableam. But it ended up with TableTuple
before it reach to the stage because of heavy use of HeapTuple.
 
  Additionally I think it's also the wrong approach architecturally. We
  shouldn't assume that a tuple can efficiently be represented as a
  single palloc'ed chunk. In fact, we should move *away* from relying on
  that so much.

  I've thus removed the TableTuple type entirely.

Thanks for the changes, I didn't check the code yet, so for now whenever the
HeapTuple is required it will be generated from slot?
 
- Previous verions of the patchset exposed Buffers in the tableam.h API,
  performed buffer locking / pinning / ExecStoreTuple() calls outside of
  it.  That is wrong in my opinion, as various AMs will deal very
  differently with buffer pinning & locking. The relevant logic is
  largely moved within the AM.  Bringing me to the next point:


- tableam exposed various operations based on HeapTuple/TableTuple's
  (and their Buffers). This all need be slot based, as we can't
  represent the way each AM will deal with this.  I've largely converted
  the API to be slot based.  That has some fallout, but I think largely
  works.  Lots of outdated comments.

Yes, I agree with you.
 
- I think the move of the indexing from outside the table layer into the
  storage layer isn't a good idea. It lead to having to pass EState into
  the tableam, a callback API to perform index updates, etc.  This seems
  to have at least partially been triggered by the speculative insertion
  codepaths.  I've reverted this part of the changes.  The speculative
  insertion / confirm codepaths are now exposed to tableam.h - I think
  that's the right thing because we'll likely want to have that
  functionality across more than a single tuple in the future.


- The visibility functions relied on the *caller* performing buffer
  locking. That's not a great idea, because generic code shouldn't know
  about the locking scheme a particular AM needs.  I've changed the
  external visibility functions to instead take a slot, and perform the
  necessary locking inside. 

When I first moved all the visibility functions as part of tableam, I find this
problem, and it will be good if it takes of buffer locking and etc. 


- There were numerous tableam callback uses inside heapam.c - that makes
  no sense, we know what the storage is therein.  The relevant


- The integration between index lookups and heap lookups based on the
  results on a index lookup was IMO too tight.  The index code dealt
  with heap tuples, which isn't great.  I've introduced a new concept, a
  'IndexFetchTableData' scan. It's initialized when building an index
  scan, and provides the necessary state (say current heap buffer), to
  do table lookups from within a heap.

I agree that it will be good with the new concept from index to access the
heap.
 
- The am of relations required for bootstrapping was set to 0 - I don't
  think that's a good idea. I changed it so it's set to the heap AM as
  well.


- HOT was encoded in the API in a bunch of places. That doesn't look
  right to me. I tried to improve a bit on that, but I'm not yet quite
  sure I like it. Needs written explanation & arguments...


- the heap tableam did a heap_copytuple() nearly everywhere. Leading to
  a higher memory usage, because the resulting tuples weren't freed or
  anything. There might be a reason for doing such a change - we've
  certainly discussed that before - but I'm *vehemently* against doing
  that at the same time we introduce pluggable storage. Analyzing the
  performance effects will be hard enough without changes like this.

How about using of slot instead of tuple and reusing of it? I don't know
yet whether it is possible everywhere.
 

- I've for now backed out the heap rewrite changes, partially.  Mostly
  because I didn't like the way the abstraction looks, but haven't quite
  figured out how it should look like.


- I did not like that speculative tokens were moved to slots. There's
  really no reason for them to live outside parameters to tableam.h
  functsions.


- lotsa additional smaller changes.


- lotsa new bugs

Thanks for all the changes.
 

My current working state is at [3] (urls to clone repo are at [4]).
This is *HEAVILY WIP*. I plan to continue working on it over the next
days, but I'll temporarily focus onto v11 work.  If others want I could
move repo to github and grant others write access.

Yes, I want to access the code and do further development on it.

 

Tasks / Questions:

- split up patch

How about generating refactoring changes as patches first based on
the code in your repo as discussed here[1]?
 
- Change heap table AM to not allocate handler function for each table,
  instead allocate it statically. Avoids a significant amount of data
  duplication, and allows for a few more compiler optimizations.

Some kind of static variable handlers for each tableam, but need to check
how can we access that static handler from the relation.
 
- Merge tableam.h and tableamapi.h and make most tableam.c functions
  small inline functions. Having one-line tableam.c wrappers makes this
  more expensive than necessary. We'll have a big enough trouble not
  regressing performancewise.

OK.
 
- change scan level slot creation to use tableam function for doing so
- get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid

so with this there shouldn't be a way from slot to tid mapping or there 
should be some other way.
 
- COPY's multi_insert path should probably deal with a bunch of slots,
  rather than forming HeapTuples

OK. 

- bitmap index scans probably need a new tableam.h callback, abstracting
  bitgetpage()

OK.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Alexander Korotkov
Дата:
Hi!

On Tue, Jul 3, 2018 at 10:06 AM Andres Freund <andres@anarazel.de> wrote:
> As I've previously mentioned I had planned to spend some time to polish
> Haribabu's version of the pluggable storage patch and rebase it on the
> vtable based slot approach from [1]. While doing so I found more and
> more things that I previously hadn't noticed. I started rewriting things
> into something closer to what I think we want architecturally.

Great, thank you for working on this patchset!

> The current state of my version of the patch is *NOT* ready for proper
> review (it doesn't even pass all tests, there's FIXME / elog()s).  But I
> think it's getting close enough to it's eventual shape that more eyes,
> and potentially more hands on keyboards, can be useful.
>
> The most fundamental issues I had with Haribabu's last version from [2]
> are the following:
>
> - The use of TableTuple, a typedef from void *, is bad from multiple
>   fronts. For one it reduces just about all type safety. There were
>   numerous bugs in the patch where things were just cast from HeapTuple
>   to TableTuple to HeapTuple (and even to TupleTableSlot).  I think it's
>   a really, really bad idea to introduce a vague type like this for
>   development purposes alone, it makes it way too hard to refactor -
>   essentially throwing the biggest benefit of type safe languages out of
>   the window.
>
>   Additionally I think it's also the wrong approach architecturally. We
>   shouldn't assume that a tuple can efficiently be represented as a
>   single palloc'ed chunk. In fact, we should move *away* from relying on
>   that so much.
>
>   I've thus removed the TableTuple type entirely.

+1, TableTuple was vague concept.

> - Previous verions of the patchset exposed Buffers in the tableam.h API,
>   performed buffer locking / pinning / ExecStoreTuple() calls outside of
>   it.  That is wrong in my opinion, as various AMs will deal very
>   differently with buffer pinning & locking. The relevant logic is
>   largely moved within the AM.  Bringing me to the next point:
>
>
> - tableam exposed various operations based on HeapTuple/TableTuple's
>   (and their Buffers). This all need be slot based, as we can't
>   represent the way each AM will deal with this.  I've largely converted
>   the API to be slot based.  That has some fallout, but I think largely
>   works.  Lots of outdated comments.

Makes sense for me.  I like passing TupleTableSlot  to tableam api
function much more.

> - I think the move of the indexing from outside the table layer into the
>   storage layer isn't a good idea. It lead to having to pass EState into
>   the tableam, a callback API to perform index updates, etc.  This seems
>   to have at least partially been triggered by the speculative insertion
>   codepaths.  I've reverted this part of the changes.  The speculative
>   insertion / confirm codepaths are now exposed to tableam.h - I think
>   that's the right thing because we'll likely want to have that
>   functionality across more than a single tuple in the future.

I agree that passing EState into tableam doesn't look good.  But I
believe that tableam needs way more control over indexes than it has
in your version patch.  If even tableam-independent insertion into
indexes on tuple insert is more or less ok, but on update we need
something smarter rather than just insert index tuples depending
"update_indexes" flag.  Tableam specific update method may decide to
update only some of indexes.  For example, when zheap performs update
in-place then it inserts to only indexes, whose fields were updated.
And I think any undo-log based storage would have similar behavior.
Moreover, it might be required to do something with existing index
tuples (for instance, as I know, zheap sets "deleted" flag to index
tuples related to previous values of updated fields).

If we would like to move indexing outside of tableam, then we might
turn "update_indexes" from bool to more enum with values like: "don't
insert index tuples", "insert all index tuples", "insert index tuples
only for updated fields" and so on.  But that looks more like set of
hardcoded cases for particular implementations, than proper API.  So,
probably we shouldn't move indexing outside of tableam, but rather
provide better wrappers for doing that in tableam?

> - The visibility functions relied on the *caller* performing buffer
>   locking. That's not a great idea, because generic code shouldn't know
>   about the locking scheme a particular AM needs.  I've changed the
>   external visibility functions to instead take a slot, and perform the
>   necessary locking inside.

Makes sense for me.  But would it cause extra locking/unlocking and in
turn performance impact?

> - There were numerous tableam callback uses inside heapam.c - that makes
>   no sense, we know what the storage is therein.  The relevant

Ok.

> - The integration between index lookups and heap lookups based on the
>   results on a index lookup was IMO too tight.  The index code dealt
>   with heap tuples, which isn't great.  I've introduced a new concept, a
>   'IndexFetchTableData' scan. It's initialized when building an index
>   scan, and provides the necessary state (say current heap buffer), to
>   do table lookups from within a heap.

+1

> - The am of relations required for bootstrapping was set to 0 - I don't
>   think that's a good idea. I changed it so it's set to the heap AM as
>   well.

+1

> - HOT was encoded in the API in a bunch of places. That doesn't look
>   right to me. I tried to improve a bit on that, but I'm not yet quite
>   sure I like it. Needs written explanation & arguments...

Yes, HOT is heapam specific feature.  Other tableams might not have
HOT.  But it appears that we still expose hot_search_buffer() function
in tableam API.  But that function have no usage, so it's just
redundant and can be removed.

> - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
>   a higher memory usage, because the resulting tuples weren't freed or
>   anything. There might be a reason for doing such a change - we've
>   certainly discussed that before - but I'm *vehemently* against doing
>   that at the same time we introduce pluggable storage. Analyzing the
>   performance effects will be hard enough without changes like this.

I think once we switched to slots, doing heap_copytuple() do
frequently is not required anymore.

> - I've for now backed out the heap rewrite changes, partially.  Mostly
>   because I didn't like the way the abstraction looks, but haven't quite
>   figured out how it should look like.

Yeah, it's hard part, but we need to invent something in this area...

> - I did not like that speculative tokens were moved to slots. There's
>   really no reason for them to live outside parameters to tableam.h
>   functsions.

Good.

> My current working state is at [3] (urls to clone repo are at [4]).
> This is *HEAVILY WIP*. I plan to continue working on it over the next
> days, but I'll temporarily focus onto v11 work.  If others want I could
> move repo to github and grant others write access.

Github would be more convinient for me.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Pluggable Storage - Andres's take

От
Alexander Korotkov
Дата:
On Thu, Jul 5, 2018 at 3:25 PM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
> > My current working state is at [3] (urls to clone repo are at [4]).
> > This is *HEAVILY WIP*. I plan to continue working on it over the next
> > days, but I'll temporarily focus onto v11 work.  If others want I could
> > move repo to github and grant others write access.
>
> Github would be more convinient for me.

I've another note.  It appears that you leave my patch for locking
last version of tuple in one call (heapam_lock_tuple() function)
almost without changes.  During PGCon 2018 Developer meeting I
remember you was somewhat unhappy with this approach.  So, do you have
any notes about that for now?
------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

I've pushed up a new version to
https://github.com/anarazel/postgres-pluggable-storage which now passes
all the tests.

Besides a lot of bugfixes, I've rebased the tree, moved TriggerData to
be primarily slot based (with a conversion roundtrip when calling
trigger functions), and a lot of other small things.


On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote:
> On Tue, Jul 3, 2018 at 5:06 PM Andres Freund <andres@anarazel.de> wrote:
> > The current state of my version of the patch is *NOT* ready for proper
> > review (it doesn't even pass all tests, there's FIXME / elog()s).  But I
> > think it's getting close enough to it's eventual shape that more eyes,
> > and potentially more hands on keyboards, can be useful.
> >
> 
> I will try to update it to make sure that it passes all the tests and also
> try to
> reduce the FIXME's.

Cool.

Alexander, Haribabu, if you give me (privately) your github accounts,
I'll give you write access to that repository.


> > The most fundamental issues I had with Haribabu's last version from [2]
> > are the following:
> >
> > - The use of TableTuple, a typedef from void *, is bad from multiple
> >   fronts. For one it reduces just about all type safety. There were
> >   numerous bugs in the patch where things were just cast from HeapTuple
> >   to TableTuple to HeapTuple (and even to TupleTableSlot).  I think it's
> >   a really, really bad idea to introduce a vague type like this for
> >   development purposes alone, it makes it way too hard to refactor -
> >   essentially throwing the biggest benefit of type safe languages out of
> >   the window.
> >
> 
> My earlier intention was to remove the HeapTuple usage entirely and replace
> it with slot everywhere outside the tableam. But it ended up with TableTuple
> before it reach to the stage because of heavy use of HeapTuple.

I don't think that's necessary - a lot of the system catalog accesses
are going to continue to be heap tuple accesses. And the conversions you
did significantly continue to access TableTuples as heap tuples - it was
just that the compiler didn't warn about it anymore.

A prime example of that is the way the rewriteheap / cluster
integreation was done. Cluster continued to sort tuples as heap tuples -
even though that's likely incompatible with other tuple formats which
need different state.


> >   Additionally I think it's also the wrong approach architecturally. We
> >   shouldn't assume that a tuple can efficiently be represented as a
> >   single palloc'ed chunk. In fact, we should move *away* from relying on
> >   that so much.
> >
> >   I've thus removed the TableTuple type entirely.
> >
> 
> Thanks for the changes, I didn't check the code yet, so for now whenever the
> HeapTuple is required it will be generated from slot?

Pretty much.


> > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
> >   a higher memory usage, because the resulting tuples weren't freed or
> >   anything. There might be a reason for doing such a change - we've
> >   certainly discussed that before - but I'm *vehemently* against doing
> >   that at the same time we introduce pluggable storage. Analyzing the
> >   performance effects will be hard enough without changes like this.
> >
> 
> How about using of slot instead of tuple and reusing of it? I don't know
> yet whether it is possible everywhere.

Not quite sure what you mean?


> > Tasks / Questions:
> >
> > - split up patch
> >
> 
> How about generating refactoring changes as patches first based on
> the code in your repo as discussed here[1]?

Sure - it was just at the moment too much work ;)


> > - Change heap table AM to not allocate handler function for each table,
> >   instead allocate it statically. Avoids a significant amount of data
> >   duplication, and allows for a few more compiler optimizations.
> >
> 
> Some kind of static variable handlers for each tableam, but need to check
> how can we access that static handler from the relation.

I'm not sure what you mean by "how can we access"? We can just return a
pointer from the constant data from the current handler? Except that
adding a bunch of consts would be good, the external interface wouldn't
need to change?


> > - change scan level slot creation to use tableam function for doing so
> > - get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid
> >
> 
> so with this there shouldn't be a way from slot to tid mapping or there
> should be some other way.

I'm not sure I follow?


> - bitmap index scans probably need a new tableam.h callback, abstracting
> >   bitgetpage()
> >
> 
> OK.

Any chance you could try to tackle this?  I'm going to be mostly out
this week, so we'd probably not run across each others feet...

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2018-07-05 15:25:25 +0300, Alexander Korotkov wrote:
> > - I think the move of the indexing from outside the table layer into the
> >   storage layer isn't a good idea. It lead to having to pass EState into
> >   the tableam, a callback API to perform index updates, etc.  This seems
> >   to have at least partially been triggered by the speculative insertion
> >   codepaths.  I've reverted this part of the changes.  The speculative
> >   insertion / confirm codepaths are now exposed to tableam.h - I think
> >   that's the right thing because we'll likely want to have that
> >   functionality across more than a single tuple in the future.
> 
> I agree that passing EState into tableam doesn't look good.  But I
> believe that tableam needs way more control over indexes than it has
> in your version patch.  If even tableam-independent insertion into
> indexes on tuple insert is more or less ok, but on update we need
> something smarter rather than just insert index tuples depending
> "update_indexes" flag.  Tableam specific update method may decide to
> update only some of indexes.  For example, when zheap performs update
> in-place then it inserts to only indexes, whose fields were updated.
> And I think any undo-log based storage would have similar behavior.
> Moreover, it might be required to do something with existing index
> tuples (for instance, as I know, zheap sets "deleted" flag to index
> tuples related to previous values of updated fields).

I agree that we probably need more - I'm just inclined to think that we
need a more concrete target to work against.  Currently zheap's indexing
logic still is fairly naive, I don't think we'll get the interface right
without having worked further on the zheap side of things.


> > - The visibility functions relied on the *caller* performing buffer
> >   locking. That's not a great idea, because generic code shouldn't know
> >   about the locking scheme a particular AM needs.  I've changed the
> >   external visibility functions to instead take a slot, and perform the
> >   necessary locking inside.
> 
> Makes sense for me.  But would it cause extra locking/unlocking and in
> turn performance impact?

I don't think so - nearly all the performance relevant cases do all the
visibility logic inside the AM. Where the underlying functions can be
used, that do not do the locking.  Pretty all the converted places just
had manual LockBuffer calls.


> > - HOT was encoded in the API in a bunch of places. That doesn't look
> >   right to me. I tried to improve a bit on that, but I'm not yet quite
> >   sure I like it. Needs written explanation & arguments...
> 
> Yes, HOT is heapam specific feature.  Other tableams might not have
> HOT.  But it appears that we still expose hot_search_buffer() function
> in tableam API.  But that function have no usage, so it's just
> redundant and can be removed.

Yea, that was a leftover.


> > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
> >   a higher memory usage, because the resulting tuples weren't freed or
> >   anything. There might be a reason for doing such a change - we've
> >   certainly discussed that before - but I'm *vehemently* against doing
> >   that at the same time we introduce pluggable storage. Analyzing the
> >   performance effects will be hard enough without changes like this.
> 
> I think once we switched to slots, doing heap_copytuple() do
> frequently is not required anymore.

It's mostly gone now.


> > - I've for now backed out the heap rewrite changes, partially.  Mostly
> >   because I didn't like the way the abstraction looks, but haven't quite
> >   figured out how it should look like.
> 
> Yeah, it's hard part, but we need to invent something in this area...

I agree. But I really don't yet quite now what. I somewhat wonder if we
should just add a cluster_rel() callback to the tableam and let it deal
with everything :(.   As previously proposed the interface wouldn't have
worked with anything not losslessly encodable into a heaptuple, which is
unlikely to be sufficient.


FWIW, I plan to be mostly out until Thursday this week, and then I'll
rebase onto the new version of the abstract slot patch and then try to
split up the patchset. Once that's done, I'll do a prototype conversion
of zheap, which I'm sure will show up a lot of weaknesses in the current
abstraction.  Once that's done I hope we can collaborate / divide &
conquer to get the individual pieces into commit shape.

If either of you wants to get a head start separating something out,
let's try to organize who would do what? The EPQ and trigger
slotification are probably good candidates.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Mon, Jul 16, 2018 at 11:35 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote:
> On Tue, Jul 3, 2018 at 5:06 PM Andres Freund <andres@anarazel.de> wrote:

> > The most fundamental issues I had with Haribabu's last version from [2]
> > are the following:
> >
> > - The use of TableTuple, a typedef from void *, is bad from multiple
> >   fronts. For one it reduces just about all type safety. There were
> >   numerous bugs in the patch where things were just cast from HeapTuple
> >   to TableTuple to HeapTuple (and even to TupleTableSlot).  I think it's
> >   a really, really bad idea to introduce a vague type like this for
> >   development purposes alone, it makes it way too hard to refactor -
> >   essentially throwing the biggest benefit of type safe languages out of
> >   the window.
> >
>
> My earlier intention was to remove the HeapTuple usage entirely and replace
> it with slot everywhere outside the tableam. But it ended up with TableTuple
> before it reach to the stage because of heavy use of HeapTuple.

I don't think that's necessary - a lot of the system catalog accesses
are going to continue to be heap tuple accesses. And the conversions you
did significantly continue to access TableTuples as heap tuples - it was
just that the compiler didn't warn about it anymore.

A prime example of that is the way the rewriteheap / cluster
integreation was done. Cluster continued to sort tuples as heap tuples -
even though that's likely incompatible with other tuple formats which
need different state.

OK. Understood.
 
> > - the heap tableam did a heap_copytuple() nearly everywhere. Leading to
> >   a higher memory usage, because the resulting tuples weren't freed or
> >   anything. There might be a reason for doing such a change - we've
> >   certainly discussed that before - but I'm *vehemently* against doing
> >   that at the same time we introduce pluggable storage. Analyzing the
> >   performance effects will be hard enough without changes like this.
> >
>
> How about using of slot instead of tuple and reusing of it? I don't know
> yet whether it is possible everywhere.

Not quite sure what you mean?

I thought of using slot everywhere can reduce the use of heap_copytuple, 
I understand that you already did those changes from you reply from the
other mail.
 

> > Tasks / Questions:
> >
> > - split up patch
> >
>
> How about generating refactoring changes as patches first based on
> the code in your repo as discussed here[1]?

Sure - it was just at the moment too much work ;)

Yes, it is too much work. How about doing this once most of the
open items are finished?
 

> > - Change heap table AM to not allocate handler function for each table,
> >   instead allocate it statically. Avoids a significant amount of data
> >   duplication, and allows for a few more compiler optimizations.
> >
>
> Some kind of static variable handlers for each tableam, but need to check
> how can we access that static handler from the relation.

I'm not sure what you mean by "how can we access"? We can just return a
pointer from the constant data from the current handler? Except that
adding a bunch of consts would be good, the external interface wouldn't
need to change?

I mean we may need to store some tableam ID in each table, so that based on
that ID we get the static tableam handler, because at a time there may be
tables from different tableam methods.
 

> > - change scan level slot creation to use tableam function for doing so
> > - get rid of slot->tts_tid, tts_tupleOid and potentially tts_tableOid
> >
>
> so with this there shouldn't be a way from slot to tid mapping or there
> should be some other way.

I'm not sure I follow?

Replacing of heaptuple with slot, currently with slot only the tid is passed
to the tableam methods, To get rid of the tid from slot, we may need any
other method of passing?
 
> - bitmap index scans probably need a new tableam.h callback, abstracting
> >   bitgetpage()
> >
>
> OK.

Any chance you could try to tackle this?  I'm going to be mostly out
this week, so we'd probably not run across each others feet...


OK, I will take care of the above point.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Tue, Jul 17, 2018 at 11:01 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Mon, Jul 16, 2018 at 11:35 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-07-04 20:11:21 +1000, Haribabu Kommi wrote:
> On Tue, Jul 3, 2018 at 5:06 PM Andres Freund <andres@anarazel.de> wrote:

> - bitmap index scans probably need a new tableam.h callback, abstracting
> >   bitgetpage()
> >
>
> OK.

Any chance you could try to tackle this?  I'm going to be mostly out
this week, so we'd probably not run across each others feet...

OK, I will take care of the above point.

I added new API in the tableam.h to get all the page visible tuples to
abstract the bitgetpage() function.

>>- Merge tableam.h and tableamapi.h and make most tableam.c functions
>> small inline functions. Having one-line tableam.c wrappers makes this
>> more expensive than necessary. We'll have a big enough trouble not
>> regressing performancewise.

I merged tableam.h and tableamapi.h into tableam.h and changed all the
functions as inline. This change may have added some additional headers,
will check them if I can remove their need.

Attached are the updated patches on top your github tree.

Currently I am working on the following.
- I observed that there is a crash when running isolation tests.
- COPY's multi_insert path should probably deal with a bunch of slots,
  rather than forming HeapTuples

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Tue, Jul 24, 2018 at 11:31 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Tue, Jul 17, 2018 at 11:01 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

I added new API in the tableam.h to get all the page visible tuples to
abstract the bitgetpage() function.

>>- Merge tableam.h and tableamapi.h and make most tableam.c functions
>> small inline functions. Having one-line tableam.c wrappers makes this
>> more expensive than necessary. We'll have a big enough trouble not
>> regressing performancewise.

I merged tableam.h and tableamapi.h into tableam.h and changed all the
functions as inline. This change may have added some additional headers,
will check them if I can remove their need.

Attached are the updated patches on top your github tree.

Currently I am working on the following.
- I observed that there is a crash when running isolation tests.

while investing the crash, I observed that it is due to the lot of FIXME's in
the code. So I just fixed minimal changes and looking into correcting
the FIXME's first.

One thing I observed is lack relation pointer is leading to crash in the
flow of EvalPlan* functions, because all ROW_MARK types doesn't
contains relation pointer.

will continue to check all FIXME fixes.
 
- COPY's multi_insert path should probably deal with a bunch of slots,
  rather than forming HeapTuples

Implemented supporting of slots in the copy multi insert path.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

I'm currently in the process of rebasing zheap onto the pluggable
storage work. The goal, which seems to work surprisingly well, is to
find issues that the current pluggable storage patch doesn't yet deal
with.  I plan to push a tree including a lot of fixes and improvements
soon.

On 2018-08-03 12:35:50 +1000, Haribabu Kommi wrote:
> while investing the crash, I observed that it is due to the lot of FIXME's
> in
> the code. So I just fixed minimal changes and looking into correcting
> the FIXME's first.
> 
> One thing I observed is lack relation pointer is leading to crash in the
> flow of EvalPlan* functions, because all ROW_MARK types doesn't
> contains relation pointer.
> 
> will continue to check all FIXME fixes.

Thanks.


> > - COPY's multi_insert path should probably deal with a bunch of slots,
> >   rather than forming HeapTuples
> >
> 
> Implemented supporting of slots in the copy multi insert path.

Cool. I've not yet looked at it, but I plan to do so soon.  Will have to
rebase over the other copy changes first :(


- Andres


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

I'm currently in the process of rebasing zheap onto the pluggable
storage work. The goal, which seems to work surprisingly well, is to
find issues that the current pluggable storage patch doesn't yet deal
with.  I plan to push a tree including a lot of fixes and improvements
soon.

Sorry for coming late to this thread.

That's good. Did you find any problems in porting zheap into pluggable
storage? Does it needs any API changes or new API requirement?
 
On 2018-08-03 12:35:50 +1000, Haribabu Kommi wrote:
> while investing the crash, I observed that it is due to the lot of FIXME's
> in
> the code. So I just fixed minimal changes and looking into correcting
> the FIXME's first.
>
> One thing I observed is lack relation pointer is leading to crash in the
> flow of EvalPlan* functions, because all ROW_MARK types doesn't
> contains relation pointer.
>
> will continue to check all FIXME fixes.

Thanks.

I fixed some of the Isolation test problems. All the issues are related to
EPQ slot handling. Still more needs to be fixed.

Does the new TupleTableSlot abstraction patches has fixed any of these
issues in the recent thread [1]? so that I can look into the change of FDW API
to return slot instead of tuple.  


> > - COPY's multi_insert path should probably deal with a bunch of slots,
> >   rather than forming HeapTuples
> >
>
> Implemented supporting of slots in the copy multi insert path.

Cool. I've not yet looked at it, but I plan to do so soon.  Will have to
rebase over the other copy changes first :(

OK. Understood. There are many changes in the COPY flow conflicts
with my changes. Please let me know once you done the rebase, I can
fix those conflicts and regenerate the patch.

Attached is the patch with further fixes.

[1] - https://www.postgresql.org/message-id/CAFjFpRcNPQ1oOL41-HQYaEF%3DNq6Vbg0eHeFgopJhHw_X2usA5w%40mail.gmail.com

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote:
> On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
> > I'm currently in the process of rebasing zheap onto the pluggable
> > storage work. The goal, which seems to work surprisingly well, is to
> > find issues that the current pluggable storage patch doesn't yet deal
> > with.  I plan to push a tree including a lot of fixes and improvements
> > soon.
> >
> 
> Sorry for coming late to this thread.

No worries.


> That's good. Did you find any problems in porting zheap into pluggable
> storage? Does it needs any API changes or new API requirement?

A lot, yes. The big changes are:
- removal of HeapPageScanDesc
- introduction of explicit support functions for tablesample & bitmap scans
- introduction of callbacks for vacuum_rel, cluster

And quite a bit more along those lines.

> Does the new TupleTableSlot abstraction patches has fixed any of these
> issues in the recent thread [1]? so that I can look into the change of
> FDW API to return slot instead of tuple.

Yea, that'd be a good thing to start with.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Aug 21, 2018 at 6:59 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote:
> On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
> > I'm currently in the process of rebasing zheap onto the pluggable
> > storage work. The goal, which seems to work surprisingly well, is to
> > find issues that the current pluggable storage patch doesn't yet deal
> > with.  I plan to push a tree including a lot of fixes and improvements
> > soon.
> >
> That's good. Did you find any problems in porting zheap into pluggable
> storage? Does it needs any API changes or new API requirement?

A lot, yes. The big changes are:
- removal of HeapPageScanDesc
- introduction of explicit support functions for tablesample & bitmap scans
- introduction of callbacks for vacuum_rel, cluster

And quite a bit more along those lines.

OK. Those are quite a bit of changes.
 
> Does the new TupleTableSlot abstraction patches has fixed any of these
> issues in the recent thread [1]? so that I can look into the change of
> FDW API to return slot instead of tuple.

Yea, that'd be a good thing to start with.

I found out only the RefetchForeignRow API needs the change and done the same.
Along with that, I fixed all the issues of  running make check-world. Attached patches 
for the same.

Now I will look into the remaining FIXME's that don't conflict with your further changes.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2018-08-24 11:55:41 +1000, Haribabu Kommi wrote:
> On Tue, Aug 21, 2018 at 6:59 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote:
> > > On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I'm currently in the process of rebasing zheap onto the pluggable
> > > > storage work. The goal, which seems to work surprisingly well, is to
> > > > find issues that the current pluggable storage patch doesn't yet deal
> > > > with.  I plan to push a tree including a lot of fixes and improvements
> > > > soon.
> > > >
> > > That's good. Did you find any problems in porting zheap into pluggable
> > > storage? Does it needs any API changes or new API requirement?
> >
> > A lot, yes. The big changes are:
> > - removal of HeapPageScanDesc
> > - introduction of explicit support functions for tablesample & bitmap scans
> > - introduction of callbacks for vacuum_rel, cluster
> >
> > And quite a bit more along those lines.
> >
> 
> OK. Those are quite a bit of changes.

I've pushed a current version of that to my git tree to the
pluggable-storage branch. It's not really a version that I think makese
sense to review or such, but it's probably more useful if you work based
on that.  There's also the pluggable-zheap branch, which I found
extremely useful to develop against.

There's a few further changes since last time: - Pluggable handlers are
now stored in static global variables, and thus do not need to be copied
anymore - VACUUM FULL / CLUSTER is moved into one callback that does the
actual copying. The various previous rewrite callbacks imo integrated at
the wrong level.  - there's a GUC that allows to change the default
table AM - moving COPY to use slots (roughly based on your / Haribabu's
patch) - removed the AM specific shmem initialization callbacks -
various AMs are going to need the syncscan APIs, so moving that into AM
callbacks doesn't make sense.

Missing:
- callback for the second scan of CREATE INDEX CONCURRENTLY
- commands/analyze.c integration (Working on it)
- fixing your (Haribabu's) slotification of copy patch to compute memory
  usage somehow
- table creation callback, currently the pluggable-zheap patch has a few
  conditionals outside of access/zheap for that purpose (see RelationTruncate
- header structure cleanup

And then:
- lotsa cleanups
- rebasing onto a newer version of the abstract slot patchset
- splitting out smaller patches


You'd moved the bulk insert into tableam callbacks - I don't quite get
why? There's not really anything AM specific in that code?


> > > Does the new TupleTableSlot abstraction patches has fixed any of these
> > > issues in the recent thread [1]? so that I can look into the change of
> > > FDW API to return slot instead of tuple.
> >
> > Yea, that'd be a good thing to start with.
> >
> 
> I found out only the RefetchForeignRow API needs the change and done the
> same.
> Along with that, I fixed all the issues of  running make check-world.
> Attached patches
> for the same.

Thanks, that's really helpful!  I'll try to merge these soon.


I'm starting to think that we're getting closer to something that
looks right from a high level, even though there's a lot of details to
clean up.


Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Fri, Aug 24, 2018 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-08-24 11:55:41 +1000, Haribabu Kommi wrote:
> On Tue, Aug 21, 2018 at 6:59 PM Andres Freund <andres@anarazel.de> wrote:
>
> > On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote:
> > > On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I'm currently in the process of rebasing zheap onto the pluggable
> > > > storage work. The goal, which seems to work surprisingly well, is to
> > > > find issues that the current pluggable storage patch doesn't yet deal
> > > > with.  I plan to push a tree including a lot of fixes and improvements
> > > > soon.
> > > >
> > > That's good. Did you find any problems in porting zheap into pluggable
> > > storage? Does it needs any API changes or new API requirement?
> >
> > A lot, yes. The big changes are:
> > - removal of HeapPageScanDesc
> > - introduction of explicit support functions for tablesample & bitmap scans
> > - introduction of callbacks for vacuum_rel, cluster
> >
> > And quite a bit more along those lines.
> >
>
> OK. Those are quite a bit of changes.

I've pushed a current version of that to my git tree to the
pluggable-storage branch. It's not really a version that I think makese
sense to review or such, but it's probably more useful if you work based
on that.  There's also the pluggable-zheap branch, which I found
extremely useful to develop against.

OK. Thanks, will check that also.
 
There's a few further changes since last time: - Pluggable handlers are
now stored in static global variables, and thus do not need to be copied
anymore - VACUUM FULL / CLUSTER is moved into one callback that does the
actual copying. The various previous rewrite callbacks imo integrated at
the wrong level.  - there's a GUC that allows to change the default
table AM - moving COPY to use slots (roughly based on your / Haribabu's
patch) - removed the AM specific shmem initialization callbacks -
various AMs are going to need the syncscan APIs, so moving that into AM
callbacks doesn't make sense.

OK.
 
Missing:
- callback for the second scan of CREATE INDEX CONCURRENTLY
- commands/analyze.c integration (Working on it)
- fixing your (Haribabu's) slotification of copy patch to compute memory
  usage somehow

I will check it.
 
- table creation callback, currently the pluggable-zheap patch has a few
  conditionals outside of access/zheap for that purpose (see RelationTruncate

I will check it.
 
And then:
- lotsa cleanups
- rebasing onto a newer version of the abstract slot patchset
- splitting out smaller patches


You'd moved the bulk insert into tableam callbacks - I don't quite get
why? There's not really anything AM specific in that code?

The main reason of adding them to AM is just to provide a control to
the specific AM to decide whether they can support the bulk insert or
not?

Current framework doesn't support AM specific bulk insert state to be
passed from one function to another and it's structure is fixed. This needs
to be enhanced to add AM specific private members also.
 
> > > Does the new TupleTableSlot abstraction patches has fixed any of these
> > > issues in the recent thread [1]? so that I can look into the change of
> > > FDW API to return slot instead of tuple.
> >
> > Yea, that'd be a good thing to start with.
> >
>
> I found out only the RefetchForeignRow API needs the change and done the
> same.
> Along with that, I fixed all the issues of  running make check-world.
> Attached patches
> for the same.

Thanks, that's really helpful!  I'll try to merge these soon.

I can share the rebased patches for the fixes, so that it will be easy to merge. 

I'm starting to think that we're getting closer to something that
looks right from a high level, even though there's a lot of details to
clean up.

That's good.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Tue, Aug 28, 2018 at 1:48 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Fri, Aug 24, 2018 at 12:50 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-08-24 11:55:41 +1000, Haribabu Kommi wrote:
> On Tue, Aug 21, 2018 at 6:59 PM Andres Freund <andres@anarazel.de> wrote:
>
> > On 2018-08-21 16:55:47 +1000, Haribabu Kommi wrote:
> > > On Sun, Aug 5, 2018 at 7:48 PM Andres Freund <andres@anarazel.de> wrote:
> > > > I'm currently in the process of rebasing zheap onto the pluggable
> > > > storage work. The goal, which seems to work surprisingly well, is to
> > > > find issues that the current pluggable storage patch doesn't yet deal
> > > > with.  I plan to push a tree including a lot of fixes and improvements
> > > > soon.
> > > >
> > > That's good. Did you find any problems in porting zheap into pluggable
> > > storage? Does it needs any API changes or new API requirement?
> >
> > A lot, yes. The big changes are:
> > - removal of HeapPageScanDesc
> > - introduction of explicit support functions for tablesample & bitmap scans
> > - introduction of callbacks for vacuum_rel, cluster
> >
> > And quite a bit more along those lines.
> >
>
> OK. Those are quite a bit of changes.

I've pushed a current version of that to my git tree to the
pluggable-storage branch. It's not really a version that I think makese
sense to review or such, but it's probably more useful if you work based
on that.  There's also the pluggable-zheap branch, which I found
extremely useful to develop against.

OK. Thanks, will check that also.
 
- fixing your (Haribabu's) slotification of copy patch to compute memory
  usage somehow

I will check it.

Attached is the copy patch that brings back the size validation.
Compute the tuple size from the first tuple in the batch and use the same for the
rest of the tuples in that batch. This way the calculation overhead is also reduced,
but there may be chances that the first tuple size is very low and rest can be very
long, but I feel those are rare.
 
 
- table creation callback, currently the pluggable-zheap patch has a few
  conditionals outside of access/zheap for that purpose (see RelationTruncate

I will check it.

I found couple of places where the zheap is using some extra logic in verifying 
whether it is zheap AM or not, based on that it used to took some extra decisions.
I am analyzing all the extra code that is done, whether any callbacks can handle it
or not? and how? I can come back with more details later.
 
 
And then:
- lotsa cleanups
- rebasing onto a newer version of the abstract slot patchset
- splitting out smaller patches


You'd moved the bulk insert into tableam callbacks - I don't quite get
why? There's not really anything AM specific in that code?

The main reason of adding them to AM is just to provide a control to
the specific AM to decide whether they can support the bulk insert or
not?

Current framework doesn't support AM specific bulk insert state to be
passed from one function to another and it's structure is fixed. This needs
to be enhanced to add AM specific private members also.

Do you want me to work on it to make it generic to AM methods to extend
the structure?
 
 
> > > Does the new TupleTableSlot abstraction patches has fixed any of these
> > > issues in the recent thread [1]? so that I can look into the change of
> > > FDW API to return slot instead of tuple.
> >
> > Yea, that'd be a good thing to start with.
> >
>
> I found out only the RefetchForeignRow API needs the change and done the
> same.
> Along with that, I fixed all the issues of  running make check-world.
> Attached patches
> for the same.

Thanks, that's really helpful!  I'll try to merge these soon.

I can share the rebased patches for the fixes, so that it will be easy to merge. 

Rebased FDW and check-world fixes patch is attached.
I will continue working on the rest of the miss items.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

Thanks for the patches!

On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote:
> I found couple of places where the zheap is using some extra logic in
> verifying
> whether it is zheap AM or not, based on that it used to took some extra
> decisions.
> I am analyzing all the extra code that is done, whether any callbacks can
> handle it
> or not? and how? I can come back with more details later.

Yea, I think some of them will need to stay (particularly around
integrating undo) and som other ones we'll need to abstract.


> >> And then:
> >> - lotsa cleanups
> >> - rebasing onto a newer version of the abstract slot patchset
> >> - splitting out smaller patches
> >>
> >>
> >> You'd moved the bulk insert into tableam callbacks - I don't quite get
> >> why? There's not really anything AM specific in that code?
> >>
> >
> > The main reason of adding them to AM is just to provide a control to
> > the specific AM to decide whether they can support the bulk insert or
> > not?
> >
> > Current framework doesn't support AM specific bulk insert state to be
> > passed from one function to another and it's structure is fixed. This needs
> > to be enhanced to add AM specific private members also.
> >
> 
> Do you want me to work on it to make it generic to AM methods to extend
> the structure?

I think the best thing here would be to *remove* all AM abstraction for
bulk insert, until it's actuallly needed. The likelihood of us getting
the interface right and useful without an actual user seems low. Also,
this already is a huge patch...


> @@ -308,7 +308,7 @@ static void CopyFromInsertBatch(CopyState cstate, EState *estate,
>                      CommandId mycid, int hi_options,
>                      ResultRelInfo *resultRelInfo,
>                      BulkInsertState bistate,
> -                    int nBufferedTuples, TupleTableSlot **bufferedSlots,
> +                    int nBufferedSlots, TupleTableSlot **bufferedSlots,
>                      uint64 firstBufferedLineNo);
>  static bool CopyReadLine(CopyState cstate);
>  static bool CopyReadLineText(CopyState cstate);
> @@ -2309,11 +2309,12 @@ CopyFrom(CopyState cstate)
>      void       *bistate;
>      uint64        processed = 0;
>      bool        useHeapMultiInsert;
> -    int            nBufferedTuples = 0;
> +    int            nBufferedSlots = 0;
>      int            prev_leaf_part_index = -1;

> -#define MAX_BUFFERED_TUPLES 1000
> +#define MAX_BUFFERED_SLOTS 1000

What's the point of these renames? We're still dealing in tuples. Just
seems to make the patch larger.


>                  if (useHeapMultiInsert)
>                  {
> +                    int tup_size;
> +
>                      /* Add this tuple to the tuple buffer */
> -                    if (nBufferedTuples == 0)
> +                    if (nBufferedSlots == 0)
> +                    {
>                          firstBufferedLineNo = cstate->cur_lineno;
> -                    Assert(bufferedSlots[nBufferedTuples] == myslot);
> -                    nBufferedTuples++;
> +
> +                        /*
> +                         * Find out the Tuple size of the first tuple in a batch and
> +                         * use it for the rest tuples in a batch. There may be scenarios
> +                         * where the first tuple is very small and rest can be large, but
> +                         * that's rare and this should work for majority of the scenarios.
> +                         */
> +                        tup_size = heap_compute_data_size(myslot->tts_tupleDescriptor,
> +                                                          myslot->tts_values,
> +                                                          myslot->tts_isnull);
> +                    }

This seems too exensive to me.  I think it'd be better if we instead
used the amount of input data consumed for the tuple as a proxy. Does that
sound reasonable?

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:


On Tue, Sep 4, 2018 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

Thanks for the patches!

On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote:
> I found couple of places where the zheap is using some extra logic in
> verifying
> whether it is zheap AM or not, based on that it used to took some extra
> decisions.
> I am analyzing all the extra code that is done, whether any callbacks can
> handle it
> or not? and how? I can come back with more details later.

Yea, I think some of them will need to stay (particularly around
integrating undo) and som other ones we'll need to abstract.
 
OK. I will list all the areas that I found with my observation of how to
abstract or leaving it and then implement around it.


> >> And then:
> >> - lotsa cleanups
> >> - rebasing onto a newer version of the abstract slot patchset
> >> - splitting out smaller patches
> >>
> >>
> >> You'd moved the bulk insert into tableam callbacks - I don't quite get
> >> why? There's not really anything AM specific in that code?
> >>
> >
> > The main reason of adding them to AM is just to provide a control to
> > the specific AM to decide whether they can support the bulk insert or
> > not?
> >
> > Current framework doesn't support AM specific bulk insert state to be
> > passed from one function to another and it's structure is fixed. This needs
> > to be enhanced to add AM specific private members also.
> >
>
> Do you want me to work on it to make it generic to AM methods to extend
> the structure?

I think the best thing here would be to *remove* all AM abstraction for
bulk insert, until it's actuallly needed. The likelihood of us getting
the interface right and useful without an actual user seems low. Also,
this already is a huge patch...

OK. Will remove them and share the patch.
 

> @@ -308,7 +308,7 @@ static void CopyFromInsertBatch(CopyState cstate, EState *estate,
>                                       CommandId mycid, int hi_options,
>                                       ResultRelInfo *resultRelInfo,
>                                       BulkInsertState bistate,
> -                                     int nBufferedTuples, TupleTableSlot **bufferedSlots,
> +                                     int nBufferedSlots, TupleTableSlot **bufferedSlots,
>                                       uint64 firstBufferedLineNo);
>  static bool CopyReadLine(CopyState cstate);
>  static bool CopyReadLineText(CopyState cstate);
> @@ -2309,11 +2309,12 @@ CopyFrom(CopyState cstate)
>       void       *bistate;
>       uint64          processed = 0;
>       bool            useHeapMultiInsert;
> -     int                     nBufferedTuples = 0;
> +     int                     nBufferedSlots = 0;
>       int                     prev_leaf_part_index = -1;

> -#define MAX_BUFFERED_TUPLES 1000
> +#define MAX_BUFFERED_SLOTS 1000

What's the point of these renames? We're still dealing in tuples. Just
seems to make the patch larger.

OK. I will correct it.
 

>                               if (useHeapMultiInsert)
>                               {
> +                                     int tup_size;
> +
>                                       /* Add this tuple to the tuple buffer */
> -                                     if (nBufferedTuples == 0)
> +                                     if (nBufferedSlots == 0)
> +                                     {
>                                               firstBufferedLineNo = cstate->cur_lineno;
> -                                     Assert(bufferedSlots[nBufferedTuples] == myslot);
> -                                     nBufferedTuples++;
> +
> +                                             /*
> +                                              * Find out the Tuple size of the first tuple in a batch and
> +                                              * use it for the rest tuples in a batch. There may be scenarios
> +                                              * where the first tuple is very small and rest can be large, but
> +                                              * that's rare and this should work for majority of the scenarios.
> +                                              */
> +                                             tup_size = heap_compute_data_size(myslot->tts_tupleDescriptor,
> +                                                                                                               myslot->tts_values,
> +                                                                                                               myslot->tts_isnull);
> +                                     }

This seems too exensive to me.  I think it'd be better if we instead
used the amount of input data consumed for the tuple as a proxy. Does that
sound reasonable?

Yes, the cstate structure contains the line_buf member that holds the information of 
the line length of the row, this can be used as a tuple length to limit the size usage.
comments?

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Tue, Sep 4, 2018 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

Thanks for the patches!

On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote:
> I found couple of places where the zheap is using some extra logic in
> verifying
> whether it is zheap AM or not, based on that it used to took some extra
> decisions.
> I am analyzing all the extra code that is done, whether any callbacks can
> handle it
> or not? and how? I can come back with more details later.

Yea, I think some of them will need to stay (particularly around
integrating undo) and som other ones we'll need to abstract.
 
OK. I will list all the areas that I found with my observation of how to
abstract or leaving it and then implement around it.

The following are the change where the code is specific to checking whether
it is a zheap relation or not?

Overall I found that It needs 3 new API's at the following locations.
1. RelationSetNewRelfilenode
2. heap_create_init_fork
3. estimate_rel_size
4. Facility to provide handler options like (skip WAL and etc).
_hash_vacuum_one_page:
xlrec.flags = RelationStorageIsZHeap(heapRel) ?
XLOG_HASH_VACUUM_RELATION_STORAGE_ZHEAP : 0;

_bt_delitems_delete:
xlrec_delete.flags = RelationStorageIsZHeap(heapRel) ?
XLOG_BTREE_DELETE_RELATION_STORAGE_ZHEAP : 0;


Storing the type of the handler and while checking for these new types adding a 
new API for special handing can remove the need of the above code.
RelationAddExtraBlocks:
if (RelationStorageIsZHeap(relation))
{
ZheapInitPage(page, BufferGetPageSize(buffer));
freespace = PageGetZHeapFreeSpace(page);
}

Adding a new API for PageInt and PageGetHeapFreeSpace to redirect the calls to specific
table am handlers.

visibilitymap_set:
if (RelationStorageIsZHeap(rel))
{
recptr = log_zheap_visible(rel->rd_node, heapBuf, vmBuf,
   cutoff_xid, flags);
/*
* We do not have a page wise visibility flag in zheap.
* So no need to set LSN on zheap page.
*/
}

Handler options may solve need of above code.

validate_index_heapscan:
/* Set up for predicate or expression evaluation */
/* For zheap relations, the tuple is locally allocated, so free it. */
ExecStoreHeapTuple(heapTuple, slot, RelationStorageIsZHeap(heapRelation));


This will solve after changing the validate_index_heapscan function to slotify.

RelationTruncate:
/* Create the meta page for zheap */
if (RelationStorageIsZHeap(rel))
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
  InvalidTransactionId,
  InvalidMultiXactId);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED &&
rel->rd_rel->relkind != 'p')
{
heap_create_init_fork(rel);
if (RelationStorageIsZHeap(rel))
ZheapInitMetaPage(rel, INIT_FORKNUM);
}

new API in RelationSetNewRelfilenode and heap_create_init_fork can solve it.
cluster:
if (RelationStorageIsZHeap(rel))
ereport(ERROR,
(errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
errmsg("cannot cluster a zheap table")));
 
No change required.
 
copyFrom:
/*
* In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
* See zheap_prepare_insert for details.
* PBORKED / ZBORKED: abstract
*/
if (!RelationStorageIsZHeap(cstate->rel) && !XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;
How about requesting the table am handler to provide options and use them here?
ExecuteTruncateGuts:

// PBORKED: Need to abstract this
minmulti = GetOldestMultiXactId();

/*
* Need the full transaction-safe pushups.
*
* Create a new empty storage file for the relation, and assign it
* as the relfilenode value. The old storage file is scheduled for
* deletion at commit.
*
* PBORKED: needs to be a callback
*/
if (RelationStorageIsZHeap(rel))
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
  InvalidTransactionId, InvalidMultiXactId);
else
RelationSetNewRelfilenode(rel, rel->rd_rel->relpersistence,
  RecentXmin, minmulti);
if (rel->rd_rel->relpersistence == RELPERSISTENCE_UNLOGGED)
{
heap_create_init_fork(rel);
if (RelationStorageIsZHeap(rel))
ZheapInitMetaPage(rel, INIT_FORKNUM);
}

New API inside RelationSetNewRelfilenode can handle it.
ATRewriteCatalogs:
/* Inherit the storage_engine reloption from the parent table. */
if (RelationStorageIsZHeap(rel))
{
static char *validnsps[] = HEAP_RELOPT_NAMESPACES;
DefElem *storage_engine;

storage_engine = makeDefElemExtended("toast", "storage_engine",
(Node *) makeString("zheap"),
DEFELEM_UNSPEC, -1);
reloptions = transformRelOptions((Datum) 0,
list_make1(storage_engine),
"toast",
validnsps, true, false);
}

I don't think anything can be done in API.

ATRewriteTable:
/*
* In zheap, we don't support the optimization for HEAP_INSERT_SKIP_WAL.
* See zheap_prepare_insert for details.
*
* ZFIXME / PFIXME: We probably need a different abstraction for this.
*/
if (!RelationStorageIsZHeap(newrel) && !XLogIsNeeded())
hi_options |= HEAP_INSERT_SKIP_WAL;

Options can solve this also.

estimate_rel_size:
if (curpages < 10 &&
(rel->rd_rel->relpages == 0 ||
(RelationStorageIsZHeap(rel) &&
rel->rd_rel->relpages == ZHEAP_METAPAGE + 1)) &&
!rel->rd_rel->relhassubclass &&
rel->rd_rel->relkind != RELKIND_INDEX)
curpages = 10;

/* report estimated # pages */
*pages = curpages;
/* quick exit if rel is clearly empty */
if (curpages == 0 || (RelationStorageIsZHeap(rel) &&
curpages == ZHEAP_METAPAGE + 1))
{
*tuples = 0;
*allvisfrac = 0;
break;
}
/* coerce values in pg_class to more desirable types */
relpages = (BlockNumber) rel->rd_rel->relpages;
reltuples = (double) rel->rd_rel->reltuples;
relallvisible = (BlockNumber) rel->rd_rel->relallvisible;

/*
* If it's a zheap relation, then subtract the pages
* to account for the metapage.
*/
if (relpages > 0 && RelationStorageIsZHeap(rel))
{
curpages--;
relpages--;
}

An API may be needed to find out estimation size based on handler type?
pg_stat_get_tuples_hot_updated and others:
/*
* Counter tuples_hot_updated stores number of hot updates for heap table
* and the number of inplace updates for zheap table.
*/
if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
RelationStorageIsZHeap(rel))
result = 0;
else
result = (int64) (tabentry->tuples_hot_updated);


Is the special condition is needed? The values should be 0 because of zheap right?

RelationSetNewRelfilenode:
/* Initialize the metapage for zheap relation. */
if (RelationStorageIsZHeap(relation))
ZheapInitMetaPage(relation, MAIN_FORKNUM);
New API in RelationSetNetRelfilenode Can solve this problem.
 

> >> And then:
> >> - lotsa cleanups
> >> - rebasing onto a newer version of the abstract slot patchset
> >> - splitting out smaller patches
> >>
> >>
> >> You'd moved the bulk insert into tableam callbacks - I don't quite get
> >> why? There's not really anything AM specific in that code?
> >>
> >
> > The main reason of adding them to AM is just to provide a control to
> > the specific AM to decide whether they can support the bulk insert or
> > not?
> >
> > Current framework doesn't support AM specific bulk insert state to be
> > passed from one function to another and it's structure is fixed. This needs
> > to be enhanced to add AM specific private members also.
> >
>
> Do you want me to work on it to make it generic to AM methods to extend
> the structure?

I think the best thing here would be to *remove* all AM abstraction for
bulk insert, until it's actuallly needed. The likelihood of us getting
the interface right and useful without an actual user seems low. Also,
this already is a huge patch...

OK. Will remove them and share the patch.

Bulk insert API changes are removed.
 
 

> @@ -308,7 +308,7 @@ static void CopyFromInsertBatch(CopyState cstate, EState *estate,
>                                       CommandId mycid, int hi_options,
>                                       ResultRelInfo *resultRelInfo,
>                                       BulkInsertState bistate,
> -                                     int nBufferedTuples, TupleTableSlot **bufferedSlots,
> +                                     int nBufferedSlots, TupleTableSlot **bufferedSlots,
>                                       uint64 firstBufferedLineNo);
>  static bool CopyReadLine(CopyState cstate);
>  static bool CopyReadLineText(CopyState cstate);
> @@ -2309,11 +2309,12 @@ CopyFrom(CopyState cstate)
>       void       *bistate;
>       uint64          processed = 0;
>       bool            useHeapMultiInsert;
> -     int                     nBufferedTuples = 0;
> +     int                     nBufferedSlots = 0;
>       int                     prev_leaf_part_index = -1;

> -#define MAX_BUFFERED_TUPLES 1000
> +#define MAX_BUFFERED_SLOTS 1000

What's the point of these renames? We're still dealing in tuples. Just
seems to make the patch larger.

OK. I will correct it.
 

>                               if (useHeapMultiInsert)
>                               {
> +                                     int tup_size;
> +
>                                       /* Add this tuple to the tuple buffer */
> -                                     if (nBufferedTuples == 0)
> +                                     if (nBufferedSlots == 0)
> +                                     {
>                                               firstBufferedLineNo = cstate->cur_lineno;
> -                                     Assert(bufferedSlots[nBufferedTuples] == myslot);
> -                                     nBufferedTuples++;
> +
> +                                             /*
> +                                              * Find out the Tuple size of the first tuple in a batch and
> +                                              * use it for the rest tuples in a batch. There may be scenarios
> +                                              * where the first tuple is very small and rest can be large, but
> +                                              * that's rare and this should work for majority of the scenarios.
> +                                              */
> +                                             tup_size = heap_compute_data_size(myslot->tts_tupleDescriptor,
> +                                                                                                               myslot->tts_values,
> +                                                                                                               myslot->tts_isnull);
> +                                     }

This seems too exensive to me.  I think it'd be better if we instead
used the amount of input data consumed for the tuple as a proxy. Does that
sound reasonable?

Yes, the cstate structure contains the line_buf member that holds the information of 
the line length of the row, this can be used as a tuple length to limit the size usage.
comments?

copy from batch insert memory usage limit fix and Provide grammer support for USING method
to create table as also.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Amit Kapila
Дата:
On Mon, Sep 10, 2018 at 1:12 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>>
> pg_stat_get_tuples_hot_updated and others:
> /*
> * Counter tuples_hot_updated stores number of hot updates for heap table
> * and the number of inplace updates for zheap table.
> */
> if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
> RelationStorageIsZHeap(rel))
> result = 0;
> else
> result = (int64) (tabentry->tuples_hot_updated);
>
>
> Is the special condition is needed? The values should be 0 because of zheap right?
>

I also think so.  Beena/Mithun has worked on this part of the code, so
it is better if they also confirm once.

-- 
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Pluggable Storage - Andres's take

От
Mithun Cy
Дата:
On Mon, Sep 10, 2018 at 7:33 PM, Amit Kapila <amit.kapila16@gmail.com> wrote:
> On Mon, Sep 10, 2018 at 1:12 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>>
>> On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>>>
>> pg_stat_get_tuples_hot_updated and others:
>> /*
>> * Counter tuples_hot_updated stores number of hot updates for heap table
>> * and the number of inplace updates for zheap table.
>> */
>> if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
>> RelationStorageIsZHeap(rel))
>> result = 0;
>> else
>> result = (int64) (tabentry->tuples_hot_updated);
>>
>>
>> Is the special condition is needed? The values should be 0 because of zheap right?
>>
>
> I also think so.  Beena/Mithun has worked on this part of the code, so
> it is better if they also confirm once.

Yes pg_stat_get_tuples_hot_updated should return 0 for zheap.


-- 
Thanks and Regards
Mithun C Y
EnterpriseDB: http://www.enterprisedb.com


Re: Pluggable Storage - Andres's take

От
Beena Emerson
Дата:
Hello,

On Mon, 10 Sep 2018, 19:33 Amit Kapila, <amit.kapila16@gmail.com> wrote:
On Mon, Sep 10, 2018 at 1:12 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>>
> pg_stat_get_tuples_hot_updated and others:
> /*
> * Counter tuples_hot_updated stores number of hot updates for heap table
> * and the number of inplace updates for zheap table.
> */
> if ((tabentry = pgstat_fetch_stat_tabentry(relid)) == NULL ||
> RelationStorageIsZHeap(rel))
> result = 0;
> else
> result = (int64) (tabentry->tuples_hot_updated);
>
>
> Is the special condition is needed? The values should be 0 because of zheap right?
>

I also think so.  Beena/Mithun has worked on this part of the code, so
it is better if they also confirm once.

We have used the hot_updated counter to count the number of inplace updates for zheap to qvoid introducing a new counter. Though, technically, hot updates are 0 for zheap, the counter could hold non-zero value indicating the inplace updates.
 
Thank you

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Mon, Sep 10, 2018 at 5:42 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Wed, Sep 5, 2018 at 2:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Tue, Sep 4, 2018 at 10:33 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

Thanks for the patches!

On 2018-09-03 19:06:27 +1000, Haribabu Kommi wrote:
> I found couple of places where the zheap is using some extra logic in
> verifying
> whether it is zheap AM or not, based on that it used to took some extra
> decisions.
> I am analyzing all the extra code that is done, whether any callbacks can
> handle it
> or not? and how? I can come back with more details later.

Yea, I think some of them will need to stay (particularly around
integrating undo) and som other ones we'll need to abstract.
 
OK. I will list all the areas that I found with my observation of how to
abstract or leaving it and then implement around it.

The following are the change where the code is specific to checking whether
it is a zheap relation or not?

Overall I found that It needs 3 new API's at the following locations.
1. RelationSetNewRelfilenode
2. heap_create_init_fork
3. estimate_rel_size
4. Facility to provide handler options like (skip WAL and etc).

During the porting of Fujitsu in-memory columnar store on top of pluggable
storage, I found that the callers of the "heap_beginscan" are expecting
the returned data is always contains all the records.

For example, in the sequential scan, the heap returns the slot with
the tuple or with value array of all the columns and then the data gets
filtered and later removed the unnecessary columns with projection.
This works fine for the row based storage. For columnar storage, if
the storage knows that upper layers needs only particular columns,
then they can directly return the specified columns and there is no
need of projection step. This will help the columnar storage also
to return proper columns in a faster way.

Is it good to pass the plan to the storage, so that they can find out
the columns that needs to be returned? And also if the projection
can handle in the storage itself for some scenarios, need to be
informed the callers that there is no need to perform the projection
extra.

comments?

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2018-09-21 16:57:43 +1000, Haribabu Kommi wrote:
> During the porting of Fujitsu in-memory columnar store on top of pluggable
> storage, I found that the callers of the "heap_beginscan" are expecting
> the returned data is always contains all the records.

Right.


> For example, in the sequential scan, the heap returns the slot with
> the tuple or with value array of all the columns and then the data gets
> filtered and later removed the unnecessary columns with projection.
> This works fine for the row based storage. For columnar storage, if
> the storage knows that upper layers needs only particular columns,
> then they can directly return the specified columns and there is no
> need of projection step. This will help the columnar storage also
> to return proper columns in a faster way.

I think this is an important feature, but I feel fairly strongly that we
should only tackle it in a second version. This patchset is already
pretty darn large.  It's imo not just helpful for columnar, but even for
heap - we e.g. spend a lot of time deforming columns that are never
accessed. That's particularly harmful when the leading columns are all
NOT NULL and fixed width, but even if not, it's painful.


> Is it good to pass the plan to the storage, so that they can find out
> the columns that needs to be returned?

I don't think that's the right approach - this should be a level *below*
plan nodes, not reference them. I suspect we're going to have to have a
new table_scan_set_columnlist() option or such.


> And also if the projection can handle in the storage itself for some
> scenarios, need to be informed the callers that there is no need to
> perform the projection extra.

I don't think that should be done in the storage layer - that's probably
better done introducing custom scan nodes and such.  This has costing
implications etc, so this needs to happen *before* planning is finished.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Fri, Sep 21, 2018 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-09-21 16:57:43 +1000, Haribabu Kommi wrote:

> For example, in the sequential scan, the heap returns the slot with
> the tuple or with value array of all the columns and then the data gets
> filtered and later removed the unnecessary columns with projection.
> This works fine for the row based storage. For columnar storage, if
> the storage knows that upper layers needs only particular columns,
> then they can directly return the specified columns and there is no
> need of projection step. This will help the columnar storage also
> to return proper columns in a faster way.

I think this is an important feature, but I feel fairly strongly that we
should only tackle it in a second version. This patchset is already
pretty darn large.  It's imo not just helpful for columnar, but even for
heap - we e.g. spend a lot of time deforming columns that are never
accessed. That's particularly harmful when the leading columns are all
NOT NULL and fixed width, but even if not, it's painful.

OK. Thanks for your opinion.
Then I will first try to cleanup the open items of the existing patch.
 

> Is it good to pass the plan to the storage, so that they can find out
> the columns that needs to be returned?

I don't think that's the right approach - this should be a level *below*
plan nodes, not reference them. I suspect we're going to have to have a
new table_scan_set_columnlist() option or such.

The table_scan_set_columnlist() API can be a good solution to share
the columns that are expected.

 
> And also if the projection can handle in the storage itself for some
> scenarios, need to be informed the callers that there is no need to
> perform the projection extra.

I don't think that should be done in the storage layer - that's probably
better done introducing custom scan nodes and such.  This has costing
implications etc, so this needs to happen *before* planning is finished.

Sorry, my explanation was wrong, Assuming a scenario where the target list
contains only the plain columns of a table and these columns are already passed
to storage using the above proposed new API and their of one to one mapping.
Based on the above info, deciding whether the projection is required or not
is good.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Alexander Korotkov
Дата:
On Fri, Aug 24, 2018 at 5:50 AM Andres Freund <andres@anarazel.de> wrote:
> I've pushed a current version of that to my git tree to the
> pluggable-storage branch. It's not really a version that I think makese
> sense to review or such, but it's probably more useful if you work based
> on that.  There's also the pluggable-zheap branch, which I found
> extremely useful to develop against.

BTW, I'm going to take a look at current shape of this patch and share
my thoughts.  But where are the branches you're referring?  On your
postgres.org git repository pluggable-storage brach was updates last
time at June 7.  And on the github branches are updated at August 5
and 14, and that is still much older than your email (August 24)...

1. https://git.postgresql.org/gitweb/?p=users/andresfreund/postgres.git;a=shortlog;h=refs/heads/pluggable-storage
2. https://github.com/anarazel/postgres-pluggable-storage
3, https://github.com/anarazel/postgres-pluggable-storage/tree/pluggable-zheap

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Mon, Sep 24, 2018 at 5:02 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
On Fri, Aug 24, 2018 at 5:50 AM Andres Freund <andres@anarazel.de> wrote:
> I've pushed a current version of that to my git tree to the
> pluggable-storage branch. It's not really a version that I think makese
> sense to review or such, but it's probably more useful if you work based
> on that.  There's also the pluggable-zheap branch, which I found
> extremely useful to develop against.

BTW, I'm going to take a look at current shape of this patch and share
my thoughts.  But where are the branches you're referring?  On your
postgres.org git repository pluggable-storage brach was updates last
time at June 7.  And on the github branches are updated at August 5
and 14, and that is still much older than your email (August 24)...

The code is latest, but the commit time is older, I feel that is because of
commit squash.

pluggable-storage is the branch where the pluggable storage code is present
and pluggable-zheap branch where zheap is rebased on top of pluggable
storage.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Alexander Korotkov
Дата:
On Mon, Sep 24, 2018 at 8:04 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
> On Mon, Sep 24, 2018 at 5:02 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>>
>> On Fri, Aug 24, 2018 at 5:50 AM Andres Freund <andres@anarazel.de> wrote:
>> > I've pushed a current version of that to my git tree to the
>> > pluggable-storage branch. It's not really a version that I think makese
>> > sense to review or such, but it's probably more useful if you work based
>> > on that.  There's also the pluggable-zheap branch, which I found
>> > extremely useful to develop against.
>>
>> BTW, I'm going to take a look at current shape of this patch and share
>> my thoughts.  But where are the branches you're referring?  On your
>> postgres.org git repository pluggable-storage brach was updates last
>> time at June 7.  And on the github branches are updated at August 5
>> and 14, and that is still much older than your email (August 24)...
>
>
> The code is latest, but the commit time is older, I feel that is because of
> commit squash.
>
> pluggable-storage is the branch where the pluggable storage code is present
> and pluggable-zheap branch where zheap is rebased on top of pluggable
> storage.

Got it, thanks!

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Fri, Sep 21, 2018 at 5:40 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Fri, Sep 21, 2018 at 5:05 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-09-21 16:57:43 +1000, Haribabu Kommi wrote:

> For example, in the sequential scan, the heap returns the slot with
> the tuple or with value array of all the columns and then the data gets
> filtered and later removed the unnecessary columns with projection.
> This works fine for the row based storage. For columnar storage, if
> the storage knows that upper layers needs only particular columns,
> then they can directly return the specified columns and there is no
> need of projection step. This will help the columnar storage also
> to return proper columns in a faster way.

I think this is an important feature, but I feel fairly strongly that we
should only tackle it in a second version. This patchset is already
pretty darn large.  It's imo not just helpful for columnar, but even for
heap - we e.g. spend a lot of time deforming columns that are never
accessed. That's particularly harmful when the leading columns are all
NOT NULL and fixed width, but even if not, it's painful.

OK. Thanks for your opinion.
Then I will first try to cleanup the open items of the existing patch.

Here I attached further cleanup patches.
1. Re-arrange the GUC variable
2. Added a check function hook for default_table_access_method GUC
3. Added a new hook validate_index. I tried to change the function
validate_index_heapscan to slotify, but that have many problems as it
is accessing some internals of the heapscandesc structure and accessing
the buffer and etc.

So I added a new hook and provided a callback to handle the index insert.
Please check and let me know comments?

I will further add the new API's that are discussed for Zheap storage and
share the patch.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> Here I attached further cleanup patches.
> 1. Re-arrange the GUC variable
> 2. Added a check function hook for default_table_access_method GUC

Cool.


> 3. Added a new hook validate_index. I tried to change the function
> validate_index_heapscan to slotify, but that have many problems as it
> is accessing some internals of the heapscandesc structure and accessing
> the buffer and etc.

Oops, I also did that locally, in a way. I also made a validate a
callback, as the validation logic is going to be specific to the AMs.
Sorry for not pushing that up earlier.  I'll try to do that soon,
there's a fair amount of change.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
On 2018-09-27 20:03:58 -0700, Andres Freund wrote:
> On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> > Here I attached further cleanup patches.
> > 1. Re-arrange the GUC variable
> > 2. Added a check function hook for default_table_access_method GUC
> 
> Cool.
> 
> 
> > 3. Added a new hook validate_index. I tried to change the function
> > validate_index_heapscan to slotify, but that have many problems as it
> > is accessing some internals of the heapscandesc structure and accessing
> > the buffer and etc.
> 
> Oops, I also did that locally, in a way. I also made a validate a
> callback, as the validation logic is going to be specific to the AMs.
> Sorry for not pushing that up earlier.  I'll try to do that soon,
> there's a fair amount of change.

I've pushed an updated version, with a fair amount of pending changes,
and I hope all your pending (and not redundant, by our concurrent
development), patches merged.

There's currently 3 regression test failures, that I'll look into
tomorrow:
- partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm
  a bit confused as to why, but haven't really investigated yet.
- fast_default fails, because I've undone most of 7636e5c60fea83a9f3c,
  I'll have to redo that in a different way.
- I occasionally see failures in aggregates.sql - I've not figured out
  what's going on there.

Amit Khandekar said he'll publish a new version of the slot-abstraction
patch tomorrow, so I'll rebase it onto that ASAP.

My next planned steps are a) to try to commit parts of the
slot-abstraction work b) to try to break out a few more pieces out of
the large pluggable storage patch.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Wed, Oct 3, 2018 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-09-27 20:03:58 -0700, Andres Freund wrote:
> On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> > Here I attached further cleanup patches.
> > 1. Re-arrange the GUC variable
> > 2. Added a check function hook for default_table_access_method GUC
>
> Cool.
>
>
> > 3. Added a new hook validate_index. I tried to change the function
> > validate_index_heapscan to slotify, but that have many problems as it
> > is accessing some internals of the heapscandesc structure and accessing
> > the buffer and etc.
>
> Oops, I also did that locally, in a way. I also made a validate a
> callback, as the validation logic is going to be specific to the AMs.
> Sorry for not pushing that up earlier.  I'll try to do that soon,
> there's a fair amount of change.

I've pushed an updated version, with a fair amount of pending changes,
and I hope all your pending (and not redundant, by our concurrent
development), patches merged.
 
Yes, All the patches are merged.

There's currently 3 regression test failures, that I'll look into
tomorrow:
- partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm
  a bit confused as to why, but haven't really investigated yet.
- fast_default fails, because I've undone most of 7636e5c60fea83a9f3c,
  I'll have to redo that in a different way.
- I occasionally see failures in aggregates.sql - I've not figured out
  what's going on there.

I also observed the failure of aggregates.sql, will look into it.
 
Amit Khandekar said he'll publish a new version of the slot-abstraction
patch tomorrow, so I'll rebase it onto that ASAP.

OK.
Here I attached two new API patches.
1. Set New Rel File node
2. Create Init fork
 
There is an another patch of "External Relations" in the older patch set,
which is not included in the current git. That patch is of creating 
external relations by the extensions for their internal purpose. (Columnar
relations for the columnar storage). This new relkind can be used for
those relations, this way it provides the difference between normal and
columnar relations. Do you have any other idea of supporting those type
of relations?

And also I want to create a new API for heap_create_with_catalog
to let the pluggable storage engine to create additional relations.
This API is not required for every storage engine, so instead of moving
the entire function as API, how about adding an API at the end of the
function and calls only when it is set like hook functions? In case if the
storage engine doesn't need any of the heap_create_with_catalog 
functionality then creating a full API is better.

Comments?

My next planned steps are a) to try to commit parts of the
slot-abstraction work b) to try to break out a few more pieces out of
the large pluggable storage patch.

OK. Let me know your views on the part of the pieces that are stable,
so that I can separate them from larger patch.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Wed, Oct 3, 2018 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-09-27 20:03:58 -0700, Andres Freund wrote:
> On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> > Here I attached further cleanup patches.
> > 1. Re-arrange the GUC variable
> > 2. Added a check function hook for default_table_access_method GUC
>
> Cool.
>
>
> > 3. Added a new hook validate_index. I tried to change the function
> > validate_index_heapscan to slotify, but that have many problems as it
> > is accessing some internals of the heapscandesc structure and accessing
> > the buffer and etc.
>
> Oops, I also did that locally, in a way. I also made a validate a
> callback, as the validation logic is going to be specific to the AMs.
> Sorry for not pushing that up earlier.  I'll try to do that soon,
> there's a fair amount of change.

I've pushed an updated version, with a fair amount of pending changes,
and I hope all your pending (and not redundant, by our concurrent
development), patches merged.
 
Yes, All the patches are merged.

There's currently 3 regression test failures, that I'll look into
tomorrow:
- partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm
  a bit confused as to why, but haven't really investigated yet.
- fast_default fails, because I've undone most of 7636e5c60fea83a9f3c,
  I'll have to redo that in a different way.
- I occasionally see failures in aggregates.sql - I've not figured out
  what's going on there.

I also observed the failure of aggregates.sql, will look into it.
 
Amit Khandekar said he'll publish a new version of the slot-abstraction
patch tomorrow, so I'll rebase it onto that ASAP.

OK.
Here I attached two new API patches.
1. Set New Rel File node
2. Create Init fork
 
There is an another patch of "External Relations" in the older patch set,
which is not included in the current git. That patch is of creating 
external relations by the extensions for their internal purpose. (Columnar
relations for the columnar storage). This new relkind can be used for
those relations, this way it provides the difference between normal and
columnar relations. Do you have any other idea of supporting those type
of relations?

And also I want to create a new API for heap_create_with_catalog
to let the pluggable storage engine to create additional relations.
This API is not required for every storage engine, so instead of moving
the entire function as API, how about adding an API at the end of the
function and calls only when it is set like hook functions? In case if the
storage engine doesn't need any of the heap_create_with_catalog 
functionality then creating a full API is better.

Comments?

My next planned steps are a) to try to commit parts of the
slot-abstraction work b) to try to break out a few more pieces out of
the large pluggable storage patch.

OK. Let me know your views on the part of the pieces that are stable,
so that I can separate them from larger patch.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Alexander Korotkov
Дата:
Hi!

On Wed, Oct 3, 2018 at 8:16 AM Andres Freund <andres@anarazel.de> wrote:
> I've pushed an updated version, with a fair amount of pending changes,
> and I hope all your pending (and not redundant, by our concurrent
> development), patches merged.

I'd like to also share some patches.  I've used current state of
pluggable-zheap for the base of my patches.

 * 0001-remove-extra-snapshot-functions.patch – removes
snapshot_satisfiesUpdate() and snapshot_satisfiesVacuum() functions
from tableam API.  snapshot_satisfiesUpdate() was unused completely.
snapshot_satisfiesVacuum()  was used only in heap_copy_for_cluster().
So, I've replaced it with direct heapam_satisfies_vacuum().

 * 0002-add-costing-function-to-API.patch – adds function for costing
sequential and table sample scan to tableam API.  zheap costing
function are now copies of heap costing function.  This should be
adjusted in future.  Estimation for heap lookup during index scans
should be also pluggable, but not yet implemented (TODO).

I've examined code in pluggable-zheap branch and EDB github [1] and I
didn't found anything related to "delete-marking" indexes as stated on
slide #25 of presentation [2].  So, basically contract between heap
and indexes is remain unchanged: once you update one indexed field,
you have to update all the others.  Did I understand correctly that
this is postponed?

And couple more notes from me:
* Right now table_fetch_row_version() is called in most of places with
SnapshotAny.  That might be working in majority of cases, because in
heap there couldn't be multiple tuples residing the same TID, while
zheap always returns most recent tuple residing this TID.  But I think
it would be better to provide some meaningful snapshot instead of
SnapshotAny.  If even the best thing we can do is to ask for most
recent tuple on some TID, we need more consistent way for asking table
AM for this.  I'm going to elaborate more on this.
* I'm not really sure we need ability to iterate multiple tuples
referenced by index.  It seems that the only place, which really needs
this is heap_copy_for_cluster(), which itself is table AM specific.
Also zheap doesn't seem to be able to return more than one tuple by
zheapam_fetch_follow().  So, I'm going to investigate more on this.
And if this iteration is really unneeded, I'll propose a patch to
delete that.

1. https://github.com/EnterpriseDB/zheap
2. http://www.pgcon.org/2018/schedule/attachments/501_zheap-a-new-storage-format-postgresql-5.pdf

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:


On Tue, Oct 9, 2018 at 1:46 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Wed, Oct 3, 2018 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-09-27 20:03:58 -0700, Andres Freund wrote:
> On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> > Here I attached further cleanup patches.
> > 1. Re-arrange the GUC variable
> > 2. Added a check function hook for default_table_access_method GUC
>
> Cool.
>
>
> > 3. Added a new hook validate_index. I tried to change the function
> > validate_index_heapscan to slotify, but that have many problems as it
> > is accessing some internals of the heapscandesc structure and accessing
> > the buffer and etc.
>
> Oops, I also did that locally, in a way. I also made a validate a
> callback, as the validation logic is going to be specific to the AMs.
> Sorry for not pushing that up earlier.  I'll try to do that soon,
> there's a fair amount of change.

I've pushed an updated version, with a fair amount of pending changes,
and I hope all your pending (and not redundant, by our concurrent
development), patches merged.
 
Yes, All the patches are merged.

There's currently 3 regression test failures, that I'll look into
tomorrow:
- partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm
  a bit confused as to why, but haven't really investigated yet.
- fast_default fails, because I've undone most of 7636e5c60fea83a9f3c,
  I'll have to redo that in a different way.
- I occasionally see failures in aggregates.sql - I've not figured out
  what's going on there.

I also observed the failure of aggregates.sql, will look into it.
 
Amit Khandekar said he'll publish a new version of the slot-abstraction
patch tomorrow, so I'll rebase it onto that ASAP.

OK.
Here I attached two new API patches.
1. Set New Rel File node
2. Create Init fork

The above patches are having problem and while testing it leads to crash.
Sorry for not testing earlier. The index relation also creates the NewRelFileNode,
because of moving that function as pluggable table access method, and for
Index relations, there is no tableam routine, thus it leads to crash.

So moving the storage creation methods as table access methods doesn't
work. we may need common access methods that are shared across both
table and index.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Amit Kapila
Дата:
On Tue, Oct 16, 2018 at 12:37 AM Alexander Korotkov
<a.korotkov@postgrespro.ru> wrote:
>
> I've examined code in pluggable-zheap branch and EDB github [1] and I
> didn't found anything related to "delete-marking" indexes as stated on
> slide #25 of presentation [2].  So, basically contract between heap
> and indexes is remain unchanged: once you update one indexed field,
> you have to update all the others.
>

Yes, this will be the behavior for the first version.

--
With Regards,
Amit Kapila.
EnterpriseDB: http://www.enterprisedb.com


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Tue, Oct 9, 2018 at 1:46 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Wed, Oct 3, 2018 at 3:16 PM Andres Freund <andres@anarazel.de> wrote:
On 2018-09-27 20:03:58 -0700, Andres Freund wrote:
> On 2018-09-28 12:21:08 +1000, Haribabu Kommi wrote:
> > Here I attached further cleanup patches.
> > 1. Re-arrange the GUC variable
> > 2. Added a check function hook for default_table_access_method GUC
>
> Cool.
>
>
> > 3. Added a new hook validate_index. I tried to change the function
> > validate_index_heapscan to slotify, but that have many problems as it
> > is accessing some internals of the heapscandesc structure and accessing
> > the buffer and etc.
>
> Oops, I also did that locally, in a way. I also made a validate a
> callback, as the validation logic is going to be specific to the AMs.
> Sorry for not pushing that up earlier.  I'll try to do that soon,
> there's a fair amount of change.

I've pushed an updated version, with a fair amount of pending changes,
and I hope all your pending (and not redundant, by our concurrent
development), patches merged.
 
Yes, All the patches are merged.

There's currently 3 regression test failures, that I'll look into
tomorrow:
- partition_prune shows a few additional Heap Blocks: exact=1 lines. I'm
  a bit confused as to why, but haven't really investigated yet.
- fast_default fails, because I've undone most of 7636e5c60fea83a9f3c,
  I'll have to redo that in a different way.
- I occasionally see failures in aggregates.sql - I've not figured out
  what's going on there.

I also observed the failure of aggregates.sql, will look into it.

The random failure of aggregates.sql is as follows

  SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
!        avg_32        
! ---------------------
!  32.6666666666666667
  (1 row)

  -- In 7.1, avg(float4) is computed using float8 arithmetic.
--- 8,16 ----
  (1 row)

  SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
!  avg_32 
! --------
!        
  (1 row)

Same NULL result for another aggregate query on column b.

The aggtest table is accessed by two tests that are running in parallel.
i.e aggregates.sql and transactions.sql, In transactions.sql, inside a transaction
all the records in the aggtest table are deleted and aborted the transaction,
I suspect that some visibility checks are having some race conditions that leads
to no records on the table aggtest, thus it returns NULL result.

If I try the scenario manually by opening a transaction and deleting the records, the
issue is not occurring.

I am yet to find the cause for this problem.

 Regards,
Haribabu Kommi

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Oct 16, 2018 at 6:06 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
Hi!

On Wed, Oct 3, 2018 at 8:16 AM Andres Freund <andres@anarazel.de> wrote:
> I've pushed an updated version, with a fair amount of pending changes,
> and I hope all your pending (and not redundant, by our concurrent
> development), patches merged.

I'd like to also share some patches.  I've used current state of
pluggable-zheap for the base of my patches.

Thanks for the review and patches.
 
 * 0001-remove-extra-snapshot-functions.patch – removes
snapshot_satisfiesUpdate() and snapshot_satisfiesVacuum() functions
from tableam API.  snapshot_satisfiesUpdate() was unused completely.
snapshot_satisfiesVacuum()  was used only in heap_copy_for_cluster().
So, I've replaced it with direct heapam_satisfies_vacuum().

Thanks for the correction.
 
 * 0002-add-costing-function-to-API.patch – adds function for costing
sequential and table sample scan to tableam API.  zheap costing
function are now copies of heap costing function.  This should be
adjusted in future.

This patch misses the new *_cost.c files that are added specific cost
functions.
 
  Estimation for heap lookup during index scans
should be also pluggable, but not yet implemented (TODO).

Yes, Is it possible to use the same API that is added by above 
patch?

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Alexander Korotkov
Дата:
On Thu, Oct 18, 2018 at 6:28 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
> On Tue, Oct 16, 2018 at 6:06 AM Alexander Korotkov <a.korotkov@postgrespro.ru> wrote:
>>  * 0002-add-costing-function-to-API.patch – adds function for costing
>> sequential and table sample scan to tableam API.  zheap costing
>> function are now copies of heap costing function.  This should be
>> adjusted in future.
>
> This patch misses the new *_cost.c files that are added specific cost
> functions.

Thank you for noticing.  Revised patchset is attached.

>>   Estimation for heap lookup during index scans
>> should be also pluggable, but not yet implemented (TODO).
>
> Yes, Is it possible to use the same API that is added by above
> patch?

I'm not yet sure.  I'll elaborate more on that.  I'd like to keep
number of costing functions small.  Handling of costing of index scan
heap fetches will probably require function signature change.

------
Alexander Korotkov
Postgres Professional: http://www.postgrespro.com
The Russian Postgres Company

Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Thu, Oct 18, 2018 at 1:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Tue, Oct 9, 2018 at 1:46 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

I also observed the failure of aggregates.sql, will look into it.

The random failure of aggregates.sql is as follows

  SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
!        avg_32        
! ---------------------
!  32.6666666666666667
  (1 row)

  -- In 7.1, avg(float4) is computed using float8 arithmetic.
--- 8,16 ----
  (1 row)

  SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
!  avg_32 
! --------
!        
  (1 row)

Same NULL result for another aggregate query on column b.

The aggtest table is accessed by two tests that are running in parallel.
i.e aggregates.sql and transactions.sql, In transactions.sql, inside a transaction
all the records in the aggtest table are deleted and aborted the transaction,
I suspect that some visibility checks are having some race conditions that leads
to no records on the table aggtest, thus it returns NULL result.

If I try the scenario manually by opening a transaction and deleting the records, the
issue is not occurring.

I am yet to find the cause for this problem.

I am not yet able to generate a test case where the above issue can occur easily for
debugging, it is happening randomly. I will try to add some logs to find out the problem. 

During the checking for the above problem, I found some corrections,
1. Remove of the tableam_common.c file as it is not used.
2. Remove the extra heaptuple visibile check in heapgettup_pagemode function
3. New API for init fork.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Mon, Oct 22, 2018 at 6:16 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Thu, Oct 18, 2018 at 1:04 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Tue, Oct 9, 2018 at 1:46 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

I also observed the failure of aggregates.sql, will look into it.

The random failure of aggregates.sql is as follows

  SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
!        avg_32        
! ---------------------
!  32.6666666666666667
  (1 row)

  -- In 7.1, avg(float4) is computed using float8 arithmetic.
--- 8,16 ----
  (1 row)

  SELECT avg(a) AS avg_32 FROM aggtest WHERE a < 100;
!  avg_32 
! --------
!        
  (1 row)

Same NULL result for another aggregate query on column b.

The aggtest table is accessed by two tests that are running in parallel.
i.e aggregates.sql and transactions.sql, In transactions.sql, inside a transaction
all the records in the aggtest table are deleted and aborted the transaction,
I suspect that some visibility checks are having some race conditions that leads
to no records on the table aggtest, thus it returns NULL result.

If I try the scenario manually by opening a transaction and deleting the records, the
issue is not occurring.

I am yet to find the cause for this problem.

I am not yet able to generate a test case where the above issue can occur easily for
debugging, it is happening randomly. I will try to add some logs to find out the problem. 

I am able to generate the simple test and found the problem. The issue with the following
SQL.

SELECT *
   INTO TABLE xacttest
   FROM aggtest;

During the processing of the above query, the tuple that is selected from the aggtest is 
sent to the intorel_receive() function, and the same tuple is used for the insert, because
of this reason, the tuple xmin is updated and it leads to failure of selecting the data from
another query. I fixed this issue by materializing the slot.

During the above test run, I found another issue during analyze, that is trying to access
the invalid offset data. Attached a fix patch.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Oct 23, 2018 at 5:49 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
I am able to generate the simple test and found the problem. The issue with the following
SQL.

SELECT *
   INTO TABLE xacttest
   FROM aggtest;

During the processing of the above query, the tuple that is selected from the aggtest is 
sent to the intorel_receive() function, and the same tuple is used for the insert, because
of this reason, the tuple xmin is updated and it leads to failure of selecting the data from
another query. I fixed this issue by materializing the slot.

Wrong patch attached in the earlier mail, sorry for the inconvenience.
Attached proper fix patch.

I will look into isolation test failures.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Oct 23, 2018 at 6:11 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Tue, Oct 23, 2018 at 5:49 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
I am able to generate the simple test and found the problem. The issue with the following
SQL.

SELECT *
   INTO TABLE xacttest
   FROM aggtest;

During the processing of the above query, the tuple that is selected from the aggtest is 
sent to the intorel_receive() function, and the same tuple is used for the insert, because
of this reason, the tuple xmin is updated and it leads to failure of selecting the data from
another query. I fixed this issue by materializing the slot.

Wrong patch attached in the earlier mail, sorry for the inconvenience.
Attached proper fix patch.

I will look into isolation test failures.

Here I attached the cumulative patch with all fixes that are shared in earlier mails by me.
Except fast_default test, rest of test failures are fixed.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Fri, 26 Oct 2018 at 13:25, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> Here I attached the cumulative patch with all fixes that are shared in earlier mails by me.
> Except fast_default test, rest of test failures are fixed.

Hi,

If I understand correctly, these patches are for the branch "pluggable-storage"
in [1] (at least I couldn't apply them cleanly to "pluggable-zheap" branch),
right? I've tried to experiment a bit with the current status of the patch, and
accidentally stumbled upon what seems to be an issue - when I run pgbench
agains it with some significant number of clients and script [2]:

    $ pgbench -T 60 -c 128 -j 64 -f zipfian.sql

I've got for some client an error:

    client 117 aborted in command 5 (SQL) of script 0; ERROR:
unrecognized heap_update status: 1

This problem couldn't be reproduced on the master branch, so I've tried to
investigate it. It comes from nodeModifyTable.c:1267, when we've got
HeapTupleInvisible as a result, and this value in turn comes from
table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
result from heap_update, when table_lock_tuple call was introduced. Since I
don't see anything similar in the master branch, can anyone clarify why is this
lock necessary here? Out of curiosity I've rearranged the code, that handles
HeapTupleUpdated, back to switch and removed table_lock_tuple (see the attached
patch, it can be applied on top of the lastest two patches posted by Haribabu)
and it seems to solve the issue.

[1]: https://github.com/anarazel/postgres-pluggable-storage
[2]: https://gist.github.com/erthalion/c85ba0e12146596d24c572234501e756

Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Mon, Oct 29, 2018 at 7:40 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Fri, 26 Oct 2018 at 13:25, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> Here I attached the cumulative patch with all fixes that are shared in earlier mails by me.
> Except fast_default test, rest of test failures are fixed.

Hi,

If I understand correctly, these patches are for the branch "pluggable-storage"
in [1] (at least I couldn't apply them cleanly to "pluggable-zheap" branch),
right?

Yes, the patches attached are for pluggable-storage branch.
 
I've tried to experiment a bit with the current status of the patch, and
accidentally stumbled upon what seems to be an issue - when I run pgbench
agains it with some significant number of clients and script [2]:

    $ pgbench -T 60 -c 128 -j 64 -f zipfian.sql

Thanks for testing the patches.
 
I've got for some client an error:

    client 117 aborted in command 5 (SQL) of script 0; ERROR:
unrecognized heap_update status: 1

This error is for the tuple state of HeapTupleInvisible, As per the comments
in heap_lock_tuple, this is possible in ON CONFLICT UPDATE, but because
of reorganizing of the table_lock_tuple out of EvalPlanQual(), the invisible
error is returned in other cases also. This case is missed in the new code.
 
This problem couldn't be reproduced on the master branch, so I've tried to
investigate it. It comes from nodeModifyTable.c:1267, when we've got
HeapTupleInvisible as a result, and this value in turn comes from
table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
result from heap_update, when table_lock_tuple call was introduced. Since I
don't see anything similar in the master branch, can anyone clarify why is this
lock necessary here?

In the master branch code also, there is a tuple lock that is happening in
EvalPlanQual() function, but pluggable-storage code, the lock is kept outside
and also function call rearrangements, to make it easier for the table access
methods to provide their own MVCC implementation.
 
Out of curiosity I've rearranged the code, that handles
HeapTupleUpdated, back to switch and removed table_lock_tuple (see the attached
patch, it can be applied on top of the lastest two patches posted by Haribabu)
and it seems to solve the issue.

Thanks for the patch. I didn't reproduce the problem, based on the error from
your mail, the attached draft patch of handling of invisible tuples in update and
delete cases should also fix it.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Mon, 29 Oct 2018 at 05:56, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
>> This problem couldn't be reproduced on the master branch, so I've tried to
>> investigate it. It comes from nodeModifyTable.c:1267, when we've got
>> HeapTupleInvisible as a result, and this value in turn comes from
>> table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
>> result from heap_update, when table_lock_tuple call was introduced. Since I
>> don't see anything similar in the master branch, can anyone clarify why is this
>> lock necessary here?
>
>
> In the master branch code also, there is a tuple lock that is happening in
> EvalPlanQual() function, but pluggable-storage code, the lock is kept outside
> and also function call rearrangements, to make it easier for the table access
> methods to provide their own MVCC implementation.

Yes, now I see it, thanks. Also I can confirm that the attached patch solves
this issue.

FYI, alongside with reviewing the code changes I've ran few performance tests
(that's why I hit this issue with pgbench in the first place). In case of high
concurrecy so far I see small performance degradation in comparison with the
master branch (about 2-5% of average latency, depending on the level of
concurrency), but can't really say why exactly (perf just shows barely
noticeable overhead there and there, maybe what I see is actually a cumulative
impact).


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Wed, Oct 31, 2018 at 9:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> On Mon, 29 Oct 2018 at 05:56, Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
>> This problem couldn't be reproduced on the master branch, so I've tried to
>> investigate it. It comes from nodeModifyTable.c:1267, when we've got
>> HeapTupleInvisible as a result, and this value in turn comes from
>> table_lock_tuple. Everything points to the new way of handling HeapTupleUpdated
>> result from heap_update, when table_lock_tuple call was introduced. Since I
>> don't see anything similar in the master branch, can anyone clarify why is this
>> lock necessary here?
>
>
> In the master branch code also, there is a tuple lock that is happening in
> EvalPlanQual() function, but pluggable-storage code, the lock is kept outside
> and also function call rearrangements, to make it easier for the table access
> methods to provide their own MVCC implementation.

Yes, now I see it, thanks. Also I can confirm that the attached patch solves
this issue.

Thanks for the testing and confirmation.
 
FYI, alongside with reviewing the code changes I've ran few performance tests
(that's why I hit this issue with pgbench in the first place). In case of high
concurrecy so far I see small performance degradation in comparison with the
master branch (about 2-5% of average latency, depending on the level of
concurrency), but can't really say why exactly (perf just shows barely
noticeable overhead there and there, maybe what I see is actually a cumulative
impact).

Thanks for sharing your observation, I will also analyze and try to find out performance
bottlenecks that are causing the overhead.

Here I attached the cumulative fixes of the patches, new API additions for zheap and
basic outline of the documentation.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Fri, Nov 2, 2018 at 11:17 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
On Wed, Oct 31, 2018 at 9:34 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote: 
FYI, alongside with reviewing the code changes I've ran few performance tests
(that's why I hit this issue with pgbench in the first place). In case of high
concurrecy so far I see small performance degradation in comparison with the
master branch (about 2-5% of average latency, depending on the level of
concurrency), but can't really say why exactly (perf just shows barely
noticeable overhead there and there, maybe what I see is actually a cumulative
impact).

Thanks for sharing your observation, I will also analyze and try to find out performance
bottlenecks that are causing the overhead.

I tried running the pgbench performance tests with minimal clients in my laptop and I didn't
find any performance issues, may be issue is visible only with higher clients. Even with
perf tool, I am not able to get a clear problem function. As you said, combining of all changes
leads to some overhead.

Here I attached the cumulative patches with further fixes and basic syntax regress tests also.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Asim R P
Дата:
Ashwin (copied) and I got a chance to go through the latest code from
Andres' github repository.  We would like to share some
comments/quesitons:

The TupleTableSlot argument is well suited for row-oriented storage.
For a column-oriented storage engine, a projection list indicating the
columns to be scanned may be necessary.  Is it possible to share this
information with current interface?

We realized that DDLs such as heap_create_with_catalog() are not
generalized.  Haribabu's latest patch that adds
SetNewFileNode_function() and CreateInitFort_function() is a step
towards this end.  However, the current API assumes that the storage
engine uses relation forks.  Isn't that too restrictive?

TupleDelete_function() accepts changingPart as a parameter to indicate
if this deletion is part of a movement from one partition to another.
Partitioning is a higher level abstraction as compared to storage.
Ideally, storage layer should have no knowledge of partitioning. The
tuple delete API should not accept any parameter related to
partitioning.

The API needs to be more accommodating towards block sizes used in
storage engines.  Currently, the same block size as heap seems to be
assumed, as evident from the type of some members of generic scan
object:

typedef struct TableScanDescData
{
  /* state set up at initscan time */
  BlockNumber rs_nblocks;     /* total number of blocks in rel */
  BlockNumber rs_startblock;  /* block # to start at */
  BlockNumber rs_numblocks;   /* max number of blocks to scan */
  /* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */
  bool        rs_syncscan;    /* report location to syncscan logic? */
}           TableScanDescData;

Using bytes to represent this information would be more generic. E.g.
rs_startlocation as bytes/offset instead of rs_startblock and so on.

Asim


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Thu, Nov 22, 2018 at 1:12 PM Asim R P <apraveen@pivotal.io> wrote:
Ashwin (copied) and I got a chance to go through the latest code from
Andres' github repository.  We would like to share some
comments/quesitons:

Thanks for the review.
 
The TupleTableSlot argument is well suited for row-oriented storage.
For a column-oriented storage engine, a projection list indicating the
columns to be scanned may be necessary.  Is it possible to share this
information with current interface?

Currently all the interfaces are designed for row-oriented storage, as you
said we need a new API for projection list. The current patch set itself
is big and it needs to stabilized and then in the next set of the patches,
those new API's will be added that will be useful for columnar storage.
 
We realized that DDLs such as heap_create_with_catalog() are not
generalized.  Haribabu's latest patch that adds
SetNewFileNode_function() and CreateInitFort_function() is a step
towards this end.  However, the current API assumes that the storage
engine uses relation forks.  Isn't that too restrictive?

Current set of API has many assumptions and uses the existing framework.
Thanks for your point, will check it how to enhance it.
 
TupleDelete_function() accepts changingPart as a parameter to indicate
if this deletion is part of a movement from one partition to another.
Partitioning is a higher level abstraction as compared to storage.
Ideally, storage layer should have no knowledge of partitioning. The
tuple delete API should not accept any parameter related to
partitioning.

Thanks for your point, will look into it in how to change extract it.
 
The API needs to be more accommodating towards block sizes used in
storage engines.  Currently, the same block size as heap seems to be
assumed, as evident from the type of some members of generic scan
object:

typedef struct TableScanDescData
{
  /* state set up at initscan time */
  BlockNumber rs_nblocks;     /* total number of blocks in rel */
  BlockNumber rs_startblock;  /* block # to start at */
  BlockNumber rs_numblocks;   /* max number of blocks to scan */
  /* rs_numblocks is usually InvalidBlockNumber, meaning "scan whole rel" */
  bool        rs_syncscan;    /* report location to syncscan logic? */
}           TableScanDescData;

Using bytes to represent this information would be more generic. E.g.
rs_startlocation as bytes/offset instead of rs_startblock and so on.

I doubt that this may not be the only one that needs a change to support
different block sizes for different storage interfaces. Thanks for your point,
but definitely this can be taken care in the next set of patches.

Andres, as the tupletableslot changes are committed, do you want me to 
share the rebased pluggable storage patch? you already working on it?

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

FWIW, now that oids are removed, and the tuple table slot abstraction
got in, I'm working on rebasing the pluggable storage patchset ontop of
that.


On 2018-11-27 12:48:36 +1100, Haribabu Kommi wrote:
> On Thu, Nov 22, 2018 at 1:12 PM Asim R P <apraveen@pivotal.io> wrote:
> 
> > Ashwin (copied) and I got a chance to go through the latest code from
> > Andres' github repository.  We would like to share some
> > comments/quesitons:
> >
> 
> Thanks for the review.
> 
> 
> > The TupleTableSlot argument is well suited for row-oriented storage.
> > For a column-oriented storage engine, a projection list indicating the
> > columns to be scanned may be necessary.  Is it possible to share this
> > information with current interface?
> >
> 
> Currently all the interfaces are designed for row-oriented storage, as you
> said we need a new API for projection list. The current patch set itself
> is big and it needs to stabilized and then in the next set of the patches,
> those new API's will be added that will be useful for columnar storage.

Precisely.


> > TupleDelete_function() accepts changingPart as a parameter to indicate
> > if this deletion is part of a movement from one partition to another.
> > Partitioning is a higher level abstraction as compared to storage.
> > Ideally, storage layer should have no knowledge of partitioning. The
> > tuple delete API should not accept any parameter related to
> > partitioning.
> >
> 
> Thanks for your point, will look into it in how to change extract it.

I don't think that's actually a problem. The changingPart parameter is
just a marker that the deletion is part of moving a tuple across
partitions. For heap and everythign compatible that's used to include
information to the tuple that concurrent modifications to the tuple
should error out when reaching such a tuple via EPQ.


> Andres, as the tupletableslot changes are committed, do you want me to
> share the rebased pluggable storage patch? you already working on it?

Working on it.


Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Amit Langote
Дата:
Hi,

On 2018/11/02 9:17, Haribabu Kommi wrote:
> Here I attached the cumulative fixes of the patches, new API additions for
> zheap and
> basic outline of the documentation.

I've read the documentation patch while also looking at the code and here
are some comments.

+   Each table is stored as its own physical
<firstterm>relation</firstterm> and so
+   is described by an entry in the <structname>pg_class</structname> catalog.

I think the "so" in "and so is described by an entry in..." is not necessary.

+   The contents of an table are entirely under the control of its access
method.

"a" table

+   (All the access methods furthermore use the standard page layout
described in
+   <xref linkend="storage-page-layout"/>.)

Maybe write the two sentences above as:

A table's content is entirely controlled by its access method, although
all access methods use the same standard page layout described in <xref
linkend="storage-page-layout"/>.

+    SlotCallbacks_function slot_callbacks;
+
+    SnapshotSatisfies_function snapshot_satisfies;
+    SnapshotSatisfiesUpdate_function snapshot_satisfiesUpdate;
+    SnapshotSatisfiesVacuum_function snapshot_satisfiesVacuum;

Like other functions, how about a one sentence comment for these, like:

/*
 * Function to get an AM-specific set of functions for manipulating
 * TupleTableSlots
 */
SlotCallbacks_function slot_callbacks;

/* AM-specific snapshot visibility determination functions */
SnapshotSatisfies_function snapshot_satisfies;
SnapshotSatisfiesUpdate_function snapshot_satisfiesUpdate;
SnapshotSatisfiesVacuum_function snapshot_satisfiesVacuum;

+    TupleFetchFollow_function tuple_fetch_follow;
+
+    GetTupleData_function get_tuple_data;

How about removing the empty line so that get_tuple_data can be seen as
part of the group /* Operations on physical tuples */

+    RelationVacuum_function relation_vacuum;
+    RelationScanAnalyzeNextBlock_function scan_analyze_next_block;
+    RelationScanAnalyzeNextTuple_function scan_analyze_next_tuple;
+    RelationCopyForCluster_function relation_copy_for_cluster;
+    RelationSync_function relation_sync;

Add /* Operations to support VACUUM/ANALYZE */ as a description for this
group?

+    BitmapPagescan_function scan_bitmap_pagescan;
+    BitmapPagescanNext_function scan_bitmap_pagescan_next;

Add /* Operations to support bitmap scans */ as a description for this group?

+    SampleScanNextBlock_function scan_sample_next_block;
+    SampleScanNextTuple_function scan_sample_next_tuple;

Add /* Operations to support sampling scans */ as a description for this
group?

+    ScanEnd_function scan_end;
+    ScanRescan_function scan_rescan;
+    ScanUpdateSnapshot_function scan_update_snapshot;

Move these two to be in the /* Operations on relation scans */ group?

+    BeginIndexFetchTable_function begin_index_fetch;
+    EndIndexFetchTable_function reset_index_fetch;
+    EndIndexFetchTable_function end_index_fetch;

Add /* Operations to support index scans */ as a description for this group?

+    IndexBuildRangeScan_function index_build_range_scan;
+    IndexValidateScan_function index_validate_scan;

Add /* Operations to support index build */ as a description for this group?

+    CreateInitFork_function CreateInitFork;

Add /* Function to create an init fork for unlogged tables */?

By the way, I can see the following two in the source code, but not in the
documentation.

    EstimateRelSize_function EstimateRelSize;
    SetNewFileNode_function SetNewFileNode;


+   The table construction and maintenance functions that an table access
+   method must provide in <structname>TableAmRoutine</structname> are:

"a" table access method

+  <para>
+<programlisting>
+TupleTableSlotOps *
+slot_callbacks (Relation relation);
+</programlisting>
+   API to access the slot specific methods;
+   Following methods are available;
+   <structname>TTSOpsVirtual</structname>,
+   <structname>TTSOpsHeapTuple</structname>,
+   <structname>TTSOpsMinimalTuple</structname>,
+   <structname>TTSOpsBufferTuple</structname>,
+  </para>

Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or
its relations to the TableAmRoutine abstraction, I think the text
description could better be written as:

"API to get the slot operations struct for a given table access method"

It's not clear to me why various TTSOps* structs are listed here?  Is the
point that different AMs may choose one of the listed alternatives?  For
example, I see that heap AM implementation returns TTOpsBufferTuple, so it
manipulates slots containing buffered tuples, right?  Other AMs are free
to return any one of these?  For example, some AMs may never use buffer
manager and hence not use TTOpsBufferTuple.  Is that understanding correct?

+  <para>
+<programlisting>
+bool
+snapshot_satisfies (TupleTableSlot *slot, Snapshot snapshot);
+</programlisting>
+   API to check whether the provided slot is visible to the current
+   transaction according the snapshot.
+  </para>

Do you mean:

"API to check whether the tuple contained in the provided slot is
visible...."?

+  <para>
+<programlisting>
+Oid
+tuple_insert (Relation rel, TupleTableSlot *slot, CommandId cid,
+              int options, BulkInsertState bistate);
+</programlisting>
+   API to insert the tuple and provide the <literal>ItemPointerData</literal>
+   where the tuple is successfully inserted.
+  </para>

It's not clear from the signature where you get the ItemPointerData.
Looking at heapam_tuple_insert which puts it in slot->tts_tid, I think
this should mention it a bit differently, like:

API to insert the tuple contained in the provided slot and return its TID,
that is, the location where the tuple is successfully inserted

+   API to insert the tuple with a speculative token. This API is similar
+   like <literal>tuple_insert</literal>, with additional speculative
+   information.

How about:

This API is similar to <literal>tuple_insert</literal>, although with
additional information necessary for speculative insertion

+  <para>
+<programlisting>
+void
+tuple_complete_speculative (Relation rel,
+                          TupleTableSlot *slot,
+                          uint32 specToken,
+                          bool succeeded);
+</programlisting>
+   API to complete the state of the tuple inserted by the API
<literal>tuple_insert_speculative</literal>
+   with the successful completion of the index insert.
+  </para>

How about:

API to complete the speculative insertion of a tuple started by
<literal>tuple_insert_speculative</literal>, invoked after finishing the
index insert

+  <para>
+<programlisting>
+bool
+tuple_fetch_row_version (Relation relation,
+                       ItemPointer tid,
+                       Snapshot snapshot,
+                       TupleTableSlot *slot,
+                       Relation stats_relation);
+</programlisting>
+   API to fetch and store the Buffered Heap tuple in the provided slot
+   based on the ItemPointer.
+  </para>

It seems that this description is based on what heapam_fetch_row_version()
does, but it should be more generic, maybe like:

API to fetch a buffered tuple given its TID and store it in the provided slot

+  <para>
+<programlisting>
+HTSU_Result
+TupleLock_function (Relation relation,
+                   ItemPointer tid,
+                   Snapshot snapshot,
+                   TupleTableSlot *slot,
+                   CommandId cid,
+                   LockTupleMode mode,
+                   LockWaitPolicy wait_policy,
+                   uint8 flags,
+                   HeapUpdateFailureData *hufd);

I guess you meant to write "tuple_lock" here, not "TupleLock_function".

+</programlisting>
+   API to lock the specified the ItemPointer tuple and fetches the newest
version of
+   its tuple and TID.
+  </para>

How about:

API to lock the specified tuple and return the TID of its newest version

+  <para>
+<programlisting>
+void
+tuple_get_latest_tid (Relation relation,
+                    Snapshot snapshot,
+                    ItemPointer tid);
+</programlisting>
+   API to get the the latest TID of the tuple with the given itempointer.
+  </para>

How about:

API to get TID of the latest version of the specified tuple

+  <para>
+<programlisting>
+bool
+tuple_fetch_follow (struct IndexFetchTableData *scan,
+                  ItemPointer tid,
+                  Snapshot snapshot,
+                  TupleTableSlot *slot,
+                  bool *call_again, bool *all_dead);
+</programlisting>
+   API to get the all the tuples of the page that satisfies itempointer.
+  </para>

IIUC, "all the tuples of of the page" in the above sentence means all the
tuples in the HOT chain of a given heap tuple, making this description of
the API slightly specific to the heap AM.  Can we make the description
more generic or is the API itself very specific that it cannot be
expressed in generic terms?  Ignoring that for a moment, I think the
sentence contains more "the"s than there need to be, so maybe write as:

API to get all tuples on a given page that are linked to the tuple of the
given TID

+  <para>
+<programlisting>
+tuple_data
+get_tuple_data (TupleTableSlot *slot, tuple_data_flags flags);
+</programlisting>
+   API to return the internal structure members of the HeapTuple.
+  </para>

I think this description doesn't mention enough details of both the
information that needs to be specified when calling the function (what's
in flags) and the information that's returned.

+  <para>
+<programlisting>
+bool
+scan_analyze_next_tuple (TableScanDesc scan, TransactionId OldestXmin,
+                      double *liverows, double *deadrows, TupleTableSlot
*slot));
+</programlisting>
+   API to analyze the block and fill the buffered heap tuple in the slot
and also
+   provide the live and dead rows.
+  </para>

How about:

API to get the next tuple from the block being scanned, which also updates
the number of live and dead rows encountered

+  <para>
+<programlisting>
+void
+relation_copy_for_cluster (Relation NewHeap, Relation OldHeap, Relation
OldIndex,
+                       bool use_sort,
+                       TransactionId OldestXmin, TransactionId FreezeXid,
MultiXactId MultiXactCutoff,
+                       double *num_tuples, double *tups_vacuumed, double
*tups_recently_dead);
+</programlisting>
+   API to copy one relation to another relation eith using the Index or
table scan.
+  </para>

Typo: eith -> either

But maybe, rewrite this as:

API to make a copy of the content of a relation, optionally sorted using
either the specified index or by sorting explicitly

+  <para>
+<programlisting>
+TableScanDesc
+scan_begin (Relation relation,
+            Snapshot snapshot,
+            int nkeys, ScanKey key,
+            ParallelTableScanDesc parallel_scan,
+            bool allow_strat,
+            bool allow_sync,
+            bool allow_pagemode,
+            bool is_bitmapscan,
+            bool is_samplescan,
+            bool temp_snap);
+</programlisting>
+   API to start the relation scan for the provided relation and returns the
+   <structname>TableScanDesc</structname> structure.
+  </para>

How about:

API to start a scan of a relation using specified options, which returns
the <structname>TableScanDesc</structname> structure to be used for
subsequent scan operations

+    <para>
+<programlisting>
+void
+scansetlimits (TableScanDesc sscan, BlockNumber startBlk, BlockNumber
numBlks);
+</programlisting>
+   API to fix the relation scan range limits.
+  </para>


How about:

API to set scan range endpoints

+    <para>
+<programlisting>
+bool
+scan_bitmap_pagescan (TableScanDesc scan,
+                    TBMIterateResult *tbmres);
+</programlisting>
+   API to scan the relation and fill the scan description bitmap with
valid item pointers
+   for the specified block.
+  </para>

This says "to scan the relation", but seems to be concerned with only a
page worth of data as the name also says.  Also, it's not clear what "scan
description bitmap" means.  Maybe write as:

API to scan the relation block specified in the scan descriptor to collect
and return the tuples requested by the given bitmap

+    <para>
+<programlisting>
+bool
+scan_bitmap_pagescan_next (TableScanDesc scan,
+                        TupleTableSlot *slot);
+</programlisting>
+   API to fill the buffered heap tuple data from the bitmap scanned item
pointers and store
+   it in the provided slot.
+  </para>

How about:

API to select the next tuple from the set of tuples of a given page
specified in the scan descriptor and return in the provided slot; returns
false if no more tuples to return on the given page

+    <para>
+<programlisting>
+bool
+scan_sample_next_block (TableScanDesc scan, struct SampleScanState
*scanstate);
+</programlisting>
+   API to scan the relation and fill the scan description bitmap with
valid item pointers
+   for the specified block provided by the sample method.
+  </para>

Looking at the code, this API selects the next block using the sampling
method and nothing more, although I see that the heap AM implementation
also does heapgetpage thus collecting live tuples in the array known only
to heap AM.  So, how about:

API to select the next block of the relation using the given sampling
method and set its information in the scan descriptor

+    <para>
+<programlisting>
+bool
+scan_sample_next_tuple (TableScanDesc scan, struct SampleScanState
*scanstate, TupleTableSlot *slot);
+</programlisting>
+   API to fill the buffered heap tuple data from the bitmap scanned item
pointers based on the sample
+   method and store it in the provided slot.
+  </para>

How about:

API to select the next tuple using the given sampling method from the set
of tuples collected from the block previously selected by the sampling method

+    <para>
+<programlisting>
+void
+scan_rescan (TableScanDesc scan, ScanKey key, bool set_params,
+             bool allow_strat, bool allow_sync, bool allow_pagemode);
+</programlisting>
+   API to restart the relation scan with provided data.
+  </para>

How about:

API to restart the given scan using provided options, releasing any
resources (such as buffer pins) already held by the scan

+  <para>
+<programlisting>
+void
+scan_update_snapshot (TableScanDesc scan, Snapshot snapshot);
+</programlisting>
+   API to update the relation scan with the new snapshot.
+  </para>

How about:

API to set the visibility snapshot to be used by a given scan

+  <para>
+<programlisting>
+IndexFetchTableData *
+begin_index_fetch (Relation relation);
+</programlisting>
+   API to prepare the <structname>IndexFetchTableData</structname> for
the relation.
+  </para>

This API is a bit vague.  As in, it's not clear from the name when it's to
be called and what's be to be done with the returned struct.  How about at
least adding more details about what the returned struct is for, like:
    
API to get the <structname>IndexFetchTableData</structname> to be assigned
to an index scan on the specified relation

+  <para>
+<programlisting>
+void
+reset_index_fetch (struct IndexFetchTableData* data);
+</programlisting>
+   API to reset the prepared internal members of the
<structname>IndexFetchTableData</structname>.
+  </para>

This description seems wrong if I look at the code.  Its purpose seems to
be reset the AM-specific members, such as releasing the buffer pin held in
xs_cbuf in the heap AM's case.

How about:

API to release AM-specific resources held by the
<structname>IndexFetchTableData</structname> of a given index scan

+  <para>
+<programlisting>
+void
+end_index_fetch (struct IndexFetchTableData* data);
+</programlisting>
+   API to clear and free the <structname>IndexFetchTableData</structname>.
+  </para>

Given above, how about:

API to release AM-specific resources held by the
<structname>IndexFetchTableData</structname> of a given index scan and
free the memory of <structname>IndexFetchTableData</structname> itself

+    <para>
+<programlisting>
+double
+index_build_range_scan (Relation heapRelation,
+                       Relation indexRelation,
+                       IndexInfo *indexInfo,
+                       bool allow_sync,
+                       bool anyvisible,
+                       BlockNumber start_blockno,
+                       BlockNumber end_blockno,
+                       IndexBuildCallback callback,
+                       void *callback_state,
+                       TableScanDesc scan);
+</programlisting>
+   API to perform the table scan with bounded range specified by the caller
+   and insert the satisfied records into the index using the provided
callback
+   function pointer.
+  </para>

This is a bit heavy API and the above description lacks some details.
Also, isn't it a bit misleading to use the name end_blockno if it is
interpreted as num_blocks by the internal APIs?

How about:

API to scan the specified blocks of the given table and insert them into
the specified index using the provided callback function

+    <para>
+<programlisting>
+void
+index_validate_scan (Relation heapRelation,
+                   Relation indexRelation,
+                   IndexInfo *indexInfo,
+                   Snapshot snapshot,
+                   struct ValidateIndexState *state);
+</programlisting>
+   API to perform the table scan and insert the satisfied records into
the index.
+   This API is similar like <function>index_build_range_scan</function>.
This
+   is used in the scenario of concurrent index build.
+  </para>

This one's a complicated API too.  How about:

API to scan the table according to the given snapshot and insert tuples
satisfying the snapshot into the specified index, provided their TIDs are
also present in the <structname>ValidateIndexState</structname> struct;
this API is used as the last phase of a concurrent index build

+ <sect2>
+  <title>Table scanning</title>
+
+  <para>
+  </para>
+ </sect2>
+
+ <sect2>
+  <title>Table insert/update/delete</title>
+
+  <para>
+  </para>
+  </sect2>
+
+ <sect2>
+  <title>Table locking</title>
+
+  <para>
+  </para>
+  </sect2>
+
+ <sect2>
+  <title>Table vacuum</title>
+
+  <para>
+  </para>
+ </sect2>
+
+ <sect2>
+  <title>Table fetch</title>
+
+  <para>
+  </para>
+ </sect2>


Seems like you forgot to put the individual API descriptions under these
sub-headers.  Actually, I think it'd be better to try to format this page
to looks more like the following:

https://www.postgresql.org/docs/devel/fdw-callbacks.html


-   Currently, only indexes have access methods.  The requirements for index
-   access methods are discussed in detail in <xref linkend="indexam"/>.
+   Currently, only <literal>INDEX</literal> and <literal>TABLE</literal> have
+   access methods.  The requirements for access methods are discussed in
detail
+   in <xref linkend="am"/>.

Hmm, I don't see why you decided to add literal tags to INDEX and TABLE.
Couldn't this have been written as:

Currently, only tables and indexes have access methods.  The requirements
for access methods are discussed in detail in <xref linkend="am"/>.

+        This variable specifies the default table access method using
which to create
+        objects (tables and materialized views) when a
<command>CREATE</command> command does
+        not explicitly specify a access method.

"variable" is not wrong, but "parameter" is used more often for GUCs.  "a
access method" should be "an access method".

Maybe you could write this as:

This variable specifies the default table access method to use when
creating tables or materialized views if the <command>CREATE</command>
does not explicitly specify an access method.

+        If the value does not match the name of any existing table access
methods,
+        <productname>PostgreSQL</productname> will automatically use the
default
+        table access method of the current database.

any existing table methods -> any existing table method

Although, shouldn't that cause an error instead of ignoring the error and
use the database default access method instead?


Thank you for working on this.  Really looking forward to how this shapes
up. :)

Thanks,
Amit



Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Fri, Nov 16, 2018 at 2:05 AM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:
>
> I tried running the pgbench performance tests with minimal clients in my laptop and I didn't
> find any performance issues, may be issue is visible only with higher clients. Even with
> perf tool, I am not able to get a clear problem function. As you said, combining of all changes
> leads to some overhead.

Just out of curiosity I've also tried tpc-c from oltpbench (in the very same
simple environment), it doesn't show any significant difference from master as
well.

> Here I attached the cumulative patches with further fixes and basic syntax regress tests also.

While testing the latest version I've noticed, that you didn't include the fix
for HeapTupleInvisible (so I see the error again), was it intentionally?

> On Tue, Nov 27, 2018 at 2:55 AM Andres Freund <andres@anarazel.de> wrote:
>
> FWIW, now that oids are removed, and the tuple table slot abstraction
> got in, I'm working on rebasing the pluggable storage patchset ontop of
> that.

Yes, please. I've tried it myself for reviewing purposes, but the rebasing
speed was not impressive. Also I want to suggest to move it from github and
make a regular patchset, since it's already a bit confusing in the sense what
goes where and which patch to apply on top of which branch.


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

Thanks for these changes. I've merged a good chunk of them.

On 2018-11-16 12:05:26 +1100, Haribabu Kommi wrote:
> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index c3960dc91f..3254e30a45 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -1741,7 +1741,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
>  {
>      HeapScanDesc scan = (HeapScanDesc) sscan;
>      Page        targpage;
> -    OffsetNumber targoffset = scan->rs_cindex;
> +    OffsetNumber targoffset;
>      OffsetNumber maxoffset;
>      BufferHeapTupleTableSlot *hslot;
>  
> @@ -1751,7 +1751,9 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
>      maxoffset = PageGetMaxOffsetNumber(targpage);
>  
>      /* Inner loop over all tuples on the selected page */
> -    for (targoffset = scan->rs_cindex; targoffset <= maxoffset; targoffset++)
> +    for (targoffset = scan->rs_cindex ? scan->rs_cindex : FirstOffsetNumber;
> +            targoffset <= maxoffset;
> +            targoffset++)
>      {
>          ItemId        itemid;
>          HeapTuple    targtuple = &hslot->base.tupdata;

I thought it was better to fix the initialization for rs_cindex - any
reason you didn't go for that?


> diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
> index 8233475aa0..7bad246f55 100644
> --- a/src/backend/access/heap/heapam_visibility.c
> +++ b/src/backend/access/heap/heapam_visibility.c
> @@ -1838,8 +1838,10 @@ HeapTupleSatisfies(HeapTuple stup, Snapshot snapshot, Buffer buffer)
>          case NON_VACUUMABLE_VISIBILTY:
>              return HeapTupleSatisfiesNonVacuumable(stup, snapshot, buffer);
>              break;
> -        default:
> +        case END_OF_VISIBILITY:
>              Assert(0);
>              break;
>      }
> +
> +    return false; /* keep compiler quiet */

I don't understand why END_OF_VISIBILITY is good idea?  I now removed
END_OF_VISIBILITY, and the default case.



> @@ -593,6 +594,10 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
>      if (myState->rel->rd_rel->relhasoids)
>          slot->tts_tupleOid = InvalidOid;
>  
> +    /* Materialize the slot */
> +    if (!TTS_IS_VIRTUAL(slot))
> +        ExecMaterializeSlot(slot);
> +
>      table_insert(myState->rel,
>                   slot,
>                   myState->output_cid,

What's the point of adding materialization here?

> @@ -570,6 +563,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
>                  Assert(TTS_IS_HEAPTUPLE(scanslot) ||
>                         TTS_IS_BUFFERTUPLE(scanslot));
>  
> +                if (hslot->tuple == NULL)
> +                    ExecMaterializeSlot(scanslot);
> +
>                  d = heap_getsysattr(hslot->tuple, attnum,
>                                      scanslot->tts_tupleDescriptor,
>                                      op->resnull);

Same?


> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index e055c0a7c6..34ef86a5bd 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -2594,7 +2594,7 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
>       * datums that may be present in copyTuple).  As with the next step, this
>       * is to guard against early re-use of the EPQ query.
>       */
> -    if (!TupIsNull(slot))
> +    if (!TupIsNull(slot) && !TTS_IS_VIRTUAL(slot))
>          ExecMaterializeSlot(slot);


Same?


>  #if FIXME
> @@ -2787,16 +2787,7 @@ EvalPlanQualFetchRowMarks(EPQState *epqstate)
>              if (isNull)
>                  continue;
>  
> -            elog(ERROR, "frak, need to implement ROW_MARK_COPY");
> -#ifdef FIXME
> -            // FIXME: this should just deform the tuple and store it as a
> -            // virtual one.
> -            tuple = table_tuple_by_datum(erm->relation, datum, erm->relid);
> -
> -            /* store tuple */
> -            EvalPlanQualSetTuple(epqstate, erm->rti, tuple);
> -#endif
> -
> +            ExecForceStoreHeapTupleDatum(datum, slot);
>          }
>      }
>  }

Cool.


> diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
> index 56880e3d16..36ca07beb2 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -224,6 +224,18 @@ BitmapHeapNext(BitmapHeapScanState *node)
>  
>              BitmapAdjustPrefetchIterator(node, tbmres);
>  
> +            /*
> +             * Ignore any claimed entries past what we think is the end of the
> +             * relation.  (This is probably not necessary given that we got at
> +             * least AccessShareLock on the table before performing any of the
> +             * indexscans, but let's be safe.)
> +             */
> +            if (tbmres->blockno >= scan->rs_nblocks)
> +            {
> +                node->tbmres = tbmres = NULL;
> +                continue;
> +            }
> +

I moved this into the storage engine, there just was a minor bug
preventing the already existing check from taking effect. I don't think
we should expose this kind of thing to the outside of the storage
engine.


> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index 54382aba88..ea48e1d6e8 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -4037,7 +4037,6 @@ CreateStatsStmt:
>   *
>   *****************************************************************************/
>  
> -// PBORKED: storage option
>  CreateAsStmt:
>          CREATE OptTemp TABLE create_as_target AS SelectStmt opt_with_data
>                  {
> @@ -4068,14 +4067,16 @@ CreateAsStmt:
>          ;
>  
>  create_as_target:
> -            qualified_name opt_column_list OptWith OnCommitOption OptTableSpace
> +            qualified_name opt_column_list table_access_method_clause
> +            OptWith OnCommitOption OptTableSpace
>                  {
>                      $$ = makeNode(IntoClause);
>                      $$->rel = $1;
>                      $$->colNames = $2;
> -                    $$->options = $3;
> -                    $$->onCommit = $4;
> -                    $$->tableSpaceName = $5;
> +                    $$->accessMethod = $3;
> +                    $$->options = $4;
> +                    $$->onCommit = $5;
> +                    $$->tableSpaceName = $6;
>                      $$->viewQuery = NULL;
>                      $$->skipData = false;        /* might get changed later */
>                  }
> @@ -4125,14 +4126,15 @@ CreateMatViewStmt:
>          ;
>  
>  create_mv_target:
> -            qualified_name opt_column_list opt_reloptions OptTableSpace
> +            qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace
>                  {
>                      $$ = makeNode(IntoClause);
>                      $$->rel = $1;
>                      $$->colNames = $2;
> -                    $$->options = $3;
> +                    $$->accessMethod = $3;
> +                    $$->options = $4;
>                      $$->onCommit = ONCOMMIT_NOOP;
> -                    $$->tableSpaceName = $4;
> +                    $$->tableSpaceName = $5;
>                      $$->viewQuery = NULL;        /* filled at analysis time */
>                      $$->skipData = false;        /* might get changed later */
>                  }

Cool. I wonder if we should also somehow support SELECT INTO w/ USING?
You've apparently started to do so with?


> diff --git a/src/test/regress/expected/create_am.out b/src/test/regress/expected/create_am.out
> index 47dd885c4e..a4094ca3f1 100644
> --- a/src/test/regress/expected/create_am.out
> +++ b/src/test/regress/expected/create_am.out
> @@ -99,3 +99,81 @@ HINT:  Use DROP ... CASCADE to drop the dependent objects too.
>  -- Drop access method cascade
>  DROP ACCESS METHOD gist2 CASCADE;
>  NOTICE:  drop cascades to index grect2ind2
> +-- Create a heap2 table am handler with heapam handler
> +CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler;
> +SELECT * FROM pg_am where amtype = 't';
> + amname |      amhandler       | amtype 
> +--------+----------------------+--------
> + heap   | heap_tableam_handler | t
> + heap2  | heap_tableam_handler | t
> +(2 rows)
> +
> +CREATE TABLE tbl_heap2(f1 int, f2 char(100)) using heap2;
> +INSERT INTO tbl_heap2 VALUES(generate_series(1,10), 'Test series');
> +SELECT count(*) FROM tbl_heap2;
> + count 
> +-------
> +    10
> +(1 row)
> +
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> +        where a.oid = r.relam AND r.relname = 'tbl_heap2';
> +  relname  | relkind | amname 
> +-----------+---------+--------
> + tbl_heap2 | r       | heap2
> +(1 row)
> +
> +-- create table as using heap2
> +CREATE TABLE tblas_heap2 using heap2 AS select * from tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> +        where a.oid = r.relam AND r.relname = 'tblas_heap2';
> +   relname   | relkind | amname 
> +-------------+---------+--------
> + tblas_heap2 | r       | heap2
> +(1 row)
> +
> +--
> +-- select into doesn't support new syntax, so it should be
> +-- default access method.
> +--
> +SELECT INTO tblselectinto_heap from tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> +        where a.oid = r.relam AND r.relname = 'tblselectinto_heap';
> +      relname       | relkind | amname 
> +--------------------+---------+--------
> + tblselectinto_heap | r       | heap
> +(1 row)
> +
> +DROP TABLE tblselectinto_heap;
> +-- create materialized view using heap2
> +CREATE MATERIALIZED VIEW mv_heap2 USING heap2 AS
> +        SELECT * FROM tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> +        where a.oid = r.relam AND r.relname = 'mv_heap2';
> + relname  | relkind | amname 
> +----------+---------+--------
> + mv_heap2 | m       | heap2
> +(1 row)
> +
> +-- Try creating the unsupported relation kinds with using syntax
> +CREATE VIEW test_view USING heap2 AS SELECT * FROM tbl_heap2;
> +ERROR:  syntax error at or near "USING"
> +LINE 1: CREATE VIEW test_view USING heap2 AS SELECT * FROM tbl_heap2...
> +                              ^
> +CREATE SEQUENCE test_seq USING heap2;
> +ERROR:  syntax error at or near "USING"
> +LINE 1: CREATE SEQUENCE test_seq USING heap2;
> +                                 ^
> +-- Drop table access method, but fails as objects depends on it
> +DROP ACCESS METHOD heap2;
> +ERROR:  cannot drop access method heap2 because other objects depend on it
> +DETAIL:  table tbl_heap2 depends on access method heap2
> +table tblas_heap2 depends on access method heap2
> +materialized view mv_heap2 depends on access method heap2
> +HINT:  Use DROP ... CASCADE to drop the dependent objects too.
> +-- Drop table access method with cascade
> +DROP ACCESS METHOD heap2 CASCADE;
> +NOTICE:  drop cascades to 3 other objects
> +DETAIL:  drop cascades to table tbl_heap2
> +drop cascades to table tblas_heap2
> +drop cascades to materialized view mv_heap2
> diff --git a/src/test/regress/sql/create_am.sql b/src/test/regress/sql/create_am.sql
> index 3e0ac104f3..0472a60f20 100644
> --- a/src/test/regress/sql/create_am.sql
> +++ b/src/test/regress/sql/create_am.sql
> @@ -66,3 +66,49 @@ DROP ACCESS METHOD gist2;
>  
>  -- Drop access method cascade
>  DROP ACCESS METHOD gist2 CASCADE;
> +
> +-- Create a heap2 table am handler with heapam handler
> +CREATE ACCESS METHOD heap2 TYPE TABLE HANDLER heap_tableam_handler;
> +
> +SELECT * FROM pg_am where amtype = 't';
> +
> +CREATE TABLE tbl_heap2(f1 int, f2 char(100)) using heap2;
> +INSERT INTO tbl_heap2 VALUES(generate_series(1,10), 'Test series');
> +SELECT count(*) FROM tbl_heap2;
> +
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> +        where a.oid = r.relam AND r.relname = 'tbl_heap2';
> +
> +-- create table as using heap2
> +CREATE TABLE tblas_heap2 using heap2 AS select * from tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> +        where a.oid = r.relam AND r.relname = 'tblas_heap2';
> +
> +--
> +-- select into doesn't support new syntax, so it should be
> +-- default access method.
> +--
> +SELECT INTO tblselectinto_heap from tbl_heap2;
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> +        where a.oid = r.relam AND r.relname = 'tblselectinto_heap';
> +
> +DROP TABLE tblselectinto_heap;
> +
> +-- create materialized view using heap2
> +CREATE MATERIALIZED VIEW mv_heap2 USING heap2 AS
> +        SELECT * FROM tbl_heap2;
> +
> +SELECT r.relname, r.relkind, a.amname from pg_class as r, pg_am as a
> +        where a.oid = r.relam AND r.relname = 'mv_heap2';
> +
> +-- Try creating the unsupported relation kinds with using syntax
> +CREATE VIEW test_view USING heap2 AS SELECT * FROM tbl_heap2;
> +
> +CREATE SEQUENCE test_seq USING heap2;
> +
> +
> +-- Drop table access method, but fails as objects depends on it
> +DROP ACCESS METHOD heap2;
> +
> +-- Drop table access method with cascade
> +DROP ACCESS METHOD heap2 CASCADE;
> -- 
> 2.18.0.windows.1

Nice!

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> FWIW, now that oids are removed, and the tuple table slot abstraction
> got in, I'm working on rebasing the pluggable storage patchset ontop of
> that.

I've pushed a version to that to the git tree, including a rebased
version of zheap:
https://github.com/anarazel/postgres-pluggable-storage
https://github.com/anarazel/postgres-pluggable-zheap

I'm still working on moving some of the out-of-access/zheap
modifications into pluggable storage (see e.g. the first commit of the
pluggable-zheap series). But this should allow others to start on a more
recent codebasis.

My next steps are:
- make relation creation properly pluggable
- remove the typedefs from tableam.h, instead move them into the
  TableAmRoutine struct.
- Move rs_{nblocks, startblock, numblocks} out of TableScanDescData
- Move HeapScanDesc and IndexFetchHeapData out of relscan.h
- See if the slot in SysScanDescData can be avoided, it's not exactly
  free of overhead.
- remove ExecSlotCompare(), it's entirely unrelated to these changes imo
  (and in the wrong place)
- rename HeapUpdateFailureData et al to not reference Heap
- split pluggable storage patchset, to commit earlier:
  - EvalPlanQual slotification
  - trigger slotification
  - split of IndexBuildHeapScan out of index.c

I'm wondering whether we should add
table_beginscan/table_getnextslot/index_getnext_slot using the old API
in an earlier commit that then could be committed separately, allowing
the tablecmd.c changes to be committed soon.

I'm wondering whether we should change the table_beginscan* API so it
provides a slot - pretty much every caller has to do so, and it seems
just as easy to create/dispose via table_beginscan/endscan.

Further tasks I'm not yet planning to tackle, that I'd welcome help on:
- pg_dump support
- pg_upgrade testing
- I think we should consider removing HeapTuple->t_tableOid, it should
  imo live entirely in the slot

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Tue, Dec 11, 2018 at 3:13 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > FWIW, now that oids are removed, and the tuple table slot abstraction
> > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > that.
>
> I've pushed a version to that to the git tree, including a rebased
> version of zheap:
> https://github.com/anarazel/postgres-pluggable-storage
> https://github.com/anarazel/postgres-pluggable-zheap

Great, thanks!

As a side note, I assume the last reference should be this, right?

https://github.com/anarazel/postgres-pluggable-storage/tree/pluggable-zheap

> Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> - pg_dump support
> - pg_upgrade testing
> - I think we should consider removing HeapTuple->t_tableOid, it should
>   imo live entirely in the slot

I would love to try help with pg_dump support.


Re: Pluggable Storage - Andres's take

От
Kyotaro HORIGUCHI
Дата:
Hello.

At Tue, 27 Nov 2018 14:58:35 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<080ce65e-7b96-adbf-1c8c-7c88d87eaeda@lab.ntt.co.jp>
> +  <para>
> +<programlisting>
> +TupleTableSlotOps *
> +slot_callbacks (Relation relation);
> +</programlisting>
> +   API to access the slot specific methods;
> +   Following methods are available;
> +   <structname>TTSOpsVirtual</structname>,
> +   <structname>TTSOpsHeapTuple</structname>,
> +   <structname>TTSOpsMinimalTuple</structname>,
> +   <structname>TTSOpsBufferTuple</structname>,
> +  </para>
> 
> Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or
> its relations to the TableAmRoutine abstraction, I think the text
> description could better be written as:
> 
> "API to get the slot operations struct for a given table access method"
> 
> It's not clear to me why various TTSOps* structs are listed here?  Is the
> point that different AMs may choose one of the listed alternatives?  For
> example, I see that heap AM implementation returns TTOpsBufferTuple, so it
> manipulates slots containing buffered tuples, right?  Other AMs are free
> to return any one of these?  For example, some AMs may never use buffer
> manager and hence not use TTOpsBufferTuple.  Is that understanding correct?

Yeah, I'm not sure why it should not be a pointer to the struct
itself but a function. And the four struct doesn't seem relevant
to table AMs. Perhaps clear, getsomeattrs and so on should be
listed instead.

> +  <para>
> +<programlisting>
> +Oid
> +tuple_insert (Relation rel, TupleTableSlot *slot, CommandId cid,
> +              int options, BulkInsertState bistate);
> +</programlisting>
> +   API to insert the tuple and provide the <literal>ItemPointerData</literal>
> +   where the tuple is successfully inserted.
> +  </para>
> 
> It's not clear from the signature where you get the ItemPointerData.
> Looking at heapam_tuple_insert which puts it in slot->tts_tid, I think
> this should mention it a bit differently, like:
> 
> API to insert the tuple contained in the provided slot and return its TID,
> that is, the location where the tuple is successfully inserted

It is actually an OID, not a TID in the current code. TID is
internaly handled.

> +  <para>
> +<programlisting>
> +bool
> +tuple_fetch_follow (struct IndexFetchTableData *scan,
> +                  ItemPointer tid,
> +                  Snapshot snapshot,
> +                  TupleTableSlot *slot,
> +                  bool *call_again, bool *all_dead);
> +</programlisting>
> +   API to get the all the tuples of the page that satisfies itempointer.
> +  </para>
> 
> IIUC, "all the tuples of of the page" in the above sentence means all the
> tuples in the HOT chain of a given heap tuple, making this description of
> the API slightly specific to the heap AM.  Can we make the description
> more generic or is the API itself very specific that it cannot be
> expressed in generic terms?  Ignoring that for a moment, I think the
> sentence contains more "the"s than there need to be, so maybe write as:
> 
> API to get all tuples on a given page that are linked to the tuple of the
> given TID

Mmm. This is exposing MVCC matters to indexam. I suppose we
should refactor this API.

> +  <para>
> +<programlisting>
> +tuple_data
> +get_tuple_data (TupleTableSlot *slot, tuple_data_flags flags);
> +</programlisting>
> +   API to return the internal structure members of the HeapTuple.
> +  </para>
> 
> I think this description doesn't mention enough details of both the
> information that needs to be specified when calling the function (what's
> in flags) and the information that's returned.

(I suppose it will be described in later sections.)

> +  <para>
> +<programlisting>
> +bool
> +scan_analyze_next_tuple (TableScanDesc scan, TransactionId OldestXmin,
> +                      double *liverows, double *deadrows, TupleTableSlot
> *slot));
> +</programlisting>
> +   API to analyze the block and fill the buffered heap tuple in the slot
> and also
> +   provide the live and dead rows.
> +  </para>
> 
> How about:
> 
> API to get the next tuple from the block being scanned, which also updates
> the number of live and dead rows encountered

"live" and "dead" are MVCC terms. I suppose that we should stash
out the deadrows somwhere else. (But analyze code would need to
be modified if we do so.)

> +void
> +scansetlimits (TableScanDesc sscan, BlockNumber startBlk, BlockNumber
> numBlks);
> +</programlisting>
> +   API to fix the relation scan range limits.
> +  </para>
> 
> 
> How about:
> 
> API to set scan range endpoints

This sets start point and the number of blocks.. Just "API to set
scan range" would be sifficient reffering to the parameter list.

> +    <para>
> +<programlisting>
> +bool
> +scan_bitmap_pagescan (TableScanDesc scan,
> +                    TBMIterateResult *tbmres);
> +</programlisting>
> +   API to scan the relation and fill the scan description bitmap with
> valid item pointers
> +   for the specified block.
> +  </para>
> 
> This says "to scan the relation", but seems to be concerned with only a
> page worth of data as the name also says.  Also, it's not clear what "scan
> description bitmap" means.  Maybe write as:
> 
> API to scan the relation block specified in the scan descriptor to collect
> and return the tuples requested by the given bitmap

"API to collect the tuples in a page requested by the given
bitmpap scan result." something? I think detailed explanation
would be required apart from the one-line description. Anyway the
name TBMIterateResult doesn't seem proper to expose.

> +    <para>
> +<programlisting>
> +bool
> +scan_sample_next_block (TableScanDesc scan, struct SampleScanState
> *scanstate);
> +</programlisting>
> +   API to scan the relation and fill the scan description bitmap with
> valid item pointers
> +   for the specified block provided by the sample method.
> +  </para>
> 
> Looking at the code, this API selects the next block using the sampling
> method and nothing more, although I see that the heap AM implementation
> also does heapgetpage thus collecting live tuples in the array known only
> to heap AM.  So, how about:
> 
> API to select the next block of the relation using the given sampling
> method and set its information in the scan descriptor

"block" and "page" seems randomly choosed here and there. I don't
mind that seen in the core but..

> +    <para>
> +<programlisting>
> +bool
> +scan_sample_next_tuple (TableScanDesc scan, struct SampleScanState
> *scanstate, TupleTableSlot *slot);
> +</programlisting>
> +   API to fill the buffered heap tuple data from the bitmap scanned item
> pointers based on the sample
> +   method and store it in the provided slot.
> +  </para>
> 
> How about:
> 
> API to select the next tuple using the given sampling method from the set
> of tuples collected from the block previously selected by the sampling method

I'm not sure "from the set of tuples collected" is true. Just
"the state of sample scan" or something wouldn't be fine?

> +    <para>
> +<programlisting>
> +void
> +scan_rescan (TableScanDesc scan, ScanKey key, bool set_params,
> +             bool allow_strat, bool allow_sync, bool allow_pagemode);
> +</programlisting>
> +   API to restart the relation scan with provided data.
> +  </para>
> 
> How about:
> 
> API to restart the given scan using provided options, releasing any
> resources (such as buffer pins) already held by the scan

It looks too-detailed to me, but "with provided data" looks too
coarse..

> +  <para>
> +<programlisting>
> +void
> +scan_update_snapshot (TableScanDesc scan, Snapshot snapshot);
> +</programlisting>
> +   API to update the relation scan with the new snapshot.
> +  </para>
> 
> How about:
> 
> API to set the visibility snapshot to be used by a given scan

If so, the function name should be "scan_set_snapshot". Anyway
the name look like "the function to update a snapshot (itself)".

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Pluggable Storage - Andres's take

От
Kyotaro HORIGUCHI
Дата:
Hello.

(in the next branch:)
At Tue, 27 Nov 2018 14:58:35 +0900, Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote in
<080ce65e-7b96-adbf-1c8c-7c88d87eaeda@lab.ntt.co.jp>
> Thank you for working on this.  Really looking forward to how this shapes
> up. :)

+1.

I looked through the documentation part, as where I can do something.

am.html:
> 61.1. Overview of Index access methods
>  61.1.1. Basic API Structure for Indexes
>  61.1.2. Index Access Method Functions
>  61.1.3. Index Scanning
> 61.2. Overview of Table access methods
>  61.2.1. Table access method API
>  61.2.2. Table Access Method Functions
>  61.2.3. Table scanning

Aren't 61.1 and 61.2 better in the reverse order?

Is there a reason for the difference of the titles between 61.1.1
and 61.2.1? The contents are quite similar.


+ <sect2 id="table-api">
+  <title>Table access method API</title>

The member names of index AM struct begins with "am" but they
don't have an unified prefix in table AM. It seems a bit
incosistent.  Perhaps we might should rename some long and
internal names..


+ <sect2 id="table-functions">
+  <title>Table Access Method Functions</title>

Table AM functions are far finer-grained than index AM. I think
that AM developers needs the more concrete description on what
every API function does and explanation on various
previously-internal structs.

I suppose that how the functions are used in core code paths will
be written in the following sections.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Dec 11, 2018 at 12:47 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

Thanks for these changes. I've merged a good chunk of them.

Thanks.
 
On 2018-11-16 12:05:26 +1100, Haribabu Kommi wrote:
> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index c3960dc91f..3254e30a45 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -1741,7 +1741,7 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
>  {
>       HeapScanDesc scan = (HeapScanDesc) sscan;
>       Page            targpage;
> -     OffsetNumber targoffset = scan->rs_cindex;
> +     OffsetNumber targoffset;
>       OffsetNumber maxoffset;
>       BufferHeapTupleTableSlot *hslot;

> @@ -1751,7 +1751,9 @@ heapam_scan_analyze_next_tuple(TableScanDesc sscan, TransactionId OldestXmin, do
>       maxoffset = PageGetMaxOffsetNumber(targpage);

>       /* Inner loop over all tuples on the selected page */
> -     for (targoffset = scan->rs_cindex; targoffset <= maxoffset; targoffset++)
> +     for (targoffset = scan->rs_cindex ? scan->rs_cindex : FirstOffsetNumber;
> +                     targoffset <= maxoffset;
> +                     targoffset++)
>       {
>               ItemId          itemid;
>               HeapTuple       targtuple = &hslot->base.tupdata;

I thought it was better to fix the initialization for rs_cindex - any
reason you didn't go for that?

No specific reason. Thanks for the correction.
 

> diff --git a/src/backend/access/heap/heapam_visibility.c b/src/backend/access/heap/heapam_visibility.c
> index 8233475aa0..7bad246f55 100644
> --- a/src/backend/access/heap/heapam_visibility.c
> +++ b/src/backend/access/heap/heapam_visibility.c
> @@ -1838,8 +1838,10 @@ HeapTupleSatisfies(HeapTuple stup, Snapshot snapshot, Buffer buffer)
>               case NON_VACUUMABLE_VISIBILTY:
>                       return HeapTupleSatisfiesNonVacuumable(stup, snapshot, buffer);
>                       break;
> -             default:
> +             case END_OF_VISIBILITY:
>                       Assert(0);
>                       break;
>       }
> +
> +     return false; /* keep compiler quiet */

I don't understand why END_OF_VISIBILITY is good idea?  I now removed
END_OF_VISIBILITY, and the default case.
 
OK.


> @@ -593,6 +594,10 @@ intorel_receive(TupleTableSlot *slot, DestReceiver *self)
>       if (myState->rel->rd_rel->relhasoids)
>               slot->tts_tupleOid = InvalidOid;

> +     /* Materialize the slot */
> +     if (!TTS_IS_VIRTUAL(slot))
> +             ExecMaterializeSlot(slot);
> +
>       table_insert(myState->rel,
>                                slot,
>                                myState->output_cid,

What's the point of adding materialization here?

In earlier testing i observed as the slot that is received is a buffered slot
and it points to the original tuple, but when it inserts it into the new table,
the transaction id changes and it leads to invisible tuple, because of that
reason I added the materialize here.


 
> @@ -570,6 +563,9 @@ ExecInterpExpr(ExprState *state, ExprContext *econtext, bool *isnull)
>                               Assert(TTS_IS_HEAPTUPLE(scanslot) ||
>                                          TTS_IS_BUFFERTUPLE(scanslot));

> +                             if (hslot->tuple == NULL)
> +                                     ExecMaterializeSlot(scanslot);
> +
>                               d = heap_getsysattr(hslot->tuple, attnum,
>                                                                       scanslot->tts_tupleDescriptor,
>                                                                       op->resnull);

Same?


> diff --git a/src/backend/executor/execMain.c b/src/backend/executor/execMain.c
> index e055c0a7c6..34ef86a5bd 100644
> --- a/src/backend/executor/execMain.c
> +++ b/src/backend/executor/execMain.c
> @@ -2594,7 +2594,7 @@ EvalPlanQual(EState *estate, EPQState *epqstate,
>        * datums that may be present in copyTuple).  As with the next step, this
>        * is to guard against early re-use of the EPQ query.
>        */
> -     if (!TupIsNull(slot))
> +     if (!TupIsNull(slot) && !TTS_IS_VIRTUAL(slot))
>               ExecMaterializeSlot(slot);


Same?

Earlier virtual tuple materialize was throwing error, because of that reason I added
that check. 
 
> index 56880e3d16..36ca07beb2 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -224,6 +224,18 @@ BitmapHeapNext(BitmapHeapScanState *node)

>                       BitmapAdjustPrefetchIterator(node, tbmres);

> +                     /*
> +                      * Ignore any claimed entries past what we think is the end of the
> +                      * relation.  (This is probably not necessary given that we got at
> +                      * least AccessShareLock on the table before performing any of the
> +                      * indexscans, but let's be safe.)
> +                      */
> +                     if (tbmres->blockno >= scan->rs_nblocks)
> +                     {
> +                             node->tbmres = tbmres = NULL;
> +                             continue;
> +                     }
> +

I moved this into the storage engine, there just was a minor bug
preventing the already existing check from taking effect. I don't think
we should expose this kind of thing to the outside of the storage
engine.

OK.
 

> diff --git a/src/backend/parser/gram.y b/src/backend/parser/gram.y
> index 54382aba88..ea48e1d6e8 100644
> --- a/src/backend/parser/gram.y
> +++ b/src/backend/parser/gram.y
> @@ -4037,7 +4037,6 @@ CreateStatsStmt:
>   *
>   *****************************************************************************/

> -// PBORKED: storage option
>  CreateAsStmt:
>               CREATE OptTemp TABLE create_as_target AS SelectStmt opt_with_data
>                               {
> @@ -4068,14 +4067,16 @@ CreateAsStmt:
>               ;

>  create_as_target:
> -                     qualified_name opt_column_list OptWith OnCommitOption OptTableSpace
> +                     qualified_name opt_column_list table_access_method_clause
> +                     OptWith OnCommitOption OptTableSpace
>                               {
>                                       $$ = makeNode(IntoClause);
>                                       $$->rel = $1;
>                                       $$->colNames = $2;
> -                                     $$->options = $3;
> -                                     $$->onCommit = $4;
> -                                     $$->tableSpaceName = $5;
> +                                     $$->accessMethod = $3;
> +                                     $$->options = $4;
> +                                     $$->onCommit = $5;
> +                                     $$->tableSpaceName = $6;
>                                       $$->viewQuery = NULL;
>                                       $$->skipData = false;           /* might get changed later */
>                               }
> @@ -4125,14 +4126,15 @@ CreateMatViewStmt:
>               ;

>  create_mv_target:
> -                     qualified_name opt_column_list opt_reloptions OptTableSpace
> +                     qualified_name opt_column_list table_access_method_clause opt_reloptions OptTableSpace
>                               {
>                                       $$ = makeNode(IntoClause);
>                                       $$->rel = $1;
>                                       $$->colNames = $2;
> -                                     $$->options = $3;
> +                                     $$->accessMethod = $3;
> +                                     $$->options = $4;
>                                       $$->onCommit = ONCOMMIT_NOOP;
> -                                     $$->tableSpaceName = $4;
> +                                     $$->tableSpaceName = $5;
>                                       $$->viewQuery = NULL;           /* filled at analysis time */
>                                       $$->skipData = false;           /* might get changed later */
>                               }

Cool. I wonder if we should also somehow support SELECT INTO w/ USING?
You've apparently started to do so with?

I thought the same, but SELECT INTO is deprecated syntax, is it fine to add
the new syntax?
 
Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Tue, Dec 11, 2018 at 3:13 AM Andres Freund <andres@anarazel.de> wrote:
>
> Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> - pg_dump support
> - pg_upgrade testing
> - I think we should consider removing HeapTuple->t_tableOid, it should
>   imo live entirely in the slot

I'm a bit confused, but what kind of pg_dump support you're talking about?
After a quick glance I don't see so far any table access specific logic there.
To check it I've created a test access method (which is a copy of heap, but
with some small differences) and pg_dump worked as expected.

As a side note, in a table description I haven't found any mention of which
access method is used for this table, probably it's useful to show that with \d+
(see the attached patch).

Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2018-12-15 20:15:12 +0100, Dmitry Dolgov wrote:
> > On Tue, Dec 11, 2018 at 3:13 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> > - pg_dump support
> > - pg_upgrade testing
> > - I think we should consider removing HeapTuple->t_tableOid, it should
> >   imo live entirely in the slot
> 
> I'm a bit confused, but what kind of pg_dump support you're talking about?
> After a quick glance I don't see so far any table access specific logic there.
> To check it I've created a test access method (which is a copy of heap, but
> with some small differences) and pg_dump worked as expected.

We need to dump the table access method at dump time, otherwise we loose
that information.

> As a side note, in a table description I haven't found any mention of which
> access method is used for this table, probably it's useful to show that with \d+
> (see the attached patch).

I'm not convinced that's really worth the cost of including it in \d
(rather than \d+ or such). When developing an alternative access method
it's extremely useful to be able to just change the default access
method, and run the existing tests, which this makes harder. It's also a
lot of churn.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Sat, Dec 15, 2018 at 8:37 PM Andres Freund <andres@anarazel.de> wrote:
>
> We need to dump the table access method at dump time, otherwise we loose
> that information.

Oh, right. So, something like in the attached patch?

> > As a side note, in a table description I haven't found any mention of which
> > access method is used for this table, probably it's useful to show that with \d+
> > (see the attached patch).
>
> I'm not convinced that's really worth the cost of including it in \d
> (rather than \d+ or such).

Maybe I'm missing the point, but I've meant exactly the same and the patch,
suggested in the previous email, add this info to \d+

Вложения

Re: Pluggable Storage - Andres's take

От
Peter Geoghegan
Дата:
On Mon, Dec 10, 2018 at 8:13 AM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> Just out of curiosity I've also tried tpc-c from oltpbench (in the very same
> simple environment), it doesn't show any significant difference from master as
> well.

FWIW, I have found BenchmarkSQL to be significantly better than
oltpbench, having used both quite a bit now:

https://bitbucket.org/openscg/benchmarksql

For example, oltpbench requires a max_connections setting that far
exceeds the number of terminals/clients used by the benchmark, because
the number of connections used during bulk loading far exceeds what is
truly required. BenchmarkSQL also makes it easy to generate useful
html reports, complete with graphs.

-- 
Peter Geoghegan


Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Sat, Dec 15, 2018 at 8:37 PM Andres Freund <andres@anarazel.de> wrote:
>
> We need to dump the table access method at dump time, otherwise we loose
> that information.

As a result of the discussion in [1] (btw, thanks for starting it), here is
proposed solution with tracking current default_table_access_method. Next I'll
tackle similar issue for psql and probably add some tests for both patches.

[1]: https://www.postgresql.org/message-id/flat/20190107235616.6lur25ph22u5u5av%40alap3.anarazel.de

Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-01-12 01:35:06 +0100, Dmitry Dolgov wrote:
> > On Sat, Dec 15, 2018 at 8:37 PM Andres Freund <andres@anarazel.de> wrote:
> >
> > We need to dump the table access method at dump time, otherwise we loose
> > that information.
> 
> As a result of the discussion in [1] (btw, thanks for starting it), here is
> proposed solution with tracking current default_table_access_method. Next I'll
> tackle similar issue for psql and probably add some tests for both patches.

Thanks!

> +/*
> + * Set the proper default_table_access_method value for the table.
> + */
> +static void
> +_selectTableAccessMethod(ArchiveHandle *AH, const char *tableam)
> +{
> +    PQExpBuffer cmd = createPQExpBuffer();
> +    const char *want, *have;
> +
> +    have = AH->currTableAm;
> +    want = tableam;
> +
> +    if (!want)
> +        return;
> +
> +    if (have && strcmp(want, have) == 0)
> +        return;
> +
> +
> +    appendPQExpBuffer(cmd, "SET default_table_access_method = %s;", tableam);

This needs escaping, at the very least with "", but better with proper
routines for dealing with identifiers.



> @@ -5914,7 +5922,7 @@ getTables(Archive *fout, int *numTables)
>                            "tc.relfrozenxid AS tfrozenxid, "
>                            "tc.relminmxid AS tminmxid, "
>                            "c.relpersistence, c.relispopulated, "
> -                          "c.relreplident, c.relpages, "
> +                          "c.relreplident, c.relpages, am.amname AS amname, "

That AS doesn't do anything, does it?


>          /* other fields were zeroed above */
>  
> @@ -9355,7 +9370,7 @@ dumpComment(Archive *fout, const char *type, const char *name,
>           * post-data.
>           */
>          ArchiveEntry(fout, nilCatalogId, createDumpId(),
> -                     tag->data, namespace, NULL, owner,
> +                     tag->data, namespace, NULL, owner, NULL,
>                       "COMMENT", SECTION_NONE,
>                       query->data, "", NULL,
>                       &(dumpId), 1,

We really ought to move the arguments to a struct, so we don't generate
quite as much useless diffs whenever we do a change around one of
these...

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Sat, Jan 12, 2019 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:
>
> > +     appendPQExpBuffer(cmd, "SET default_table_access_method = %s;", tableam);
>
> This needs escaping, at the very least with "", but better with proper
> routines for dealing with identifiers.

Thanks for noticing, fixed.

> > @@ -5914,7 +5922,7 @@ getTables(Archive *fout, int *numTables)
> >                                                 "tc.relfrozenxid AS tfrozenxid, "
> >                                                 "tc.relminmxid AS tminmxid, "
> >                                                 "c.relpersistence, c.relispopulated, "
> > -                                               "c.relreplident, c.relpages, "
> > +                                               "c.relreplident, c.relpages, am.amname AS amname, "
>
> That AS doesn't do anything, does it?

Rigth, I've renamed it few times and forgot to get rid of it. Removed.

>
> >               /* other fields were zeroed above */
> >
> > @@ -9355,7 +9370,7 @@ dumpComment(Archive *fout, const char *type, const char *name,
> >                * post-data.
> >                */
> >               ArchiveEntry(fout, nilCatalogId, createDumpId(),
> > -                                      tag->data, namespace, NULL, owner,
> > +                                      tag->data, namespace, NULL, owner, NULL,
> >                                        "COMMENT", SECTION_NONE,
> >                                        query->data, "", NULL,
> >                                        &(dumpId), 1,
>
> We really ought to move the arguments to a struct, so we don't generate
> quite as much useless diffs whenever we do a change around one of
> these...

That's what I though too. Maybe then I'll suggest a mini-patch to the master to
refactor these arguments out into a separate struct, so we can leverage it here.

Вложения

Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
Thanks for the patch updates.

A few comments so far from me :

+static void _selectTableAccessMethod(ArchiveHandle *AH, const char
*tablespace);
tablespace => tableam

+_selectTableAccessMethod(ArchiveHandle *AH, const char *tableam)
+{
+       PQExpBuffer cmd = createPQExpBuffer();
createPQExpBuffer() should be moved after the below statement, so that
it does not leak memory :
if (have && strcmp(want, have) == 0)
return;

char    *tableam; /* table access method, onlyt for TABLE tags */
Indentation is a bit misaligned. onlyt=> only



@@ -2696,6 +2701,7 @@ ReadToc(ArchiveHandle *AH)
                        te->tablespace = ReadStr(AH);

te->owner = ReadStr(AH);
+  te->tableam = ReadStr(AH);

Above, I am not sure about the this, but possibly we may require to
have archive-version check like how it is done for tablespace :
if (AH->version >= K_VERS_1_10)
   te->tablespace = ReadStr(AH);

So how about bumping up the archive version and doing these checks ?
Otherwise, if we run pg_restore using old version, we may read some
junk into te->tableam, or possibly crash. As I said, I am not sure
about this due to lack of clear understanding of archive versioning,
but let me know if you indeed find this issue to be true.


Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Mon, Jan 14, 2019 at 2:07 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> createPQExpBuffer() should be moved after the below statement, so that
> it does not leak memory

Thanks for noticing, fixed.

> So how about bumping up the archive version and doing these checks ?

Yeah, you're right, I've added this check.

Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Dec 11, 2018 at 1:13 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
Further tasks I'm not yet planning to tackle, that I'd welcome help on:
- pg_upgrade testing

I did the pg_upgrade testing from older version with some tables and views
exists,  and all of them are properly transformed into new server with heap
as the default access method.

I will add the dimitry pg_dump patch and test the pg_upgrade to confirm
the proper access method is retained on the upgraded database.

 
- I think we should consider removing HeapTuple->t_tableOid, it should
  imo live entirely in the slot

I removed the t_tableOid from HeapTuple and during testing I found some
problems with triggers, will post the patch once it is fixed.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-01-15 18:02:38 +1100, Haribabu Kommi wrote:
> On Tue, Dec 11, 2018 at 1:13 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > Hi,
> >
> > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> > - pg_upgrade testing
> >
> 
> I did the pg_upgrade testing from older version with some tables and views
> exists,  and all of them are properly transformed into new server with heap
> as the default access method.
> 
> I will add the dimitry pg_dump patch and test the pg_upgrade to confirm
> the proper access method is retained on the upgraded database.
> 
> 
> 
> > - I think we should consider removing HeapTuple->t_tableOid, it should
> >   imo live entirely in the slot
> >
> 
> I removed the t_tableOid from HeapTuple and during testing I found some
> problems with triggers, will post the patch once it is fixed.


Please note that I'm working on a heavily revised version of the patch
right now, trying to clean up a lot of things (you might have seen some
of the threads I started). I hope to post it ~Thursday.  Local-ish
patches shouldn't be a problem though.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Sat, 12 Jan 2019 at 18:11, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Sat, Jan 12, 2019 at 1:44 AM Andres Freund <andres@anarazel.de> wrote:
> > >               /* other fields were zeroed above */
> > >
> > > @@ -9355,7 +9370,7 @@ dumpComment(Archive *fout, const char *type, const char *name,
> > >                * post-data.
> > >                */
> > >               ArchiveEntry(fout, nilCatalogId, createDumpId(),
> > > -                                      tag->data, namespace, NULL, owner,
> > > +                                      tag->data, namespace, NULL, owner, NULL,
> > >                                        "COMMENT", SECTION_NONE,
> > >                                        query->data, "", NULL,
> > >                                        &(dumpId), 1,
> >
> > We really ought to move the arguments to a struct, so we don't generate
> > quite as much useless diffs whenever we do a change around one of
> > these...
>
> That's what I thought too. Maybe then I'll suggest a mini-patch to the master to
> refactor these arguments out into a separate struct, so we can leverage it here.

Then for each of the calls, we would need to declare that structure
variable (with = {0}) and assign required fields in that structure
before passing it to ArchiveEntry(). But a major reason of
ArchiveEntry() is to avoid doing this and instead conveniently pass
those fields as parameters. This will cause unnecessary more lines of
code. I think better way is to have an ArchiveEntry() function with
limited number of parameters, and have an ArchiveEntryEx() with those
extra parameters which are not needed in usual cases. E.g. we can have
tablespace, tableam, dumpFn and dumpArg as those extra arguments of
ArchiveEntryEx(), because most of the places these are passed as NULL.
All future arguments would go in ArchiveEntryEx().
Comments ?



-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Tue, 15 Jan 2019 at 12:27, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Mon, Jan 14, 2019 at 2:07 PM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > createPQExpBuffer() should be moved after the below statement, so that
> > it does not leak memory
>
> Thanks for noticing, fixed.

Looks good.

>
> > So how about bumping up the archive version and doing these checks ?
>
> Yeah, you're right, I've added this check.

Need to bump K_VERS_MINOR as well.

On Mon, 14 Jan 2019 at 18:36, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> +static void _selectTableAccessMethod(ArchiveHandle *AH, const char
> *tablespace);
> tablespace => tableam

This is yet to be addressed.


--
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Tue, Jan 15, 2019 at 10:52 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> Need to bump K_VERS_MINOR as well.

I've bumped it up, but somehow this change escaped the previous version. Now
should be there, thanks!

> On Mon, 14 Jan 2019 at 18:36, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char
> > *tablespace);
> > tablespace => tableam
>
> This is yet to be addressed.

Fixed.

Also I guess another attached patch should address the psql part, namely
displaying a table access method with \d+ and possibility to hide it with a
psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name).

Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-01-15 14:37:36 +0530, Amit Khandekar wrote:
> Then for each of the calls, we would need to declare that structure
> variable (with = {0}) and assign required fields in that structure
> before passing it to ArchiveEntry(). But a major reason of
> ArchiveEntry() is to avoid doing this and instead conveniently pass
> those fields as parameters. This will cause unnecessary more lines of
> code. I think better way is to have an ArchiveEntry() function with
> limited number of parameters, and have an ArchiveEntryEx() with those
> extra parameters which are not needed in usual cases.

I don't think that'll really solve the problem. I think it might be more
reasonable to rely on structs. Now that we can rely on designated
initializers for structs we can do something like

    ArchiveEntry((ArchiveArgs){.tablespace = 3,
                               .dumpFn = somefunc,
                               ...});

and unused arguments will automatically initialized to zero.  Or we
could pass the struct as a pointer, might be more efficient (although I
doubt it matters here):

    ArchiveEntry(&(ArchiveArgs){.tablespace = 3,
                                .dumpFn = somefunc,
                                ...});

What do others think?  It'd probably be a good idea to start a new
thread about this.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Tue, 15 Jan 2019 at 17:58, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Tue, Jan 15, 2019 at 10:52 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > Need to bump K_VERS_MINOR as well.
>
> I've bumped it up, but somehow this change escaped the previous version. Now
> should be there, thanks!
>
> > On Mon, 14 Jan 2019 at 18:36, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > > +static void _selectTableAccessMethod(ArchiveHandle *AH, const char
> > > *tablespace);
> > > tablespace => tableam
> >
> > This is yet to be addressed.
>
> Fixed.

Thanks, the patch looks good to me. Of course there's the other thread
about ArchiveEntry arguments which may alter this patch, but
otherwise, I have no more comments on this patch.

>
> Also I guess another attached patch should address the psql part, namely
> displaying a table access method with \d+ and possibility to hide it with a
> psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name).

Will have a look at this one.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Fri, 18 Jan 2019 at 10:13, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> On Tue, 15 Jan 2019 at 17:58, Dmitry Dolgov <9erthalion6@gmail.com> wrote:

> > Also I guess another attached patch should address the psql part, namely
> > displaying a table access method with \d+ and possibility to hide it with a
> > psql variable (HIDE_TABLEAM, but I'm open for suggestion about the name).

I am ok with the name.

>
> Will have a look at this one.

--- a/src/test/regress/expected/copy2.out
+++ b/src/test/regress/expected/copy2.out
@@ -1,3 +1,4 @@
+\set HIDE_TABLEAM on
 CREATE TEMP TABLE x (

I thought we wanted to avoid having to add this setting in individual
regression tests. Can't we do this in pg_regress as a common setting ?

+ /* Access method info */
+ if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL &&
+    !(pset.hide_tableam && tableinfo.relam_is_default))
+ {
+         printfPQExpBuffer(&buf, _("Access method: %s"),
fmtId(tableinfo.relam));

So this will make psql hide the access method if it's same as the
default. I understand that this was kind of concluded in the other
thread "Displaying and dumping of table access methods". But IMHO, if
the hide_tableam is false, we should *always* show the access method,
regardless of the default value. I mean, we can make it simple : off
means never show table-access, on means always show table-access,
regardless of the default access method. And this also will work with
regression tests. If some regression test wants specifically to output
the access method, it can have a "\SET HIDE_TABLEAM off" command.

If we hide the method if it's default, then for a regression test that
wants to forcibly show the table access method of all tables, it won't
show up for tables that have default access method.

------------

+ if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL &&

If the server does not support relam, tableinfo.relam will be NULL
anyways. So I think sversion check is not needed.

------------

+ printfPQExpBuffer(&buf, _("Access method: %s"), fmtId(tableinfo.relam));
fmtId is not required. In fact, we should display the access method
name as-is. fmtId is required only for identifiers present in SQL
queries.

-----------

+      printfPQExpBuffer(&buf, _("Access method: %s"), fmtId(tableinfo.relam));
+      printTableAddFooter(&cont, buf.data);
+   }
+
+
 }

Last two blank lines are not needed.

-----------

+ bool            hide_tableam;
 } PsqlSettings;

These variables, it seems, are supposed to be grouped together by type.

-----------

I believe you are going to add a new regression testcase for the change ?


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Fri, Jan 18, 2019 at 11:22 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> --- a/src/test/regress/expected/copy2.out
> +++ b/src/test/regress/expected/copy2.out
> @@ -1,3 +1,4 @@
> +\set HIDE_TABLEAM on
>
> I thought we wanted to avoid having to add this setting in individual
> regression tests. Can't we do this in pg_regress as a common setting ?

Yeah, you're probably right. Actually, I couldn't find anything that looks like
"common settings", and so far I've placed it into psql_start_test as a psql
argument. But not sure, maybe there is a better place.

> + /* Access method info */
> + if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL &&
> +    !(pset.hide_tableam && tableinfo.relam_is_default))
> + {
> +         printfPQExpBuffer(&buf, _("Access method: %s"),
> fmtId(tableinfo.relam));
>
> So this will make psql hide the access method if it's same as the
> default. I understand that this was kind of concluded in the other
> thread "Displaying and dumping of table access methods". But IMHO, if
> the hide_tableam is false, we should *always* show the access method,
> regardless of the default value. I mean, we can make it simple : off
> means never show table-access, on means always show table-access,
> regardless of the default access method. And this also will work with
> regression tests. If some regression test wants specifically to output
> the access method, it can have a "\SET HIDE_TABLEAM off" command.
>
> If we hide the method if it's default, then for a regression test that
> wants to forcibly show the table access method of all tables, it won't
> show up for tables that have default access method.

I can't imagine, what kind of test would need to forcibly show the table access
method of all the tables? Even if you need to verify tableam for something,
maybe it's even easier just to select it from pg_am?

> + if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL &&
>
> If the server does not support relam, tableinfo.relam will be NULL
> anyways. So I think sversion check is not needed.
> ------------
>
> + printfPQExpBuffer(&buf, _("Access method: %s"), fmtId(tableinfo.relam));
> fmtId is not required.
> -----------
>
> +      printfPQExpBuffer(&buf, _("Access method: %s"), fmtId(tableinfo.relam));
> +      printTableAddFooter(&cont, buf.data);
> +   }
> +
> +
>  }
>
> Last two blank lines are not needed.

Right, fixed.

> + bool            hide_tableam;
>  } PsqlSettings;
>
> These variables, it seems, are supposed to be grouped together by type.

Well, this grouping looks strange for me. But since I don't have a strong
opinion, I moved the variable.

> I believe you are going to add a new regression testcase for the change ?

Yep.

Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:
On Tue, Jan 15, 2019 at 6:05 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-01-15 18:02:38 +1100, Haribabu Kommi wrote:
> On Tue, Dec 11, 2018 at 1:13 PM Andres Freund <andres@anarazel.de> wrote:
>
> > Hi,
> >
> > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > Further tasks I'm not yet planning to tackle, that I'd welcome help on:
> > - pg_upgrade testing
> >
>
> I did the pg_upgrade testing from older version with some tables and views
> exists,  and all of them are properly transformed into new server with heap
> as the default access method.
>
> I will add the dimitry pg_dump patch and test the pg_upgrade to confirm
> the proper access method is retained on the upgraded database.
>
>
>
> > - I think we should consider removing HeapTuple->t_tableOid, it should
> >   imo live entirely in the slot
> >
>
> I removed the t_tableOid from HeapTuple and during testing I found some
> problems with triggers, will post the patch once it is fixed.


Please note that I'm working on a heavily revised version of the patch
right now, trying to clean up a lot of things (you might have seen some
of the threads I started). I hope to post it ~Thursday.  Local-ish
patches shouldn't be a problem though.

Yes, I am checking you other threads of refactoring and cleanups.
I will rebase this patch once the revised code is available.

I am not able to remove the complete t_tableOid from HeapTuple,
because of its use in triggers, as the slot is not available in triggers
and I need to store the tableOid also as part of the tuple.

Currently setting of t_tableOid is done only when the tuple is formed
from the slot, and it is use is replaced with slot member.

comments?

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

(resending with compressed attachements, perhaps that'll go through)

On 2018-12-10 18:13:40 -0800, Andres Freund wrote:
> On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > FWIW, now that oids are removed, and the tuple table slot abstraction
> > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > that.
> 
> I've pushed a version to that to the git tree, including a rebased
> version of zheap:
> https://github.com/anarazel/postgres-pluggable-storage
> https://github.com/anarazel/postgres-pluggable-zheap

I've pushed the newest, substantially revised, version to the same
repository. Note, that while the newest pluggable-zheap version is newer
than my last email, it's not based on the latest version, and the
pluggable-zheap development is now happening in the main zheap
repository.


> My next steps are:
> - make relation creation properly pluggable
> - remove the typedefs from tableam.h, instead move them into the
>   TableAmRoutine struct.
> - Move rs_{nblocks, startblock, numblocks} out of TableScanDescData
> - Move HeapScanDesc and IndexFetchHeapData out of relscan.h
> - remove ExecSlotCompare(), it's entirely unrelated to these changes imo
>   (and in the wrong place)

These are done.


> - split pluggable storage patchset, to commit earlier:
>   - EvalPlanQual slotification
>   - trigger slotification
>   - split of IndexBuildHeapScan out of index.c

The patchset is now pretty granularly split into individual pieces.
There's two commits that might be worthwhile to split up further:

1) The commit introducing table_beginscan et al, currently also
   introduces indexscans through tableam.
2) The commit introducing table_(insert|delete|update) also includes
   table_lock_tuple(), which in turn changes a bunch of EPQ related
   code. It's probably worthwhile to break that out.

I tried to make each individual commit make some sense, and pass all
tests on its own. That requires some changes that are then obsolted
in a later commit, but it's not as much as I feared.


> - rename HeapUpdateFailureData et al to not reference Heap

I've not done that, I decided it's best to do that after all the work
has gone in.


> - See if the slot in SysScanDescData can be avoided, it's not exactly
>   free of overhead.

After reconsidering, I don't think it's worth doing so.


There's pretty substantial changes in this series, besides the things
mentioned above:

- I re-introduced parallel scan into pluggable storage, but added a set
  of helper functions to avoid having to duplicate the current block
  based logic from heap. That way it can be shared between most/all
  block based AMs
- latestRemovedXid handling is moved into the table-AM, that's required
  for correct replay on Hot-Standby, where we do not know the AM of the
  current
- the whole truncation and relation creation code has been overhauled
- the order of functions in tableam.h, heapam_handler.c etc has been
  made more sensible
- a number of callbacks have been obsoleted (relation_sync,
  relation_create_init_fork, scansetlimits)
- A bunch of prerequisite work has been merged
- (heap|relation)_(open|openrv|close) have been split into their own
  files
- To avoid having to care about the bulk-insert flags code that uses a
  bulk-insert now unconditionally calls table_finish_bulk_insert(). The
  AM then internally can decide what it needs to do in case of
  e.g. HEAP_INSERT_SKIP_WAL.  Zheap currently for example doesn't
  implement that (because UNDO handling is complicated), and this way it
  can just ignore the option, without needing call-site code for that.
- A *lot* of cleanups

Todo:
- merge psql / pg_dump support by Dmitry
- consider removing scan_update_snapshot
- consider removing table_gimmegimmeslot()
- add substantial docs for every callback
- consider revising the current table_lock_tuple() API, I'm not quite
  convinced that's right
- reconsider heap_fetch() API changes, causes unnecessary pain
- polish the split out trigger and EPQ changes, so they can be merged
  soon-ish


I plan to merge the first few commits pretty soon (as largely announced
in related threads).


While I saw an initial attempt at writing smgl docs for the table AM
API, I'm not convinced that's the best approach.  I think it might make
more sense to have high-level docs in sgml, but then do all the
per-callback docs in tableam.h.

Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Sun, 20 Jan 2019 at 22:46, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Jan 18, 2019 at 11:22 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > --- a/src/test/regress/expected/copy2.out
> > +++ b/src/test/regress/expected/copy2.out
> > @@ -1,3 +1,4 @@
> > +\set HIDE_TABLEAM on
> >
> > I thought we wanted to avoid having to add this setting in individual
> > regression tests. Can't we do this in pg_regress as a common setting ?
>
> Yeah, you're probably right. Actually, I couldn't find anything that looks like
> "common settings", and so far I've placed it into psql_start_test as a psql
> argument. But not sure, maybe there is a better place.

Yeah, psql_start_test() looks good to me. pg_regress does not seem to
have it's own psqlrc file where we could have put this variable. May
be later on if we want to have more such variables, we could device
this infrastructure.

>
> > + /* Access method info */
> > + if (pset.sversion >= 120000 && verbose && tableinfo.relam != NULL &&
> > +    !(pset.hide_tableam && tableinfo.relam_is_default))
> > + {
> > +         printfPQExpBuffer(&buf, _("Access method: %s"),
> > fmtId(tableinfo.relam));
> >
> > So this will make psql hide the access method if it's same as the
> > default. I understand that this was kind of concluded in the other
> > thread "Displaying and dumping of table access methods". But IMHO, if
> > the hide_tableam is false, we should *always* show the access method,
> > regardless of the default value. I mean, we can make it simple : off
> > means never show table-access, on means always show table-access,
> > regardless of the default access method. And this also will work with
> > regression tests. If some regression test wants specifically to output
> > the access method, it can have a "\SET HIDE_TABLEAM off" command.
> >
> > If we hide the method if it's default, then for a regression test that
> > wants to forcibly show the table access method of all tables, it won't
> > show up for tables that have default access method.
>
> I can't imagine, what kind of test would need to forcibly show the table access
> method of all the tables? Even if you need to verify tableam for something,
> maybe it's even easier just to select it from pg_am?

Actually my statement is wrong, sorry. For a regression test that
wants to forcibly show table access for all tables, it just needs to
SET HIDE_TABLEAM to OFF. With your patch, if we set HIDE_TABLEAM to
OFF, it will *always* show the table access regardless of default
access method.

It is with HIDE_TABLEAM=ON that your patch hides the table access
conditionally (i.e. it shows when default value does not match). It's
in this case, that I feel we should *unconditionally* hide the table
access. Regression tests that use \d+ to show the table details might
not be interested specifically in table access method. But these will
fail if run with a modified default access method.

Besides, my general inclination is : keep the GUC behaviour simple;
and also,  it looks like we can keep the regression test output
consistent without having to have this conditional behaviour.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Mon, Jan 21, 2019 at 1:01 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2018-12-10 18:13:40 -0800, Andres Freund wrote:
> On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > FWIW, now that oids are removed, and the tuple table slot abstraction
> > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > that.
>
> I've pushed a version to that to the git tree, including a rebased
> version of zheap:
> https://github.com/anarazel/postgres-pluggable-storage
> https://github.com/anarazel/postgres-pluggable-zheap

I've pushed the newest, substantially revised, version to the same
repository. Note, that while the newest pluggable-zheap version is newer
than my last email, it's not based on the latest version, and the
pluggable-zheap development is now happening in the main zheap
repository.

Thanks for the new version of patches and changes.
 
Todo:
- consider removing scan_update_snapshot

Attached the patch for removal of scan_update_snapshot
and also the rebased patch of reduction in use of t_tableOid.
 
- consider removing table_gimmegimmeslot()
- add substantial docs for every callback

Will work on the above two.

While I saw an initial attempt at writing smgl docs for the table AM
API, I'm not convinced that's the best approach.  I think it might make
more sense to have high-level docs in sgml, but then do all the
per-callback docs in tableam.h.

OK, I will update the sgml docs accordingly. 
Index AM has per callback docs in the sgml, refactor them also?

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

Thanks!

On 2019-01-22 11:51:57 +1100, Haribabu Kommi wrote:
> Attached the patch for removal of scan_update_snapshot
> and also the rebased patch of reduction in use of t_tableOid.

I'll soon look at the latter.


> > - consider removing table_gimmegimmeslot()
> > - add substantial docs for every callback
> >
> 
> Will work on the above two.

I think it's easier if I do the first, because I can just do it while
rebasing, reducing unnecessary conflicts.


> > While I saw an initial attempt at writing smgl docs for the table AM
> > API, I'm not convinced that's the best approach.  I think it might make
> > more sense to have high-level docs in sgml, but then do all the
> > per-callback docs in tableam.h.
> >
> 
> OK, I will update the sgml docs accordingly.
> Index AM has per callback docs in the sgml, refactor them also?

I don't think it's a good idea to tackle the index docs at the same time
- this patchset is already humongously large...


> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index 62c5f9fa9f..3dc1444739 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2308,7 +2308,6 @@ static const TableAmRoutine heapam_methods = {
>      .scan_begin = heap_beginscan,
>      .scan_end = heap_endscan,
>      .scan_rescan = heap_rescan,
> -    .scan_update_snapshot = heap_update_snapshot,
>      .scan_getnextslot = heap_getnextslot,
>  
>      .parallelscan_estimate = table_block_parallelscan_estimate,
> diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
> index 59061c746b..b48ab5036c 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -954,5 +954,9 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
>      node->pstate = pstate;
>  
>      snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
> -    table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
> +    Assert(IsMVCCSnapshot(snapshot));
> +
> +    RegisterSnapshot(snapshot);
> +    node->ss.ss_currentScanDesc->rs_snapshot = snapshot;
> +    node->ss.ss_currentScanDesc->rs_temp_snap = true;
>  }

I was rather thinking that we'd just move this logic into
table_scan_update_snapshot(), without it invoking a callback.


Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Jan 22, 2019 at 12:15 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

Thanks!

On 2019-01-22 11:51:57 +1100, Haribabu Kommi wrote:
> Attached the patch for removal of scan_update_snapshot
> and also the rebased patch of reduction in use of t_tableOid.

I'll soon look at the latter.

Thanks.
 

> > - consider removing table_gimmegimmeslot()
> > - add substantial docs for every callback
> >
>
> Will work on the above two.

I think it's easier if I do the first, because I can just do it while
rebasing, reducing unnecessary conflicts.


OK. I will work on the doc changes.
 
> > While I saw an initial attempt at writing smgl docs for the table AM
> > API, I'm not convinced that's the best approach.  I think it might make
> > more sense to have high-level docs in sgml, but then do all the
> > per-callback docs in tableam.h.
> >
>
> OK, I will update the sgml docs accordingly.
> Index AM has per callback docs in the sgml, refactor them also?

I don't think it's a good idea to tackle the index docs at the same time
- this patchset is already humongously large...

OK.
 

> diff --git a/src/backend/access/heap/heapam_handler.c b/src/backend/access/heap/heapam_handler.c
> index 62c5f9fa9f..3dc1444739 100644
> --- a/src/backend/access/heap/heapam_handler.c
> +++ b/src/backend/access/heap/heapam_handler.c
> @@ -2308,7 +2308,6 @@ static const TableAmRoutine heapam_methods = {
>       .scan_begin = heap_beginscan,
>       .scan_end = heap_endscan,
>       .scan_rescan = heap_rescan,
> -     .scan_update_snapshot = heap_update_snapshot,
>       .scan_getnextslot = heap_getnextslot,

>       .parallelscan_estimate = table_block_parallelscan_estimate,
> diff --git a/src/backend/executor/nodeBitmapHeapscan.c b/src/backend/executor/nodeBitmapHeapscan.c
> index 59061c746b..b48ab5036c 100644
> --- a/src/backend/executor/nodeBitmapHeapscan.c
> +++ b/src/backend/executor/nodeBitmapHeapscan.c
> @@ -954,5 +954,9 @@ ExecBitmapHeapInitializeWorker(BitmapHeapScanState *node,
>       node->pstate = pstate;

>       snapshot = RestoreSnapshot(pstate->phs_snapshot_data);
> -     table_scan_update_snapshot(node->ss.ss_currentScanDesc, snapshot);
> +     Assert(IsMVCCSnapshot(snapshot));
> +
> +     RegisterSnapshot(snapshot);
> +     node->ss.ss_currentScanDesc->rs_snapshot = snapshot;
> +     node->ss.ss_currentScanDesc->rs_temp_snap = true;
>  }

I was rather thinking that we'd just move this logic into
table_scan_update_snapshot(), without it invoking a callback.

OK. Changed accordingly.
But the table_scan_update_snapshot() function is moved into tableam.c, 
to avoid additional header file snapmgr.h inclusion in tableam.h

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Mon, Jan 21, 2019 at 3:01 AM Andres Freund <andres@anarazel.de> wrote:
>
> The patchset is now pretty granularly split into individual pieces.

Wow, thanks!

> On Mon, Jan 21, 2019 at 9:33 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> Regression tests that use \d+ to show the table details might
> not be interested specifically in table access method. But these will
> fail if run with a modified default access method.

I see your point, but if a test is not interested specifically in a table am,
then I guess it wouldn't use a custom table am in the first place, right? Any
way, I don't have strong opinion here, so if everyone agrees that HIDE_TABLEAM
will show/hide access method unconditionally, I'm fine with that.


Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Tue, 22 Jan 2019 at 15:29, Dmitry Dolgov <9erthalion6@gmail.com> wrote:
> > On Mon, Jan 21, 2019 at 9:33 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > Regression tests that use \d+ to show the table details might
> > not be interested specifically in table access method. But these will
> > fail if run with a modified default access method.
>
> I see your point, but if a test is not interested specifically in a table am,
> then I guess it wouldn't use a custom table am in the first place, right?

Right. It wouldn't use a custom table am. But I mean, despite not
using a custom table am, the test would fail if the regression runs
with a changed default access method, because the regression output
file has only one particular am value output.

> Anyway, I don't have strong opinion here, so if everyone agrees that HIDE_TABLEAM
> will show/hide access method unconditionally, I'm fine with that.

Yeah, I agree it's subjective.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Sun, Jan 20, 2019 at 6:17 PM Dmitry Dolgov <9erthalion6@gmail.com> wrote:
>
> > On Fri, Jan 18, 2019 at 11:22 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> >
> > I believe you are going to add a new regression testcase for the change ?
>
> Yep.

So, here are these two patches for pg_dump/psql with a few regression tests.

Вложения

Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
Hi,

Attached is a patch that adds some test scenarios for testing the
dependency of various object types on table am. Besides simple tables,
it considers materialized views, partitioned table, foreign table, and
composite types, and verifies that the dependency is created only for
those object types that support table access method.

This patch is based on commit 1bc7e6a4838 in
https://github.com/anarazel/postgres-pluggable-storage

Thanks
-Amit Khandekar

Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Jan 22, 2019 at 1:43 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:


OK. I will work on the doc changes.

Sorry for the delay.

Attached a draft patch of doc and comments changes that I worked upon.
Currently I added comments to the callbacks that are present in the TableAmRoutine
structure and I copied it into the docs. I am not sure whether it is a good approach or not?
I am yet to add description for the each parameter of the callbacks for easier understanding.

Or, Giving description of each callbacks in the docs with division of those callbacks
according to them divided in the TableAmRoutine structure? Currently following divisions 
are available.
1. Table scan
2. Parallel table scan
3. Index scan
4. Manipulation of physical tuples
5. Non-modifying operations on individual tuples
6. DDL 
7. Planner
8. Executor

Suggestions?

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Mon, 21 Jan 2019 at 08:31, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> (resending with compressed attachements, perhaps that'll go through)
>
> On 2018-12-10 18:13:40 -0800, Andres Freund wrote:
> > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > > FWIW, now that oids are removed, and the tuple table slot abstraction
> > > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > > that.
> >
> > I've pushed a version to that to the git tree, including a rebased
> > version of zheap:
> > https://github.com/anarazel/postgres-pluggable-storage

I worked on a slight improvement on the
0040-WIP-Move-xid-horizon-computation-for-page-level patch . Instead
of pre-fetching all the required buffers beforehand, the attached WIP
patch pre-fetches the buffers keeping a constant distance ahead of the
buffer reads. It's a WIP patch because right now it just uses a
hard-coded 5 buffers ahead. Haven't used effective_io_concurrency like
how it is done in nodeBitmapHeapscan.c. Will do that next. But before
that, any comments on the way I did the improvements would be nice.

Note that for now, the patch is based on the pluggable-storage latest
commit; it does not replace the 0040 patch in the patch series.

-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Вложения

Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Wed, 6 Feb 2019 at 18:30, Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
>
> On Mon, 21 Jan 2019 at 08:31, Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > (resending with compressed attachements, perhaps that'll go through)
> >
> > On 2018-12-10 18:13:40 -0800, Andres Freund wrote:
> > > On 2018-11-26 17:55:57 -0800, Andres Freund wrote:
> > > > FWIW, now that oids are removed, and the tuple table slot abstraction
> > > > got in, I'm working on rebasing the pluggable storage patchset ontop of
> > > > that.
> > >
> > > I've pushed a version to that to the git tree, including a rebased
> > > version of zheap:
> > > https://github.com/anarazel/postgres-pluggable-storage
>
> I worked on a slight improvement on the
> 0040-WIP-Move-xid-horizon-computation-for-page-level patch . Instead
> of pre-fetching all the required buffers beforehand, the attached WIP
> patch pre-fetches the buffers keeping a constant distance ahead of the
> buffer reads. It's a WIP patch because right now it just uses a
> hard-coded 5 buffers ahead. Haven't used effective_io_concurrency like
> how it is done in nodeBitmapHeapscan.c. Will do that next. But before
> that, any comments on the way I did the improvements would be nice.
>
> Note that for now, the patch is based on the pluggable-storage latest
> commit; it does not replace the 0040 patch in the patch series.

In the attached v1 patch, the prefetch_distance is calculated as
effective_io_concurrency + 10. Also it has some cosmetic changes.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:


On Mon, Feb 4, 2019 at 2:31 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:

On Tue, Jan 22, 2019 at 1:43 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:


OK. I will work on the doc changes.

Sorry for the delay.

Attached a draft patch of doc and comments changes that I worked upon.
Currently I added comments to the callbacks that are present in the TableAmRoutine
structure and I copied it into the docs. I am not sure whether it is a good approach or not?
I am yet to add description for the each parameter of the callbacks for easier understanding.

Or, Giving description of each callbacks in the docs with division of those callbacks
according to them divided in the TableAmRoutine structure? Currently following divisions 
are available.
1. Table scan
2. Parallel table scan
3. Index scan
4. Manipulation of physical tuples
5. Non-modifying operations on individual tuples
6. DDL 
7. Planner
8. Executor

Suggestions?

Here I attached the doc patches for the pluggable storage, I divided the API's into the above
specified groups and explained them in the docs.I can further add more details if the approach
seems fine.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Nov 27, 2018 at 4:59 PM Amit Langote <Langote_Amit_f8@lab.ntt.co.jp> wrote:
Hi,

On 2018/11/02 9:17, Haribabu Kommi wrote:
> Here I attached the cumulative fixes of the patches, new API additions for
> zheap and
> basic outline of the documentation.

I've read the documentation patch while also looking at the code and here
are some comments.

Thanks for the review and apologies for the delay.
I have taken care of your most of the comments in the latest version of the
doc patches.
 

+  <para>
+<programlisting>
+TupleTableSlotOps *
+slot_callbacks (Relation relation);
+</programlisting>
+   API to access the slot specific methods;
+   Following methods are available;
+   <structname>TTSOpsVirtual</structname>,
+   <structname>TTSOpsHeapTuple</structname>,
+   <structname>TTSOpsMinimalTuple</structname>,
+   <structname>TTSOpsBufferTuple</structname>,
+  </para>

Unless I'm misunderstanding what the TupleTableSlotOps abstraction is or
its relations to the TableAmRoutine abstraction, I think the text
description could better be written as:

"API to get the slot operations struct for a given table access method"

It's not clear to me why various TTSOps* structs are listed here?  Is the
point that different AMs may choose one of the listed alternatives?  For
example, I see that heap AM implementation returns TTOpsBufferTuple, so it
manipulates slots containing buffered tuples, right?  Other AMs are free
to return any one of these?  For example, some AMs may never use buffer
manager and hence not use TTOpsBufferTuple.  Is that understanding correct?

Yes, AM can decide what type of Slot method it wants to use.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Robert Haas
Дата:
On Fri, Feb 8, 2019 at 5:18 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> In the attached v1 patch, the prefetch_distance is calculated as
> effective_io_concurrency + 10. Also it has some cosmetic changes.

I did a little brief review of this patch and noticed the following things.

+} PrefetchState;

That name seems too generic.

+/*
+ * An arbitrary way to come up with a pre-fetch distance that grows with io
+ * concurrency, but is at least 10 and not more than the max effective io
+ * concurrency.
+ */

This comment is kinda useless, because it only tells us what the code
does (which is obvious anyway) and not why it does that.  Saying that
your formula is arbitrary may not be the best way to attract support
for it.

+ for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++)

It looks strange to me that next_item is stored in prefetch_state and
nitems is passed around as an argument.  Is there some reason why it's
like that?

+ /* prefetch a fixed number of pages beforehand. */

Not accurate -- the number of pages we prefetch isn't fixed.  It
depends on effective_io_concurrency.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Thu, 21 Feb 2019 at 04:17, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 8, 2019 at 5:18 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > In the attached v1 patch, the prefetch_distance is calculated as
> > effective_io_concurrency + 10. Also it has some cosmetic changes.
>
> I did a little brief review of this patch and noticed the following things.
>
> +} PrefetchState;
> That name seems too generic.

Ok, so something like XidHorizonPrefetchState ? On similar lines, does
prefetch_buffer() function name sound too generic as well ?

>
> +/*
> + * An arbitrary way to come up with a pre-fetch distance that grows with io
> + * concurrency, but is at least 10 and not more than the max effective io
> + * concurrency.
> + */
>
> This comment is kinda useless, because it only tells us what the code
> does (which is obvious anyway) and not why it does that.  Saying that
> your formula is arbitrary may not be the best way to attract support
> for it.

Well, I had checked the way the number of drive spindles
(effective_io_concurrency) is used to calculate the prefetch distance
for bitmap heap scans (ComputeIoConcurrency). Basically I think the
intention behind that method is to come up with a number that makes it
highly likely that we pre-fetch a block of each of the drive spindles.
But I didn't get how that exactly works, all the less for non-parallel
bitmap scans. Same is the case for the pre-fetching that we do here
for xid-horizon stuff, where we do the block reads sequentially. Me
and Andres discussed this offline, and he was of the opinion that this
formula won't help here, and instead we just keep a constant distance
that is some number greater than effective_io_concurrency. I agree
that instead of saying "arbitrary" we should explain why we have done
that, and before that, come up with an agreed-upon formula.


>
> + for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++)
>
> It looks strange to me that next_item is stored in prefetch_state and
> nitems is passed around as an argument.  Is there some reason why it's
> like that?

We could keep the max count in the structure itself as well. There
isn't any specific reason for not keeping it there. It's just that
this function prefetch_state () is not a general function for
maintaining a prefetch state that spans across function calls; so we
might as well just pass the max count to that function instead of
having another field in that structure. I am not inclined specifically
towards either of the approaches.

>
> + /* prefetch a fixed number of pages beforehand. */
>
> Not accurate -- the number of pages we prefetch isn't fixed.  It
> depends on effective_io_concurrency.

Yeah, will change that in the next patch version, according to what we
conclude about the prefetch distance calculation.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Pluggable Storage - Andres's take

От
Robert Haas
Дата:
On Thu, Feb 21, 2019 at 6:44 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Ok, so something like XidHorizonPrefetchState ? On similar lines, does
> prefetch_buffer() function name sound too generic as well ?

Yeah, that sounds good.  And, yeah, then maybe rename the function too.

> > +/*
> > + * An arbitrary way to come up with a pre-fetch distance that grows with io
> > + * concurrency, but is at least 10 and not more than the max effective io
> > + * concurrency.
> > + */
> >
> > This comment is kinda useless, because it only tells us what the code
> > does (which is obvious anyway) and not why it does that.  Saying that
> > your formula is arbitrary may not be the best way to attract support
> > for it.
>
> Well, I had checked the way the number of drive spindles
> (effective_io_concurrency) is used to calculate the prefetch distance
> for bitmap heap scans (ComputeIoConcurrency). Basically I think the
> intention behind that method is to come up with a number that makes it
> highly likely that we pre-fetch a block of each of the drive spindles.
> But I didn't get how that exactly works, all the less for non-parallel
> bitmap scans. Same is the case for the pre-fetching that we do here
> for xid-horizon stuff, where we do the block reads sequentially. Me
> and Andres discussed this offline, and he was of the opinion that this
> formula won't help here, and instead we just keep a constant distance
> that is some number greater than effective_io_concurrency. I agree
> that instead of saying "arbitrary" we should explain why we have done
> that, and before that, come up with an agreed-upon formula.

Maybe something like: We don't use the regular formula to determine
how much to prefetch here, but instead just add a constant to
effective_io_concurrency.  That's because it seems best to do some
prefetching here even when effective_io_concurrency is set to 0, but
if the DBA thinks it's OK to do more prefetching for other operations,
then it's probably OK to do more prefetching in this case, too.  It
may be that this formula is too simplistic, but at the moment we have
no evidence of that or any idea about what would work better.

> > + for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++)
> >
> > It looks strange to me that next_item is stored in prefetch_state and
> > nitems is passed around as an argument.  Is there some reason why it's
> > like that?
>
> We could keep the max count in the structure itself as well. There
> isn't any specific reason for not keeping it there. It's just that
> this function prefetch_state () is not a general function for
> maintaining a prefetch state that spans across function calls; so we
> might as well just pass the max count to that function instead of
> having another field in that structure. I am not inclined specifically
> towards either of the approaches.

All right, count me as +0.5 for putting a copy in the structure.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Thu, 21 Feb 2019 at 18:06, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Thu, Feb 21, 2019 at 6:44 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Ok, so something like XidHorizonPrefetchState ? On similar lines, does
> > prefetch_buffer() function name sound too generic as well ?
>
> Yeah, that sounds good.

> And, yeah, then maybe rename the function too.

Renamed the function to xid_horizon_prefetch_buffer().

>
> > > +/*
> > > + * An arbitrary way to come up with a pre-fetch distance that grows with io
> > > + * concurrency, but is at least 10 and not more than the max effective io
> > > + * concurrency.
> > > + */
> > >
> > > This comment is kinda useless, because it only tells us what the code
> > > does (which is obvious anyway) and not why it does that.  Saying that
> > > your formula is arbitrary may not be the best way to attract support
> > > for it.
> >
> > Well, I had checked the way the number of drive spindles
> > (effective_io_concurrency) is used to calculate the prefetch distance
> > for bitmap heap scans (ComputeIoConcurrency). Basically I think the
> > intention behind that method is to come up with a number that makes it
> > highly likely that we pre-fetch a block of each of the drive spindles.
> > But I didn't get how that exactly works, all the less for non-parallel
> > bitmap scans. Same is the case for the pre-fetching that we do here
> > for xid-horizon stuff, where we do the block reads sequentially. Me
> > and Andres discussed this offline, and he was of the opinion that this
> > formula won't help here, and instead we just keep a constant distance
> > that is some number greater than effective_io_concurrency. I agree
> > that instead of saying "arbitrary" we should explain why we have done
> > that, and before that, come up with an agreed-upon formula.

>
> Maybe something like: We don't use the regular formula to determine
> how much to prefetch here, but instead just add a constant to
> effective_io_concurrency.  That's because it seems best to do some
> prefetching here even when effective_io_concurrency is set to 0, but
> if the DBA thinks it's OK to do more prefetching for other operations,
> then it's probably OK to do more prefetching in this case, too.  It
> may be that this formula is too simplistic, but at the moment we have
> no evidence of that or any idea about what would work better.

Thanks for writing it down for me. I think this is good-to-go as a
comment; so I put this as-is into the patch.

>
> > > + for (i = prefetch_state->next_item; i < nitems && count < prefetch_count; i++)
> > >
> > > It looks strange to me that next_item is stored in prefetch_state and
> > > nitems is passed around as an argument.  Is there some reason why it's
> > > like that?
> >
> > We could keep the max count in the structure itself as well. There
> > isn't any specific reason for not keeping it there. It's just that
> > this function prefetch_state () is not a general function for
> > maintaining a prefetch state that spans across function calls; so we
> > might as well just pass the max count to that function instead of
> > having another field in that structure. I am not inclined specifically
> > towards either of the approaches.
>
> All right, count me as +0.5 for putting a copy in the structure.

Have put the nitems into the structure.

Thanks for the review. Attached v2.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company

Вложения

Re: Pluggable Storage - Andres's take

От
Robert Haas
Дата:
On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> Thanks for the review. Attached v2.

Thanks.  I took this, combined it with Andres's
v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did
some polishing of the code and comments, and pgindented.  Here's what
I ended up with; see what you think.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Вложения

Re: Pluggable Storage - Andres's take

От
Amit Khandekar
Дата:
On Sat, 23 Feb 2019 at 01:22, Robert Haas <robertmhaas@gmail.com> wrote:
>
> On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Thanks for the review. Attached v2.
>
> Thanks.  I took this, combined it with Andres's
> v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did
> some polishing of the code and comments, and pgindented.  Here's what
> I ended up with; see what you think.

Thanks Robert ! The changes look good.


-- 
Thanks,
-Amit Khandekar
EnterpriseDB Corporation
The Postgres Database Company


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-01-21 10:32:37 +1100, Haribabu Kommi wrote:
> I am not able to remove the complete t_tableOid from HeapTuple,
> because of its use in triggers, as the slot is not available in triggers
> and I need to store the tableOid also as part of the tuple.

What precisely do you man by "use in triggers"? You mean that a trigger
might access a HeapTuple's t_tableOid directly, even though all of the
information is available in the trigger context?

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Wed, Feb 27, 2019 at 11:10 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-01-21 10:32:37 +1100, Haribabu Kommi wrote:
> I am not able to remove the complete t_tableOid from HeapTuple,
> because of its use in triggers, as the slot is not available in triggers
> and I need to store the tableOid also as part of the tuple.

What precisely do you man by "use in triggers"? You mean that a trigger
might access a HeapTuple's t_tableOid directly, even though all of the
information is available in the trigger context?

I forgot the exact scenario, but during the trigger function execution, the
pl/pgsql function execution access the TableOidAttributeNumber from the stored
tuple using the heap_get* function. Because of lack of slot support in the triggers, 
we still need to maintain the t_tableOid with proper OID. The heaptuple t_tableOid
member data is updated whenever the heaptuple is generated from slot.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Heikki Linnakangas
Дата:
I haven't been following this thread closely, but I looked briefly at 
some of the patches posted here:

On 21/01/2019 11:01, Andres Freund wrote:
> The patchset is now pretty granularly split into individual pieces.

Wow, 42 patches, very granular indeed! That's nice for reviewing, but 
are you planning to squash them before committing? Seems a bit excessive 
for the git history.

Patches 1-4:

* v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch
* v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch
* v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch
* v12-0004-Remove-superfluous-tqual.h-includes.patch

These look good to me. I think it would make sense to squash these 
together, and commit now.


Patches 20 and 21:
* v12-0020-WIP-Slotified-triggers.patch
* v12-0021-WIP-Slotify-EPQ.patch

I like this slotification of trigger and EPQ code. It seems like a nice 
thing to do, independently of the other patches. You said you wanted to 
polish that up to committable state, and commit separately: +1 on that. 
Perhaps do that even before patches 1-4.

> --- a/src/include/commands/trigger.h
> +++ b/src/include/commands/trigger.h
> @@ -35,8 +35,8 @@ typedef struct TriggerData
>      HeapTuple    tg_trigtuple;
>      HeapTuple    tg_newtuple;
>      Trigger    *tg_trigger;
> -    Buffer        tg_trigtuplebuf;
> -    Buffer        tg_newtuplebuf;
> +    TupleTableSlot *tg_trigslot;
> +    TupleTableSlot *tg_newslot;
>      Tuplestorestate *tg_oldtable;
>      Tuplestorestate *tg_newtable;
>  } TriggerData;

Do we still need tg_trigtuple and tg_newtuple? Can't we always use the 
corresponding slots instead? Is it for backwards-compatibility with 
user-defined triggers written in C? (Documentation also needs to be 
updated for the changes in this struct)


I didn't look a the rest of the patches yet...

- Heikki


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-02-27 18:00:12 +0800, Heikki Linnakangas wrote:
> I haven't been following this thread closely, but I looked briefly at some
> of the patches posted here:

Thanks!


> On 21/01/2019 11:01, Andres Freund wrote:
> > The patchset is now pretty granularly split into individual pieces.
> 
> Wow, 42 patches, very granular indeed! That's nice for reviewing, but are
> you planning to squash them before committing? Seems a bit excessive for the
> git history.

I've pushed a number of the preliminary patches since you replied. We're
down to 23 in my local count...

I do plan / did squash some, but not actually that many. I find that
patches after a certain size are just too hard to do the necessary final
polish to, especially if they do several independent things. Keeping
things granular also allows to push incrementally, even when later
patches aren't quite ready - imo pretty important for a project this
size.


> Patches 1-4:
> 
> * v12-0001-WIP-Introduce-access-table.h-access-relation.h.patch
> * v12-0002-Replace-heapam.h-includes-with-relation.h-table..patch
> * v12-0003-Replace-uses-of-heap_open-et-al-with-table_open-.patch
> * v12-0004-Remove-superfluous-tqual.h-includes.patch
> 
> These look good to me. I think it would make sense to squash these together,
> and commit now.

I've pushed these a while ago.


> Patches 20 and 21:
> * v12-0020-WIP-Slotified-triggers.patch
> * v12-0021-WIP-Slotify-EPQ.patch
> 
> I like this slotification of trigger and EPQ code. It seems like a nice
> thing to do, independently of the other patches. You said you wanted to
> polish that up to committable state, and commit separately: +1 on
> that.

I pushed the trigger patch yesterday evening. Working to finalize the
EPQ patch now - I've polished it a fair bit since the version posted on
the list, but it still needs a bit more.

Once the EPQ patch (and two other simple preliminary ones) is pushed, I
plan to post a new rebased version to this thread. That's then really
only the core table AM work.


> > --- a/src/include/commands/trigger.h
> > +++ b/src/include/commands/trigger.h
> > @@ -35,8 +35,8 @@ typedef struct TriggerData
> >      HeapTuple    tg_trigtuple;
> >      HeapTuple    tg_newtuple;
> >      Trigger    *tg_trigger;
> > -    Buffer        tg_trigtuplebuf;
> > -    Buffer        tg_newtuplebuf;
> > +    TupleTableSlot *tg_trigslot;
> > +    TupleTableSlot *tg_newslot;
> >      Tuplestorestate *tg_oldtable;
> >      Tuplestorestate *tg_newtable;
> >  } TriggerData;
> 
> Do we still need tg_trigtuple and tg_newtuple? Can't we always use the
> corresponding slots instead? Is it for backwards-compatibility with
> user-defined triggers written in C?

Yes, the external trigger interface currently relies on those being
there. I think we probably ought to revise that, at the very least
because it'll otherwise be noticably less efficient to have triggers on
!heap tables, but also because it's just cleaner.  But I feel like I
don't want more significantly sized patches on my plate right now, so my
current goal is to just put that on the todo after the pluggable storage
work.  Kinda wonder if we don't want to do that earlier in a release
cycle too, so we can do other breaking changes to the trigger interface
without breaking external code multiple times.  There's probably also an
argument for just not breaking the interface.


> (Documentation also needs to be updated for the changes in this
> struct)

Ah, nice catch, will do that next.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Ashwin Agrawal
Дата:
Hi,

While playing with the tableam, usage of which starts with commit v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we check for NULL function pointer before actually calling the same and ERROR out instead as NOT_SUPPORTED or something on those lines.

Understand its kind of think which should get caught during development. But still currently it segfaults if missing to define some AM function, might be easier for iterative development to error instead in common place.

Or should there be upfront check for NULL somewhere if all the AM functions are mandatory to have functions defined for them and should not be NULL.

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

Thanks for looking!

On 2019-03-05 18:27:45 -0800, Ashwin Agrawal wrote:
> While playing with the tableam, usage of which starts with commit
> v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we
> check for NULL function pointer before actually calling the same and ERROR
> out instead as NOT_SUPPORTED or something on those lines.

Scans seem like absolutely required part of the functionality, so I
don't think there's much point in that. It'd just bloat code and
runtime.


> Understand its kind of think which should get caught during development.
> But still currently it segfaults if missing to define some AM function,

The segfault iself doesn't bother me at all, it's just a NULL pointer
dereference. If we were to put Asserts somewhere it'd crash very
similarly. I think you have a point in that:

> might be easier for iterative development to error instead in common place.

Would make it a tiny bit easier to implement a new AM.  We could
probably add a few asserts to GetTableAmRoutine(), to check that
required functions are implemted.  Don't think that'd make a meaningful
difference for something like the scan functions, but it'd probably make
it easier to forward port AMs to the next release - I'm pretty sure
we're going to add required callbacks in the next few releases.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

Thanks for looking!

On 2019-03-05 18:27:45 -0800, Ashwin Agrawal wrote:
> While playing with the tableam, usage of which starts with commit
> v12-0023-tableam-Introduce-and-use-begin-endscan-and-do-i.patch, should we
> check for NULL function pointer before actually calling the same and ERROR
> out instead as NOT_SUPPORTED or something on those lines.

Scans seem like absolutely required part of the functionality, so I
don't think there's much point in that. It'd just bloat code and
runtime.


> Understand its kind of think which should get caught during development.
> But still currently it segfaults if missing to define some AM function,

The segfault iself doesn't bother me at all, it's just a NULL pointer
dereference. If we were to put Asserts somewhere it'd crash very
similarly. I think you have a point in that:

> might be easier for iterative development to error instead in common place.

Would make it a tiny bit easier to implement a new AM.  We could
probably add a few asserts to GetTableAmRoutine(), to check that
required functions are implemted.  Don't think that'd make a meaningful
difference for something like the scan functions, but it'd probably make
it easier to forward port AMs to the next release - I'm pretty sure
we're going to add required callbacks in the next few releases.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-02-27 09:29:31 -0800, Andres Freund wrote:
> Once the EPQ patch (and two other simple preliminary ones) is pushed, I
> plan to post a new rebased version to this thread. That's then really
> only the core table AM work.

That's now done.  Here's my current submission of remaining
patches. I've polished the first patch, adding DDL support, quite a bit,
I'm planning to push that soon.

Changes:

- I've removed the ability to specify a table AM to partitioned tables,
  as discussed at [1]
- That happily shook out a number of bugs, where the partitioned table's
  AM was used, when the leaf partitioned AM should have been used (via
  the slot). In particular this necessicated refactoring the way slots
  are used for ON CONFLICT of partitioned tables. That's the new WIP
  patch in the series. But I think the result is actually clearer.
- I've integrated the pg_dump and psql patches, although I'd made
  HIDE_TABLEAM independent of whether \d+ table is on a table with the
  default AM or not.
- There's a good number of new tests in both the DDL and the pg_dump
  patch
- Lots of smaller cleanups


My next steps are:
- final polish & push the basic DDL and pg_dump patches
- cleanup & polish the ON CONFLICT refactoring
- cleanup & polish the patch adding the tableam based scan
  interface. That's by far the largest patch in the series. I might try
  to split it up further, but it's probably not worth it.
- improve documentation for the individual callbacks (integrating
  work done by others on this thread), in the header
- integrate docs patch
- integrate the revised version of the xid horizon patch by Amit
  Khandekar (reviewed by Robert Haas)
- fix naive implementation of slot based COPY, to not constantly
  drop/recreate slots upon partition change. I've hackplemented a better
  approach, which makes it faster than the current code in my testing.

Notes:
- I'm currently not targeting v13 with "tableam: Fetch tuples for
  triggers & EPQ using a proper snapshot.". While we need something like
  that for some AMs like zheap, I think it's a crap approach and needs
  more thought.

Greetings,

Andres Freund

[1] https://postgr.es/m/20190304234700.w5tmhducs5wxgzls@alap3.anarazel.de

Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-03-05 23:07:21 -0800, Andres Freund wrote:
> My next steps are:
> - final polish & push the basic DDL and pg_dump patches

Done and pushed. Some collation dependent fallout, I'm hoping I've just
fixed that.

> - cleanup & polish the ON CONFLICT refactoring

Here's a cleaned up version of that patch.  David, Alvaro, you also
played in that area, any objections? I think this makes that part of the
code easier to read actually. Robert, thanks for looking at that patch
already.

Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
David Rowley
Дата:
On Thu, 7 Mar 2019 at 08:33, Andres Freund <andres@anarazel.de> wrote:
> Here's a cleaned up version of that patch.  David, Alvaro, you also
> played in that area, any objections? I think this makes that part of the
> code easier to read actually. Robert, thanks for looking at that patch
> already.

I only had a quick look and don't have a grasp of what the patch
series is doing to tuple slots, but I didn't see anything I found
alarming during the read.


-- 
 David Rowley                   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-03-07 11:56:57 +1300, David Rowley wrote:
> On Thu, 7 Mar 2019 at 08:33, Andres Freund <andres@anarazel.de> wrote:
> > Here's a cleaned up version of that patch.  David, Alvaro, you also
> > played in that area, any objections? I think this makes that part of the
> > code easier to read actually. Robert, thanks for looking at that patch
> > already.
> 
> I only had a quick look and don't have a grasp of what the patch
> series is doing to tuple slots, but I didn't see anything I found
> alarming during the read.

Thanks for looking.

Re slots - the deal basically is that going forward low level
operations, like fetching a row from a table etc, have to be done by a
slot that's compatible with the "target" table. You can get compatible
slot callbakcs by calling table_slot_callbacks(), or directly create one
by calling table_gimmegimmeslot() (likely to be renamed :)).

The problem here was that the partition root's slot was used to fetch /
store rows from a child partition. By moving mt_existing into
ResultRelInfo that's not the case anymore.

- Andres


Re: Pluggable Storage - Andres's take

От
Robert Haas
Дата:
On Wed, Mar 6, 2019 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:
> slot that's compatible with the "target" table. You can get compatible
> slot callbakcs by calling table_slot_callbacks(), or directly create one
> by calling table_gimmegimmeslot() (likely to be renamed :)).

Hmm.  I assume the issue is that table_createslot() was already taken
for another purpose, so then when you needed another callback you went
with table_givemeslot(), and then when you needed a third API to do
something in the same area the best thing available was
table_gimmeslot(), which meant that the fourth API could only be
table_gimmegimmeslot().

Does that sound about right?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
On 2019-03-07 08:52:21 -0500, Robert Haas wrote:
> On Wed, Mar 6, 2019 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:
> > slot that's compatible with the "target" table. You can get compatible
> > slot callbakcs by calling table_slot_callbacks(), or directly create one
> > by calling table_gimmegimmeslot() (likely to be renamed :)).
>
> Hmm.  I assume the issue is that table_createslot() was already taken
> for another purpose, so then when you needed another callback you went
> with table_givemeslot(), and then when you needed a third API to do
> something in the same area the best thing available was
> table_gimmeslot(), which meant that the fourth API could only be
> table_gimmegimmeslot().
>
> Does that sound about right?

It was 3 AM, and I thought it was hilarious...


Re: Pluggable Storage - Andres's take

От
Robert Haas
Дата:
On Thu, Mar 7, 2019 at 12:49 PM Andres Freund <andres@anarazel.de> wrote:
> On 2019-03-07 08:52:21 -0500, Robert Haas wrote:
> > On Wed, Mar 6, 2019 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:
> > > slot that's compatible with the "target" table. You can get compatible
> > > slot callbakcs by calling table_slot_callbacks(), or directly create one
> > > by calling table_gimmegimmeslot() (likely to be renamed :)).
> >
> > Hmm.  I assume the issue is that table_createslot() was already taken
> > for another purpose, so then when you needed another callback you went
> > with table_givemeslot(), and then when you needed a third API to do
> > something in the same area the best thing available was
> > table_gimmeslot(), which meant that the fourth API could only be
> > table_gimmegimmeslot().
> >
> > Does that sound about right?
>
> It was 3 AM, and I thought it was hilarious...

It is.  Just like me.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: Pluggable Storage - Andres's take

От
ilmari@ilmari.org (Dagfinn Ilmari Mannsåker)
Дата:
Andres Freund <andres@anarazel.de> writes:

> On 2019-03-07 08:52:21 -0500, Robert Haas wrote:
>> On Wed, Mar 6, 2019 at 6:11 PM Andres Freund <andres@anarazel.de> wrote:
>> > slot that's compatible with the "target" table. You can get compatible
>> > slot callbakcs by calling table_slot_callbacks(), or directly create one
>> > by calling table_gimmegimmeslot() (likely to be renamed :)).
>>
>> Hmm.  I assume the issue is that table_createslot() was already taken
>> for another purpose, so then when you needed another callback you went
>> with table_givemeslot(), and then when you needed a third API to do
>> something in the same area the best thing available was
>> table_gimmeslot(), which meant that the fourth API could only be
>> table_gimmegimmeslot().
>>
>> Does that sound about right?
>
> It was 3 AM, and I thought it was hilarious...

♫ Gimme! Gimme! Gimme! A slot after midnight ♫

- ilmari (SCNR) 
-- 
"I use RMS as a guide in the same way that a boat captain would use
 a lighthouse.  It's good to know where it is, but you generally
 don't want to find yourself in the same spot." - Tollef Fog Heen


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Thu, Mar 7, 2019 at 6:33 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-03-05 23:07:21 -0800, Andres Freund wrote:
> My next steps are:
> - final polish & push the basic DDL and pg_dump patches

Done and pushed. Some collation dependent fallout, I'm hoping I've just
fixed that.

Thanks for the corrections that I missed, also for the extra changes.

Here I attached the rebased patches that I shared earlier. I am adding the
comments to explain the API's in the code, will share those patches later.

I observed a crash with the latest patch series in the COPY command.
I am not sure whether the problem is with the reduce of tableOid patch problem,
Will check it and correct the problem.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-03-05 23:07:21 -0800, Andres Freund wrote:
> My next steps are:
> - final polish & push the basic DDL and pg_dump patches
> - cleanup & polish the ON CONFLICT refactoring

Those are pushed.


> - cleanup & polish the patch adding the tableam based scan
>   interface. That's by far the largest patch in the series. I might try
>   to split it up further, but it's probably not worth it.

I decided not to split it up further, and even merged two small commits
into it. Subdividing it cleanly would have required making some changes
just to undo them in a subsequent patch.


> - improve documentation for the individual callbacks (integrating
>   work done by others on this thread), in the header

I've done that for the callbacks in the above commit.

Changes:
- I've added comments to all the callbacks in the first commit / the
  scan commit
- I've renamed table_gimmegimmeslot to table_slot_create
- I've made the callbacks and their wrappers more consistently named
- I've added asserts for necessary callbacks in scan commit
- Lots of smaller cleanup
- Added a commit message

While 0001 is pretty bulky, the interesting bits concentrate on a
comparatively small area. I'd appreciate if somebody could give the
comments added in tableam.h a read (both on callbacks, and their
wrappers, as they have different audiences). It'd make sense to first
read the commit message, to understand the goal (and I'd obviously also
appreciate suggestions for improvements there as well).

I'm pretty happy with the current state of the scan patch. I plan to do
two more passes through it (formatting, comment polishing, etc. I don't
know of any functional changes needed), and then commit it, lest
somebody objects.

Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-03-09 11:03:21 +1100, Haribabu Kommi wrote:
> Here I attached the rebased patches that I shared earlier. I am adding the
> comments to explain the API's in the code, will share those patches later.

I've started to add those for the callbacks in the first commit. I'd
appreciate a look!

I think I'll include the docs patches, sans the callback documentation,
in the next version. I'll probably merge them into one commit, if that's
OK with you?


> I observed a crash with the latest patch series in the COPY command.

Hm, which version was this? I'd at some point accidentally posted a
'tmp' commit that was just a performance hack.


Btw, your patches always are attached out of order:
https://www.postgresql.org/message-id/CAJrrPGd%2Brkz54wE-oXRojg4XwC3bcF6bjjRziD%2BXhFup9Q3n2w%40mail.gmail.com
10, 1, 3, 4, 2 ...

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Dmitry Dolgov
Дата:
> On Sat, Mar 9, 2019 at 4:13 AM Andres Freund <andres@anarazel.de> wrote:
>
> While 0001 is pretty bulky, the interesting bits concentrate on a
> comparatively small area. I'd appreciate if somebody could give the
> comments added in tableam.h a read (both on callbacks, and their
> wrappers, as they have different audiences).

Potentially stupid question, but I'm curious about this one (couldn't find any
discussion about it):

    +/*
    + * Generic descriptor for table scans. This is the base-class for
table scans,
    + * which needs to be embedded in the scans of individual AMs.
    + */
    +typedef struct TableScanDescData
    // ...
    bool rs_pageatatime; /* verify visibility page-at-a-time? */
    bool rs_allow_strat; /* allow or disallow use of access strategy */
    bool rs_allow_sync; /* allow or disallow use of syncscan */

    + * allow_{strat, sync, pagemode} specify whether a scan strategy,
    + * synchronized scans, or page mode may be used (although not every AM
    + * will support those).
    // ...
    + TableScanDesc (*scan_begin) (Relation rel,

The last commentary makes me think that those flags (allow_strat / allow_sync /
pageatime) are more like AM specific, shouldn't they live in HeapScanDescData?


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-03-10 05:49:26 +0100, Dmitry Dolgov wrote:
> > On Sat, Mar 9, 2019 at 4:13 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > While 0001 is pretty bulky, the interesting bits concentrate on a
> > comparatively small area. I'd appreciate if somebody could give the
> > comments added in tableam.h a read (both on callbacks, and their
> > wrappers, as they have different audiences).
> 
> Potentially stupid question, but I'm curious about this one (couldn't find any
> discussion about it):

Not stupid...


>     +/*
>     + * Generic descriptor for table scans. This is the base-class for
> table scans,
>     + * which needs to be embedded in the scans of individual AMs.
>     + */
>     +typedef struct TableScanDescData
>     // ...
>     bool rs_pageatatime; /* verify visibility page-at-a-time? */
>     bool rs_allow_strat; /* allow or disallow use of access strategy */
>     bool rs_allow_sync; /* allow or disallow use of syncscan */
> 
>     + * allow_{strat, sync, pagemode} specify whether a scan strategy,
>     + * synchronized scans, or page mode may be used (although not every AM
>     + * will support those).
>     // ...
>     + TableScanDesc (*scan_begin) (Relation rel,
> 
> The last commentary makes me think that those flags (allow_strat / allow_sync /
> pageatime) are more like AM specific, shouldn't they live in HeapScanDescData?

They're common enough across AMs, but more importantly calling code
currently specifies them in several places. As they're thus essentially
generic, rather than AM specific, I think it makes sense to have them in
the general scan struct.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Sat, Mar 9, 2019 at 2:18 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-03-09 11:03:21 +1100, Haribabu Kommi wrote:
> Here I attached the rebased patches that I shared earlier. I am adding the
> comments to explain the API's in the code, will share those patches later.

I've started to add those for the callbacks in the first commit. I'd
appreciate a look!

Thanks for the updated patches.

+ /* ------------------------------------------------------------------------
+ * Callbacks for hon-modifying operations on individual tuples
+ * ------------------------------------------------------------------------

Typo in tableam.h file. hon->non 

 
I think I'll include the docs patches, sans the callback documentation,
in the next version. I'll probably merge them into one commit, if that's
OK with you?

OK. 
For easy review, I will still maintain 3 or 4 patches instead of the current patch
series. 


> I observed a crash with the latest patch series in the COPY command.

Hm, which version was this? I'd at some point accidentally posted a
'tmp' commit that was just a performance hack

Yes. in my version that checked have that commit.
May be that is the reason for the failure.

Btw, your patches always are attached out of order:
https://www.postgresql.org/message-id/CAJrrPGd%2Brkz54wE-oXRojg4XwC3bcF6bjjRziD%2BXhFup9Q3n2w%40mail.gmail.com
10, 1, 3, 4, 2 ...

Sorry about that.
I always think why it is ordering that way when I attached the patch files into the mail.
I thought it may be gmail behavior, but with experiment I found that, while adding the multiple
patches, the last selected patch given the preference and it will be listed as first attachment.

I will take care that this problem doesn't repeat it again.

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-03-08 19:13:10 -0800, Andres Freund wrote:
> Changes:
> - I've added comments to all the callbacks in the first commit / the
>   scan commit
> - I've renamed table_gimmegimmeslot to table_slot_create
> - I've made the callbacks and their wrappers more consistently named
> - I've added asserts for necessary callbacks in scan commit
> - Lots of smaller cleanup
> - Added a commit message
> 
> While 0001 is pretty bulky, the interesting bits concentrate on a
> comparatively small area. I'd appreciate if somebody could give the
> comments added in tableam.h a read (both on callbacks, and their
> wrappers, as they have different audiences). It'd make sense to first
> read the commit message, to understand the goal (and I'd obviously also
> appreciate suggestions for improvements there as well).
> 
> I'm pretty happy with the current state of the scan patch. I plan to do
> two more passes through it (formatting, comment polishing, etc. I don't
> know of any functional changes needed), and then commit it, lest
> somebody objects.

Here's a further polished version. Pretty boring changes:
- newlines
- put tableam.h into the correct place
- a few comment improvements, including typos
- changed reorderqueue_push() to accept the slot. That avoids an
  unnecessary heap_copytuple() in some cases

No meaningful changes in later patches.

Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
On 2019-03-11 12:37:46 -0700, Andres Freund wrote:
> Hi,
> 
> On 2019-03-08 19:13:10 -0800, Andres Freund wrote:
> > Changes:
> > - I've added comments to all the callbacks in the first commit / the
> >   scan commit
> > - I've renamed table_gimmegimmeslot to table_slot_create
> > - I've made the callbacks and their wrappers more consistently named
> > - I've added asserts for necessary callbacks in scan commit
> > - Lots of smaller cleanup
> > - Added a commit message
> > 
> > While 0001 is pretty bulky, the interesting bits concentrate on a
> > comparatively small area. I'd appreciate if somebody could give the
> > comments added in tableam.h a read (both on callbacks, and their
> > wrappers, as they have different audiences). It'd make sense to first
> > read the commit message, to understand the goal (and I'd obviously also
> > appreciate suggestions for improvements there as well).
> > 
> > I'm pretty happy with the current state of the scan patch. I plan to do
> > two more passes through it (formatting, comment polishing, etc. I don't
> > know of any functional changes needed), and then commit it, lest
> > somebody objects.
> 
> Here's a further polished version. Pretty boring changes:
> - newlines
> - put tableam.h into the correct place
> - a few comment improvements, including typos
> - changed reorderqueue_push() to accept the slot. That avoids an
>   unnecessary heap_copytuple() in some cases
> 
> No meaningful changes in later patches.

I pushed this.  There's a failure on 32bit machines, unfortunately. The
problem comes from the ParallelTableScanDescData embedded in BTShared -
after the change the compiler can't see that that actually needs more
alignment, because ParallelTableScanDescData doesn't have any 8byte
members. That's a problem for just about all such "struct inheritance"
type tricks postgres, but we normally just allocate them separately,
guaranteeing maxalign. Given that we here already allocate enough space
after the BTShared struct, it's probably easiest to just not embed the
struct anymore.

- Andres


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
On 2019-03-11 13:31:26 -0700, Andres Freund wrote:
> On 2019-03-11 12:37:46 -0700, Andres Freund wrote:
> > Hi,
> > 
> > On 2019-03-08 19:13:10 -0800, Andres Freund wrote:
> > > Changes:
> > > - I've added comments to all the callbacks in the first commit / the
> > >   scan commit
> > > - I've renamed table_gimmegimmeslot to table_slot_create
> > > - I've made the callbacks and their wrappers more consistently named
> > > - I've added asserts for necessary callbacks in scan commit
> > > - Lots of smaller cleanup
> > > - Added a commit message
> > > 
> > > While 0001 is pretty bulky, the interesting bits concentrate on a
> > > comparatively small area. I'd appreciate if somebody could give the
> > > comments added in tableam.h a read (both on callbacks, and their
> > > wrappers, as they have different audiences). It'd make sense to first
> > > read the commit message, to understand the goal (and I'd obviously also
> > > appreciate suggestions for improvements there as well).
> > > 
> > > I'm pretty happy with the current state of the scan patch. I plan to do
> > > two more passes through it (formatting, comment polishing, etc. I don't
> > > know of any functional changes needed), and then commit it, lest
> > > somebody objects.
> > 
> > Here's a further polished version. Pretty boring changes:
> > - newlines
> > - put tableam.h into the correct place
> > - a few comment improvements, including typos
> > - changed reorderqueue_push() to accept the slot. That avoids an
> >   unnecessary heap_copytuple() in some cases
> > 
> > No meaningful changes in later patches.
> 
> I pushed this.  There's a failure on 32bit machines, unfortunately. The
> problem comes from the ParallelTableScanDescData embedded in BTShared -
> after the change the compiler can't see that that actually needs more
> alignment, because ParallelTableScanDescData doesn't have any 8byte
> members. That's a problem for just about all such "struct inheritance"
> type tricks postgres, but we normally just allocate them separately,
> guaranteeing maxalign. Given that we here already allocate enough space
> after the BTShared struct, it's probably easiest to just not embed the
> struct anymore.

I've pushed an attempt to fix this, which locally fixes 32bit
builds. It's copying the alignment logic for shm_toc_allocate, namely
using BUFFERALIGN for alignment.  We should probably invent a more
appropriate define for this at some point...

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Kyotaro HORIGUCHI
Дата:
Hello.

I had a look on the patch set. I cannot see the thread structure
due to the depth and cannot get the picture on the all patches,
but I have some comments. I apologize in advance for possible
duplicate with upthread.


0001-Reduce-the...

This doesn't apply master.

>  TupleTableSlot *
>  ExecStoreHeapTuple(HeapTuple tuple,
>                     TupleTableSlot *slot,
> +                   Oid relid,
>                     bool shouldFree)

  The comment for ExecStoreHeapTuple is missing the description
  on "relid" parameter.


> -        if (HeapTupleSatisfiesVisibility(tuple, &SnapshotDirty, hscan->rs_cbuf))
> +        if (HeapTupleSatisfiesVisibility(tuple, RelationGetRelid(hscan->rs_scan.rs_rd),
> +                                &SnapshotDirty, hscan->rs_cbuf))

The second parameter seems to be always
RelationGetRelid(...). Actually only relid is required but isn't
it better to take Relation instead of Oid as the second
parameter?



0005-Reorganize-am-as...

> +   catalog. The contents of an table are entirely under the control of its

 "of an table" => "of a table"

0006-Doc-update-of-Create-access..

> +      be <type>index_am_handler</type> and for <literal>TABLE</literal>
> +      access methods, it must be <type>table_am_handler</type>.
> +      The C-level API that the handler function must implement varies
> +      depending on the type of access method. The index access method API
> +      is described in <xref linkend="index-access-methods"/> and the table access method
> +      API is described in <xref linkend="table-access-methods"/>.

If table is the primary object, talbe-am should precede index-am?


0007-Doc-update-of-create-materi..

> +   This clause specifies optional access method for the new materialize view;

 "materialize view" => "materialized view"?

> +   If this option is not specified, then the default table access method

 I'm not sure the 'then' is needed.

> +   is chosen for the new materialized view. see <xref linkend="guc-default-table-access-method"/>

 "see" => "See"


0008-Doc-update-of-CREATE_TABLE..

> +[ USING <replaceable class="parameter">method</replaceable> ]

 *I* prefer access_method to just method.

> +  If this option is not specified, then the default table access method

 Same to 0007. "I'm not sure the 'then' is needed.".

> +  is chosen for the new table. see <xref linkend="guc-default-table-access-method"/>

 Same to 0007. " "see" => "See" "
"

0009-Doc-of-CREATE-TABLE-AS

 Same as 0008.


0010-Table-access-method-API-

> +    Any new <literal>TABLE ACCSESS METHOD</literal> developers can refer the exisitng <literal>HEAP</literal>


> +    There are differnt type of API's that are defined and those details are below.

  "differnt" => "different"

> +     by the AM, in case if it support parallel scan.

  "support" => "supports"

> + This API to return the total size that is required for the AM to perform

  Total size of what? Shared memory chunk? Or parallel scan descriptor?


> +     the parallel table scan. The minimum size that is required is 
> +     <structname>ParallelBlockTableScanDescData</structname>.

  I don't get what the "The minimum size" tells. Just reading
  this I would always return the minimum size...


> +     This API to perform the initialization of the <literal>parallel_scan</literal>
> +     that is required for the parallel scan to be performed by the AM and also return
> +     the total size that is required for the AM to perform the parallel table scan.

 (Note: I'm not good at English..) Similar to the above. I cannot
 read what the "size" is for.

 In the code it is used as:

  > Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);

  (The varialbe name should be snapshot_offset) It is the offset
  from the beginning of parallel scan descriptor but it should be
  described in other representation, which I'm not sure of..

  Something like this?

  > This API to initialize a parallel scan by the AM and also
  > return the consumed size so far of parallel scan descriptor.

(Sorry for not finishing. Time's up today.)

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:


On Sat, Mar 9, 2019 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

While 0001 is pretty bulky, the interesting bits concentrate on a
comparatively small area. I'd appreciate if somebody could give the
comments added in tableam.h a read (both on callbacks, and their
wrappers, as they have different audiences). It'd make sense to first
read the commit message, to understand the goal (and I'd obviously also
appreciate suggestions for improvements there as well).

I'm pretty happy with the current state of the scan patch. I plan to do
two more passes through it (formatting, comment polishing, etc. I don't
know of any functional changes needed), and then commit it, lest
somebody objects.

I found couple of typos in the committed patch, attached patch fixes them.
I am not sure about one typo, please check it once.

And I reviewed the 0002 patch, which is a pretty simple and it can be committed.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Mar 12, 2019 at 7:28 PM Kyotaro HORIGUCHI <horiguchi.kyotaro@lab.ntt.co.jp> wrote:
Hello.

I had a look on the patch set. I cannot see the thread structure
due to the depth and cannot get the picture on the all patches,
but I have some comments. I apologize in advance for possible
duplicate with upthread.


Thanks for the review.
 

0001-Reduce-the...

This doesn't apply master.

Yes, these patches doesn't apply to the master.
These patches can only be applied to the code present in [1].
 
>  TupleTableSlot *
>  ExecStoreHeapTuple(HeapTuple tuple,
>                                  TupleTableSlot *slot,
> +                                Oid relid,
>                                  bool shouldFree)

  The comment for ExecStoreHeapTuple is missing the description
  on "relid" parameter.

Corrected.
 
> -        if (HeapTupleSatisfiesVisibility(tuple, &SnapshotDirty, hscan->rs_cbuf))
> +        if (HeapTupleSatisfiesVisibility(tuple, RelationGetRelid(hscan->rs_scan.rs_rd),
> +                                &SnapshotDirty, hscan->rs_cbuf))

The second parameter seems to be always
RelationGetRelid(...). Actually only relid is required but isn't
it better to take Relation instead of Oid as the second
parameter?

Currently the passed relid is used only in the case of 
history MVCC verification function. Passing the direct relation
pointer will lessen the performance impact as there is no
need of calculation to find out the relid.

Will update and share it.
 


0005-Reorganize-am-as...

> +   catalog. The contents of an table are entirely under the control of its

 "of an table" => "of a table"

Corrected. 
 
0006-Doc-update-of-Create-access..

> +      be <type>index_am_handler</type> and for <literal>TABLE</literal>
> +      access methods, it must be <type>table_am_handler</type>.
> +      The C-level API that the handler function must implement varies
> +      depending on the type of access method. The index access method API
> +      is described in <xref linkend="index-access-methods"/> and the table access method
> +      API is described in <xref linkend="table-access-methods"/>.

If table is the primary object, talbe-am should precede index-am?

Changed.
 

0007-Doc-update-of-create-materi..

> +   This clause specifies optional access method for the new materialize view;

 "materialize view" => "materialized view"?

Corrected.
 
> +   If this option is not specified, then the default table access method

 I'm not sure the 'then' is needed.

> +   is chosen for the new materialized view. see <xref linkend="guc-default-table-access-method"/>

 "see" => "See"


0008-Doc-update-of-CREATE_TABLE..

> +[ USING <replaceable class="parameter">method</replaceable> ]

 *I* prefer access_method to just method.

> +  If this option is not specified, then the default table access method

 Same to 0007. "I'm not sure the 'then' is needed.".

> +  is chosen for the new table. see <xref linkend="guc-default-table-access-method"/>

 Same to 0007. " "see" => "See" "
"

0009-Doc-of-CREATE-TABLE-AS

 Same as 0008.

Corrected as per your suggestions.
 
0010-Table-access-method-API-

> +    Any new <literal>TABLE ACCSESS METHOD</literal> developers can refer the exisitng <literal>HEAP</literal>


> +    There are differnt type of API's that are defined and those details are below.

  "differnt" => "different"

> +     by the AM, in case if it support parallel scan.

  "support" => "supports"

Corrected above both.
 
> + This API to return the total size that is required for the AM to perform

  Total size of what? Shared memory chunk? Or parallel scan descriptor?

It returns the required parallel scan descriptor memory size.
 

> +     the parallel table scan. The minimum size that is required is
> +     <structname>ParallelBlockTableScanDescData</structname>.

  I don't get what the "The minimum size" tells. Just reading
  this I would always return the minimum size...


> +     This API to perform the initialization of the <literal>parallel_scan</literal>
> +     that is required for the parallel scan to be performed by the AM and also return
> +     the total size that is required for the AM to perform the parallel table scan.

 (Note: I'm not good at English..) Similar to the above. I cannot
 read what the "size" is for.

 In the code it is used as:

  > Size snapshot_off = rel->rd_tableam->parallelscan_initialize(rel, pscan);

  (The varialbe name should be snapshot_offset) It is the offset
  from the beginning of parallel scan descriptor but it should be
  described in other representation, which I'm not sure of..

  Something like this?

  > This API to initialize a parallel scan by the AM and also
  > return the consumed size so far of parallel scan descriptor.

I updated the doc around those API's to make easy to understand.
Can you please check whether that helps?
 
updated patches are attached.

[1] -  https://github.com/anarazel/postgres-pluggable-storage.git

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Sat, Mar 16, 2019 at 5:43 PM Haribabu Kommi <kommi.haribabu@gmail.com> wrote:


On Sat, Mar 9, 2019 at 2:13 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

While 0001 is pretty bulky, the interesting bits concentrate on a
comparatively small area. I'd appreciate if somebody could give the
comments added in tableam.h a read (both on callbacks, and their
wrappers, as they have different audiences). It'd make sense to first
read the commit message, to understand the goal (and I'd obviously also
appreciate suggestions for improvements there as well).

I'm pretty happy with the current state of the scan patch. I plan to do
two more passes through it (formatting, comment polishing, etc. I don't
know of any functional changes needed), and then commit it, lest
somebody objects.

I found couple of typos in the committed patch, attached patch fixes them.
I am not sure about one typo, please check it once.

And I reviewed the 0002 patch, which is a pretty simple and it can be committed.

As you are modifying the 0003 patch for modify API's, I went and reviewed the
existing patch and found couple corrections that are needed, in case if you are not
taken care of them already.


+ /* Update the tuple with table oid */
+ slot->tts_tableOid = RelationGetRelid(relation);
+ if (slot->tts_tableOid != InvalidOid)
+ tuple->t_tableOid = slot->tts_tableOid;

The setting of slot->tts_tableOid is not required in this function,
After set the check is happening, the above code is present in both
heapam_heap_insert and heapam_heap_insert_speculative.


+ slot->tts_tableOid = RelationGetRelid(relation);

In heapam_heap_update, i don't think there is a need to update
slot->tts_tableOid.


+ default:
+ elog(ERROR, "unrecognized heap_update status: %u", result);

heap_update --> table_update?


+ default:
+ elog(ERROR, "unrecognized heap_delete status: %u", result);

same as above?


+ /*hari FIXME*/
+ /*Assert(result != HeapTupleUpdated && hufd.traversed);*/

Removing the commented codes in both ExecDelete and ExecUpdate functions.


+ /**/
+ if (epqreturnslot)
+ {
+ *epqreturnslot = epqslot;
+ return NULL;
+ }

comment update missed?


Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

Hi,

The psql \dA commands currently doesn't show the type of the access methods of
type 'Table'.

postgres=# \dA heap
List of access methods
 Name | Type  
------+-------
 heap | 
(1 row)

Attached a simple patch that fixes the problem and outputs as follows.

postgres=# \dA heap
List of access methods
 Name | Type  
------+-------
 heap | Table
(1 row)

The attached patch directly modifies the query that is sent to the server.
Servers < 12 doesn't support of type 'Table', so the same query can work,
because of the case addition as follows.

SELECT amname AS "Name",
          CASE amtype WHEN 'i' THEN 'Index' WHEN 't' THEN 'Table' END AS "Type"
        FROM pg_catalog.pg_am ...

Anyone feels that it requires a separate query for servers < 12?

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

Attached is a version of just the first patch. I'm still updating it,
but it's getting closer to commit:

- There were no tests testing EPQ interactions with DELETE, and only an
  accidental test for EPQ in UPDATE with a concurrent DELETE. I've added
  tests. Plan to commit that ahead of the big change.

- I was pretty unhappy about how the EPQ integration looked before, I've
  changed that now.

  I still wonder if we should restore EvalPlanQualFetch and move the
  table_lock_tuple() calls in ExecDelete/Update into it. But it seems
  like it'd not gain that much, because there's custom surrounding code,
  and it's not that much code.

- I changed heapam_tuple_tuple to return *WouldBlock rather than just
  the last result. I think that's one of the reason Haribabu had
  neutered a few asserts.

- I moved comments from heapam.h to tableam.h where appropriate

- I updated the name of HeapUpdateFailureData to TM_FailureData,
  HTSU_Result to TM_Result, TM_Results members now properly distinguish
  between update vs modifications (delete & update).

- I separated the HEAP_INSERT_ flags into TABLE_INSERT_* and
  HEAP_INSERT_ with the latter being a copy of TABLE_INSERT_ with the
  sole addition of _SPECULATIVE. table_insert_speculative callers now
  don't specify that anymore.


Pending work:
- Wondering if table_insert/delete/update should rather be
  table_tuple_insert etc. Would be a bit more consistent with the
  callback names, but a bigger departure from existing code.

- I'm not yet happy with TableTupleDeleted computation in heapam.c, I
  want to revise that further

- formatting

- commit message

- a few comments need a bit of polishing (ExecCheckTIDVisible, heapam_tuple_lock)

- Rename TableTupleMayBeModified to TableTupleOk, but also probably a s/TableTuple/TableMod/

- I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h

- two more passes through the patch



On 2019-03-21 15:07:04 +1100, Haribabu Kommi wrote:
> As you are modifying the 0003 patch for modify API's, I went and reviewed
> the
> existing patch and found couple corrections that are needed, in case if you
> are not
> taken care of them already.

Some I have...



> + /* Update the tuple with table oid */
> + slot->tts_tableOid = RelationGetRelid(relation);
> + if (slot->tts_tableOid != InvalidOid)
> + tuple->t_tableOid = slot->tts_tableOid;
>
> The setting of slot->tts_tableOid is not required in this function,
> After set the check is happening, the above code is present in both
> heapam_heap_insert and heapam_heap_insert_speculative.

I'm not following? Those functions are independent?


> + slot->tts_tableOid = RelationGetRelid(relation);
>
> In heapam_heap_update, i don't think there is a need to update
> slot->tts_tableOid.

Why?


> + default:
> + elog(ERROR, "unrecognized heap_update status: %u", result);
>
> heap_update --> table_update?
>
>
> + default:
> + elog(ERROR, "unrecognized heap_delete status: %u", result);
>
> same as above?

I've fixed that in a number of places.


> + /*hari FIXME*/
> + /*Assert(result != HeapTupleUpdated && hufd.traversed);*/
>
> Removing the commented codes in both ExecDelete and ExecUpdate functions.

I don't think that's the right fix. I've refactored that code
significantly now, and restored the assert in a, imo, correct version.


> + /**/
> + if (epqreturnslot)
> + {
> + *epqreturnslot = epqslot;
> + return NULL;
> + }
>
> comment update missed?

Well, you'd deleted a comment around there ;). I've added something back
now...

Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Fri, Mar 22, 2019 at 5:16 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

Attached is a version of just the first patch. I'm still updating it,
but it's getting closer to commit:

- There were no tests testing EPQ interactions with DELETE, and only an
  accidental test for EPQ in UPDATE with a concurrent DELETE. I've added
  tests. Plan to commit that ahead of the big change.

- I was pretty unhappy about how the EPQ integration looked before, I've
  changed that now.

  I still wonder if we should restore EvalPlanQualFetch and move the
  table_lock_tuple() calls in ExecDelete/Update into it. But it seems
  like it'd not gain that much, because there's custom surrounding code,
  and it's not that much code.

- I changed heapam_tuple_tuple to return *WouldBlock rather than just
  the last result. I think that's one of the reason Haribabu had
  neutered a few asserts.

- I moved comments from heapam.h to tableam.h where appropriate

- I updated the name of HeapUpdateFailureData to TM_FailureData,
  HTSU_Result to TM_Result, TM_Results members now properly distinguish
  between update vs modifications (delete & update).

- I separated the HEAP_INSERT_ flags into TABLE_INSERT_* and
  HEAP_INSERT_ with the latter being a copy of TABLE_INSERT_ with the
  sole addition of _SPECULATIVE. table_insert_speculative callers now
  don't specify that anymore.


Pending work:
- Wondering if table_insert/delete/update should rather be
  table_tuple_insert etc. Would be a bit more consistent with the
  callback names, but a bigger departure from existing code.

- I'm not yet happy with TableTupleDeleted computation in heapam.c, I
  want to revise that further

- formatting

- commit message

- a few comments need a bit of polishing (ExecCheckTIDVisible, heapam_tuple_lock)

- Rename TableTupleMayBeModified to TableTupleOk, but also probably a s/TableTuple/TableMod/

- I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h

- two more passes through the patch

Thanks for the corrections.

 
On 2019-03-21 15:07:04 +1100, Haribabu Kommi wrote:
> As you are modifying the 0003 patch for modify API's, I went and reviewed
> the
> existing patch and found couple corrections that are needed, in case if you
> are not
> taken care of them already.

Some I have...



> + /* Update the tuple with table oid */
> + slot->tts_tableOid = RelationGetRelid(relation);
> + if (slot->tts_tableOid != InvalidOid)
> + tuple->t_tableOid = slot->tts_tableOid;
>
> The setting of slot->tts_tableOid is not required in this function,
> After set the check is happening, the above code is present in both
> heapam_heap_insert and heapam_heap_insert_speculative.

I'm not following? Those functions are independent?

In those functions, the slot->tts_tableOid is set and in the next statement 
checked whether it is invalid or not? Callers of table_insert should have
already set that. So setting the value and checking in the next line is it required?
The value cannot be InvalidOid.
 

> + slot->tts_tableOid = RelationGetRelid(relation);
>
> In heapam_heap_update, i don't think there is a need to update
> slot->tts_tableOid.

Why?

The slot->tts_tableOid should have been updated before the call to heap_update.
setting it again after the heap_update is required?

I also observed setting slot->tts_tableOid after table_insert_XXX calls also in 
Exec_insert function?

Is this to make sure that AM hasn't modified that value? 


> + default:
> + elog(ERROR, "unrecognized heap_update status: %u", result);
>
> heap_update --> table_update?
>
>
> + default:
> + elog(ERROR, "unrecognized heap_delete status: %u", result);
>
> same as above?

I've fixed that in a number of places.


> + /*hari FIXME*/
> + /*Assert(result != HeapTupleUpdated && hufd.traversed);*/
>
> Removing the commented codes in both ExecDelete and ExecUpdate functions.

I don't think that's the right fix. I've refactored that code
significantly now, and restored the assert in a, imo, correct version.


OK.
 
> + /**/
> + if (epqreturnslot)
> + {
> + *epqreturnslot = epqslot;
> + return NULL;
> + }
>
> comment update missed?

Well, you'd deleted a comment around there ;). I've added something back
now...

This is not only the problem I could have introduced, All the comments that
listed are introduced by me ;).

Regards,
Haribabu Kommi
Fujitsu Australia

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-03-21 11:15:57 -0700, Andres Freund wrote:
> Pending work:
> - Wondering if table_insert/delete/update should rather be
>   table_tuple_insert etc. Would be a bit more consistent with the
>   callback names, but a bigger departure from existing code.

I've left this as is.


> - I'm not yet happy with TableTupleDeleted computation in heapam.c, I
>   want to revise that further

I changed that. Found a bunch of untested paths, I've pushed tests for
those already.

> - formatting

Done that.


> - commit message

Done that.


> - a few comments need a bit of polishing (ExecCheckTIDVisible, heapam_tuple_lock)

Done that.


> - Rename TableTupleMayBeModified to TableTupleOk, but also probably a s/TableTuple/TableMod/

It's now TM_*.

/*
 * Result codes for table_{update,delete,lock}_tuple, and for visibility
 * routines inside table AMs.
 */
typedef enum TM_Result
{
    /*
     * Signals that the action succeeded (i.e. update/delete performed, lock
     * was acquired)
     */
    TM_Ok,

    /* The affected tuple wasn't visible to the relevant snapshot */
    TM_Invisible,

    /* The affected tuple was already modified by the calling backend */
    TM_SelfModified,

    /*
     * The affected tuple was updated by another transaction. This includes
     * the case where tuple was moved to another partition.
     */
    TM_Updated,

    /* The affected tuple was deleted by another transaction */
    TM_Deleted,

    /*
     * The affected tuple is currently being modified by another session. This
     * will only be returned if (update/delete/lock)_tuple are instructed not
     * to wait.
     */
    TM_BeingModified,

    /* lock couldn't be acquired, action skipped. Only used by lock_tuple */
    TM_WouldBlock
} TM_Result;


> - I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h

Done.


> - two more passes through the patch

One of them completed. Which is good, because there was a subtle bug in
heapam_tuple_lock (*tid was adjusted to be the followup tuple after the
heap_fetch(), before going to heap_lock_tuple - but that's wrong, it
should only be adjusted when heap_fetch() ing the next version.).

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

(sorry, I somehow miskeyed, and sent a partial version of this email
before it was ready)


On 2019-03-21 11:15:57 -0700, Andres Freund wrote:
> Pending work:
> - Wondering if table_insert/delete/update should rather be
>   table_tuple_insert etc. Would be a bit more consistent with the
>   callback names, but a bigger departure from existing code.

I've left this as is.


> - I'm not yet happy with TableTupleDeleted computation in heapam.c, I
>   want to revise that further

I changed that. Found a bunch of untested paths, I've pushed tests for
those already.

> - formatting

Done that.


> - commit message

Done that.


> - a few comments need a bit of polishing (ExecCheckTIDVisible, heapam_tuple_lock)

Done that.


> - Rename TableTupleMayBeModified to TableTupleOk, but also probably a s/TableTuple/TableMod/

It's now TM_*.

/*
 * Result codes for table_{update,delete,lock}_tuple, and for visibility
 * routines inside table AMs.
 */
typedef enum TM_Result
{
    /*
     * Signals that the action succeeded (i.e. update/delete performed, lock
     * was acquired)
     */
    TM_Ok,

    /* The affected tuple wasn't visible to the relevant snapshot */
    TM_Invisible,

    /* The affected tuple was already modified by the calling backend */
    TM_SelfModified,

    /*
     * The affected tuple was updated by another transaction. This includes
     * the case where tuple was moved to another partition.
     */
    TM_Updated,

    /* The affected tuple was deleted by another transaction */
    TM_Deleted,

    /*
     * The affected tuple is currently being modified by another session. This
     * will only be returned if (update/delete/lock)_tuple are instructed not
     * to wait.
     */
    TM_BeingModified,

    /* lock couldn't be acquired, action skipped. Only used by lock_tuple */
    TM_WouldBlock
} TM_Result;


> - I'll probably move TUPLE_LOCK_FLAG_LOCK_* into tableam.h

Done.


> - two more passes through the patch

One of them completed. Which is good, because there was a subtle bug in
heapam_tuple_lock (*tid was adjusted to be the followup tuple after the
heap_fetch(), before going to heap_lock_tuple - but that's wrong, it
should only be adjusted when heap_fetch() ing the next version.).

I'm pretty happy with that last version (of the first patch). I'm
planning to do one more pass, and then push.

There are no meaningful changes to later patches in the series besides
followup changes required from changes in the first patch.

Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-03-23 20:16:30 -0700, Andres Freund wrote:
> I'm pretty happy with that last version (of the first patch). I'm
> planning to do one more pass, and then push.

And done, after a bunch of mostly cosmetic changes (renaming
ExecCheckHeapTupleVisible to ExecCheckTupleVisible, removing an
unnecessary change in heap_lock_tuple parameters, a bunch of comments,
stuff like that).  Let's see what the buildfarm says.

The remaining commits luckily all are a good bit smaller.

Greetings,

Andres Freund


Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Wed, Mar 27, 2019 at 11:17 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-02-22 14:52:08 -0500, Robert Haas wrote:
> On Fri, Feb 22, 2019 at 11:19 AM Amit Khandekar <amitdkhan.pg@gmail.com> wrote:
> > Thanks for the review. Attached v2.
>
> Thanks.  I took this, combined it with Andres's
> v12-0040-WIP-Move-xid-horizon-computation-for-page-level-.patch, did
> some polishing of the code and comments, and pgindented.  Here's what
> I ended up with; see what you think.

I pushed this after some fairly minor changes, directly including the
patch to route the horizon computation through tableam. The only real
change is that I removed the table relfilenode from the nbtree/hash
deletion WAL record - it was only required to access the heap without
accessing the catalog and was unused now.  Also added a WAL version
bump.

It seems possible that some other AM might want to generalize the
prefetch logic from heapam.c, but I think it's fair to defer that until
such an AM wants to do so

As I see that your are fixing some typos of the code that is committed,
I just want to share some more corrections that I found in the patches
that are committed till now.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
On 2019-03-29 18:38:46 +1100, Haribabu Kommi wrote:
> As I see that your are fixing some typos of the code that is committed,
> I just want to share some more corrections that I found in the patches
> that are committed till now.

Pushed both, thanks!



Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-03-16 23:21:31 +1100, Haribabu Kommi wrote:
> updated patches are attached.

Now that nearly all of the tableam patches are committed (with the
exception of the copy.c changes which are for bad reasons discussed at
[1]) I'm looking at the docs changes.

What made you rename indexam.sgml to am.sgml, instead of creating a
separate tableam.sgml?  Seems easier to just have a separate file?

I'm currently not planning to include the function-by-function API
reference you've in your patchset, as I think it's more reasonable to
centralize all of it in tableam.h. I think I've included most of the
information there - could you check whether you agree?

[1] https://postgr.es/m/CAKJS1f98Fa%2BQRTGKwqbtz0M%3DCy1EHYR8Q-W08cpA78tOy4euKQ%40mail.gmail.com

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Apr 2, 2019 at 10:18 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-03-16 23:21:31 +1100, Haribabu Kommi wrote:
> updated patches are attached.

Now that nearly all of the tableam patches are committed (with the
exception of the copy.c changes which are for bad reasons discussed at
[1]) I'm looking at the docs changes.
 
Thanks.
 
What made you rename indexam.sgml to am.sgml, instead of creating a
separate tableam.sgml?  Seems easier to just have a separate file?

No specific reason, I just thought of adding all the access methods under one file.
I can change it to tableam.sgml.
 
I'm currently not planning to include the function-by-function API
reference you've in your patchset, as I think it's more reasonable to
centralize all of it in tableam.h. I think I've included most of the
information there - could you check whether you agree?

I checked all the comments and explanation that is provided in the tableam.h is
good enough to understand. Even I updated docs section to reflect with some more
details from tableam.h comments.

I can understand your point of avoiding function-by-function API reference,
as the user can check directly the code comments, Still I feel some people
may refer the doc for API changes. I am fine to remove based on your opinion.

Added current set of doc patches for your reference.

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-02 11:39:57 +1100, Haribabu Kommi wrote:
> > What made you rename indexam.sgml to am.sgml, instead of creating a
> > separate tableam.sgml?  Seems easier to just have a separate file?
> >
> 
> No specific reason, I just thought of adding all the access methods under
> one file.
> I can change it to tableam.sgml.

I'd rather keep it separate. It seems likely that both table and indexam
docs will grow further over time, and they're not that closely
related. Additionally not moving sect1->sect2 etc will keep links stable
(which we could also achieve with different sect1 names, I realize
that).

> I can understand your point of avoiding function-by-function API reference,
> as the user can check directly the code comments, Still I feel some people
> may refer the doc for API changes. I am fine to remove based on your
> opinion.

I think having to keeping both tableam.h and the sgml file current is
too much overhead - and anybody that's going to create a new tableam is
going to be able to look into the source anyway.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Haribabu Kommi
Дата:

On Tue, Apr 2, 2019 at 11:53 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-04-02 11:39:57 +1100, Haribabu Kommi wrote:
> > What made you rename indexam.sgml to am.sgml, instead of creating a
> > separate tableam.sgml?  Seems easier to just have a separate file?
> >
>
> No specific reason, I just thought of adding all the access methods under
> one file.
> I can change it to tableam.sgml.

I'd rather keep it separate. It seems likely that both table and indexam
docs will grow further over time, and they're not that closely
related. Additionally not moving sect1->sect2 etc will keep links stable
(which we could also achieve with different sect1 names, I realize
that).

OK.
 
> I can understand your point of avoiding function-by-function API reference,
> as the user can check directly the code comments, Still I feel some people
> may refer the doc for API changes. I am fine to remove based on your
> opinion.

I think having to keeping both tableam.h and the sgml file current is
too much overhead - and anybody that's going to create a new tableam is
going to be able to look into the source anyway.

Here I attached updated patches as per the discussion.
Is the description of table access methods is enough? or do you want me to
add some more details?

Regards,
Haribabu Kommi
Fujitsu Australia
Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-02 17:11:07 +1100, Haribabu Kommi wrote:

> +     <varlistentry id="guc-default-table-access-method" xreflabel="default_table_access_method">
> +      <term><varname>default_table_access_method</varname> (<type>string</type>)
> +      <indexterm>
> +       <primary><varname>default_table_access_method</varname> configuration parameter</primary>
> +      </indexterm>
> +      </term>
> +      <listitem>
> +       <para>
> +        The value is either the name of a table access method, or an empty string
> +        to specify using the default table access method of the current database.
> +        If the value does not match the name of any existing table access method,
> +        <productname>PostgreSQL</productname> will automatically use the default
> +        table access method of the current database.
> +       </para>

Hm, this doesn't strike me as right (there's no such thing as "default
table access method of the current database"). You just get an error in
that case. I think we should simply not allow setting to "" - what's the
point in that?

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-02 17:11:07 +1100, Haribabu Kommi wrote:
> From a72cfcd523887f1220473231d7982928acc23684 Mon Sep 17 00:00:00 2001
> From: Hari Babu <kommi.haribabu@gmail.com>
> Date: Tue, 2 Apr 2019 15:41:17 +1100
> Subject: [PATCH 1/2] tableam : doc update of table access methods
> 
> Providing basic explanation of table access methods
> including their structure details and reference heap
> implementation files.
> ---
>  doc/src/sgml/catalogs.sgml |  5 ++--
>  doc/src/sgml/filelist.sgml |  1 +
>  doc/src/sgml/postgres.sgml |  1 +
>  doc/src/sgml/tableam.sgml  | 56 ++++++++++++++++++++++++++++++++++++++
>  4 files changed, 61 insertions(+), 2 deletions(-)
>  create mode 100644 doc/src/sgml/tableam.sgml
> 
> diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
> index f4aabf5dc7..200708e121 100644
> --- a/doc/src/sgml/catalogs.sgml
> +++ b/doc/src/sgml/catalogs.sgml
> @@ -587,8 +587,9 @@
>     The catalog <structname>pg_am</structname> stores information about
>     relation access methods.  There is one row for each access method supported
>     by the system.
> -   Currently, only indexes have access methods.  The requirements for index
> -   access methods are discussed in detail in <xref linkend="indexam"/>.
> +   Currently, only table and indexes have access methods. The requirements for table
> +   access methods are discussed in detail in <xref linkend="tableam"/> and the
> +   requirements for index access methods are discussed in detail in <xref linkend="indexam"/>.
>    </para>

I also adapted pg_am.amtype.


> diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml
> new file mode 100644
> index 0000000000..9eca52ee70
> --- /dev/null
> +++ b/doc/src/sgml/tableam.sgml
> @@ -0,0 +1,56 @@
> +<!-- doc/src/sgml/tableam.sgml -->
> +
> +<chapter id="tableam">
> + <title>Table Access Method Interface Definition</title>
> + 
> +  <para>
> +   This chapter defines the interface between the core <productname>PostgreSQL</productname>
> +   system and <firstterm>access methods</firstterm>, which manage <literal>TABLE</literal>
> +   types. The core system knows nothing about these access methods beyond
> +   what is specified here, so it is possible to develop entirely new access
> +   method types by writing add-on code.
> +  </para>
> +  
> +  <para>
> +   All Tables in <productname>PostgreSQL</productname> system are the primary
> +   data store. Each table is stored as its own physical <firstterm>relation</firstterm>
> +   and so is described by an entry in the <structname>pg_class</structname>
> +   catalog. A table's content is entirely controlled by its access method.
> +   (All the access methods furthermore use the standard page layout described
> +   in <xref linkend="storage-page-layout"/>.)
> +  </para>

I don't think there's actually any sort of dependency on the page
layout. It's entirely conceivable to write an AM that doesn't use
postgres' shared buffers.

> +  <para>
> +   A table access method handler function must be declared to accept a single
> +   argument of type <type>internal</type> and to return the pseudo-type
> +   <type>table_am_handler</type>.  The argument is a dummy value that simply
> +   serves to prevent handler functions from being called directly from SQL commands.

> +   The result of the function must be a palloc'd struct of type <structname>TableAmRoutine</structname>,
> +   which contains everything that the core code needs to know to make use of
> +   the table access method.

That's not been correct for a while...


> The <structname>TableAmRoutine</structname> struct,
> +   also called the access method's <firstterm>API struct</firstterm>, includes
> +   fields specifying assorted fixed properties of the access method, such as
> +   whether it can support bitmap scans.  More importantly, it contains pointers
> +   to support functions for the access method, which do all of the real work to
> +   access tables. These support functions are plain C functions and are not
> +   visible or callable at the SQL level.  The support functions are described
> +   in <structname>TableAmRoutine</structname> structure. For more details, please
> +   refer the file <filename>src/include/access/tableam.h</filename>.
> +  </para>

This seems to not have been adapted after copying it from indexam?


I'm still working on this (in particular I think storage.sgml and
probably some other places needs updates to make clear they apply to
heap not generally; I think there needs to be some references to generic
WAL records in tableam.sgml, ...), but I got to run a few errands.

One thing I want to call out is that I made the reference to
src/include/access/tableam.h a link to gitweb. I think that makes it
much more useful to the casual reader. But it also means that, baring
some infrastructure / procedure we don't have, the link will just
continue to point to master. I'm not particularly concerned about that,
but it seems worth pointing out, given that we've only a single link to
gitweb in the sgml docs so far.

Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
Justin Pryzby
Дата:
I reviewed new docs for $SUBJECT.

Find attached proposed changes.

There's one XXX item I'm unsure what it's intended to say.

Justin

From a3d290bf67af2a34e44cd6c160daf552b56a13b5 Mon Sep 17 00:00:00 2001
From: Justin Pryzby <pryzbyj@telsasoft.com>
Date: Thu, 4 Apr 2019 00:48:09 -0500
Subject: [PATCH v1] Fine tune documentation for tableam

Added at commit b73c3a11963c8bb783993cfffabb09f558f86e37
---
 doc/src/sgml/catalogs.sgml        |  2 +-
 doc/src/sgml/config.sgml          |  4 ++--
 doc/src/sgml/ref/select_into.sgml |  6 +++---
 doc/src/sgml/storage.sgml         | 17 ++++++++-------
 doc/src/sgml/tableam.sgml         | 44 ++++++++++++++++++++-------------------
 5 files changed, 38 insertions(+), 35 deletions(-)

diff --git a/doc/src/sgml/catalogs.sgml b/doc/src/sgml/catalogs.sgml
index 58c8c96..40ddec4 100644
--- a/doc/src/sgml/catalogs.sgml
+++ b/doc/src/sgml/catalogs.sgml
@@ -587,7 +587,7 @@
    The catalog <structname>pg_am</structname> stores information about
    relation access methods.  There is one row for each access method supported
    by the system.
-   Currently, only table and indexes have access methods. The requirements for table
+   Currently, only tables and indexes have access methods. The requirements for table
    and index access methods are discussed in detail in <xref linkend="tableam"/> and
    <xref linkend="indexam"/> respectively.
   </para>
diff --git a/doc/src/sgml/config.sgml b/doc/src/sgml/config.sgml
index 4a9a1e8..90b478d 100644
--- a/doc/src/sgml/config.sgml
+++ b/doc/src/sgml/config.sgml
@@ -7306,8 +7306,8 @@ COPY postgres_log FROM '/full/path/to/logfile.csv' WITH csv;
         This parameter specifies the default table access method to use when
         creating tables or materialized views if the <command>CREATE</command>
         command does not explicitly specify an access method, or when
-        <command>SELECT ... INTO</command> is used, which does not allow to
-        specify a table access method. The default is <literal>heap</literal>.
+        <command>SELECT ... INTO</command> is used, which does not allow
+        specification of a table access method. The default is <literal>heap</literal>.
        </para>
       </listitem>
      </varlistentry>
diff --git a/doc/src/sgml/ref/select_into.sgml b/doc/src/sgml/ref/select_into.sgml
index 17bed24..1443d79 100644
--- a/doc/src/sgml/ref/select_into.sgml
+++ b/doc/src/sgml/ref/select_into.sgml
@@ -106,11 +106,11 @@ SELECT [ ALL | DISTINCT [ ON ( <replaceable class="parameter">expression</replac
   </para>
 
   <para>
-   In contrast to <command>CREATE TABLE AS</command> <command>SELECT
-   INTO</command> does not allow to specify properties like a table's access
+   In contrast to <command>CREATE TABLE AS</command>, <command>SELECT
+   INTO</command> does not allow specification of properties like a table's access
    method with <xref linkend="sql-createtable-method" /> or the table's
    tablespace with <xref linkend="sql-createtable-tablespace" />. Use <xref
-   linkend="sql-createtableas"/> if necessary.  Therefore the default table
+   linkend="sql-createtableas"/> if necessary.  Therefore, the default table
    access method is chosen for the new table. See <xref
    linkend="guc-default-table-access-method"/> for more information.
   </para>
diff --git a/doc/src/sgml/storage.sgml b/doc/src/sgml/storage.sgml
index 62333e3..5dfca1b 100644
--- a/doc/src/sgml/storage.sgml
+++ b/doc/src/sgml/storage.sgml
@@ -189,11 +189,11 @@ there.
 </para>
 
 <para>
- Note that the following sections describe the way the builtin
+ Note that the following sections describe the behavior of the builtin
  <literal>heap</literal> <link linkend="tableam">table access method</link>,
- and the builtin <link linkend="indexam">index access methods</link> work. Due
- to the extensible nature of <productname>PostgreSQL</productname> other types
- of access method might work similar or not.
+ and the builtin <link linkend="indexam">index access methods</link>. Due
+ to the extensible nature of <productname>PostgreSQL</productname>, other
+ access methods might work differently.
 </para>
 
 <para>
@@ -703,11 +703,12 @@ erased (they will be recreated automatically as needed).
 This section provides an overview of the page format used within
 <productname>PostgreSQL</productname> tables and indexes.<footnote>
   <para>
-    Actually, neither table nor index access methods need not use this page
-    format.  All the existing index methods do use this basic format, but the
+    Actually, use of this page format is not required for either table or index
+    access methods.
+    The <literal>heap</literal> table access method always uses this format.
+    All the existing index methods also use the basic format, but the
     data kept on index metapages usually doesn't follow the item layout
-    rules. The <literal>heap</literal> table access method also always uses
-    this format.
+    rules.
   </para>
 </footnote>
 Sequences and <acronym>TOAST</acronym> tables are formatted just like a regular table.
diff --git a/doc/src/sgml/tableam.sgml b/doc/src/sgml/tableam.sgml
index 8d9bfd8..0a89935 100644
--- a/doc/src/sgml/tableam.sgml
+++ b/doc/src/sgml/tableam.sgml
@@ -48,54 +48,56 @@
   callbacks and their behavior is defined in the
   <structname>TableAmRoutine</structname> structure (with comments inside the
   struct defining the requirements for callbacks). Most callbacks have
-  wrapper functions, which are documented for the point of view of a user,
-  rather than an implementor, of the table access method.  For details,
+  wrapper functions, which are documented from the point of view of a user
+  (rather than an implementor) of the table access method.  For details,
   please refer to the <ulink
url="https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/access/tableam.h;hb=HEAD">
   <filename>src/include/access/tableam.h</filename></ulink> file.
  </para>
 
  <para>
-  To implement a access method, an implementor will typically need to
-  implement a AM specific type of tuple table slot (see
+  To implement an access method, an implementor will typically need to
+  implement an AM-specific type of tuple table slot (see
   <ulink url="https://git.postgresql.org/gitweb/?p=postgresql.git;a=blob;f=src/include/executor/tuptable.h;hb=HEAD">
-   <filename>src/include/executor/tuptable.h</filename></ulink>) which allows
+   <filename>src/include/executor/tuptable.h</filename></ulink>), which allows
    code outside the access method to hold references to tuples of the AM, and
    to access the columns of the tuple.
  </para>
 
  <para>
-  Currently the the way an AM actually stores data is fairly
-  unconstrained. It is e.g. possible to use postgres' shared buffer cache,
-  but not required. In case shared buffers are used, it likely makes to
-  postgres' standard page layout described in <xref
+  Currently, the way an AM actually stores data is fairly
+  unconstrained. For example, it's possible but not required to use postgres'
+  shared buffer cache.  In case it is used, it likely makes
+XXX something missing here ?
+  to postgres' standard page layout described in <xref
   linkend="storage-page-layout"/>.
  </para>
 
  <para>
   One fairly large constraint of the table access method API is that,
   currently, if the AM wants to support modifications and/or indexes, it is
-  necessary that each tuple has a tuple identifier (<acronym>TID</acronym>)
+  necessary for each tuple to have a tuple identifier (<acronym>TID</acronym>)
   consisting of a block number and an item number (see also <xref
   linkend="storage-page-layout"/>).  It is not strictly necessary that the
-  sub-parts of <acronym>TIDs</acronym> have the same meaning they e.g. have
+  sub-parts of <acronym>TIDs</acronym> have the same meaning as used
   for <literal>heap</literal>, but if bitmap scan support is desired (it is
   optional), the block number needs to provide locality.
  </para>
 
  <para>
-  For crash safety an AM can use postgres' <link
-  linkend="wal"><acronym>WAL</acronym></link>, or a custom approach can be
-  implemented.  If <acronym>WAL</acronym> is chosen, either <link
-  linkend="generic-wal">Generic WAL Records</link> can be used — which
-  implies higher WAL volume but is easy, or a new type of
-  <acronym>WAL</acronym> records can be implemented — but that
-  currently requires modifications of core code (namely modifying
+  For crash safety, an AM can use postgres' <link
+  linkend="wal"><acronym>WAL</acronym></link>, or a custom implementation.
+  If <acronym>WAL</acronym> is chosen, either <link
+  linkend="generic-wal">Generic WAL Records</link> can be used,
+  or a new type of <acronym>WAL</acronym> records can be implemented.
+  Generic WAL Records are easy, but imply higher WAL volume.
+  Implementation of a new type of WAL record
+  currently requires modifications to core code (specifically,
   <filename>src/include/access/rmgrlist.h</filename>).
  </para>
 
  <para>
   To implement transactional support in a manner that allows different table
-  access methods be accessed within a single transaction, it likely is
+  access methods be accessed within a single transaction, it's likely
   necessary to closely integrate with the machinery in
   <filename>src/backend/access/transam/xlog.c</filename>.
  </para>
@@ -103,8 +105,8 @@
  <para>
   Any developer of a new <literal>table access method</literal> can refer to
   the existing <literal>heap</literal> implementation present in
-  <filename>src/backend/heap/heapam_handler.c</filename> for more details of
-  how it is implemented.
+  <filename>src/backend/heap/heapam_handler.c</filename> for details of
+  its implementation.
  </para>
 
 </chapter>
-- 
2.1.4


Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-04 00:51:38 -0500, Justin Pryzby wrote:
> I reviewed new docs for $SUBJECT.

> Find attached proposed changes.

> There's one XXX item I'm unsure what it's intended to say.

Thanks! I applied most of these, and filled in the XXX. I didn't like
the s/allow to to specify properties/allow specification of properties/,
so I left those out. But I could be convinced otherwise...

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Heikki Linnakangas
Дата:
I wrote a little toy implementation that just returns constant data to 
play with this a little. Looks good overall.

There were a bunch of typos in the comments in tableam.h, see attached. 
Some of the comments could use more copy-editing and clarification, I 
think, but I stuck to fixing just typos and such for now.

index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM 
doesn't use normal data files, that won't work. I bumped into that with 
my toy implementation, which wouldn't need to create any data files, if 
it wasn't for this.

The comments for relation_set_new_relfilenode() callback say that the AM 
can set *freezeXid and *minmulti to invalid. But when I did that, VACUUM 
hits this assertion:

TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId) 
3)))", File: "vacuum.c", Line: 1323)

There's a little bug in index-only scan executor node, where it mixes up 
the slots to hold a tuple from the index, and from the table. That 
doesn't cause any ill effects if the AM uses TTSOpsHeapTuple, but with 
my toy AM, which uses a virtual slot, it caused warnings like this from 
index-only scans:

WARNING:  problem in alloc set ExecutorState: detected write past chunk 
end in block 0x56419b0f88e8, chunk 0x56419b0f8f90

Attached is a patch with the toy implementation I used to test this. 
I'm not suggesting we should commit that - although feel free to do that 
if you think it's useful - but it shows how I bumped into these issues. 
The second patch fixes the index-only-scan slot confusion (untested, 
except with my toy AM).

- Heikki

Вложения

Re: Pluggable Storage - Andres's take

От
Fabrízio de Royes Mello
Дата:


On Mon, Apr 8, 2019 at 9:34 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> I wrote a little toy implementation that just returns constant data to
> play with this a little. Looks good overall.
>
> There were a bunch of typos in the comments in tableam.h, see attached.
> Some of the comments could use more copy-editing and clarification, I
> think, but I stuck to fixing just typos and such for now.
>
> index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM
> doesn't use normal data files, that won't work. I bumped into that with
> my toy implementation, which wouldn't need to create any data files, if
> it wasn't for this.
>
> The comments for relation_set_new_relfilenode() callback say that the AM
> can set *freezeXid and *minmulti to invalid. But when I did that, VACUUM
> hits this assertion:
>
> TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId)
> 3)))", File: "vacuum.c", Line: 1323)
>
> There's a little bug in index-only scan executor node, where it mixes up
> the slots to hold a tuple from the index, and from the table. That
> doesn't cause any ill effects if the AM uses TTSOpsHeapTuple, but with
> my toy AM, which uses a virtual slot, it caused warnings like this from
> index-only scans:
>
> WARNING:  problem in alloc set ExecutorState: detected write past chunk
> end in block 0x56419b0f88e8, chunk 0x56419b0f8f90
>
> Attached is a patch with the toy implementation I used to test this.
> I'm not suggesting we should commit that - although feel free to do that
> if you think it's useful - but it shows how I bumped into these issues.
> The second patch fixes the index-only-scan slot confusion (untested,
> except with my toy AM).
>

Awesome... it's built and ran tests cleanly, but I got assertion running VACUUM:

fabrizio=# vacuum toytab ;
TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId) 3)))", File: "vacuum.c", Line: 1323)
psql: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The connection to the server was lost. Attempting reset: 2019-04-08 12:29:16.204 -03 [20844] LOG:  server process (PID 24457) was terminated by signal 6: Aborted
2019-04-08 12:29:16.204 -03 [20844] DETAIL:  Failed process was running: vacuum toytab ;
2019-04-08 12:29:16.204 -03 [20844] LOG:  terminating any other active server processes
2019-04-08 12:29:16.205 -03 [24458] WARNING:  terminating connection because of crash of another server process

And backtrace is:

(gdb) bt
#0  0x00007f813779f428 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:54
#1  0x00007f81377a102a in __GI_abort () at abort.c:89
#2  0x0000000000ec0de9 in ExceptionalCondition (conditionName=0x10e3bb8 "!(((classForm->relfrozenxid) >= ((TransactionId) 3)))", errorType=0x10e33f3 "FailedAssertion", fileName=0x10e345a "vacuum.c", lineNumber=1323) at assert.c:54
#3  0x0000000000893646 in vac_update_datfrozenxid () at vacuum.c:1323
#4  0x000000000089127a in vacuum (relations=0x26c4390, params=0x7ffeb1a3fb30, bstrategy=0x26c4218, isTopLevel=true) at vacuum.c:452
#5  0x00000000008906ae in ExecVacuum (pstate=0x26145b8, vacstmt=0x25f46f0, isTopLevel=true) at vacuum.c:196
#6  0x0000000000c3a883 in standard_ProcessUtility (pstmt=0x25f4a50, queryString=0x25f3be8 "vacuum toytab ;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at utility.c:670
#7  0x0000000000c3977a in ProcessUtility (pstmt=0x25f4a50, queryString=0x25f3be8 "vacuum toytab ;", context=PROCESS_UTILITY_TOPLEVEL, params=0x0, queryEnv=0x0, dest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at utility.c:360
#8  0x0000000000c3793e in PortalRunUtility (portal=0x265ba28, pstmt=0x25f4a50, isTopLevel=true, setHoldSnapshot=false, dest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at pquery.c:1175
#9  0x0000000000c37d7f in PortalRunMulti (portal=0x265ba28, isTopLevel=true, setHoldSnapshot=false, dest=0x25f4b48, altdest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at pquery.c:1321
#10 0x0000000000c36899 in PortalRun (portal=0x265ba28, count=9223372036854775807, isTopLevel=true, run_once=true, dest=0x25f4b48, altdest=0x25f4b48, completionTag=0x7ffeb1a3ffb0 "") at pquery.c:796
#11 0x0000000000c2a40e in exec_simple_query (query_string=0x25f3be8 "vacuum toytab ;") at postgres.c:1215
#12 0x0000000000c332a3 in PostgresMain (argc=1, argv=0x261fe68, dbname=0x261fca8 "fabrizio", username=0x261fc80 "fabrizio") at postgres.c:4249
#13 0x0000000000b051fc in BackendRun (port=0x2616d20) at postmaster.c:4429
#14 0x0000000000b042c3 in BackendStartup (port=0x2616d20) at postmaster.c:4120
#15 0x0000000000afc70a in ServerLoop () at postmaster.c:1703
#16 0x0000000000afb94e in PostmasterMain (argc=3, argv=0x25ed850) at postmaster.c:1376
#17 0x0000000000977de8 in main (argc=3, argv=0x25ed850) at main.c:228


Isn't better raise an exception as you did in other functions??

static void
toyam_relation_vacuum(Relation onerel,
                      struct VacuumParams *params,
                      BufferAccessStrategy bstrategy)
{
    ereport(ERROR,
            (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
             errmsg("function %s not implemented yet", __func__)));
}

Regards,

--
   Fabrízio de Royes Mello         Timbira - http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> There were a bunch of typos in the comments in tableam.h, see attached. Some
> of the comments could use more copy-editing and clarification, I think, but
> I stuck to fixing just typos and such for now.

I pushed these after adding three boring changes by pgindent. Thanks for
those!

I'd greatly welcome more feedback on the comments - I've been pretty
deep in this for so long that I don't see all of the issues anymore. And
a mild dyslexia doesn't help...


> index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM
> doesn't use normal data files, that won't work. I bumped into that with my
> toy implementation, which wouldn't need to create any data files, if it
> wasn't for this.

Hm. That should be fixed. I've been burning the candle on both ends for
too long, so I'll not get to it today. But I think we should fix it
soon.  I'll create an open item for it.


> The comments for relation_set_new_relfilenode() callback say that the AM can
> set *freezeXid and *minmulti to invalid. But when I did that, VACUUM hits
> this assertion:
> 
> TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId)
> 3)))", File: "vacuum.c", Line: 1323)

Hm. That needs to be fixed - IIRC it previously worked, because zheap
doesn't have relfrozenxid either. Probably broke it when trying to
winnow down the tableam patches. I'm planning to rebase zheap onto the
newest version soon, so I'll re-encounter this.


> There's a little bug in index-only scan executor node, where it mixes up the
> slots to hold a tuple from the index, and from the table. That doesn't cause
> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which
> uses a virtual slot, it caused warnings like this from index-only scans:

Hm. That's another one that I think I had fixed previously :(, and then
concluded that it's not actually necessary for some reason. Your fix
looks correct to me.  Do you want to commit it? Otherwise I'll look at
it after rebasing zheap, and checking it with that.


> Attached is a patch with the toy implementation I used to test this. I'm not
> suggesting we should commit that - although feel free to do that if you
> think it's useful - but it shows how I bumped into these issues.

Hm, probably not a bad idea to include something like it. It seems like
we kinda would need non-stub implementation of more functions for it to
test much / and to serve as an example.  I'm mildy inclined to just do
it via zheap / externally, but I'm not quite sure that's good enough.


> +static Size
> +toyam_parallelscan_estimate(Relation rel)
> +{
> +    ereport(ERROR,
> +            (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> +             errmsg("function %s not implemented yet", __func__)));
> +}

The other stubbed functions seem like we should require them, but I
wonder if we should make the parallel stuff optional?


Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Heikki Linnakangas
Дата:
On 08/04/2019 20:37, Andres Freund wrote:
> On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
>> There's a little bug in index-only scan executor node, where it mixes up the
>> slots to hold a tuple from the index, and from the table. That doesn't cause
>> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which
>> uses a virtual slot, it caused warnings like this from index-only scans:
> 
> Hm. That's another one that I think I had fixed previously :(, and then
> concluded that it's not actually necessary for some reason. Your fix
> looks correct to me.  Do you want to commit it? Otherwise I'll look at
> it after rebasing zheap, and checking it with that.

I found another slot type confusion bug, while playing with zedstore. In 
an Index Scan, if you have an ORDER BY key that needs to be rechecked, 
so that it uses the reorder queue, then it will sometimes use the 
reorder queue slot, and sometimes the table AM's slot, for the scan 
slot. If they're not of the same type, you get an assertion:

TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File: 
"execExprInterp.c", Line: 1905)

Attached is a test for this, again using the toy table AM, extended to 
be able to test this. And a fix.

>> Attached is a patch with the toy implementation I used to test this. I'm not
>> suggesting we should commit that - although feel free to do that if you
>> think it's useful - but it shows how I bumped into these issues.
> 
> Hm, probably not a bad idea to include something like it. It seems like
> we kinda would need non-stub implementation of more functions for it to
> test much / and to serve as an example.  I'm mildy inclined to just do
> it via zheap / externally, but I'm not quite sure that's good enough.

Works for me.

>> +static Size
>> +toyam_parallelscan_estimate(Relation rel)
>> +{
>> +    ereport(ERROR,
>> +            (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> +             errmsg("function %s not implemented yet", __func__)));
>> +}
> 
> The other stubbed functions seem like we should require them, but I
> wonder if we should make the parallel stuff optional?

Yeah, that would be good. I would assume it to be optional.

- Heikki

Вложения

Re: Pluggable Storage - Andres's take

От
Heikki Linnakangas
Дата:
On 08/04/2019 20:37, Andres Freund wrote:
> Hi,
> 
> On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
>> There were a bunch of typos in the comments in tableam.h, see attached. Some
>> of the comments could use more copy-editing and clarification, I think, but
>> I stuck to fixing just typos and such for now.
> I pushed these after adding three boring changes by pgindent. Thanks for
> those!
> 
> I'd greatly welcome more feedback on the comments - I've been pretty
> deep in this for so long that I don't see all of the issues anymore. And
> a mild dyslexia doesn't help...

Here is another iteration on the comments. The patch is a mix of 
copy-editing and questions. The questions are marked with "HEIKKI:". I 
can continue the copy-editing, if you can reply to the questions, 
clarifying the intention on some parts of the API. (Or feel free to pick 
and push any of these fixes immediately, if you prefer.)

- Heikki

Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote:
> Here is another iteration on the comments. The patch is a mix of
> copy-editing and questions. The questions are marked with "HEIKKI:". I can
> continue the copy-editing, if you can reply to the questions, clarifying the
> intention on some parts of the API. (Or feel free to pick and push any of
> these fixes immediately, if you prefer.)

Thanks!

> diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
> index f7f726b5aec..bbcab9ce31a 100644
> --- a/src/backend/utils/misc/guc.c
> +++ b/src/backend/utils/misc/guc.c
> @@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] =
>          {"default_table_access_method", PGC_USERSET, CLIENT_CONN_STATEMENT,
>              gettext_noop("Sets the default table access method for new tables."),
>              NULL,
> -            GUC_IS_NAME
> +            GUC_NOT_IN_SAMPLE | GUC_IS_NAME
>          },
>          &default_table_access_method,
>          DEFAULT_TABLE_ACCESS_METHOD,

Hm, I think we should rather add it to sample. That's an oversight, not
intentional.


> index 6fbfcb96c98..d4709563e7e 100644
> --- a/src/include/access/tableam.h
> +++ b/src/include/access/tableam.h
> @@ -91,8 +91,9 @@ typedef enum TM_Result
>   * xmax is the outdating transaction's XID.  If the caller wants to visit the
>   * replacement tuple, it must check that this matches before believing the
>   * replacement is really a match.
> + * HEIKKI: matches what? xmin, but that's specific to the heapam.

It's basically just the old comment moved. I wonder if we can just get
rid of that field - because the logic to follow update chains correctly
is now inside the lock tuple callback. And as you say - it's not clear
what callers can do with it for the purpose of following chains.  The
counter-argument is that having it makes it a lot less annoying to adapt
external code that wants to adapt with the minimal set of changes, and
only is really interested in supporting heap for now.


>   * GetTableAmRoutine() asserts that required callbacks are filled in, remember
>   * to update when adding a callback.
> @@ -179,6 +184,12 @@ typedef struct TableAmRoutine
>       *
>       * if temp_snap is true, the snapshot will need to be deallocated at
>       * scan_end.
> +     *
> +     * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's
> +     * a bit surprising for the AM, no? Can it be called when a scan is
> +     * already in progress?

Yea, it can be called when the scan is in-progress. I think we probably
should just fix calling code to not need that - it's imo weird that
nodeBitmapHeapscan.c doesn't just delay starting the scan till it has
the snapshot. This isn't new code, but it's now going to be exposed to
more AMs, so I think there's a good argument to fix it now.

Robert: You committed that addition, in

commit f35742ccb7aa53ee3ed8416bbb378b0c3eeb6bb9
Author: Robert Haas <rhaas@postgresql.org>
Date:   2017-03-08 12:05:43 -0500

    Support parallel bitmap heap scans.

do you remember why that's done?



> +     * HEIKKI: A flags bitmask argument would be more readable than 6 booleans
>       */
>      TableScanDesc (*scan_begin) (Relation rel,
>                                   Snapshot snapshot,

I honestly don't have strong feelings about it. Not sure that I buy that
bitmasks would be much more readable - but perhaps we could just use the
struct trickery we started to use in

commit f831d4accda00b9144bc647ede2e2f848b59f39d
Author: Alvaro Herrera <alvherre@alvh.no-ip.org>
Date:   2019-02-01 11:29:42 -0300

    Add ArchiveOpts to pass options to ArchiveEntry


> @@ -194,6 +205,9 @@ typedef struct TableAmRoutine
>      /*
>       * Release resources and deallocate scan. If TableScanDesc.temp_snap,
>       * TableScanDesc.rs_snapshot needs to be unregistered.
> +     *
> +     * HEIKKI: I find this 'temp_snap' thing pretty weird. Can't the caller handle
> +     * deregistering it?
>       */
>      void        (*scan_end) (TableScanDesc scan);

It's old logic, just wrapped new.  I think there's some argument that
some of this should be moved to tableam.c rather than the individual
AMs.



> @@ -221,6 +235,11 @@ typedef struct TableAmRoutine
>      /*
>       * Estimate the size of shared memory needed for a parallel scan of this
>       * relation. The snapshot does not need to be accounted for.
> +     *
> +     * HEIKKI: If this returns X, then the parallelscan_initialize() call
> +     * mustn't use more than X. So this is not just for optimization purposes,
> +     * for example. Not sure how to phrase that, but could use some
> +     * clarification.
>       */
>      Size        (*parallelscan_estimate) (Relation rel);

Hm. I thought I'd done that by adding the note about the amount of space
parallelscan_initialize() getting memory sized by
parallelscan_estimate().


>      /*
>       * Reset index fetch. Typically this will release cross index fetch
>       * resources held in IndexFetchTableData.
> +     *
> +     * HEIKKI: Is this called between every call to index_fetch_tuple()?
> +     * Between every call to index_fetch_tuple(), except when call_again is
> +     * set? Can it be a no-op?
>       */
>      void        (*index_fetch_reset) (struct IndexFetchTableData *data);

It's basically just to release resources eagerly. I'll add a note.


> @@ -272,19 +297,22 @@ typedef struct TableAmRoutine
>       * test, return true, false otherwise.
>       *
>       * Note that AMs that do not necessarily update indexes when indexed
> -     * columns do not change, need to return the current/correct version of
> +     * columns don't change, need to return the current/correct version of
>       * the tuple that is visible to the snapshot, even if the tid points to an
>       * older version of the tuple.
> 
>       * *call_again is false on the first call to index_fetch_tuple for a tid.
> -     * If there potentially is another tuple matching the tid, *call_again
> -     * needs be set to true by index_fetch_tuple, signalling to the caller
> +     * If there potentially is another tuple matching the tid, the callback
> +     * needs to set *call_again to true, signalling to the caller
>       * that index_fetch_tuple should be called again for the same tid.
>       *
>       * *all_dead, if all_dead is not NULL, should be set to true by
>       * index_fetch_tuple iff it is guaranteed that no backend needs to see
> -     * that tuple. Index AMs can use that do avoid returning that tid in
> +     * that tuple. Index AMs can use that to avoid returning that tid in
>       * future searches.
> +     *
> +     * HEIKKI: Should the snapshot be given in index_fetch_begin()? Can it
> +     * differ between calls?
>       */
>      bool        (*index_fetch_tuple) (struct IndexFetchTableData *scan,
>                                        ItemPointer tid,

Hm. It could very well differ between calls. E.g. _bt_check_unique()
could benefit from that (although it currently uses the
table_index_fetch_tuple_check() wrapper), as it does one lookup with
SnapshotDirty, and then the next with SnapshotSelf.


> @@ -302,6 +330,8 @@ typedef struct TableAmRoutine
>       * Fetch tuple at `tid` into `slot`, after doing a visibility test
>       * according to `snapshot`. If a tuple was found and passed the visibility
>       * test, returns true, false otherwise.
> +     *
> +     * HEIKKI: explain how this differs from index_fetch_tuple.
>       */
>      bool        (*tuple_fetch_row_version) (Relation rel,
>                                              ItemPointer tid,

Currently the wrapper has:

 * See table_index_fetch_tuple's comment about what the difference between
 * these functions is. This function is the correct to use outside of
 * index entry->table tuple lookups.

referencing

 * The difference between this function and table_fetch_row_version is that
 * this function returns the currently visible version of a row if the AM
 * supports storing multiple row versions reachable via a single index entry
 * (like heap's HOT). Whereas table_fetch_row_version only evaluates the
 * tuple exactly at `tid`. Outside of index entry ->table tuple lookups,
 * table_fetch_row_version is what's usually needed.

should we just duplicate that?

> @@ -311,14 +341,17 @@ typedef struct TableAmRoutine
>      /*
>       * Return the latest version of the tuple at `tid`, by updating `tid` to
>       * point at the newest version.
> +     *
> +     * HEIKKI: the latest version visible to the snapshot?
>       */
>      void        (*tuple_get_latest_tid) (Relation rel,
>                                           Snapshot snapshot,
>                                           ItemPointer tid);

It's such a bad interface :(. I'd love to just remove it.  Based on

https://www.postgresql.org/message-id/17ef5a8a-71cb-5cbf-1762-dbb71626f84e%40dream.email.ne.jp

I think we can basically just remove currtid_byreloid/byrelname.  I've
not sufficiently thought about TidNext() yet.


>      /*
> -     * Does the tuple in `slot` satisfy `snapshot`?  The slot needs to be of
> -     * the appropriate type for the AM.
> +     * Does the tuple in `slot` satisfy `snapshot`?
> +     *
> +     * The AM may modify the data underlying the tuple as a side-effect.
>       */
>      bool        (*tuple_satisfies_snapshot) (Relation rel,
>                                               TupleTableSlot *slot,

Hm, this obviously should be moved here from the wrapper. But I now
wonder if we can't phrase this better. Might try to come up with
something.


> +    /*
> +     * Copy all data from `OldHeap` into `NewHeap`, as part of a CLUSTER or
> +     * VACUUM FULL.
> +     *
> +     * If `OldIndex` is valid, the data should be ordered according to the
> +     * given index. If `use_sort` is false, the data should be fetched from the
> +     * index, otherwise it should be fetched from the old table and sorted.
> +     *
> +     * OldestXmin, FreezeXid, MultiXactCutoff are currently valid values for
> +     * the table.
> +     * HEIKKI: What does "currently valid" mean? Valid for the old table?

They are system-wide values, basically. Not sure into how much detail
about that to go her?


> +     * The callback should set *num_tuples, *tups_vacuumed, *tups_recently_dead
> +     * to statistics computed while copying for the relation. Not all might make
> +     * sense for every AM.
> +     * HEIKKI: What to do for the ones that don't make sense? Set to 0?
> +     */

I don't see much of an alternative, yea. I suspect we're going to have
to expand vacuum's reporting once we have a better grasp about what
other AMs want / need.


>      /*
>       * Prepare to analyze block `blockno` of `scan`. The scan has been started
> -     * with table_beginscan_analyze().  See also
> -     * table_scan_analyze_next_block().
> +     * with table_beginscan_analyze().
>       *
>       * The callback may acquire resources like locks that are held until
> -     * table_scan_analyze_next_tuple() returns false. It e.g. can make sense
> +     * table_scan_analyze_next_tuple() returns false. For example, it can make sense
>       * to hold a lock until all tuples on a block have been analyzed by
>       * scan_analyze_next_tuple.
> +     * HEIKKI: Hold a lock on what? A lwlock on the page?

Yea, that's what heapam does. I'm not particularly happy with this, but
I'm not sure how to do better.  I expect that we'll have to revise this
to be more general at some not too far away point.


> @@ -618,6 +666,8 @@ typedef struct TableAmRoutine
>       * internally needs to perform mapping between the internal and a block
>       * based representation.
>       *
> +     * HEIKKI: What TsmRoutine? Where is that?

I'm not sure what you mean. The SampleScanState has it's associated
tablesample routine.  Would saying something like "will call the
NextSampleBlock() callback for the TsmRoutine associated with the
SampleScanState" be better?


>  /*
>   * Like table_beginscan(), but table_beginscan_strat() offers an extended API
> - * that lets the caller control whether a nondefault buffer access strategy
> - * can be used, and whether syncscan can be chosen (possibly resulting in the
> - * scan not starting from block zero).  Both of these default to true with
> - * plain table_beginscan.
> + * that lets the caller to use a non-default buffer access strategy, or
> + * specify that a synchronized scan can be used (possibly resulting in the
> + * scan not starting from block zero).  Both of these default to true, as
> + * with plain table_beginscan.
> + *
> + * HEIKKI: I'm a bit confused by 'allow_strat'. What is the non-default
> + * strategy that will get used if you pass allow_strat=true? Perhaps the flag
> + * should be called "use_bulkread_strategy"? Or it should be of type
> + * BufferAccessStrategyType, or the caller should create a strategy with
> + * GetAccessStrategy() and pass that.
>   */

That's really just a tableam port of the pre-existing heapam interface.
I don't like the API very much, but there's only so much things that
were realistic to change during this project (I think, there were
obviously lots of judgement calls).  I don't think there's much reason
to defend the current status - and I'm happy to collaborate on fixing
that.  But I think it's a out of scope for 12.


>  /*
> - * table_beginscan_sampling is an alternative entry point for setting up a
> + * table_beginscan_sampling() is an alternative entry point for setting up a
>   * TableScanDesc for a TABLESAMPLE scan.  As with bitmap scans, it's worth
>   * using the same data structure although the behavior is rather different.
>   * In addition to the options offered by table_beginscan_strat, this call
>   * also allows control of whether page-mode visibility checking is used.
> + *
> + * HEIKKI: What is 'pagemode'?
>   */

That's a good question. My not defining it is pretty much a cop-out,
because there previously wasn't any explanation, and I wasn't sure there
*is* a meaningful definition.  I mean, it's basically largely an
efficiency hack inside heapam.c, but it's currently externally
determined e.g. in bernoulli.c (code from 11):

     * Use bulkread, since we're scanning all pages.  But pagemode visibility
     * checking is a win only at larger sampling fractions.  The 25% cutoff
     * here is based on very limited experimentation.
     */
    node->use_bulkread = true;
    node->use_pagemode = (percent >= 25);

If you have a suggestion how to either get rid of it, or how to properly
phrase this...


>   * TABLE_INSERT_NO_LOGICAL force-disables the emitting of logical decoding
>   * information for the tuple. This should solely be used during table rewrites
>   * where RelationIsLogicallyLogged(relation) is not yet accurate for the new
>   * relation.
> + * HEIKKI: Is this optional, too? Can the AM ignore it?

Hm. Currently logical decoding isn't really extensible automatically to
an AM (it works via WAL and WAL isn't extensible) - so it'll currently
not mean anything to non-heap AMs (or AMs that patch/are part of core).


>   * Note that most of these options will be applied when inserting into the
>   * heap's TOAST table, too, if the tuple requires any out-of-line data.
> @@ -1041,6 +1100,8 @@ table_compute_xid_horizon_for_tuples(Relation rel,
>   * On return the slot's tts_tid and tts_tableOid are updated to reflect the
>   * insertion. But note that any toasting of fields within the slot is NOT
>   * reflected in the slots contents.
> + *
> + * HEIKKI: I think GetBulkInsertState() should be an AM-specific callback.
>   */

I agree. There was some of that in an earlier version of the patch, but
the interface wasn't yet right.  I think there's a lot such things that
just need to be added incrementally.


> @@ -1170,6 +1235,9 @@ table_delete(Relation rel, ItemPointer tid, CommandId cid,
>   * update was done.  However, any TOAST changes in the new tuple's
>   * data are not reflected into *newtup.
>   *
> + * HEIKKI: There is no 'newtup'.
> + * HEIKKI: HEAP_ONLY_TUPLE is AM-specific; do the callers peek into that, currently?

No, callers currently don't.  The callback does, and sets
*update_indexes accordingly.



> - * A side effect is to set indexInfo->ii_BrokenHotChain to true if we detect
> + * A side effect is to set index_info->ii_BrokenHotChain to true if we detect
>   * any potentially broken HOT chains.  Currently, we set this if there are any
>   * RECENTLY_DEAD or DELETE_IN_PROGRESS entries in a HOT chain, without trying
>   * very hard to detect whether they're really incompatible with the chain tip.
>   * This only really makes sense for heap AM, it might need to be generalized
>   * for other AMs later.
> + *
> + * HEIKKI: What does 'allow_sync' do?

Heh, I'm going to be responsible for everything that was previously
undocumented, aren't I ;).  I guess we should say something vague like
  "When allow_sync is set to true, an AM may use scans synchronized with
  other backends, if that makes sense. For some AMs that determines
  whether tuples are going to be returned in TID order".
It's vague, but I'm not sure we can do better.


Thanks!


Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Robert Haas
Дата:
On Thu, Apr 11, 2019 at 12:49 PM Andres Freund <andres@anarazel.de> wrote:
> > @@ -179,6 +184,12 @@ typedef struct TableAmRoutine
> >        *
> >        * if temp_snap is true, the snapshot will need to be deallocated at
> >        * scan_end.
> > +      *
> > +      * HEIKKI: table_scan_update_snapshot() changes the snapshot. That's
> > +      * a bit surprising for the AM, no? Can it be called when a scan is
> > +      * already in progress?
>
> Yea, it can be called when the scan is in-progress. I think we probably
> should just fix calling code to not need that - it's imo weird that
> nodeBitmapHeapscan.c doesn't just delay starting the scan till it has
> the snapshot. This isn't new code, but it's now going to be exposed to
> more AMs, so I think there's a good argument to fix it now.
>
> Robert: You committed that addition, in
>
> commit f35742ccb7aa53ee3ed8416bbb378b0c3eeb6bb9
> Author: Robert Haas <rhaas@postgresql.org>
> Date:   2017-03-08 12:05:43 -0500
>
>     Support parallel bitmap heap scans.
>
> do you remember why that's done?

I don't think there was any brilliant idea behind it.  Delaying the
scan start until it has the snapshot seems like a good idea.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pluggable Storage - Andres's take

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote:
>> +     * HEIKKI: A flags bitmask argument would be more readable than 6 booleans

> I honestly don't have strong feelings about it. Not sure that I buy that
> bitmasks would be much more readable

Sure they would be --- how's FLAG_FOR_FOO | FLAG_FOR_BAR not
better than unlabeled "true" and "false"?

> - but perhaps we could just use the
> struct trickery we started to use in

I find that rather ugly really.  If we're doing something other than a
dozen-or-so booleans, maybe its the only viable option.  But for cases
where a flags argument will serve, that's our longstanding practice and
I don't see a reason to deviate.

            regards, tom lane



Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> The comments for relation_set_new_relfilenode() callback say that the AM can
> set *freezeXid and *minmulti to invalid. But when I did that, VACUUM hits
> this assertion:
> 
> TRAP: FailedAssertion("!(((classForm->relfrozenxid) >= ((TransactionId)
> 3)))", File: "vacuum.c", Line: 1323)

Hm, that necessary change unfortunately escaped into the the zheap tree
(which indeed doesn't set relfrozenxid). That's why I'd not noticed
this.  How about something like the attached?



I found a related problem in VACUUM FULL / CLUSTER while working on the
above, not fixed in the attached yet. Namely even if a relation doesn't
yet have a valid relfrozenxid/relminmxid before a VACUUM FULL / CLUSTER,
we'll set one after that. That's not great.

I suspect the easiest fix would be to to make the relevant
relation_copy_for_cluster() FreezeXid, MultiXactCutoff arguments into
pointer, and allow the AM to reset them to an invalid value if that's
the appropriate one.

It'd probably be better if we just moved the entire xid limit
computation into the AM, but I'm worried that we actually need to move
it *further up* instead - independent of this change. I don't think it's
quite right to allow a table with a toast table to be independently
VACUUM FULL/CLUSTERed from the toast table. GetOldestXmin() can go
backwards for a myriad of reasons (or limited by
old_snapshot_threshold), and I'm fairly certain that e.g. VACUUM FULLing
the toast table, setting a lower old_snapshot_threshold, and VACUUM
FULLing the main table would result in failures.

I think we need to fix this for 12, rather than wait for 13. Does
anybody disagree?

Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM
> doesn't use normal data files, that won't work. I bumped into that with my
> toy implementation, which wouldn't need to create any data files, if it
> wasn't for this.

There are a few more of these:

1) index_update_stats(), computing pg_class.relpages

   Feels like the number of both heap and index blocks should be
   computed by the index build and stored in IndexInfo. That'd also get
   a bit closer towards allowing indexams not going through smgr (useful
   e.g. for memory only ones).

2) commands/analyze.c, computing pg_class.relpages

   This should imo be moved to the tableam callback. It's currently done
   a bit weirdly imo, with fdws computing relpages the callback, but
   then also returning the acquirefunc. Seems like it should entirely be
   computed as part of calling acquirefunc.

3) nodeTidscan, skipping over too large tids
   I think this should just be moved into the AMs, there's no need to
   have this in nodeTidscan.c

4) freespace.c, used for the new small-rels-have-no-fsm paths.
   That's being revised currently anyway. But I'm not particularly
   concerned even if it stays as is - freespace use is optional
   anyway. And I can't quite see an AM that doesn't want to use
   postgres' storage mechanism wanting to use freespace.c

   Therefore I'm inclined not to thouch this independent of fixing the
   others.

I think none of these are critical issues for tableam, but we should fix
them.

I'm not sure about doing so for v12 though. 1) and 3) are fairly
trivial, but 2) would involve changing the FDW interface, by changing
the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
we're not even in beta1.

Comments?

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> ... I think none of these are critical issues for tableam, but we should fix
> them.

> I'm not sure about doing so for v12 though. 1) and 3) are fairly
> trivial, but 2) would involve changing the FDW interface, by changing
> the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> we're not even in beta1.

Probably better to fix those API issues now rather than later.

            regards, tom lane



Re: Pluggable Storage - Andres's take

От
Robert Haas
Дата:
On Tue, Apr 23, 2019 at 6:55 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Andres Freund <andres@anarazel.de> writes:
> > ... I think none of these are critical issues for tableam, but we should fix
> > them.
>
> > I'm not sure about doing so for v12 though. 1) and 3) are fairly
> > trivial, but 2) would involve changing the FDW interface, by changing
> > the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> > we're not even in beta1.
>
> Probably better to fix those API issues now rather than later.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi Heikki, Ashwin, Tom,

On 2019-04-23 15:52:01 -0700, Andres Freund wrote:
> On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> > index_update_stats() calls RelationGetNumberOfBlocks(<table>). If the AM
> > doesn't use normal data files, that won't work. I bumped into that with my
> > toy implementation, which wouldn't need to create any data files, if it
> > wasn't for this.
> 
> There are a few more of these:

> I'm not sure about doing so for v12 though. 1) and 3) are fairly
> trivial, but 2) would involve changing the FDW interface, by changing
> the AnalyzeForeignTable, AcquireSampleRowsFunc signatures. But OTOH,
> we're not even in beta1.

Hm. I think some of those changes would be a bit bigger than I initially
though. Attached is a more minimal fix that'd route
RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
think it's definitely the right answer for 1), probably the pragmatic
answer to 2), but certainly not for 3).

I've for now made the AM return the size in bytes, and then convert that
into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
are going to continue to want it internally as pages (otherwise there's
going to be way too much churn, without a benefit I can see). So I
thinkt that's OK.

There's also a somewhat weird bit of returning the total relation size
for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
postgres' current forks, but have equivalent concepts. And without that
there'd be no way to get that size.  I'm not sure I like this, input
welcome. But it seems good to offer the ability to get the entire size
somehow.

Btw, isn't RelationGetNumberOfBlocksForFork() currently weirdly placed?
I don't see why bufmgr.c would be appropriate? Although I don't think
it's particulary clear where it'd best reside - I'd tentatively say
storage.c.

Heikki, Ashwin, your inputs would be appreciated here, in particular the
tid fetch bit below.


The attached patch isn't intended to be applied as-is, just basis for
discussion.


> 1) index_update_stats(), computing pg_class.relpages
> 
>    Feels like the number of both heap and index blocks should be
>    computed by the index build and stored in IndexInfo. That'd also get
>    a bit closer towards allowing indexams not going through smgr (useful
>    e.g. for memory only ones).

Due to parallel index builds that'd actually be hard. Given the number
of places wanting to compute relpages for pg_class I think the above
patch routing RelationGetNumberOfBlocksForFork() through tableam is the
right fix.


> 2) commands/analyze.c, computing pg_class.relpages
> 
>    This should imo be moved to the tableam callback. It's currently done
>    a bit weirdly imo, with fdws computing relpages the callback, but
>    then also returning the acquirefunc. Seems like it should entirely be
>    computed as part of calling acquirefunc.

Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
tableam wouldn't be the right minimal approach too. It has the
disadvantage of implying certain values for the
RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
would be to return the desire sampling range in
table_beginscan_analyze() - but that'd require some duplication because
currently that just uses the generic scan_begin() callback.

I suspect - as previously mentioned- that we're going to have to extend
statistics collection beyond the current approach at some point, but I
don't think that's now. At least to me it's not clear how to best
represent the stats, and how to best use them, if the underlying storage
is fundamentally not block best.  Nor how we'd avoid code duplication...


> 3) nodeTidscan, skipping over too large tids
>    I think this should just be moved into the AMs, there's no need to
>    have this in nodeTidscan.c

I think here it's *not* actually correct at all to use the relation
size. It's currently doing:

    /*
     * We silently discard any TIDs that are out of range at the time of scan
     * start.  (Since we hold at least AccessShareLock on the table, it won't
     * be possible for someone to truncate away the blocks we intend to
     * visit.)
     */
    nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);

which is fine (except for a certain abstraction leakage) for an AM like
heap or zheap, but I suspect strongly that that's not ok for Ashwin &
Heikki's approach where tid isn't tied to physical representation.


The obvious answer would be to just move that check into the
table_fetch_row_version implementation (currently just calling
heap_fetch()) - but that doesn't seem OK from a performance POV, because
we'd then determine the relation size once for each tid, rather than
once per tidscan.  And it'd also check in cases where we know the tid is
supposed to be valid (e.g. fetching trigger tuples and such).

The proper fix seems to be to introduce a new scan variant
(e.g. table_beginscan_tid()), and then have table_fetch_row_version take
a scan as a parameter.  But it seems we'd have to introduce that as a
separate tableam callback, because we'd not want to incur the overhead
of creating an additional scan / RelationGetNumberOfBlocks() checks for
triggers et al.


Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
Ashwin Agrawal
Дата:
On Thu, Apr 25, 2019 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:
> Hm. I think some of those changes would be a bit bigger than I initially
> though. Attached is a more minimal fix that'd route
> RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> think it's definitely the right answer for 1), probably the pragmatic
> answer to 2), but certainly not for 3).
>
> I've for now made the AM return the size in bytes, and then convert that
> into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> are going to continue to want it internally as pages (otherwise there's
> going to be way too much churn, without a benefit I can see). So I
> think that's OK.

I will provide my inputs, Heikki please correct me or add your inputs.

I am not sure how much gain this practically provides, if rest of the
system continues to use the value returned in-terms of blocks. I
understand things being block based (and not just really block based
but all the blocks of relation are storing data and full tuple) is
engraved in the system. So, breaking out of it is yes much larger
change and not just limited to table AM API.

I feel most of the issues discussed here should be faced by zheap as
well, as not all blocks/pages contain data like TPD pages should be
excluded from sampling and TID scans, etc...

> There's also a somewhat weird bit of returning the total relation size
> for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
> postgres' current forks, but have equivalent concepts. And without that
> there'd be no way to get that size.  I'm not sure I like this, input
> welcome. But it seems good to offer the ability to get the entire size
> somehow.

Yes, I do think we should have mechanism to get total size and just
size for specific purpose. Zedstore currently doesn't use forks. Just
a thought, instead of calling it forknum as argument, call it
something like data and meta-data or main-data and auxiliary-data
size. Though I don't know if usage exists where wish to get size of
just some non MAIN fork for heap/zheap, those pieces of code shouldn't
be in generic areas instead in AM specific code only.

>
> > 2) commands/analyze.c, computing pg_class.relpages
> >
> >    This should imo be moved to the tableam callback. It's currently done
> >    a bit weirdly imo, with fdws computing relpages the callback, but
> >    then also returning the acquirefunc. Seems like it should entirely be
> >    computed as part of calling acquirefunc.
>
> Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> tableam wouldn't be the right minimal approach too. It has the
> disadvantage of implying certain values for the
> RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
> would be to return the desire sampling range in
> table_beginscan_analyze() - but that'd require some duplication because
> currently that just uses the generic scan_begin() callback.

Yes, just routing relation size via AM layer and using its returned
value in terms of blocks still and performing sampling based on blocks
based on it, doesn't feel resolves the issue. Maybe need to delegate
sampling completely to AM layer. Code duplication can be avoided by
similar AMs (heap and zheap) possible using some common utility
functions to achieve intended result.

>
> I suspect - as previously mentioned- that we're going to have to extend
> statistics collection beyond the current approach at some point, but I
> don't think that's now. At least to me it's not clear how to best
> represent the stats, and how to best use them, if the underlying storage
> is fundamentally not block best.  Nor how we'd avoid code duplication...

Yes, will have to give more thoughts into this.

>
> > 3) nodeTidscan, skipping over too large tids
> >    I think this should just be moved into the AMs, there's no need to
> >    have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
>         /*
>          * We silently discard any TIDs that are out of range at the time of scan
>          * start.  (Since we hold at least AccessShareLock on the table, it won't
>          * be possible for someone to truncate away the blocks we intend to
>          * visit.)
>          */
>         nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.

Agree, its not nice to have that optimization being performed based on
number of block in generic layer. I feel its not efficient either for
zheap too due to TPD pages as mentioned above, as number of blocks
returned will be higher compared to actually data blocks.

>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan.  And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).

Agree, checking relation size per tuple is out of possible solutions.

>
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter.  But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.

Thinking out loud here, we can possibly tackle this in multiple ways.
First above mentioned check seems more optimization to me than
functionally needed, correct me if wrong. If that's true we can check
with AM if wish to apply that optimization or not based on relation
size. Like for Zedstore, instead of performing this optimization
directly call fetch row version and zedstore can quickly bail out
based on TID passed to it, as in meta page has highest allocated TID
value. With concurrent inserts though it may perform more work.

Or other alternative could be instead of getting relation size. Add
callback to get highest TID value from AM. heap and zheap can return
TID using highest block number and max TID that block can have.
Zedstore can return the highest TID it has assigned so far. Either use
the TID then perform the check and not just block-number. Or extract
block number from the TID and use that instead for the check. That
would at least work for AMs we know of so far and hard to imagine for
AMs doesn't exist yet, how this will be used.

Irrespective of how we solve this problem, ctids are displayed and
need to be specified in (block, offset) fashion for tid scans :-)



Re: Pluggable Storage - Andres's take

От
Rafia Sabih
Дата:
On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas <hlinnaka@iki.fi> wrote:
>
> On 08/04/2019 20:37, Andres Freund wrote:
> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> >> There's a little bug in index-only scan executor node, where it mixes up the
> >> slots to hold a tuple from the index, and from the table. That doesn't cause
> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which
> >> uses a virtual slot, it caused warnings like this from index-only scans:
> >
> > Hm. That's another one that I think I had fixed previously :(, and then
> > concluded that it's not actually necessary for some reason. Your fix
> > looks correct to me.  Do you want to commit it? Otherwise I'll look at
> > it after rebasing zheap, and checking it with that.
>
> I found another slot type confusion bug, while playing with zedstore. In
> an Index Scan, if you have an ORDER BY key that needs to be rechecked,
> so that it uses the reorder queue, then it will sometimes use the
> reorder queue slot, and sometimes the table AM's slot, for the scan
> slot. If they're not of the same type, you get an assertion:
>
> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
> "execExprInterp.c", Line: 1905)
>
> Attached is a test for this, again using the toy table AM, extended to
> be able to test this. And a fix.
>
> >> Attached is a patch with the toy implementation I used to test this. I'm not
> >> suggesting we should commit that - although feel free to do that if you
> >> think it's useful - but it shows how I bumped into these issues.
> >
> > Hm, probably not a bad idea to include something like it. It seems like
> > we kinda would need non-stub implementation of more functions for it to
> > test much / and to serve as an example.  I'm mildy inclined to just do
> > it via zheap / externally, but I'm not quite sure that's good enough.
>
> Works for me.
>
> >> +static Size
> >> +toyam_parallelscan_estimate(Relation rel)
> >> +{
> >> +    ereport(ERROR,
> >> +                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >> +                     errmsg("function %s not implemented yet", __func__)));
> >> +}
> >
> > The other stubbed functions seem like we should require them, but I
> > wonder if we should make the parallel stuff optional?
>
> Yeah, that would be good. I would assume it to be optional.
>
I was trying the toyam patch and on make check it failed with
segmentation fault at

static void
toyam_relation_set_new_filenode(Relation rel,
 char persistence,
 TransactionId *freezeXid,
 MultiXactId *minmulti)
{
 *freezeXid = InvalidTransactionId;

Basically, on running create table t (i int, j int) using toytable,
leads to this segmentation fault.

Am I missing something here?


-- 
Regards,
Rafia Sabih



Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On May 6, 2019 3:40:55 AM PDT, Rafia Sabih <rafia.pghackers@gmail.com> wrote:
>On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas <hlinnaka@iki.fi>
>wrote:
>>
>> On 08/04/2019 20:37, Andres Freund wrote:
>> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
>> >> There's a little bug in index-only scan executor node, where it
>mixes up the
>> >> slots to hold a tuple from the index, and from the table. That
>doesn't cause
>> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy
>AM, which
>> >> uses a virtual slot, it caused warnings like this from index-only
>scans:
>> >
>> > Hm. That's another one that I think I had fixed previously :(, and
>then
>> > concluded that it's not actually necessary for some reason. Your
>fix
>> > looks correct to me.  Do you want to commit it? Otherwise I'll look
>at
>> > it after rebasing zheap, and checking it with that.
>>
>> I found another slot type confusion bug, while playing with zedstore.
>In
>> an Index Scan, if you have an ORDER BY key that needs to be
>rechecked,
>> so that it uses the reorder queue, then it will sometimes use the
>> reorder queue slot, and sometimes the table AM's slot, for the scan
>> slot. If they're not of the same type, you get an assertion:
>>
>> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
>> "execExprInterp.c", Line: 1905)
>>
>> Attached is a test for this, again using the toy table AM, extended
>to
>> be able to test this. And a fix.
>>
>> >> Attached is a patch with the toy implementation I used to test
>this. I'm not
>> >> suggesting we should commit that - although feel free to do that
>if you
>> >> think it's useful - but it shows how I bumped into these issues.
>> >
>> > Hm, probably not a bad idea to include something like it. It seems
>like
>> > we kinda would need non-stub implementation of more functions for
>it to
>> > test much / and to serve as an example.  I'm mildy inclined to just
>do
>> > it via zheap / externally, but I'm not quite sure that's good
>enough.
>>
>> Works for me.
>>
>> >> +static Size
>> >> +toyam_parallelscan_estimate(Relation rel)
>> >> +{
>> >> +    ereport(ERROR,
>> >> +                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
>> >> +                     errmsg("function %s not implemented yet",
>__func__)));
>> >> +}
>> >
>> > The other stubbed functions seem like we should require them, but I
>> > wonder if we should make the parallel stuff optional?
>>
>> Yeah, that would be good. I would assume it to be optional.
>>
>I was trying the toyam patch and on make check it failed with
>segmentation fault at
>
>static void
>toyam_relation_set_new_filenode(Relation rel,
> char persistence,
> TransactionId *freezeXid,
> MultiXactId *minmulti)
>{
> *freezeXid = InvalidTransactionId;
>
>Basically, on running create table t (i int, j int) using toytable,
>leads to this segmentation fault.
>
>Am I missing something here?

I assume you got compiler warmings compiling it? The API for some callbacks changed a bit.

Andred
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: Pluggable Storage - Andres's take

От
Ashwin Agrawal
Дата:
On Mon, May 6, 2019 at 7:14 AM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On May 6, 2019 3:40:55 AM PDT, Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> >I was trying the toyam patch and on make check it failed with
> >segmentation fault at
> >
> >static void
> >toyam_relation_set_new_filenode(Relation rel,
> > char persistence,
> > TransactionId *freezeXid,
> > MultiXactId *minmulti)
> >{
> > *freezeXid = InvalidTransactionId;
> >
> >Basically, on running create table t (i int, j int) using toytable,
> >leads to this segmentation fault.
> >
> >Am I missing something here?
>
> I assume you got compiler warmings compiling it? The API for some callbacks changed a bit.

Attached patch gets toy table AM implementation to match latest master API.
The patch builds on top of patch from Heikki in [1].
Compiles and works but the test still continues to fail with WARNING
for issue mentioned in [1]


Noticed the typo in recently added comment for relation_set_new_filenode().

     * Note that only the subset of the relcache filled by
     * RelationBuildLocalRelation() can be relied upon and that the
relation's
     * catalog entries either will either not yet exist (new
relation), or
     * will still reference the old relfilenode.

seems should be

     * Note that only the subset of the relcache filled by
     * RelationBuildLocalRelation() can be relied upon and that the
relation's
     * catalog entries will either not yet exist (new relation), or
still
     * reference the old relfilenode.

Also wish to point out, while working on Zedstore, we realized that
TupleDesc from Relation object can be trusted at AM layer for
scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
catalog is updated first and hence the relation object passed to AM
layer reflects new TupleDesc. For heapam its fine as it doesn't use
the TupleDesc today during scans in AM layer for scan_getnextslot().
Only TupleDesc which can trusted and matches the on-disk layout of the
tuple for scans hence is from TupleTableSlot. Which is little
unfortunate as TupleTableSlot is only available in scan_getnextslot(),
and not in scan_begin(). Means if AM wishes to do some initialization
based on TupleDesc for scans can't be done in scan_begin() and forced
to delay till has access to TupleTableSlot. We should at least add
comment for scan_begin() to strongly clarify not to trust Relation
object TupleDesc. Or maybe other alternative would be have separate
API for rewrite case.


1] https://www.postgresql.org/message-id/9a7fb9cc-2419-5db7-8840-ddc10c93f122%40iki.fi

Вложения

Re: Pluggable Storage - Andres's take

От
Rafia Sabih
Дата:
On Mon, 6 May 2019 at 16:14, Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On May 6, 2019 3:40:55 AM PDT, Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> >On Tue, 9 Apr 2019 at 15:17, Heikki Linnakangas <hlinnaka@iki.fi>
> >wrote:
> >>
> >> On 08/04/2019 20:37, Andres Freund wrote:
> >> > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
> >> >> There's a little bug in index-only scan executor node, where it
> >mixes up the
> >> >> slots to hold a tuple from the index, and from the table. That
> >doesn't cause
> >> >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy
> >AM, which
> >> >> uses a virtual slot, it caused warnings like this from index-only
> >scans:
> >> >
> >> > Hm. That's another one that I think I had fixed previously :(, and
> >then
> >> > concluded that it's not actually necessary for some reason. Your
> >fix
> >> > looks correct to me.  Do you want to commit it? Otherwise I'll look
> >at
> >> > it after rebasing zheap, and checking it with that.
> >>
> >> I found another slot type confusion bug, while playing with zedstore.
> >In
> >> an Index Scan, if you have an ORDER BY key that needs to be
> >rechecked,
> >> so that it uses the reorder queue, then it will sometimes use the
> >> reorder queue slot, and sometimes the table AM's slot, for the scan
> >> slot. If they're not of the same type, you get an assertion:
> >>
> >> TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
> >> "execExprInterp.c", Line: 1905)
> >>
> >> Attached is a test for this, again using the toy table AM, extended
> >to
> >> be able to test this. And a fix.
> >>
> >> >> Attached is a patch with the toy implementation I used to test
> >this. I'm not
> >> >> suggesting we should commit that - although feel free to do that
> >if you
> >> >> think it's useful - but it shows how I bumped into these issues.
> >> >
> >> > Hm, probably not a bad idea to include something like it. It seems
> >like
> >> > we kinda would need non-stub implementation of more functions for
> >it to
> >> > test much / and to serve as an example.  I'm mildy inclined to just
> >do
> >> > it via zheap / externally, but I'm not quite sure that's good
> >enough.
> >>
> >> Works for me.
> >>
> >> >> +static Size
> >> >> +toyam_parallelscan_estimate(Relation rel)
> >> >> +{
> >> >> +    ereport(ERROR,
> >> >> +                    (errcode(ERRCODE_FEATURE_NOT_SUPPORTED),
> >> >> +                     errmsg("function %s not implemented yet",
> >__func__)));
> >> >> +}
> >> >
> >> > The other stubbed functions seem like we should require them, but I
> >> > wonder if we should make the parallel stuff optional?
> >>
> >> Yeah, that would be good. I would assume it to be optional.
> >>
> >I was trying the toyam patch and on make check it failed with
> >segmentation fault at
> >
> >static void
> >toyam_relation_set_new_filenode(Relation rel,
> > char persistence,
> > TransactionId *freezeXid,
> > MultiXactId *minmulti)
> >{
> > *freezeXid = InvalidTransactionId;
> >
> >Basically, on running create table t (i int, j int) using toytable,
> >leads to this segmentation fault.
> >
> >Am I missing something here?
>
> I assume you got compiler warmings compiling it? The API for some callbacks changed a bit.
>
Oh yeah it does.


-- 
Regards,
Rafia Sabih



Re: Pluggable Storage - Andres's take

От
Rafia Sabih
Дата:
On Mon, 6 May 2019 at 22:39, Ashwin Agrawal <aagrawal@pivotal.io> wrote:
>
> On Mon, May 6, 2019 at 7:14 AM Andres Freund <andres@anarazel.de> wrote:
> >
> > Hi,
> >
> > On May 6, 2019 3:40:55 AM PDT, Rafia Sabih <rafia.pghackers@gmail.com> wrote:
> > >I was trying the toyam patch and on make check it failed with
> > >segmentation fault at
> > >
> > >static void
> > >toyam_relation_set_new_filenode(Relation rel,
> > > char persistence,
> > > TransactionId *freezeXid,
> > > MultiXactId *minmulti)
> > >{
> > > *freezeXid = InvalidTransactionId;
> > >
> > >Basically, on running create table t (i int, j int) using toytable,
> > >leads to this segmentation fault.
> > >
> > >Am I missing something here?
> >
> > I assume you got compiler warmings compiling it? The API for some callbacks changed a bit.
>
> Attached patch gets toy table AM implementation to match latest master API.
> The patch builds on top of patch from Heikki in [1].
> Compiles and works but the test still continues to fail with WARNING
> for issue mentioned in [1]
>
Thanks Ashwin, this works fine with the mentioned warnings of course.

-- 
Regards,
Rafia Sabih



Re: Pluggable Storage - Andres's take

От
Ashwin Agrawal
Дата:

On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
>
> Also wish to point out, while working on Zedstore, we realized that
> TupleDesc from Relation object can be trusted at AM layer for
> scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> catalog is updated first and hence the relation object passed to AM
> layer reflects new TupleDesc. For heapam its fine as it doesn't use
> the TupleDesc today during scans in AM layer for scan_getnextslot().
> Only TupleDesc which can trusted and matches the on-disk layout of the
> tuple for scans hence is from TupleTableSlot. Which is little
> unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> and not in scan_begin(). Means if AM wishes to do some initialization
> based on TupleDesc for scans can't be done in scan_begin() and forced
> to delay till has access to TupleTableSlot. We should at least add
> comment for scan_begin() to strongly clarify not to trust Relation
> object TupleDesc. Or maybe other alternative would be have separate
> API for rewrite case.

Just to correct my typo, I wish to say, TupleDesc from Relation object can't
be trusted at AM layer for scan_begin() API.

Andres, any thoughts on above. I see you had proposed "change the table_beginscan* API so it
provides a slot" in [1], but seems received no response/comments that time.

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-29 16:17:41 -0700, Ashwin Agrawal wrote:
> On Thu, Apr 25, 2019 at 3:43 PM Andres Freund <andres@anarazel.de> wrote:
> > Hm. I think some of those changes would be a bit bigger than I initially
> > though. Attached is a more minimal fix that'd route
> > RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> > think it's definitely the right answer for 1), probably the pragmatic
> > answer to 2), but certainly not for 3).
> >
> > I've for now made the AM return the size in bytes, and then convert that
> > into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> > are going to continue to want it internally as pages (otherwise there's
> > going to be way too much churn, without a benefit I can see). So I
> > think that's OK.
> 
> I will provide my inputs, Heikki please correct me or add your inputs.
>
> I am not sure how much gain this practically provides, if rest of the
> system continues to use the value returned in-terms of blocks. I
> understand things being block based (and not just really block based
> but all the blocks of relation are storing data and full tuple) is
> engraved in the system. So, breaking out of it is yes much larger
> change and not just limited to table AM API.

I don't think it's that ingrained in all that many parts of the
system. Outside of the places I listed upthread, and the one index case
that stashes extra info, which places are that "block based"?


> I feel most of the issues discussed here should be faced by zheap as
> well, as not all blocks/pages contain data like TPD pages should be
> excluded from sampling and TID scans, etc...

It's not a problem so far, and zheap works on tableam. You can just skip
such blocks during sampling / analyze, and return nothing for tidscans.


> > > 2) commands/analyze.c, computing pg_class.relpages
> > >
> > >    This should imo be moved to the tableam callback. It's currently done
> > >    a bit weirdly imo, with fdws computing relpages the callback, but
> > >    then also returning the acquirefunc. Seems like it should entirely be
> > >    computed as part of calling acquirefunc.
> >
> > Here I'm not sure routing RelationGetNumberOfBlocksForFork() through
> > tableam wouldn't be the right minimal approach too. It has the
> > disadvantage of implying certain values for the
> > RelationGetNumberOfBlocksForFork(MAIN) return value.  The alternative
> > would be to return the desire sampling range in
> > table_beginscan_analyze() - but that'd require some duplication because
> > currently that just uses the generic scan_begin() callback.
> 
> Yes, just routing relation size via AM layer and using its returned
> value in terms of blocks still and performing sampling based on blocks
> based on it, doesn't feel resolves the issue. Maybe need to delegate
> sampling completely to AM layer. Code duplication can be avoided by
> similar AMs (heap and zheap) possible using some common utility
> functions to achieve intended result.

I don't know what this is actually proposing.



> > I suspect - as previously mentioned- that we're going to have to extend
> > statistics collection beyond the current approach at some point, but I
> > don't think that's now. At least to me it's not clear how to best
> > represent the stats, and how to best use them, if the underlying storage
> > is fundamentally not block best.  Nor how we'd avoid code duplication...
> 
> Yes, will have to give more thoughts into this.
> 
> >
> > > 3) nodeTidscan, skipping over too large tids
> > >    I think this should just be moved into the AMs, there's no need to
> > >    have this in nodeTidscan.c
> >
> > I think here it's *not* actually correct at all to use the relation
> > size. It's currently doing:
> >
> >         /*
> >          * We silently discard any TIDs that are out of range at the time of scan
> >          * start.  (Since we hold at least AccessShareLock on the table, it won't
> >          * be possible for someone to truncate away the blocks we intend to
> >          * visit.)
> >          */
> >         nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> >
> > which is fine (except for a certain abstraction leakage) for an AM like
> > heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> > Heikki's approach where tid isn't tied to physical representation.
> 
> Agree, its not nice to have that optimization being performed based on
> number of block in generic layer. I feel its not efficient either for
> zheap too due to TPD pages as mentioned above, as number of blocks
> returned will be higher compared to actually data blocks.

I don't think there's a problem for zheap. The blocks are just
interspersed.

Having pondered this a lot more, I think this is the way to go for
12. Then we can improve this for v13, to be nice.


> > The proper fix seems to be to introduce a new scan variant
> > (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> > a scan as a parameter.  But it seems we'd have to introduce that as a
> > separate tableam callback, because we'd not want to incur the overhead
> > of creating an additional scan / RelationGetNumberOfBlocks() checks for
> > triggers et al.
> 
> Thinking out loud here, we can possibly tackle this in multiple ways.
> First above mentioned check seems more optimization to me than
> functionally needed, correct me if wrong. If that's true we can check
> with AM if wish to apply that optimization or not based on relation
> size.

It'd be really expensive to check this differently for heap. We'd have
to check the relation size, which is out of the question imo.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote:
> On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
> > Also wish to point out, while working on Zedstore, we realized that
> > TupleDesc from Relation object can be trusted at AM layer for
> > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> > catalog is updated first and hence the relation object passed to AM
> > layer reflects new TupleDesc. For heapam its fine as it doesn't use
> > the TupleDesc today during scans in AM layer for scan_getnextslot().
> > Only TupleDesc which can trusted and matches the on-disk layout of the
> > tuple for scans hence is from TupleTableSlot. Which is little
> > unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> > and not in scan_begin(). Means if AM wishes to do some initialization
> > based on TupleDesc for scans can't be done in scan_begin() and forced
> > to delay till has access to TupleTableSlot. We should at least add
> > comment for scan_begin() to strongly clarify not to trust Relation
> > object TupleDesc. Or maybe other alternative would be have separate
> > API for rewrite case.
> 
> Just to correct my typo, I wish to say, TupleDesc from Relation object can't
> be trusted at AM layer for scan_begin() API.
> 
> Andres, any thoughts on above. I see you had proposed "change the
> table_beginscan* API so it
> provides a slot" in [1], but seems received no response/comments that time.
> [1]
> https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de

I don't think passing a slot at beginscan time is a good idea. There's
several places that want to use different slots for the same scan, and
we probably want to increase that over time (e.g. for batching), not
decrease it.

What kind of initialization do you want to do based on the tuple desc at
beginscan time?

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Ashwin Agrawal
Дата:
On Wed, May 8, 2019 at 2:46 PM Andres Freund <andres@anarazel.de> wrote:
>
> Hi,
>
> On 2019-05-07 23:18:39 -0700, Ashwin Agrawal wrote:
> > On Mon, May 6, 2019 at 1:39 PM Ashwin Agrawal <aagrawal@pivotal.io> wrote:
> > > Also wish to point out, while working on Zedstore, we realized that
> > > TupleDesc from Relation object can be trusted at AM layer for
> > > scan_begin() API. As for ALTER TABLE rewrite case (ATRewriteTables()),
> > > catalog is updated first and hence the relation object passed to AM
> > > layer reflects new TupleDesc. For heapam its fine as it doesn't use
> > > the TupleDesc today during scans in AM layer for scan_getnextslot().
> > > Only TupleDesc which can trusted and matches the on-disk layout of the
> > > tuple for scans hence is from TupleTableSlot. Which is little
> > > unfortunate as TupleTableSlot is only available in scan_getnextslot(),
> > > and not in scan_begin(). Means if AM wishes to do some initialization
> > > based on TupleDesc for scans can't be done in scan_begin() and forced
> > > to delay till has access to TupleTableSlot. We should at least add
> > > comment for scan_begin() to strongly clarify not to trust Relation
> > > object TupleDesc. Or maybe other alternative would be have separate
> > > API for rewrite case.
> >
> > Just to correct my typo, I wish to say, TupleDesc from Relation object can't
> > be trusted at AM layer for scan_begin() API.
> >
> > Andres, any thoughts on above. I see you had proposed "change the
> > table_beginscan* API so it
> > provides a slot" in [1], but seems received no response/comments that time.
> > [1]
> > https://www.postgresql.org/message-id/20181211021340.mqaown4njtcgrjvr%40alap3.anarazel.de
>
> I don't think passing a slot at beginscan time is a good idea. There's
> several places that want to use different slots for the same scan, and
> we probably want to increase that over time (e.g. for batching), not
> decrease it.
>
> What kind of initialization do you want to do based on the tuple desc at
> beginscan time?

For Zedstore (column store) need to allocate map (array or bitmask) to
mark which columns to project for the scan. Also need to allocate AM
internal scan descriptors corresponding to number of attributes for
the scan. Hence, need access to number of attributes involved in the
scan. Currently, not able to trust Relation's TupleDesc, for Zedstore
we worked-around the same by allocating these things on first call to
getnextslot, when have access to slot (by switching to memory context
used during scan_begin()).



Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-04-25 15:43:15 -0700, Andres Freund wrote:
> Hm. I think some of those changes would be a bit bigger than I initially
> though. Attached is a more minimal fix that'd route
> RelationGetNumberOfBlocksForFork() through tableam if necessary.  I
> think it's definitely the right answer for 1), probably the pragmatic
> answer to 2), but certainly not for 3).

> I've for now made the AM return the size in bytes, and then convert that
> into blocks in RelationGetNumberOfBlocksForFork(). Most postgres callers
> are going to continue to want it internally as pages (otherwise there's
> going to be way too much churn, without a benefit I can see). So I
> thinkt that's OK.
> 
> There's also a somewhat weird bit of returning the total relation size
> for InvalidForkNumber - it's pretty likely that other AMs wouldn't use
> postgres' current forks, but have equivalent concepts. And without that
> there'd be no way to get that size.  I'm not sure I like this, input
> welcome. But it seems good to offer the ability to get the entire size
> somehow.

I'm still reasonably happy with this.  I'll polish it a bit and push.


> > 3) nodeTidscan, skipping over too large tids
> >    I think this should just be moved into the AMs, there's no need to
> >    have this in nodeTidscan.c
> 
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
> 
>     /*
>      * We silently discard any TIDs that are out of range at the time of scan
>      * start.  (Since we hold at least AccessShareLock on the table, it won't
>      * be possible for someone to truncate away the blocks we intend to
>      * visit.)
>      */
>     nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
> 
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
> 
> 
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan.  And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
> 
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter.  But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.

Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Needs a bit of polishing, but I think this is the right direction?
Unless somebody protests, I'm going to push something along those lines
quite soon.

Greetings,

Andres Freund

Вложения

Re: Pluggable Storage - Andres's take

От
Ashwin Agrawal
Дата:

On Wed, May 15, 2019 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-04-25 15:43:15 -0700, Andres Freund wrote:

> > 3) nodeTidscan, skipping over too large tids
> >    I think this should just be moved into the AMs, there's no need to
> >    have this in nodeTidscan.c
>
> I think here it's *not* actually correct at all to use the relation
> size. It's currently doing:
>
>       /*
>        * We silently discard any TIDs that are out of range at the time of scan
>        * start.  (Since we hold at least AccessShareLock on the table, it won't
>        * be possible for someone to truncate away the blocks we intend to
>        * visit.)
>        */
>       nblocks = RelationGetNumberOfBlocks(tidstate->ss.ss_currentRelation);
>
> which is fine (except for a certain abstraction leakage) for an AM like
> heap or zheap, but I suspect strongly that that's not ok for Ashwin &
> Heikki's approach where tid isn't tied to physical representation.
>
>
> The obvious answer would be to just move that check into the
> table_fetch_row_version implementation (currently just calling
> heap_fetch()) - but that doesn't seem OK from a performance POV, because
> we'd then determine the relation size once for each tid, rather than
> once per tidscan.  And it'd also check in cases where we know the tid is
> supposed to be valid (e.g. fetching trigger tuples and such).
>
> The proper fix seems to be to introduce a new scan variant
> (e.g. table_beginscan_tid()), and then have table_fetch_row_version take
> a scan as a parameter.  But it seems we'd have to introduce that as a
> separate tableam callback, because we'd not want to incur the overhead
> of creating an additional scan / RelationGetNumberOfBlocks() checks for
> triggers et al.

Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Needs a bit of polishing, but I think this is the right direction?

Highlevel this looks good to me. Will look into full details tomorrow. This alligns with the high level thought I made but implemented in much better way, to consult with the AM to perform the optimization or not. So, now using the new callback table_tuple_tid_valid() AM either can implement some way to perform the validation for TID to optimize the scan, or if has no way to check based on scan descriptor then can decide to always pass true and let table_fetch_row_version() handle the things.

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> Highlevel this looks good to me. Will look into full details tomorrow.

Ping?

I'll push the first of the patches soon, and unless you'll comment on
the second soon, I'll also push ahead. There's a beta upcoming...

- Andres



Re: Pluggable Storage - Andres's take

От
Ashwin Agrawal
Дата:

On Fri, May 17, 2019 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:
Hi,

On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> Highlevel this looks good to me. Will look into full details tomorrow.

Ping?

I'll push the first of the patches soon, and unless you'll comment on
the second soon, I'll also push ahead. There's a beta upcoming...

Sorry for the delay, didn't get to it yesterday. Looked into both the patches. They both look good to me, thank you.

Relation size API still doesn't address the analyze case as you mentioned but sure something we can improve on later.

Re: Pluggable Storage - Andres's take

От
Ashwin Agrawal
Дата:
On Tue, Apr 9, 2019 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi> wrote:
On 08/04/2019 20:37, Andres Freund wrote:
> On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
>> There's a little bug in index-only scan executor node, where it mixes up the
>> slots to hold a tuple from the index, and from the table. That doesn't cause
>> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy AM, which
>> uses a virtual slot, it caused warnings like this from index-only scans:
>
> Hm. That's another one that I think I had fixed previously :(, and then
> concluded that it's not actually necessary for some reason. Your fix
> looks correct to me.  Do you want to commit it? Otherwise I'll look at
> it after rebasing zheap, and checking it with that.

I found another slot type confusion bug, while playing with zedstore. In
an Index Scan, if you have an ORDER BY key that needs to be rechecked,
so that it uses the reorder queue, then it will sometimes use the
reorder queue slot, and sometimes the table AM's slot, for the scan
slot. If they're not of the same type, you get an assertion:

TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
"execExprInterp.c", Line: 1905)

Attached is a test for this, again using the toy table AM, extended to
be able to test this. And a fix.

It seems the two patches from email [1] fixing slot confusion in Index Scans are pending to be committed.

Re: Pluggable Storage - Andres's take

От
Ashwin Agrawal
Дата:

On Wed, May 15, 2019 at 11:54 AM Andres Freund <andres@anarazel.de> wrote:
Attached is a prototype of a variation of this. I added a
table_tuple_tid_valid(TableScanDesc sscan, ItemPointer tid)
callback / wrapper. Currently it just takes a "plain" scan, but we could
add a separate table_beginscan variant too.

For heap that just means we can just use HeapScanDesc's rs_nblock to
filter out invalid tids, and we only need to call
RelationGetNumberOfBlocks() once, rather than every
table_tuple_tid_valid(0 / table_get_latest_tid() call. Which is a good
improvement for nodeTidscan's table_get_latest_tid() call (for WHERE
CURRENT OF) - which previously computed the relation size once per
tuple.

Question on the patch, if not too late
Why call table_beginscan() in TidNext() and not in ExecInitTidScan() ? Seems cleaner to have it in ExecInitTidScan().

Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-05-17 16:56:04 -0700, Ashwin Agrawal wrote:
> Question on the patch, if not too late
> Why call table_beginscan() in TidNext() and not in ExecInitTidScan() ?
> Seems cleaner to have it in ExecInitTidScan().

Largely because it's symmetrical to where most other scans are started (
c.f. nodeSeqscan.c, nodeIndexscan.c). But also, there's no need to incur
the cost of a smgrnblocks() etc when the node might never actually be
reached during execution.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Andres Freund
Дата:
Hi,

On 2019-05-17 14:49:19 -0700, Ashwin Agrawal wrote:
> On Fri, May 17, 2019 at 12:54 PM Andres Freund <andres@anarazel.de> wrote:
> 
> > Hi,
> >
> > On 2019-05-15 23:00:38 -0700, Ashwin Agrawal wrote:
> > > Highlevel this looks good to me. Will look into full details tomorrow.
> >
> > Ping?
> >
> > I'll push the first of the patches soon, and unless you'll comment on
> > the second soon, I'll also push ahead. There's a beta upcoming...
> >
> 
> Sorry for the delay, didn't get to it yesterday. Looked into both the
> patches. They both look good to me, thank you.

Pushed both now.


> Relation size API still doesn't address the analyze case as you mentioned
> but sure something we can improve on later.

I'm much less concerned about that. You can just return a reasonable
block size from the size callback, and it'll work for block sampling
(and you can just skip pages in the analyze callback if needed, e.g. for
zheap's tpd pages). And we assume that a reasonable block size is
returned by the size callback anyway, for planning purposes (both in
relpages and for estimate_rel_size).  We'll probably want to improve
that some day, but it doesn't strike me as hugely urgent.

Greetings,

Andres Freund



Re: Pluggable Storage - Andres's take

От
Heikki Linnakangas
Дата:
On 18/05/2019 01:19, Ashwin Agrawal wrote:
> On Tue, Apr 9, 2019 at 6:17 AM Heikki Linnakangas <hlinnaka@iki.fi 
> <mailto:hlinnaka@iki.fi>> wrote:
> 
>     On 08/04/2019 20:37, Andres Freund wrote:
>      > On 2019-04-08 15:34:46 +0300, Heikki Linnakangas wrote:
>      >> There's a little bug in index-only scan executor node, where it
>     mixes up the
>      >> slots to hold a tuple from the index, and from the table. That
>     doesn't cause
>      >> any ill effects if the AM uses TTSOpsHeapTuple, but with my toy
>     AM, which
>      >> uses a virtual slot, it caused warnings like this from
>     index-only scans:
>      >
>      > Hm. That's another one that I think I had fixed previously :(,
>     and then
>      > concluded that it's not actually necessary for some reason. Your fix
>      > looks correct to me.  Do you want to commit it? Otherwise I'll
>     look at
>      > it after rebasing zheap, and checking it with that.
> 
>     I found another slot type confusion bug, while playing with
>     zedstore. In
>     an Index Scan, if you have an ORDER BY key that needs to be rechecked,
>     so that it uses the reorder queue, then it will sometimes use the
>     reorder queue slot, and sometimes the table AM's slot, for the scan
>     slot. If they're not of the same type, you get an assertion:
> 
>     TRAP: FailedAssertion("!(op->d.fetch.kind == slot->tts_ops)", File:
>     "execExprInterp.c", Line: 1905)
> 
>     Attached is a test for this, again using the toy table AM, extended to
>     be able to test this. And a fix.
> 
> 
> It seems the two patches from email [1] fixing slot confusion in Index 
> Scans are pending to be committed.
> 
> 1] 
> https://www.postgresql.org/message-id/e71c4da4-3e82-cc4f-32cc-ede387fac8b0%40iki.fi

Pushed the first patch now. Andres already fixed the second issue in 
commit b8b94ea129.

- Heikki



Re: Pluggable Storage - Andres's take

От
Alvaro Herrera
Дата:
On 2019-Jun-06, Heikki Linnakangas wrote:

> Pushed the first patch now. Andres already fixed the second issue in commit
> b8b94ea129.

Please don't omit the "Discussion:" tag in commit messages.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



default_table_access_method is not in sample config file

От
Heikki Linnakangas
Дата:
On 11/04/2019 19:49, Andres Freund wrote:
> On 2019-04-11 14:52:40 +0300, Heikki Linnakangas wrote:
>> diff --git a/src/backend/utils/misc/guc.c b/src/backend/utils/misc/guc.c
>> index f7f726b5aec..bbcab9ce31a 100644
>> --- a/src/backend/utils/misc/guc.c
>> +++ b/src/backend/utils/misc/guc.c
>> @@ -3638,7 +3638,7 @@ static struct config_string ConfigureNamesString[] =
>>           {"default_table_access_method", PGC_USERSET, CLIENT_CONN_STATEMENT,
>>               gettext_noop("Sets the default table access method for new tables."),
>>               NULL,
>> -            GUC_IS_NAME
>> +            GUC_NOT_IN_SAMPLE | GUC_IS_NAME
>>           },
>>           &default_table_access_method,
>>           DEFAULT_TABLE_ACCESS_METHOD,
> 
> Hm, I think we should rather add it to sample. That's an oversight, not
> intentional.

I just noticed that this is still an issue. default_table_access_method 
is not in the sample config file, and it's not marked with 
GUC_NOT_IN_SAMPLE. I'll add this to the open items list so we don't forget.

- Heikki



Re: default_table_access_method is not in sample config file

От
Michael Paquier
Дата:
On Fri, Aug 09, 2019 at 11:34:05AM +0300, Heikki Linnakangas wrote:
> On 11/04/2019 19:49, Andres Freund wrote:
>> Hm, I think we should rather add it to sample. That's an oversight, not
>> intentional.
>
> I just noticed that this is still an issue. default_table_access_method is
> not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE.
> I'll add this to the open items list so we don't forget.

I think that we should give it the same visibility as default_tablespace,
so adding it to the sample file sounds good to me.
--
Michael

Вложения

Re: default_table_access_method is not in sample config file

От
Andres Freund
Дата:
On 2019-08-13 15:03:13 +0900, Michael Paquier wrote:
> On Fri, Aug 09, 2019 at 11:34:05AM +0300, Heikki Linnakangas wrote:
> > On 11/04/2019 19:49, Andres Freund wrote:
> >> Hm, I think we should rather add it to sample. That's an oversight, not
> >> intentional.
> > 
> > I just noticed that this is still an issue. default_table_access_method is
> > not in the sample config file, and it's not marked with GUC_NOT_IN_SAMPLE.
> > I'll add this to the open items list so we don't forget.

Thanks!


> I think that we should give it the same visibility as default_tablespace,
> so adding it to the sample file sounds good to me.

> diff --git a/src/backend/utils/misc/postgresql.conf.sample b/src/backend/utils/misc/postgresql.conf.sample
> index 65a6da18b3..39fc787851 100644
> --- a/src/backend/utils/misc/postgresql.conf.sample
> +++ b/src/backend/utils/misc/postgresql.conf.sample
> @@ -622,6 +622,7 @@
>  #default_tablespace = ''        # a tablespace name, '' uses the default
>  #temp_tablespaces = ''            # a list of tablespace names, '' uses
>                      # only default tablespace
> +#default_table_access_method = 'heap'

Pushed, thanks.


>  #check_function_bodies = on
>  #default_transaction_isolation = 'read committed'
>  #default_transaction_read_only = off

Hm.  I find the current ordering there a bit weird. Unrelated to your
proposed change.  The header of the group is

#------------------------------------------------------------------------------
# CLIENT CONNECTION DEFAULTS
#------------------------------------------------------------------------------

# - Statement Behavior -

but I don't quite see GUCs like default_tablespace, search_path (due to
determining a created table's schema), temp_tablespace,
default_table_access_method fit reasonably well under that heading. They
all can affect persistent state. That seems pretty different from a
number of other settings (client_min_messages,
default_transaction_isolation, lock_timeout, ...) which only have
transient effects.

Should we perhaps split that group? Not that I have a good proposal for
better names.

Greetings,

Andres Freund



Re: default_table_access_method is not in sample config file

От
Michael Paquier
Дата:
On Fri, Aug 16, 2019 at 03:29:30PM -0700, Andres Freund wrote:
> but I don't quite see GUCs like default_tablespace, search_path (due to
> determining a created table's schema), temp_tablespace,
> default_table_access_method fit reasonably well under that heading. They
> all can affect persistent state. That seems pretty different from a
> number of other settings (client_min_messages,
> default_transaction_isolation, lock_timeout, ...) which only have
> transient effects.

Agreed.

> Should we perhaps split that group? Not that I have a good proposal for
> better names.

We could have a section for transaction-related parameters, and move
the vacuum ones into the section for autovacuum so as they get
grouped, renaming the section "autovacuum and vacuum".  An idea of
group for search_path, temp_tablespace, default_tablespace & co would
be "object parameters", or "relation parameters" for all the
parameters which interfere with object definitions?
--
Michael

Вложения