Обсуждение: Volunteer: Large Tuples / Tuple chaining

Поиск
Список
Период
Сортировка

Volunteer: Large Tuples / Tuple chaining

От
Christof Petig
Дата:
Hello,

I'll donate some (read all freely available) of my spare time to
implementing tuple
chaining. It looks like this feature is most wanted and it would be a
pity to hold this until post 7.0. Personally I don't need it, yet ...
But I will definitely find a use for it once available ;-) And it looks
like a good start for hacking on pgsql.

I already dived into the depth of pgsql's page and tuple structures and
it looks like it is possible. But before I start coding I would like to
hear some more experienced opinions on how to implement it.

Did you alread discuss technical matters about the implementation? How
can I get in touch with it? (Simply browse the mailing list archives?)

Here's a layout how I imagine the work:

What is needed:
- lay out a tuple continuation structure
- put tuple into multiple chunks when pages are considered, reconcile
when loaded from disk (how to continue a tuple - need a structure) how is a tuple (read page item) addressed?
ItemPointerDataI imagine to store a continuation address as the last bytes of the
 
tuple unless it fits into one page. I need to mark large tuples (how, just one flag in tuple) How to tell a maximum
possiblesize last block from a continued  (which carries a pointer to the next one at its end)?  Or don't care: make
itemcontinued and put last 6(?) bytes into a new
 
block
- note that the continued tuples are not referenced directly (vacuum?) mark them as used. I hope vacuum operates on a
tuplebasis and has no
 
concept of pages
- I guess that the tuple pointer points into page memory, if multiple
pages  are concatenated for a tuple, these pages must not reside in memory
but the full tuple's memory must be allocated (from a memory similar to
pages) (shared mem?)
- should be possible for memory only pages  see PageGetPageSize but od_pagesize is 16bit! Reuse another variable?
Anothertype of page? (32bit od_pagesize) 
 
Very fascinated by this large beast of ancient code to explore     Christof

PS: I think the documentation on page layout is far outdated (or points
into the future since it speaks about ItemContinuationData structures.)
Should I update it?
The table doesn't match actual structure components. At least I don't
understand what it's about. The source code mentions a different page
layout.

PPS: Do not pity me, I have ten+ years of coding experience in C.

PPPS: Could someone in few words tell me what an access method is (a
tuple is an access method, log pages are another?)



RE: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: owner-pgsql-hackers@postgreSQL.org 
> [mailto:owner-pgsql-hackers@postgreSQL.org]On Behalf Of Christof Petig
> 
> Hello,
> 
> I'll donate some (read all freely available) of my spare time to
> implementing tuple
> chaining. It looks like this feature is most wanted and it would be a
> pity to hold this until post 7.0. Personally I don't need it, yet ...
> But I will definitely find a use for it once available ;-) And it looks
> like a good start for hacking on pgsql.
> 
> I already dived into the depth of pgsql's page and tuple structures and
> it looks like it is possible. But before I start coding I would like to
> hear some more experienced opinions on how to implement it.
>

Will you put a long tuple into a long logical page(continued multiple
phisical(?) pages) ?
I'm suspicious about the way that allows non-page-formatted page.

Anyway it would need a big change around bufmgr/smgr etc.
Could someone estimate the influence/danger before going forward ?

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp



Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
Bruce Momjian
Дата:
Thanks.  Seems like Jan is going to be doing this.


> Hello,
> 
> I'll donate some (read all freely available) of my spare time to
> implementing tuple
> chaining. It looks like this feature is most wanted and it would be a
> pity to hold this until post 7.0. Personally I don't need it, yet ...
> But I will definitely find a use for it once available ;-) And it looks
> like a good start for hacking on pgsql.
> 
> I already dived into the depth of pgsql's page and tuple structures and
> it looks like it is possible. But before I start coding I would like to
> hear some more experienced opinions on how to implement it.
> 
> Did you alread discuss technical matters about the implementation? How
> can I get in touch with it? (Simply browse the mailing list archives?)
> 
> Here's a layout how I imagine the work:
> 
> What is needed:
> - lay out a tuple continuation structure
> - put tuple into multiple chunks when pages are considered, reconcile
> when
>   loaded from disk
>   (how to continue a tuple - need a structure)
>   how is a tuple (read page item) addressed? ItemPointerData
>   I imagine to store a continuation address as the last bytes of the
> tuple unless it
>   fits into one page.
>   I need to mark large tuples (how, just one flag in tuple)
>   How to tell a maximum possible size last block from a continued 
>   (which carries a pointer to the next one at its end)? 
>   Or don't care: make item continued and put last 6(?) bytes into a new
> block
> - note that the continued tuples are not referenced directly (vacuum?)
>   mark them as used. I hope vacuum operates on a tuple basis and has no
> concept of
>   pages
> - I guess that the tuple pointer points into page memory, if multiple
> pages 
>   are concatenated for a tuple, these pages must not reside in memory
> but
>   the full tuple's memory must be allocated (from a memory similar to
> pages)
>   (shared mem?)
> - should be possible for memory only pages 
>   see PageGetPageSize but od_pagesize is 16bit!
>   Reuse another variable? Another type of page? (32bit od_pagesize)
>   
> Very fascinated by this large beast of ancient code to explore
>       Christof
> 
> PS: I think the documentation on page layout is far outdated (or points
> into the future since it speaks about ItemContinuationData structures.)
> Should I update it?
> The table doesn't match actual structure components. At least I don't
> understand what it's about. The source code mentions a different page
> layout.
> 
> PPS: Do not pity me, I have ten+ years of coding experience in C.
> 
> PPPS: Could someone in few words tell me what an access method is (a
> tuple is an access method, log pages are another?)
> 
> 
> ************
> 


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
Christof Petig
Дата:
Hiroshi Inoue wrote:
> 
> Will you put a long tuple into a long logical page(continued multiple
> phisical(?) pages) ?
> I'm suspicious about the way that allows non-page-formatted page.
> 
> Anyway it would need a big change around bufmgr/smgr etc.
> Could someone estimate the influence/danger before going forward ?
> 

I planned to use as many of PostgreSQL data structures unaltered as
possible. Storing one Tuple in multiple Items should not pose too much
danger on bufmgr and smgr unless they access tuple internals. (I didn't
check that yet). This would mean that on disk Items do no longer
correspond to Tuples. (Some of them might form one tuple).

I dropped the plan of Unformatted pages very soon. But the issue of
tuple in-memory-storage remains (I don't know the internals of
allocating/freeing, yet).

Christof




RE: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: christof@to.wtal.de [mailto:christof@to.wtal.de]On Behalf Of
> Christof Petig
> 
> Hiroshi Inoue wrote:
> > 
> > Will you put a long tuple into a long logical page(continued multiple
> > phisical(?) pages) ?
> > I'm suspicious about the way that allows non-page-formatted page.
> > 
> > Anyway it would need a big change around bufmgr/smgr etc.
> > Could someone estimate the influence/danger before going forward ?
> > 
> 
> I planned to use as many of PostgreSQL data structures unaltered as
> possible. Storing one Tuple in multiple Items should not pose too much
> danger on bufmgr and smgr unless they access tuple internals. (I didn't
> check that yet). This would mean that on disk Items do no longer
> correspond to Tuples. (Some of them might form one tuple).
>

Hmm,we have discussed about LONG.
Change by LONG is transparent to users and would resolve
the big tuple problem mostly.
I'm suspicious that tuple chaining is worth the work now.

At least a consensus is needed before going,I think.
Bad design would only introduce a confusion.

Regards.

Hiroshi Inoue
Inoue@tpf.co.jp 


Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
Bruce Momjian
Дата:
> > I planned to use as many of PostgreSQL data structures unaltered as
> > possible. Storing one Tuple in multiple Items should not pose too much
> > danger on bufmgr and smgr unless they access tuple internals. (I didn't
> > check that yet). This would mean that on disk Items do no longer
> > correspond to Tuples. (Some of them might form one tuple).
> >
> 
> Hmm,we have discussed about LONG.
> Change by LONG is transparent to users and would resolve
> the big tuple problem mostly.
> I'm suspicious that tuple chaining is worth the work now.
> 
> At least a consensus is needed before going,I think.
> Bad design would only introduce a confusion.

Agreed.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
wieck@debis.com (Jan Wieck)
Дата:
Bruce Momjian wrote:

> > > I planned to use as many of PostgreSQL data structures unaltered as
> > > possible. Storing one Tuple in multiple Items should not pose too much
> > > danger on bufmgr and smgr unless they access tuple internals. (I didn't
> > > check that yet). This would mean that on disk Items do no longer
> > > correspond to Tuples. (Some of them might form one tuple).
> > >
> >
> > Hmm,we have discussed about LONG.
> > Change by LONG is transparent to users and would resolve
> > the big tuple problem mostly.
> > I'm suspicious that tuple chaining is worth the work now.
> >
> > At least a consensus is needed before going,I think.
> > Bad design would only introduce a confusion.
>
> Agreed.

Me too.

    I  think that only a combination of LONG attributes and split
    tuples will be a complete solution.

    What I'm worried about is to make the  segments  of  a  large
    tuple  specialized  things in the main table. The reliability
    of Vacuum is one of the most important things for any  system
    in production. While the general operation of vacuum seems to
    be well known, it's requirements for atomicy of some  actions
    appears  to  be  lesser. The more chunks a tuple consists of,
    the more possible an abort of vacuum in the middle  of  their
    moving  becomes.  So keeping the links of chained tuples fail
    safe intact is IMHO an issue, a little underestimated in this
    discussion.

    Maybe we can split tuples in another way, must think about it
    for another hour - 'til later.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
Tom Lane
Дата:
wieck@debis.com (Jan Wieck) writes:
>     I  think that only a combination of LONG attributes and split
>     tuples will be a complete solution.

If we can do a good job with long attributes, I really think we
will not need to have split tuples too.

You'd be able to put perhaps 400 LONG attributes into an 8K tuple,
more than that if they are float8 or int or bool attributes.
If someone needs tables with even more columns than that, they
could bump BLCKSZ up to 32K and quadruple the number of columns.

How many people are really going to be bumping into that limit?
Is it worth the work and reliability risk to support long tuples
for a few applications that are about three sigmas out on the bell
curve?  I doubt it.

I think the effort this would take would be *much* more profitably
spent on tuning the LONG-attribute support.  If we can make that
fast and robust, we will have very few complaints.
        regards, tom lane


Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
wieck@debis.com (Jan Wieck)
Дата:
Tom Lane wrote:

> wieck@debis.com (Jan Wieck) writes:
> >     I  think that only a combination of LONG attributes and split
> >     tuples will be a complete solution.
>
> If we can do a good job with long attributes, I really think we
> will not need to have split tuples too.

    I  really hope so, because there will be very severe problems
    coming up with a real tuple split  at  arbitrary  cut  points
    that  can  occur  somewhere  in  the  middle of an attribute.
    Arbitrary cut points are  the  only  way  to  support  single
    values over BLKSIZE.

    Just   to  tell  one  problem,  the  scan  key  tests  during
    heap_getnext()  are  handed  down   into   heapgettup()   and
    performed  with  HeapTupleSatisfies,  a  macro  using  the in
    buffer tuple here. IIRC it was turned into a macro in one  of
    our last releases for performance reasons.

    If  now  faced  with  a tuple living in multiple pages, these
    checks will need to  reconstruct  the  tuple  in  memory,  to
    concatenate the attributes well again.

    This  now  needs  to  lock  multiple  buffers  at once during
    heapgettup(), where I'm not sure if they must all  stay  with
    the  bumped  refcount  when  returning  the  tuple or not. So
    ReleaseBuffer() might need  to  be  changed  into  something,
    where  the  HeapTuple  remembers  all  the buffers that where
    locked for it.

    Also this separate ReleaseBuffer() reminds me, that there are
    some  places  in  the backend that assume a tuple returned by
    heap AM allways is in a buffer! But that can't  be  true  any
    more, because a buffer allways has BLKSIZE.

> I think the effort this would take would be *much* more profitably
> spent on tuning the LONG-attribute support.  If we can make that
> fast and robust, we will have very few complaints.

    *MUCH*!


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

RE: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
"Hiroshi Inoue"
Дата:
> -----Original Message-----
> From: Jan Wieck [mailto:wieck@debis.com]
> Sent: Wednesday, December 15, 1999 3:45 AM
> 
> Bruce Momjian wrote:
> 
> > > > I planned to use as many of PostgreSQL data structures unaltered as
> > > > possible. Storing one Tuple in multiple Items should not 
> pose too much
> > > > danger on bufmgr and smgr unless they access tuple 
> internals. (I didn't
> > > > check that yet). This would mean that on disk Items do no longer
> > > > correspond to Tuples. (Some of them might form one tuple).
> > > >
> > >
> > > Hmm,we have discussed about LONG.
> > > Change by LONG is transparent to users and would resolve
> > > the big tuple problem mostly.
> > > I'm suspicious that tuple chaining is worth the work now.
> > >
> > > At least a consensus is needed before going,I think.
> > > Bad design would only introduce a confusion.
> >
> > Agreed.
> 
> Me too.
> 
>     I  think that only a combination of LONG attributes and split
>     tuples will be a complete solution.
> 
>     What I'm worried about is to make the  segments  of  a  large
>     tuple  specialized  things in the main table. The reliability
>     of Vacuum is one of the most important things for any  system
>     in production. While the general operation of vacuum seems to
>     be well known, it's requirements for atomicy of some  actions
>     appears  to  be  lesser. The more chunks a tuple consists of,
>     the more possible an abort of vacuum in the middle  of  their
>     moving  becomes.  So keeping the links of chained tuples fail
>     safe intact is IMHO an issue, a little underestimated in this
>     discussion.
>

There exists another related problem.
Vacuum could hardly move big tuples if some tuples of each page
live long. Though we have to move a long tuple at once,there won't
be so many clean pages.

Probably vacuum couldn't move even a 8K tuple in some cases.
The problem is already there,more or less.
But it seems very difficult to solve this problem without giving up
to preserve consistency in case of a crash. 

Regards.
Hiroshi Inoue
Inoue@tpf.co.jp


Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
Bruce Momjian
Дата:
Remember, chaining tuples had all sorts of performance, vacuum, code
handling, and UPDATE problems.  They buy us very little, and almost
nothing if we have LONG tables.


--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining

От
Christof Petig
Дата:
Bruce Momjian wrote:
> 
> Remember, chaining tuples had all sorts of performance, vacuum, code
> handling, and UPDATE problems.  They buy us very little, and almost
> nothing if we have LONG tables.
> 

I had already contacted Jan in private Email. Since we share country,
native language and time zone, this is even the most comfortable way.

I agree with the concerns you mailed and will (most likely) start
helping Jan to implement LONG. As I had seen your LONG discussion
_after_ my original post, this had been a strange coincidence. But I had
been following it with interest.
    Christof