Last thoughts about LONG

Поиск
Список
Период
Сортировка
От wieck@debis.com (Jan Wieck)
Тема Last thoughts about LONG
Дата
Msg-id m11wpGy-0003kGC@orion.SAPserv.Hamburg.dsh.de
обсуждение исходный текст
Ответ на Re: [HACKERS] LONG  (wieck@debis.com (Jan Wieck))
Ответы Re: [HACKERS] Last thoughts about LONG  (Peter Eisentraut <e99re41@DoCS.UU.SE>)
Re: [HACKERS] Last thoughts about LONG  (Tom Lane <tgl@sss.pgh.pa.us>)
Список pgsql-hackers
I wrote:

> Bruce Momjian wrote:
>
> > Should we use large objects for this, and beef them up.  Seems that
> > would be a good way.  I have considered putting them in a hash
> > bucket/directory tree for faster access to lots of large objects.
> >
> > There is a lot to say about storing long tuples outside the tables
> > because long tuples fill cache buffers and make short fields longer to
> > access.
>
>     I  thought  to  use  a  regular table. Of course, it will eat
>     buffers ...

    When  looking  at  my  actual implementation concept, I'm not
    sure if it will win or loose compared  against  text  itself!
    Amazing, but I think it could win already on relatively small
    text sizes (1-2K is IMHO small compared  to  what  this  type
    could store).

    Well,  the  implementation  details. I really would like some
    little  comments  to  verify  it's  really  complete   before
    starting.

    - A  new  field "rellongrelid" type Oid is added to pg_class.
      It contains the Oid  of  the  long-value  relation  or  the
      invalid Oid for those who have no LONG attributes.

    - At    CREATE   TABLE,   a   long   value   relation   named
      "_LONG<tablename>" is created for those tables who need it.
      And of course dropped and truncated appropriate. The schema
      of this table is

          rowid       Oid,          -- oid of our main data row
          rowattno    int2,         -- the attribute number in main data
          chunk_seq   int4,         -- the part number of this data chunk
          chunk       text          -- the content of this data chunk

      There is a unique index defined on (rowid, rowattno).

    - The new data type is of variable size  with  the  following
      header:

          typedef struct LongData {
              int32           varsize;
              int32           datasize;
              Oid             longrelid;
              Oid             rowid;
              int16           rowattno;
          } LongData;

      The   types   input   function  is  very  simple.  Allocate
      sizeof(LongData)  +  strlen(input),  set  varsize  to   it,
      datasize  to  strlen(input), and the rest to invalid and 0.
      Then copy the input after the struct.

      The types output function determines on the longrelid, what
      to do.  If it's invalid, just output the bytes stored after
      the struct (it must be a datum that resulted from an  input
      operation.   If  longrelid  isn't invalid, it does an index
      scan on that relation, fetching all tuples that match rowid
      and  attno.  Since  it  knows the datasize, it doesn't need
      them in the correct order, it can put  them  at  the  right
      places into the allocated return buffer by their chunk_seq.

    - For now (until we have enough experience to judge) I  think
      it  would  be  better  to  forbid  ALTER  TABLE  when  LONG
      attributes  are  involved.   Sure,  must   be   implemented
      finally, but IMHO not on the first evaluation attempt.

Now how the data goes in and out of the longrel.

    - On  heap_insert(),  we look for non NULL LONG attributes in
      the tuple. If there could be any  can  simply  be  seen  by
      looking  at the rellongrelid in rd_rel.  We fetch the value
      either from the memory after LongData or by using the  type
      output function (for fetching it from the relation where it
      is!).  Then we simply break it up into  single  chunks  and
      store  them with our tuples information.  Now we need to do
      something tricky - to shrink the main data tuple  size,  we
      form  a new heap tuple with the datums of the original one.
      But we replace all LongData items we stored by faked  ones,
      where  the  varsize  is  sizeof(LongData) and all the other
      information is setup appropriate.   We  append  that  faked
      tuple  instead,  copy  the  resulting  information into the
      original tuples header and throw it away.

      This is a point, where  I'm  not  totally  sure.  Could  it
      possibly  be  better  or  required to copy the entire faked
      tuple over the one we should have stored?  It  could  never
      need more space, so that wouldn't be a problem.

    - On heap_replace(), we check all LONG attributes if they are
      NULL of if the information in longrelid, rowid and rowattno
      doesn't match our rellongrelid, tupleid, and attno. In that
      case this attribute  might  have  an  old  content  in  the
      longrel, which we need to delete first.

      The   rest   of   the   operation   is   exactly  like  for
      heap_insert(), except all the  attributes  information  did
      match - then it's our own OLD value that wasn't changed. So
      we can simply skip it - the existing data is still valid.

    - heap_delete() is so simple that I don't explain it.

    Now I hear you asking "how could this overhead be a win?" :-)

    That's  easy  to  explain.   As  long as you don't use a LONG
    column in the WHERE clause, when will the data be fetched? At
    the  time  it's finally clear that it's needed. That's when a
    result tuple is sent to the client (type output)  or  when  a
    tuple resulting from INSERT ... SELECT should be stored.

    Thus,  all  the  tuples  moving around in the execution tree,
    getting joined together, abused by sorts and  aggregates  and
    filtered  out  again,  allways  contain  the  small  LongData
    struct, not the data itself. Wheren't there recently  reports
    about too expansive sorts due to their huge size?

    Another  bonus  would be this: What happens on an UPDATE to a
    table  having  LONG  attributes?  If  the  attribute  is  not
    modified,  the  OLD LongData will be found in the targetlist,
    and we'll not waste any space by storing the same information
    again.  IIRC  that  one was one of the biggest concerns about
    storing huge data  in  tuples,  but  it  disappeared  without
    leaving a trace - funny eh?

    It  is  so simple, that I fear I made some mistake somewhere.
    But where?


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Peter Eisentraut
Дата:
Сообщение: Re: [HACKERS] LONG
Следующее
От: wieck@debis.com (Jan Wieck)
Дата:
Сообщение: Re: [HACKERS] LONG