Re: [HACKERS] LONG

Поиск
Список
Период
Сортировка
От wieck@debis.com (Jan Wieck)
Тема Re: [HACKERS] LONG
Дата
Msg-id m11xOws-0003kGC@orion.SAPserv.Hamburg.dsh.de
обсуждение исходный текст
Ответ на Re: [HACKERS] LONG  (Bruce Momjian <pgman@candle.pha.pa.us>)
Ответы Re: [HACKERS] LONG  (wieck@debis.com (Jan Wieck))
Re: [HACKERS] LONG  (Bruce Momjian <pgman@candle.pha.pa.us>)
Список pgsql-hackers
Bruce Momjian wrote:

> > > No need for attno in there anymore.
> >
> >     I  still  need  it  to  explicitly  remove  one long value on
> >     update, while the other one is untouched. Otherwise  I  would
> >     have  to  drop  all  long  values  for  the  row together and
> >     reinsert all new ones.
>
> I am suggesting the longoid is not the oid of the primary or long*
> table, but a unque id we assigned just to number all parts of the long*
> tuple.  I thought that's what your oid was for.

    It's  not  even  an  Oid  of  any  existing  tuple,  just  an
    identifier to quickly find all the chunks of one  LONG  value
    by (non-unique) index.

    My idea is this now:

    The schema of the expansion relation is

        value_id        Oid
        chunk_seq       int32
        chunk_data      text

    with a non unique index on value_id.

    We  change  heap_formtuple(),  heap_copytuple()  etc.  not to
    allocate the entire thing in one palloc(). Instead the  tuple
    portion itself is allocated separately and the current memory
    context remembered too  in  the  HeapTuple  struct  (this  is
    required below).

    The long value reference in a tuple is defined as:

        vl_len          int32;     /* high bit set, 32-bit = 18 */
        vl_datasize     int32;     /* real vl_len of long value */
        vl_valueid      Oid;       /* value_id in expansion relation */
        vl_relid        Oid;       /* Oid of "expansion" table */
        vl_rowid        Oid;       /* Oid of the row in "primary" table */
        vl_attno        int16;     /* attribute number in "primary" table */

    The  tuple  given to heap_update() (the most complex one) can
    now contain usual VARLENA values of the format

        high-bit=0|31-bit-size|data

    or if the value is the result of a scan eventually

        high-bit=1|31-bit=18|datasize|valueid|relid|rowid|attno

    Now there are a couple of different cases.

    1.  The value found is a plain VARLENA  that  must  be  moved
        off.

        To  move  it  off a new Oid for value_id is obtained, the
        value itself stored in the  expansion  relation  and  the
        attribute in the tuple is replaced by the above structure
        with the values  1,  18,  original  VARSIZE(),  value_id,
        "expansion" relid, "primary" tuples Oid and attno.

    2.  The  value  found  is a long value reference that has our
        own "expansion" relid and the correct  rowid  and  attno.
        This  would  be  the result of an UPDATE without touching
        this long value.

        Nothing to be done.

    3.  The value found is a  long  value  reference  of  another
        attribute,  row or relation and this attribute is enabled
        for move off.

        The long value is fetched from the expansion relation  it
        is  living  in,  and the same as for 1. is done with that
        value. There's space for optimization  here,  because  we
        might have room to store the value plain. This can happen
        if the operation was an INSERT INTO t1  SELECT  FROM  t2,
        where  t1 has few small plus one varsize attribute, while
        t2 has many, many long varsizes.

    4.  The value found is a  long  value  reference  of  another
        attribute, row or relation and this attribute is disabled
        for move off (either per column or because  our  relation
        does not have an expansion relation at all).

        The  long value is fetched from the expansion relation it
        is living in, and the reference in our tuple is  replaced
        with this plain VARLENA.

    This  in place replacement of values in the main tuple is the
    reason, why we have to make another allocation for the  tuple
    data  and  remember the memory context where made. Due to the
    above process, the tuple data can expand, and we then need to
    change into that context and reallocate it.

    What  heap_update()  further  must  do  is to examine the OLD
    tuple (that  it  already  has  grabbed  by  CTID  for  header
    modification)  and  delete all long values by their value_id,
    that aren't any longer present in the new tuple.

    The VARLENA arguments to type specific functions now can also
    have both formats.  The macro

        #define VAR_GETPLAIN(arg) \
            (VARLENA_ISLONG(arg) ? expand_long(arg) : (arg))

    can   be   used   to  get  a  pointer  to  an  allways  plain
    representation, and the macro

        #define VAR_FREEPLAIN(arg,userptr) \
              if (arg != userptr) pfree(userptr);

    is to be used to tidy up before returning.

    In this scenario, a function  like  smaller(text,text)  would
    look like

        text *
        smaller(text *t1, text *t2)
        {
            text *plain1 = VAR_GETPLAIN(t1);
            text *plain2 = VAR_GETPLAIN(t2);
            text *result;

            if ( /* whatever to compare plain1 and plain2 */ )
                result = t1;
            else
                result = t2;

            VAR_FREEPLAIN(t1,plain1);
            VAR_FREEPLAIN(t2,plain2);

            return result;
        }

    The  LRU cache used in expand_long() will the again and again
    expansion become cheap enough. The  benefit  would  be,  that
    huge  values resulting from table scans will be passed around
    in the system (in and out of sorting,  grouping  etc.)  until
    they are modified or really stored/output.

    And the LONG index stuff should be covered here already (free
    lunch)!   Index_insert()  MUST  allways   be   called   after
    heap_insert()/heap_update(),   because  it  needs  the  there
    assigned CTID. So at that time, the moved off attributes  are
    replaced  in  the tuple data by the references. These will be
    stored instead of the values that  originally  where  in  the
    tuple.   Should  also  work with hash indices, as long as the
    hashing functions use VAR_GETPLAIN as well.

    If we want to use auto compression too, no problem.  We  code
    this  into  another  bit  of  the  first  32-bit  vl_len. The
    question if to call expand_long() changes now to "is  one  of
    these  set".  This  way,  we  can  store both, compressed and
    uncompressed  into  both,  "primary"  tuple  or   "expansion"
    relation. expand_long() will take care for it.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#========================================= wieck@debis.com (Jan Wieck) #

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: [HACKERS] generic LONG VARLENA
Следующее
От: wieck@debis.com (Jan Wieck)
Дата:
Сообщение: Re: [HACKERS] update_pg_pwd