Re: [HACKERS] LONG

Поиск
Список
Период
Сортировка
От Bruce Momjian
Тема Re: [HACKERS] LONG
Дата
Msg-id 199912140156.UAA22211@candle.pha.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] LONG  (wieck@debis.com (Jan Wieck))
Список pgsql-hackers
This outline is perfect!


> > I am suggesting the longoid is not the oid of the primary or long*
> > table, but a unque id we assigned just to number all parts of the long*
> > tuple.  I thought that's what your oid was for.
> 
>     It's  not  even  an  Oid  of  any  existing  tuple,  just  an
>     identifier to quickly find all the chunks of one  LONG  value
>     by (non-unique) index.

Yes, I understood this and I think it is a great idea.  It allows UPDATE
to control whether it wants to replace the LONG value.


> 
>     My idea is this now:
> 
>     The schema of the expansion relation is
> 
>         value_id        Oid
>         chunk_seq       int32
>         chunk_data      text
> 
>     with a non unique index on value_id.

Yes, exactly.

> 
>     We  change  heap_formtuple(),  heap_copytuple()  etc.  not to
>     allocate the entire thing in one palloc(). Instead the  tuple
>     portion itself is allocated separately and the current memory
>     context remembered too  in  the  HeapTuple  struct  (this  is
>     required below).

I read the later part.  I understand.

> 
>     The long value reference in a tuple is defined as:
> 
>         vl_len          int32;     /* high bit set, 32-bit = 18 */
>         vl_datasize     int32;     /* real vl_len of long value */
>         vl_valueid      Oid;       /* value_id in expansion relation */
>         vl_relid        Oid;       /* Oid of "expansion" table */
>         vl_rowid        Oid;       /* Oid of the row in "primary" table */
>         vl_attno        int16;     /* attribute number in "primary" table */

I see you need vl_rowid and vl_attno so you don't accidentally reference
a LONG value twice.  Good point.  I hadn't thought of that.

> 
>     The  tuple  given to heap_update() (the most complex one) can
>     now contain usual VARLENA values of the format
> 
>         high-bit=0|31-bit-size|data
> 
>     or if the value is the result of a scan eventually
> 
>         high-bit=1|31-bit=18|datasize|valueid|relid|rowid|attno
> 
>     Now there are a couple of different cases.
> 
>     1.  The value found is a plain VARLENA  that  must  be  moved
>         off.
> 
>         To  move  it  off a new Oid for value_id is obtained, the
>         value itself stored in the  expansion  relation  and  the
>         attribute in the tuple is replaced by the above structure
>         with the values  1,  18,  original  VARSIZE(),  value_id,
>         "expansion" relid, "primary" tuples Oid and attno.
> 
>     2.  The  value  found  is a long value reference that has our
>         own "expansion" relid and the correct  rowid  and  attno.
>         This  would  be  the result of an UPDATE without touching
>         this long value.
> 
>         Nothing to be done.
> 
>     3.  The value found is a  long  value  reference  of  another
>         attribute,  row or relation and this attribute is enabled
>         for move off.
> 
>         The long value is fetched from the expansion relation  it
>         is  living  in,  and the same as for 1. is done with that
>         value. There's space for optimization  here,  because  we
>         might have room to store the value plain. This can happen
>         if the operation was an INSERT INTO t1  SELECT  FROM  t2,
>         where  t1 has few small plus one varsize attribute, while
>         t2 has many, many long varsizes.
> 
>     4.  The value found is a  long  value  reference  of  another
>         attribute, row or relation and this attribute is disabled
>         for move off (either per column or because  our  relation
>         does not have an expansion relation at all).
> 
>         The  long value is fetched from the expansion relation it
>         is living in, and the reference in our tuple is  replaced
>         with this plain VARLENA.

Yes.

> 
>     This  in place replacement of values in the main tuple is the
>     reason, why we have to make another allocation for the  tuple
>     data  and  remember the memory context where made. Due to the
>     above process, the tuple data can expand, and we then need to
>     change into that context and reallocate it.


Yes, got it.

> 
>     What  heap_update()  further  must  do  is to examine the OLD
>     tuple (that  it  already  has  grabbed  by  CTID  for  header
>     modification)  and  delete all long values by their value_id,
>     that aren't any longer present in the new tuple.

Yes, makes vacuum run find on the LONG* relation.

> 
>     The VARLENA arguments to type specific functions now can also
>     have both formats.  The macro
> 
>         #define VAR_GETPLAIN(arg) \
>             (VARLENA_ISLONG(arg) ? expand_long(arg) : (arg))
> 
>     can   be   used   to  get  a  pointer  to  an  allways  plain
>     representation, and the macro
> 
>         #define VAR_FREEPLAIN(arg,userptr) \
>               if (arg != userptr) pfree(userptr);
> 
>     is to be used to tidy up before returning.

Got it.

> 
>     In this scenario, a function  like  smaller(text,text)  would
>     look like
> 
>         text *
>         smaller(text *t1, text *t2)
>         {
>             text *plain1 = VAR_GETPLAIN(t1);
>             text *plain2 = VAR_GETPLAIN(t2);
>             text *result;
> 
>             if ( /* whatever to compare plain1 and plain2 */ )
>                 result = t1;
>             else
>                 result = t2;
> 
>             VAR_FREEPLAIN(t1,plain1);
>             VAR_FREEPLAIN(t2,plain2);
> 
>             return result;
>         }

Yes.

> 
>     The  LRU cache used in expand_long() will the again and again
>     expansion become cheap enough. The  benefit  would  be,  that
>     huge  values resulting from table scans will be passed around
>     in the system (in and out of sorting,  grouping  etc.)  until
>     they are modified or really stored/output.

Yes.

> 
>     And the LONG index stuff should be covered here already (free
>     lunch)!   Index_insert()  MUST  allways   be   called   after
>     heap_insert()/heap_update(),   because  it  needs  the  there
>     assigned CTID. So at that time, the moved off attributes  are
>     replaced  in  the tuple data by the references. These will be
>     stored instead of the values that  originally  where  in  the
>     tuple.   Should  also  work with hash indices, as long as the
>     hashing functions use VAR_GETPLAIN as well.

I hoped this would be true.  Great.

> 
>     If we want to use auto compression too, no problem.  We  code
>     this  into  another  bit  of  the  first  32-bit  vl_len. The
>     question if to call expand_long() changes now to "is  one  of
>     these  set".  This  way,  we  can  store both, compressed and
>     uncompressed  into  both,  "primary"  tuple  or   "expansion"
>     relation. expand_long() will take care for it.

Perfect.  Sounds great.

--  Bruce Momjian                        |  http://www.op.net/~candle maillist@candle.pha.pa.us            |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Bruce Momjian
Дата:
Сообщение: Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining
Следующее
От: Thomas Lockhart
Дата:
Сообщение: Re: [HACKERS] 6.6 release