Re: [HACKERS] LONG
От | Bruce Momjian |
---|---|
Тема | Re: [HACKERS] LONG |
Дата | |
Msg-id | 199912140156.UAA22211@candle.pha.pa.us обсуждение исходный текст |
Ответ на | Re: [HACKERS] LONG (wieck@debis.com (Jan Wieck)) |
Список | pgsql-hackers |
This outline is perfect! > > I am suggesting the longoid is not the oid of the primary or long* > > table, but a unque id we assigned just to number all parts of the long* > > tuple. I thought that's what your oid was for. > > It's not even an Oid of any existing tuple, just an > identifier to quickly find all the chunks of one LONG value > by (non-unique) index. Yes, I understood this and I think it is a great idea. It allows UPDATE to control whether it wants to replace the LONG value. > > My idea is this now: > > The schema of the expansion relation is > > value_id Oid > chunk_seq int32 > chunk_data text > > with a non unique index on value_id. Yes, exactly. > > We change heap_formtuple(), heap_copytuple() etc. not to > allocate the entire thing in one palloc(). Instead the tuple > portion itself is allocated separately and the current memory > context remembered too in the HeapTuple struct (this is > required below). I read the later part. I understand. > > The long value reference in a tuple is defined as: > > vl_len int32; /* high bit set, 32-bit = 18 */ > vl_datasize int32; /* real vl_len of long value */ > vl_valueid Oid; /* value_id in expansion relation */ > vl_relid Oid; /* Oid of "expansion" table */ > vl_rowid Oid; /* Oid of the row in "primary" table */ > vl_attno int16; /* attribute number in "primary" table */ I see you need vl_rowid and vl_attno so you don't accidentally reference a LONG value twice. Good point. I hadn't thought of that. > > The tuple given to heap_update() (the most complex one) can > now contain usual VARLENA values of the format > > high-bit=0|31-bit-size|data > > or if the value is the result of a scan eventually > > high-bit=1|31-bit=18|datasize|valueid|relid|rowid|attno > > Now there are a couple of different cases. > > 1. The value found is a plain VARLENA that must be moved > off. > > To move it off a new Oid for value_id is obtained, the > value itself stored in the expansion relation and the > attribute in the tuple is replaced by the above structure > with the values 1, 18, original VARSIZE(), value_id, > "expansion" relid, "primary" tuples Oid and attno. > > 2. The value found is a long value reference that has our > own "expansion" relid and the correct rowid and attno. > This would be the result of an UPDATE without touching > this long value. > > Nothing to be done. > > 3. The value found is a long value reference of another > attribute, row or relation and this attribute is enabled > for move off. > > The long value is fetched from the expansion relation it > is living in, and the same as for 1. is done with that > value. There's space for optimization here, because we > might have room to store the value plain. This can happen > if the operation was an INSERT INTO t1 SELECT FROM t2, > where t1 has few small plus one varsize attribute, while > t2 has many, many long varsizes. > > 4. The value found is a long value reference of another > attribute, row or relation and this attribute is disabled > for move off (either per column or because our relation > does not have an expansion relation at all). > > The long value is fetched from the expansion relation it > is living in, and the reference in our tuple is replaced > with this plain VARLENA. Yes. > > This in place replacement of values in the main tuple is the > reason, why we have to make another allocation for the tuple > data and remember the memory context where made. Due to the > above process, the tuple data can expand, and we then need to > change into that context and reallocate it. Yes, got it. > > What heap_update() further must do is to examine the OLD > tuple (that it already has grabbed by CTID for header > modification) and delete all long values by their value_id, > that aren't any longer present in the new tuple. Yes, makes vacuum run find on the LONG* relation. > > The VARLENA arguments to type specific functions now can also > have both formats. The macro > > #define VAR_GETPLAIN(arg) \ > (VARLENA_ISLONG(arg) ? expand_long(arg) : (arg)) > > can be used to get a pointer to an allways plain > representation, and the macro > > #define VAR_FREEPLAIN(arg,userptr) \ > if (arg != userptr) pfree(userptr); > > is to be used to tidy up before returning. Got it. > > In this scenario, a function like smaller(text,text) would > look like > > text * > smaller(text *t1, text *t2) > { > text *plain1 = VAR_GETPLAIN(t1); > text *plain2 = VAR_GETPLAIN(t2); > text *result; > > if ( /* whatever to compare plain1 and plain2 */ ) > result = t1; > else > result = t2; > > VAR_FREEPLAIN(t1,plain1); > VAR_FREEPLAIN(t2,plain2); > > return result; > } Yes. > > The LRU cache used in expand_long() will the again and again > expansion become cheap enough. The benefit would be, that > huge values resulting from table scans will be passed around > in the system (in and out of sorting, grouping etc.) until > they are modified or really stored/output. Yes. > > And the LONG index stuff should be covered here already (free > lunch)! Index_insert() MUST allways be called after > heap_insert()/heap_update(), because it needs the there > assigned CTID. So at that time, the moved off attributes are > replaced in the tuple data by the references. These will be > stored instead of the values that originally where in the > tuple. Should also work with hash indices, as long as the > hashing functions use VAR_GETPLAIN as well. I hoped this would be true. Great. > > If we want to use auto compression too, no problem. We code > this into another bit of the first 32-bit vl_len. The > question if to call expand_long() changes now to "is one of > these set". This way, we can store both, compressed and > uncompressed into both, "primary" tuple or "expansion" > relation. expand_long() will take care for it. Perfect. Sounds great. -- Bruce Momjian | http://www.op.net/~candle maillist@candle.pha.pa.us | (610) 853-3000+ If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania19026
В списке pgsql-hackers по дате отправления:
Предыдущее
От: Bruce MomjianДата:
Сообщение: Re: [HACKERS] Volunteer: Large Tuples / Tuple chaining