Re: [HACKERS] LONG

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: [HACKERS] LONG
Дата
Msg-id 26627.945021733@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [HACKERS] LONG  (Bruce Momjian <pgman@candle.pha.pa.us>)
Ответы Re: [HACKERS] LONG  (wieck@debis.com (Jan Wieck))
Re: [HACKERS] LONG  (Bruce Momjian <pgman@candle.pha.pa.us>)
RE: [HACKERS] LONG  ("Hiroshi Inoue" <Inoue@tpf.co.jp>)
Список pgsql-hackers
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Also, my idea was to auto-enable longs for all varlena types, so short
> values stay in the table, while longer chained ones that take up lots of
> space and are expensive to expand are retrieved only when needed.

I missed most of yesterday's discussion (was off fighting a different
fire...).  This morning in the shower I had a brilliant idea, which
I now see Bruce has beaten me to ;-)

The idea of doing tuple splitting by pushing "long" fields out of line,
rather than just cutting up the tuple at arbitrary points, is clearly
a win for the reasons Bruce and Jan point out.  But I like Bruce's
approach (automatically do it for any overly-long varlena attribute)
much better than Jan's (invent a special LONG datatype).  A special
datatype is bad for several reasons:
* it forces users to kluge up their database schemas;
* inevitably, users will pick the wrong columns to make LONG (it's a truism that programmers seldom guess right about
whatparts of their programs consume the most resources; users would need a "profiler" to make the right decisions);
 
* it doesn't solve the problems for arrays, which desperately need it;
* we'd need to add a whole bunch of operations on the special datatype;

I could live with all of those limitations if a "clean" datatype-based
solution were possible, ie, all the special code is in the datatype
functions.  But we already know that that's not possible --- there would
have to be special hacks for the LONG datatype in other places.  So I
think we ought to handle the problem as part of the tuple access
machinery, not as a special datatype.

I think that the right place to implement this is in heapam, and that
it should go more or less like this:

1. While writing out a tuple, if the total tuple size is "too big"
(threshold would be some fraction of BLCKSZ, yet to be chosen),
then the tuple manager would go through the tuple to find the longest
varlena attribute, and convert same into an out-of-line attribute.
Repeat if necessary until tuple size fits within threshold.

2. While reading a tuple, fastgetattr() automatically fetches the
out-of-line value if it sees the requested attribute is out-of-line.
(I'd be inclined to mark out-of-line attributes in the same way that
NULL attributes are marked: one bit in the tuple header shows if any
out-of-line attrs are present, and if so there is a bitmap to show
which ones are out-of-line.  We could also use Bruce's idea of
commandeering the high-order bit of the varlena length word, but
I think that's a much uglier and more fragile solution.)

I think that these two changes would handle 99% of the problem.
VACUUM would still need work, but most normal access to tuples would
just work automatically, because all access to varlena fields must go
through fastgetattr().

An as-yet-unsolved issue is how to avoid memory leaks of out-of-line
values after they have been read in by fastgetattr().  However, I think
that's going to be a nasty problem with Jan's approach as well.  The
best answer might be to solve this in combination with addressing the
problem of leakage of temporary results during expression evaluation,
say by adding some kind of reference-count convention to all varlena
values.

BTW, I don't see any really good reason to keep the out-of-line values
in a separate physical file (relation) as Jan originally proposed.
Why not keep them in the same file, but mark them as being something
different than a normal tuple?  Sequential scans would have to know to
skip over them (big deal), and VACUUM would have to handle them
properly, but I think VACUUM is going to have to have special code to
support this feature no matter what.  If we do make them a new primitive
kind-of-a-tuple on disk, we could sidestep the problem of marking all
the out-of-line values associated with a tuple when the tuple is
outdated by a transaction.  The out-of-line values wouldn't have
transaction IDs in them at all; they'd just be labeled with the CTID
and/or OID of the primary tuple they belong to.  VACUUM would consult
that tuple to determine whether to keep or discard an out-of-line value.
        regards, tom lane


В списке pgsql-hackers по дате отправления:

Предыдущее
От: The Hermit Hacker
Дата:
Сообщение: Re: [HACKERS] 6.6 release
Следующее
От: Tom Lane
Дата:
Сообщение: Re: [HACKERS] LONG