Обсуждение: Improving on MAX_CONVERSION_GROWTH

Поиск
Список
Период
Сортировка

Improving on MAX_CONVERSION_GROWTH

От
Tom Lane
Дата:
Thinking about the nearby thread[1] about overrunning MaxAllocSize
during encoding conversion, it struck me that another thing
we could usefully do to improve that situation is to be smarter
about what's the growth factor --- the existing one-size-fits-all
choice of MAX_CONVERSION_GROWTH = 4 is leaving a lot on the table.

In particular, it seems like we could usefully frame things as
having a constant max growth factor associated with each target
encoding, stored as a new field in pg_wchar_table[].  By definition,
the max growth factor cannot be more than the maximum character
length in the target encoding.  So this approach immediately gives
us a growth factor of 1 with any single-byte output encoding,
and even many of the multibyte encodings would have max growth 2
or 3 without having to think any harder than that.

But we can do better, I think, recognizing that all the supported
encodings are ASCII extensions.  The only possible way to expend
4 output bytes per input byte is if there is some 1-byte character
that translates to a 4-byte character, and I think this is not the
case for converting any of our encodings to UTF8.  If you need at
least a 2-byte character to produce a 3-byte or 4-byte UTF8 character,
then UTF8 has max growth 2.  I'm not quite sure if that's true
for every source encoding, but I'm pretty certain it couldn't be
worse than 3.

It might be worth getting a bit more complex and having a 2-D
array indexed by both source and destination encodings to determine
the max growth factor.  I haven't run tests to empirically verify
what is the max growth factor.

A fly in this ointment is: could a custom encoding conversion
function violate our conclusions about what's the max growth
factor?  Maybe it would be worth treating the growth factor
as a property of a particular conversion (i.e., add a column
to pg_conversion) rather than making it a hard-wired property.

In any case, it seems likely that we could end up with a
multiplier of 1, 2, or 3 rather than 4 in just about every
case of practical interest.  That sure seems like a win
when converting long strings.

Thoughts?

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/20190816181418.GA898@alvherre.pgsql



Re: Improving on MAX_CONVERSION_GROWTH

От
Robert Haas
Дата:
On Tue, Sep 24, 2019 at 5:15 PM Tom Lane <tgl@sss.pgh.pa.us> wrote:
> In any case, it seems likely that we could end up with a
> multiplier of 1, 2, or 3 rather than 4 in just about every
> case of practical interest.  That sure seems like a win
> when converting long strings.

+1. From what I've seen, I'd say this is a significant practical
problem for people who are trying to store large blobs of data in the
database.

A lot of that is because they hit the 1GB allocation limit, and I
wonder whether we shouldn't be trying harder to avoid imposing that
limit in multiple places. It's reasonable - and necessary - to impose
a limit on the size of an individual datum, but when that same limit
is imposed on other things, like the worst-case size of the encoding
conversion, the size of an individual message sent via the wire
protocol, etc., you end up with a situation where users have trouble
predicting what the behavior is going to be. >=1GB definitely won't
work, but it'll probably break at some point before you even get that
far depending on a bunch of complex factors that are hard to
understand, not really documented, and mostly the result of applying
1GB limit to every single memory allocation across the whole backend
without really thinking about what that does to the user-visible
behavior.

Now, that's not to say we should abandon MaxAllocSize, which I agree
serves as a useful backstop. But IMHO it would be smart to start with
the desired user-facing behavior -- we want to support datums up to X
size -- and then consider how we can get there while maintaining
MaxAllocSize as a general-purpose backstop. Our current strategy seems
to be mostly the reverse: write the code the way that feels natural,
enforce MaxAllocSize everywhere, and if that breaks things for a user,
well then that means - by definition - that the user was trying to do
something we don't support.

One approach I think we should consider is, for larger strings,
actually scan the string and figure out how much memory we're going to
need for the conversion and then allocate exactly that amount (and
fail if it's >=1GB). An extra scan over the string is somewhat costly,
but allocating hundreds of megabytes of memory on the theory that we
could hypothetically have needed it is costly in different way. Memory
is more abundant today than it's ever been, but there are still plenty
of systems where a couple of extra allocations in the multi-hundred-MB
range can make the whole thing fall over. And even if it doesn't make
the whole thing fall over, the CPU efficiency of avoiding an extra
pass over the string really ought to be compared with the memory
efficiency of allocating extra storage.  Getting down from a
worst-case multiple of 4 to 2 is a great idea, but it still means that
converting a 100MB string will allocate 200MB when what you need will
very often be between 100MB and 105MB.  That's not an insignificant
cost, even though it's much better than allocating 400MB.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Improving on MAX_CONVERSION_GROWTH

От
Andres Freund
Дата:
Hi,

On 2019-09-27 10:53:48 -0400, Robert Haas wrote:
> A lot of that is because they hit the 1GB allocation limit, and I
> wonder whether we shouldn't be trying harder to avoid imposing that
> limit in multiple places.

> It's reasonable - and necessary - to impose
> a limit on the size of an individual datum, but when that same limit
> is imposed on other things, like the worst-case size of the encoding
> conversion, the size of an individual message sent via the wire
> protocol, etc., you end up with a situation where users have trouble
> predicting what the behavior is going to be. >=1GB definitely won't
> work, but it'll probably break at some point before you even get that
> far depending on a bunch of complex factors that are hard to
> understand, not really documented, and mostly the result of applying
> 1GB limit to every single memory allocation across the whole backend
> without really thinking about what that does to the user-visible
> behavior.

+1 - that will be a long, piecemeal, project I think... But deciding
that we should do so is a good first step.

Note that one of the additional reasons for the 1GB limit is that it
protects against int overflows. I'm somewhat unconvinced that that's a
sensible approach, but ...

I wonder if we shouldn't make stringinfos use size_t lengths, btw. Only
supporting INT32_MAX (not even UINT32_MAX) seems weird these days. But
we'd presumably have to make it opt-in.


> One approach I think we should consider is, for larger strings,
> actually scan the string and figure out how much memory we're going to
> need for the conversion and then allocate exactly that amount (and
> fail if it's >=1GB). An extra scan over the string is somewhat costly,
> but allocating hundreds of megabytes of memory on the theory that we
> could hypothetically have needed it is costly in different way.

My proposal for this is something like
https://www.postgresql.org/message-id/20190924214204.mav4md77xg5u5wq6%40alap3.anarazel.de
which should avoid the overallocation without a second pass, and
hopefully without loosing much efficiency.

It's worthwhile to note that additional passes over data are often quite
expensive, memory latency hasn't shrunk that much in last decade or
so. I have frequently seen all the memcpys from one StringInfo/char*
into another StringInfo show up in profiles.

Greetings,

Andres Freund



Re: Improving on MAX_CONVERSION_GROWTH

От
Robert Haas
Дата:
On Fri, Sep 27, 2019 at 11:40 AM Andres Freund <andres@anarazel.de> wrote:
> Note that one of the additional reasons for the 1GB limit is that it
> protects against int overflows. I'm somewhat unconvinced that that's a
> sensible approach, but ...

It's not crazy. People using 'int' rather casually just as they use
'palloc' rather casually, without necessarily thinking about what
could go wrong at the edges. I don't have any beef with that as a
general strategy; I just think we should be trying to do better in the
cases where it negatively affects the user experience.

> It's worthwhile to note that additional passes over data are often quite
> expensive, memory latency hasn't shrunk that much in last decade or
> so. I have frequently seen all the memcpys from one StringInfo/char*
> into another StringInfo show up in profiles.

OK.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Improving on MAX_CONVERSION_GROWTH

От
Tom Lane
Дата:
Robert Haas <robertmhaas@gmail.com> writes:
> On Fri, Sep 27, 2019 at 11:40 AM Andres Freund <andres@anarazel.de> wrote:
>> Note that one of the additional reasons for the 1GB limit is that it
>> protects against int overflows. I'm somewhat unconvinced that that's a
>> sensible approach, but ...

> It's not crazy. People using 'int' rather casually just as they use
> 'palloc' rather casually, without necessarily thinking about what
> could go wrong at the edges. I don't have any beef with that as a
> general strategy; I just think we should be trying to do better in the
> cases where it negatively affects the user experience.

A small problem with doing anything very interesting here is that the
int-is-enough-for-a-string-length approach is baked into the wire
protocol (read the DataRow message format spec and weep).

We could probably bend the COPY protocol enough to support multi-gig row
values --- dropping the rule that the backend doesn't split rows across
CopyData messages wouldn't break too many clients, hopefully.  That would
at least dodge some problems in dump/restore scenarios.

In the meantime, I still think we should commit what I proposed in the
other thread (<974.1569356381@sss.pgh.pa.us>), or something close to it.
Andres' proposal would perhaps be an improvement on that, but I don't
think it'll be ready anytime soon; and for sure we wouldn't risk
back-patching it, while I think we could back-patch what I suggested.
In any case, that patch is small enough that dropping it would be no big
loss if a better solution comes along.

Also, as far as the immediate subject of this thread is concerned,
I'm inclined to get rid of MAX_CONVERSION_GROWTH in favor of using
the target encoding's max char length.  The one use (in printtup.c)
where we don't know the target encoding could use MAX_MULTIBYTE_CHAR_LEN
instead.  Being smarter than that could help in some cases (mostly,
conversion of ISO encodings to UTF8), but it's not that big a win.
(I did some checks and found that some ISO encodings could provide a
max growth of 2x, but many are max 3x, so 4x isn't that far out of
line.)  If Andres' ideas don't pan out we could come back and work
harder on this, but for now something simple and back-patchable
seems like a useful stopgap improvement.

            regards, tom lane



Re: Improving on MAX_CONVERSION_GROWTH

От
Tom Lane
Дата:
I wrote:
> In the meantime, I still think we should commit what I proposed in the
> other thread (<974.1569356381@sss.pgh.pa.us>), or something close to it.
> Andres' proposal would perhaps be an improvement on that, but I don't
> think it'll be ready anytime soon; and for sure we wouldn't risk
> back-patching it, while I think we could back-patch what I suggested.
> In any case, that patch is small enough that dropping it would be no big
> loss if a better solution comes along.

Not having heard any objections, I'll proceed with that.  Andres is
welcome to work on replacing it with his more-complicated idea...

> Also, as far as the immediate subject of this thread is concerned,
> I'm inclined to get rid of MAX_CONVERSION_GROWTH in favor of using
> the target encoding's max char length.

I realized after re-reading the comment for MAX_CONVERSION_GROWTH that
this thread is based on a false premise, namely that encoding conversions
always produce one "character" out per "character" in.  In the presence of
combining characters and suchlike, that premise fails, and it becomes
quite unclear just what the max growth ratio actually is.  So I'm going
to leave that alone for now.  Maybe this point is an argument for pushing
forward with Andres' approach, but I'm still dubious about the overall
cost/benefit ratio of that concept.

            regards, tom lane



Re: Improving on MAX_CONVERSION_GROWTH

От
Andres Freund
Дата:
Hi,

On 2019-10-03 12:12:40 -0400, Tom Lane wrote:
> I wrote:
> > In the meantime, I still think we should commit what I proposed in the
> > other thread (<974.1569356381@sss.pgh.pa.us>), or something close to it.
> > Andres' proposal would perhaps be an improvement on that, but I don't
> > think it'll be ready anytime soon; and for sure we wouldn't risk
> > back-patching it, while I think we could back-patch what I suggested.
> > In any case, that patch is small enough that dropping it would be no big
> > loss if a better solution comes along.
> 
> Not having heard any objections, I'll proceed with that.  Andres is
> welcome to work on replacing it with his more-complicated idea...

Yea, what I'm proposing is clearly not backpatchable. So +1


> Maybe this point is an argument for pushing forward with Andres'
> approach, but I'm still dubious about the overall cost/benefit ratio
> of that concept.

I think if it were just for MAX_CONVERSION_GROWTH, I'd be inclined to
agree. But I think it has other advantages, so I'm mildy positivie that
it'll be an overall win...

Greetings,

Andres Freund