Обсуждение: allocation limit for encoding conversion

Поиск
Список
Период
Сортировка

allocation limit for encoding conversion

От
Alvaro Herrera
Дата:
Somebody ran into issues when generating large XML output (upwards of
256 MB) and then sending via a connection with a different
client_encoding.  This occurs because we pessimistically allocate 4x as
much memory as the string needs, and we run into the 1GB palloc
limitation.  ISTM we can do better now by using huge allocations, as per
the preliminary attached patch (which probably needs an updated overflow
check rather than have it removed altogether); but at least it is able
to process this query, which it wasn't without the patch:

select query_to_xml(
    'select a, cash_words(a::text::money) from generate_series(0, 2000000) a',
    true, false, '');

-- 
Álvaro Herrera

Вложения

Re: allocation limit for encoding conversion

От
Alvaro Herrera
Дата:
On 2019-Aug-16, Alvaro Herrera wrote:

> Somebody ran into issues when generating large XML output (upwards of
> 256 MB) and then sending via a connection with a different
> client_encoding.

ref: https://postgr.es/m/43a889a1-45fb-1d60-31ae-21e607307492@gmail.com
(pgsql-es-ayuda)

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: allocation limit for encoding conversion

От
Tom Lane
Дата:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> Somebody ran into issues when generating large XML output (upwards of
> 256 MB) and then sending via a connection with a different
> client_encoding.  This occurs because we pessimistically allocate 4x as
> much memory as the string needs, and we run into the 1GB palloc
> limitation.  ISTM we can do better now by using huge allocations, as per
> the preliminary attached patch (which probably needs an updated overflow
> check rather than have it removed altogether); but at least it is able
> to process this query, which it wasn't without the patch:

> select query_to_xml(
>     'select a, cash_words(a::text::money) from generate_series(0, 2000000) a',
>     true, false, '');

I fear that allowing pg_do_encoding_conversion to return strings longer
than 1GB is just going to create failure cases somewhere else.

However, it's certainly true that 4x growth is a pretty unlikely worst
case.  Maybe we could do something like

1. If string is short (say up to a few megabytes), continue to do it
like now.  This avoids adding overhead for typical cases.

2. Otherwise, run some lobotomized form of encoding conversion that
just computes the space required (as an int64, I guess) without saving
the result anywhere.

3. If space required > 1GB, fail.

4. Otherwise, allocate just the space required, and convert.

            regards, tom lane



Re: allocation limit for encoding conversion

От
Andres Freund
Дата:
Hi,

On 2019-08-16 17:31:49 -0400, Tom Lane wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > Somebody ran into issues when generating large XML output (upwards of
> > 256 MB) and then sending via a connection with a different
> > client_encoding.  This occurs because we pessimistically allocate 4x as
> > much memory as the string needs, and we run into the 1GB palloc
> > limitation.  ISTM we can do better now by using huge allocations, as per
> > the preliminary attached patch (which probably needs an updated overflow
> > check rather than have it removed altogether); but at least it is able
> > to process this query, which it wasn't without the patch:
> 
> > select query_to_xml(
> >     'select a, cash_words(a::text::money) from generate_series(0, 2000000) a',
> >     true, false, '');
> 
> I fear that allowing pg_do_encoding_conversion to return strings longer
> than 1GB is just going to create failure cases somewhere else.
> 
> However, it's certainly true that 4x growth is a pretty unlikely worst
> case.  Maybe we could do something like
> 
> 1. If string is short (say up to a few megabytes), continue to do it
> like now.  This avoids adding overhead for typical cases.
> 
> 2. Otherwise, run some lobotomized form of encoding conversion that
> just computes the space required (as an int64, I guess) without saving
> the result anywhere.
> 
> 3. If space required > 1GB, fail.
> 
> 4. Otherwise, allocate just the space required, and convert.

It's probably too big a hammer for this specific case, but I think at
some point we ought to stop using fixed size allocations for this kind
of work. Instead we should use something roughly like our StringInfo,
except that when exceeding the current size limit, the overflowing data
is stored in a separate allocation. And only once we actually need the
data in a consecutive form, we allocate memory that's large enough to
store the all the separate allocations in their entirety.

Greetings,

Andres Freund



Re: allocation limit for encoding conversion

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2019-08-16 17:31:49 -0400, Tom Lane wrote:
>> I fear that allowing pg_do_encoding_conversion to return strings longer
>> than 1GB is just going to create failure cases somewhere else.
>> 
>> However, it's certainly true that 4x growth is a pretty unlikely worst
>> case.  Maybe we could do something like
>> 1. If string is short (say up to a few megabytes), continue to do it
>> like now.  This avoids adding overhead for typical cases.
>> 2. Otherwise, run some lobotomized form of encoding conversion that
>> just computes the space required (as an int64, I guess) without saving
>> the result anywhere.
>> 3. If space required > 1GB, fail.
>> 4. Otherwise, allocate just the space required, and convert.

> It's probably too big a hammer for this specific case, but I think at
> some point we ought to stop using fixed size allocations for this kind
> of work. Instead we should use something roughly like our StringInfo,
> except that when exceeding the current size limit, the overflowing data
> is stored in a separate allocation. And only once we actually need the
> data in a consecutive form, we allocate memory that's large enough to
> store the all the separate allocations in their entirety.

That sounds pretty messy :-(.

I spent some time looking at what I proposed above, and concluded that
it's probably impractical.  In the first place, we'd have to change
the API spec for encoding conversion functions.  Now maybe that would
not be a huge deal, because there likely aren't very many people outside
the core code who are defining their own conversion functions, but it's
still a negative.  More importantly, unless we wanted to duplicate
large swaths of code, we'd end up inserting changes about like this
into the inner loops of encoding conversions:

-        *dest++ = code;
+        if (dest)
+            *dest++ = code;
+        outcount++;

which seems like it'd be bad for performance.

So I now think that Alvaro's got basically the right idea, except
that I'm still afraid to allow strings larger than MaxAllocSize
to run around loose in the backend.  So in addition to the change
he suggested, we need a final check on strlen(result) not being
too large.  We can avoid doing a useless strlen() if the input len
is small, though.

It then occurred to me that we could also repalloc the output buffer
down to just the required size, which is pointless if it's small
but not if we can give back several hundred MB.  This is conveniently
mergeable with the check to see whether we need to check strlen or not.

... or at least, that's what I thought we could do.  Testing showed
me that AllocSetRealloc never actually gives back any space, even
when it's just acting as a frontend for a direct malloc.  However,
we can fix that for little more than the price of swapping the order
of the is-it-a-decrease and is-it-a-large-chunk stanzas, as in the
0002 patch below.

I also put back the missing overflow check --- although that's unreachable
in a 64-bit machine, it's not at all in 32-bit.  The patch is still
useful in 32-bit though, since it still doubles the size of string
we can cope with.

I think this is committable, though surely another pair of eyeballs
on it wouldn't hurt.  Also, is it worth having a different error
message for the case where the output does exceed MaxAllocSize?

            regards, tom lane

diff --git a/src/backend/utils/mb/mbutils.c b/src/backend/utils/mb/mbutils.c
index bec54bb..6b08b77 100644
--- a/src/backend/utils/mb/mbutils.c
+++ b/src/backend/utils/mb/mbutils.c
@@ -349,16 +349,24 @@ pg_do_encoding_conversion(unsigned char *src, int len,
                         pg_encoding_to_char(dest_encoding))));

     /*
-     * Allocate space for conversion result, being wary of integer overflow
+     * Allocate space for conversion result, being wary of integer overflow.
+     *
+     * len * MAX_CONVERSION_GROWTH is typically a vast overestimate of the
+     * required space, so it might exceed MaxAllocSize even though the result
+     * would actually fit.  We do not want to hand back a result string that
+     * exceeds MaxAllocSize, because callers might not cope gracefully --- but
+     * if we just allocate more than that, and don't use it, that's fine.
      */
-    if ((Size) len >= (MaxAllocSize / (Size) MAX_CONVERSION_GROWTH))
+    if ((Size) len >= (MaxAllocHugeSize / (Size) MAX_CONVERSION_GROWTH))
         ereport(ERROR,
                 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                  errmsg("out of memory"),
                  errdetail("String of %d bytes is too long for encoding conversion.",
                            len)));

-    result = palloc(len * MAX_CONVERSION_GROWTH + 1);
+    result = (unsigned char *)
+        MemoryContextAllocHuge(CurrentMemoryContext,
+                               (Size) len * MAX_CONVERSION_GROWTH + 1);

     OidFunctionCall5(proc,
                      Int32GetDatum(src_encoding),
@@ -366,6 +374,26 @@ pg_do_encoding_conversion(unsigned char *src, int len,
                      CStringGetDatum(src),
                      CStringGetDatum(result),
                      Int32GetDatum(len));
+
+    /*
+     * If the result is large, it's worth repalloc'ing to release any extra
+     * space we asked for.  The cutoff here is somewhat arbitrary, but we
+     * *must* check when len * MAX_CONVERSION_GROWTH exceeds MaxAllocSize.
+     */
+    if (len > 1000000)
+    {
+        Size        resultlen = strlen((char *) result);
+
+        if (resultlen >= MaxAllocSize)
+            ereport(ERROR,
+                    (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+                     errmsg("out of memory"),
+                     errdetail("String of %d bytes is too long for encoding conversion.",
+                               len)));
+
+        result = (unsigned char *) repalloc(result, resultlen + 1);
+    }
+
     return result;
 }

@@ -682,16 +710,19 @@ perform_default_encoding_conversion(const char *src, int len,
         return unconstify(char *, src);

     /*
-     * Allocate space for conversion result, being wary of integer overflow
+     * Allocate space for conversion result, being wary of integer overflow.
+     * See comments in pg_do_encoding_conversion.
      */
-    if ((Size) len >= (MaxAllocSize / (Size) MAX_CONVERSION_GROWTH))
+    if ((Size) len >= (MaxAllocHugeSize / (Size) MAX_CONVERSION_GROWTH))
         ereport(ERROR,
                 (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
                  errmsg("out of memory"),
                  errdetail("String of %d bytes is too long for encoding conversion.",
                            len)));

-    result = palloc(len * MAX_CONVERSION_GROWTH + 1);
+    result = (char *)
+        MemoryContextAllocHuge(CurrentMemoryContext,
+                               (Size) len * MAX_CONVERSION_GROWTH + 1);

     FunctionCall5(flinfo,
                   Int32GetDatum(src_encoding),
@@ -699,6 +730,25 @@ perform_default_encoding_conversion(const char *src, int len,
                   CStringGetDatum(src),
                   CStringGetDatum(result),
                   Int32GetDatum(len));
+
+    /*
+     * Release extra space if there might be a lot --- see comments in
+     * pg_do_encoding_conversion.
+     */
+    if (len > 1000000)
+    {
+        Size        resultlen = strlen(result);
+
+        if (resultlen >= MaxAllocSize)
+            ereport(ERROR,
+                    (errcode(ERRCODE_PROGRAM_LIMIT_EXCEEDED),
+                     errmsg("out of memory"),
+                     errdetail("String of %d bytes is too long for encoding conversion.",
+                               len)));
+
+        result = (char *) repalloc(result, resultlen + 1);
+    }
+
     return result;
 }

diff --git a/src/backend/utils/mmgr/aset.c b/src/backend/utils/mmgr/aset.c
index 6b63d6f..5e964ea 100644
--- a/src/backend/utils/mmgr/aset.c
+++ b/src/backend/utils/mmgr/aset.c
@@ -1084,62 +1084,12 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
                  set->header.name, chunk);
 #endif

-    /*
-     * Chunk sizes are aligned to power of 2 in AllocSetAlloc(). Maybe the
-     * allocated area already is >= the new size.  (In particular, we always
-     * fall out here if the requested size is a decrease.)
-     */
-    if (oldsize >= size)
-    {
-#ifdef MEMORY_CONTEXT_CHECKING
-        Size        oldrequest = chunk->requested_size;
-
-#ifdef RANDOMIZE_ALLOCATED_MEMORY
-        /* We can only fill the extra space if we know the prior request */
-        if (size > oldrequest)
-            randomize_mem((char *) pointer + oldrequest,
-                          size - oldrequest);
-#endif
-
-        chunk->requested_size = size;
-
-        /*
-         * If this is an increase, mark any newly-available part UNDEFINED.
-         * Otherwise, mark the obsolete part NOACCESS.
-         */
-        if (size > oldrequest)
-            VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + oldrequest,
-                                        size - oldrequest);
-        else
-            VALGRIND_MAKE_MEM_NOACCESS((char *) pointer + size,
-                                       oldsize - size);
-
-        /* set mark to catch clobber of "unused" space */
-        if (size < oldsize)
-            set_sentinel(pointer, size);
-#else                            /* !MEMORY_CONTEXT_CHECKING */
-
-        /*
-         * We don't have the information to determine whether we're growing
-         * the old request or shrinking it, so we conservatively mark the
-         * entire new allocation DEFINED.
-         */
-        VALGRIND_MAKE_MEM_NOACCESS(pointer, oldsize);
-        VALGRIND_MAKE_MEM_DEFINED(pointer, size);
-#endif
-
-        /* Disallow external access to private part of chunk header. */
-        VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
-
-        return pointer;
-    }
-
     if (oldsize > set->allocChunkLimit)
     {
         /*
          * The chunk must have been allocated as a single-chunk block.  Use
-         * realloc() to make the containing block bigger with minimum space
-         * wastage.
+         * realloc() to make the containing block bigger, or smaller, with
+         * minimum space wastage.
          */
         AllocBlock    block = (AllocBlock) (((char *) chunk) - ALLOC_BLOCKHDRSZ);
         Size        chksize;
@@ -1153,11 +1103,19 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
         if (block->aset != set ||
             block->freeptr != block->endptr ||
             block->freeptr != ((char *) block) +
-            (chunk->size + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ))
+            (oldsize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ))
             elog(ERROR, "could not find block containing chunk %p", chunk);

+        /*
+         * Even if the new request is less than set->allocChunkLimit, we stick
+         * with the single-chunk block approach.  Therefore we need
+         * chunk->size to be bigger than set->allocChunkLimit, so we don't get
+         * confused about the chunk's status.
+         */
+        chksize = Max(size, set->allocChunkLimit + 1);
+        chksize = MAXALIGN(chksize);
+
         /* Do the realloc */
-        chksize = MAXALIGN(size);
         blksize = chksize + ALLOC_BLOCKHDRSZ + ALLOC_CHUNKHDRSZ;
         block = (AllocBlock) realloc(block, blksize);
         if (block == NULL)
@@ -1182,17 +1140,19 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)
 #ifdef MEMORY_CONTEXT_CHECKING
 #ifdef RANDOMIZE_ALLOCATED_MEMORY
         /* We can only fill the extra space if we know the prior request */
-        randomize_mem((char *) pointer + chunk->requested_size,
-                      size - chunk->requested_size);
+        if (size > chunk->requested_size)
+            randomize_mem((char *) pointer + chunk->requested_size,
+                          size - chunk->requested_size);
 #endif

         /*
-         * realloc() (or randomize_mem()) will have left the newly-allocated
+         * realloc() (or randomize_mem()) will have left any newly-allocated
          * part UNDEFINED, but we may need to adjust trailing bytes from the
          * old allocation.
          */
-        VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + chunk->requested_size,
-                                    oldsize - chunk->requested_size);
+        if (oldsize > chunk->requested_size)
+            VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + chunk->requested_size,
+                                        oldsize - chunk->requested_size);

         chunk->requested_size = size;

@@ -1217,6 +1177,56 @@ AllocSetRealloc(MemoryContext context, void *pointer, Size size)

         return pointer;
     }
+
+    /*
+     * Chunk sizes are aligned to power of 2 in AllocSetAlloc(). Maybe the
+     * allocated area already is >= the new size.  (In particular, we will
+     * fall out here if the requested size is a decrease.)
+     */
+    if (oldsize >= size)
+    {
+#ifdef MEMORY_CONTEXT_CHECKING
+        Size        oldrequest = chunk->requested_size;
+
+#ifdef RANDOMIZE_ALLOCATED_MEMORY
+        /* We can only fill the extra space if we know the prior request */
+        if (size > oldrequest)
+            randomize_mem((char *) pointer + oldrequest,
+                          size - oldrequest);
+#endif
+
+        chunk->requested_size = size;
+
+        /*
+         * If this is an increase, mark any newly-available part UNDEFINED.
+         * Otherwise, mark the obsolete part NOACCESS.
+         */
+        if (size > oldrequest)
+            VALGRIND_MAKE_MEM_UNDEFINED((char *) pointer + oldrequest,
+                                        size - oldrequest);
+        else
+            VALGRIND_MAKE_MEM_NOACCESS((char *) pointer + size,
+                                       oldsize - size);
+
+        /* set mark to catch clobber of "unused" space */
+        if (size < oldsize)
+            set_sentinel(pointer, size);
+#else                            /* !MEMORY_CONTEXT_CHECKING */
+
+        /*
+         * We don't have the information to determine whether we're growing
+         * the old request or shrinking it, so we conservatively mark the
+         * entire new allocation DEFINED.
+         */
+        VALGRIND_MAKE_MEM_NOACCESS(pointer, oldsize);
+        VALGRIND_MAKE_MEM_DEFINED(pointer, size);
+#endif
+
+        /* Disallow external access to private part of chunk header. */
+        VALGRIND_MAKE_MEM_NOACCESS(chunk, ALLOCCHUNK_PRIVATE_LEN);
+
+        return pointer;
+    }
     else
     {
         /*

Re: allocation limit for encoding conversion

От
Andres Freund
Дата:
On 2019-09-24 16:19:41 -0400, Tom Lane wrote:
> Andres Freund <andres@anarazel.de> writes:
> > On 2019-08-16 17:31:49 -0400, Tom Lane wrote:
> >> I fear that allowing pg_do_encoding_conversion to return strings longer
> >> than 1GB is just going to create failure cases somewhere else.
> >>
> >> However, it's certainly true that 4x growth is a pretty unlikely worst
> >> case.  Maybe we could do something like
> >> 1. If string is short (say up to a few megabytes), continue to do it
> >> like now.  This avoids adding overhead for typical cases.
> >> 2. Otherwise, run some lobotomized form of encoding conversion that
> >> just computes the space required (as an int64, I guess) without saving
> >> the result anywhere.
> >> 3. If space required > 1GB, fail.
> >> 4. Otherwise, allocate just the space required, and convert.
>
> > It's probably too big a hammer for this specific case, but I think at
> > some point we ought to stop using fixed size allocations for this kind
> > of work. Instead we should use something roughly like our StringInfo,
> > except that when exceeding the current size limit, the overflowing data
> > is stored in a separate allocation. And only once we actually need the
> > data in a consecutive form, we allocate memory that's large enough to
> > store the all the separate allocations in their entirety.
>
> That sounds pretty messy :-(.

I don't think it's too bad - except for now allowing the .data member of
such a 'chunked' StringInfo to be directly accessible, it'd be just
about the same interface as the current stringinfo. Combined perhaps
with a helper or two for "de-chunking" directly into another stringinfo,
network buffer etc, to avoid the unnecessary allocation of a buffer of
the overall size when the result is just going to be memcpy'd elsewhere.

Obviously a good first step would just to pass a StringInfo to the
encoding routines. That'd solve the need for pessimistic overallocation,
because the buffer can be enlarged. And by sizing the initial allocation
to the input string (or at least only a small factor above), we'd not
usually need (many) reallocations.

That'd also remove the need for unnecessary strlen/memcpy done in many
encoding conversion callsites, like e.g.:

    p = pg_server_to_client(str, slen);
    if (p != str)                /* actual conversion has been done? */
    {
        slen = strlen(p);
        appendBinaryStringInfo(buf, p, slen);
        pfree(p);
    }

which do show up in profiles.


> I spent some time looking at what I proposed above, and concluded that
> it's probably impractical.  In the first place, we'd have to change
> the API spec for encoding conversion functions.  Now maybe that would
> not be a huge deal, because there likely aren't very many people outside
> the core code who are defining their own conversion functions, but it's
> still a negative.  More importantly, unless we wanted to duplicate
> large swaths of code, we'd end up inserting changes about like this
> into the inner loops of encoding conversions:
>
> -        *dest++ = code;
> +        if (dest)
> +            *dest++ = code;
> +        outcount++;
> which seems like it'd be bad for performance.

One thing this made me wonder is if we shouldn't check the size of the
output string explicitly, rather than relying on overallocation. The
only reason we need an allocation bigger than MaxAllocSize here is that
we don't pass the output buffer size to the conversion routines. If
those routines instead checked whether the output buffer size is
exceeded, and returned with the number of converted input bytes *and*
the position in the output buffer, we wouldn't have to overallocate
quite so much.

But I suspect using a StringInfo, as suggested above, would be better.

To avoid making the tight innermost loop more expensive, I'd perhaps
code it roughly like:



    /*
     * Initially size output buffer to the likely required length, to
     * avoid unnecessary reallocations while growing.
     */
    enlargeStringInfo(output, input_len * ESTIMATED_GROWTH_FACTOR);

    /*
     * Process input in chunks, to reduce overhead of maintaining output buffer
     * for each processed input char. Increasing the buffer size too much will
     * lead to memory being wasted due to the necessary over-allocation.
     */
    #define CHUNK_SIZE 128
    remaining_bytes = input_len;
    while (remaining_bytes > 0)
    {
        local_len = Min(remaining_bytes, CHUNK_SIZE);

        /* ensure we have output buffer space for this chunk */
        enlargeStringInfo(output, MAX_CONVERSION_GROWTH * local_len);

        /* growing the stringinfo may invalidate previous dest */
        dest = output->data + output->len;

        while (local_len > 0)
        {
                /* current conversion logic, barely any slower */
        }

        remaining_bytes -= local_len;
        output->len = dest - output->data;
    }

    Assert(remaining_bytes == 0);


And to avoid duplicating this code all over I think we could package it
in a inline function with a per-char callback. Just about every useful
compiler ought to be able to remove those levels of indirection.

So a concrete conversion routine might look like:

static inline int
iso8859_1_to_utf8_char(ConversionState *state)
{
    unsigned short c = *state->src;

    if (c == 0)
        report_invalid_encoding(PG_LATIN1, (const char *) state->src, state->len);
    if (!IS_HIGHBIT_SET(c))
        *state->dest++ = c;
    else
    {
        *state->dest++ = (c >> 6) | 0xc0;
        *state->dest++ = (c & 0x003f) | HIGHBIT;
    }
    state->src++;
    state->len--;
}

Datum
iso8859_1_to_utf8(PG_FUNCTION_ARGS)
{
    ConversionState state = {
        .conservative_growth_factor = 1.05,
        .max_perchar_overhead = 2,
        .src = (unsigned char *) PG_GETARG_CSTRING(2),
        .dest = (StringInfo *) PG_GETARG_POINTER(3),
        .len = PG_GETARG_INT32(4),
        };

    return encoding_conv_helper(&state, iso8859_1_to_utf8_char);
}

where encoding_conv_helper is a static inline function that does the
chunking described above.

There's probably some added complexity around making sure that the
chunking properly deals with multi-byte encodings properly, but that
seems solvable.


> It then occurred to me that we could also repalloc the output buffer
> down to just the required size, which is pointless if it's small
> but not if we can give back several hundred MB.  This is conveniently
> mergeable with the check to see whether we need to check strlen or not.
>
> ... or at least, that's what I thought we could do.  Testing showed
> me that AllocSetRealloc never actually gives back any space, even
> when it's just acting as a frontend for a direct malloc.  However,
> we can fix that for little more than the price of swapping the order
> of the is-it-a-decrease and is-it-a-large-chunk stanzas, as in the
> 0002 patch below.

That seems generally like a good idea.


Greetings,

Andres Freund



Re: allocation limit for encoding conversion

От
Tom Lane
Дата:
Andres Freund <andres@anarazel.de> writes:
> On 2019-09-24 16:19:41 -0400, Tom Lane wrote:
>> Andres Freund <andres@anarazel.de> writes:
>>> It's probably too big a hammer for this specific case, but I think at
>>> some point we ought to stop using fixed size allocations for this kind
>>> of work. Instead we should use something roughly like our StringInfo,
>>> except that when exceeding the current size limit, the overflowing data
>>> is stored in a separate allocation. And only once we actually need the
>>> data in a consecutive form, we allocate memory that's large enough to
>>> store the all the separate allocations in their entirety.

>> That sounds pretty messy :-(.

> I don't think it's too bad - except for now allowing the .data member of
> such a 'chunked' StringInfo to be directly accessible, it'd be just
> about the same interface as the current stringinfo.

I dunno.  What you're describing would be a whole lotta work, and it'd
break a user-visible API, and no amount of finagling is going to prevent
it from making conversions somewhat slower, and the cases where it matters
to not preallocate a surely-large-enough buffer are really few and far
between.  I have to think that we have better ways to spend our time.

            regards, tom lane



Re: allocation limit for encoding conversion

От
Andres Freund
Дата:
Hi,

On September 24, 2019 3:09:28 PM PDT, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>Andres Freund <andres@anarazel.de> writes:
>> On 2019-09-24 16:19:41 -0400, Tom Lane wrote:
>>> Andres Freund <andres@anarazel.de> writes:
>>>> It's probably too big a hammer for this specific case, but I think
>at
>>>> some point we ought to stop using fixed size allocations for this
>kind
>>>> of work. Instead we should use something roughly like our
>StringInfo,
>>>> except that when exceeding the current size limit, the overflowing
>data
>>>> is stored in a separate allocation. And only once we actually need
>the
>>>> data in a consecutive form, we allocate memory that's large enough
>to
>>>> store the all the separate allocations in their entirety.
>
>>> That sounds pretty messy :-(.
>
>> I don't think it's too bad - except for now allowing the .data member
>of
>> such a 'chunked' StringInfo to be directly accessible, it'd be just
>> about the same interface as the current stringinfo.
>
>I dunno.  What you're describing would be a whole lotta work, and it'd
>break a user-visible API, and no amount of finagling is going to
>prevent
>it from making conversions somewhat slower, and the cases where it
>matters
>to not preallocate a surely-large-enough buffer are really few and far
>between.  I have to think that we have better ways to spend our time.

It'd not just avoid the overallocation, but also avoid the strlen and memcpy afterwards at the callsites, as well as
theseparate allocation. So I'd bet it'd be almost always a win. 

Andres

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.



Re: allocation limit for encoding conversion

От
Tom Lane
Дата:
I wrote:
> [ v2-0001-cope-with-large-encoding-conversions.patch ]
> [ v2-0002-actually-recover-space-in-repalloc.patch ]

I've pushed these patches (after some more review and cosmetic
adjustments) and marked the CF entry closed.  Andres is welcome
to see if he can improve the situation further.

            regards, tom lane



Re: allocation limit for encoding conversion

От
Alvaro Herrera
Дата:
On 2019-Oct-03, Tom Lane wrote:

> I wrote:
> > [ v2-0001-cope-with-large-encoding-conversions.patch ]
> > [ v2-0002-actually-recover-space-in-repalloc.patch ]
> 
> I've pushed these patches (after some more review and cosmetic
> adjustments) and marked the CF entry closed.  Andres is welcome
> to see if he can improve the situation further.

Many thanks!

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services