Обсуждение: BUG #17145: Invalid memory alloc request size, when searching in text column with big content > 250MB

Поиск
Список
Период
Сортировка

BUG #17145: Invalid memory alloc request size, when searching in text column with big content > 250MB

От
PG Bug reporting form
Дата:
The following bug has been logged on the website:

Bug reference:      17145
Logged by:          James Inform
Email address:      james.inform@pharmapp.de
PostgreSQL version: 13.4
Operating system:   Ubuntu 18.04 LTS / MacOS 10.15.7
Description:

Hi,

while importing some log data into a PostgreSQL table a came across this
issue.

When I try to search a text field in a where clause that holds more than 250
MB of text, PostgreSQL runs out of memory for the request.

You can reproduce this behaviour with the following sql statement:

with
    q1 as 
    (
                -- 260*1024*1024 = 260MB
        select repeat('x',260*1024*1024) as mydata
    )
select count(*) from q1 where lower(mydata) like '%a%'
;

ERROR:  invalid memory alloc request size 1090519044

Using just a 250MB string:

with
    q1 as 
    (
                -- 250*1024*1024 = 250MB
        select repeat('x',250*1024*1024) as mydata
    )
select count(*) from q1 where lower(mydata) like '%a%'
;

Everything is fine!

The alloc request size seems to be 4 times the length of the text field.

The issue is also reproducible when using a table with a text column and
filling it with a string like above.

create table mytest as
 with
    q1 as 
    (
                -- 260*1024*1024 = 260MB
        select repeat('x',260*1024*1024) as mydata
    )
 select * from q1;

 select count(*) from mytest where lower(mydata) like '%a%'
 ;
 ERROR:  invalid memory alloc request size 1090519044

Also creating an index e.g.

> create extension pg_trgm;
> create index on mytest using gin(lower(mydata) gin_trgm_ops);

is giving the error.

Is this an expected behaviour? 
Why ist PostgreSQL allocating 4 times the column's memory when doing a
search like above?

It seems, that currently nobody will be able to use a text field with more
than 256 MB content with a function bases index nor search for a record
using the field in a where clause.

If the behaviour of using 4 times the memory is due to performance
considerations, then maybe the underlaying algorithm must be devided into an
efficient algo for < 256 MB and a less efficient but working without error
one for >= 256MB.

If we cannot change the behaviour, we should state this information in the
documentation, telling that you can store up to 1GB of string data into a
text column, but only up to 256 MB if you want to use the column for
complexer where clauses.

Cheers, 
James


Re: BUG #17145: Invalid memory alloc request size, when searching in text column with big content > 250MB

От
Julien Rouhaud
Дата:
On Mon, Aug 16, 2021 at 4:46 PM PG Bug reporting form
<noreply@postgresql.org> wrote:
>
> When I try to search a text field in a where clause that holds more than 250
> MB of text, PostgreSQL runs out of memory for the request.
>
> You can reproduce this behaviour with the following sql statement:
>
> with
>         q1 as
>         (
>                 -- 260*1024*1024 = 260MB
>                 select repeat('x',260*1024*1024) as mydata
>         )
> select count(*) from q1 where lower(mydata) like '%a%'
> ;
>
> ERROR:  invalid memory alloc request size 1090519044
>
> Using just a 250MB string:
>
> with
>         q1 as
>         (
>                 -- 250*1024*1024 = 250MB
>                 select repeat('x',250*1024*1024) as mydata
>         )
> select count(*) from q1 where lower(mydata) like '%a%'
> ;
>
> Everything is fine!
>
> The alloc request size seems to be 4 times the length of the text field.
> [...]
> Is this an expected behaviour?
> Why ist PostgreSQL allocating 4 times the column's memory when doing a
> search like above?

This is unfortunately the expected behavior, assuming that you're not
dealing with C/POSIX  encoding.

This is because in multibyte encoding each character can occupy up to
4B.  Postgres needs to allocate a single chunk of memory to hold the
resulting text, and it has no way to know how many multibyte
characters are present in the input string or how many character will
have a different size when down-cased, so it has to allocate the
maximum size that may be needed, which is 4 times the size of the
input string.  And there's a strict 1GB limitation for a single field
size, thus the ~256MB limit.



Well I understand. But in my scenario I am using ansi characters which 
are just 1-byte-utf8. The system should be capable of handling that.
Maybe we could implement a different approach for text column content > 
256 GB, where we implement something like a "streaming" read where 
memory is not allocated based on fixed 4-bytes per character, but on the 
true length of characters found in the text column or returned by a 
function like lower().

But as a user i would expect that I can store as many characters as 
possible into a text field without getting errors in any way.

The following example will give an error although the amount of text 
easily fits into the column because the 'x' is ansi and takes 1 byte.

create temp table mytest(mydata text);
create index on mytest(lower(mydata));
insert into mytest
select repeat('x',300*1024*1024) as mydata;

Without the index I could easily store 800mb in the text column:
select repeat('x',800*1024*1024) as mydata;

So from what I see we should at least extend the documentation and tell 
people that although a text field can hold up to 1GB of string data,
only 256MB can safely be used if one wants to use function-based indexed 
or function on the column inside a where clauses.

So for everyone using text columns in a more than basic way, we should 
simply tell everyone  not not store more than 256 * 1024*1024 characters 
in a text column.

Julien Rouhaud wrote:
> On Mon, Aug 16, 2021 at 4:46 PM PG Bug reporting form
> <noreply@postgresql.org> wrote:
>> When I try to search a text field in a where clause that holds more than 250
>> MB of text, PostgreSQL runs out of memory for the request.
>>
>> You can reproduce this behaviour with the following sql statement:
>>
>> with
>>          q1 as
>>          (
>>                  -- 260*1024*1024 = 260MB
>>                  select repeat('x',260*1024*1024) as mydata
>>          )
>> select count(*) from q1 where lower(mydata) like '%a%'
>> ;
>>
>> ERROR:  invalid memory alloc request size 1090519044
>>
>> Using just a 250MB string:
>>
>> with
>>          q1 as
>>          (
>>                  -- 250*1024*1024 = 250MB
>>                  select repeat('x',250*1024*1024) as mydata
>>          )
>> select count(*) from q1 where lower(mydata) like '%a%'
>> ;
>>
>> Everything is fine!
>>
>> The alloc request size seems to be 4 times the length of the text field.
>> [...]
>> Is this an expected behaviour?
>> Why ist PostgreSQL allocating 4 times the column's memory when doing a
>> search like above?
> This is unfortunately the expected behavior, assuming that you're not
> dealing with C/POSIX  encoding.
>
> This is because in multibyte encoding each character can occupy up to
> 4B.  Postgres needs to allocate a single chunk of memory to hold the
> resulting text, and it has no way to know how many multibyte
> characters are present in the input string or how many character will
> have a different size when down-cased, so it has to allocate the
> maximum size that may be needed, which is 4 times the size of the
> input string.  And there's a strict 1GB limitation for a single field
> size, thus the ~256MB limit.