Обсуждение: Counting the occurences of a substring within a very large text

Поиск
Список
Период
Сортировка

Counting the occurences of a substring within a very large text

От
Marc Mamin
Дата:
Hello,

I'd like to count the number  linebreaks within a string,
but I get a memory allocation error when using regexp_matches or regexp_split_to_table.

Any idea for an alternative to this problem  ?

select count(*)-1 from
(  select regexp_split_to_table(full_message,'(\n)', 'g')
   from mytable
   where id =-2146999703
)foo;

ERROR:  invalid memory alloc request size 1447215584

regards,

Marc Mamin


Re: Counting the occurences of a substring within a very large text

От
Chris Mair
Дата:
> Hello,
>
> I'd like to count the number  linebreaks within a string,
> but I get a memory allocation error when using regexp_matches or regexp_split_to_table.
>
> Any idea for an alternative to this problem  ?
>
> select count(*)-1 from
> (  select regexp_split_to_table(full_message,'(\n)', 'g')
>    from mytable
>    where id =-2146999703
> )foo;
>
> ERROR:  invalid memory alloc request size 1447215584
>
> regards,
>
> Marc Mamin
>

Hi,

what's the size of full_message from mytable where id =-2146999703?

Also: Postgres version? OS? any extensions installed?

Bye,
Chris.




Re: Counting the occurences of a substring within a very large text

От
Albe Laurenz
Дата:
Marc Mamin wrote:
> I'd like to count the number  linebreaks within a string,
> but I get a memory allocation error when using regexp_matches or regexp_split_to_table.
> 
> Any idea for an alternative to this problem  ?
> 
> select count(*)-1 from
> (  select regexp_split_to_table(full_message,'(\n)', 'g')
>    from mytable
>    where id =-2146999703
> )foo;
> 
> ERROR:  invalid memory alloc request size 1447215584

Does any of these two work:

SELECT length(regexp_replace(full_message, '[^\n]', '', 'g'))
FROM mytable
WHERE id = -2146999703;

or

SELECT length(full_message) - length(replace(full_message, E'\n', ''))
FROM mytable
WHERE id = -2146999703;

Yours,
Laurenz Albe

Re: Counting the occurences of a substring within a very large text

От
Andy Colson
Дата:
On 6/24/2015 5:55 AM, Marc Mamin wrote:
> Hello,
>
> I'd like to count the number  linebreaks within a string,
> but I get a memory allocation error when using regexp_matches or regexp_split_to_table.
>
> Any idea for an alternative to this problem  ?
>
> select count(*)-1 from
> (  select regexp_split_to_table(full_message,'(\n)', 'g')
>     from mytable
>     where id =-2146999703
> )foo;
>
> ERROR:  invalid memory alloc request size 1447215584
>
> regards,
>
> Marc Mamin
>
>

If its a large enough string, switching to plperl/plpython might give
you a pleasant speedup.  There is a small overhead getting the string
to/from pg, but the string ops will be much faster.

-Andy


Re: Counting the occurences of a substring within a very large text

От
Marc Mamin
Дата:

> -----Original Message-----
> From: Albe Laurenz [mailto:laurenz.albe@wien.gv.at]
> Sent: Mittwoch, 24. Juni 2015 13:44
> To: Marc Mamin; Postgres General
> Subject: RE: Counting the occurences of a substring within a very large
> text
> 
> Marc Mamin wrote:
> > I'd like to count the number  linebreaks within a string, but I get a
> > memory allocation error when using regexp_matches or
> regexp_split_to_table.
> >
> > Any idea for an alternative to this problem  ?
> >
> > select count(*)-1 from
> > (  select regexp_split_to_table(full_message,'(\n)', 'g')
> >    from mytable
> >    where id =-2146999703
> > )foo;
> >
> > ERROR:  invalid memory alloc request size 1447215584
> 
> Does any of these two work:
> 
> SELECT length(regexp_replace(full_message, '[^\n]', '', 'g')) FROM
> mytable WHERE id = -2146999703;
> 
> or
> 
> SELECT length(full_message) - length(replace(full_message, E'\n', ''))
> FROM mytable WHERE id = -2146999703;


no, they both yeld the same error.

a new string functions for this would be nice, as it could certainly be implemented in a more efficient way...

    BTW: the text to check is a single 350 MB error message from a log file :)





> 
> Yours,
> Laurenz Albe

Re: Counting the occurences of a substring within a very large text

От
Marc Mamin
Дата:

> -----Original Message-----
> From: Chris Mair [mailto:chris@1006.org]
> Sent: Mittwoch, 24. Juni 2015 13:26
> To: Marc Mamin; Postgres General
> Subject: Re: [GENERAL] Counting the occurences of a substring within a
> very large text
>
> > Hello,
> >
> > I'd like to count the number  linebreaks within a string, but I get a
> > memory allocation error when using regexp_matches or
> regexp_split_to_table.
> >
> > Any idea for an alternative to this problem  ?
> >
> > select count(*)-1 from
> > (  select regexp_split_to_table(full_message,'(\n)', 'g')
> >    from mytable
> >    where id =-2146999703
> > )foo;
> >
> > ERROR:  invalid memory alloc request size 1447215584
> >
> > regards,
> >
> > Marc Mamin
> >
>
> Hi,
>
> what's the size of full_message from mytable where id =-2146999703?

345 MB, Postgres 9.3.6

Marc Mamin

>
> Also: Postgres version? OS? any extensions installed?
>
> Bye,
> Chris.
>



Re: Counting the occurences of a substring within a very large text

От
Marc Mamin
Дата:
> > > I'd like to count the number  linebreaks within a string, but I get
> > > a memory allocation error when using regexp_matches or
> > regexp_split_to_table.
> > >
> > > Any idea for an alternative to this problem  ?
> > >
> > > select count(*)-1 from
> > > (  select regexp_split_to_table(full_message,'(\n)', 'g')
> > >    from mytable
> > >    where id =-2146999703
> > > )foo;
> > >
> > > ERROR:  invalid memory alloc request size 1447215584
> >
> > Does any of these two work:
> >
> > SELECT length(regexp_replace(full_message, '[^\n]', '', 'g')) FROM
> > mytable WHERE id = -2146999703;
> >
> > or
> >
> > SELECT length(full_message) - length(replace(full_message, E'\n',
> ''))
> > FROM mytable WHERE id = -2146999703;
> 
> 
> no, they both yeld the same error.
> 


and this fails too, which is more annoying as it looks like a bug:
SELECT replace(full_message, E'\n', '') FROM stadium_rprod.aserrorfull_20150623 WHERE id = -2146999703;

note that the 345MB text only contains 635 lines. This might be the issue...

Marc Mamin


> a new string functions for this would be nice, as it could certainly be
> implemented in a more efficient way...
> 
>     BTW: the text to check is a single 350 MB error message from a
> log file :)
> 
> 
> 
> 
> 
> >
> > Yours,
> > Laurenz Albe

Re: Counting the occurences of a substring within a very large text

От
Geoff Winkless
Дата:
On 24 June 2015 at 14:51, Marc Mamin <M.Mamin@intershop.de> wrote:
note that the 345MB text only contains 635 lines. This might be the issue...


There's similar issue discussed here:


Tom did seem to accept that the attempted alloc is larger than it needs to be, but even if it were fixed you would still run into problems further down the line managing values of that size.

Geoff

Re: Counting the occurences of a substring within a very large text

От
Marc Mamin
Дата:


>>On 24 June 2015 at 14:51, Marc Mamin <M.Mamin@intershop.de> wrote:
>>note that the 345MB text only contains 635 lines. This might be the issue...

>There's similar issue discussed here:

>http://www.postgresql.org/message-id/6046.1353874252@sss.pgh.pa.us

>Tom did seem to accept that the attempted alloc is larger than it needs to be, but even if it were fixed you would
stillrun into problems further down the line managing values of that size.
 


Thanks,

does it means that the issue would also occur in a (not yet existing) function that would just count the substring,
withouthaving to manipulate the text itself ?
 
or in other words, are there string functions that can handle such large text in stream, or is it always necessary to
allocenough memory for the whole text first ?
 

Marc 

Re: Counting the occurences of a substring within a very large text

От
Tom Lane
Дата:
Marc Mamin <M.Mamin@intershop.de> writes:
> I'd like to count the number  linebreaks within a string,
> but I get a memory allocation error when using regexp_matches or regexp_split_to_table.

Anything involving a regexp is going to have a maximum input string length
of about 256MB, as a result of conversion to pg_wchar format.
regexp_split_to_table(), for instance, does this:

    /* convert string to pg_wchar form for matching */
    orig_len = VARSIZE_ANY_EXHDR(orig_str);
    wide_str = (pg_wchar *) palloc(sizeof(pg_wchar) * (orig_len + 1));
    wide_len = pg_mb2wchar_with_len(VARDATA_ANY(orig_str), wide_str, orig_len);

palloc() would complain for requests beyond 1GB, and sizeof(pg_wchar) is
4, so 256MB is the longest orig_str this can handle.  (This coding is on
the hairy edge of being a security bug on 32-bit machines, too, but AFAICS
it is okay because there's a factor of 4 daylight between the max possible
input length of 1GB and overflowing uint32.)

In newer branches we could consider using MemoryContextAllocHuge to escape
the 1GB limit on wide_str, but it would take some research to be sure that
everything involved is using suitable datatypes for string indices and so
forth.  I'd be a bit worried about the performance implications of
throwing such large strings around, anyway.

            regards, tom lane


Re: Counting the occurences of a substring within a very large text

От
Albe Laurenz
Дата:
Marc Mamin wrote:
>>> I'd like to count the number  linebreaks within a string, but I get a
>>> memory allocation error when using regexp_matches or regexp_split_to_table.

>> Does any of these two work:
[...]

> no, they both yeld the same error.
>
> a new string functions for this would be nice, as it could certainly be implemented in a more efficient way...

This is a rather special case.
But it should be easy to write your own C function that does this efficiently.

Yours,
Laurenz Albe