Обсуждение: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

Поиск
Список
Период
Сортировка

[ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Cory Nemelka
Дата:
I have getting very poor performance using libpq to process very large TEXT columns (300MB+).   I suspect it is IO related but can't be sure.

Anyone had experience with same issue that can help me resolve?

--cnemelka

Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Aldo Sarmiento
Дата:
I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage-toast.html

Aldo Sarmiento
President & CTO



8687 Research Dr, Irvine, CA 92618

On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:
I have getting very poor performance using libpq to process very large TEXT columns (300MB+).   I suspect it is IO related but can't be sure.

Anyone had experience with same issue that can help me resolve?

--cnemelka

Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Cory Nemelka
Дата:
Yes, but I should be able to read them much faster.  The psql client can display an 11MB column in a little over a minute, while in C using libpg library, it takes over an hour.  

Anyone have any experience with the same issue that can help me resolve?

--cnemelka

On Thu, Oct 19, 2017 at 5:20 PM, Aldo Sarmiento <aldo@bigpurpledot.com> wrote:
I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage-toast.html


On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:
I have getting very poor performance using libpq to process very large TEXT columns (300MB+).   I suspect it is IO related but can't be sure.

Anyone had experience with same issue that can help me resolve?

--cnemelka


Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Cory Nemelka
Дата:
All I am am doing is iterating through the characters so I know it isn't my code.

--cnemelka

On Fri, Oct 20, 2017 at 9:14 AM, Cory Nemelka <cnemelka@gmail.com> wrote:
Yes, but I should be able to read them much faster.  The psql client can display an 11MB column in a little over a minute, while in C using libpg library, it takes over an hour.  

Anyone have any experience with the same issue that can help me resolve?

--cnemelka

On Thu, Oct 19, 2017 at 5:20 PM, Aldo Sarmiento <aldo@bigpurpledot.com> wrote:
I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage-toast.html


On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:
I have getting very poor performance using libpq to process very large TEXT columns (300MB+).   I suspect it is IO related but can't be sure.

Anyone had experience with same issue that can help me resolve?

--cnemelka



Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Geoff Winkless
Дата:
It's probably worth removing the iterating code Just In Case.

Apologies for egg-suck-education, but I assume you're not doing something silly like

for (i=0; i < strlen(bigtextstring); i++) {
....
}

I know it sounds stupid, but you'd be amazed how many times that crops up, and for small strings it doesn't matter, but for large strings it's catastrophic.

Geoff

On 20 October 2017 at 16:16, Cory Nemelka <cnemelka@gmail.com> wrote:
All I am am doing is iterating through the characters so I know it isn't my code.

--cnemelka

On Fri, Oct 20, 2017 at 9:14 AM, Cory Nemelka <cnemelka@gmail.com> wrote:
Yes, but I should be able to read them much faster.  The psql client can display an 11MB column in a little over a minute, while in C using libpg library, it takes over an hour.  

Anyone have any experience with the same issue that can help me resolve?

--cnemelka

On Thu, Oct 19, 2017 at 5:20 PM, Aldo Sarmiento <aldo@bigpurpledot.com> wrote:
I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage-toast.html


On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:
I have getting very poor performance using libpq to process very large TEXT columns (300MB+).   I suspect it is IO related but can't be sure.

Anyone had experience with same issue that can help me resolve?

--cnemelka




Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Cory Nemelka
Дата:
I'll take out all the code that isn't directly related to reading the data and see if that helps.  That was next step I intended anyway.  

thank you for the reply

--cnemelka

On Fri, Oct 20, 2017 at 9:43 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
It's probably worth removing the iterating code Just In Case.

Apologies for egg-suck-education, but I assume you're not doing something silly like

for (i=0; i < strlen(bigtextstring); i++) {
....
}

I know it sounds stupid, but you'd be amazed how many times that crops up, and for small strings it doesn't matter, but for large strings it's catastrophic.

Geoff

On 20 October 2017 at 16:16, Cory Nemelka <cnemelka@gmail.com> wrote:
All I am am doing is iterating through the characters so I know it isn't my code.

--cnemelka

On Fri, Oct 20, 2017 at 9:14 AM, Cory Nemelka <cnemelka@gmail.com> wrote:
Yes, but I should be able to read them much faster.  The psql client can display an 11MB column in a little over a minute, while in C using libpg library, it takes over an hour.  

Anyone have any experience with the same issue that can help me resolve?

--cnemelka

On Thu, Oct 19, 2017 at 5:20 PM, Aldo Sarmiento <aldo@bigpurpledot.com> wrote:
I believe large columns get put into a TOAST table. Max page size is 8k. So you'll have lots of pages per row that need to be joined with a size like that: https://www.postgresql.org/docs/9.5/static/storage-toast.html


On Thu, Oct 19, 2017 at 2:03 PM, Cory Nemelka <cnemelka@gmail.com> wrote:
I have getting very poor performance using libpq to process very large TEXT columns (300MB+).   I suspect it is IO related but can't be sure.

Anyone had experience with same issue that can help me resolve?

--cnemelka





Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Tom Lane
Дата:
Cory Nemelka <cnemelka@gmail.com> writes:
> Yes, but I should be able to read them much faster.  The psql client can
> display an 11MB column in a little over a minute, while in C using libpg
> library, it takes over an hour.

Well, of course psql relies on libpq, so it seems unlikely that libpq
itself is where the time is going.  Have you tried applying a profiler?
"perf" or "oprofile" or similar tool ought to pinpoint the culprit
pretty easily.
        regards, tom lane


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Bear Giles
Дата:
As an aside any halfway decent optimizer would realize that the results of strlen() are unchanging as long as the contents of what it's passed isn't modified. That's a common enough pattern that it should be checked.

What about buffer size? Are you using a smaller fetch size that results in lots of little packets?

Is there a way to specify a connection that uses compression?

On Fri, Oct 20, 2017 at 10:55 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
Cory Nemelka <cnemelka@gmail.com> writes:
> Yes, but I should be able to read them much faster.  The psql client can
> display an 11MB column in a little over a minute, while in C using libpg
> library, it takes over an hour.

Well, of course psql relies on libpq, so it seems unlikely that libpq
itself is where the time is going.  Have you tried applying a profiler?
"perf" or "oprofile" or similar tool ought to pinpoint the culprit
pretty easily.

                        regards, tom lane


--
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Geoff Winkless
Дата:
On 20 October 2017 at 18:52, Bear Giles <bgiles@coyotesong.com> wrote:
As an aside any halfway decent optimizer would realize that the results of strlen() are unchanging as long as the contents of what it's passed isn't modified. That's a common enough pattern that it should be checked.

​IME this is a myth perpetuated by bad computer science lecturers who haven't thought through the consequences of what they're saying.​ strlen() can change because of changes inside the loop but also because of also threads in the system; I've not yet seen a compiler optimise that away, and neither should it, IMO.

G

Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Bear Giles
Дата:
​In that case you must put a read lock on the string that covers the loop. If you're in a multi-threaded environment and not using locks when appropriate then all bets are off.


On Fri, Oct 20, 2017 at 12:05 PM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
On 20 October 2017 at 18:52, Bear Giles <bgiles@coyotesong.com> wrote:
As an aside any halfway decent optimizer would realize that the results of strlen() are unchanging as long as the contents of what it's passed isn't modified. That's a common enough pattern that it should be checked.

​IME this is a myth perpetuated by bad computer science lecturers who haven't thought through the consequences of what they're saying.​ strlen() can change because of changes inside the loop but also because of also threads in the system; I've not yet seen a compiler optimise that away, and neither should it, IMO.

G

Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Geoff Winkless
Дата:
On 21 Oct 2017 12:32, "Bear Giles" <bgiles@coyotesong.com> wrote:

> In that case you must put a read lock on the string that covers the loop. If you're in
> a multi-threaded environment and not using locks when appropriate then all bets are off.

You reckon a compiler can decide to blow up your code by making
assumptions like that?

Your loop could set a var for a state machine in a processing thread
to modify the string. That doesn't preclude correct locking behaviour.

If you think that's too contrived then forget threads, you could make
a shared library call that the compiler can't assess at compile-time
that could change the string.

Yes, in either case, using strlen to check for that is poor code, but
the compiler can't assume you're not using poor code.

This argument is pretty pointless. The only way to be sure to avoid
the problem is to assume that the compiler won't optimize bad code.

FWIW gcc 4.8.5 with -O3 doesn't optimize away strlen even in code this simple:

#include <stdio.h>
#include <string.h>

int main (int argc, char **argv) { int i; char *buff; buff=malloc(strlen(argv[1])); for (i=0; i < strlen(argv[1]); i++)
{  buff[i]=argv[1][i]; } printf("%s", buff);
 
}


.L3:       movzbl  0(%rbp,%rbx), %edx       movb    %dl, (%r12,%rbx)       movq    8(%r13), %rbp       addq    $1,
%rbx
.L2:       movq    %rbp, %rdi       call    strlen       cmpq    %rax, %rbx       jb      .L3

However, it _does_ optimize this code:

int main (int argc, char **argv) { int i; char *buff; char *buff2; buff2=strdup(argv[1]); buff=malloc(strlen(buff2));
for(i=0; i < strlen(buff2); i++) {   buff[i]=buff2[i]; } printf("%s", buff);
 
}

I assume that's because it can be certain at compile time that, since
both buff and buff2 are local, nothing else is going to modify the
source string (without some stack smashing, anyway).

Geoff


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Geoff Winkless
Дата:
I wrote:
> I assume that's because it can be certain at compile time that, since
> both buff and buff2 are local, nothing else is going to modify the
> source string (without some stack smashing, anyway).

Ugh. Of course, they're not local: they're malloc'd; however it's
still reasonable to assume that (since the pointers aren't passed
anywhere) the buffers are untouched elsewhere, unless other threads
are misbehaving.

Geoff


-- 
Sent via pgsql-admin mailing list (pgsql-admin@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-admin

Re: [ADMIN] Processing very large TEXT columns (300MB+) using C/libpq

От
Bear Giles
Дата:
I meant the developer needs to add the locks, not the compiler. I think they're just advisory but everyone always rights well-behaved code, right? ;-)

On Mon, Oct 23, 2017 at 4:17 AM, Geoff Winkless <pgsqladmin@geoff.dj> wrote:
I wrote:
> I assume that's because it can be certain at compile time that, since
> both buff and buff2 are local, nothing else is going to modify the
> source string (without some stack smashing, anyway).

Ugh. Of course, they're not local: they're malloc'd; however it's
still reasonable to assume that (since the pointers aren't passed
anywhere) the buffers are untouched elsewhere, unless other threads
are misbehaving.

Geoff