Обсуждение: general PG network slowness (possible cure) (repost)

Поиск
Список
Период
Сортировка

general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
I set up pg to replace a plain gdbm database for my application.  But
even running to the same machine, via a unix socket

   * the pg database ran 100 times slower

Across the net it was

  * about 500 to 1000 times slower than local gdbm

with no cpu use to speak of.

I'd heard that networked databases are slow.  I might have left it at
that if curiosity hadn't led me to write a network server for gdbm
databases, and talk to _that_ just to get a comparison.

Lo and behold and smack me with a corncob if it wasn't _slower_ than pg.

On a whim I mapped the network bandwidth per packet size with the NPtcp
suite, and got surprising answers ..  at 1500B, naturally, the bandwidth
was the full 10Mb/s (minus overheads, say 8.5Mb/s) of my pathetic little
local net.  At 100B the bandwidth available was only 25Kb/s.  At 10B,
you might as well use tin cans and taut string instead.

I also mapped the network flows using ntop, and yes, the average packet
size for both gdbm and pg in one direction was only about 100B or
so.  That's it!  Clearly there are a lot of short queries going out and
the answers were none too big either ( I had a LIMIT 1 in all my PG
queries).

About 75% of traffic was in the 64-128B range while my application was
running, with the peak bandwidth in that range being about 75-125Kb/s
(and I do mean bits, not bytes).

Soooo ... I took a look at my implementation of remote gdbm, and did
a very little work to aggregate outgoing transmissions together into
lumps.  Three lines added in two places.  At the level of the protocol
where I could tell how long the immediate conversation segment would be,
I "corked" the tcp socket before starting the segment and "uncorked" it
after the segment (for "cork", see tcp(7), setsockopt(2) and TCP_CORK in
linux).

Surprise, ...  I got a speed up of hundreds of times.  The same application
that crawled under my original rgdbm implementation and under PG now
maxed out the network bandwidth at close to a full 10Mb/s and 1200
pkts/s, at 10% CPU on my 700MHz client, and a bit less on the 1GHz
server.

So

  * Is that what is holding up postgres over the net too?  Lots of tiny
    packets?

And if so

  * can one fix it the way I fixed it for remote gdbm?

The speedup was hundreds of times. Can someone point me at the relevant
bits of pg code? A quick look seems to say that fe-*.c is
interesting. I need to find where the actual read and write on the
conn->sock is done.

Very illuminating gnuplot outputs available on request.

Peter


Re: general PG network slowness (possible cure) (repost)

От
Richard Huxton
Дата:
Peter T. Breuer wrote:
> I set up pg to replace a plain gdbm database for my application.  But
> even running to the same machine, via a unix socket
>
>    * the pg database ran 100 times slower

For what operations? Bulk reads? 19-way joins?

> Across the net it was
>
>   * about 500 to 1000 times slower than local gdbm
>
> with no cpu use to speak of.

Disk-intensive or memory intensive?

> I'd heard that networked databases are slow.  I might have left it at
> that if curiosity hadn't led me to write a network server for gdbm
> databases, and talk to _that_ just to get a comparison.
>
> Lo and behold and smack me with a corncob if it wasn't _slower_ than pg.
>
> On a whim I mapped the network bandwidth per packet size with the NPtcp
> suite, and got surprising answers ..  at 1500B, naturally, the bandwidth
> was the full 10Mb/s (minus overheads, say 8.5Mb/s) of my pathetic little
> local net.  At 100B the bandwidth available was only 25Kb/s.  At 10B,
> you might as well use tin cans and taut string instead.

This sounds like you're testing a single connection. You would expect
"dead time" to dominate in that scenario. What happens when you have 50
simultaneous connections? Or do you think it's just packet overhead?

> I also mapped the network flows using ntop, and yes, the average packet
> size for both gdbm and pg in one direction was only about 100B or
> so.  That's it!  Clearly there are a lot of short queries going out and
> the answers were none too big either ( I had a LIMIT 1 in all my PG
> queries).

I'm not sure that 100B query-results are usually the bottleneck.
Why would you have LIMIT 1 on all your queries?

> About 75% of traffic was in the 64-128B range while my application was
> running, with the peak bandwidth in that range being about 75-125Kb/s
> (and I do mean bits, not bytes).

None of this sounds like typical database traffic to me. Yes, there are
lots of small result-sets, but there are also typically larger (several
kilobytes) to much larger (10s-100s KB).

> Soooo ... I took a look at my implementation of remote gdbm, and did
> a very little work to aggregate outgoing transmissions together into
> lumps.  Three lines added in two places.  At the level of the protocol
> where I could tell how long the immediate conversation segment would be,
> I "corked" the tcp socket before starting the segment and "uncorked" it
> after the segment (for "cork", see tcp(7), setsockopt(2) and TCP_CORK in
> linux).

I'm a bit puzzled, because I'd have thought the standard Nagle algorithm
would manage this gracefully enough for short-query cases. There's no
way (that I know of) for a backend to handle more than one query at a time.

> Surprise, ...  I got a speed up of hundreds of times.  The same application
> that crawled under my original rgdbm implementation and under PG now
> maxed out the network bandwidth at close to a full 10Mb/s and 1200
> pkts/s, at 10% CPU on my 700MHz client, and a bit less on the 1GHz
> server.
>
> So
>
>   * Is that what is holding up postgres over the net too?  Lots of tiny
>     packets?

I'm not sure your setup is typical, interesting though the figures are.
Google a bit for pg_bench perhaps and see if you can reproduce the
effect with a more typical load. I'd be interested in being proved wrong.

> And if so
>
>   * can one fix it the way I fixed it for remote gdbm?
>
> The speedup was hundreds of times. Can someone point me at the relevant
> bits of pg code? A quick look seems to say that fe-*.c is
> interesting. I need to find where the actual read and write on the
> conn->sock is done.

You'll want to look in backend/libpq and interfaces/libpq I think
(although I'm not a developer).

--
   Richard Huxton
   Archonet Ltd

Re: general PG network slowness (possible cure) (repost)

От
"Steinar H. Gunderson"
Дата:
On Fri, May 25, 2007 at 10:50:58AM +0200, Peter T. Breuer wrote:
> I set up pg to replace a plain gdbm database for my application.

Postgres and gdbm are completely different. You want to rethink your queries
so each does more work, instead of running a zillion of them over the network.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
"Also sprach Kenneth Marshall:"
> > Surprise, ...  I got a speed up of hundreds of times.  The same application
> > that crawled under my original rgdbm implementation and under PG now
> > maxed out the network bandwidth at close to a full 10Mb/s and 1200
> > pkts/s, at 10% CPU on my 700MHz client, and a bit less on the 1GHz
> > server.
> >
> > So
> >
> >   * Is that what is holding up postgres over the net too?  Lots of tiny
> >     packets?
>
>
> This effect is very common, but you are in effect altering the query/

I imagined so, but no, I am not changing the behaviour - I believe you
are imagining something different here.  Let me explain.

It is usually the case that drivers and the network layer conspire to
emit packets when they are otherwise idle, since they have nothing
better to do.  That is, if the transmission unit is the normal 1500B and
there is 200B in the transmission buffer and nothing else is frisking
them about the chops, something along the line will shrug and say, OK,
I'll just send out a 200B fragment now, apologize, and send out another
fragment later if anything else comes along for me to chunter out.

It is also the case that drivers do the opposite .. that is, they do
NOT send out packets when the transmission buffer is full, even if they
have 1500B worth. Why? Well, on Ge for sure, and on 100BT most of the
time, it doesn't pay to send out individual packets because the space
required between packets is relatively too great to permit the network
to work at that speed  given the speed of light as it is, and the
spacing it implies between packets (I remember when I advised the
networking protocol people that Ge was a coming thing about 6 years
ago, they all protested and said it was _physically_ impossible. It is.
If you send packets one by one!).  An ethernet line is fundamentally
only electrical and only signals up or down (relative) and needs time to
quiesce. And then there's the busmastering .. a PCI bus is only about
33MHz, and 32 bits wide (well, or 16 on portables, or even 64, but
you're getting into heavy server equipment then).  That's 128MB/s in
one direction, and any time one releases the bus there's a re-setup time
that costs the earth and will easily lower bandwidth by 75%. So drivers
like to take the bus for a good few packets at a time. Even a single
packet (1500B) will take 400 multi-step bus cycles to get to the
card, and then it's a question of how much onboard memory it has or
whether one has to drive it synchronously. Most cards have something
like a 32-unit ring buffer, and I think each unit is considerable.

Now, if a driver KNOWS what's coming then it can alter its behavior in
order to mesh properly with the higher level layers. What I did was
_tell_ the driver and the protocol not to send any data until I well
and truly tell it to, and then told it to, when I was ready.  The result
is that a full communication unit (start, header, following data, and
stop codon) was sent in one blast.

That meant that there were NO tiny fragments blocking up the net, being
sent wily-nily. And it also meant that the driver was NOT waiting for
more info to come in before getting bored and sending out what it had.
It did as I told it to.

The evidence from monitoring the PG network thruput is that 75% of its
packets are in the 64-128B range, including tcp header. That's hitting
the 100Kb/s (10KB/s) bandwidth regime on my network at the lower end.
It will be even _worse_ on a faster net, I think (feel free to send me a
faster net to compare with :).

I also graphed latency, but I haven't taken into account the results as
the bandwidth measurements were so striking.

> response behavior of the database. Most applications expect an answer
> from the database after every query.

Well of course. Nothing else would work! (I imagine you have some kind
of async scheme, but I haven't investigated). I ask, the db replies. I
ask, the db replies. What I did was

  1) made the ASK go out as one lump.
  2) made the REPLY go out as one lump
  3) STOPPED the card waiting for several replies or asks to accumulate
     before sending out anything at all.

> If it could manage retrying failed
> queries later, you could use the typical sliding window/delayed ack
> that is so useful in improving the bandwidth utilization of many network

That is not what is going on (though that's not a bad idea). See
above for the explanation. One has to take into account the physical
hardware involved and its limitations, and arrange the communications
accordingly. All I did was send EACH query and EACH response as a
single unit, at the hardware level.

One could do better still by managing _several_ threads communications
at once.

> programs. Maybe an option in libpq to tell it to use delayed "acks". I
> do not know what would be involved.

Nothing spectacular is required to see a considerable improvement, I
think,.  apart from a little direction from the high level protocol down
to the driver about where the communication boundaries are.  1000%
speedup in my case.

Now, where is the actual socket send done in the pg code? I'd like to
check what's happening in there.




Peter

Re: general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
"Also sprach Richard Huxton:"
[Charset ISO-8859-1 unsupported, filtering to ASCII...]
> Peter T. Breuer wrote:
> > I set up pg to replace a plain gdbm database for my application.  But
> > even running to the same machine, via a unix socket
> >
> >    * the pg database ran 100 times slower
>
> For what operations? Bulk reads? 19-way joins?

The only operations being done are simple "find the row with this key",
or "update the row with this key". That's all. The queries are not an
issue (though why the PG thread choose to max out cpu when it gets the
chance to do so through a unix socket, I don't know).

> > Across the net it was
> >
> >   * about 500 to 1000 times slower than local gdbm
> >
> > with no cpu use to speak of.
>
> Disk-intensive or memory intensive?

There is no disk as such...  it's running on a ramdisk at the server
end.  But assuming you mean i/o, i/o was completely stalled.  Everything
was idle, all waiting on the net.

> > On a whim I mapped the network bandwidth per packet size with the NPtcp
> > suite, and got surprising answers ..  at 1500B, naturally, the bandwidth
> > was the full 10Mb/s (minus overheads, say 8.5Mb/s) of my pathetic little
> > local net.  At 100B the bandwidth available was only 25Kb/s.  At 10B,
> > you might as well use tin cans and taut string instead.
>
> This sounds like you're testing a single connection. You would expect
> "dead time" to dominate in that scenario. What happens when you have 50

Indeed, it is single, because that's my application. I don't have
50 simultaneous connections.  The use of the database is as a permanent
storage area for the results of previous analyses (static analysis of
the linux kernel codes) from a single client.

Multiple threads accessing at the same time might help keep the network
drivers busier, which would help. They would always see their buffers
filling at an even rate and be able to send out groups of packets at
once.

> simultaneous connections? Or do you think it's just packet overhead?

It's not quite overhead in the sense of the logical layer. It's a
physical layer thing.  I replied in another mail on this thread, but in
summary, tcp behaves badly with small packets on ethernet, even on a
dedicated line (as this was). One needs to keep it on a tight rein.

> > I also mapped the network flows using ntop, and yes, the average packet
> > size for both gdbm and pg in one direction was only about 100B or
> > so.  That's it!  Clearly there are a lot of short queries going out and
> > the answers were none too big either ( I had a LIMIT 1 in all my PG
> > queries).
>
> I'm not sure that 100B query-results are usually the bottleneck.
> Why would you have LIMIT 1 on all your queries?

Because there is always only one answer to the query, according to the
logic.  So I can always tell the database manager to stop looking after
one, which will always help it.

> > About 75% of traffic was in the 64-128B range while my application was
> > running, with the peak bandwidth in that range being about 75-125Kb/s
> > (and I do mean bits, not bytes).
>
> None of this sounds like typical database traffic to me. Yes, there are
> lots of small result-sets, but there are also typically larger (several
> kilobytes) to much larger (10s-100s KB).

There's none here.

> > Soooo ... I took a look at my implementation of remote gdbm, and did
> > a very little work to aggregate outgoing transmissions together into
> > lumps.  Three lines added in two places.  At the level of the protocol
> > where I could tell how long the immediate conversation segment would be,
> > I "corked" the tcp socket before starting the segment and "uncorked" it
> > after the segment (for "cork", see tcp(7), setsockopt(2) and TCP_CORK in
> > linux).
>
> I'm a bit puzzled, because I'd have thought the standard Nagle algorithm
> would manage this gracefully enough for short-query cases. There's no

On the contrary, Nagle is also often wrong here because it will delay
sending in order to accumulate more data into buffers when only a little
has arrived, then give up when no more data arrives to be sent out, then
send out the (short) packet anyway, late. There's no other traffic
apart from my (single thread) application.

What we want is to direct the sending exactly,n this situation saying
when to not send, and when to send.  Disable Nagle for a start, use
async read (noblock), and sync write, with sends from the socket blocked
from initiation of a message until the whole message is ready to be sent
out.  Sending the message piecemeal just hurts too.

> way (that I know of) for a backend to handle more than one query at a time.

That's not the scenario.

> > Surprise, ...  I got a speed up of hundreds of times.  The same application
> > that crawled under my original rgdbm implementation and under PG now
> > maxed out the network bandwidth at close to a full 10Mb/s and 1200
> > pkts/s, at 10% CPU on my 700MHz client, and a bit less on the 1GHz
> > server.
> >
> > So
> >
> >   * Is that what is holding up postgres over the net too?  Lots of tiny
> >     packets?
>
> I'm not sure your setup is typical, interesting though the figures are.
> Google a bit for pg_bench perhaps and see if you can reproduce the
> effect with a more typical load. I'd be interested in being proved wrong.

But the load is typical HERE. The application works well against gdbm
and I was hoping to see speedup from using a _real_ full-fledged DB
instead.

Well, at least it's very helpful for debugging.

> > And if so
> >
> >   * can one fix it the way I fixed it for remote gdbm?
> >
> > The speedup was hundreds of times. Can someone point me at the relevant
> > bits of pg code? A quick look seems to say that fe-*.c is
> > interesting. I need to find where the actual read and write on the
> > conn->sock is done.
>
> You'll want to look in backend/libpq and interfaces/libpq I think
> (although I'm not a developer).

I'll look around there. Specific directions are greatly
appreciated.

Thanks.

Peter

Re: general PG network slowness (possible cure) (repost)

От
Richard Huxton
Дата:
Peter T. Breuer wrote:
>
> The only operations being done are simple "find the row with this key",
> or "update the row with this key". That's all. The queries are not an
> issue (though why the PG thread choose to max out cpu when it gets the
> chance to do so through a unix socket, I don't know).

> There is no disk as such...  it's running on a ramdisk at the server
> end.  But assuming you mean i/o, i/o was completely stalled.  Everything
> was idle, all waiting on the net.

> Indeed, it is single, because that's my application. I don't have
> 50 simultaneous connections.  The use of the database is as a permanent
> storage area for the results of previous analyses (static analysis of
> the linux kernel codes) from a single client.

>> I'm not sure your setup is typical, interesting though the figures are.
>> Google a bit for pg_bench perhaps and see if you can reproduce the
>> effect with a more typical load. I'd be interested in being proved wrong.
>
> But the load is typical HERE. The application works well against gdbm
> and I was hoping to see speedup from using a _real_ full-fledged DB
> instead.

I'm not sure you really want a full RDBMS. If you only have a single
connection and are making basic key-lookup queries then 90% of
PostgreSQL's code is just getting in your way. Sounds to me like gdbm
(or one of its alternatives) is a good match for you. Failing that,
sqlite is probably the next lowest-overhead solution.

Of course, if you want to have multiple clients interacting and
performing complex 19-way joins on gigabyte-sized tables with full-text
indexing and full transaction control then you *do* want a RDBMS.

--
   Richard Huxton
   Archonet Ltd

Re: general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
"Also sprach Richard Huxton:"
> I'm not sure you really want a full RDBMS. If you only have a single
> connection and are making basic key-lookup queries then 90% of
> PostgreSQL's code is just getting in your way. Sounds to me like gdbm

Yep - I could happily tell it not to try and compile a special lookup
scheme each time, for example!  (how that?). I could presumably also
help it by preloading the commands I will run and sending over the
params only with a "do a no. 17 now!".

> (or one of its alternatives) is a good match for you. Failing that,
> sqlite is probably the next lowest-overhead solution.

Not a bad idea. but PG _will_ be useful when folk come to analyse the
result of the analyses being done. What is slow is getting the data
into the database now via simple store, fetch and update.

> Of course, if you want to have multiple clients interacting and
> performing complex 19-way joins on gigabyte-sized tables with full-text

Well, the dbs are in the tens of MB from a single run over a single
file (i.e analysis of a single 30KLOC source).  The complete analysis
space is something like 4000 times that, for 4300 C files in the linux
kernel source. And then there is all the linux kernel versions. Then
there is godzilla and apache source ..

> indexing and full transaction control then you *do* want a RDBMS.

We want one anyway. The problem is filling the data and the simple
fetch and update queries on it.

I really think it would be worthwhile getting some developer to tell me
where the network send is done in PG.

Peter

Re: general PG network slowness (possible cure) (repost)

От
Tom Lane
Дата:
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes:
> Soooo ... I took a look at my implementation of remote gdbm, and did
> a very little work to aggregate outgoing transmissions together into
> lumps.

We do that already --- for a simple query/response such as you are
describing, each query cycle will involve one physical client->server
message followed by one physical server->client message.  The only way
to aggregate more is for the application code to merge queries together.

Migrating a dbm-style application to a SQL database is often a real
pain, precisely because the application is designed to a mindset of
"fetch one record, manipulate it, update it", where "fetch" and "update"
are assumed to be too stupid to do any of the work for you.  The way
to get high performance with a SQL engine is to push as much of the work
as you can to the database side, and let the engine process multiple
records per query; and that can easily mean rewriting the app from the
ground up :-(

            regards, tom lane

Re: general PG network slowness (possible cure) (repost)

От
Richard Huxton
Дата:
Peter T. Breuer wrote:
> "Also sprach Richard Huxton:"
>> I'm not sure you really want a full RDBMS. If you only have a single
>> connection and are making basic key-lookup queries then 90% of
>> PostgreSQL's code is just getting in your way. Sounds to me like gdbm
>
> Yep - I could happily tell it not to try and compile a special lookup
> scheme each time, for example!  (how that?). I could presumably also
> help it by preloading the commands I will run and sending over the
> params only with a "do a no. 17 now!".

PREPARE/EXECUTE (or the equivalent libpq functions).
Also - if you can have multiple connections to the DB you should be able
to have several queries running at once.

>> (or one of its alternatives) is a good match for you. Failing that,
>> sqlite is probably the next lowest-overhead solution.
>
> Not a bad idea. but PG _will_ be useful when folk come to analyse the
> result of the analyses being done. What is slow is getting the data
> into the database now via simple store, fetch and update.

I'd have an hourly/daily bulk-load running from the simple system into
PG. If you have to search all the data from your app that's not
practical of course.

>> Of course, if you want to have multiple clients interacting and
>> performing complex 19-way joins on gigabyte-sized tables with full-text
>
> Well, the dbs are in the tens of MB from a single run over a single
> file (i.e analysis of a single 30KLOC source).  The complete analysis
> space is something like 4000 times that, for 4300 C files in the linux
> kernel source. And then there is all the linux kernel versions. Then
> there is godzilla and apache source ..

If you're doing some sort of token analysis on source-code you probably
want to look into how tsearch2 / trigram / Gist+GIN indexes work. It
might be that you're doing work in your app that the DB can handle for you.

>> indexing and full transaction control then you *do* want a RDBMS.
>
> We want one anyway. The problem is filling the data and the simple
> fetch and update queries on it.

OK

> I really think it would be worthwhile getting some developer to tell me
> where the network send is done in PG.

--
   Richard Huxton
   Archonet Ltd

Re: general PG network slowness (possible cure) (repost)

От
Alvaro Herrera
Дата:
Peter T. Breuer escribió:

> I really think it would be worthwhile getting some developer to tell me
> where the network send is done in PG.

See src/backend/libpq/pqcomm.c (particularly internal_flush()).

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
"Also sprach Alvaro Herrera:"
> > I really think it would be worthwhile getting some developer to tell me
> > where the network send is done in PG.
>
> See src/backend/libpq/pqcomm.c (particularly internal_flush()).

Yes. Thanks. That looks like it. It calls secure_write continually
until the buffer is empty.

Secure_write is located ibe-secure.c, but I'm not using ssl, so the
call reduces to just

    n = send(port->sock, ptr, len, 0);

And definitely all those could be grouped if there are several to do.
But under normal circumstances the send will be pushing against a
lttle resistance (the copy to the driver/protocol stack buffer is faster
than the physical network send, by a ratio of GB/s to MB/s, or 1000 to
1), and thus all these sends will probably complete as a single unit
once they have been started.

It's worth a try.  I thought first this may be too low level, but it
looks as though internal_flush is only triggered when some other buffer
is full, or deliberately, so it may be useful to block until it fires.

I'll try it.


Peter

Re: general PG network slowness (possible cure) (repost)

От
Tom Lane
Дата:
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes:
> And definitely all those could be grouped if there are several to do.

Except that in the situation you're describing, there's only a hundred
or two bytes of response to each query, which means that only one send()
will occur anyway.  (The flush call comes only when we are done
responding to the current client query.)

It's possible that for bulk data transmission situations we could
optimize things a bit better --- in particular I've wondered whether we
can reliably find out the MTU of the connection and use that as the
output buffer size, instead of trusting the kernel to choose the best
message boundaries --- but for the situation you're worried about
there will be only one send.

            regards, tom lane

Re: general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
"Also sprach Tom Lane:"
> "Peter T. Breuer" <ptb@inv.it.uc3m.es> writes:
> > And definitely all those could be grouped if there are several to do.
>
> Except that in the situation you're describing, there's only a hundred
> or two bytes of response to each query, which means that only one send()
> will occur anyway.  (The flush call comes only when we are done
> responding to the current client query.)

It may still be useful. The kernel won't necessarily send data as you
push it down to the network protocols and driver. The driver may decide
to wait for more data to accumulate, particularly if it only has a
couple of hundred bytes to send so far and the medium is high speed and
medium latency (fast ethernet).  It'll get fed up with waiting for more
data eventually, and send it out, but it is essentially waiting on
_itself_ in that case, since the outgoing data is required at the other
side of the net as a response to be processed before another query can
be sent out, only then prompting the postmaster to start stuffing the
output buffer with more bytes.

Waiting on oneself is bad for us procrastinators. We need some whips.

I'll try and really force a send, and try some more tricks.
Unfortunately this isn't really quite the right level, so I have to use
some heuristics. Can you guarantee that internal_flush is not called
until (a) the internal buffer is full, OR (b) we have finished
composing a reply, AND (c) there is no other way to send out data?

I also need to find where we begin to compose a reply. That's somewhere
well before internal flush ever gets called. I want to block output at
that point.

As it is, I can either unblock just before internal_flush and block
after, or block just before internal_flush and unblock after (:-)
that's not quite as daft as it sounds, but needs care). Really I want
to do

   query received
   *block output
   process query
   create response
   *unblock output
   send

Instead, I have here to do

   query received
   process query
   create response
   *unblock output
   send
   *block output

Which is not quite the same. It may work though, because the driver
will know nothing is going to go out while it is listening for the next
query, and it will not have sent anything prematurely or kept it back
inopportunely.

> It's possible that for bulk data transmission situations we could
> optimize things a bit better --- in particular I've wondered whether we
> can reliably find out the MTU of the connection and use that as the
> output buffer size, instead of trusting the kernel to choose the best
> message boundaries --- but for the situation you're worried about

Don't bother, I think.  MTU is often effectively only notional these
days at the hardware level in many media.

OTOH, on my little net, MTU really does mean something because it's
10BT.

> there will be only one send.

True.

Peter

Re: general PG network slowness (possible cure) (repost)

От
Tom Lane
Дата:
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes:
> "Also sprach Tom Lane:"
>> Except that in the situation you're describing, there's only a hundred
>> or two bytes of response to each query, which means that only one send()
>> will occur anyway.  (The flush call comes only when we are done
>> responding to the current client query.)

> It may still be useful. The kernel won't necessarily send data as you
> push it down to the network protocols and driver. The driver may decide
> to wait for more data to accumulate,

No, because we set TCP_NODELAY.  Once we've flushed a message to the
kernel, we don't want the kernel sitting on it --- any delay there adds
directly to the elapsed query time.  At least this is the case for the
final response to a query.  I'm not too clear on whether this means we
need to be careful about intermediate message boundaries when there's a
lot of data being sent.

            regards, tom lane

Re: general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
"Also sprach Tom Lane:"
> > It may still be useful. The kernel won't necessarily send data as you
> > push it down to the network protocols and driver. The driver may decide
> > to wait for more data to accumulate,
>
> No, because we set TCP_NODELAY.  Once we've flushed a message to the

That just means "disable Nagle", which is indeed more or less the
correct thing to do ..  you don't want to sit around waiting  for more
data when we're sure there will be none, as you say.  Yet you also don't
want to send short data out prematurely, which disabling Nagle can
cause.

And disabling Nagle doesn't actually force data out immediately you want
it to be sent ...  it just disables extra waits imposed by the Nagle
algorithm/protocol.  It doesn't stop the driver from waiting around
because it feels taking the bus might be a bit premature right now,
for example.

> kernel, we don't want the kernel sitting on it --- any delay there adds
> directly to the elapsed query time.  At least this is the case for the
> final response to a query.  I'm not too clear on whether this means we
> need to be careful about intermediate message boundaries when there's a
> lot of data being sent.

It's unclear. But not my situation.


If I clear TCP_CORK all data is sent at that point. If I set TCP_CORK
data is held until I clear TCP_CORK, or 200ms have passed with no send.

Peter

Re: general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
"Also sprach Kenneth Marshall:"
> improvement from coalescing the packets. Good luck in your investigations.

While I am recompiling stuff, just some stats.

Typical network traffic analysis during the PG runs:

        Total Packets Processed 493,499
        Unicast 100.0%  493,417
        Broadcast       0.0%    82
        Multicast       0.0%    0
        pktCast distribution chart
        Shortest         42 bytes
        Average Size    192 bytes
        Longest       1,514 bytes
        <= 64 bytes             0.0%       158
        64 to 128 bytes        77.3%   381,532
        129 to 256 bytes        6.8%    33,362
        257 to 512 bytes        8.6%    42,535
        513 to 1024 bytes       4.0%    19,577
        1025 to 1518 bytes      3.3%    16,335

Typical application rusage stats:

       time ./c -timeout 12000 -database postgresql://pebbles/d /tmp/tty_io..c
       user   system elapsed cpu
       7.866u 6.038s 5:49.13 3.9%      0+0k 0+0io 0pf+0w

Those stats show the system lost in i/o. It's neither in kernel nor in
userspace. Presumably the other side plus networking was the holdup.

For comparison, against localhost via loopback ("fake" networking):

       time ./c -timeout 12000 -database postgresql://localhost/d /tmp/tty_io..c
       user   system elapsed cpu
       9.483u 5.321s 2:41.78 9.1%      0+0k 0+0io 0pf+0w

but in that case postmaster was doing about 54% cpu, so the overall
cpu for server + client is 63%.

I moved to a unix domain socket and postmaster alone went to 68%.


       time ./c -timeout 12000 -database postgresql://unix/var/run/postgresql/d /tmp/tty_io..c
       user   system elapsed cpu
       9.569u 3.698s 2:52.41 7.6%      0+0k 0+0io 0pf+0w

The elapsed time is not much different between unix and localhost. One can
see that there is some i/o holdup because the two threads ought to do 100%
between them if handover of info were costless. The difference (the system
was queiscent o/w apart from the monitoring software, which shows only a
fraction of a percent loading). There were no memory shortages and swap
was disabled for the test (both sides)

For comparison, running against gdbm straignt to disk

      time ./c -timeout 12000  /tmp/tty_io..c
      user   system elapsed cpu
      2.637u 0.735s 0:05.34 62.9%     0+0k 0+0io 0pf+0w

Through localhost:

      time ./c -timeout 12000 -database gdbm://localhost/ptb/c /tmp/tty_io..c
      user   system elapsed cpu
      2.746u 3.699s 0:16.00 40.1%     0+0k 0+0io 0pf+0w

(the server process was at 35% cpu, for 75% total).

Across the net:

      time ./c -timeout 12000 -database gdbm://pebbles/ptb/c /tmp/tty_io..c
      user   system elapsed cpu
      2.982u 4.430s 1:03.44 7.9%      0+0k 0+0io 0pf+0w

(the server was at 7% cpu)


Have to go shopping ....

Peter

Re: general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
"Also sprach Richard Huxton:"
> > scheme each time, for example!  (how that?). I could presumably also
> > help it by preloading the commands I will run and sending over the
> > params only with a "do a no. 17 now!".
>
> PREPARE/EXECUTE (or the equivalent libpq functions).

Yes, thank you. It seems to speed things up by a factor of 2.

But can I prepare a DECLARE x BINARY CURSOR FOR SELECT ...  statement?
The manual seems to say no.

Peter

Re: general PG network slowness (possible cure) (repost)

От
Tom Lane
Дата:
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes:
> But can I prepare a DECLARE x BINARY CURSOR FOR SELECT ...  statement?
> The manual seems to say no.

No, you just prepare the SELECT.  At the protocol level, DECLARE CURSOR
is a tad useless.  You can still fetch the data in binary if you want...

            regards, tom lane

Re: general PG network slowness (possible cure) (repost)

От
"Peter T. Breuer"
Дата:
"Also sprach Tom Lane:"
> "Peter T. Breuer" <ptb@inv.it.uc3m.es> writes:
> > But can I prepare a DECLARE x BINARY CURSOR FOR SELECT ...  statement?
> > The manual seems to say no.
>
> No, you just prepare the SELECT.  At the protocol level, DECLARE CURSOR
> is a tad useless.  You can still fetch the data in binary if you want...

How?  It's a 7.4 server (or may be, more generally) and declare binary
cursor is the only way I know to get binary data off it. AFAIR the only
other way works only for an 8.* server and consists of sending the query
with an annotation that a binary reply is expected.

Peter

Re: general PG network slowness (possible cure) (repost)

От
Tom Lane
Дата:
"Peter T. Breuer" <ptb@inv.it.uc3m.es> writes:
> "Also sprach Tom Lane:"
>> No, you just prepare the SELECT.  At the protocol level, DECLARE CURSOR
>> is a tad useless.  You can still fetch the data in binary if you want...

> How?  It's a 7.4 server (or may be, more generally) and declare binary
> cursor is the only way I know to get binary data off it. AFAIR the only
> other way works only for an 8.* server and consists of sending the query
> with an annotation that a binary reply is expected.

No, that works for a 7.4 server too; we haven't changed the protocol
since then.  (I forget though to what extent 7.4 libpq exposes the
capability.)

            regards, tom lane

Re: general PG network slowness (possible cure) (repost)

От
Bruce Momjian
Дата:
Please let us know if there is something we should change in the
PostgreSQL source code.

---------------------------------------------------------------------------

Peter T. Breuer wrote:
> "Also sprach Tom Lane:"
> > > It may still be useful. The kernel won't necessarily send data as you
> > > push it down to the network protocols and driver. The driver may decide
> > > to wait for more data to accumulate,
> >
> > No, because we set TCP_NODELAY.  Once we've flushed a message to the
>
> That just means "disable Nagle", which is indeed more or less the
> correct thing to do ..  you don't want to sit around waiting  for more
> data when we're sure there will be none, as you say.  Yet you also don't
> want to send short data out prematurely, which disabling Nagle can
> cause.
>
> And disabling Nagle doesn't actually force data out immediately you want
> it to be sent ...  it just disables extra waits imposed by the Nagle
> algorithm/protocol.  It doesn't stop the driver from waiting around
> because it feels taking the bus might be a bit premature right now,
> for example.
>
> > kernel, we don't want the kernel sitting on it --- any delay there adds
> > directly to the elapsed query time.  At least this is the case for the
> > final response to a query.  I'm not too clear on whether this means we
> > need to be careful about intermediate message boundaries when there's a
> > lot of data being sent.
>
> It's unclear. But not my situation.
>
>
> If I clear TCP_CORK all data is sent at that point. If I set TCP_CORK
> data is held until I clear TCP_CORK, or 200ms have passed with no send.
>
> Peter
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate

--
  Bruce Momjian  <bruce@momjian.us>          http://momjian.us
  EnterpriseDB                               http://www.enterprisedb.com

  + If your life is a hard drive, Christ can be your backup. +

Re: general PG network slowness (possible cure) (repost)

От
"Merlin Moncure"
Дата:
On 5/26/07, Peter T. Breuer <ptb@inv.it.uc3m.es> wrote:
> "Also sprach Tom Lane:"
> > "Peter T. Breuer" <ptb@inv.it.uc3m.es> writes:
> > > But can I prepare a DECLARE x BINARY CURSOR FOR SELECT ...  statement?
> > > The manual seems to say no.
> >
> > No, you just prepare the SELECT.  At the protocol level, DECLARE CURSOR
> > is a tad useless.  You can still fetch the data in binary if you want...
>
> How?  It's a 7.4 server (or may be, more generally) and declare binary
> cursor is the only way I know to get binary data off it. AFAIR the only
> other way works only for an 8.* server and consists of sending the query
> with an annotation that a binary reply is expected.

You want to be calling PQexecPrepared and flip the resultFormat.

http://www.postgresql.org/docs/7.4/interactive/libpq-exec.html#LIBPQ-EXEC-MAIN

IMO, it's usually not worth bothering with binary unless you are
dealing with bytea objects.

merlin