Обсуждение: [HACKERS] libpq: why we need to fetch all rows?

Поиск
Список
Период
Сортировка

[HACKERS] libpq: why we need to fetch all rows?

От
Alexander Demenshin
Дата:
Hello!

  Currently libpq works as follows:

  If we doing a select, all rows fetched in memory, after this
  it returned to the caller.

  All OK, but: if number of rows is too big, we use a LOT of memory.

  And, if select scans large table, we get first result after long
  (relatively ;) time...

  Many SQL servers will return rows as soon as it found (ASAF), and this
  behavior is more friendly - human don't need to wait before
  first result is appears.

  Just imagine typical query from WEB: large database,
  and select with condition which will take long time.

  In my case, I've database with >150,000 rows, and often query
  is based on RE, so search will take long time. For end user,
  this looks like server hang, at least if he used it first time.

  Of course, implementing ASAF will require some changes in programs
  which use libpq, and will cost something - e.g., program will not
  know total number of rows returned (does anybody use row count?).

  But at least this functionality must be as option, IMHO.

  Especially, when using psql in batch mode, nobody is interested
  in total row count, since: 1) it returned after last row;
  2) after last row is fetched from psql output, we already know
  total row count. ;)

  I've modified libpq for my purposes, but what do you think about
  adding this to future versions? May be, as alternative for
  currently distributed libpq...

/Al

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
"Thomas G. Lockhart"
Дата:
>   Currently libpq works as follows:
>
>   If we doing a select, all rows fetched in memory, after this
>   it returned to the caller.
>
>   All OK, but: if number of rows is too big, we use a LOT of memory.
>
>   And, if select scans large table, we get first result after long
>   (relatively ;) time...
>
>   Many SQL servers will return rows as soon as it found (ASAF), and this
>   behavior is more friendly - human don't need to wait before
>   first result is appears.

I don't think that this can possibly work for a true multi-user RDBMS.
At any point during row fetches to satisfy a query, something could
abort the transaction (e.g. deadlock if doing row or page level locking,
or finding an invalid internal value). If you've already started
returning rows, you can't gracefully abort the operation.

Please give examples of true RDBMS SQL servers which return rows before
completing the database scan.

            - Tom

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Igor
Дата:
you can use a cursor to read rows one at a time...

=+=------------------------/\---------------------------------=+=
       Igor Natanzon      |**|   E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=

On Sun, 29 Jun 1997, Alexander Demenshin wrote:

> Hello!
>
>   Currently libpq works as follows:
>
>   If we doing a select, all rows fetched in memory, after this
>   it returned to the caller.
>
>   All OK, but: if number of rows is too big, we use a LOT of memory.
>
>   And, if select scans large table, we get first result after long
>   (relatively ;) time...
>
>   Many SQL servers will return rows as soon as it found (ASAF), and this
>   behavior is more friendly - human don't need to wait before
>   first result is appears.
>
>   Just imagine typical query from WEB: large database,
>   and select with condition which will take long time.
>
>   In my case, I've database with >150,000 rows, and often query
>   is based on RE, so search will take long time. For end user,
>   this looks like server hang, at least if he used it first time.
>
>   Of course, implementing ASAF will require some changes in programs
>   which use libpq, and will cost something - e.g., program will not
>   know total number of rows returned (does anybody use row count?).
>
>   But at least this functionality must be as option, IMHO.
>
>   Especially, when using psql in batch mode, nobody is interested
>   in total row count, since: 1) it returned after last row;
>   2) after last row is fetched from psql output, we already know
>   total row count. ;)
>
>   I've modified libpq for my purposes, but what do you think about
>   adding this to future versions? May be, as alternative for
>   currently distributed libpq...
>
> /Al
>

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Bruce Momjian
Дата:
> Please give examples of true RDBMS SQL servers which return rows before
> completing the database scan.

My comments exactly.  I know of no such SQL engine.  Now, there are some
which will keep the RESULTS on the server, and only return them
one-by-one.  That may be a speed savings.  Also, if your query uses
ORDER BY, I am in the process of improving the performance of that code.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Igor
Дата:
What could be done (and this is similar to what PROGRESS database does..I
think)...
You could request a select on a very large table. The server will NOT
load all tuples that match the select, but only the Oid's of these tuples.
When you are ready to read data, you do fetch, at which time the oid is
used to get an entire tuple. So neither the server nor the client ever
keep in memory more than one tuple at a time. This might not even be too
hard to implement..A list of Oid could be a double-linked list withch
would make it easy to traverse it back and forth with NEXT/ PREV and other
statements. A lookup table of portals/cursors and their corresponding Oid
tables could be maintained by the server.

=+=------------------------/\---------------------------------=+=
       Igor Natanzon      |**|   E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=

On Sun, 29 Jun 1997, Bruce Momjian wrote:

> > Please give examples of true RDBMS SQL servers which return rows before
> > completing the database scan.
>
> My comments exactly.  I know of no such SQL engine.  Now, there are some
> which will keep the RESULTS on the server, and only return them
> one-by-one.  That may be a speed savings.  Also, if your query uses
> ORDER BY, I am in the process of improving the performance of that code.
>
> --
> Bruce Momjian
> maillist@candle.pha.pa.us
>

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
"Thomas G. Lockhart"
Дата:
Bruce Momjian wrote:
>
> > Please give examples of true RDBMS SQL servers which return rows before
> > completing the database scan.
>
> My comments exactly.  I know of no such SQL engine.  Now, there are some
> which will keep the RESULTS on the server, and only return them
> one-by-one.  That may be a speed savings.  Also, if your query uses
> ORDER BY, I am in the process of improving the performance of that code.

Another possibility is to multi-thread the pqlib front end, so it can
start returning rows while rows are still coming from the backend. For
fast machines or for a few rows, there would be no performance
improvement, but for really big queries you might notice a significant
difference.

There have been several threads of discussion regarding multi-threading
in the backend. IMHO, many of the performance gains people are hoping to
get from that would be addressed by multi-threading frontends instead.
For example, a web server version of a frontend could run as a daemon
and connect to several database sessions, then service web requests from
those permanent sessions.

            - Tom

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Bruce Momjian
Дата:
>
> What could be done (and this is similar to what PROGRESS database does..I
> think)...
> You could request a select on a very large table. The server will NOT
> load all tuples that match the select, but only the Oid's of these tuples.
> When you are ready to read data, you do fetch, at which time the oid is
> used to get an entire tuple. So neither the server nor the client ever
> keep in memory more than one tuple at a time. This might not even be too
> hard to implement..A list of Oid could be a double-linked list withch
> would make it easy to traverse it back and forth with NEXT/ PREV and other
> statements. A lookup table of portals/cursors and their corresponding Oid
> tables could be maintained by the server.

The problem is that most results are joins, and there are multiple oid's
to deal with .

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
"Michael J. Miller Jr."
Дата:
On Sun, 29 Jun 1997, Bruce Momjian wrote:

> > statements. A lookup table of portals/cursors and their corresponding Oid
> > tables could be maintained by the server.
>
> The problem is that most results are joins, and there are multiple oid's
> to deal with .

In this case couldn't you create a temporary table and return the name
of the table to the client?  It could then "cursor" through the temporary
table.

Mike Miller             mke@netcom.com

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Bruce Momjian
Дата:
>
> On Sun, 29 Jun 1997, Bruce Momjian wrote:
>
> > > statements. A lookup table of portals/cursors and their corresponding Oid
> > > tables could be maintained by the server.
> >
> > The problem is that most results are joins, and there are multiple oid's
> > to deal with .
>
> In this case couldn't you create a temporary table and return the name
> of the table to the client?  It could then "cursor" through the temporary
> table.

Yes, we could, but you would not want to do that all the time because of
performance.  You would have to determine if that particulary select
statement was going to need it.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
"Michael J. Miller Jr."
Дата:
On Sun, 29 Jun 1997, Bruce Momjian wrote:

> > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> >
> > In this case couldn't you create a temporary table and return the name
> > of the table to the client?  It could then "cursor" through the temporary
> > table.
>
> Yes, we could, but you would not want to do that all the time because of
> performance.  You would have to determine if that particulary select
> statement was going to need it.

Well, you'd want to base this on the number of rows returned.  For small
sets it would probably be best to just send it all to the client.

Also, you could implament a low overhead table type to make this work
better for large data sets.  Just do minimal indexing (Probably on
an oid equiv.) and lay the data out in a way that is quick to write
and retrieve.  For a lot of medium sized datasets you could just do
this in memory.

Of course I can see some security issues here.  You'd either want to
make it hard for clients other than the one that created the table to
get to the data, or at the very least make it difficult for the name of
the temporary table to be guessed.

Mike Miller             mke@netcom.com

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Bruce Momjian
Дата:
>
> On Sun, Jun 29, 1997 at 01:44:36PM -0400, Bruce Momjian wrote:
>
> > Because it is not standard SQL.  No database I know of returns multiple
> > rows with out a 'declare' statement and a cursor.

Actually, standard SQL requires a cursor so you can name the cursor to
fetch from, but PostgreSQL requires it to be in a transaction, which is
not standard.

>
>   Not so good... But then, why not to modify backend and libpq
>   in a way, which will allow to fetch rows one-by-one?

OK, can you run a test?  If you start a transaction, declare a cursor,
and do a fetch over and over again, does it return one line at a time
from the server.  I think it does.

Now, it will not return from the first fetch until the query has
finished executing, but I don't think it will send all the results to
the client, only one at a time.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Bruce Momjian
Дата:
>
> On Sun, 29 Jun 1997, Bruce Momjian wrote:
>
> > > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> > >
> > > In this case couldn't you create a temporary table and return the name
> > > of the table to the client?  It could then "cursor" through the temporary
> > > table.
> >
> > Yes, we could, but you would not want to do that all the time because of
> > performance.  You would have to determine if that particulary select
> > statement was going to need it.
>
> Well, you'd want to base this on the number of rows returned.  For small
> sets it would probably be best to just send it all to the client.
>
> Also, you could implament a low overhead table type to make this work
> better for large data sets.  Just do minimal indexing (Probably on
> an oid equiv.) and lay the data out in a way that is quick to write
> and retrieve.  For a lot of medium sized datasets you could just do
> this in memory.
>
> Of course I can see some security issues here.  You'd either want to
> make it hard for clients other than the one that created the table to
> get to the data, or at the very least make it difficult for the name of
> the temporary table to be guessed.

But then, what is the goal?  A temp table is going to be a performance
hit, and so is passing one row at a time from the backend.


- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
"Michael J. Miller Jr."
Дата:
On Sun, 29 Jun 1997, Bruce Momjian wrote:

> >
> > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> >
> > > > On Sun, 29 Jun 1997, Bruce Momjian wrote:
>
> But then, what is the goal?  A temp table is going to be a performance
> hit, and so is passing one row at a time from the backend.

I see two goals, first you save memory on the client.  Next you
improve performance since you don't have to transfer all of a large
dataset to a client.  On a multi-megabyte result this can be a
very big win.

Of course I may be misunderstanding the issue.

Mike Miller             mke@netcom.com
- -
(Pseudo advertising ahead)
Check out MySQL (http://www.tcx.se/), a mostly free lightning fast SQL DB.
And my manual for MySQL (http://mysql.turbolift.com/)

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Bruce Momjian
Дата:
>
> On Sun, 29 Jun 1997, Bruce Momjian wrote:
>
> > >
> > > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> > >
> > > > > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> >
> > But then, what is the goal?  A temp table is going to be a performance
> > hit, and so is passing one row at a time from the backend.
>
> I see two goals, first you save memory on the client.  Next you
> improve performance since you don't have to transfer all of a large
> dataset to a client.  On a multi-megabyte result this can be a
> very big win.

OK, saving memory on the client is a valid goal.  You could process and
discard one before getting the next row.

As far as performance, if you are going to want to look at all the rows,
you might as well get them all at once.  There is no temp table on the
server side, and the transfer via the socket happens with only one
request to the backend.  Much faster.


- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Igor
Дата:
Not necessarily. A join could produce some sort of temporary table (disk
based or in memory) and each row produced by the join assigned an Oid.

=+=------------------------/\---------------------------------=+=
       Igor Natanzon      |**|   E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=

On Sun, 29 Jun 1997, Bruce Momjian wrote:

> >
> > What could be done (and this is similar to what PROGRESS database does..I
> > think)...
> > You could request a select on a very large table. The server will NOT
> > load all tuples that match the select, but only the Oid's of these tuples.
> > When you are ready to read data, you do fetch, at which time the oid is
> > used to get an entire tuple. So neither the server nor the client ever
> > keep in memory more than one tuple at a time. This might not even be too
> > hard to implement..A list of Oid could be a double-linked list withch
> > would make it easy to traverse it back and forth with NEXT/ PREV and other
> > statements. A lookup table of portals/cursors and their corresponding Oid
> > tables could be maintained by the server.
>
> The problem is that most results are joins, and there are multiple oid's
> to deal with .
>
> --
> Bruce Momjian
> maillist@candle.pha.pa.us
>

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Igor
Дата:
After the table is created it will be stored to disk (costly), but in
memory you would again keep only the Oid's so that malloc'ing every tuple
won't be necessary,  offsetting the cost of a temporary table.
Of course, we are talking about 10 - 100 thousand tuples where preloading
an entire table into memory wouldn't even be possible if machine doesn't
have enough ram..

=+=------------------------/\---------------------------------=+=
       Igor Natanzon      |**|   E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=

On Sun, 29 Jun 1997, Bruce Momjian wrote:

> >
> > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> >
> > > > statements. A lookup table of portals/cursors and their corresponding Oid
> > > > tables could be maintained by the server.
> > >
> > > The problem is that most results are joins, and there are multiple oid's
> > > to deal with .
> >
> > In this case couldn't you create a temporary table and return the name
> > of the table to the client?  It could then "cursor" through the temporary
> > table.
>
> Yes, we could, but you would not want to do that all the time because of
> performance.  You would have to determine if that particulary select
> statement was going to need it.
>
> --
> Bruce Momjian
> maillist@candle.pha.pa.us
>

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Igor
Дата:
The temporary table should be a private table, accessible only by the
person executing the query, and cleaned up either  when the transaction
terminates or PQfinish() is issued, or when the connection to the
associated client terminates...Even indexing wouldn't be necessary,
because the list of Oid's could already be considered an index.

=+=------------------------/\---------------------------------=+=
       Igor Natanzon      |**|   E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=

On Sun, 29 Jun 1997, Michael J. Miller Jr. wrote:

> On Sun, 29 Jun 1997, Bruce Momjian wrote:
>
> > > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> > >
> > > In this case couldn't you create a temporary table and return the name
> > > of the table to the client?  It could then "cursor" through the temporary
> > > table.
> >
> > Yes, we could, but you would not want to do that all the time because of
> > performance.  You would have to determine if that particulary select
> > statement was going to need it.
>
> Well, you'd want to base this on the number of rows returned.  For small
> sets it would probably be best to just send it all to the client.
>
> Also, you could implament a low overhead table type to make this work
> better for large data sets.  Just do minimal indexing (Probably on
> an oid equiv.) and lay the data out in a way that is quick to write
> and retrieve.  For a lot of medium sized datasets you could just do
> this in memory.
>
> Of course I can see some security issues here.  You'd either want to
> make it hard for clients other than the one that created the table to
> get to the data, or at the very least make it difficult for the name of
> the temporary table to be guessed.
>
> Mike Miller             mke@netcom.com
>
>

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Igor
Дата:
Does the portal store all the rows at the backend and feed them one by one
to the client, or does it go out to disk to get new rows?

=+=------------------------/\---------------------------------=+=
       Igor Natanzon      |**|   E-mail: igor@sba.miami.edu
=+=------------------------\/---------------------------------=+=

On Sun, 29 Jun 1997, Bruce Momjian wrote:

> >
> > On Sun, Jun 29, 1997 at 01:44:36PM -0400, Bruce Momjian wrote:
> >
> > > Because it is not standard SQL.  No database I know of returns multiple
> > > rows with out a 'declare' statement and a cursor.
>
> Actually, standard SQL requires a cursor so you can name the cursor to
> fetch from, but PostgreSQL requires it to be in a transaction, which is
> not standard.
>
> >
> >   Not so good... But then, why not to modify backend and libpq
> >   in a way, which will allow to fetch rows one-by-one?
>
> OK, can you run a test?  If you start a transaction, declare a cursor,
> and do a fetch over and over again, does it return one line at a time
> from the server.  I think it does.
>
> Now, it will not return from the first fetch until the query has
> finished executing, but I don't think it will send all the results to
> the client, only one at a time.
>
> --
> Bruce Momjian
> maillist@candle.pha.pa.us
>

------------------------------

End of hackers-digest V1 #404
*****************************

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Bruce Momjian
Дата:
>
> Not necessarily. A join could produce some sort of temporary table (disk
> based or in memory) and each row produced by the join assigned an Oid.

True, in fact, I am working on the sort code now.  However, if there is
an order by, you have to get all the rows and sort them, before
returning any of them.  How many of these queries don't have an order
by?  Usually most multi-row results have one.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Bruce Momjian
Дата:
>
> After the table is created it will be stored to disk (costly), but in
> memory you would again keep only the Oid's so that malloc'ing every tuple
> won't be necessary,  offsetting the cost of a temporary table.
> Of course, we are talking about 10 - 100 thousand tuples where preloading
> an entire table into memory wouldn't even be possible if machine doesn't
> have enough ram..

You have to malloc every tuple because the result usually will come from
a few fields OF SEVERAL TABLES.  The cost of re-building the result for
every oid requested would be big.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
Bruce Momjian
Дата:
>
> Does the portal store all the rows at the backend and feed them one by one
> to the client, or does it go out to disk to get new rows?

Check out PerformPortalFetch().  Looks like it creates a memory area for
each portal, and allows you to move around in it.  This is probably why
a transaction is needed, because the memory is probably in a
transaction-specific pool that gets automatically freed at the end of
transactions.

But then if you look at the ExecutorRun() call, it appears to re-execute
the entire query for each fetch, which is a terrible waste.  So it may
be only saving the executor 'plan' in memory, and reexecuting it for
each fetch, rather than saving off the result of the query.

Perhaps someone can comment on this.  Vadim is on vacation until July
24th.

- --
Bruce Momjian
maillist@candle.pha.pa.us

------------------------------

Re: [HACKERS] libpq: why we need to fetch all rows?

От
"Michael J. Miller Jr."
Дата:
On Sun, 29 Jun 1997, Bruce Momjian wrote:

>
> OK, saving memory on the client is a valid goal.  You could process and
> discard one before getting the next row.
>
> As far as performance, if you are going to want to look at all the rows,
> you might as well get them all at once.  There is no temp table on the
> server side, and the transfer via the socket happens with only one
> request to the backend.  Much faster.

Flexibility is what this scheme buys.  If you want to take the whole thing,
you send a request to the database to send you the whole thing.  If you
want the first one hundred records, or records 3355 through the end,
you can get those as well.  Pretty much the client can do whatever it
wants, but it doesn't have to have say 3 million rows shoved down its throat
if it doesn't want them.

I believe for large data sets you have to have this sort of functionality,
at least with remote clients.

- -
Mike Miller             mke@netcom.com
- -

------------------------------

End of hackers-digest V1 #405
*****************************

Re: [HACKERS] libpq: why we need to fetch all rows?

От
PostgreSQL
Дата:
You can put my comments on the "Reader's corner", if you have one.

On Sun, 29 Jun 1997, Bruce Momjian wrote:

> >
> > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> >
> > > >
> > > > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> > > >
> > > > > > On Sun, 29 Jun 1997, Bruce Momjian wrote:
> > >
> > > But then, what is the goal?  A temp table is going to be a performance
> > > hit, and so is passing one row at a time from the backend.
> >
> > I see two goals, first you save memory on the client.  Next you
> > improve performance since you don't have to transfer all of a large
> > dataset to a client.  On a multi-megabyte result this can be a
> > very big win.
>
> OK, saving memory on the client is a valid goal.  You could process and
> discard one before getting the next row.

Sending rows ASAF (as soon as they were fetched) could save memory on the
backend, which is also a valid goal!

>
> As far as performance, if you are going to want to look at all the rows,
> you might as well get them all at once.  There is no temp table on the
> server side, and the transfer via the socket happens with only one
> request to the backend.  Much faster.

I tend to agree that sending raws ASAF could hit the performance, at
least the server throughput. The only thing you must be carefull about is
that (as pointed by somone before) the client is now part of the
transaction, so the application cannot use the data until commit point. Rows
can be send one by one or thousand by thousand, so that small results
could be got once. On large results client could write data to disk.

I will do some tests and I'll report in a couple of days. Since I am
phd-researching the optimization on the client side (using PostgreSQL, of
course!), I would like to pickup the idea. I formally require permission
to use it. You can choose to point to the one who owns it, to send me a
note to be included in some paper or just let me quote the whole
PostgreSQL development team :).

>
>
> --
> Bruce Momjian
> maillist@candle.pha.pa.us
>
Thanks, regards, aso
.co.

Costin Oproiu
co@deuroconsult.ro
postgres@deuroconsult.ro

------------------------------