COPY, bytea streaming and memory footprint

Поиск
Список
Период
Сортировка
От Jerome Wagner
Тема COPY, bytea streaming and memory footprint
Дата
Msg-id CA+=V_fNkQyZFcUo2XEwFz95=9eaOUViyNcW0JbgiTdy8RsqcFA@mail.gmail.com
обсуждение исходный текст
Список pgsql-hackers
Hello,

I am trying to understand/optimize how a COPY operation behaves when transfering a bytea from the database to a client.

For simplicity, I'll consider that I have only one bytea _image_ in the _images_ table.

Starting with
COPY (SELECT image FROM images) TO STDOUT BINARY

I understand that :
 - the server will create a linear buffer on the server side holding the whole image and then start sending it over the network in one big copyData message chunked in 64KB network chunks
 - the client can manage to extract this copyData payload by re-assembling those chunks in memory or by streaming the relevant data parts of the chunks elsewhere.

so the problem I see in a streaming situation is that the server actually needs to buffer the whole image in memory.

Now the image is already compressed so if I
ALTER TABLE images ALTER image SET STORAGE EXTERNAL
I can use the fact that substring on non compressed toasted values will fetch only the needed parts and do

COPY (
  SELECT (
    SELECT substring(image from n for 65536) from images)
    FROM generate_series(1, (select length(image) from images), 65536) n
  ) TO STDOUT BINARY

As I understand it, this would be less memory intensive on the server side if the server starts sending rows before all rows of the subselect are built because it would only need to prepare a sequence of  65536 bytes long buffers for the rows it would decide to have in memory.

but is there a way to know if such a COPY/SELECT statement will indeed start sending rows before they are all prepared on the server ? Does it depend on the request and  is there a difference if I add an order by on the select versus the natural order of the table ?
How many rows will be needed in memory before the sending begins ?

I hope my explanation was clear. I am looking for help in better understanding how the server decides to stream the COPY data out of the server vs the internal retrieval of the COPY'd subselect.

Thank you
Jérôme







В списке pgsql-hackers по дате отправления:

Предыдущее
От: David Rowley
Дата:
Сообщение: Re: Parallel Seq Scan vs kernel read ahead
Следующее
От: David Rowley
Дата:
Сообщение: Recording test runtimes with the buildfarm