Обсуждение: [PATCH] COPY .. COMPRESSED

Поиск
Список
Период
Сортировка

[PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
Greetings,

  Attached is a patch to add a 'COMPRESSED' option to COPY which will
  cause COPY to expect a gzip'd file on input and which will output a
  gzip'd file on output.  Included is support for backend COPY, psql's
  \copy, regression tests for both, and documentation.

  On top of this I plan to submit a trivial patch to add support for
  this to file_fdw, allowing creation of FDW tables which operate
  directly on compressed files (including CSVs, which is what I need
  this patch for).

  I've also begun working on a patch to allow this capability to be used
  through pg_dump/pg_restore which would reduce the bandwidth used
  between the client and the server for backups and restores.  Ideally,
  one would also be able to use custom format dumps, with compression,
  even if the client-side pg_dump/pg_restore wasn't compiled with zlib
  support.

      Thanks,

        Stephen

Вложения

Re: [PATCH] COPY .. COMPRESSED

От
Tom Lane
Дата:
Stephen Frost <sfrost@snowman.net> writes:
>   Attached is a patch to add a 'COMPRESSED' option to COPY which will
>   cause COPY to expect a gzip'd file on input and which will output a
>   gzip'd file on output.  Included is support for backend COPY, psql's
>   \copy, regression tests for both, and documentation.

I don't think it's a very good idea to invent such a specialized option,
nor to tie it to gzip, which is widely considered to be old news.

There was discussion (and, I think, a patch in the queue) for allowing
COPY to pipe into or out of an arbitrary shell pipe.  Why would that not
be enough to cover this use-case?  That is, instead of a hard-wired
capability, people would do something like COPY TO '| gzip >file.gz'.
Or they could use bzip2 or whatever struck their fancy.
        regards, tom lane



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
Tom,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> >   Attached is a patch to add a 'COMPRESSED' option to COPY which will
> >   cause COPY to expect a gzip'd file on input and which will output a
> >   gzip'd file on output.  Included is support for backend COPY, psql's
> >   \copy, regression tests for both, and documentation.
>
> I don't think it's a very good idea to invent such a specialized option,
> nor to tie it to gzip, which is widely considered to be old news.

We're already using gzip/zlib for pg_dump/pg_restore, so it was simple
and straight-forward to add and would allow utilizing this option while
keeping the custom dump format the same.  It also happens to match what
I need.  While gzip might be 'old hat' it's still extremely popular.
I'd be happy to add support for bzip2 or something else that people are
interested in, and support compression options for zlib if necessary
too.  This was intended to get the ball rolling on something as the last
discussion that I had seen while hunting through the archives was from
2006, obviously I missed the boat on the last set of patches.

> There was discussion (and, I think, a patch in the queue) for allowing
> COPY to pipe into or out of an arbitrary shell pipe.  Why would that not
> be enough to cover this use-case?  That is, instead of a hard-wired
> capability, people would do something like COPY TO '| gzip >file.gz'.
> Or they could use bzip2 or whatever struck their fancy.

Sounds like a nice idea, but I can't imagine it'd be available to anyone
except for superusers, and looking at that patch, that's exactly the
restriction which is in place for it.  In addition, that patch's support
for "\copy" implements everything locally, making it little different
from "zcat mycsv.csv.gz | psql".  The patch that I proposed actually
sent the compressed stream across the wire, reducing bandwidth
utilization.

All that said, I've nothing against having the pipe option for the
backend COPY command; a bit annoyed with myself for somehow missing that
patch.  I don't like what it's doing with psql's \copy command and would
rather we figure out a way to support PROGRAM .. TO STDOUT, but that
still would require superuser privileges.  I don't see any easy way to
support compressed data streaming to/from the server for COPY w/o
defining what methods are available or coming up with some ACL system
for what programs can be called by the backend.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Simon Riggs
Дата:
On 14 January 2013 13:43, Stephen Frost <sfrost@snowman.net> wrote:
> Tom,
>
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Stephen Frost <sfrost@snowman.net> writes:
>> >   Attached is a patch to add a 'COMPRESSED' option to COPY which will
>> >   cause COPY to expect a gzip'd file on input and which will output a
>> >   gzip'd file on output.  Included is support for backend COPY, psql's
>> >   \copy, regression tests for both, and documentation.
>>
>> I don't think it's a very good idea to invent such a specialized option,
>> nor to tie it to gzip, which is widely considered to be old news.
>
> We're already using gzip/zlib for pg_dump/pg_restore, so it was simple
> and straight-forward to add and would allow utilizing this option while
> keeping the custom dump format the same.

Both thoughts are useful, I think.

There is a new option being added to pre/post process data, so it
seems like the best way to add new features - in general.

Specifically, we do support compressed output so a simple patch to
allow re-loading of the compressed data we generate does seem sensible
and reasonable.

-- Simon Riggs                   http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training & Services



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Simon Riggs (simon@2ndQuadrant.com) wrote:
> There is a new option being added to pre/post process data, so it
> seems like the best way to add new features - in general.

That structure appears to have no option for passing compressed data to
or from a client connection.  Instead, it actually overloads the typical
meaning for options sent to \copy (which, imv, is "run COPY on the server
with these options and have the results stored locally") to mean
something different (run part of the COPY command on the server and part
of it locally).

> Specifically, we do support compressed output so a simple patch to
> allow re-loading of the compressed data we generate does seem sensible
> and reasonable.

Right, we're already using gzip for pg_dump/pg_restore.  This just gives
an option to move that compression over to the server side.  Also, I'd
be happy to add support for other compression options.

I do like the idea of a generalized answer which just runs a
user-provided command on the server but that's always going to require
superuser privileges.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Claudio Freire
Дата:
On Mon, Jan 14, 2013 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:
> I do like the idea of a generalized answer which just runs a
> user-provided command on the server but that's always going to require
> superuser privileges.

Unless it's one of a set of superuser-authorized compression tools.



Re: [PATCH] COPY .. COMPRESSED

От
Tom Lane
Дата:
Stephen Frost <sfrost@snowman.net> writes:
> I do like the idea of a generalized answer which just runs a
> user-provided command on the server but that's always going to require
> superuser privileges.

The design that was being kicked around allowed pipes to be used on the
client side too, ie \copy foo to '| gzip ...'.  That form would not
require any special privileges, and might be thought preferable for
another reason too: it offloads the work from the server.
        regards, tom lane



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Claudio Freire (klaussfreire@gmail.com) wrote:
> On Mon, Jan 14, 2013 at 1:01 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > I do like the idea of a generalized answer which just runs a
> > user-provided command on the server but that's always going to require
> > superuser privileges.
>
> Unless it's one of a set of superuser-authorized compression tools.

Which would require a new ACL system for handling that, as I mentioned..
That certainly isn't what the existing patch does.

What would that look like?  How would it operate?  How would a user
invoke it or even know what options are available?  Would we provide
anything by default?  It's great to consider that possibility but
there's a lot of details involved.

I'm a bit nervous about having a generalized system which can run
anything on the system when called by a superuser but when called by a
regular user we're on the hook to verify the request against a
superuser-provided list and to then make sure nothing goes wrong.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Stephen Frost <sfrost@snowman.net> writes:
> > I do like the idea of a generalized answer which just runs a
> > user-provided command on the server but that's always going to require
> > superuser privileges.
>
> The design that was being kicked around allowed pipes to be used on the
> client side too, ie \copy foo to '| gzip ...'.  That form would not
> require any special privileges, and might be thought preferable for
> another reason too: it offloads the work from the server.

It's a different use-case which, imv, is really already trivially
covered:

psql -c 'COPY foo TO STDOUT;' | gzip > myfile.gz

While there is no option currently for having the server do the
compression before sending the data over the wire.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Peter Eisentraut
Дата:
On 1/14/13 11:28 AM, Stephen Frost wrote:
> While there is no option currently for having the server do the
> compression before sending the data over the wire.

OpenSSL?



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Peter Eisentraut (peter_e@gmx.net) wrote:
> On 1/14/13 11:28 AM, Stephen Frost wrote:
> > While there is no option currently for having the server do the
> > compression before sending the data over the wire.
>
> OpenSSL?

To be honest, I expected that to come up earlier in this discussion.
It'd be redundant to use OpenSSL for compression and then ALSO do
compression on the client side to save into a custom format dump.
There's also plenty of reasons to not want to deal with OpenSSL just to
have compression support.  Now, protocol-level on-the-wire compression
is another option, but there's quite a few drawbacks to that and quite a
bit of work involved.  Having support for COPY-based compression could
be an answer for many cases where on-the-wire compression is desirable.

Being able to use pipe's for the backend-side of COPY is a good
solution, for that.  I'm looking forward to having it and plan to review
the patch.  That said, I'd like to find an answer to some of these other
use cases, if possible.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Greg Stark
Дата:
On Tue, Jan 15, 2013 at 2:33 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Peter Eisentraut (peter_e@gmx.net) wrote:
>> On 1/14/13 11:28 AM, Stephen Frost wrote:
>> > While there is no option currently for having the server do the
>> > compression before sending the data over the wire.
>>
>> OpenSSL?
>
> To be honest, I expected that to come up earlier in this discussion.
> It'd be redundant to use OpenSSL for compression and then ALSO do
> compression on the client side to save into a custom format dump.

For what it's worth there was a security announcement not long ago
that made OpenSSL disable compression in streams by default. I'm not
sure if it's relevant to Postgres or not.

-- 
greg



Re: [PATCH] COPY .. COMPRESSED

От
Claudio Freire
Дата:
On Mon, Jan 14, 2013 at 11:33 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Now, protocol-level on-the-wire compression
> is another option, but there's quite a few drawbacks to that and quite a
> bit of work involved.  Having support for COPY-based compression could
> be an answer for many cases where on-the-wire compression is desirable.

Like?

Postgres' packetized protocol lends itself quite well for this kind of
thing. It could even be done on a packet-by-packet basis. The only
drawback I see, is that it pretty much rules out piping through
arbitrary commands (a protocol needs to be very clearly defined).



Re: [PATCH] COPY .. COMPRESSED

От
Peter Eisentraut
Дата:
On 1/13/13 9:16 PM, Stephen Frost wrote:
> On top of this I plan to submit a trivial patch to add support for
>   this to file_fdw, allowing creation of FDW tables which operate
>   directly on compressed files (including CSVs, which is what I need
>   this patch for).
> 
>   I've also begun working on a patch to allow this capability to be used
>   through pg_dump/pg_restore which would reduce the bandwidth used
>   between the client and the server for backups and restores.  Ideally,
>   one would also be able to use custom format dumps, with compression,
>   even if the client-side pg_dump/pg_restore wasn't compiled with zlib
>   support.

I think a problem is that this code is now serving such different uses.

Operating on compressed files transparently in file_fdw is obviously
useful, but why only gzip?  The gold standard is GNU tar, which can
operate on any compressed file in a variety of compression formats
without even having to specify an option.

Writing compressed COPY output files on the backend has limited uses, at
least none have been clearly explained, and the popen patch might
address those better.

Writing compressed COPY output on the frontend can already be done
differently.

Compression on the wire is a different debate and it probably shouldn't
be snuck in through this backdoor.

Putting compressed COPY output from the backend straight into a
compressed pg_dump file sounds interested, but this patch doesn't do
that yet, and I think there will be more issues to solve there.



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Peter Eisentraut (peter_e@gmx.net) wrote:
> Operating on compressed files transparently in file_fdw is obviously
> useful, but why only gzip?

This isn't really an argument, imv.  It's only gzip *right this moment*
because that's all that I implemented.  I've already offered to add
bzip2 or whatever else people would like.

> The gold standard is GNU tar, which can
> operate on any compressed file in a variety of compression formats
> without even having to specify an option.

Yes, that's what I was hoping to get to, eventually.

> Writing compressed COPY output files on the backend has limited uses, at
> least none have been clearly explained, and the popen patch might
> address those better.

I do see value in the popen patch for server-side operations.

> Writing compressed COPY output on the frontend can already be done
> differently.

Certainly.  On a similar vein, I'm not convinced that the popen patch
for psql's \copy is really a great addition.

> Compression on the wire is a different debate and it probably shouldn't
> be snuck in through this backdoor.

Considering the COPY-COMPRESSED-to-FE piece is the vast majority of the
patch, I hope you understand that it certainly wasn't my intent to try
and 'sneak it in'.  Support for reading and writing compressed files
with COPY directly from the FE was one of my goals from the start on
this.

> Putting compressed COPY output from the backend straight into a
> compressed pg_dump file sounds interested, but this patch doesn't do
> that yet, and I think there will be more issues to solve there.

Let me just vent my dislike for the pg_dump code. :)  Probably half the
time spent on this overall patch was fighting with that to make it work
and it's actually about 90% of the way there, imv.  Getting the
compressed data into pg_dump is working in my local branch, going to a
directory-format dump output, but the custom format is causing me some
difficulties which I believe are related to the blocking that's used and
that the blocks coming off the wire were 'full-size', if you will,
instead of being chunked down to 4KB by the client-side compression.
I've simply not had time to debug it and fix it and wanted to get the
general patch out for discussion (which I'm glad that I did, given that
there's other work going on that's related).
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Claudio Freire (klaussfreire@gmail.com) wrote:
> Postgres' packetized protocol lends itself quite well for this kind of
> thing. It could even be done on a packet-by-packet basis. The only
> drawback I see, is that it pretty much rules out piping through
> arbitrary commands (a protocol needs to be very clearly defined).

Actually, wouldn't PG's packet-based protocol be exactly the wrong way
to do any kind of good on-the-wire compression?  You certainly wouldn't
want to waste time compressing small packets, such as a single INSERT
command or similar, as you'll always have to send a packet out anyway.
Even doing it at the protocol level with something ssl-like, where
you wrap the entire connection, wouldn't help if the client has a
process along the lines of:

send INSERT command
wait for response
send INSERT command
wait for response
..
..

Since you'd still have to flush after each small bit of data.

Where it does work well is when you move into a bulk-data mode (ala
COPY) and can compress relatively large amounts of data into a smaller
number of full-size packets to be sent.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Claudio Freire
Дата:
On Tue, Jan 15, 2013 at 1:08 PM, Stephen Frost <sfrost@snowman.net> wrote:
> Where it does work well is when you move into a bulk-data mode (ala
> COPY) and can compress relatively large amounts of data into a smaller
> number of full-size packets to be sent.

Well... exactly. COPY is one case, big result sets is another.
And packet headers can include whether each packet is compressed or
not, which is quite transparent and easy to handle. There could even
be a negotiation phase and make it backwards-compatible.



Re: [PATCH] COPY .. COMPRESSED

От
Garick Hamlin
Дата:
On Tue, Jan 15, 2013 at 01:35:57PM +0000, Greg Stark wrote:
> On Tue, Jan 15, 2013 at 2:33 AM, Stephen Frost <sfrost@snowman.net> wrote:
> > * Peter Eisentraut (peter_e@gmx.net) wrote:
> >> On 1/14/13 11:28 AM, Stephen Frost wrote:
> >> > While there is no option currently for having the server do the
> >> > compression before sending the data over the wire.
> >>
> >> OpenSSL?
> >
> > To be honest, I expected that to come up earlier in this discussion.
> > It'd be redundant to use OpenSSL for compression and then ALSO do
> > compression on the client side to save into a custom format dump.
> 
> For what it's worth there was a security announcement not long ago
> that made OpenSSL disable compression in streams by default. I'm not
> sure if it's relevant to Postgres or not.

It's an interesting question.  It might be.  I thought at first it
wouldn't be relevant, but on reflection it is.

This attack is called the CRIME attack.  This class of attacks stem from reuse
of a dictionary across some sort of confidentiality boundary.  The attacker
looks at the traffic and notices 'how big' the network response is.   This
tells the attacker the compressor has seen already seem the text.  

So imagine, I have a website and I keep session cookies, user names, and 
password crypts in a database.  Imagine the session key is a long 
hexidecimal number.  As an attacker, I could send in the username field 
of the login form guesses of the prefix of somebody else's session key 
I peek at the encrypted traffic from script to the database.  As the 
prefix match length increases the database reply gets shorter.  
Essentially, its a side channel attack that ends up reducing guessing to
a radix search by prefix of all valid session keys.  In this attack, I 
don't have access to the database except through the web form, but I
can see the encrypted database traffic.

It is not a huge vulnerability, but yeah in some use cases if postgresql
used compression it might provide a difficult, but possible route.

Garick



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Claudio Freire (klaussfreire@gmail.com) wrote:
> On Tue, Jan 15, 2013 at 1:08 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > Where it does work well is when you move into a bulk-data mode (ala
> > COPY) and can compress relatively large amounts of data into a smaller
> > number of full-size packets to be sent.
>
> Well... exactly. COPY is one case, big result sets is another.
> And packet headers can include whether each packet is compressed or
> not, which is quite transparent and easy to handle. There could even
> be a negotiation phase and make it backwards-compatible.

COPY and a large result set are the only cases, and a large result set
could easily be put inside of a COPY statement.  I agree that large
result sets outside of COPY could benefit from compression and perhaps
we can formulate a way to support that also.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Bruce Momjian
Дата:
On Tue, Jan 15, 2013 at 10:55:04AM -0500, Stephen Frost wrote:
> * Peter Eisentraut (peter_e@gmx.net) wrote:
> > Operating on compressed files transparently in file_fdw is obviously
> > useful, but why only gzip?
> 
> This isn't really an argument, imv.  It's only gzip *right this moment*
> because that's all that I implemented.  I've already offered to add
> bzip2 or whatever else people would like.

And this leads to support-my-compression-binary-of-the-day mess.  Why
not just allow them to do '|compression-binary'?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Bruce Momjian (bruce@momjian.us) wrote:
> On Tue, Jan 15, 2013 at 10:55:04AM -0500, Stephen Frost wrote:
> > * Peter Eisentraut (peter_e@gmx.net) wrote:
> > > Operating on compressed files transparently in file_fdw is obviously
> > > useful, but why only gzip?
> >
> > This isn't really an argument, imv.  It's only gzip *right this moment*
> > because that's all that I implemented.  I've already offered to add
> > bzip2 or whatever else people would like.
>
> And this leads to support-my-compression-binary-of-the-day mess.  Why
> not just allow them to do '|compression-binary'?

You're right, to clarify, for *file_fdw*, which is a backend-only
operation, the popen patch is great (thought I made that clear before).

The popen patch doesn't support the '|compression-binary' option through
the FE protocol.  Even if it did, it would only be available for
superusers as we can't allow regular users to run arbitrary commands on
the server-side.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Christopher Browne
Дата:
On Tue, Jan 15, 2013 at 2:53 PM, Stephen Frost <sfrost@snowman.net> wrote:
> The popen patch doesn't support the '|compression-binary' option through
> the FE protocol.  Even if it did, it would only be available for
> superusers as we can't allow regular users to run arbitrary commands on
> the server-side.

That points towards a fix that involves having a set of non-arbitrary commands
that we allow plain users to use.

Hmm.  There's an interesting thought...

How about having a "pg_filters" table in pg_catalog which allows capturing
labels and names of known-to-be-safe binary filters:

insert into pg_filters (label, location)
values
('zcat', '/usr/bin/zcat'),
('bzip2', '/usr/bin/bzip2'),
('bunzip2', '/usr/bin/bunzip2');

And then having some capability to grant permissions to roles to use
these filters.

That's not a "version 1" capability...  Suppose we have, in 9.3, that there are
direct references to "|/usr/bin/zcat" (and such), and then hope, in
9.4, to tease
this out to be a non-superuser-capable facility via the above pg_filters?

These filters should be useful for FDWs as well as for COPY.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: [PATCH] COPY .. COMPRESSED

От
Bruce Momjian
Дата:
On Tue, Jan 15, 2013 at 03:37:07PM -0500, Christopher Browne wrote:
> On Tue, Jan 15, 2013 at 2:53 PM, Stephen Frost <sfrost@snowman.net> wrote:
> > The popen patch doesn't support the '|compression-binary' option through
> > the FE protocol.  Even if it did, it would only be available for
> > superusers as we can't allow regular users to run arbitrary commands on
> > the server-side.
> 
> That points towards a fix that involves having a set of non-arbitrary commands
> that we allow plain users to use.
> 
> Hmm.  There's an interesting thought...
> 
> How about having a "pg_filters" table in pg_catalog which allows capturing
> labels and names of known-to-be-safe binary filters:
> 
> insert into pg_filters (label, location)
> values
> ('zcat', '/usr/bin/zcat'),
> ('bzip2', '/usr/bin/bzip2'),
> ('bunzip2', '/usr/bin/bunzip2');
> 
> And then having some capability to grant permissions to roles to use
> these filters.
> 
> That's not a "version 1" capability...  Suppose we have, in 9.3, that there are
> direct references to "|/usr/bin/zcat" (and such), and then hope, in
> 9.4, to tease
> this out to be a non-superuser-capable facility via the above pg_filters?
> 
> These filters should be useful for FDWs as well as for COPY.

Well, COPY is super-user only, so it seems only useful for FDW, no?  We
already have lots of user-configuration FDW commands, so I can see
adding this one too.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Christopher Browne (cbbrowne@gmail.com) wrote:
> How about having a "pg_filters" table in pg_catalog which allows capturing
> labels and names of known-to-be-safe binary filters:

I was considering that (though I was thinking they'd be
"transformations" rather than filters; filter implies that you're
removing something, imv), but as I mentioned upthread, there are dangers
in that direction and having a default set of options strikes me as a
lot more challenging to provide.

> insert into pg_filters (label, location)
> values
> ('zcat', '/usr/bin/zcat'),
> ('bzip2', '/usr/bin/bzip2'),
> ('bunzip2', '/usr/bin/bunzip2');

We'd need to include which direction is supported also, I think.

> And then having some capability to grant permissions to roles to use
> these filters.

Yes, an additional ACL system, as I mentioned upthread, would be
required for this.

> That's not a "version 1" capability...  Suppose we have, in 9.3, that there are
> direct references to "|/usr/bin/zcat" (and such), and then hope, in
> 9.4, to tease
> this out to be a non-superuser-capable facility via the above pg_filters?

It would be good to flush out what the syntax, etc, would look like for
this, if we're going to support it, before we go down a road that limits
us in what we can do.  For example, if we implement the existing popen
call, and then later want to allow non-superusers to use certain
filters, how would the non-superuser specify the filter?  I really don't
think we want to be taking the shell-like command provided by a
non-superuser and then try to match that against a list of commands in a
table..

> These filters should be useful for FDWs as well as for COPY.

I'm not sure I see how any FDW beyond file_fdw would really benefit from
this..?  I don't think a MySQL FDW or Reddis FDW would gain anything...
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Bruce Momjian (bruce@momjian.us) wrote:
> Well, COPY is super-user only, so it seems only useful for FDW, no?  We
> already have lots of user-configuration FDW commands, so I can see
> adding this one too.

COPY is most certainly not superuser-only..  COPY w/ popen, if that
popen can call anything, would certainly have to be superuser-only.

COPY TO STDOUT / FROM STDIN is available to and used a huge amount by
non-superusers.  Would be great if we could allow that to work with
compressed data also, imv.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Tom Lane
Дата:
Stephen Frost <sfrost@snowman.net> writes:
> * Bruce Momjian (bruce@momjian.us) wrote:
>> And this leads to support-my-compression-binary-of-the-day mess.  Why
>> not just allow them to do '|compression-binary'?

> The popen patch doesn't support the '|compression-binary' option through
> the FE protocol.  Even if it did, it would only be available for
> superusers as we can't allow regular users to run arbitrary commands on
> the server-side.

I find the argument that this supports compression-over-the-wire to be
quite weak, because COPY is only one form of bulk data transfer, and
one that a lot of applications don't ever use.  If we think we need to
support transmission compression for ourselves, it ought to be
integrated at the wire protocol level, not in COPY.

Just to not look like I'm rejecting stuff without proposing
alternatives, here is an idea about a backwards-compatible design for
doing that: we could add an option that can be set in the connection
request packet.  Say, "transmission_compression = gzip".
        regards, tom lane



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> I find the argument that this supports compression-over-the-wire to be
> quite weak, because COPY is only one form of bulk data transfer, and
> one that a lot of applications don't ever use.  If we think we need to
> support transmission compression for ourselves, it ought to be
> integrated at the wire protocol level, not in COPY.

As far as I can tell, COPY is the option which is strongly recommended
for bulk-data-operations.  I can see the use-case for wanting SELECT
results to be compressed, but it strikes me as the 10% case for PG users
rather than the 90% one.  Ditto for COPY vs. large INSERT .. VALUES.

Compressing every small packet seems like it'd be overkill and might
surprise people by actually reducing performance in the case of lots of
small requests.

It also strikes me as a bit silly to do something like:

zcat myfile.gz | \
psql -Z -c "COPY mytable FROM STDIN;"

> Just to not look like I'm rejecting stuff without proposing
> alternatives, here is an idea about a backwards-compatible design for
> doing that: we could add an option that can be set in the connection
> request packet.  Say, "transmission_compression = gzip".

Alright, do we want/need to support multiple options there?  What do
people think we should support?  Any other particular details or issues
that come to mind with such an implementation?

I'm willing to work through that if it's the route everyone agrees with.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Bruce Momjian
Дата:
On Tue, Jan 15, 2013 at 04:22:48PM -0500, Stephen Frost wrote:
> * Bruce Momjian (bruce@momjian.us) wrote:
> > Well, COPY is super-user only, so it seems only useful for FDW, no?  We
> > already have lots of user-configuration FDW commands, so I can see
> > adding this one too.
> 
> COPY is most certainly not superuser-only..  COPY w/ popen, if that
> popen can call anything, would certainly have to be superuser-only.

COPY with a file name is super-user-only.  I am unclear how you would
use STDIN/STDOUT in any meaningful way with binary data produced by
compression.  I guess you could with libpq.

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Bruce Momjian (bruce@momjian.us) wrote:
> COPY with a file name is super-user-only.  I am unclear how you would
> use STDIN/STDOUT in any meaningful way with binary data produced by
> compression.  I guess you could with libpq.

The patch that I posted provided this:

psql -h myhost -c "COPY mytable FROM STDIN COMPRESSED;" < myfile.gz

With the compressed file being transferred unmolested to the server side
where it was decompressed and processed by the server.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Tom Lane
Дата:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> I find the argument that this supports compression-over-the-wire to be
>> quite weak, because COPY is only one form of bulk data transfer, and
>> one that a lot of applications don't ever use.  If we think we need to
>> support transmission compression for ourselves, it ought to be
>> integrated at the wire protocol level, not in COPY.

> As far as I can tell, COPY is the option which is strongly recommended
> for bulk-data-operations.  I can see the use-case for wanting SELECT
> results to be compressed, but it strikes me as the 10% case for PG users
> rather than the 90% one.  Ditto for COPY vs. large INSERT .. VALUES.

Really?  Given that libpq provides no useful support for doing anything
with COPY data, much less higher-level packages such as Perl DBI, I'd
venture that the real-world ratio is more like 90/10.  If not 99/1.
There might be a few souls out there who are hardy enough and concerned
enough with performance to have made their apps speak COPY protocol,
and not given up on it the first time they hit a quoting/escaping bug
... but not many, I bet.

> Compressing every small packet seems like it'd be overkill and might
> surprise people by actually reducing performance in the case of lots of
> small requests.

Yeah, proper selection and integration of a compression method would be
critical, which is one reason that I'm not suggesting a plugin for this.
You couldn't expect any-random-compressor to work well.  I think zlib
would be okay though when making use of its stream compression features.
The key thing there is to force a stream buffer flush (too lazy to look
up exactly what zlib calls it, but they have the concept) exactly when
we're about to do a flush to the socket.  That way we get cross-packet
compression but don't have a problem with the compressor failing to send
the last partial message when we need it to.

(My suggestion of an expansible option is for future-proofing, not
because I think we'd try to support more than one option today.)
        regards, tom lane



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Really?  Given that libpq provides no useful support for doing anything
> with COPY data, much less higher-level packages such as Perl DBI, I'd
> venture that the real-world ratio is more like 90/10.  If not 99/1.

Perhaps I'm taking a bit too narrow view of the world, but my thinking
is OLTP won't want things compressed, as it increases latency of
requests, while OLAP users are operating with enough data that they'll
go through the effort to use COPY.

> There might be a few souls out there who are hardy enough and concerned
> enough with performance to have made their apps speak COPY protocol,
> and not given up on it the first time they hit a quoting/escaping bug
> ... but not many, I bet.

The Perl/PHP/Ruby/etc users that are writing OLTP systems aren't likely
going to be interested in this.  The OLAP users are likely using psql
(it's what we're using to load terrabytes of data via COPY, JDBC, DBI,
etc, all have been tried and pale in comparison..).

> The key thing there is to force a stream buffer flush (too lazy to look
> up exactly what zlib calls it, but they have the concept) exactly when
> we're about to do a flush to the socket.  That way we get cross-packet
> compression but don't have a problem with the compressor failing to send
> the last partial message when we need it to.

Yes, I'm familiar with it.

> (My suggestion of an expansible option is for future-proofing, not
> because I think we'd try to support more than one option today.)

Fair enough.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Tom Lane
Дата:
Stephen Frost <sfrost@snowman.net> writes:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Really?  Given that libpq provides no useful support for doing anything
>> with COPY data, much less higher-level packages such as Perl DBI, I'd
>> venture that the real-world ratio is more like 90/10.  If not 99/1.

> Perhaps I'm taking a bit too narrow view of the world, but my thinking
> is OLTP won't want things compressed, as it increases latency of
> requests, while OLAP users are operating with enough data that they'll
> go through the effort to use COPY.

I should think the argument for or against wire-protocol compression
depends mainly on your network environment, not the nature of your
application.  Either bytes sent are more expensive than CPU cycles at
each end, or vice versa.  Latency could be a big deal if we weren't
going to force compressor flushes at synchronization boundaries, but
if we are, any added latency is a matter of a few cycles at most.
        regards, tom lane



Re: [PATCH] COPY .. COMPRESSED

От
Tom Lane
Дата:
Stephen Frost <sfrost@snowman.net> writes:
> Perhaps I'm taking a bit too narrow view of the world, but my thinking
> is OLTP won't want things compressed, as it increases latency of
> requests, while OLAP users are operating with enough data that they'll
> go through the effort to use COPY.

Also, if there are so many people using COPY in their apps, why have we
never seen any submitted patches to extend libpq with functions to
construct/deconstruct COPY data?  Surely somebody would have said to
themselves that they shouldn't be responsible for knowing those escaping
rules.
        regards, tom lane



Re: [PATCH] COPY .. COMPRESSED

От
Andrew Dunstan
Дата:
On 01/15/2013 06:22 PM, Tom Lane wrote:
> Stephen Frost <sfrost@snowman.net> writes:
>> Perhaps I'm taking a bit too narrow view of the world, but my thinking
>> is OLTP won't want things compressed, as it increases latency of
>> requests, while OLAP users are operating with enough data that they'll
>> go through the effort to use COPY.
> Also, if there are so many people using COPY in their apps, why have we
> never seen any submitted patches to extend libpq with functions to
> construct/deconstruct COPY data?  Surely somebody would have said to
> themselves that they shouldn't be responsible for knowing those escaping
> rules.

There are perfectly good libraries in Perl and other languages for 
constructing/deconstructing CSV data. If we didn't have CSV 
import/export I suspect we would have heard lots more howls by now.

cheers

andrew



Re: [PATCH] COPY .. COMPRESSED

От
Claudio Freire
Дата:
On Tue, Jan 15, 2013 at 7:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Compressing every small packet seems like it'd be overkill and might
>> surprise people by actually reducing performance in the case of lots of
>> small requests.
>
> Yeah, proper selection and integration of a compression method would be
> critical, which is one reason that I'm not suggesting a plugin for this.
> You couldn't expect any-random-compressor to work well.  I think zlib
> would be okay though when making use of its stream compression features.
> The key thing there is to force a stream buffer flush (too lazy to look
> up exactly what zlib calls it, but they have the concept) exactly when
> we're about to do a flush to the socket.  That way we get cross-packet
> compression but don't have a problem with the compressor failing to send
> the last partial message when we need it to.

Just a "stream flush bit" (or stream reset bit) on the packet header
would do. First packet on any stream would be marked, and that's it.



Re: [PATCH] COPY .. COMPRESSED

От
Peter Eisentraut
Дата:
On 1/15/13 2:53 PM, Stephen Frost wrote:
> You're right, to clarify, for *file_fdw*, which is a backend-only
> operation, the popen patch is great (thought I made that clear before).

I would think that if we get writable FDWs, you would want file_fdw to
go through zlib so that it can write directly to the file.




Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Peter Eisentraut (peter_e@gmx.net) wrote:
> On 1/15/13 2:53 PM, Stephen Frost wrote:
> > You're right, to clarify, for *file_fdw*, which is a backend-only
> > operation, the popen patch is great (thought I made that clear before).
>
> I would think that if we get writable FDWs, you would want file_fdw to
> go through zlib so that it can write directly to the file.

With the popen patch, I expect it could be defined as '|gzip >
myfile.gz'..  I believe that patch did that.  It'd be ideal to add
support for that to file_fdw also, certainly.  That shouldn't be hard as
file_fdw is essentially a wrapper around backend COPY already and this
would just be adding a few additional options to pass through.
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Robert Haas
Дата:
On Mon, Jan 14, 2013 at 11:28 AM, Stephen Frost <sfrost@snowman.net> wrote:
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Stephen Frost <sfrost@snowman.net> writes:
>> > I do like the idea of a generalized answer which just runs a
>> > user-provided command on the server but that's always going to require
>> > superuser privileges.
>>
>> The design that was being kicked around allowed pipes to be used on the
>> client side too, ie \copy foo to '| gzip ...'.  That form would not
>> require any special privileges, and might be thought preferable for
>> another reason too: it offloads the work from the server.
>
> It's a different use-case which, imv, is really already trivially
> covered:
>
> psql -c 'COPY foo TO STDOUT;' | gzip > myfile.gz
>
> While there is no option currently for having the server do the
> compression before sending the data over the wire.

+1.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] COPY .. COMPRESSED

От
Robert Haas
Дата:
On Tue, Jan 15, 2013 at 3:37 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
> That points towards a fix that involves having a set of non-arbitrary commands
> that we allow plain users to use.
>
> Hmm.  There's an interesting thought...
>
> How about having a "pg_filters" table in pg_catalog which allows capturing
> labels and names of known-to-be-safe binary filters:
>
> insert into pg_filters (label, location)
> values
> ('zcat', '/usr/bin/zcat'),
> ('bzip2', '/usr/bin/bzip2'),
> ('bunzip2', '/usr/bin/bunzip2');
>
> And then having some capability to grant permissions to roles to use
> these filters.

I suspect that's going to be less efficient than using a compression
library that's linked into the backend, because you have to copy all
the data through the kernel to another process and back.  And it's
certainly a lot more complex.

If it greatly broadened the applicability of this feature I might
think it was worthwhile, but I can't see that it does.  I suspect that
supporting zlib, which we already linked against, would cater to
something well upwards of 90% of the use cases here.  Sure, there are
other things, but zlib is very widely used and bzip2 IME is far too
slow to be taken seriously for this kind of application.  The
additional space savings that you get for the additional CPU
investment is typically small, and if you really need it, having to
un-gzip and re-bzip2 on the client is always an option.  If you're
using bzip2 you obviously have CPU time to burn.

At any rate, I think it would be good to avoid letting our desire for
infinite flexibility get in the way of doing something useful.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] COPY .. COMPRESSED

От
Christopher Browne
Дата:
On Wed, Jan 16, 2013 at 5:15 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 15, 2013 at 3:37 PM, Christopher Browne <cbbrowne@gmail.com> wrote:
>> That points towards a fix that involves having a set of non-arbitrary commands
>> that we allow plain users to use.
>>
>> Hmm.  There's an interesting thought...
>>
>> How about having a "pg_filters" table in pg_catalog which allows capturing
>> labels and names of known-to-be-safe binary filters:
>>
>> insert into pg_filters (label, location)
>> values
>> ('zcat', '/usr/bin/zcat'),
>> ('bzip2', '/usr/bin/bzip2'),
>> ('bunzip2', '/usr/bin/bunzip2');
>>
>> And then having some capability to grant permissions to roles to use
>> these filters.
>
> I suspect that's going to be less efficient than using a compression
> library that's linked into the backend, because you have to copy all
> the data through the kernel to another process and back.  And it's
> certainly a lot more complex.

More complex, certainly.

By spawning a separate process, we'd get benefit of multicore CPUs,
so I'm not sure I agree that it's necessarily slower.

> At any rate, I think it would be good to avoid letting our desire for
> infinite flexibility get in the way of doing something useful.

Oh, agreed.  I was actively thinking of the cooler bits of this pointing more
towards 9.4 than 9.3.
-- 
When confronted by a difficult problem, solve it by reducing it to the
question, "How would the Lone Ranger handle this?"



Re: [PATCH] COPY .. COMPRESSED

От
Robert Haas
Дата:
On Tue, Jan 15, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> I find the argument that this supports compression-over-the-wire to be
> quite weak, because COPY is only one form of bulk data transfer, and
> one that a lot of applications don't ever use.  If we think we need to
> support transmission compression for ourselves, it ought to be
> integrated at the wire protocol level, not in COPY.
>
> Just to not look like I'm rejecting stuff without proposing
> alternatives, here is an idea about a backwards-compatible design for
> doing that: we could add an option that can be set in the connection
> request packet.  Say, "transmission_compression = gzip".

But presumably this would transparently compress at one end and
decompress at the other end, which is again a somewhat different use
case.  To get compressed output on the client side, you have to
decompress and recompress.  Maybe that's OK, but it's not quite the
same thing.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: [PATCH] COPY .. COMPRESSED

От
Stephen Frost
Дата:
* Bruce Momjian (bruce@momjian.us) wrote:
> On Wed, Jan 16, 2013 at 06:19:09PM -0500, Robert Haas wrote:
> > But presumably this would transparently compress at one end and
> > decompress at the other end, which is again a somewhat different use
> > case.  To get compressed output on the client side, you have to
> > decompress and recompress.  Maybe that's OK, but it's not quite the
> > same thing.
>
> Is there a TODO here?

For some reason, I thought there was a patch...
Thanks,
    Stephen

Re: [PATCH] COPY .. COMPRESSED

От
Bruce Momjian
Дата:
On Wed, Jan 16, 2013 at 06:19:09PM -0500, Robert Haas wrote:
> On Tue, Jan 15, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> > I find the argument that this supports compression-over-the-wire to be
> > quite weak, because COPY is only one form of bulk data transfer, and
> > one that a lot of applications don't ever use.  If we think we need to
> > support transmission compression for ourselves, it ought to be
> > integrated at the wire protocol level, not in COPY.
> >
> > Just to not look like I'm rejecting stuff without proposing
> > alternatives, here is an idea about a backwards-compatible design for
> > doing that: we could add an option that can be set in the connection
> > request packet.  Say, "transmission_compression = gzip".
> 
> But presumably this would transparently compress at one end and
> decompress at the other end, which is again a somewhat different use
> case.  To get compressed output on the client side, you have to
> decompress and recompress.  Maybe that's OK, but it's not quite the
> same thing.

Is there a TODO here?

--  Bruce Momjian  <bruce@momjian.us>        http://momjian.us EnterpriseDB
http://enterprisedb.com
 + It's impossible for everything to be true. +



Re: [PATCH] COPY .. COMPRESSED

От
Claudio Freire
Дата:
On Wed, Jan 16, 2013 at 8:19 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Jan 15, 2013 at 4:50 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> I find the argument that this supports compression-over-the-wire to be
>> quite weak, because COPY is only one form of bulk data transfer, and
>> one that a lot of applications don't ever use.  If we think we need to
>> support transmission compression for ourselves, it ought to be
>> integrated at the wire protocol level, not in COPY.
>>
>> Just to not look like I'm rejecting stuff without proposing
>> alternatives, here is an idea about a backwards-compatible design for
>> doing that: we could add an option that can be set in the connection
>> request packet.  Say, "transmission_compression = gzip".
>
> But presumably this would transparently compress at one end and
> decompress at the other end, which is again a somewhat different use
> case.  To get compressed output on the client side, you have to
> decompress and recompress.  Maybe that's OK, but it's not quite the
> same thing.

Well, libpq could give some access to raw compressed streams, but,
really, even with double compression on the client, it solves the
bandwidth issue, not only for pg_dump, pg_restore, and copy, but also
for all other transfer-intensive applications. I do think it's the
best option.