Re: Commitfest 2021-11 Patch Triage - Part 2

Поиск
Список
Период
Сортировка
От Stephen Frost
Тема Re: Commitfest 2021-11 Patch Triage - Part 2
Дата
Msg-id 20211115205857.GG26257@tamriel.snowman.net
обсуждение исходный текст
Ответ на Re: Commitfest 2021-11 Patch Triage - Part 2  (Robert Haas <robertmhaas@gmail.com>)
Список pgsql-hackers
Greetings,

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Mon, Nov 15, 2021 at 2:51 PM Stephen Frost <sfrost@snowman.net> wrote:
> > I get that just compressing the entire stream is simpler and easier and
> > such, but it's surely cheaper and more efficient to not decompress and
> > then recompress data that's already compressed.  Finding a way to pass
> > through data that's already compressed when stored as-is while also
> > supporting compression of everything else (in a sensible way- wouldn't
> > make sense to just compress each attribute independently since a 4 byte
> > integer isn't going to get smaller with compression) definitely
> > complicates the overall idea but perhaps would be possible to do.
>
> To me, this feels like an attempt to move the goalposts far enough to
> kill the project. Sure, in a perfect world, that would be nice. But,
> we don't do it anywhere else. If you try to store a JPEG into a bytea
> column, we'll try to compress it just like we would any other data,
> and it may not work out. If you then take a pg_basebackup of the
> database using -Z, there's no attempt made to avoid the overhead of
> CPU overhead of compressing those TOAST table pages that contain
> already-compressed data and not the others. And it's easy to
> understand why that's the case: when you insert data into the
> database, there's no way for the database to magically know whether
> that data has been previously compressed by some means, and if so, how
> effectively. And when you back up a database, the backup doesn't know
> which relfilenodes contain TOAST tables or which pages of those
> relfilenodes contain that is already pre-compressed. In both cases,
> your options are either (1) shut off compression yourself or (2) hope
> that the compressor doesn't waste too much effort on it.

I'll grant that perhaps it's just a different project.

While, sure, we will try to compress things we don't understand by
default, the admin does have ways to tell us to not do that and so it
isn't always just guess-work.  As for other parts of the system not
being smarter- they should be.  I'd argue that up until quite recently
it didn't make sense to teach something like pg_basebackup about TOAST
tables because our compression was rather lacking and it was actually
useful to compress TOAST tables (perhaps still is with the compression
methods we have now, but we'll hopefully add more).  We should probably
look at trying to provide a way for pg_basebackup and other backup tools
to identify TOAST tables to avoid trying to compress them as it's just
wasted effort (though in that case, it's at least less wasted than here-
we don't decompress that TOAST data and then re-compress it).

Using bytea for everything-and-the-kitchen-sink also just generally
seems like it's throwing away more information than we really should (if
we had a jpeg data type, for example, we could just have that default to
not being compressed and not spend time trying to compress jpegs).

I also just generally disagree that an inefficiency in one part of the
system justifies having it somewhere else.

> I think the same approach ought to be completely acceptable here. I
> don't even really understand how we could do anything else. printtup()
> just gets datums, and it has no idea whether or how they are toasted.
> It calls the type output functions which don't know that data is being
> prepared for transmission to the client as opposed to some other
> hypothetical way you could call that function, nor do they know what
> compression method the client wants. It does not seem at all
> straightforward to teach them that ... and even if they did, what
> then? It's not like every column value is sent as a separate packet;
> the whole row is a single protocol message, and some columns may be
> compressed and others uncompressed. Trying to guess what to do about
> that seems to boil down to a sheer guess. Unless you try to compress
> that mixture of compressed and uncompressed values - and it's
> moderately uncommon for every column of a table to be even be
> toastable - you aren't going to know how well it will compress. You
> could easily waste more CPU cycles trying to guess than you would have
> spent just doing what the user asked for.

I agree that we wouldn't want to be doing this all based on guesswork
somehow.  Having a way to identify what's compressed and what isn't
would be needed, as well as what method was used to compress.  It's not
like we don't know that at the source, we'd just need a way to pass that
back through- probably more-or-less exactly as how it's stored today.

Tackling the actual protocol side of it seems like it'd be the more
sensible place to start working through a design and I don't think
that'd be trivial to do, so probably not something to try and hash out
in this particular thread.  I'm happy to have at least voiced that
thought.

Thanks,

Stephen

Вложения

В списке pgsql-hackers по дате отправления:

Предыдущее
От: Mark Dilger
Дата:
Сообщение: Granting SET and ALTER SYSTE privileges for GUCs
Следующее
От: Tom Lane
Дата:
Сообщение: Re: Improving psql's \password command