Re: [HACKERS] Custom compression methods
От | Tomas Vondra |
---|---|
Тема | Re: [HACKERS] Custom compression methods |
Дата | |
Msg-id | 29527031-c837-59e1-760d-677bb33d6b0f@2ndquadrant.com обсуждение исходный текст |
Ответ на | [HACKERS] Custom compression methods (Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru>) |
Ответы |
Re: [HACKERS] Custom compression methods
(Ildus Kurbangaliev <i.kurbangaliev@postgrespro.ru>)
Re: [HACKERS] Custom compression methods (Robert Haas <robertmhaas@gmail.com>) |
Список | pgsql-hackers |
Hi, I see there's an ongoing discussion about the syntax and ALTER TABLE behavior when changing a compression method for a column. So the patch seems to be on the way to be ready in the January CF, I guess. But let me play the devil's advocate for a while and question the usefulness of this approach to compression. Some of the questions were mentioned in the thread before, but I don't think they got the attention they deserve. FWIW I don't know the answers, but I think it's important to ask them. Also, apologies if this post looks to be against the patch - that's part of the "devil's advocate" thing. The main question I'm asking myself is what use cases the patch addresses, and whether there is a better way to do that. I see about three main use-cases: 1) Replacing the algorithm used to compress all varlena types (in a way that makes it transparent for the data type code). 2) Custom datatype-aware compression (e.g. the tsvector). 3) Custom datatype-aware compression with additional column-specific metadata (e.g. the jsonb with external dictionary). Now, let's discuss those use cases one by one, and see if there are simpler (or better in some way) solutions ... Replacing the algorithm used to compress all varlena values (in a way that makes it transparent for the data type code). ---------------------------------------------------------------------- While pglz served us well over time, it was repeatedly mentioned that in some cases it becomes the bottleneck. So supporting other state of the art compression algorithms seems like a good idea, and this patch is one way to do that. But perhaps we should simply make it an initdb option (in which case the whole cluster would simply use e.g. lz4 instead of pglz)? That seems like a much simpler approach - it would only require some ./configure options to add --with-lz4 (and other compression libraries), an initdb option to pick compression algorithm, and probably noting the choice in cluster controldata. No dependencies tracking, no ALTER TABLE issues, etc. Of course, it would not allow using different compression algorithms for different columns (although it might perhaps allow different compression level, to some extent). Conclusion: If we want to offer a simple cluster-wide pglz alternative, perhaps this patch is not the right way to do that. Custom datatype-aware compression (e.g. the tsvector) ---------------------------------------------------------------------- Exploiting knowledge of the internal data type structure is a promising way to improve compression ratio and/or performance. The obvious question of course is why shouldn't this be done by the data type code directly, which would also allow additional benefits like operating directly on the compressed values. Another thing is that if the datatype representation changes in some way, the compression method has to change too. So it's tightly coupled to the datatype anyway. This does not really require any new infrastructure, all the pieces are already there. In some cases that may not be quite possible - the datatype may not be flexible enough to support alternative (compressed) representation, e.g. because there are no bits available for "compressed" flag, etc. Conclusion: IMHO if we want to exploit the knowledge of the data type internal structure, perhaps doing that in the datatype code directly would be a better choice. Custom datatype-aware compression with additional column-specific metadata (e.g. the jsonb with external dictionary). ---------------------------------------------------------------------- Exploiting redundancy in multiple values in the same column (instead of compressing them independently) is another attractive way to help the compression. It is inherently datatype-aware, but currently can't be implemented directly in datatype code as there's no concept of column-specific storage (e.g. to store dictionary shared by all values in a particular column). I believe any patch addressing this use case would have to introduce such column-specific storage, and any solution doing that would probably need to introduce the same catalogs, etc. The obvious disadvantage of course is that we need to decompress the varlena value before doing pretty much anything with it, because the datatype is not aware of the compression. So I wonder if the patch should instead provide infrastructure for doing that in the datatype code directly. The other question is if the patch should introduce some infrastructure for handling the column context (e.g. column dictionary). Right now, whoever implements the compression has to implement this bit too. regards -- Tomas Vondra http://www.2ndQuadrant.com PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
В списке pgsql-hackers по дате отправления: