Обсуждение: lztext and compression ratios...
I have been looking at using the lztext type and I have some
questions/observations.   Most of my experience comes from attempting to
compress text records in a different database (CTREE), but I think the
experience is transferable.
My typical table consists of variable length text records.  The average
length record is around 1K bytes.  I would like to compress my records
to save space and improve I/O performance (smaller records means more
records fit into the file system cache which means less I/O - or so the
theory goes).  I am not too concerned about CPU as we are using a 4-way
Sun Enterprise class server.  So compress seems like a good idea to me.
My experience with attempting to compress such a relatively small
(around 1K) text string is that the compression ration is not very
good.  This is because the string is not long enough for the LZ
compression algorithm to establish really good compression patterns and
the fact that the de-compression table has to be built into each
record.  What I have done in the past to get around these problems is
that I have "taught" the compression algorithm the patterns ahead of
time and stored the de-compression patterns in an external table.  Using
this technique, I have achieved *much* better compression ratios.
So my questions/comments are:
    - What are the typical compression rations on relatively small (i.e.
around 1K) strings seen with lztext?
    - Does anyone see a need/use for a generalized string compression
type that can be "trained" external to the individual records?
    - Am I crazy in even attempting to compress strings of this relative
size?  My largest table correct contains about 2 million entries of
roughly 1k size strings or about 2Gig of data.  If I could compress this
to about 33% of it's original size (not unreasonable with a trained LZ
compression), I would save a lot of disk space (not really important)
and a lot of file system cache space (very important) and be able to fit
the entire table into memory (very, very important).
Thank you,
Jeff
			
		Jeffery Collins <collins@onyx-technologies.com> writes:
> My experience with attempting to compress such a relatively small
> (around 1K) text string is that the compression ration is not very
> good.  This is because the string is not long enough for the LZ
> compression algorithm to establish really good compression patterns and
> the fact that the de-compression table has to be built into each
> record.  What I have done in the past to get around these problems is
> that I have "taught" the compression algorithm the patterns ahead of
> time and stored the de-compression patterns in an external table.  Using
> this technique, I have achieved *much* better compression ratios.
(Puts on compression-guru hat...)
There is much in what you say.  Perhaps we should consider keeping the
lztext type around (currently it's slated for doom in 7.1, since the
TOAST feature will make plain text do everything lztext does and more)
and having it be different from text in that a training sample is
supplied when the column is defined.  Not quite sure how that should
look or where to store the sample, but it could be a big win for tables
having a large number of moderate-sized text entries.
            regards, tom lane
			
		I was sitting here reading about TOAST and 7.1 and another question came into my mind.. I looked on the web page for some information about features to be added in 7.1 and an approximate 7.1 release date.. I found information on the TOAST feature as well as Referential Integrity but nothing else.. If I missed it please just point me in the right direction, if not I'd like to as about the approximate time of 7.1 arriving in a stable form and if full text indexing is going to be a part of 7.1.. I ask because if 7.1 is going to include full text indexing natively and is going to arrive pretty soon, I might not continue on this project as I have been (the new full text index trigger).. Thanks!!! -Mitch
figure sometime in October before v7.1 is released ... i believe the last
talk talked about a beta starting sometime in September, but nothing is
ever "set in stone" until we actually do it :)
On Wed, 5 Jul 2000, Mitch Vincent wrote:
> I was sitting here reading about TOAST and 7.1 and another question came
> into my mind.. I looked on the web page for some information about features
> to be added in 7.1 and an approximate 7.1 release date.. I found information
> on the TOAST feature as well as Referential Integrity  but nothing else.. If
> I missed it please just point me in the right direction, if not I'd like to
> as about the approximate time of 7.1 arriving in a stable form and if full
> text indexing is going to be a part of 7.1..
>
> I ask because if 7.1 is going to include full text indexing natively and is
> going to arrive pretty soon, I might not continue on this project as I have
> been (the new full text index trigger)..
>
> Thanks!!!
>
> -Mitch
>
Marc G. Fournier                   ICQ#7615664               IRC Nick: Scrappy
Systems Administrator @ hub.org
primary: scrappy@hub.org           secondary: scrappy@{freebsd|postgresql}.org
			
		"Mitch Vincent" <mitch@venux.net> writes:
> I was sitting here reading about TOAST and 7.1 and another question came
> into my mind.. I looked on the web page for some information about features
> to be added in 7.1 and an approximate 7.1 release date.. I found information
> on the TOAST feature as well as Referential Integrity  but nothing else.. If
> I missed it please just point me in the right direction, if not I'd like to
> as about the approximate time of 7.1 arriving in a stable form and if full
> text indexing is going to be a part of 7.1..
AFAIK there are no plans on the table to do more with full text
indexing; at least none of the core developers are working on it.
(If someone else is and I've forgotten, my apologies.)
Currently the plans for 7.1 look like this:
    * WAL (assuming Vadim gets it done soon)
    * TOAST (pretty far along already)
    * new fmgr (fmgr itself done, still need to turn crank on
        converting built-in functions)
    * better memory management, no more intraquery memory leaks
      (about half done)
    * some other stuff, but I think those are all the "big features"
      that anyone has committed to get done
    * whatever else gets done meanwhile
We were shooting for going beta around Aug 1 with release around Sep 1,
but don't hold us to that ;-).
Notably missing from this list is OUTER JOIN and other things that
depend on querytree redesign (such as fixing all the restrictions on
views).  We've agreed to put that off till 7.2 in hopes of keeping the
7.1 development cycle reasonably short.  Not sure what else will be
in 7.2, but the querytree work will be.
> I ask because if 7.1 is going to include full text indexing natively and is
> going to arrive pretty soon, I might not continue on this project as I have
> been (the new full text index trigger)..
Seems like you should keep at it.
            regards, tom lane
			
		Mitch Vincent wrote:
> I was sitting here reading about TOAST and 7.1 and another question came
> into my mind.. I looked on the web page for some information about features
> to be added in 7.1 and an approximate 7.1 release date.. I found information
> on the TOAST feature as well as Referential Integrity  but nothing else.. If
> I missed it please just point me in the right direction, if not I'd like to
> as about the approximate time of 7.1 arriving in a stable form and if full
> text indexing is going to be a part of 7.1..
    The  project page you've seen was an idea I had several month
    ago.  I had a big success with it since I  got  co-developers
    for  FOREIGN  KEY.   It  was the first time I something in an
    open source project with related developers, and  the  result
    was stunning - even for me.
    Unfortunately,  none  of  the other "inner-circle" developers
    catched up that idea. Maybe all they're doing is not  of  the
    nature    that    they   would   gain   anything   from   co-
    developers/volunteers.
    Not an answer to your real question, more like a kick in  the
    A.. of my colleagues.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #
			
		Jeffery Collins wrote:
> I have been looking at using the lztext type and I have some
> questions/observations.   Most of my experience comes from attempting to
> compress text records in a different database (CTREE), but I think the
> experience is transferable.
    First   of   all  I  welcome  any  suggestions/input  to  the
    compression code I've implemented. Seems  you're  experienced
    in  this  area so keep going and hammer down any of my points
    below.
    You should know that the "lztext" type  will  disappear  soon
    and  be  replaced  with  the  general  "all varlena types are
    potentially compressable" approach of TOAST.
> My typical table consists of variable length text records.  The average
> length record is around 1K bytes.  I would like to compress my records
> to save space and improve I/O performance (smaller records means more
> records fit into the file system cache which means less I/O - or so the
> theory goes).  I am not too concerned about CPU as we are using a 4-way
> Sun Enterprise class server.  So compress seems like a good idea to me.
>
> My experience with attempting to compress such a relatively small
> (around 1K) text string is that the compression ration is not very
> good.  This is because the string is not long enough for the LZ
> compression algorithm to establish really good compression patterns and
> the fact that the de-compression table has to be built into each
> record.  What I have done in the past to get around these problems is
> that I have "taught" the compression algorithm the patterns ahead of
> time and stored the de-compression patterns in an external table.  Using
> this technique, I have achieved *much* better compression ratios.
    The compression algorithm used in "lztext" (and so  in  TOAST
    in  the  future)  doesn't have a de-compression table at all.
    It's  based  on  Adisak  Pochanayon's  SLZ  algorithm,  using
    <literal_char>  or  <token>.  A <token> just tells how far to
    go back in the OUTPUT-buffer and how many bytes to copy  from
    OUTPUT to OUTPUT. Look at the code for details.
    My  design  rules for the compression inside of Postgres have
    been
        -   beeing fast on decompression
        -   beeing good for relatively small values
        -   beeing fast on compression
    The first rule is met by  the  implementation  itself.  Don't
    underestimate  this  design rule! Usually you don't update as
    often as you query. And the implementation of TOAST  requires
    a super fast decompression.
    The   second   rule   is  met  by  not  needing  any  initial
    decompression table inside of the stored value.
    The third rule is controlled by the default strategy  of  the
    algorithm,        (unfortunately)        hardcoded       into
    utils/adt/pg_lzcompress.c.  It'll never try to compress items
    smaller  than 256 bytes. It'll fallback to plain storage (for
    speed advantage while decompressing a value) if less than 20%
    of  compression  is  gained.  It'll  stop  match loookup if a
    backward match of 128 or more bytes is found.
> So my questions/comments are:
>
>     - What are the typical compression rations on relatively small (i.e.
> around 1K) strings seen with lztext?
    Don't have that small items handy. But a table  containing  a
    file path and it's content. All files where HTML files.
        From - To      | count(*) | avg(length) | avg(octet_length)
        ---------------+----------+-------------+------------------
        1024 - 2047    |       14 |        1905 |              1470
        2048 - 4095    |       67 |        3059 |              1412
        4096 - 8191    |       45 |        5384 |              2412
        8192 -         |       25 |       17200 |              6323
        ---------------+----------+-------------+------------------
        all            |      151 |        5986 |              2529
>     - Does anyone see a need/use for a generalized string compression
> type that can be "trained" external to the individual records?
    Yes, of course. Maybe "lztext" can be a framework for you and
    we just tell the toaster "never apply your lousy  compression
    on that" (it's prepared for).
>     - Am I crazy in even attempting to compress strings of this relative
> size?  My largest table correct contains about 2 million entries of
> roughly 1k size strings or about 2Gig of data.  If I could compress this
> to about 33% of it's original size (not unreasonable with a trained LZ
> compression), I would save a lot of disk space (not really important)
> and a lot of file system cache space (very important) and be able to fit
> the entire table into memory (very, very important).
    Noone is crazy attempting to improve something. It might turn
    out not to work well, or beeing brain damaged from the start.
    But someone who never tries will miss all his chances.
    Final note:
    One  thing  to  keep  in mind is that the LZ algorithm you're
    thinking of must be distributable under the terms of the  BSD
    license.  If it's copyrighted or patented by any third party,
    not agreeing to these terms, it's out of discussion and  will
    never  appear in the Postgres source tree.  Especially the LZ
    algorithm used in GIF is one of these show-stoppers.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #
			
		JanWieck@t-online.de (Jan Wieck) writes:
>     One  thing  to  keep  in mind is that the LZ algorithm you're
>     thinking of must be distributable under the terms of the  BSD
>     license.  If it's copyrighted or patented by any third party,
>     not agreeing to these terms, it's out of discussion and  will
>     never  appear in the Postgres source tree.  Especially the LZ
>     algorithm used in GIF is one of these show-stoppers.
As long as you brought it up: how sure are you that the method you've
used is not subject to any patents?  The mere fact that you learned
it from someone who didn't patent it does not guarantee anything ---
someone else could have invented it independently and filed for a
patent.
If you can show that this method uses no ideas not found in zlib,
then I'll feel reassured --- a good deal of legal research went into
zlib to make sure it didn't fall foul of any patents, and zlib has
now been around long enough that it'd be tough for anyone to get a
new patent on one of its techniques.  But if SLZ has any additional
ideas in it, then we could be in trouble.  There are an awful lot of
compression patents.
            regards, tom lane
			
		Jan,
Thank you for your comments on lztext.  They were very informative.  I hope
you (and the rest of the list) don't mind my ramblings on compression, but I
think there is some performance/space advantage if compression can integrated
into a table.  I admit to not being a compression expert.  All that I know
has been learn from reading some zlib source.
The compression that I have been working with (and am more familiar with) is
Huffman compression.  You are probably familiar with it and may have even
evaluated it when you were looking at compression techniques.  In my limited
understanding of LZ compression, there are some advantages and disadvantages
to Huffman compression.
The disadvantages seem to be:
    - In normal Huffman compression, the string to be compressed is examined
twice.  Once to build the translation table and once to do the translation.
    - The translation table must be written with the string.  This makes the
result larger and limits it's effectiveness with small strings.
    - I don't know for sure, but I wouldn't be surprised if uncompression is
slower with Huffman than LZ compression.
To get around these disadvantages, I modified my Huffman support in the
following way:
    - I trained the translation table and stored the table separately.  I was
able to do this because I knew before hand the type of text that would be in
my columns.
    - I "improved" the compression by giving the translation table the
ability to look for well known substrings in addition to byte values.  For
example the string "THE" in an English text might have it's own Huffman
translation value instead of relying on the individual T, H and E
translations.  Again I was able to do this because I knew the type of text I
would be storing in the column.
Because the translation table is external to the string, it is no longer
included in the string.  This improves the compression ratio and helps the
uncompression as the table does not need to be read and interpreted.  This
approach also allows for and takes advantage of cross-row commonalities.
After doing all of this, I was able to get the average compression ratio to
about 3.5 (i.e. orig size / new size) on text columns of 1K or less.
Of course the disadvantage to this is that if the dataset changes overtime,
the compression translations to not change with it and the compression ratio
will get worse.  In the very worst case, where the dataset totally changes
(say the language changed from English to Russian) the "compressed" string
could actually get larger than the original string because everything the
compression algorithm thought it knew was totally wrong.
As I said, I did this with a different database (CTREE).  My company is now
converting from CTREE to postgresql and I would like to find some fairly
elegant way to include this type of compression support.  My CTREE
implementation was rather a hack, but now that we are moving to a better
database, I would like to have a better solution for compression.
Originally I latched onto the idea of using lztext or something like lztext,
but I agree it is better if the column type is something standard and the
compression is hidden by the backend.  I guess my vision would be to be able
to define either a column type or a column attribute that would allow me to
indicate the type of compression to have and possibly the compression
algorithm and translation set to use.  Actually the real vision is to have it
all hidden and have the database learn about the column as rows are added and
modify the compression transalation to adapt to the changing column data.
Jeff
			
		Tom Lane wrote:
> JanWieck@t-online.de (Jan Wieck) writes:
> >     One  thing  to  keep  in mind is that the LZ algorithm you're
> >     thinking of must be distributable under the terms of the  BSD
> >     license.  If it's copyrighted or patented by any third party,
> >     not agreeing to these terms, it's out of discussion and  will
> >     never  appear in the Postgres source tree.  Especially the LZ
> >     algorithm used in GIF is one of these show-stoppers.
>
> As long as you brought it up: how sure are you that the method you've
> used is not subject to any patents?  The mere fact that you learned
> it from someone who didn't patent it does not guarantee anything ---
> someone else could have invented it independently and filed for a
> patent.
    Now that you ask for it: I'm not sure. Could be.
> If you can show that this method uses no ideas not found in zlib,
> then I'll feel reassured --- a good deal of legal research went into
> zlib to make sure it didn't fall foul of any patents, and zlib has
> now been around long enough that it'd be tough for anyone to get a
> new patent on one of its techniques.  But if SLZ has any additional
> ideas in it, then we could be in trouble.  There are an awful lot of
> compression patents.
    To  do  so  I  don't know enough about the algorithms used in
    zlib.  Is there someone out here who could verify that  if  I
    detailed enough describe what our compression code does?
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #
			
		JanWieck@t-online.de (Jan Wieck) writes:
>> As long as you brought it up: how sure are you that the method you've
>> used is not subject to any patents?
>     Now that you ask for it: I'm not sure. Could be.
>> If you can show that this method uses no ideas not found in zlib,
>> then I'll feel reassured
>     To  do  so  I  don't know enough about the algorithms used in
>     zlib.  Is there someone out here who could verify that  if  I
>     detailed enough describe what our compression code does?
After a quick look at the code, I don't think there is anything
problematic about the data representation or the decompression
algorithm.  The compression algorithm is another story, and it's
not real well commented :-(.  The important issues are how you
search for matches in the past text and how you decide which match
is the best one to use.  Please update the code comments to describe
that, and I'll take another look.
            regards, tom lane
			
		Tom Lane wrote:
> JanWieck@t-online.de (Jan Wieck) writes:
> >> As long as you brought it up: how sure are you that the method you've
> >> used is not subject to any patents?
>
> >     Now that you ask for it: I'm not sure. Could be.
>
> >> If you can show that this method uses no ideas not found in zlib,
> >> then I'll feel reassured
>
> >     To  do  so  I  don't know enough about the algorithms used in
> >     zlib.  Is there someone out here who could verify that  if  I
> >     detailed enough describe what our compression code does?
>
> After a quick look at the code, I don't think there is anything
> problematic about the data representation or the decompression
> algorithm.  The compression algorithm is another story, and it's
> not real well commented :-(.  The important issues are how you
> search for matches in the past text and how you decide which match
> is the best one to use.  Please update the code comments to describe
> that, and I'll take another look.
    Done. You'll find a new section in the top comments.
    While  writing  it  I  noticed  that  the algorithm is really
    expensive for big items. The history lookup  table  allocated
    is  8  times (on 32 bit architectures) the size of the input.
    So if you want to have 1MB compressed, it'll allocate 8MB for
    the  history.  It  hit  me  when  I  was hunting a bug in the
    toaster earlier today. Doing an update to a toasted  item  of
    5MB, resulting in a new value of 10MB, the backend blew up to
    290MB of virtual memory - oh boy. I definitely need  to  make
    that smarter.
    When  I wrote it I never thought about items that big. It was
    before we had the idea of TOAST.
    This all might  open  another  discussion  I'll  start  in  a
    separate thread.
Jan
--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #
			
		> > figure sometime in October before v7.1 is released ... i believe the last > talk talked about a beta starting sometime in September, but nothing is > ever "set in stone" until we actually do it :) I agree. August is a very useful month because a lot of people are slow at work during that month, and that gives them time to work on PostgreSQL. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 853-3000 + If your life is a hard drive, | 830 Blythe Avenue + Christ can be your backup. | Drexel Hill, Pennsylvania 19026
Hi! How can I count the number of rows affected by UPDATE or DELETE? Przem
> How can I count the number of rows affected by UPDATE or DELETE?
This depends on the interface you are using.
* The psql SQL monitor always reports the number of affected rows
  (it's the 2nd number)
* In DBI "do" returns the number of affected rows.
* In JDBC "executeUpdate" does the same. (By the way, is there a way
  to get the number of rows from a executeQuery without counting them?).
Regards,
Mit freundlichem Gruß,
    Holger Klawitter
--
Holger Klawitter                                    +49 (0)251 484 0637
holger@klawitter.de                            http://www.klawitter.de/