Re: lztext and compression ratios...

Поиск
Список
Период
Сортировка
От Tom Lane
Тема Re: lztext and compression ratios...
Дата
Msg-id 8993.962947840@sss.pgh.pa.us
обсуждение исходный текст
Ответ на Re: [SQL] Re: [GENERAL] lztext and compression ratios...  (JanWieck@t-online.de (Jan Wieck))
Список pgsql-hackers
JanWieck@t-online.de (Jan Wieck) writes:
> Tom Lane wrote:
>> After a quick look at the code, I don't think there is anything
>> problematic about the data representation or the decompression
>> algorithm.  The compression algorithm is another story, and it's
>> not real well commented :-(.  The important issues are how you
>> search for matches in the past text and how you decide which match
>> is the best one to use.  Please update the code comments to describe
>> that, and I'll take another look.

>     Done. You'll find a new section in the top comments.

I think we are probably OK.  Anyone who wants to check the issue might
want to start with http://www.faqs.org/faqs/compression-faq/,
particularly part 1 item 8.  The two patents I'm most concerned about
are Waterworth's (4701745) and Stac's (5016009) which both cover basic
variants of LZ77 (for full text of these or any other US patent consult
http://patent.womplex.ibm.com/).  But it looks to me like this
particular variant can be argued not to fall under either patent.
For example, Waterworth's patent specifies using 3-byte hash values
while you use 4-byte, and stupid as that is it's sufficient to get
you out from under.  (A more technically valid reason is that he
doesn't use hash collision chains.)  The Stac patent also specifies
a data structure considerably different from your hash chains.

To give you an idea of what's involved here, I attach an old message
from Jean-loup Gailly, author of gzip and zlib, explaining why he
believes gzip doesn't fall foul of these two patents.  Not all of
his reasons apply to your method, but I think enough do.  (BTW,
Gailly's a good guy --- I worked with him on the PNG project.
I think I'll write to him and see if he's got time to review this
code for patent problems.)

>     While  writing  it  I  noticed  that  the algorithm is really
>     expensive for big items. The history lookup  table  allocated
>     is  8  times (on 32 bit architectures) the size of the input.
>     So if you want to have 1MB compressed, it'll allocate 8MB for
>     the  history.  It  hit  me  when  I  was hunting a bug in the
>     toaster earlier today. Doing an update to a toasted  item  of
>     5MB, resulting in a new value of 10MB, the backend blew up to
>     290MB of virtual memory - oh boy. I definitely need  to  make
>     that smarter.

Yes, you need some smarter method for choosing the size of the
hash table... a braindead but possibly sufficient answer is to
have some hard limit on the size... or you could just use a
hard-wired constant size to begin with.  I think it's usually
considered wise to use a prime for the table size and reduce
the raw input bits modulo the prime.
        regards, tom lane


Article: 14604 of gnu.misc.discuss
Path: chorus!octave.chorus.fr!jloup
From: jloup@chorus.fr (Jean-loup Gailly)
Newsgroups: gnu.misc.discuss
Subject: Re: LPF, NY Times, GNU and compression algorithms
Message-ID: <7989@chorus.chorus.fr>
Date: 10 Mar 94 10:10:03 GMT
References: <RXN106.94Mar6103732@wilbur.cac.psu.edu> <HSU.94Mar7204324@laphroaig.cs.hut.fi>
<MARC.94Mar8090631@marc.watson.ibm.com>
Sender: news@chorus.chorus.fr
Distribution: gnu
Lines: 201

Marc Auslander <marc@watson.ibm.com> writes:

>   Two Stac patents in the case are 4701745 and 5016009.
>
>   From 4601745: [typo, meant: 4701745]
>
>   the data processing means including circuit means operable to check
>   whether a sequence of successive bytes to be processed identical with
>   a sequence of bytes already processed, and including a hash generating
>   means responsive to the application of a predetermined number of bytes ...
>
>   From 5016009
>
>   in order to perform the hashing function, a data compression system
>   includes certain hash data structures including a history array
>   pointer ...
>
>   also:
>
>   ... is found ..., encoding said matching string ... a variable length
>   indicator ..., said predetermined strategy ensuring that a matching
>   string of two characters of said input data stream is compressed to
>   less than said two characters of said input data stream.
>
>   So - on the face of it, why isn't gzip covered by these patents?

Let's take each patent in turn. A clarification first: the Stac patent
4,701,745 was not invented by Stac. It was initially owned by Ferranti
(UK) and only recently bought by Stac. This was a very profitable
acquisition (assuming it cost Stac far less than the $120 million they
won by using this patent against Microsoft).

(a) 4,701,745

This algorithm is now known as LZRW1, because Ross Williams reinvented
it independently later and posted it on comp.compression on April 22,
1991. Exactly the same algorithm has also been patented by Gibson and
Graybill (5,049,881). The patent office failed to recognize that
it was the same algorithm.

This algorithm uses LZ77 with hashing but no collision chains and
outputs unmatched bytes directly without further compression.  gzip
uses collisions chains of arbitrary length, and uses Huffman encoding
on the unmatched bytes:

- Claim 1 of the patent is restricted to (emphasis added by me):
   output means operable to APPLY to a transfer medium each byte of data     not forming part of such an identical
sequence;and
 
   ENCODING means responsive to the identification of such a sequence     to APPLY to the transfer medium an
identificationsignal which     identifies both the location in the input store of the previous     occurrence of the
sequenceof bytes and the number of bytes     contained in the sequence.
 
 The claim thus makes a clear distinction between "encoding" and "applying to the transfer medium". A system which
compressesthe unmatched bytes does not infringe this patent.
 

- The description of the patent and claim 2 make clear that the check for identity of the sequences of bytes is to be
madeonly once (no hash collision chains). Gzip performs an arbitrary number of such checks. The "means" enumerated in
claim1 specify what the hash table consists of, and this does not include any means for storing hash collision chains.
 

- Claim 2 also requires that *all* bytes participating in the hash function should be compared:
    A system as claimed in claim 1 in which the circuit means also     includes check means operable to check for
identitybetween EACH     of the said predetermined number of bytes in sequence and EACH of     a similar sequence of
bytescontained in the input store at a     location defined by a pointer read out from the temporary store at     said
address
 [in plain English, this is the check for hash collision]
     and to check whether identity exists between succeeding bytes in     each sequence of bytes, and a byte counter
operableto count the     number of identical bytes in each sequence.
 
 [this is the determination of the match length]
 Gzip never checks for equality of the third byte used in the hash function. The hash function is such that on a hash
hitwith equality of the first two bytes, the third byte necessarily matches.
 

- In addition, gzip uses a "lazy" evaluation of string matches.  Even when a match is found, gzip may encode (with
Huffmancoding) a single unmatched byte. This is done when gzip determines that it is more beneficial to parse the input
stringdifferently because a longer match follows. In the Waterworth patent, a string match is always encoded as a
(length,pointer) pair.
 

All other claims of the patent are dependent on claim 1
("a system as claimed in claim 1 in which ..."). Since gzip
does not infringe claim 1 it does not infringe the other claims.
In particular, claim 6 explicitly states that unmatched strings
are not compressed:
     A system as claimed in claim 5 in which the data receiving means     includes decoder means operable to separate
UNCOMPRESSEDbytes of     data from identification signals.
 

The gzip decoder never receives uncompressed bytes since all input is
compressed with Huffman coding [both literals and (length, offset) pairs].


The only "invention" in the Waterworth patent is the absence of hash
collision chains. The original description of the LZ77 algorithm
required searching for the true longest match, not just checking the
length of one particular match.  Using hashing for string searching
was very well known at the time of the patent application (March 86).
The patent description specifies explicitly that "Hash techniques are
well known and many differents forms of hash will be suitable".

The --fast option of gzip was on purpose made slower than possible
precisely because of the existence of the Waterworth patent.
See in particular the following part of the gzip TODO file:
 Add a super-fast compression method, suitable for implementing file systems with transparent compression. One problem
isthat the best candidate (lzrw1) is patented twice (Waterworth 4,701,745 and Gibson & Graybill 5,049,881).
 


(b) 5,016,009

This is standard LZ77 with hashing, and collisions resolved using
linked lists. There are several important restrictions which let gzip
escape from the patent:

- the linked lists are implemented only with offsets. The end of a chain is detected by adding together all offsets,
untilthe sum becomes greater than the size of the history buffer. gzip uses direct indices, and so detects the end of
thechains differently. The exact wording of claim 1 of the patent is:
 
  ... said data compression system comprising ... an offset array means  ... said method comprising the steps of ...
      calculating a difference between said history array pointer         and said pointer obtained from said hash
tablemeans,      storing said difference into said offset array means entry         pointed to by said history array
pointer,...
 
  gzip never calculates such a difference and does not have any offset  array.

- unmatched strings are emitted as literal bytes without any compression. gzip uses Huffman encoding on the unmatched
strings.This is the same argument as for the Waterworth patent.
 

- unmatched strings are preceded by  ... a "raw" data tag indicating that no matching data string was found
 gzip does not use such a tag because it uses a single Huffman table for both string literals and match lengths. It is
onlythe prefix property of Huffman codes which allows the decoder to distinguish the two cases. So there is not a
unique"raw" tag preceding all literals. This is not a minor point. It is one of the reasons giving gzip superior
compressionto that obtained with the Stac algorithm.
 

- a string match is always encoded as a (length, pointer) pair. Gzip uses a "lazy" evaluation of string matches as
describedabove for the Waterworth patent.
 

All other claims of the patent are dependent on claim 1 ("the method
of claim 1 wherein ..."). Since gzip does not infringe claim 1 it does
not infringe the other claims.  In any case, I have studied in detail
all the 77 claims to make sure that gzip does not infringe.

Unrelated note: this Stac patent is the only one where I found an
original and non obvious idea. The hash table is refreshed using a
incremental mechanism, so that the refresh overhead is distributed
among all input bytes. This allows the real response time necessary in
disk compressors such as Stacker (the Stac product). gzip does not use
this idea, and refreshes the hash table in a straightforward manner
every 32K bytes.


One final comment: I estimate that I have now spent more time studying
data compression patents than actually implementing data compression
algorithms.  I have a partial list of 318 data compression patents,
even though for the moment I restrict myself mostly to lossless
algorithms (ignoring most image compression patents).  Richard
Stallman has been *extremely* careful before accepting gzip as the GNU
compressor.  I continue to study new patents regularly. I would of
course very much prefer spending what's left of my spare time
improving the gzip compression algorithm instead of studying patents.
Some improvements that I would have liked to put in gzip have not been
incorporated because of patents.

In short, every possible precaution has been taken to make sure that
gzip isn't covered by patents.

Jean-loup Gailly, author of gzip.
jloup@chorus.fr


В списке pgsql-hackers по дате отправления:

Предыдущее
От: Thomas Lockhart
Дата:
Сообщение: Re: fcntl(F_SETLK)
Следующее
От: Thomas Lockhart
Дата:
Сообщение: Re: DateStyle (was Re: Per-database/schema settings)