bzz/BSByteStream.h at master · leonbottou/bzz · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
//C- Copyright � 1999-2001 LizardTech, Inc. All Rights Reserved.
//C-
//C- This software (the "Original Code") is subject to, and may be
//C- distributed under, the GNU General Public License, Version 2.
//C- You may obtain a copy of the license from the Free Software
//C- Foundation at http://www.fsf.org .
//C-
//C- With respect to the Original Code, and subject to any third party
//C- intellectual property claims, LizardTech grants recipient a worldwide,
//C- royalty-free, non-exclusive license under patent claims infringed by
//C- making, using, or selling Original Code which are now or hereafter
//C- owned or controlled by LizardTech, but solely to the extent that any
//C- such patent is reasonably necessary to enable you to make, have made,
//C- practice, sell, or otherwise dispose of Original Code (or portions
//C- thereof) and not to any greater extent that may be necessary to utilize
//C- further modifications or combinations.
//C-
//C- The Original Code is provided "AS IS" WITHOUT WARRANTY OF ANY KIND,
//C- EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY
//C- OF NON-INFRINGEMENT, OR ANY IMPLIED WARRANTY OF MERCHANTIBILITY OF
//C- FITNESS FOR A PARTICULAR PURPOSE.
//C-


#ifndef _BSBYTESTREAM_H
#define _BSBYTESTREAM_H

/** @name BSByteStream.h

    Files #"BSByteStream.h"# and #"BSByteStream.cpp"# implement a very compact
    general purpose compressor based on the Burrows-Wheeler transform.  The
    utility program \Ref{bzz} provides a front-end for this class. Although
    this compression model is not currently used in DjVu files, it may be used
    in the future for encoding textual data chunks.

    {\bf Algorithms} --- The Burrows-Wheeler transform (also named Block-Sorting)
    is performed using a combination of the Karp-Miller-Rosenberg and the
    Bentley-Sedgewick algorithms. This is comparable to (Sadakane, DCC 98)
    with a slightly more flexible ranking scheme. Symbols are then ordered
    according to a running estimate of their occurrence frequencies, with
    an adaptation rate that is selected on a per-block basis by trying
    multiple settings and picking the one that yields the best compression.
    The symbol ranks are then coded using a simple fixed tree and the
    \Ref{ZPCodec} binary adaptive coder.

    {\bf File format} --- The compressed stream starts with a version byte
    (currently 0xE1) followed by a sequence of compressed blocks. Each block
    begins with a 2-bit adaptation rate selector and a 30-bit block size,
    followed by the ZP-coded MTF ranks.  The stream ends with a block having
    size zero.

    {\bf Performances} --- The basic algorithm is mostly similar to those
    implemented in well known compressors like #bzip# or #bzip2#
    (\URL{http://www.muraroa.demon.co.uk}).  The adaptive binary coder however
    generates small differences. The adaptation noise may cost up to 5\% in
    file size, but this penalty is usually offset by the benefits of
    adaptation.  This is good when processing large and highly structured
    files like spreadsheet files.  Compression and decompression speed is
    about twice slower than #bzip2# but the sorting algorithms is more
    robust. Unlike #bzip2# (as of August 1998), this code can compress half a
    megabyte of "abababab...." in bounded time.

    Here are some comparative results (in bits per character) obtained on the
    Canterbury Corpus (\URL{http://corpus.canterbury.ac.nz}) as of August
    1998. The BSByteStream performance on the single spreadsheet file #Excl#
    moves #bzz#'s weighted average ahead of much more sophisticated methods,
    like Suzanne Bunton's #fsmxBest# system
    \URL{http://corpus.canterbury.ac.nz/methodinfo/fsmx.html}.  This result
    will not last very long.

    {\footnotesize
    \begin{tabular}{lccccccccccccc}
      & text & fax & Csrc & Excl & SPRC & tech
      & poem & html & lisp & man & play & Weighted & Average \\
      compress
      & 3.27 & 0.97 & 3.56 & 2.41 & 4.21 & 3.06
      & 3.38 & 3.68 & 3.90 & 4.43 & 3.51
      & 2.55 & 3.31 \\
      gzip -9
      & 2.85 & 0.82 & 2.24 & 1.63 & 2.67 & 2.71
      & 3.23 & 2.59 & 2.65 & 3.31 & 3.12
      & 2.08 & 2.53 \\
      bzip2 -9
      & 2.27 & 0.78 & 2.18 & 1.01 & 2.70 & 2.02
      & 2.42 & 2.48 & 2.79 & 3.33 & 2.53
      & 1.54 & 2.23 \\
      ppmd
      & 2.31 & 0.99 & 2.11 & 1.08 & 2.68 & 2.19
      & 2.48 & 2.38 & 2.43 & 3.00 & 2.53
      & 1.65 & 2.20 \\
      fsmx
      & {\bf 2.10} & 0.79 & {\bf 1.89} & 1.48 & {\bf 2.52} & {\bf 1.84}
      & {\bf 2.21} & {\bf 2.24} & {\bf 2.29} & {\bf 2.91} & {\bf 2.35}
      & 1.63 & {\bf 2.06} \\
      {\bf bzz (1998, params unknown)}
      & 2.25 & 0.76 & 2.13 & 0.78 & 2.67 & 2.00
      & 2.40 & 2.52 & 2.60 & 3.19 & 2.52
      & {\bf 1.44} & 2.16 \\
      {\bf bzz (2026)}
      & 2.24 & {\bf 0.75} & 2.13 & {\bf 0.78} & 2.66 & 1.99
      & 2.38 & 2.52 & 2.60 & 3.17 & 2.51
      & {\bf 1.44} & 2.16
    \end{tabular}
    }

    Note that the DjVu people have several entries in this table.  Program
    #compress# was written some time ago by Joe Orost
    (\URL{http://www.research.att.com/info/orost}). The #ppmc# method, (a
    precursor of #ppmd#) was created by Paul Howard
    (\URL{http://www.research.att.com/info/pgh}). The #bzz# program is just
    below your eyes.

    @author
    L\'eon Bottou <leonb@research.att.com> -- Initial implementation\\
    Andrei Erofeev <eaf@research.att.com> -- Improved Block Sorting algorithm.
    @memo
    Simple Burrows-Wheeler general purpose compressor.
    @version
    #$Id: BSByteStream.h,v 1.2 2001-01-04 22:04:53 bcr Exp $# */
//@{

#include "GException.h"
#include "ByteStream.h"
#include "ZPCodec.h"


/** Performs bzz compression/decompression.

    Class #BSByteStream# defines a \Ref{ByteStream} which transparently
    performs the BZZ compression/decompression. The constructor of class
    \Ref{BSByteStream} takes another \Ref{ByteStream} as argument.  Any data
    written to the BSByteStream is compressed and written to this second
    ByteStream. Any data read from the BSByteStream is internally generated by
    decompressing data read from the second ByteStream.

    Program \Ref{bzz} demonstrates how to use this class.  All the hard work
    is achieved by a simple ByteStream to ByteStream copy, as shown below.
    \begin{verbatim}
      StdioByteStream in(infile,"rb");
      StdioByteStream out(outfile,"wb");
      if (encoding) {
          BSByteStream bsb(&out, blocksize);
          bsb.copy(in);
      } else {
          BSByteStream bsb(&in);
          out.copy(bsb);
      }
    \end{verbatim}
    Due to the block oriented nature of the Burrows-Wheeler transform, there
    is a very significant latency between the data input and the data output.
    You can use function #flush# to force data output at the expense of
    compression efficiency.  Destroying the BSByteStream performs an implicit
    #flush#.
*/
class BSByteStream : public ByteStream
{
public:
  /** Constructs a BSByteStream.
      Argument #blocksize# determines whether the BSByteStream will be used
      for compressing or decompressing data.
      \begin{description}
      \item[Decompression]
      Setting #blocksize# to zero initializes the decompressor.  Chunks of
      data will be read from ByteStream #bs# and decompressed into an internal
      buffer. Function #read# can be used to access the decompressed data.
      \item[Compression]
      Setting #blocksize# to a positive number between 100 and 1024
      initializes the compressor.  Data written to the BSByteStream will be
      accumulated into an internal buffer.  The buffered data will be
      compressed and written to ByteStream #bs# whenever the buffer size
      reaches the maximum value specified by argument #blocksize# (in
      kilobytes).  Using a larger block size usually increases the compression
      ratio at the expense of computation time.  There is no need however to
      specify a block size larger than the total number of bytes to compress.
      Setting #blocksize# to #1024# is a good starting point.
      \end{description} */
  BSByteStream(ByteStream &bs, int blocksize=0);
  // ByteStream Interface
  ~BSByteStream();
  size_t read(void *buffer, size_t size);
  size_t write(const void *buffer, size_t size);
  long tell();
  void flush();
private:
  // Data
  int             encoding;
  long            offset;
  int             bptr;
  unsigned int    blocksize;
  unsigned char  *data;
  int             size;
  int             eof;
  ByteStream	 *bs;
  // Coder
  ZPCodec          *zp;
  BitContext        ctx[300];
private:
  // Cancel C++ default stuff
  BSByteStream(const BSByteStream &);
  BSByteStream & operator=(const BSByteStream &);
  // Helpers
  unsigned int encode(void);
  unsigned int decode(void);
  // MTF encode/decode with fshift parameter
  void encode_block(ZPCodec &zp, BitContext *ctx, int fshift, int markerpos);
  void decode_block(int fshift);
};


//@}

#endif