-
Notifications
You must be signed in to change notification settings - Fork 0
Expand file tree
/
Copy pathBSByteStream.h
More file actions
212 lines (193 loc) · 8.82 KB
/
BSByteStream.h
File metadata and controls
212 lines (193 loc) · 8.82 KB
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
//C- Copyright � 1999-2001 LizardTech, Inc. All Rights Reserved.
//C-
//C- This software (the "Original Code") is subject to, and may be
//C- distributed under, the GNU General Public License, Version 2.
//C- You may obtain a copy of the license from the Free Software
//C- Foundation at http://www.fsf.org .
//C-
//C- With respect to the Original Code, and subject to any third party
//C- intellectual property claims, LizardTech grants recipient a worldwide,
//C- royalty-free, non-exclusive license under patent claims infringed by
//C- making, using, or selling Original Code which are now or hereafter
//C- owned or controlled by LizardTech, but solely to the extent that any
//C- such patent is reasonably necessary to enable you to make, have made,
//C- practice, sell, or otherwise dispose of Original Code (or portions
//C- thereof) and not to any greater extent that may be necessary to utilize
//C- further modifications or combinations.
//C-
//C- The Original Code is provided "AS IS" WITHOUT WARRANTY OF ANY KIND,
//C- EITHER EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY
//C- OF NON-INFRINGEMENT, OR ANY IMPLIED WARRANTY OF MERCHANTIBILITY OF
//C- FITNESS FOR A PARTICULAR PURPOSE.
//C-
#ifndef _BSBYTESTREAM_H
#define _BSBYTESTREAM_H
/** @name BSByteStream.h
Files #"BSByteStream.h"# and #"BSByteStream.cpp"# implement a very compact
general purpose compressor based on the Burrows-Wheeler transform. The
utility program \Ref{bzz} provides a front-end for this class. Although
this compression model is not currently used in DjVu files, it may be used
in the future for encoding textual data chunks.
{\bf Algorithms} --- The Burrows-Wheeler transform (also named Block-Sorting)
is performed using a combination of the Karp-Miller-Rosenberg and the
Bentley-Sedgewick algorithms. This is comparable to (Sadakane, DCC 98)
with a slightly more flexible ranking scheme. Symbols are then ordered
according to a running estimate of their occurrence frequencies, with
an adaptation rate that is selected on a per-block basis by trying
multiple settings and picking the one that yields the best compression.
The symbol ranks are then coded using a simple fixed tree and the
\Ref{ZPCodec} binary adaptive coder.
{\bf File format} --- The compressed stream starts with a version byte
(currently 0xE1) followed by a sequence of compressed blocks. Each block
begins with a 2-bit adaptation rate selector and a 30-bit block size,
followed by the ZP-coded MTF ranks. The stream ends with a block having
size zero.
{\bf Performances} --- The basic algorithm is mostly similar to those
implemented in well known compressors like #bzip# or #bzip2#
(\URL{http://www.muraroa.demon.co.uk}). The adaptive binary coder however
generates small differences. The adaptation noise may cost up to 5\% in
file size, but this penalty is usually offset by the benefits of
adaptation. This is good when processing large and highly structured
files like spreadsheet files. Compression and decompression speed is
about twice slower than #bzip2# but the sorting algorithms is more
robust. Unlike #bzip2# (as of August 1998), this code can compress half a
megabyte of "abababab...." in bounded time.
Here are some comparative results (in bits per character) obtained on the
Canterbury Corpus (\URL{http://corpus.canterbury.ac.nz}) as of August
1998. The BSByteStream performance on the single spreadsheet file #Excl#
moves #bzz#'s weighted average ahead of much more sophisticated methods,
like Suzanne Bunton's #fsmxBest# system
\URL{http://corpus.canterbury.ac.nz/methodinfo/fsmx.html}. This result
will not last very long.
{\footnotesize
\begin{tabular}{lccccccccccccc}
& text & fax & Csrc & Excl & SPRC & tech
& poem & html & lisp & man & play & Weighted & Average \\
compress
& 3.27 & 0.97 & 3.56 & 2.41 & 4.21 & 3.06
& 3.38 & 3.68 & 3.90 & 4.43 & 3.51
& 2.55 & 3.31 \\
gzip -9
& 2.85 & 0.82 & 2.24 & 1.63 & 2.67 & 2.71
& 3.23 & 2.59 & 2.65 & 3.31 & 3.12
& 2.08 & 2.53 \\
bzip2 -9
& 2.27 & 0.78 & 2.18 & 1.01 & 2.70 & 2.02
& 2.42 & 2.48 & 2.79 & 3.33 & 2.53
& 1.54 & 2.23 \\
ppmd
& 2.31 & 0.99 & 2.11 & 1.08 & 2.68 & 2.19
& 2.48 & 2.38 & 2.43 & 3.00 & 2.53
& 1.65 & 2.20 \\
fsmx
& {\bf 2.10} & 0.79 & {\bf 1.89} & 1.48 & {\bf 2.52} & {\bf 1.84}
& {\bf 2.21} & {\bf 2.24} & {\bf 2.29} & {\bf 2.91} & {\bf 2.35}
& 1.63 & {\bf 2.06} \\
{\bf bzz (1998, params unknown)}
& 2.25 & 0.76 & 2.13 & 0.78 & 2.67 & 2.00
& 2.40 & 2.52 & 2.60 & 3.19 & 2.52
& {\bf 1.44} & 2.16 \\
{\bf bzz (2026)}
& 2.24 & {\bf 0.75} & 2.13 & {\bf 0.78} & 2.66 & 1.99
& 2.38 & 2.52 & 2.60 & 3.17 & 2.51
& {\bf 1.44} & 2.16
\end{tabular}
}
Note that the DjVu people have several entries in this table. Program
#compress# was written some time ago by Joe Orost
(\URL{http://www.research.att.com/info/orost}). The #ppmc# method, (a
precursor of #ppmd#) was created by Paul Howard
(\URL{http://www.research.att.com/info/pgh}). The #bzz# program is just
below your eyes.
@author
L\'eon Bottou <leonb@research.att.com> -- Initial implementation\\
Andrei Erofeev <eaf@research.att.com> -- Improved Block Sorting algorithm.
@memo
Simple Burrows-Wheeler general purpose compressor.
@version
#$Id: BSByteStream.h,v 1.2 2001-01-04 22:04:53 bcr Exp $# */
//@{
#include "GException.h"
#include "ByteStream.h"
#include "ZPCodec.h"
/** Performs bzz compression/decompression.
Class #BSByteStream# defines a \Ref{ByteStream} which transparently
performs the BZZ compression/decompression. The constructor of class
\Ref{BSByteStream} takes another \Ref{ByteStream} as argument. Any data
written to the BSByteStream is compressed and written to this second
ByteStream. Any data read from the BSByteStream is internally generated by
decompressing data read from the second ByteStream.
Program \Ref{bzz} demonstrates how to use this class. All the hard work
is achieved by a simple ByteStream to ByteStream copy, as shown below.
\begin{verbatim}
StdioByteStream in(infile,"rb");
StdioByteStream out(outfile,"wb");
if (encoding) {
BSByteStream bsb(&out, blocksize);
bsb.copy(in);
} else {
BSByteStream bsb(&in);
out.copy(bsb);
}
\end{verbatim}
Due to the block oriented nature of the Burrows-Wheeler transform, there
is a very significant latency between the data input and the data output.
You can use function #flush# to force data output at the expense of
compression efficiency. Destroying the BSByteStream performs an implicit
#flush#.
*/
class BSByteStream : public ByteStream
{
public:
/** Constructs a BSByteStream.
Argument #blocksize# determines whether the BSByteStream will be used
for compressing or decompressing data.
\begin{description}
\item[Decompression]
Setting #blocksize# to zero initializes the decompressor. Chunks of
data will be read from ByteStream #bs# and decompressed into an internal
buffer. Function #read# can be used to access the decompressed data.
\item[Compression]
Setting #blocksize# to a positive number between 100 and 1024
initializes the compressor. Data written to the BSByteStream will be
accumulated into an internal buffer. The buffered data will be
compressed and written to ByteStream #bs# whenever the buffer size
reaches the maximum value specified by argument #blocksize# (in
kilobytes). Using a larger block size usually increases the compression
ratio at the expense of computation time. There is no need however to
specify a block size larger than the total number of bytes to compress.
Setting #blocksize# to #1024# is a good starting point.
\end{description} */
BSByteStream(ByteStream &bs, int blocksize=0);
// ByteStream Interface
~BSByteStream();
size_t read(void *buffer, size_t size);
size_t write(const void *buffer, size_t size);
long tell();
void flush();
private:
// Data
int encoding;
long offset;
int bptr;
unsigned int blocksize;
unsigned char *data;
int size;
int eof;
ByteStream *bs;
// Coder
ZPCodec *zp;
BitContext ctx[300];
private:
// Cancel C++ default stuff
BSByteStream(const BSByteStream &);
BSByteStream & operator=(const BSByteStream &);
// Helpers
unsigned int encode(void);
unsigned int decode(void);
// MTF encode/decode with fshift parameter
void encode_block(ZPCodec &zp, BitContext *ctx, int fshift, int markerpos);
void decode_block(int fshift);
};
//@}
#endif