Fourier analysis is a foundational mathematical technique widely employed across signal processing to decompose complex signals from the time/spatial domain into simpler sinusoidal components in the frequency domain. This transformation facilitates signal analysis, filtering, and compression, as it often reveals intrinsic signal characteristics that are otherwise hidden in their original representation.
Real-world signals, such as audio, images, and video, are often band-limited, with less energy concentrated in higher frequencies. By exploiting this phenomenon, Fourier-based compression methods, most notably the Discrete Cosine Transform (DCT), allow efficient representation of signals. For example, the widely used JPEG image compression standard relies on DCT to retain perceptually important low-frequency content while selectively discarding higher-frequency details that have minimal visual impact [1]. This results in substantial data reduction with minimal perceived quality loss.
This project presents a thoroughly verified hardware implementation of the Fast Integer DCT, grounded in a rigorous mathematical derivation.
For a continuous-time signal
provided
In many applications, signals are represented with samples at discrete intervals. Sampling a continuous-time signal
The corresponding Discrete-Time Fourier Transform (DTFT) is then a
with normalized angular frequency
As, in practice, signals are of finite length, consider
Splitting the DTFT sum into groups of length
Using periodicity
The second summation represents a Fourier series expansion of a Dirac comb (periodic impulse function):
Thus, the DTFT becomes:
This naturally motivates defining the
Hence, observe that the non-zero components of the DTFT of a
For real-word real-valued signals (
Taking the DFT of
Simplify the second summation by substituting
Simplifying the exponential:
Furthermore, factoring out the common exponent:
results in the DFT of the symmetrically extended sequence:
Discarding the phasor, and introducing a normalization factor
where
ensures orthogonality of the DCT basis. Additionally, scaling by
Within the lexicon of this report, DCT refers to DCT-II—the premier trigonometric transform in signal representation and data compression due to its strong energy compaction properties [3].
The computation of the DCT can be optimized by identifying and reusing common intermediate values. By factorizing the DCT into stages, as demonstrated by the Arai-Agui-Nakajima (AAN) algorithm [4], intermediate results can be shared across calculations, eliminating redundant operations. This optimization yields a computation complexity to
![]() |
|---|
| Image 1. The generalized signal flow graph for the DCT-II computation for N = 2, 4 and 8 based on the AAN algorithm [6]. |
Figure 1 visualizes the butterfly structure of the AAN algorithm for the 8-point DCT. Each node represents a weighted summation of its two preceding nodes (forming the butterfly structure), with weights (scaling factors) annotated on connecting edges. Dashed edges indicate that the corresponding input is negated. When two incoming edges share identical scaling factors, these factors are absorbed into the node itself.
The diagram illustrates the recursive radix-2 nature of the even branch, wherein it is successively decomposed into a 4-point DCT, which is further split into a 2-point DCT.
A manual computation of the DCT as defined by the signal flow graph, confirms its equivalence to the aforementioned mathematical expression.
In practical applications, an integer-based approximation of the DCT, called the Integer Discrete Cosine Transform (IntDCT), is commonly employed [7]. The IntDCT uses efficient fixed-point integer arithmetic applied to integer-valued input sequences
with scaling factor
Note that the division by
For a reference implementation, the computation is iteratively performed with floating-point arithmetic, with quantization applied at the final stage:
Note that both quantization approaches are asymptotically equivalent for sufficiently large scaling factors
This hardware core implements a Fast IntDCT algorithm with parameters
The transform is encapsulated within a SystemVerilog module (rtl/int_dct8.sv), implementing a pipelined butterfly structure as specified by the AAN algorithm (see Figure 1). This structure comprises a 4-stage pipeline, wherein each stage performs a single addition and two multiplication operations using integer arithmetic to ensure timing closure and consistent propagation delays. Additionally, a validation handshake signal is propagated through each stage to maintain pipeline synchronization and correctness of outputs.
Given 8-bit integer inputs and 16-bit Q15 scaling factors, intermediate computational stages maintain the Q15 fixed-point format with the minimum number of bits that avoid overflow. The the 9 least significant fractional bits are discarded, due to their inconsistency with the reference model, yielding a 20-bit Q6 fixed-point output.
The Q15 quantized representation of scaling factor constants is computed using the script aux/gen_q15_trig.py.
The verification environment (tb/) employs a modular and reuseable custom UVM framework implemented with Verilator, leveraging modern C++20 practices for clarity, maintainability, and performance.
agent: provides an interface to the Design Under Test (DUT) by driving and montoring the input/output signals, including the clock.reference: implements a naive, iterative ground-truth calculation of the DCT (see IntDCT) using floating-point arithmetic and quantization for verification purposes.test: encapsulates the testbench logic, by executing both custom test cases –including stress and edge conditions– and randomized test cases, while tracking and reporting discrepancies between the DUT and reference model via a scoreboard.
Using this environment, the hardware core has been verified to be accurate within a
Integration of the 8-point DCT core into larger systems is straightforward. For typical video-processing applications involving two-dimensional DCTs (2D DCT), such as JPEG compression, the core can serve as a fundamental building block. By partitioning data into 8×8 blocks and sequentially applying the DCT first on rows and then columns–using either two separate cores or reusing a single core with a transpose memory in between—this hardware facilitates efficient computation of generic-sized 2D DCTs.
By employing the Fast IntDCT hardware architecture described above, we anticipate excellent performance suitable for real-time data processing, even with a single DCT core. The throughput is particularly notable: a fully pipelined 8-point DCT module produces one complete 8-point result per clock cycle. At a moderate operating frequency of 100 MHz, this yields to 100 million DCT operations per second, sufficient for processing over 48 full Ultra High Definition (4K) video frames per second—each 4K frame consisting of 2,073,600 one-dimensional 8-point DCT computations. With higher clock frequencies on advanced FPGA platforms; post-synthesis analysis of a similar pipelined DCT architecture suggests operation frequencies approaching 2 GHz, yielding approximately 2 billion 8-point DCT computations per second, verifying our expected performance [8].
Moreover, the latency for a single 8-point DCT computation is just 4 clock cycles. For instance, at 100 MHz, the delay to process a full frame is less than half the frame interval for standard HD (1080p) video at 30 frames per second. Notably, this latency is deterministic and consistent, simplifying integration into real-time processing pipelines.
We also anticipate significant power and area efficiency gains. By significantly reducing the number of multiplications and primarily utilizing simpler addition and shift operations, the design minimizes power consumption, while requiring less area for the minimal arithmetic.
From the project's root directory, execute:
cmake -B build
cd build
makeIn the build directory, run the testbench by executing:
make run_simThe testbench executes both custom and randomized test cases, outputs a summary scoreboard, and provides detailed information on any failed test cases.
[1] G. K. Wallace, "The JPEG still picture compression standard," IEEE Transactions on Consumer Electronics, vol. 38, no. 1, pp. xviii–xxxiv, Feb. 1992. [Online]. Available: https://ieeexplore.ieee.org/document/125072. doi: 10.1109/30.125072.
[2] J. Makhoul, "A fast cosine transform in one and two dimensions," IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 28, no. 1, pp. 27–34, Feb. 1980. [Online]. Available: https://ieeexplore.ieee.org/document/1163351. doi: 10.1109/TASSP.1980.1163351.
[3] V. Britanak, P. C. Yip, and K. R. Rao, Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer Approximations. Academic Press, 2006, p. 66.
[4] Y. Arai, T. Agui, and M. Nakajima, "A fast DCT-SQ scheme for images," The Transactions of the IEICE, vol. E71, no. 11, pp. 1095–1097, Nov. 1989.
[5] M. Püschel and J. M. F. Moura, "The algebraic approach to the discrete cosine and sine transforms and their fast algorithms," SIAM Journal on Computing, vol. 32, no. 5, pp. 1280–1316, 2003. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/S009753970139272X. doi: 10.1137/S009753970139272X.
[6] V. Britanak, P. C. Yip, and K. R. Rao, Discrete Cosine and Sine Transforms: General Properties, Fast Algorithms and Integer Approximations. Academic Press, 2006, p. 98.
[7] J. Chen, S. Liu, G. Deng, and S. Rahardja, "Hardware efficient integer discrete cosine transform for efficient image/video compression," IEEE Access, vol. 7, pp. 152635–152645, 2019. [Online]. Available: https://ieeexplore.ieee.org/document/8868087. doi: 10.1109/ACCESS.2019.2947269.
[8] A. Edirisuriya, A. Madanayake, V. S. Dimitrov, R. J. Cintra, and J. Adikari, "VLSI architecture for 8-point AI-based Arai DCT having low area-time complexity and power at improved accuracy," Journal of Low Power Electronics and Applications, vol. 2, no. 2, pp. 127–142, 2012. [Online]. Available: https://www.mdpi.com/2079-9268/2/2/127. doi: 10.3390/jlpea2020127.
