This document discusses optimizing discrete wavelet transforms for CPU performance. It covers techniques like loop fusion, removing prologs and epilogs, leveraging CPU cache, SIMD vectorization, and parallelization. Benchmark results show these optimizations can achieve up to an 11x speedup over the separable diagonal implementation for a 10 megapixel image on an Intel Core2 Quad CPU. Future work areas discussed include merging multiple levels and transforms.
7. What have I done?
loop fusion
removed prologs/epilogs
influence of CPU cache
SIMD-vectorization
parallelization
David Barina Wavelets @ CPU April 15, 2014 7 / 16
15. Future Work
merge several levels
merge forward and inverse cores
another wavelets
combine with EAW
another platforms (ARM, GPU, FPGA)
another transforms
David Barina Wavelets @ CPU April 15, 2014 15 / 16
16. Example (AMD Opteron)
1.0ns
10.0ns
100.0ns
1.0k 10.0k 100.0k 1.0M 10.0M 100.0M
time/pixel
pixels
naive vertical
naive diagonal
single-loop vertical
single-loop diagonal
David Barina Wavelets @ CPU April 15, 2014 16 / 16