Wavelets CPU Optimization Techniques

•

0 likes•281 views

This document discusses optimizing discrete wavelet transforms for CPU performance. It covers techniques like loop fusion, removing prologs and epilogs, leveraging CPU cache, SIMD vectorization, and parallelization. Benchmark results show these optimizations can achieve up to an 11x speedup over the separable diagonal implementation for a 10 megapixel image on an Intel Core2 Quad CPU. Future work areas discussed include merging multiple levels and transforms.

Science

Wavelets @ CPU
David Barina
April 15, 2014
David Barina Wavelets @ CPU April 15, 2014 1 / 16

Wavelet
David Barina Wavelets @ CPU April 15, 2014 2 / 16

Discrete Wavelet Transform
David Barina Wavelets @ CPU April 15, 2014 3 / 16

Lifting
α
β
γ
δ
David Barina Wavelets @ CPU April 15, 2014 4 / 16

2-D DWT
David Barina Wavelets @ CPU April 15, 2014 5 / 16

2-D Separability
David Barina Wavelets @ CPU April 15, 2014 6 / 16

What have I done?
loop fusion
removed prologs/epilogs
inﬂuence of CPU cache
SIMD-vectorization
parallelization
David Barina Wavelets @ CPU April 15, 2014 7 / 16

Loop Fusion
read
write
F
F
David Barina Wavelets @ CPU April 15, 2014 8 / 16

Removed Prologs and Epilogs
David Barina Wavelets @ CPU April 15, 2014 9 / 16

Inﬂuence of CPU Cache
David Barina Wavelets @ CPU April 15, 2014 10 / 16

SIMD Vectorization
4 × 4 6 × 2
David Barina Wavelets @ CPU April 15, 2014 11 / 16

Image Processing and Buﬀers
David Barina Wavelets @ CPU April 15, 2014 12 / 16

Parallelization
prolog
overlay
overlay
segment
David Barina Wavelets @ CPU April 15, 2014 13 / 16

Results
Intel Core2 Quad @ 2.00 GHz
10 Mpx
CDF 9/7, 1 level, in-place
approach best algorithm time/px speed-up
separable diag. 17.23 ns 1.0×
single-loop diag. 2 × 2 9.55 ns 1.8×
core diag. 2 × 2 8.79 ns 2.0×
super-core vert. 4 × 4 5.33 ns 3.2×
parallel (4) vert. 4 × 4 1.55 ns 11.1×
David Barina Wavelets @ CPU April 15, 2014 14 / 16

Future Work
merge several levels
merge forward and inverse cores
another wavelets
combine with EAW
another platforms (ARM, GPU, FPGA)
another transforms
David Barina Wavelets @ CPU April 15, 2014 15 / 16

Example (AMD Opteron)
1.0ns
10.0ns
100.0ns
1.0k 10.0k 100.0k 1.0M 10.0M 100.0M
time/pixel
pixels
naive vertical
naive diagonal
single-loop vertical
single-loop diagonal
David Barina Wavelets @ CPU April 15, 2014 16 / 16

Recently uploaded

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...Universidade Federal de Sergipe - UFS

Environmental Biotechnology Topic:- Microbial Biosensorsonawaneprad

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)Columbia Weather Systems

trihybrid cross , test cross chi squaresusmanzain586

GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide

GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin

Pests of castor_Binomics_Identification_Dr.UPR.pdfPirithiRaju

Vision and reflection on Mining Software Repositories research in 2024AyushiRastogi48

Let’s Say Someone Did Drop the Bomb. Then What?LUMINATIVE MEDIA/PROJECT COUNSEL MEDIA GROUP

Observational constraints on mergers creating magnetism in massive starsSérgio Sacani

GenBio2 - Lesson 1 - Introduction to Genetics.pptxBerniceCayabyab1

FREE NURSING BUNDLE FOR NURSES.PDF by naJASISJULIANOELYNV

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)riyaescorts54

《Queensland毕业文凭-昆士兰大学毕业证成绩单》rnrncn29

User Guide: Magellan MX™ Weather StationColumbia Weather Systems

Servosystem Theory / Cybernetic Theory by PetrovicAditi Jain

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptxmalonesandreagweneth

Microteaching on terms used in filtration .Pharmaceutical EngineeringPrajakta Shinde

Pests of Bengal gram_Identification_Dr.UPR.pdfPirithiRaju

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdfPirithiRaju

Recently uploaded (20)

REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...

Environmental Biotechnology Topic:- Microbial Biosensor

User Guide: Pulsar™ Weather Station (Columbia Weather Systems)

trihybrid cross , test cross chi squares

GenAI talk for Young at Wageningen University & Research (WUR) March 2024

GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx

Pests of castor_Binomics_Identification_Dr.UPR.pdf

Vision and reflection on Mining Software Repositories research in 2024

Let’s Say Someone Did Drop the Bomb. Then What?

Observational constraints on mergers creating magnetism in massive stars

GenBio2 - Lesson 1 - Introduction to Genetics.pptx

FREE NURSING BUNDLE FOR NURSES.PDF by na

(9818099198) Call Girls In Noida Sector 14 (NOIDA ESCORTS)

《Queensland毕业文凭-昆士兰大学毕业证成绩单》

User Guide: Magellan MX™ Weather Station

Servosystem Theory / Cybernetic Theory by Petrovic

LIGHT-PHENOMENA-BY-CABUALDIONALDOPANOGANCADIENTE-CONDEZA (1).pptx

Microteaching on terms used in filtration .Pharmaceutical Engineering

Pests of Bengal gram_Identification_Dr.UPR.pdf

Pests of Blackgram, greengram, cowpea_Dr.UPR.pdf

Wavelets CPU Optimization Techniques

1. Wavelets @ CPU David Barina April 15, 2014 David Barina Wavelets @ CPU April 15, 2014 1 / 16

2. Wavelet David Barina Wavelets @ CPU April 15, 2014 2 / 16

3. Discrete Wavelet Transform David Barina Wavelets @ CPU April 15, 2014 3 / 16

4. Lifting α β γ δ David Barina Wavelets @ CPU April 15, 2014 4 / 16

5. 2-D DWT David Barina Wavelets @ CPU April 15, 2014 5 / 16

6. 2-D Separability David Barina Wavelets @ CPU April 15, 2014 6 / 16

7. What have I done? loop fusion removed prologs/epilogs inﬂuence of CPU cache SIMD-vectorization parallelization David Barina Wavelets @ CPU April 15, 2014 7 / 16

8. Loop Fusion read write F F David Barina Wavelets @ CPU April 15, 2014 8 / 16

9. Removed Prologs and Epilogs David Barina Wavelets @ CPU April 15, 2014 9 / 16

10. Inﬂuence of CPU Cache David Barina Wavelets @ CPU April 15, 2014 10 / 16

11. SIMD Vectorization 4 × 4 6 × 2 David Barina Wavelets @ CPU April 15, 2014 11 / 16

12. Image Processing and Buﬀers David Barina Wavelets @ CPU April 15, 2014 12 / 16

13. Parallelization prolog overlay overlay segment David Barina Wavelets @ CPU April 15, 2014 13 / 16

14. Results Intel Core2 Quad @ 2.00 GHz 10 Mpx CDF 9/7, 1 level, in-place approach best algorithm time/px speed-up separable diag. 17.23 ns 1.0× single-loop diag. 2 × 2 9.55 ns 1.8× core diag. 2 × 2 8.79 ns 2.0× super-core vert. 4 × 4 5.33 ns 3.2× parallel (4) vert. 4 × 4 1.55 ns 11.1× David Barina Wavelets @ CPU April 15, 2014 14 / 16

15. Future Work merge several levels merge forward and inverse cores another wavelets combine with EAW another platforms (ARM, GPU, FPGA) another transforms David Barina Wavelets @ CPU April 15, 2014 15 / 16

16. Example (AMD Opteron) 1.0ns 10.0ns 100.0ns 1.0k 10.0k 100.0k 1.0M 10.0M 100.0M time/pixel pixels naive vertical naive diagonal single-loop vertical single-loop diagonal David Barina Wavelets @ CPU April 15, 2014 16 / 16

Wavelets CPU Optimization Techniques

Recommended

Recommended

More Related Content

More from David Bařina

More from David Bařina (18)

Recently uploaded

Recently uploaded (20)

Wavelets CPU Optimization Techniques