SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
"On the Capability and Achievable
Performance of FPGAs for HPC
Applications"
Wim Vanderbauwhede
School of Computing Science, University of Glasgow, UK
Or in other words
"How Fast Can Those FPGA Thingies
Really Go?"
Outline
Part 1 The Promise of FPGAs for HPC
FPGAs
FLOPS
Performance Model
Part 2 How to Deliver this Promise
Assumptions on Applications
Computational Architecture
Optimising the Performance
A Matter of Programming
Enter TyTra
Conclusions
Part 1:
The Promise of FPGAs for HPC
High-end FPGA Board
FPGAs in a Nutshell
Field-Programmable Gate Array
Configurable logic
Matrix of look-up tables (LUTs) that can
be configured into any N-input logic
operation
e.g. 2-bit LUT configured as XOR:
Address Value
00 1
01 0
10 0
00 1
Combined with flip-flops to provide state
FPGAs in a Nutshell
Communication fabric:
island-style: grid of
wires with islands of
LUTS
wires with switch boxes
provides full connectivity
Also dedicated on-chip
memory
FPGAs in a Nutshell
Programming
By configuring the LUTs and their
interconnection, one can create arbitrary
circuits
In practice, circuit description is written
in VHDL or Verilog, and converted into a
configuration file by the vendor tools
Two major vendors: Xilinx and Altera
Many "C-based" programming solutions
have been proposed and are
commercially available. They generate
VHDL or Verilog.
Most recently, OpenCL is available for
FPGA programming (specific
Altera-based boards only)
The Promise
FPGAs have great potential for HPC:
Low power consumption
Massive amount of fine grained
parallelism (e.g. Xilinx Virtex-6 has
about 600,000 LUTs)
Huge (TB/s) internal memory
bandwidth
Very high power efficiency
(GFLOPS/W)
The Challenge
FPGA Computing Challenge
Device clock speed is very low
Many times lower than memory clock
There is no cache
So random memory access will kill
the performance
Requires a very different
programming paradigm
So, it’s hard
But that shouldn’t stop us
Maximum Achievable Performance
The theoretical maximum computational performance is determined
by:
Available Memory bandwidth
Easy: read the datasheets!
Compute Capacity
Hard: what is the relationship between logic gates and “FLOPS”?
FLOPS?
What is a FLOP, anyway?
Ubiquitous measure of performance for
HPC systems
Floating-point Operations per second
Floating-point:
Single or double precisions?
Number format: floating-point or
fixed-point?
Operations:
Which operations? Addition? Multiplication
In fact, why floating point?
Depends on the application
FLOPS!
FLOPS on Multicore CPUs and GPGPUs
Fixed number of FPUs
Historically, FP operations had higher cost
than integer operations
Today, essentially no difference between
integer and floating-point operations
But scientific applications perform mostly
FP operations
Hence, FLOPS as a measure of
performance
An Aside: the GPGPU Promise
Many papers report huge speed-ups: 20x/50x/100x/...
And the vendors promise the world
However, theatrical FLOPS are comparable between
same-complexity CPUs and GPGPUs:
#cores vector
size
Clock
speed
(GHz)
GFLOPS
CPU: Intel Xeon E5-2640 24 8 2.5 480
GPU: Nvidia GeForce GX480 15 32 1.4 672
CPU: AMD Opteron 6176 SE 48 4 2.3 442
GPU: Nvidia Tesla C2070 14 32 1.1 493
FPGA: GiDEL PROCStar-IV ? ? 0.2 ??
Difference is no more than 1.5x
The GPGPU Promise (Cont’d)
Memory bandwidth is usually higher for GPGPU:
Memory
BW
(GB/s)
CPU: Intel Xeon E5-2640 42.6
GPU: Nvidia GeForce GX480 177.4
CPU: AMD Opteron 6176 SE 42.7
GPU: Nvidia Tesla C2070 144
FPGA: GiDEL PROCStar-IV 32
The difference is about 4.5x
So where do the 20x/50x/100x figures come from?
Unoptimised baselines!
FPGA Power Efficiency Model (1)
On FPGAs, different instructions (e.g. *, +, /) consume different
amount of resources (area and time)
FLOPS should be defined on a per-application basis
We analyse the application code and compute the aggregated
resource requirements based on the count nOP,i and resource
utilisation rOP,i of the required operations
rapp = ∑
Ninstrs
i=1 nOP,i rOP,i
We take into account an area overhead for control logic, I/O etc.
Combined with the available resources on the board rFPGA, the
clock speed fFPGA and the power consumption PFPGA, we can
compute the power efficiency:
Power Efficiency=(1 − )(rFGPA/rapp)/fFPGA/PFPGA GFLOPS/W
FPGA Power Efficiency Model (2)
Example: convection kernel from the FLEXPART Lagrangian particle
dispersion simulator
About 600 lines of Fortran 77
This would be a typical kernel for e.g. OpenCL or CUDA on a
GPU
Assuming a GiDEL PROCStar-IV board, PFPGA = 30W
Assume = 0.5 (50% overhead, conservative) and clock speed
fFPGA = 175MHz (again, conservative)
Resulting power efficiency: 30 GFLOPS/W
By comparison: Tesla C2075 GPU: 4.5 GFLOPS/W
If we only did multiplications and similar operations, it would be 15
GFLOPS/W
If we only did additions and similar operations, it would be 225
GFLOPS/W
Depending on the application, the power efficiency can be up to
50x better on FPGA!
Conclusion of Part 1
Conclusion of Part 1
The FPGA HPC promise is real!
Part 2:
How to Deliver this Promise
Assumptions on Applications
Suitable for streaming computation
Data parallelism
If it works well in OpenCL or CUDA, it
will work well on FPGA
Single-precision floating point, integer or
bit-level operations. Doubles take too
much space.
Suitable model for many scientific
applications (esp. NWP)
But also for data search, filtering and
classification
So good for both HPC and data centres
Computational Architecture
Essentially, a network of processors
But "processors" defined very loosely
Very different from e.g. Intel CPU
Streaming processor
Minimal control flow
Single-instruction
Coarse-grained instructions
Main challenge is the parallelisation
Optimise memory throughput
Optimise computational performance
Example
A – somewhat contrived – example to illustrate our
optimisation approach:
We assume we have an application that
performs 4 additions, 2 multiplications and a
division
We assume that the relative areas of the
operations are 16, 400, 2000 slices
We assume that the multiplication requires 2
clock cycles and the division requires 8 clock
cycles
The processor area would be
4*64+2*200+1*2000 = 2528 slices
The compute time 1*4+2*2+8*1 = 16 cycles
Lanes
Memory clock is several times
higher than FPGA clock:
fMEM = n.fFPGA
To match memory bandwidth
requires at least n parallel
lanes
For the GiDEL board, n = 4
So the area requirement is
10,000 slices
But the throughput is still
only 1/16th of the memory
bandwidth
Threads
Typically, each lane needs to
perform many operations on each
item of data read from memory (16
in the example)
So we need to parallelise the
computational units per lane
as well
A common approach is to use
data parallel threads to
achieve processing at
memory rate
In our example, this requires
16 threads, so 160,000
slices
Pipelining
However, this approach is
wasteful:
Create a pipeline of the
operations
Each stage in the pipeline on
needs the operation that it
executes
In the example, this requires
4*16+2*400+1*2000 slices,
and 8 cycles per datum
Requires only 8 parallel
threads to achieve memory
bandwidth, so 80,000 slices
Balancing the Pipeline
This is still not optimal:
As we assume a streaming mode, we can
replicate pipeline stage to balance the
pipeline
In this way, the pipeline will have optimal
throughput
In the example, this requires
4*16+2*2*400+8*2000 slices to process at
1 cycle per datum
So the total resource utilisation is
17,664*4=70,656 slices
To evaluate various trade-offs (e.g lower
clock speeds/ smaller area/ more cycles),
we use the notion of “Effective Slice Count”
(ESC) to express the number of slices
required by an operation in order to achieve
a balanced pipeline.
Coarse Grained Operations
We can still do better though:
By grouping fine-grained operations into coarser-grained ones,
we reduce the overhead of the pipeline.
This is effective as long as the clock speed does not degrade
Again, the ESC is used to evaluate the optimal grouping
Preliminary Result
We applied our approach manually to a small part of the
convection kernel
The balanced pipeline results in 10GFLOPS/W, without any
optimisation in terms of number representation
This is already better than a Tesla C2075 GPU
Application Size
The approach we outlined leads to optimal performance if the
circuit fits on the FPGA
What if the circuit is too large for the FPGA (and you can’t buy a
larger one)?
Only solution is to trade space for time, i.e. reduce throughput
Our approach is to group operations into processors
Each processor instantiates the instruction required to perform all
operations
Because some instructions are executed frequently, there is an
optimum for operations/area
As the search space is small, we perform an exhaustive search for
the optimal solution
The throughput drops with the number of operations per
processor, so based on the theoretical model, for our example
case with 4 to 8 operations it can still be worthwhile to use the
FPGA.
Conclusion of Part 2
Conclusion of Part 2
The FPGA HPC Promise can be
delivered –
Conclusion of Part 2
– but it’s hard work!
A Matter of Programming
In practice, scientists don’t write “streaming
multiple-lane balanced-pipeline” code.
They write code like this −→−→−→−→−→
And current high-level programming tools
still require a lot of programmer know-how
to get good performance, because
essentially the only way is to follow a course
as outlined in this talk.
So we need better programming tools
And specifically, better compilers
Enter the TyTra Project
Project between universities of Glasgow,
Heriot-Watt and Imperial College, funded by
EPSRC
The aim: compile scientific code efficiently
for heterogeneous platforms, including
multicore/manycore CPUs GPGPUs and
FPGAs
The approach: TYpe TRAnsformations
Infer the type of all communication in a
program
Transform the types using a formal,
provably correct mechanism
Use a cost model to identify the suitable
transformations
Five-year project, started Jan 2014
But Meanwhile
A practical recipe:
Given a legacy Fortran application
And a high-level FPGA programming solution, e.g. Maxeler,
Impulse-C, Vivado or Altera OpenCL
Rewrite your code in data-parallel fashion, e.g in OpenCL
There are tools to help you: automated refactoring, Fortran-to-C
translation
This will produce code suitable for streaming
Now rewrite this code to be similar to the pipeline model
described
Finally, rewrite the code obtained in this way for Maxeler,
Impulse-C etc,mainly a matter of syntax
Conclusion
FPGAs are very promising for HPC
We presented a model to estimate the maximum achievable
performance on a per-application basis
Our conclusion is that the power efficiency can be up to 10x
better compared to GPU/multicore CPU
We presented a methodology to achieve the best possible
performance
Better tools are needed, but already with today’s tools very good
performance is achievable
Thank you

Más contenido relacionado

La actualidad más candente

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsIntel® Software
 
LEGaTO: Software Stack Programming Models
LEGaTO: Software Stack Programming ModelsLEGaTO: Software Stack Programming Models
LEGaTO: Software Stack Programming ModelsLEGATO project
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...inside-BigData.com
 
Update on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCUpdate on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCinside-BigData.com
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhereinside-BigData.com
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Datainside-BigData.com
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...Shinya Takamaeda-Y
 
Some experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon PhiSome experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon PhiMaho Nakata
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Intel® Software
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...Shinya Takamaeda-Y
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Ahsan Javed Awan
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural NetworksShinya Takamaeda-Y
 
Co-simulation interfaces for connecting distributed real-time simulators
Co-simulation interfaces for connecting distributed real-time simulatorsCo-simulation interfaces for connecting distributed real-time simulators
Co-simulation interfaces for connecting distributed real-time simulatorsSteffen Vogel
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmapinside-BigData.com
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3mustafa sarac
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIinside-BigData.com
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)vaidehi87
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeNidhin Pattaniyil
 

La actualidad más candente (20)

Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ ProcessorsUnderstand and Harness the Capabilities of Intel® Xeon Phi™ Processors
Understand and Harness the Capabilities of Intel® Xeon Phi™ Processors
 
LEGaTO: Software Stack Programming Models
LEGaTO: Software Stack Programming ModelsLEGaTO: Software Stack Programming Models
LEGaTO: Software Stack Programming Models
 
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
Abstractions and Directives for Adapting Wavefront Algorithms to Future Archi...
 
Update on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPCUpdate on the Mont-Blanc Project for ARM-based HPC
Update on the Mont-Blanc Project for ARM-based HPC
 
AI is Impacting HPC Everywhere
AI is Impacting HPC EverywhereAI is Impacting HPC Everywhere
AI is Impacting HPC Everywhere
 
Communication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big DataCommunication Frameworks for HPC and Big Data
Communication Frameworks for HPC and Big Data
 
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
A High Performance Heterogeneous FPGA-based Accelerator with PyCoRAM (Runner ...
 
Some experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon PhiSome experiences for porting application to Intel Xeon Phi
Some experiences for porting application to Intel Xeon Phi
 
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
Use C++ and Intel® Threading Building Blocks (Intel® TBB) for Hardware Progra...
 
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
PyCoRAM: Yet Another Implementation of CoRAM Memory Architecture for Modern F...
 
Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...Performance Characterization and Optimization of In-Memory Data Analytics on ...
Performance Characterization and Optimization of In-Memory Data Analytics on ...
 
IBM HPC Transformation with AI
IBM HPC Transformation with AI IBM HPC Transformation with AI
IBM HPC Transformation with AI
 
A CGRA-based Approach for Accelerating Convolutional Neural Networks
A CGRA-based Approachfor Accelerating Convolutional Neural NetworksA CGRA-based Approachfor Accelerating Convolutional Neural Networks
A CGRA-based Approach for Accelerating Convolutional Neural Networks
 
Co-simulation interfaces for connecting distributed real-time simulators
Co-simulation interfaces for connecting distributed real-time simulatorsCo-simulation interfaces for connecting distributed real-time simulators
Co-simulation interfaces for connecting distributed real-time simulators
 
ECP Application Development
ECP Application DevelopmentECP Application Development
ECP Application Development
 
Overview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future RoadmapOverview of the MVAPICH Project and Future Roadmap
Overview of the MVAPICH Project and Future Roadmap
 
Real time machine learning proposers day v3
Real time machine learning proposers day v3Real time machine learning proposers day v3
Real time machine learning proposers day v3
 
Challenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPIChallenges and Opportunities for HPC Interconnects and MPI
Challenges and Opportunities for HPC Interconnects and MPI
 
186 devlin p-poster(2)
186 devlin p-poster(2)186 devlin p-poster(2)
186 devlin p-poster(2)
 
Serving BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServeServing BERT Models in Production with TorchServe
Serving BERT Models in Production with TorchServe
 

Similar a On the Capability and Achievable Performance of FPGAs for HPC Applications

Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd Iaetsd
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentEricsson
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDataWorks Summit
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...Hideyuki Tanaka
 
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Joshua Mora
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecturemohamedragabslideshare
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDatabricks
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelKoichi Shirahata
 
FPGAs for Supercomputing: The Why and How
FPGAs for Supercomputing: The Why and HowFPGAs for Supercomputing: The Why and How
FPGAs for Supercomputing: The Why and HowDESMOND YUEN
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0Sahil Kaw
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementGanesan Narayanasamy
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedRCCSRENKEI
 
Fpga implementation of truncated multiplier for array multiplication
Fpga implementation of truncated multiplier for array multiplicationFpga implementation of truncated multiplier for array multiplication
Fpga implementation of truncated multiplier for array multiplicationFinalyear Projects
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
Dad i want a supercomputer on my next
Dad i want a supercomputer on my nextDad i want a supercomputer on my next
Dad i want a supercomputer on my nextAkash Sahoo
 

Similar a On the Capability and Achievable Performance of FPGAs for HPC Applications (20)

Iaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg asIaetsd multioperand redundant adders on fpg as
Iaetsd multioperand redundant adders on fpg as
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
 
Deep Learning with Spark and GPUs
Deep Learning with Spark and GPUsDeep Learning with Spark and GPUs
Deep Learning with Spark and GPUs
 
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with T...
 
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
Do Theoretical Flo Ps Matter For Real Application’S Performance Kaust 2012
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
 
Deep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce SpitlerDeep Learning with Apache Spark and GPUs with Pierce Spitler
Deep Learning with Apache Spark and GPUs with Pierce Spitler
 
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming ModelPerformance Analysis of Lattice QCD on GPUs in APGAS Programming Model
Performance Analysis of Lattice QCD on GPUs in APGAS Programming Model
 
TPU paper slide
TPU paper slideTPU paper slide
TPU paper slide
 
FPGA
FPGAFPGA
FPGA
 
FPGAs for Supercomputing: The Why and How
FPGAs for Supercomputing: The Why and HowFPGAs for Supercomputing: The Why and How
FPGAs for Supercomputing: The Why and How
 
Deep learning with FPGA
Deep learning with FPGADeep learning with FPGA
Deep learning with FPGA
 
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
DeepLearningAlgorithmAccelerationOnHardwarePlatforms_V2.0
 
CAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablementCAPI and OpenCAPI Hardware acceleration enablement
CAPI and OpenCAPI Hardware acceleration enablement
 
0507036
05070360507036
0507036
 
Fugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons LearnedFugaku, the Successes and the Lessons Learned
Fugaku, the Successes and the Lessons Learned
 
Fpga implementation of truncated multiplier for array multiplication
Fpga implementation of truncated multiplier for array multiplicationFpga implementation of truncated multiplier for array multiplication
Fpga implementation of truncated multiplier for array multiplication
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
Bs25412419
Bs25412419Bs25412419
Bs25412419
 
Dad i want a supercomputer on my next
Dad i want a supercomputer on my nextDad i want a supercomputer on my next
Dad i want a supercomputer on my next
 

Más de Wim Vanderbauwhede

Haku, a toy functional language based on literary Japanese
Haku, a toy functional language  based on literary JapaneseHaku, a toy functional language  based on literary Japanese
Haku, a toy functional language based on literary JapaneseWim Vanderbauwhede
 
A few thoughts on work life-balance
A few thoughts on work life-balanceA few thoughts on work life-balance
A few thoughts on work life-balanceWim Vanderbauwhede
 
List-based Monadic Computations for Dynamically Typed Languages (Python version)
List-based Monadic Computations for Dynamically Typed Languages (Python version)List-based Monadic Computations for Dynamically Typed Languages (Python version)
List-based Monadic Computations for Dynamically Typed Languages (Python version)Wim Vanderbauwhede
 
It was finally Christmas: Perl 6 is here!
It was finally Christmas: Perl 6 is here!It was finally Christmas: Perl 6 is here!
It was finally Christmas: Perl 6 is here!Wim Vanderbauwhede
 
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)Wim Vanderbauwhede
 
List-based Monadic Computations for Dynamic Languages
List-based Monadic Computations for Dynamic LanguagesList-based Monadic Computations for Dynamic Languages
List-based Monadic Computations for Dynamic LanguagesWim Vanderbauwhede
 
Why I do Computing Science Research
Why I do Computing Science ResearchWhy I do Computing Science Research
Why I do Computing Science ResearchWim Vanderbauwhede
 

Más de Wim Vanderbauwhede (9)

Haku, a toy functional language based on literary Japanese
Haku, a toy functional language  based on literary JapaneseHaku, a toy functional language  based on literary Japanese
Haku, a toy functional language based on literary Japanese
 
Frugal computing
Frugal computingFrugal computing
Frugal computing
 
A few thoughts on work life-balance
A few thoughts on work life-balanceA few thoughts on work life-balance
A few thoughts on work life-balance
 
List-based Monadic Computations for Dynamically Typed Languages (Python version)
List-based Monadic Computations for Dynamically Typed Languages (Python version)List-based Monadic Computations for Dynamically Typed Languages (Python version)
List-based Monadic Computations for Dynamically Typed Languages (Python version)
 
Greener Search
Greener SearchGreener Search
Greener Search
 
It was finally Christmas: Perl 6 is here!
It was finally Christmas: Perl 6 is here!It was finally Christmas: Perl 6 is here!
It was finally Christmas: Perl 6 is here!
 
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
Perl and Haskell: Can the Twain Ever Meet? (tl;dr: yes)
 
List-based Monadic Computations for Dynamic Languages
List-based Monadic Computations for Dynamic LanguagesList-based Monadic Computations for Dynamic Languages
List-based Monadic Computations for Dynamic Languages
 
Why I do Computing Science Research
Why I do Computing Science ResearchWhy I do Computing Science Research
Why I do Computing Science Research
 

Último

Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxzeus70441
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...Chayanika Das
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxRitchAndruAgustin
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024Jene van der Heide
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaDr.Mahmoud Abbas
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxfarhanvvdk
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learningvschiavoni
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Christina Parmionova
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxpriyankatabhane
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerLuis Miguel Chong Chong
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11GelineAvendao
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGiovaniTrinidad
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxGiDMOh
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxpriyankatabhane
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsDanielBaumann11
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clonechaudhary charan shingh university
 
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2AuEnriquezLontok
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionJadeNovelo1
 

Último (20)

Abnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptxAbnormal LFTs rate of deco and NAFLD.pptx
Abnormal LFTs rate of deco and NAFLD.pptx
 
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
ESSENTIAL FEATURES REQUIRED FOR ESTABLISHING FOUR TYPES OF BIOSAFETY LABORATO...
 
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptxGENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
GENERAL PHYSICS 2 REFRACTION OF LIGHT SENIOR HIGH SCHOOL GENPHYS2.pptx
 
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024GenAI talk for Young at Wageningen University & Research (WUR) March 2024
GenAI talk for Young at Wageningen University & Research (WUR) March 2024
 
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer ZahanaEGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
EGYPTIAN IMPRINT IN SPAIN Lecture by Dr Abeer Zahana
 
Oxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptxOxo-Acids of Halogens and their Salts.pptx
Oxo-Acids of Halogens and their Salts.pptx
 
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep LearningCombining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
Combining Asynchronous Task Parallelism and Intel SGX for Secure Deep Learning
 
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
Charateristics of the Angara-A5 spacecraft launched from the Vostochny Cosmod...
 
Loudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptxLoudspeaker- direct radiating type and horn type.pptx
Loudspeaker- direct radiating type and horn type.pptx
 
Advances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of CancerAdvances in AI-driven Image Recognition for Early Detection of Cancer
Advances in AI-driven Image Recognition for Early Detection of Cancer
 
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
WEEK 4 PHYSICAL SCIENCE QUARTER 3 FOR G11
 
Gas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptxGas-ExchangeS-in-Plants-and-Animals.pptx
Gas-ExchangeS-in-Plants-and-Animals.pptx
 
DNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptxDNA isolation molecular biology practical.pptx
DNA isolation molecular biology practical.pptx
 
Ultrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptxUltrastructure and functions of Chloroplast.pptx
Ultrastructure and functions of Chloroplast.pptx
 
Environmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptxEnvironmental acoustics- noise criteria.pptx
Environmental acoustics- noise criteria.pptx
 
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological CorrelationsTimeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
Timeless Cosmology: Towards a Geometric Origin of Cosmological Correlations
 
whole genome sequencing new and its types including shortgun and clone by clone
whole genome sequencing new  and its types including shortgun and clone by clonewhole genome sequencing new  and its types including shortgun and clone by clone
whole genome sequencing new and its types including shortgun and clone by clone
 
Introduction Classification Of Alkaloids
Introduction Classification Of AlkaloidsIntroduction Classification Of Alkaloids
Introduction Classification Of Alkaloids
 
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
LESSON PLAN IN SCIENCE GRADE 4 WEEK 1 DAY 2
 
The Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and FunctionThe Sensory Organs, Anatomy and Function
The Sensory Organs, Anatomy and Function
 

On the Capability and Achievable Performance of FPGAs for HPC Applications

  • 1. "On the Capability and Achievable Performance of FPGAs for HPC Applications" Wim Vanderbauwhede School of Computing Science, University of Glasgow, UK
  • 2. Or in other words "How Fast Can Those FPGA Thingies Really Go?"
  • 3. Outline Part 1 The Promise of FPGAs for HPC FPGAs FLOPS Performance Model Part 2 How to Deliver this Promise Assumptions on Applications Computational Architecture Optimising the Performance A Matter of Programming Enter TyTra Conclusions
  • 4. Part 1: The Promise of FPGAs for HPC
  • 6. FPGAs in a Nutshell Field-Programmable Gate Array Configurable logic Matrix of look-up tables (LUTs) that can be configured into any N-input logic operation e.g. 2-bit LUT configured as XOR: Address Value 00 1 01 0 10 0 00 1 Combined with flip-flops to provide state
  • 7. FPGAs in a Nutshell Communication fabric: island-style: grid of wires with islands of LUTS wires with switch boxes provides full connectivity Also dedicated on-chip memory
  • 8. FPGAs in a Nutshell Programming By configuring the LUTs and their interconnection, one can create arbitrary circuits In practice, circuit description is written in VHDL or Verilog, and converted into a configuration file by the vendor tools Two major vendors: Xilinx and Altera Many "C-based" programming solutions have been proposed and are commercially available. They generate VHDL or Verilog. Most recently, OpenCL is available for FPGA programming (specific Altera-based boards only)
  • 9. The Promise FPGAs have great potential for HPC: Low power consumption Massive amount of fine grained parallelism (e.g. Xilinx Virtex-6 has about 600,000 LUTs) Huge (TB/s) internal memory bandwidth Very high power efficiency (GFLOPS/W)
  • 10. The Challenge FPGA Computing Challenge Device clock speed is very low Many times lower than memory clock There is no cache So random memory access will kill the performance Requires a very different programming paradigm So, it’s hard But that shouldn’t stop us
  • 11. Maximum Achievable Performance The theoretical maximum computational performance is determined by: Available Memory bandwidth Easy: read the datasheets! Compute Capacity Hard: what is the relationship between logic gates and “FLOPS”?
  • 12. FLOPS? What is a FLOP, anyway? Ubiquitous measure of performance for HPC systems Floating-point Operations per second Floating-point: Single or double precisions? Number format: floating-point or fixed-point? Operations: Which operations? Addition? Multiplication In fact, why floating point? Depends on the application
  • 13. FLOPS! FLOPS on Multicore CPUs and GPGPUs Fixed number of FPUs Historically, FP operations had higher cost than integer operations Today, essentially no difference between integer and floating-point operations But scientific applications perform mostly FP operations Hence, FLOPS as a measure of performance
  • 14. An Aside: the GPGPU Promise Many papers report huge speed-ups: 20x/50x/100x/... And the vendors promise the world However, theatrical FLOPS are comparable between same-complexity CPUs and GPGPUs: #cores vector size Clock speed (GHz) GFLOPS CPU: Intel Xeon E5-2640 24 8 2.5 480 GPU: Nvidia GeForce GX480 15 32 1.4 672 CPU: AMD Opteron 6176 SE 48 4 2.3 442 GPU: Nvidia Tesla C2070 14 32 1.1 493 FPGA: GiDEL PROCStar-IV ? ? 0.2 ?? Difference is no more than 1.5x
  • 15. The GPGPU Promise (Cont’d) Memory bandwidth is usually higher for GPGPU: Memory BW (GB/s) CPU: Intel Xeon E5-2640 42.6 GPU: Nvidia GeForce GX480 177.4 CPU: AMD Opteron 6176 SE 42.7 GPU: Nvidia Tesla C2070 144 FPGA: GiDEL PROCStar-IV 32 The difference is about 4.5x So where do the 20x/50x/100x figures come from? Unoptimised baselines!
  • 16. FPGA Power Efficiency Model (1) On FPGAs, different instructions (e.g. *, +, /) consume different amount of resources (area and time) FLOPS should be defined on a per-application basis We analyse the application code and compute the aggregated resource requirements based on the count nOP,i and resource utilisation rOP,i of the required operations rapp = ∑ Ninstrs i=1 nOP,i rOP,i We take into account an area overhead for control logic, I/O etc. Combined with the available resources on the board rFPGA, the clock speed fFPGA and the power consumption PFPGA, we can compute the power efficiency: Power Efficiency=(1 − )(rFGPA/rapp)/fFPGA/PFPGA GFLOPS/W
  • 17. FPGA Power Efficiency Model (2) Example: convection kernel from the FLEXPART Lagrangian particle dispersion simulator About 600 lines of Fortran 77 This would be a typical kernel for e.g. OpenCL or CUDA on a GPU Assuming a GiDEL PROCStar-IV board, PFPGA = 30W Assume = 0.5 (50% overhead, conservative) and clock speed fFPGA = 175MHz (again, conservative) Resulting power efficiency: 30 GFLOPS/W By comparison: Tesla C2075 GPU: 4.5 GFLOPS/W If we only did multiplications and similar operations, it would be 15 GFLOPS/W If we only did additions and similar operations, it would be 225 GFLOPS/W Depending on the application, the power efficiency can be up to 50x better on FPGA!
  • 19. Conclusion of Part 1 The FPGA HPC promise is real!
  • 20. Part 2: How to Deliver this Promise
  • 21. Assumptions on Applications Suitable for streaming computation Data parallelism If it works well in OpenCL or CUDA, it will work well on FPGA Single-precision floating point, integer or bit-level operations. Doubles take too much space. Suitable model for many scientific applications (esp. NWP) But also for data search, filtering and classification So good for both HPC and data centres
  • 22. Computational Architecture Essentially, a network of processors But "processors" defined very loosely Very different from e.g. Intel CPU Streaming processor Minimal control flow Single-instruction Coarse-grained instructions Main challenge is the parallelisation Optimise memory throughput Optimise computational performance
  • 23. Example A – somewhat contrived – example to illustrate our optimisation approach: We assume we have an application that performs 4 additions, 2 multiplications and a division We assume that the relative areas of the operations are 16, 400, 2000 slices We assume that the multiplication requires 2 clock cycles and the division requires 8 clock cycles The processor area would be 4*64+2*200+1*2000 = 2528 slices The compute time 1*4+2*2+8*1 = 16 cycles
  • 24. Lanes Memory clock is several times higher than FPGA clock: fMEM = n.fFPGA To match memory bandwidth requires at least n parallel lanes For the GiDEL board, n = 4 So the area requirement is 10,000 slices But the throughput is still only 1/16th of the memory bandwidth
  • 25. Threads Typically, each lane needs to perform many operations on each item of data read from memory (16 in the example) So we need to parallelise the computational units per lane as well A common approach is to use data parallel threads to achieve processing at memory rate In our example, this requires 16 threads, so 160,000 slices
  • 26. Pipelining However, this approach is wasteful: Create a pipeline of the operations Each stage in the pipeline on needs the operation that it executes In the example, this requires 4*16+2*400+1*2000 slices, and 8 cycles per datum Requires only 8 parallel threads to achieve memory bandwidth, so 80,000 slices
  • 27. Balancing the Pipeline This is still not optimal: As we assume a streaming mode, we can replicate pipeline stage to balance the pipeline In this way, the pipeline will have optimal throughput In the example, this requires 4*16+2*2*400+8*2000 slices to process at 1 cycle per datum So the total resource utilisation is 17,664*4=70,656 slices To evaluate various trade-offs (e.g lower clock speeds/ smaller area/ more cycles), we use the notion of “Effective Slice Count” (ESC) to express the number of slices required by an operation in order to achieve a balanced pipeline.
  • 28. Coarse Grained Operations We can still do better though: By grouping fine-grained operations into coarser-grained ones, we reduce the overhead of the pipeline. This is effective as long as the clock speed does not degrade Again, the ESC is used to evaluate the optimal grouping
  • 29. Preliminary Result We applied our approach manually to a small part of the convection kernel The balanced pipeline results in 10GFLOPS/W, without any optimisation in terms of number representation This is already better than a Tesla C2075 GPU
  • 30. Application Size The approach we outlined leads to optimal performance if the circuit fits on the FPGA What if the circuit is too large for the FPGA (and you can’t buy a larger one)? Only solution is to trade space for time, i.e. reduce throughput Our approach is to group operations into processors Each processor instantiates the instruction required to perform all operations Because some instructions are executed frequently, there is an optimum for operations/area As the search space is small, we perform an exhaustive search for the optimal solution The throughput drops with the number of operations per processor, so based on the theoretical model, for our example case with 4 to 8 operations it can still be worthwhile to use the FPGA.
  • 32. Conclusion of Part 2 The FPGA HPC Promise can be delivered –
  • 33. Conclusion of Part 2 – but it’s hard work!
  • 34. A Matter of Programming In practice, scientists don’t write “streaming multiple-lane balanced-pipeline” code. They write code like this −→−→−→−→−→ And current high-level programming tools still require a lot of programmer know-how to get good performance, because essentially the only way is to follow a course as outlined in this talk. So we need better programming tools And specifically, better compilers
  • 35. Enter the TyTra Project Project between universities of Glasgow, Heriot-Watt and Imperial College, funded by EPSRC The aim: compile scientific code efficiently for heterogeneous platforms, including multicore/manycore CPUs GPGPUs and FPGAs The approach: TYpe TRAnsformations Infer the type of all communication in a program Transform the types using a formal, provably correct mechanism Use a cost model to identify the suitable transformations Five-year project, started Jan 2014
  • 36. But Meanwhile A practical recipe: Given a legacy Fortran application And a high-level FPGA programming solution, e.g. Maxeler, Impulse-C, Vivado or Altera OpenCL Rewrite your code in data-parallel fashion, e.g in OpenCL There are tools to help you: automated refactoring, Fortran-to-C translation This will produce code suitable for streaming Now rewrite this code to be similar to the pipeline model described Finally, rewrite the code obtained in this way for Maxeler, Impulse-C etc,mainly a matter of syntax
  • 37. Conclusion FPGAs are very promising for HPC We presented a model to estimate the maximum achievable performance on a per-application basis Our conclusion is that the power efficiency can be up to 10x better compared to GPU/multicore CPU We presented a methodology to achieve the best possible performance Better tools are needed, but already with today’s tools very good performance is achievable