Lead, Follow. or...: O&G HPC: Guillaume Thomas-Colligan

+++CGG Veritas accelerators experience

They have done work on a few platforms.

FPGAs
The hardware platform was a Cray XT4. Used an FPGA library from ENS Lyon plus Xilix to 'design the hardware' It was an intensive process with iterations going between Matlab & other tools to work out timing and other issues.

Pros

Good Density
Liked the closely coupled nature of working on a regular Opteron systems

Cons

Complex to develop,
Limited bandwidth hampered performance
Scaling efficiency wasn't that good.

Cell
Using the QS20 blade (2x3.2GHz Cell BE)

Porting process:

Port from linux to Power... Endian and Power compiler issues
Rearrange the code to make use of SPEs
Write the vector code for one SPE. It is all vectors, so it is a new project, but it is not a major hurdle.
Optimize for multiple SPEs. This gets complicated

The code is not human readable... Perhaps Geek readable, but certainly not for the average human.

Results need to be measured at the application since the kernels are up to 20x faster on the SPE, but code on the PPC is 3x slower. Guiallame wants a better host processor on the Cell.

Pros

High volume CPU due to PS3
Performance on codes that fit
memory bandwidth

Cons

not a general purpose CPU
complicated code
1GB was certainly not enough

GPGPU
Now working on Nvidia's CUDA, which he feels is significantly ahead in the market and been a good partner. CUDA was relatively straightforward to learn and use.
__
One of the problems is moving the data from host to GPU data, which can be optimized as non-cached memory to get 3.2GB/s. Working around the PCIe bottleneck is required for application performance. Works only for compute intensive kernels where you can leave the data on the GPU as much as possible. The programmer needs to manage memory access constraints to get performance. This is analogous to user managed CPU cache, and a bit daunting.
There are unique memory features in in the GPU that needs to be understood. texture, float & boundaries are areas he highlighted.

So, what about the application performance? CGG ran a wave equation model. It required 'deep re-engineering of the code' but returned 15x performance. The interesting note is the performance is dependent upon the data set, a reflection of the optimization dependency on memory architecture.

Pros

Performance
C like code
an add in to regular systems
fast evolution of CUDA (now on mac too!)

Cons

Tricky to get full performance
Only for most intensive applications
limited amount of memory
limited debug & profiling tools

===
Bottom line
====

There are programming issues. Hardware specific codes, need algorithms to be modified, which requires understanding the hardware. There are too many codes to port this way
It is not always rewarding, but it is getting better. CUDA was far and away their best experience and don't forgot the CPUs are getting faster. Whatever happens, parallelism is a part of your future.

Q: Size of the data structure and scaling
A: The code scales well to dual core CPU, but quad cores bottleneck on memory bandwidth.

Lead, Follow. or...

Tuesday, March 4, 2008

O&G HPC: Guillaume Thomas-Colligan

No comments:

Blogroll

Contributors