They have done work on a few platforms.
The hardware platform was a Cray XT4. Used an FPGA library from ENS Lyon plus Xilix to 'design the hardware' It was an intensive process with iterations going between Matlab & other tools to work out timing and other issues.
- Good Density
- Liked the closely coupled nature of working on a regular Opteron systems
- Complex to develop,
- Limited bandwidth hampered performance
- Scaling efficiency wasn't that good.
Using the QS20 blade (2x3.2GHz Cell BE)
- Port from linux to Power... Endian and Power compiler issues
- Rearrange the code to make use of SPEs
- Write the vector code for one SPE. It is all vectors, so it is a new project, but it is not a major hurdle.
- Optimize for multiple SPEs. This gets complicated
The code is not human readable... Perhaps Geek readable, but certainly not for the average human.
Results need to be measured at the application since the kernels are up to 20x faster on the SPE, but code on the PPC is 3x slower. Guiallame wants a better host processor on the Cell.
- High volume CPU due to PS3
- Performance on codes that fit
- memory bandwidth
- not a general purpose CPU
- complicated code
- 1GB was certainly not enough
Now working on Nvidia's CUDA, which he feels is significantly ahead in the market and been a good partner. CUDA was relatively straightforward to learn and use.
One of the problems is moving the data from host to GPU data, which can be optimized as non-cached memory to get 3.2GB/s. Working around the PCIe bottleneck is required for application performance. Works only for compute intensive kernels where you can leave the data on the GPU as much as possible. The programmer needs to manage memory access constraints to get performance. This is analogous to user managed CPU cache, and a bit daunting.
There are unique memory features in in the GPU that needs to be understood. texture, float & boundaries are areas he highlighted.
So, what about the application performance? CGG ran a wave equation model. It required 'deep re-engineering of the code' but returned 15x performance. The interesting note is the performance is dependent upon the data set, a reflection of the optimization dependency on memory architecture.
- C like code
- an add in to regular systems
- fast evolution of CUDA (now on mac too!)
- Tricky to get full performance
- Only for most intensive applications
- limited amount of memory
- limited debug & profiling tools
There are programming issues. Hardware specific codes, need algorithms to be modified, which requires understanding the hardware. There are too many codes to port this way
It is not always rewarding, but it is getting better. CUDA was far and away their best experience and don't forgot the CPUs are getting faster. Whatever happens, parallelism is a part of your future.
Q: Size of the data structure and scaling
A: The code scales well to dual core CPU, but quad cores bottleneck on memory bandwidth.