Tuesday, March 4, 2008

O&G HPC Workshop: Samuel Brown

Sam Brown of University of Utah
"Optimal Thread Parallelism, etc."

His background includes lot of work on FPGAs & others for accelerated computing.
His core requirement is it needs to be integrated into the CPU environment.

Current accelerators have lots in common
  • Specialized HW
  • New programming languages
  • Rearrange your data

This is also true of the vector ops in current CPUs. Where data arrangement and compiler flags will change the code behavior. No matter what happens, he expects some kind of heterogeneous silicon on the die. in his opinion, the performance balance can't be reached with homogeneous cores

Most of what he does is generate synthetic data as quickly as possible. He has a number of 8 core systems in his lab today. This does allow for the easy approach of separate MPI processes, which is easy. However, Matlab & other shared data sneaks into the general purpose cores.

However, he does have a PS3, which he doesn't have to share. (and doesn't run Matlab). So what he needed was a code base that worked on both the PS3 and the 8 core machines. Most of the math is PDEs (Partial Differential Equations).

which I can't capture.
which you can google.
Sam highlighted
  • SPEs have only 256KB local memory, so you need to manage your data well
  • Explicit data movement via DMA
  • Local SPE is like a user-controlled cache, which can be good
  • Out of core methodology. You asynchronously refresh the data during the compute

His approach was Control & Compute threads. The Control thread handles Initialization, File IO & Thread sync. While the Compute threads are simple. He did rearrange his data to to be much more vector friendly using intrinsics & partioned to map to his command & data movement.

Results of PS3 and the x86 8 core machine showed the scaling of the heterogeneous processing v. the 8 core machine. On a single thread the performance is similar, but at 6 worker threads the PS3 is 2x faster. Scaling beyond 4 x66 Intel cores was poor. (at 4 threads, PS3 was better than 1.5x performance)

What to remember is "Memory access is the critical. The effort is applicable to both the x86 & x86."
Sam had more specific points. I will see if I can find them and add to the comments.

However, this is not readily readable code for others... Which is a problem for sustaining the work.

Q: Your results are almost exactly what we got! However, we need more scaling. Do you see what the effect of IO performance on your ability to use PS3 code?
A: Given his problem size there wasn't a need to address additional IO.

Q: What is the QS20 in your data
A: Cell Blade.

Q: What limited the x86 code scaling...
A: The usual... Memory bandwidth. He hasn't had a chance to run on AMD systems with integrated memory controllers.

1 comment: