Tuesday, March 4, 2008

O&G HPC Workshop Scott Morton

Let me start with what is going on here...
Over 220 registered attendees though it does include a strong showing of vendors. Looks like a room of 190 plus.
Seven corporate sponsors. Three of the supernationals are sponsors. Key names include Keith Gray, Henri Calandara & Ch-- Wong.

Scott Morton.. Hess and his experience on Seismic Algorythms on GPUS

What he wants
  • 10x price performance (commiserate with improvement from SC to x86 clusters
  • Commodity volumes
  • significant parallelism
  • "easy to program"
It is clear he is a technologist, since the only business requirements are a 10x price/performance and low cost to port.

They've looked at a variety of hardware platforms
  • DSPs
    • Commodity. Not easy to move the algorithms, especially for large data sets
  • FPGAs
    • Worked with SRC in on wave equations algo in 2003
    • 10x performance and 10x cost
    • Programming is graphical, which doesn't map to skills and tools.
    • I wonder what the cost/performance number looks like 5 years later since the FPGA vendors claim much higher than Moore's Law improvement in gates per $.
  • Cell
    • Tracking. Believes it is commodity, but haven't dove into it yet.

Languages he mentioned working on included OpenGL, Peakstream, Nividia Cg & Brook (Brook was a positive experience, but under supported. He believes they switched to Peakstream development

Hess did work with Peakstream and delivered 5 to 10x speed up in 2D, but only 2 to 3x in 3D for Kirkoff. Once Google bought them, they disappeared. Now he's working with Nvidia Cg.

Comments on CUDA
  • Realtively easy to program, but hard to optimzie
    • Two day course was useful
  • Used on Kirkoff, Reverse-time & Wave Equation algorythms.
    • Showing ongoing work... This is not final!
Kirkoff results using a CSM Intern over the summer. Not a GPU programmer, a geophysist. He did 8 major iterations of the code. The second version was equal to the CPU baseline. The 8th version is 27x the CPU baseline. Every version was an improvement in memory management. The noted changes certainly read as fairly minor. e.g. Removed an if and for returned

Optimization was on minimizing the data movement between the GPU and CPU with an emphasis on most compute on GPU.

Reverse-time algo is dominated by 3D FFTs. The results are 5x over single CPU, which is 20% faster than a dual node quad core. Notably (and this is really interesting...) 1 quad core and 2 quad cores deliver the same performance.

Wave Equations MIgration is an implicit solver. Prototype performance is 48x 1 3.6GHz Xeon. The core performance flattens at 4 quad core CPUs - 8 is no better.
(These problems are all 32bit and just over a 1GB in size.)

They have ordered a 32 node system. Their theoretical performance of 32 nodes of GPU & dual quad-core should outperform the current 4k Xeon cluster.

Q: What is the hardware platform?
A: The external Telsa boxes connected via PCIe cables.

Q: DO you really think it will meet price/performance targets
A: Yes. but we still need to develop the production code.

Q: Experience with heat density
A: New hardware, so there are no problems so far :)

Q: Are slides going to be available?
A: Yes (I'll add them in the comments)

Q: Why the asymptotic performance on CPU scaling
A: Memory bandwidth

Q: What about IEEE 754 & double precision
A: We don't care about that right now.

1 comment:

JayO said...

It is interesting to me that neither presentor uses Matlab or some other scripting tool to structure problems. Is that because they perceive a performance hit? Are they programmers and don't need to script?