Tuesday, March 4, 2008

O&G Workshop: AM Sessions

*Vivek Sarkar*

Portble Parallel Programming on Multicore Computing_

This is based upon Vivek's work on Habanero, his class & built upon the X10 work he did at IBM.

Hardware platforms are proliferating a an increasing rate, so we need portable parallel abstractions that are not hardware targeted. As a result, the scope is quite broad. The research boundaries are parallel applications to multicore hardware. Vivek wants more industry interaction, especially regarding O&G applications.

Current targets include the usual parallel benchmarks, medical imaging, seismic data, graphics, nd computational chemistry. In the spirit of eating the dog food, the Habenaro compiler is also an application they are developing within the Habenaro framework.

Vivek believes in portable managed runtimes as a result of the compiler and analysis. This may be controversial. To accommodate the true geek and hardware targeted code, there is a model for partitioned code.

Early work included running streaming vectors using Java on Cell, though it is really Java on the PPC for control and C on the SPE for compute.

The topology of the heterogeneous processor is in two dimensions. The first is distance from the main CPU - I am assuming he means memory access. Though we think of this as devices, it applies to NUMA as well.
The second is the degree of customization in the accelerator, which is a trade off of programmability to efficiency. In his slides, he sees CPUs & DSPs as sequential and multicore and accelerators are parallel.

X10 structures are Data, Places and Processing Elements. http://x10.sf.net Using Places, programmers create lightweight activities. The message construct is async. X10 recognizes the improvement that results from affinity binding of threads to local data structures. This is not available in most shared memory models, such as OpenMP.

When porting X10 to GPGPU, the localization and affinity of memory will be critical.

So, what about implicit parallelism via the new auto-magic parallelization with a target of new codes, not dusty decks. Habanero extensions for Java would improve the success of parallelization in the code.

The case study is Java Grande Forum Benchmarks with certain subset of language extensions. >

Bottom line for Vivek is multi-core absolutely requires a change in language, compilers and runtimes. He believes that managed runtimes are here to stay, but didn't go any further on what else will.

Q: What do you think are the minimal language extensions for co-array & UPC?
A: They don't have support for dynamic threading, they hold the old SMPD model. What needs to expand is a threaded pgas model. e.g. no facility for aync.

Q: What about OpenMPI parameters to have shared & partitioned memory?
A: In general MPI with Threading is quite challenging, like identifying which thread is participating in communication.

*John Mellor-Crummey*
_Fine tuning your HPC Investments_

He is working on co-array Fortran in light of Vivek's talk. This is not the subject of his presentation today.

The challenge is is working on is performance challenges across many cores/sockets like the Cray install at ORNL and Cell B/G.

He states that CPUs are hard to program (yes, CPUs, not accelerators) because the CPU is pipelined, OOO & superscalar with multi-level access & parallelism. Rice HPCToolkit, needs to be correlated with code, useful on serial & parallel execution, intuitive yet detailed for compiler folks to use.

John's design principles are a good guideline for anyone looking at performance analysis.

The measurement of performance is increasingly complex due to the increase in layered software design including repeated calls of procedures, the velocity of change in the hardware and the impact of context (which data, etc.) on performance.

*Stephane Bihan of CAPS *
_Addressing Heterogeneity in Manycore Applications_

I've seen Stephane present different versions of this before. The CAPS website also has information. They were also at SC07 in Reno.

They are tackling the gap between changing processing hardware and software development. Their runtime recognizes the distributed, heterogeneous & parallel requirements of future programming.

HMPP Concepts
1. Parallelism needs to be expressed by the programmer through the use of directives similar to openMP, but in standard languages (C & Fortran)
2. The runtime needs to deal with resource availability & scheduling
3. Program for the Hardware architecture, insulating the main body of the code & programmer.

The result of using HMPP directives is to enumerate codelets that can be executed synchronous or asynchronously. Directives guide the data transfer & barriers required for synchronization.

CAPS HMPP provides the workflow for x86 code to be linked with the hardware accelerator compiler. The directives of a codelet can be set to describe the codelet behavior when it is called. The interesting example is if the problem is small, run it on the CPU.

Current version support C & Fortran, machine independent directives and CUDA & SSE. AMD/ATI software is targeted for support soon.

His O&G example is Reverse Time Migration on a GPU on a 5 node (dual quad core) with Tesla S870 (4GPUs). Using 2D test case, they achieved an 8x improvement for compute, but 1/3 of the app is in disk IO.

Once again, key optimization were data alignment for the GPU and overlapping data transfers.

Panel Discussion

Q: (Henri Calendara of Total) Despite good small scale support, the challenge on the O&G market is scaling the data as well. We can't optimize for accelerators without understanding the overheads of data movement on the application.
A: (JMC) The tools help understand the data movement to identify the data movement bottlenecks. In the end, compilers will need to manage data movement. Explicit data movement is a 'lot of work' Therefore, locality aware programming languages is going to be critical.

A: (Scott Misner) If you have 1GB plus of GPU memory, which is similar to the granularity of the problems we see on the CPU already.

A: (Guiliame) Compression of the data for the data transfer to otpimize transfer is a technique to be con

A: Vivek Good news: increasing on chip bandwidth will be there. IO is the next bottleneck and OSes don't understand how to manage for that.

Q: (Christoph from T--- O&G developer)I am very excited about the potential, but I noted a thread that performance was achieved with 'careful development of the alogrythms.' Our current optimized x86 8 core nodes achieve 80GFlops for $3k.
Can you comment on Performance/Dollar?
How do we better link computer science with the algorithm development since it appears that optimization will require CS knowledge?

A (Samuel brown) There were examples of good performance/dollar here. As we move forward these will become new engineering practices and more available as a result.

A (Henri) The impact of TCO for scaling to Petaflop is very interesting to his group

A (JCM): Looking at historical collaborative efforts between national labs and the O&G community may be an important model for optimizing code.

Q (Lee): What's the difference in heat & capital costs?
A: may need to pull in vendors.

Q: Advantages of the program analyzer for developing parallel programming
A: (JCM): ...examples of what they measure... Bottom line, the Rice HPC Toolkit allows for blackbox scalability

Q (Joe): I have complicated unstructured finite element problems, much of which is in Java. Is there work that can help me?
A (Vivek): The model is pursuing has promise for unstructured meshes and it was the work that IBM has published research. You are correct that current accelerators are best suited regular structured data. There is work on the Cell for each SPE for different instruction streams. The GPUs are much more of a challenge for this data. You would need more of a "Thinking Machines" aproach there.

Q (Steve Joachems of Acceleware): Where do you see the hurdles from development to production
A (Guillame) There are the facility issues of heat, etc. that need to be addressed, including maintainabity of the infrastructure.
A (Stephane): We're focused on portablility tools

Q (Jan Orgegard): How easy does it need to get to be useful.. VB v. Fortran.
A (Scott M): Need to limit to thousands of lines of code, not hundreds of thousands. Much of the old code doesn't need to change.
A (Guiliamme) We will not rewrite the whole thing. Recompiling is OK.
A (Samual Brown): Increased lifespan of the code is more critical than ease.
A (Stephane) It is a complex environment with many layers. The language itself can be in different run times.
A (JMC) It is less the language than it is the portability the code into the future.
A (Vivek) Wholeheartedly agree there. Making a structure where the function will not have 'side effects' that allows more separation of hardware and software.

Q: Any expectation of MBTF for the new hardware?
A: (Scott) That is what we're looking at with our new GPU cluster.
Q: Follow up... what's acceptable in Seismic?
A: PC clusters had a high initial failure rate, but yes it does need to stay up.

Q: What about the data from goverment labs that determined that GPUs were not cost effective.
A (Scott) we looked at it, but the national lab data is older.
A (Vivek) look at roadrunner for future data.

Q (Keith Gray) What is coming from the University to help the industry with this problem
A (Vivek) Good university and indsutry relationships are critical. The ones Rice has are positive.
comments.... Demand for smart people

Q (from __ of altair scheduling app) What's your impression before this is readily in production and available? When do commercial ISVs need to acknowledge and port to it?
A Despite some comments on we need some now & we have no idea when it needs to be ready.

No comments: