Wednesday, March 26, 2008

Observations from HPCC ’08

Newport, RI March 25-26, 2008

This is my first blog entry, ever….really. Before I talk about the conference I’ll share some credentials. I’ve worked in the high performance computing market for almost ten years. Prior to that, I held various positions range from commercial software, networking, semiconductors to workstations. My roles were always in technical marketing or business development except a few short stints in sales….which is good for keeping one’s ego in check.

I would say the HPC market is probably the most interesting segment I’ve ever touched. I’m not saying that because it is where I am now….I say that because it is WHY I am where I am now. The people are interesting, their work is fascinating and with the exception of a few players (and you know who you are) they are very pleasant. It is a familiar crowd. As a side observation is that we need to start attracting more young people into applied science. I’ll save comments on that for another entry.

Enough said….on with the conference observations. I’m going to break my discussion into two sections. The first will deal with the general content of the presentations and the second will cover the business challenges of the market as presented by Dr. Stephen Wheat, Sr. Director in Intel’s HPC group.

Interesting science, interesting technologies…

A number of presenters from the national labs and academia presented work they were doing with some overview of the science. I found the audience generally attentive. Presentation problem statements were broad enough that listeners could see if the approach applied to their area of interest. Frankly for me some of the science was over my head...but it was still worthwhile.

John Grosh, LLNL had the best quote of the day. It was, “The right answer is always the obvious one, once you find it!” So true in life and in science. Among other things, he described their biggest challenge as the application of computing technology to large scale predictive simulations. As it was explained, massive simulations must present “a result” and include a quantified margin of error from the simulation. Quantifying the margin of error or uncertainty requires lots of data points. This has implications for the size of the data set, which places a load on memory subsystems, file systems, underlying hardware and the management and reliability of a complex computing system.

At the end of his presentation I was struck by the complexity of quantifying margin of error. I see at least three factors that could contribute to the uncertainty. They are:
- Model uncertainty, based on the predictive validity of the model itself.
- Platform uncertainty associated with the accuracy and predictability of a complex system to execute code
- Uncertainty or a range of possible results that occur in the system being modeled or simulated

Computational scientists tend to worry about item one, would like to push item two to their systems’ vendors and leave item three to the domain experts. Do you agree? Can you think of other factors contributing to uncertainty?

Dr. David Shaw talked about their work at D.E. Shaw Research. He mentioned they have about sixty technologists and associated staff at the lab. They collaborate with other researchers, typically people who specialize in experimentation to help validate the computational algorithms. They are looking at the interaction between small organic molecules and proteins within the body. Their efforts are aimed at scientific discovery with “a long time horizon”.

As someone who worked with life science researchers for a time, I found the content of this presentation the most intriguing. Dr. Shaw commented we might find that today's protein folding models may be of dubious value due to the quality of force field models. D.E. Shaw Research is trying to reduce the time to simulate a millisecond reaction and have structured the problem to try to eliminate some of the deficiencies as they see them in today's models. In turn they can reduce the problem to a set of computational ASICs. They have also developed a molecular dynamic programming code that will run on a machine with these specialized ASICs. As he described it, the machine is incredibly fast and also incredibly focused on a single task. In other words, it is not your father’s general purpose, time share system….

They obtained the speedups through “the judicious use of arithmetic specialization” and “carefully choreographed communication”. In their system they only move data when they absolutely must. One wonders whether this approach could actually trickle down to the commercial computing in some capacity. I think they are almost mutually exclusive. This would imply that accelerating computational computing (acceleration) is always specialized and limited to specific markets, making it unfeasible to pursue as a vendor unless you are a small specialty shop. Do you agree?

Dr. Tim Germann, LANL presented work on a simulation they did to provide content for a federal response plan to a pandemic flu infection. The work was interesting and showed that some of the logical approaches (use of the flu vaccine emergency stockpile) would only delay but not mitigate the impact of a pandemic. They were able to use demographic, population and social contact data to show that a variety of actions, taken in concert, would reduce the impact of a pandemic. The simulation also provided the early indicators that would occur some sixty days before the pandemic was evident.

Truly useful stuff but how do you take these techniques and use them to model other problems? What is the uncertainty in the simulation? Dr. Don Lamb, University of Chicago, talked about the concept of null physics as it applies to evaluating supernovas and the same question arose in my mind…is it broadly useable?

I want to know because I’m a business development guy. The implications of broad or narrow applicability do not make the case for vendors to help scientists solve problems. They do have implications for the way we approach this as a market. This leads to the second section…

The business challenges in the HPC market….

I should point out that I do not, nor have I ever worked for Intel. My observations are those of an interested market participant and outsider.

I’ve heard Steve Wheat present a number of times. As do all the Intel folks, their presentations are crisp and “safe” for general public viewing. Steve opened with some general observations about the growth in HPC (greater than 30% of the enterprise server market) and made the appropriate comments about the importance of market to Intel. It was the kind of “head nodding slide” those of us who present routinely use to make sure the audience is on our side. He then launched into an update on work at Intel that was relevant to HPC. This was good but rather routine, spending some time discussing the implications of reliability when deploying smaller geometries. I think it safe to say, this is the kind of conditioning that Intel, AMD and any other processor vendor should be doing to explain to the market that this isn’t easy. The implication for this audience was that “HPC needs to help solve these problems” and it will benefit the entire industry….eventually. He also made a suggestion that the industry think about the implications of multi-core processors on I/O proposed that I/O be treated as a Grand Challenge problem.

He then spent time talking about the economic challenge of serving the HPC market. My interpretation (not Steve’s words) would characterize the HPC procurement cycle as one that barely allows vendors to recoup R&D costs. Steve pointed out that wins for large deployments typical have terms that penalize failure far more than rewarding success. This appears to be a business problem that any sane vendor should avoid. Why pursue a high profile, high risk opportunity with normal return on investment as the best case? While the PR is good, I can think of other ways to garner good press without putting a company at risk. It feels like the HPC market’s answer to subprime mortgages. Do you agree?

Everyone believes that HPC technology eventually has a “trickle down” benefit to the entire market. However, the payoff is muted because margin degrades with volume and over time. I’m also unsure that the original developers ever see the lion’s share of the margin. Mosaic and Netscape come to mind. Can you think of others either making or disputing this point? Do you agree?

Steve closed with some very thought provoking business slides for an HPC conference. His points could be summarized with the question, “given the needs of the HPC market and the associated economics, what are the dynamics that allow HPC vendors to make active investments to solve these problems”. He makes a case that there needs to be an investment and model that allows vendors to recoup R & D costs. I think it is an interesting topic and worth further conversation. Please post your views and questions.

Tuesday, March 25, 2008

Ridiculously Easy Group Formation

A correction! I didn't attribute correctly the first time.

Ridiculously Easy Group Formation
is a phrase originally coined by Seb Paquet in 2002(1), but greatly enhanced by Clay Shirky's recent work(2). It is also a guiding principle for Lead, Follow...

With the in mind, the next few weeks will feature posts by people who are not me. Hopefully, this will be a regular feature and not even worth mentioning... But certainly worth reading.

Expected writers include
  • Jay O, who drives business development, interesting research and other forward thinking activities in technology.
  • Eric S, who spends his time herding cats toward a common goal of open and interoperable systems at openfpga.org
  • The executive team of Mitrionics. A smart, driven group of people whose ideas are rooted in practical delivery.

good reading...
(1) Seb's Original Blog
(2) Here Comes Everybody

Monday, March 24, 2008

IDC measures the Data Explosion

I've been asked for back up data on the size and velocity of data growth. here's one pointer...

IDC has published a 2008 update to their Information Growth Forecast. A couple interesting tidbits to get you to follow the link.
  • More storage in the home than anywhere else, but enterprises still carry the responsibility for ensuring the data is available. (Think your photos on Picassa or Flickr or...)
  • A huge driver of data growth is replicated copies. Though the example they use is email, I've also seen this done with Business Intelligence data. Replicated data is there to be analyzed!
  • They used "Data Tsunami" in 2007
It's no surprise EMC sponsored this work. http://www.emc.com/leadership/digital-universe/expanding-digital-universe.htm

Friday, March 21, 2008

Belfast Reconfigurable Conference

I'm still searching for a blogger for the Belfast conference...

http://www.mrsc2008.org/

Whose going? Anyone want to contribute? Just drop me a line or add to the comments!

Tuesday, March 11, 2008

O&G HPC Workshop slides

They are posted:

http://citi2.rice.edu/OG-HPC-WS/program2.htm

Wrap up from Rice's HPC for Oil & Gas

I learned something live blogging at the O&G HPC event at Rice... You can't simultaneously report and analyze. I owe myself (and others) a short reflection on the event. So here goes...

The current accelerated computing work is going full force. The options exist and the barrier to experimentation is very low. This is very good for Accelerated Computing.

However, I don't think the motivations are pure. The chief reason for working on silicon outside the mainstream x86 is fear of many-core. There is an expectation that x86 complexity is also dramatically increasing, while the performance is stagnant. The the cost of overcoming x86 many-core complexity is unknown.

The presentations are by smart motivated people who are exploring the alternatives. What they have in common seemed to be the following:
  1. Current scaling options are running out. All presenters provided scale up on dual and quad core x86 CPUs. They are all asymptotic.
  2. The compute is data driven. That is to say there is a lot of data to be worked upon - and it is increasing.
  3. Performance achievements on a greater scale of current performance x86 cores is going to be more expensive than historical trends. Complexity of application management is emerging as both a motivator and a barrier to Accelerated Computing
  4. They need to touch the compute kernels anyway. If they are going to rewrite the compute intensive sections, why not try code on a different piece of silicon or ISA. They have been moving away from hardware specific code anyway.
The optimized point on the curve for the O&G group was compute kernel code that looked like human readable C (or Fortran), integrated with an x86 cluster and a >10x performance return on single thread performance on current quad core.

Mainstreaming Accelerated Computing will not happen without addressing the complexity of systems and application management. I don't know who is really working on this... Do you?

Tuesday, March 4, 2008

O&G HPC: Afternoon on Storage

The afternoon session is on storage growth. In the Oil & Gas market it quickly becomes about really big chunks of data.

Presenters are:
Jeff Denby of DataDirect Networks
Per Brashers of EMC NAS SW Engineering
Larry Jones of Panasas
Sean Cochrane of Sun Microsystems
Tom Reed of SGI
Dan Lee of IDC HPTC


Jeff talked about...
* SATA is 'the way to go' in geophysics
> Design has some issues. Lower reliability including silent data corruption
* Infiniband is becoming popular
* For DDN, a Petabyte shipment is not unusual

The state of the art is 6GB/s for network attached storage.


Per talked about the architecture of the pNFS NAS stack different than traditional NAS architectures, different from EMC's MPFS.
* Advantage of pNFS is separate route for metadata from data
* MPFS adds awareness of caches in client and storage array to assist in throughput.
* Increase in concurrency due to byte level locking, not file
* IB is about 650MB/s; quad Ethernet is 300 to 400MB/s

Larry talked about...
* Panasas uses an object based iSCSI SAN model using "Direct Flow Protocol"
* Parallel I/O for windows apps is underway
* Landmark paper said Direct Flow improves application performance improvements by greatly reducing CPU wait on data.
* Reliablity is important
* Also support pNFS (NFS 4.1)
* Targeting 12GB/s


Sean talked about...

* He leads the HPC storage out of the CTO office
* He presented the Sun best practices
* Describing two bottlenecks: MetaData & connecting cluster storage with archive. Sun uses dedicated boxes (Thumpers & 4100s respectively) at those pain points.

Tom talked about...
* Humans: Interactive Serial, costly interupts, open loop, non-deterministic, expensive
* "Time Slicing and Atrophy make a bad lifestyle cocktail"
* Current storage solutions can't serve everyone

Dan says...
* HPC server revenue grew to over $11B in 2007 - It's mostly clusters
* YOY growth of 15% - double digit over last 5 years
* Oil & Gas reached $650m in 2007
* HPC storage exceeded $3.7B in 2006 at a faster growth than HPC servers
* pNFS is getting close to general availability.
* It eliminates custom clients
* Key driver is to have several threads access the same file system concurrently
* Block and Object versions may not be included in final spec.


===
Q (Keith Gray): What's the largest number of concurrent clients & largest file?
A (Jeff of DDC): ORNL, LANL, 25k clients. Performance throughput for GPFS at Livermore is 100s of GB/s
A (Per of EMC) 12k clients
A (Larry of Panasas) 8k clients at Intel and growing. LANL with 150GB/s
A (Sean of Sun): 25k of Luster, about 120GB/s at CEA (same as DDN); 10s of petabytes on SANFS (tape)
A (SGI): depends upon file systems

Q: What's the next technology

A: (Sean of Sun) Likely flash for heavy metadata access
Q: Availability?
A: (Sean of Sun) Something from Sun later on this year. Can't comment on sizes, stay tuned.
A: (Per of EMC) EMC already has flash as option. Good for small random IO ops. Good for metadata, but not for throughput

Q: What happened with next gen, like Cray's old fast ram disks.

Comment from the audience... Look at fabric cache from SciCortex
A: (DDN): The duty cycle of flash for lots of writes is about 2 years, so it doesn't map well to what we have. DDN is waiting on phase change memory to emerge.
A: (Panasas): The storage models are closer to workflow models, not raw data transfer. That usage model works well with fast cache.

Q: Will the storage vendors *really* going to get behind pNFS and drive it.

A (sun): Yes and on Luster and ZFS backend filesystems
A (panasas): There are pluggable modules in the standard which does allow customization.
A (EMC): yes, and our shipping code should be very close to final spec.

O&G Workshop: AM Sessions

*Vivek Sarkar*

Portble Parallel Programming on Multicore Computing_

This is based upon Vivek's work on Habanero, his class & built upon the X10 work he did at IBM.

Hardware platforms are proliferating a an increasing rate, so we need portable parallel abstractions that are not hardware targeted. As a result, the scope is quite broad. The research boundaries are parallel applications to multicore hardware. Vivek wants more industry interaction, especially regarding O&G applications.

Current targets include the usual parallel benchmarks, medical imaging, seismic data, graphics, nd computational chemistry. In the spirit of eating the dog food, the Habenaro compiler is also an application they are developing within the Habenaro framework.

Vivek believes in portable managed runtimes as a result of the compiler and analysis. This may be controversial. To accommodate the true geek and hardware targeted code, there is a model for partitioned code.

Early work included running streaming vectors using Java on Cell, though it is really Java on the PPC for control and C on the SPE for compute.

The topology of the heterogeneous processor is in two dimensions. The first is distance from the main CPU - I am assuming he means memory access. Though we think of this as devices, it applies to NUMA as well.
The second is the degree of customization in the accelerator, which is a trade off of programmability to efficiency. In his slides, he sees CPUs & DSPs as sequential and multicore and accelerators are parallel.

X10 structures are Data, Places and Processing Elements. http://x10.sf.net Using Places, programmers create lightweight activities. The message construct is async. X10 recognizes the improvement that results from affinity binding of threads to local data structures. This is not available in most shared memory models, such as OpenMP.

When porting X10 to GPGPU, the localization and affinity of memory will be critical.

So, what about implicit parallelism via the new auto-magic parallelization with a target of new codes, not dusty decks. Habanero extensions for Java would improve the success of parallelization in the code.

The case study is Java Grande Forum Benchmarks with certain subset of language extensions. >

Bottom line for Vivek is multi-core absolutely requires a change in language, compilers and runtimes. He believes that managed runtimes are here to stay, but didn't go any further on what else will.

Q: What do you think are the minimal language extensions for co-array & UPC?
A: They don't have support for dynamic threading, they hold the old SMPD model. What needs to expand is a threaded pgas model. e.g. no facility for aync.

Q: What about OpenMPI parameters to have shared & partitioned memory?
A: In general MPI with Threading is quite challenging, like identifying which thread is participating in communication.



=====
*John Mellor-Crummey*
_Fine tuning your HPC Investments_

He is working on co-array Fortran in light of Vivek's talk. This is not the subject of his presentation today.

The challenge is is working on is performance challenges across many cores/sockets like the Cray install at ORNL and Cell B/G.



He states that CPUs are hard to program (yes, CPUs, not accelerators) because the CPU is pipelined, OOO & superscalar with multi-level access & parallelism. Rice HPCToolkit, needs to be correlated with code, useful on serial & parallel execution, intuitive yet detailed for compiler folks to use.

John's design principles are a good guideline for anyone looking at performance analysis.


The measurement of performance is increasingly complex due to the increase in layered software design including repeated calls of procedures, the velocity of change in the hardware and the impact of context (which data, etc.) on performance.

=====
*Stephane Bihan of CAPS *
_Addressing Heterogeneity in Manycore Applications_

I've seen Stephane present different versions of this before. The CAPS website also has information. They were also at SC07 in Reno.

They are tackling the gap between changing processing hardware and software development. Their runtime recognizes the distributed, heterogeneous & parallel requirements of future programming.

HMPP Concepts
1. Parallelism needs to be expressed by the programmer through the use of directives similar to openMP, but in standard languages (C & Fortran)
2. The runtime needs to deal with resource availability & scheduling
3. Program for the Hardware architecture, insulating the main body of the code & programmer.

The result of using HMPP directives is to enumerate codelets that can be executed synchronous or asynchronously. Directives guide the data transfer & barriers required for synchronization.

CAPS HMPP provides the workflow for x86 code to be linked with the hardware accelerator compiler. The directives of a codelet can be set to describe the codelet behavior when it is called. The interesting example is if the problem is small, run it on the CPU.

Current version support C & Fortran, machine independent directives and CUDA & SSE. AMD/ATI software is targeted for support soon.

His O&G example is Reverse Time Migration on a GPU on a 5 node (dual quad core) with Tesla S870 (4GPUs). Using 2D test case, they achieved an 8x improvement for compute, but 1/3 of the app is in disk IO.

Once again, key optimization were data alignment for the GPU and overlapping data transfers.


====
Panel Discussion

Q: (Henri Calendara of Total) Despite good small scale support, the challenge on the O&G market is scaling the data as well. We can't optimize for accelerators without understanding the overheads of data movement on the application.
A: (JMC) The tools help understand the data movement to identify the data movement bottlenecks. In the end, compilers will need to manage data movement. Explicit data movement is a 'lot of work' Therefore, locality aware programming languages is going to be critical.

A: (Scott Misner) If you have 1GB plus of GPU memory, which is similar to the granularity of the problems we see on the CPU already.

A: (Guiliame) Compression of the data for the data transfer to otpimize transfer is a technique to be con

A: Vivek Good news: increasing on chip bandwidth will be there. IO is the next bottleneck and OSes don't understand how to manage for that.

Q: (Christoph from T--- O&G developer)I am very excited about the potential, but I noted a thread that performance was achieved with 'careful development of the alogrythms.' Our current optimized x86 8 core nodes achieve 80GFlops for $3k.
Can you comment on Performance/Dollar?
How do we better link computer science with the algorithm development since it appears that optimization will require CS knowledge?

A (Samuel brown) There were examples of good performance/dollar here. As we move forward these will become new engineering practices and more available as a result.

A (Henri) The impact of TCO for scaling to Petaflop is very interesting to his group

A (JCM): Looking at historical collaborative efforts between national labs and the O&G community may be an important model for optimizing code.

Q (Lee): What's the difference in heat & capital costs?
A: may need to pull in vendors.

Q: Advantages of the program analyzer for developing parallel programming
A: (JCM): ...examples of what they measure... Bottom line, the Rice HPC Toolkit allows for blackbox scalability

Q (Joe): I have complicated unstructured finite element problems, much of which is in Java. Is there work that can help me?
A (Vivek): The model is pursuing has promise for unstructured meshes and it was the work that IBM has published research. You are correct that current accelerators are best suited regular structured data. There is work on the Cell for each SPE for different instruction streams. The GPUs are much more of a challenge for this data. You would need more of a "Thinking Machines" aproach there.



Q (Steve Joachems of Acceleware): Where do you see the hurdles from development to production
A (Guillame) There are the facility issues of heat, etc. that need to be addressed, including maintainabity of the infrastructure.
A (Stephane): We're focused on portablility tools

Q (Jan Orgegard): How easy does it need to get to be useful.. VB v. Fortran.
A (Scott M): Need to limit to thousands of lines of code, not hundreds of thousands. Much of the old code doesn't need to change.
A (Guiliamme) We will not rewrite the whole thing. Recompiling is OK.
A (Samual Brown): Increased lifespan of the code is more critical than ease.
A (Stephane) It is a complex environment with many layers. The language itself can be in different run times.
A (JMC) It is less the language than it is the portability the code into the future.
A (Vivek) Wholeheartedly agree there. Making a structure where the function will not have 'side effects' that allows more separation of hardware and software.

Q: Any expectation of MBTF for the new hardware?
A: (Scott) That is what we're looking at with our new GPU cluster.
Q: Follow up... what's acceptable in Seismic?
A: PC clusters had a high initial failure rate, but yes it does need to stay up.

Q: What about the data from goverment labs that determined that GPUs were not cost effective.
A (Scott) we looked at it, but the national lab data is older.
A (Vivek) look at roadrunner for future data.

Q (Keith Gray) What is coming from the University to help the industry with this problem
A (Vivek) Good university and indsutry relationships are critical. The ones Rice has are positive.
comments.... Demand for smart people

Q (from __ of altair scheduling app) What's your impression before this is readily in production and available? When do commercial ISVs need to acknowledge and port to it?
A Despite some comments on we need some now & we have no idea when it needs to be ready.

O&G HPC: Guillaume Thomas-Colligan

+++CGG Veritas accelerators experience

They have done work on a few platforms.

FPGAs
The hardware platform was a Cray XT4. Used an FPGA library from ENS Lyon plus Xilix to 'design the hardware' It was an intensive process with iterations going between Matlab & other tools to work out timing and other issues.

Pros
  • Good Density
  • Liked the closely coupled nature of working on a regular Opteron systems

Cons
  • Complex to develop,
  • Limited bandwidth hampered performance
  • Scaling efficiency wasn't that good.
Cell
Using the QS20 blade (2x3.2GHz Cell BE)

Porting process:
  1. Port from linux to Power... Endian and Power compiler issues
  2. Rearrange the code to make use of SPEs
  3. Write the vector code for one SPE. It is all vectors, so it is a new project, but it is not a major hurdle.
  4. Optimize for multiple SPEs. This gets complicated

The code is not human readable... Perhaps Geek readable, but certainly not for the average human.

Results need to be measured at the application since the kernels are up to 20x faster on the SPE, but code on the PPC is 3x slower. Guiallame wants a better host processor on the Cell.

Pros
  • High volume CPU due to PS3
  • Performance on codes that fit
  • memory bandwidth
Cons
  • not a general purpose CPU
  • complicated code
  • 1GB was certainly not enough

GPGPU
Now working on Nvidia's CUDA, which he feels is significantly ahead in the market and been a good partner. CUDA was relatively straightforward to learn and use.
__
One of the problems is moving the data from host to GPU data, which can be optimized as non-cached memory to get 3.2GB/s. Working around the PCIe bottleneck is required for application performance. Works only for compute intensive kernels where you can leave the data on the GPU as much as possible. The programmer needs to manage memory access constraints to get performance. This is analogous to user managed CPU cache, and a bit daunting.
There are unique memory features in in the GPU that needs to be understood. texture, float & boundaries are areas he highlighted.

So, what about the application performance? CGG ran a wave equation model. It required 'deep re-engineering of the code' but returned 15x performance. The interesting note is the performance is dependent upon the data set, a reflection of the optimization dependency on memory architecture.

Pros
  • Performance
  • C like code
  • an add in to regular systems
  • fast evolution of CUDA (now on mac too!)

Cons
  • Tricky to get full performance
  • Only for most intensive applications
  • limited amount of memory
  • limited debug & profiling tools
===
Bottom line

====

There are programming issues. Hardware specific codes, need algorithms to be modified, which requires understanding the hardware. There are too many codes to port this way
It is not always rewarding, but it is getting better. CUDA was far and away their best experience and don't forgot the CPUs are getting faster. Whatever happens, parallelism is a part of your future.

Q: Size of the data structure and scaling
A: The code scales well to dual core CPU, but quad cores bottleneck on memory bandwidth.

O&G HPC Workshop: Samuel Brown

Sam Brown of University of Utah
"Optimal Thread Parallelism, etc."

His background includes lot of work on FPGAs & others for accelerated computing.
His core requirement is it needs to be integrated into the CPU environment.

Current accelerators have lots in common
  • Specialized HW
  • New programming languages
  • Rearrange your data

This is also true of the vector ops in current CPUs. Where data arrangement and compiler flags will change the code behavior. No matter what happens, he expects some kind of heterogeneous silicon on the die. in his opinion, the performance balance can't be reached with homogeneous cores

Most of what he does is generate synthetic data as quickly as possible. He has a number of 8 core systems in his lab today. This does allow for the easy approach of separate MPI processes, which is easy. However, Matlab & other shared data sneaks into the general purpose cores.

However, he does have a PS3, which he doesn't have to share. (and doesn't run Matlab). So what he needed was a code base that worked on both the PS3 and the 8 core machines. Most of the math is PDEs (Partial Differential Equations).

which I can't capture.
which you can google.
Sam highlighted
  • SPEs have only 256KB local memory, so you need to manage your data well
  • Explicit data movement via DMA
  • Local SPE is like a user-controlled cache, which can be good
  • Out of core methodology. You asynchronously refresh the data during the compute

His approach was Control & Compute threads. The Control thread handles Initialization, File IO & Thread sync. While the Compute threads are simple. He did rearrange his data to to be much more vector friendly using intrinsics & partioned to map to his command & data movement.



Results of PS3 and the x86 8 core machine showed the scaling of the heterogeneous processing v. the 8 core machine. On a single thread the performance is similar, but at 6 worker threads the PS3 is 2x faster. Scaling beyond 4 x66 Intel cores was poor. (at 4 threads, PS3 was better than 1.5x performance)

What to remember is "Memory access is the critical. The effort is applicable to both the x86 & x86."
Sam had more specific points. I will see if I can find them and add to the comments.

However, this is not readily readable code for others... Which is a problem for sustaining the work.

Q: Your results are almost exactly what we got! However, we need more scaling. Do you see what the effect of IO performance on your ability to use PS3 code?
A: Given his problem size there wasn't a need to address additional IO.

Q: What is the QS20 in your data
A: Cell Blade.

Q: What limited the x86 code scaling...
A: The usual... Memory bandwidth. He hasn't had a chance to run on AMD systems with integrated memory controllers.

O&G HPC Workshop Scott Morton

Let me start with what is going on here...
Over 220 registered attendees though it does include a strong showing of vendors. Looks like a room of 190 plus.
Seven corporate sponsors. Three of the supernationals are sponsors. Key names include Keith Gray, Henri Calandara & Ch-- Wong.


Scott Morton.. Hess and his experience on Seismic Algorythms on GPUS

What he wants
  • 10x price performance (commiserate with improvement from SC to x86 clusters
  • Commodity volumes
  • significant parallelism
  • "easy to program"
It is clear he is a technologist, since the only business requirements are a 10x price/performance and low cost to port.

They've looked at a variety of hardware platforms
  • DSPs
    • Commodity. Not easy to move the algorithms, especially for large data sets
  • FPGAs
    • Worked with SRC in on wave equations algo in 2003
    • 10x performance and 10x cost
    • Programming is graphical, which doesn't map to skills and tools.
    • I wonder what the cost/performance number looks like 5 years later since the FPGA vendors claim much higher than Moore's Law improvement in gates per $.
  • Cell
    • Tracking. Believes it is commodity, but haven't dove into it yet.


Languages he mentioned working on included OpenGL, Peakstream, Nividia Cg & Brook (Brook was a positive experience, but under supported. He believes they switched to Peakstream development

Hess did work with Peakstream and delivered 5 to 10x speed up in 2D, but only 2 to 3x in 3D for Kirkoff. Once Google bought them, they disappeared. Now he's working with Nvidia Cg.

Comments on CUDA
  • Realtively easy to program, but hard to optimzie
    • Two day course was useful
  • Used on Kirkoff, Reverse-time & Wave Equation algorythms.
    • Showing ongoing work... This is not final!
Kirkoff results using a CSM Intern over the summer. Not a GPU programmer, a geophysist. He did 8 major iterations of the code. The second version was equal to the CPU baseline. The 8th version is 27x the CPU baseline. Every version was an improvement in memory management. The noted changes certainly read as fairly minor. e.g. Removed an if and for returned

Optimization was on minimizing the data movement between the GPU and CPU with an emphasis on most compute on GPU.

Reverse-time algo is dominated by 3D FFTs. The results are 5x over single CPU, which is 20% faster than a dual node quad core. Notably (and this is really interesting...) 1 quad core and 2 quad cores deliver the same performance.

Wave Equations MIgration is an implicit solver. Prototype performance is 48x 1 3.6GHz Xeon. The core performance flattens at 4 quad core CPUs - 8 is no better.
(These problems are all 32bit and just over a 1GB in size.)

They have ordered a 32 node system. Their theoretical performance of 32 nodes of GPU & dual quad-core should outperform the current 4k Xeon cluster.

Q: What is the hardware platform?
A: The external Telsa boxes connected via PCIe cables.

Q: DO you really think it will meet price/performance targets
A: Yes. but we still need to develop the production code.

Q: Experience with heat density
A: New hardware, so there are no problems so far :)

Q: Are slides going to be available?
A: Yes (I'll add them in the comments)

Q: Why the asymptotic performance on CPU scaling
A: Memory bandwidth

Q: What about IEEE 754 & double precision
A: We don't care about that right now.

O&G HPC Workshop

I am at Rice for the Oil & Gas HPC Workshop 2008

My first attempt at this new Live Blogging...

It is also being webcast.

Monday, March 3, 2008

Clay's New Book

This was one of the key motivations to create this blog...

At dinner, one of the first thing that Clay asked me was if my work was being influenced by John Coarmack's positions on the future of useful computing. If Clay, who is up to his neck in thinking about people on the web connects Accelerated Computing with Carmack's comments on Game Physics, PPUs & GPUs. This is more than just something being considered in random labs.

+++ The good stuff

I attended Clay's public presentation at the launch of his new book at the Berkman last week. I have great notes thanks to Zinta, but she also pointed out that David Weinberger was live blogging. Zinta's notes are excellent. David's capturing of the event is different and has comments from others. Not many, but the dialog is valuable.

Inadvertently and with no malice, I was convinced over dinner by the 'Fellows of Berkman' to recognize the validity of the term cyberspace... A phrase I dislike due to historical overuse. However, what else are you going to call something that embraces twitter, skype, flikr and email. It's more than 'the Internet.'

Here are the links

Clay Shirky & others on his excellent new book.


Weinberger on Shirky in the evening: The Book.
And in the afternoon on protest culture.

Of course, the talks by Clay will be posted to the Berkman Center site so you can have the original, rather than the distillate.

The Mission

++ The Mission

I am not a fan of blogs. In general, they are self indulgent musings in search of an editor. I can say this from personal experience, though I think I successfully purged those old postings from the net. So, why have I started another one?

I attend a number of conferences and technology related events for work and personal growth. They fall into two categories. Those that aren't generally covered by some flavor of cybernews and those that are. This blog captures my notes when they are the only ones available, or will point to others when they are.

The areas which I expect to have here are:
Mostly, Accelerated Computing and all that it means (not well covered)
Often, Emergent organizations in cyberspace (covered very well)
When I can, Market forces that affect, or effect either of the above (lots of opinion, but really mixed)

A couple very important caveats:
Rule 1: I don't mention my company. This is my blog.
Though many of the notes here may be related to my work, here I do not speak for my employer. If you want to know about my company's opinion on something, go Google it. If you want to here what I say about my employer's technology, market position, etc. go Google that too. It isn't here.

Rule 2: Nothing is Confidential. This is a public forum.
You will never find anything here that was not revealed in a public forum. I have other places to put notes that aren't meant for all to see.

Rule 3: This is temporary.
I am undertaking this project because I need to organize better in a period of high-velocity. This is not a commitment to be a pundit, alpha geek or journalist. I know several of those.

Rule 4: Lead, follow or get out of the way!
I can't stand blogs that rehash good original work into pabulum. I'd rather link than write.

Rule 5: Cluetrain says...
I recognize I am letting something into the wild. Comments, assistance and C&D requests will be equally invited and acknowledged.

AND, if someone can point me to a better place to cover these subject... I will.

Thanks,

doug