Monday, October 13, 2008
I started this blog because I have been working on the cutting edge of faster computing. Thanks to my position, resources and connections I regularly attended events with many people smarter and more expert in the field of computing than I am. Others in this field could and have benefited from my exposure. Clearly, this hasn't been happening lately.
Monday, August 18, 2008
Co-CPUS, and FPGAs in particular, are excellent crypto crackers. When I started thinking about using a co-processor as a security solution, I never thought of these applications. Crypto cracking is one of the few examples I know where the co_CPU is overwhlemingly faster than the x86 core. It helps that crypto cracking is really *just* an algorithm, while most of us use more complex applications.
It's worth reading Amir's summary and especially the video from Shmoo.
PS: I was off the conference circuit for a while. For those of you who know my personal life, it was an excellent extended summer vacation...
Wednesday, July 9, 2008
1. HPCSW Presentations are on line. There is a lot of discussion about how hard it is to program more cores. I still can't get the statement that several people made there, which was "I really want flat coherent memory." In another post, I need to highlight some of those requests and responses.
2. RSSI is this week... I'm not there, but another of our intrepid correspondents will be.
Monday, May 12, 2008
- Mitronics with not-so-new news, pushing the Bio stack they showcased just about one year ago at ISC. The demoed it (again) at BioIT World April 28th . They are also promoting the new personal SDK. I don't have experience with it, but would love to hear a story or two.... For the Bio demo news go here. For more on the SDK go to their home page, http://www.mitrionics.com/.
- Acceleware does EDA for the Koreans. In an example of working really hard at it, Acceleware has partnered with SPEAG, a specialized simulator for EM (Electromagnetic readiation), to sell an GPGU version of their SW as a turnkey solution. Korean handset radiation simulation is not the only use. Boston Scientific also uses the solution. They claim 25x speed up.
- Angstrom, a Boston area system builder, has announced formal support GPGPU with both a hardware platform and an accelerated GPGPU library. The library is a plug in for Atlas, a well known linear math library. So this isn't really a solution, but it's a step in the right direction.
Speaking of BioIT World... Joe Landman was there. His post from the floor is here. It doesn't cover Accelerated Computing, but he deserves the plug.
Monday, April 28, 2008
I didn't see this covered too broadly, but it is a notable event
French hybrid supercomputer to exceed 300 TFLOPS by 2009
French supercomputing institute Grand Equipement National de Calcul Intensif (GENCI) along with former nuclear research institute CEA (Commissariat à l'Energie Atomique) has asked Bull to make the first hybrid PC cluster in Europe. The new machine will be housed south of Paris in Bruyères le Châtel, a data centre also used by military institute CEA-DAM. The Bull Novascale series machine will be composed of 1068 cluster nodes, each consisting of eight Intel processor cores and an additional 48 GPU application accelerators with 512 cores each. The supercomputer will also have 25 TB of RAM and 1 PB of hard drive storage.
Friday, April 25, 2008
Now John E. West has added to the discussion instigated by Wheat. "If high performance computing wants to continue to be a distinguishable market space, it needs its own research and development activities." So, when is that funding coming?
In the article for HPC Wire there are numerous suggestions for improving the process. There are several suggestions for ways to fix the procurement process and a call by Dan Reed, now of MSFT, for a coordinated national HPC R&D effort.
Fundamentally, the bottom line is the economics aren't working right now.
Ed Turkel's statement about tweaks on commodity systems is naive about the real economic costs of delivering commodity components. By definition commodity systems are mature markets with brutal margins. When you focus on holding the margin even the most modest tweak in the components is expensive. To support tweaks, you need modular designs to isolate the HPC embellishments from your mainstream delivery, or you need an industry that thinks these will be table stakes in the near future.
We're going to see how expensive and sustainable tweaks in silicon are in the GPGPU market. AMD & NVIDIA are delivering GPUs with functionality that no video display system will ever use. It's a real live experiment in action.
I just hope someone is watching... Oh yeah, we are :)
 Dear President, I want America to spend more money on really big computers... See Michael Feldman's HPC Wire Editorial on that one I won't touch that discussion on these pages. At least not yet.
Tuesday, April 8, 2008
I go to a number of these and there seem to be more every year to attend. This post is about the HPC Science Week sponsored by several government agencies, plus some vendors.
However, as I prepped this I realized I also missed another one the same week. and the Council of Competitiveness hosted an HPC Application Summit the same week. Here's coverage from HPC Wire. My highlight from Michael's write up is
"There was also extensive discussion of how best to conceive a software framework for integrating codes that must work together to run multiphysics models. Codes written in-house have to work with codes provided by independent software vendors and open-source codes being built by far-flung communities. A software framework could be the solution."
This was also a theme at HPCSW. Some people make rash statements like 1000 core chips and others talk about new programming languages, but everyone agrees that humans can not program at the anticipated level of future complexity. Since we can't go faster, we're going to have to have more and there is a limit to how much more a person can manage.
There are a few themes that are becoming inescapable.
There will be experts, near experts and the rest of us. Experts will want to be as close to the silicon as possible, while the rest of us are most interested in it working. This is going to require layers.
Data Movement is Expensive
We have plenty of computation, but moving the data to it is hard. We need methods that minimize the cost of data movement. Asynchronous threads, hardware synchronization and message queues are in vogue.
Is it a Framework, or a Library?
The data is also more complicated, so a library doesn't seem to sufficient. Programmers need constructs that handle data parameters and other odd bits. Libraries take rigor. Frameworks are application modules with architecture and external hooks. (see Guido's blog & comments for more on this.) Accelerated computing will take frameworks.
More User Control
Only the programmer knows... According to the pundits, future operating systems will allow the user to schedule threads manage data access patterns etc.
More on all these in the near future...
Friday, April 4, 2008
ECIT, Belfast, Northern Ireland
April 1-3, 2008
In short, the conference consensus is that accelerators are going to be an intergral part of the future computing paradigm. This isn't surprising, given the nature of the conference, but rather than being speculative statements, there was increasing demonstration of community acceptance of heterogenous computing as the next wave of innovation and performance.
Several presentations were made by vendors, SGI, Mitrion and Clearspeed, demonstrating cases where they have had success in proving out performance in real applicaitons. Mitrion with BLAST, Clearspeed with everything from quantum chemistry to financial modeling (but all floating point intensive) and SGI partnering with both of these partners. SGI presentation provided several interesting perspectives.
- It took a big case (proving 70 FPGAs working together) to begin to draw out interest by many companies in the technology.
- Now people are approaching SGI on what one can do for them with FPGAs and accelerators. This isn't surprising either, because in this demonstration, SGI stretched the size of the box that was constraining interest in FPGAs.
- SGI has developed several examples using Quick Assist, but unfortunately, the details of the implementation and interface were not available.
- It was important to note that Quick Assist focuses on single node acceleration, which is potentially limiting.
Mitrion presented on their language and BLAST example. A primary take home point is that the parallel programming mindset needs to be developed earlier for scientists and programmers alike. Mitrion C helps enforce this mindset. Of course, Mitrion C also emphasized their portability across parallel processor types.
Clearspeed was very interesting because of the speedup and density of performance they are able to achieve. Admittedly SIMD in nature and focused on floating point, the accelerator has a valuable niche, but isn't universal. It seems that Clearspeed is the CM5 coming around
with updated technology. A notable point from Clearspeed was a call for common standards for acceleration, something akin to OpenMP but not OpenMP. One notable point about Clearspeed was the availaiblity of codes that had the Clearspeed implementations.
Several other presentations were given from Alan George regarding CHREC, Craig Steffan from NCSA, Olaf Stoorasli from ORNL. Alan talked primarily about CHREC's effort to move the thought process up to strategy for computing the solution, instead of low-level optimization on a specific processor. A good direction because it provides a more common ground for domain scientists to interact with the application performance experts.
Olaf talked about his work with parallelizing Smith-Waterman on many, many FPGAs and with enough FPGAs, achieved 1000x speedup over a single CPU. This is another example of big cases providing visibility and showing the limit for FPGA computing has a ways to go before finding a limit.
Craig Steffan provided a good overview of NCSA mission which is to bring new computational capabilities to the scientists. Provided good input on necessary steps to have a successful deployment of new computing technologies including
- Making it easy to keep heterogenous components (object files and bitstreams) together
- Make decisions at run time on how the application problem will be solved
- Make documentation available and consistent
- Access to latest versions (even pre-release) is useful when trying to work around compiler bugs present in early releases
Mike Giles from Oxford presented his experiences in financial modeling using GPGPUs. Showed good success. Commented that standards for GPGPUs will be many years off, but that OpenFPGA is a good sign from the RC community that standards are emerging. Mike also identified that having examples, tools and libraries, student projects and more conferences will be important to getting started in new technologies. For those experienced with parallel programming, it's a 2-4 week learning curve to use CUDA.
Greg Petersen (UT) talked about cyber chemistry virtual center between UT and UIUC. Question for chemists is how to use these machines. Whole research front is on this aspect in order to get to petascale systems. Talked about kernel for QMC applications including general interpolation framework. Looked at efforts using Monte Carlo stochastic methods with random numbers and a Markov process. Significant work on using numerical analysis underlying the chemistry results.
Overall there are many applications using heterogenous acceleration, many in life sciences ranging from MD, to drug docking and Monte Carlo techniques, and nearly all referencing image processing and financial applications that performed well with accelerators. There was overlap in the life sciences space, with nearly every accelerator type demonstrating acceleration for at least one applicaition in this space.
Another significant time block was for the OpenFPGA forum. A show of hands indicated that only about 20% of the audience was aware of OpenFPGA, so I spent 30 minutes on an OpenFPGA overview before moving to the discussion of the general API. Part of the presentation included getting an interest level in assuring open interoperability for accelerators. There were no responses in the negative, many in the affirmative and some undecided.
The GenAPI discussion went pretty well. In short, there were no showstoppers indicating a wrong direction, but more discussion on technical detals of argument specification, what is included and what is not specified. There was a strong interest in having more direction for new areas such as inter-fpga communication, inter-node accelerator communication, etc, although all agreed it was too early to standardize because even the basics had yet to become standard.
There were some comments from those with a lot of history in HPC that the GenAPI looked similar to the model use by Floating Point Systems. There was general consensus that a first standard that is simple is best, allowing common use and then looking to emerging patterns as the basis for future standards. It appeared the application community would accept the standard if it were available.
Summarizing, the conference provided a good overview of work in moving computational science applications to new accelerator technologies that is becoming the new mainstream way to get higher performance for computing. The tools have matured enough that applications are being more broadly developed, and beginning to be deployed.
Tuesday, April 1, 2008
Mitrion presented on their Mitrion-C programming language and illustrated the changes needed to about 1300 lines of code to make BLAST accelerate with Mitrion-C. 1300 lines out of a million plus lines of code isn't a bad percentage for porting a large application.
My notes are...
First, you need to articulate the underlying ideas, trends or technology that drive your competitive differentiation.
Second, understand how your culture and mission match the vision of the institution.
Third, dedicate an impassioned champion for this work who has access to high-level executives in your company.
Fourth, enable your champion to reach our well beyond the corporate silos with internal tools, support and leverage.
Claude from Schlumberger says:
- You get ideas & IP, not prototypes
- Interact frequently
- Bring lots of people to visit (your academic partners can show off)
- Learn from 'demo or die'
- Be prepared to constantly advocate the relationship
- Look to your future (based upon the Alchemy of Growth)
- Co-creation as a principle
- Learn, learn, learn...especially from the students
Scratch... tile based programming for kids. The cool item was the ability to send it also to you mobile phone. I've enabled my kids already.
Sticky Notes... that are smart. Linkiing phyiscal notes & books with computer storage. I couldn't find the video online, but the research summary is online
Sociable Robots for Weight Loss... A partner to help you achieve your long-term goals. Can't convince a friend to get you to the gym, ask Autom to help. There is a video of this one. And, it will be commercialized.
Detecting Group Dynamics.... (I think there was a catchy title for this as well, but I was blogging) Using sensors to tell you if you are running Good Meetings or Bad Meetings. The cool part was the simple feedback model. I swear 90% of everything is the UI and again, they used a simple display on the cell phone.
Cognitive Machines... Searching for the highlights of the game. "show me a video clip of Ortiz hitting a home run" Current state of the art is using the the announcer, who is often filling in the dull spots. Therefore, you need to get a machine to "look" at the video to understand the patterns within it. THIS IS COMPUTATIONALLY INTENSIVE. (4th floor)
Information Spaces... What's this virtual world stuff anyway.... I liked this one because of my personal interest in how to use online more effectively for the things we really do as humans in meetings: social clues, consensus building, recognition of social clues.
Active Sensor Nodes... Small, fast, real-time data on movement via wearable sensors. This is what Jacoby means when he says "Show me & I can learn"
Common Sense Toolkit... Yes, you too can have Common Sense via C Code. Available online as a repository of simple statements. http://commons.media.mit.edu/ & a library of sematic analysis called DIVISI. (a potentially cool little library.)
Tangible Media... You live in the real world, why can't your computer interfaces act more like real items (paintbrushes, clothes and more) This lab is also doing the Gesture Object Interfaces, which is a really cool idea. Throwing your phone on the table is much better than voice control
Zero Energy Home... a project of Changing Places... And they are really building this house in Maine. Nice ideas that can be used today. I need to re-read this stuff.
Smart Cities and Roboscooter... No American wants this, but everyone else does! What Dean Kaman thought we should target with the Segway, but he hasn't make the leap.
Wednesday, March 26, 2008
This is my first blog entry, ever….really. Before I talk about the conference I’ll share some credentials. I’ve worked in the high performance computing market for almost ten years. Prior to that, I held various positions range from commercial software, networking, semiconductors to workstations. My roles were always in technical marketing or business development except a few short stints in sales….which is good for keeping one’s ego in check.
I would say the HPC market is probably the most interesting segment I’ve ever touched. I’m not saying that because it is where I am now….I say that because it is WHY I am where I am now. The people are interesting, their work is fascinating and with the exception of a few players (and you know who you are) they are very pleasant. It is a familiar crowd. As a side observation is that we need to start attracting more young people into applied science. I’ll save comments on that for another entry.
Enough said….on with the conference observations. I’m going to break my discussion into two sections. The first will deal with the general content of the presentations and the second will cover the business challenges of the market as presented by Dr. Stephen Wheat, Sr. Director in Intel’s HPC group.
Interesting science, interesting technologies…
A number of presenters from the national labs and academia presented work they were doing with some overview of the science. I found the audience generally attentive. Presentation problem statements were broad enough that listeners could see if the approach applied to their area of interest. Frankly for me some of the science was over my head...but it was still worthwhile.
John Grosh, LLNL had the best quote of the day. It was, “The right answer is always the obvious one, once you find it!” So true in life and in science. Among other things, he described their biggest challenge as the application of computing technology to large scale predictive simulations. As it was explained, massive simulations must present “a result” and include a quantified margin of error from the simulation. Quantifying the margin of error or uncertainty requires lots of data points. This has implications for the size of the data set, which places a load on memory subsystems, file systems, underlying hardware and the management and reliability of a complex computing system.
At the end of his presentation I was struck by the complexity of quantifying margin of error. I see at least three factors that could contribute to the uncertainty. They are:
- Model uncertainty, based on the predictive validity of the model itself.
- Platform uncertainty associated with the accuracy and predictability of a complex system to execute code
- Uncertainty or a range of possible results that occur in the system being modeled or simulated
Computational scientists tend to worry about item one, would like to push item two to their systems’ vendors and leave item three to the domain experts. Do you agree? Can you think of other factors contributing to uncertainty?
Dr. David Shaw talked about their work at D.E. Shaw Research. He mentioned they have about sixty technologists and associated staff at the lab. They collaborate with other researchers, typically people who specialize in experimentation to help validate the computational algorithms. They are looking at the interaction between small organic molecules and proteins within the body. Their efforts are aimed at scientific discovery with “a long time horizon”.
As someone who worked with life science researchers for a time, I found the content of this presentation the most intriguing. Dr. Shaw commented we might find that today's protein folding models may be of dubious value due to the quality of force field models. D.E. Shaw Research is trying to reduce the time to simulate a millisecond reaction and have structured the problem to try to eliminate some of the deficiencies as they see them in today's models. In turn they can reduce the problem to a set of computational ASICs. They have also developed a molecular dynamic programming code that will run on a machine with these specialized ASICs. As he described it, the machine is incredibly fast and also incredibly focused on a single task. In other words, it is not your father’s general purpose, time share system….
They obtained the speedups through “the judicious use of arithmetic specialization” and “carefully choreographed communication”. In their system they only move data when they absolutely must. One wonders whether this approach could actually trickle down to the commercial computing in some capacity. I think they are almost mutually exclusive. This would imply that accelerating computational computing (acceleration) is always specialized and limited to specific markets, making it unfeasible to pursue as a vendor unless you are a small specialty shop. Do you agree?
Dr. Tim Germann, LANL presented work on a simulation they did to provide content for a federal response plan to a pandemic flu infection. The work was interesting and showed that some of the logical approaches (use of the flu vaccine emergency stockpile) would only delay but not mitigate the impact of a pandemic. They were able to use demographic, population and social contact data to show that a variety of actions, taken in concert, would reduce the impact of a pandemic. The simulation also provided the early indicators that would occur some sixty days before the pandemic was evident.
Truly useful stuff but how do you take these techniques and use them to model other problems? What is the uncertainty in the simulation? Dr. Don Lamb, University of Chicago, talked about the concept of null physics as it applies to evaluating supernovas and the same question arose in my mind…is it broadly useable?
I want to know because I’m a business development guy. The implications of broad or narrow applicability do not make the case for vendors to help scientists solve problems. They do have implications for the way we approach this as a market. This leads to the second section…
The business challenges in the HPC market….
I should point out that I do not, nor have I ever worked for Intel. My observations are those of an interested market participant and outsider.
I’ve heard Steve Wheat present a number of times. As do all the Intel folks, their presentations are crisp and “safe” for general public viewing. Steve opened with some general observations about the growth in HPC (greater than 30% of the enterprise server market) and made the appropriate comments about the importance of market to Intel. It was the kind of “head nodding slide” those of us who present routinely use to make sure the audience is on our side. He then launched into an update on work at Intel that was relevant to HPC. This was good but rather routine, spending some time discussing the implications of reliability when deploying smaller geometries. I think it safe to say, this is the kind of conditioning that Intel, AMD and any other processor vendor should be doing to explain to the market that this isn’t easy. The implication for this audience was that “HPC needs to help solve these problems” and it will benefit the entire industry….eventually. He also made a suggestion that the industry think about the implications of multi-core processors on I/O proposed that I/O be treated as a Grand Challenge problem.
He then spent time talking about the economic challenge of serving the HPC market. My interpretation (not Steve’s words) would characterize the HPC procurement cycle as one that barely allows vendors to recoup R&D costs. Steve pointed out that wins for large deployments typical have terms that penalize failure far more than rewarding success. This appears to be a business problem that any sane vendor should avoid. Why pursue a high profile, high risk opportunity with normal return on investment as the best case? While the PR is good, I can think of other ways to garner good press without putting a company at risk. It feels like the HPC market’s answer to subprime mortgages. Do you agree?
Everyone believes that HPC technology eventually has a “trickle down” benefit to the entire market. However, the payoff is muted because margin degrades with volume and over time. I’m also unsure that the original developers ever see the lion’s share of the margin. Mosaic and Netscape come to mind. Can you think of others either making or disputing this point? Do you agree?
Steve closed with some very thought provoking business slides for an HPC conference. His points could be summarized with the question, “given the needs of the HPC market and the associated economics, what are the dynamics that allow HPC vendors to make active investments to solve these problems”. He makes a case that there needs to be an investment and model that allows vendors to recoup R & D costs. I think it is an interesting topic and worth further conversation. Please post your views and questions.
Tuesday, March 25, 2008
Ridiculously Easy Group Formation is a phrase originally coined by Seb Paquet in 2002(1), but greatly enhanced by Clay Shirky's recent work(2). It is also a guiding principle for Lead, Follow...
With the in mind, the next few weeks will feature posts by people who are not me. Hopefully, this will be a regular feature and not even worth mentioning... But certainly worth reading.
Expected writers include
- Jay O, who drives business development, interesting research and other forward thinking activities in technology.
- Eric S, who spends his time herding cats toward a common goal of open and interoperable systems at openfpga.org
- The executive team of Mitrionics. A smart, driven group of people whose ideas are rooted in practical delivery.
(1) Seb's Original Blog
(2) Here Comes Everybody
Monday, March 24, 2008
IDC has published a 2008 update to their Information Growth Forecast. A couple interesting tidbits to get you to follow the link.
- More storage in the home than anywhere else, but enterprises still carry the responsibility for ensuring the data is available. (Think your photos on Picassa or Flickr or...)
- A huge driver of data growth is replicated copies. Though the example they use is email, I've also seen this done with Business Intelligence data. Replicated data is there to be analyzed!
- They used "Data Tsunami" in 2007
Friday, March 21, 2008
Tuesday, March 11, 2008
The current accelerated computing work is going full force. The options exist and the barrier to experimentation is very low. This is very good for Accelerated Computing.
However, I don't think the motivations are pure. The chief reason for working on silicon outside the mainstream x86 is fear of many-core. There is an expectation that x86 complexity is also dramatically increasing, while the performance is stagnant. The the cost of overcoming x86 many-core complexity is unknown.
The presentations are by smart motivated people who are exploring the alternatives. What they have in common seemed to be the following:
- Current scaling options are running out. All presenters provided scale up on dual and quad core x86 CPUs. They are all asymptotic.
- The compute is data driven. That is to say there is a lot of data to be worked upon - and it is increasing.
- Performance achievements on a greater scale of current performance x86 cores is going to be more expensive than historical trends. Complexity of application management is emerging as both a motivator and a barrier to Accelerated Computing
- They need to touch the compute kernels anyway. If they are going to rewrite the compute intensive sections, why not try code on a different piece of silicon or ISA. They have been moving away from hardware specific code anyway.
Mainstreaming Accelerated Computing will not happen without addressing the complexity of systems and application management. I don't know who is really working on this... Do you?
Tuesday, March 4, 2008
Jeff Denby of DataDirect Networks
Per Brashers of EMC NAS SW Engineering
Larry Jones of Panasas
Sean Cochrane of Sun Microsystems
Tom Reed of SGI
Dan Lee of IDC HPTC
Jeff talked about...
* SATA is 'the way to go' in geophysics
> Design has some issues. Lower reliability including silent data corruption
* Infiniband is becoming popular
* For DDN, a Petabyte shipment is not unusual
The state of the art is 6GB/s for network attached storage.
Per talked about the architecture of the pNFS NAS stack different than traditional NAS architectures, different from EMC's MPFS.
* Advantage of pNFS is separate route for metadata from data
* MPFS adds awareness of caches in client and storage array to assist in throughput.
* Increase in concurrency due to byte level locking, not file
* IB is about 650MB/s; quad Ethernet is 300 to 400MB/s
Larry talked about...
* Panasas uses an object based iSCSI SAN model using "Direct Flow Protocol"
* Parallel I/O for windows apps is underway
* Landmark paper said Direct Flow improves application performance improvements by greatly reducing CPU wait on data.
* Reliablity is important
* Also support pNFS (NFS 4.1)
* Targeting 12GB/s
Sean talked about...
* He leads the HPC storage out of the CTO office
* He presented the Sun best practices
* Describing two bottlenecks: MetaData & connecting cluster storage with archive. Sun uses dedicated boxes (Thumpers & 4100s respectively) at those pain points.
Tom talked about...
* Humans: Interactive Serial, costly interupts, open loop, non-deterministic, expensive
* "Time Slicing and Atrophy make a bad lifestyle cocktail"
* Current storage solutions can't serve everyone
* HPC server revenue grew to over $11B in 2007 - It's mostly clusters
* YOY growth of 15% - double digit over last 5 years
* Oil & Gas reached $650m in 2007
* HPC storage exceeded $3.7B in 2006 at a faster growth than HPC servers
* pNFS is getting close to general availability.
* It eliminates custom clients
* Key driver is to have several threads access the same file system concurrently
* Block and Object versions may not be included in final spec.
Q (Keith Gray): What's the largest number of concurrent clients & largest file?
A (Jeff of DDC): ORNL, LANL, 25k clients. Performance throughput for GPFS at Livermore is 100s of GB/s
A (Per of EMC) 12k clients
A (Larry of Panasas) 8k clients at Intel and growing. LANL with 150GB/s
A (Sean of Sun): 25k of Luster, about 120GB/s at CEA (same as DDN); 10s of petabytes on SANFS (tape)
A (SGI): depends upon file systems
Q: What's the next technology
A: (Sean of Sun) Likely flash for heavy metadata access
A: (Sean of Sun) Something from Sun later on this year. Can't comment on sizes, stay tuned.
A: (Per of EMC) EMC already has flash as option. Good for small random IO ops. Good for metadata, but not for throughput
Q: What happened with next gen, like Cray's old fast ram disks.
Comment from the audience... Look at fabric cache from SciCortex
A: (DDN): The duty cycle of flash for lots of writes is about 2 years, so it doesn't map well to what we have. DDN is waiting on phase change memory to emerge.
Q: Will the storage vendors *really* going to get behind pNFS and drive it.
A (sun): Yes and on Luster and ZFS backend filesystems
A (panasas): There are pluggable modules in the standard which does allow customization.
A (EMC): yes, and our shipping code should be very close to final spec.
Portble Parallel Programming on Multicore Computing_
This is based upon Vivek's work on Habanero, his class & built upon the X10 work he did at IBM.
Hardware platforms are proliferating a an increasing rate, so we need portable parallel abstractions that are not hardware targeted. As a result, the scope is quite broad. The research boundaries are parallel applications to multicore hardware. Vivek wants more industry interaction, especially regarding O&G applications.
Current targets include the usual parallel benchmarks, medical imaging, seismic data, graphics, nd computational chemistry. In the spirit of eating the dog food, the Habenaro compiler is also an application they are developing within the Habenaro framework.
Vivek believes in portable managed runtimes as a result of the compiler and analysis. This may be controversial. To accommodate the true geek and hardware targeted code, there is a model for partitioned code.
Early work included running streaming vectors using Java on Cell, though it is really Java on the PPC for control and C on the SPE for compute.
The topology of the heterogeneous processor is in two dimensions. The first is distance from the main CPU - I am assuming he means memory access. Though we think of this as devices, it applies to NUMA as well.
The second is the degree of customization in the accelerator, which is a trade off of programmability to efficiency. In his slides, he sees CPUs & DSPs as sequential and multicore and accelerators are parallel.
X10 structures are Data, Places and Processing Elements. http://x10.sf.net Using Places, programmers create lightweight activities. The message construct is async. X10 recognizes the improvement that results from affinity binding of threads to local data structures. This is not available in most shared memory models, such as OpenMP.
When porting X10 to GPGPU, the localization and affinity of memory will be critical.
So, what about implicit parallelism via the new auto-magic parallelization with a target of new codes, not dusty decks. Habanero extensions for Java would improve the success of parallelization in the code.
The case study is Java Grande Forum Benchmarks with certain subset of language extensions.
Bottom line for Vivek is multi-core absolutely requires a change in language, compilers and runtimes. He believes that managed runtimes are here to stay, but didn't go any further on what else will.
Q: What do you think are the minimal language extensions for co-array & UPC?
A: They don't have support for dynamic threading, they hold the old SMPD model. What needs to expand is a threaded pgas model. e.g. no facility for aync.
Q: What about OpenMPI parameters to have shared & partitioned memory?
A: In general MPI with Threading is quite challenging, like identifying which thread is participating in communication.
_Fine tuning your HPC Investments_
He is working on co-array Fortran in light of Vivek's talk. This is not the subject of his presentation today.
The challenge is is working on is performance challenges across many cores/sockets like the Cray install at ORNL and Cell B/G.
He states that CPUs are hard to program (yes, CPUs, not accelerators) because the CPU is pipelined, OOO & superscalar with multi-level access & parallelism. Rice HPCToolkit, needs to be correlated with code, useful on serial & parallel execution, intuitive yet detailed for compiler folks to use.
John's design principles are a good guideline for anyone looking at performance analysis.
The measurement of performance is increasingly complex due to the increase in layered software design including repeated calls of procedures, the velocity of change in the hardware and the impact of context (which data, etc.) on performance.
*Stephane Bihan of CAPS *
_Addressing Heterogeneity in Manycore Applications_
I've seen Stephane present different versions of this before. The CAPS website also has information. They were also at SC07 in Reno.
They are tackling the gap between changing processing hardware and software development. Their runtime recognizes the distributed, heterogeneous & parallel requirements of future programming.
1. Parallelism needs to be expressed by the programmer through the use of directives similar to openMP, but in standard languages (C & Fortran)
2. The runtime needs to deal with resource availability & scheduling
3. Program for the Hardware architecture, insulating the main body of the code & programmer.
The result of using HMPP directives is to enumerate codelets that can be executed synchronous or asynchronously. Directives guide the data transfer & barriers required for synchronization.
CAPS HMPP provides the workflow for x86 code to be linked with the hardware accelerator compiler. The directives of a codelet can be set to describe the codelet behavior when it is called. The interesting example is if the problem is small, run it on the CPU.
Current version support C & Fortran, machine independent directives and CUDA & SSE. AMD/ATI software is targeted for support soon.
His O&G example is Reverse Time Migration on a GPU on a 5 node (dual quad core) with Tesla S870 (4GPUs). Using 2D test case, they achieved an 8x improvement for compute, but 1/3 of the app is in disk IO.
Once again, key optimization were data alignment for the GPU and overlapping data transfers.
Q: (Henri Calendara of Total) Despite good small scale support, the challenge on the O&G market is scaling the data as well. We can't optimize for accelerators without understanding the overheads of data movement on the application.
A: (JMC) The tools help understand the data movement to identify the data movement bottlenecks. In the end, compilers will need to manage data movement. Explicit data movement is a 'lot of work' Therefore, locality aware programming languages is going to be critical.
A: (Scott Misner) If you have 1GB plus of GPU memory, which is similar to the granularity of the problems we see on the CPU already.
A: (Guiliame) Compression of the data for the data transfer to otpimize transfer is a technique to be con
A: Vivek Good news: increasing on chip bandwidth will be there. IO is the next bottleneck and OSes don't understand how to manage for that.
Q: (Christoph from T--- O&G developer)I am very excited about the potential, but I noted a thread that performance was achieved with 'careful development of the alogrythms.' Our current optimized x86 8 core nodes achieve 80GFlops for $3k.
Can you comment on Performance/Dollar?
How do we better link computer science with the algorithm development since it appears that optimization will require CS knowledge?
A (Samuel brown) There were examples of good performance/dollar here. As we move forward these will become new engineering practices and more available as a result.
A (Henri) The impact of TCO for scaling to Petaflop is very interesting to his group
A (JCM): Looking at historical collaborative efforts between national labs and the O&G community may be an important model for optimizing code.
Q (Lee): What's the difference in heat & capital costs?
A: may need to pull in vendors.
Q: Advantages of the program analyzer for developing parallel programming
A: (JCM): ...examples of what they measure... Bottom line, the Rice HPC Toolkit allows for blackbox scalability
Q (Joe): I have complicated unstructured finite element problems, much of which is in Java. Is there work that can help me?
A (Vivek): The model
Q (Steve Joachems of Acceleware): Where do you see the hurdles from development to production
A (Guillame) There are the facility issues of heat, etc. that need to be addressed, including maintainabity of the infrastructure.
A (Stephane): We're focused on portablility tools
Q (Jan Orgegard): How easy does it need to get to be useful.. VB v. Fortran.
A (Scott M): Need to limit to thousands of lines of code, not hundreds of thousands. Much of the old code doesn't need to change.
A (Guiliamme) We will not rewrite the whole thing. Recompiling is OK.
A (Samual Brown): Increased lifespan of the code is more critical than ease.
A (Stephane) It is a complex environment with many layers. The language itself can be in different run times.
A (JMC) It is less the language than it is the portability the code into the future.
A (Vivek) Wholeheartedly agree there. Making a structure where the function will not have 'side effects' that allows more separation of hardware and software.
Q: Any expectation of MBTF for the new hardware?
A: (Scott) That is what we're looking at with our new GPU cluster.
Q: Follow up... what's acceptable in Seismic?
A: PC clusters had a high initial failure rate, but yes it does need to stay up.
Q: What about the data from goverment labs that determined that GPUs were not cost effective.
A (Scott) we looked at it, but the national lab data is older.
A (Vivek) look at roadrunner for future data.
Q (Keith Gray) What is coming from the University to help the industry with this problem
A (Vivek) Good university and indsutry relationships are critical. The ones Rice has are positive.
comments.... Demand for smart people
Q (from __ of altair scheduling app) What's your impression before this is readily in production and available? When do commercial ISVs need to acknowledge and port to it?
A Despite some comments on we need some now & we have no idea when it needs to be ready.
They have done work on a few platforms.
The hardware platform was a Cray XT4. Used an FPGA library from ENS Lyon plus Xilix to 'design the hardware' It was an intensive process with iterations going between Matlab & other tools to work out timing and other issues.
- Good Density
- Liked the closely coupled nature of working on a regular Opteron systems
- Complex to develop,
- Limited bandwidth hampered performance
- Scaling efficiency wasn't that good.
Using the QS20 blade (2x3.2GHz Cell BE)
- Port from linux to Power... Endian and Power compiler issues
- Rearrange the code to make use of SPEs
- Write the vector code for one SPE. It is all vectors, so it is a new project, but it is not a major hurdle.
- Optimize for multiple SPEs. This gets complicated
The code is not human readable... Perhaps Geek readable, but certainly not for the average human.
Results need to be measured at the application since the kernels are up to 20x faster on the SPE, but code on the PPC is 3x slower. Guiallame wants a better host processor on the Cell.
- High volume CPU due to PS3
- Performance on codes that fit
- memory bandwidth
- not a general purpose CPU
- complicated code
- 1GB was certainly not enough
Now working on Nvidia's CUDA, which he feels is significantly ahead in the market and been a good partner. CUDA was relatively straightforward to learn and use.
One of the problems is moving the data from host to GPU data, which can be optimized as non-cached memory to get 3.2GB/s. Working around the PCIe bottleneck is required for application performance. Works only for compute intensive kernels where you can leave the data on the GPU as much as possible. The programmer needs to manage memory access constraints to get performance. This is analogous to user managed CPU cache, and a bit daunting.
There are unique memory features in in the GPU that needs to be understood. texture, float & boundaries are areas he highlighted.
So, what about the application performance? CGG ran a wave equation model. It required 'deep re-engineering of the code' but returned 15x performance. The interesting note is the performance is dependent upon the data set, a reflection of the optimization dependency on memory architecture.
- C like code
- an add in to regular systems
- fast evolution of CUDA (now on mac too!)
- Tricky to get full performance
- Only for most intensive applications
- limited amount of memory
- limited debug & profiling tools
There are programming issues. Hardware specific codes, need algorithms to be modified, which requires understanding the hardware. There are too many codes to port this way
It is not always rewarding, but it is getting better. CUDA was far and away their best experience and don't forgot the CPUs are getting faster. Whatever happens, parallelism is a part of your future.
Q: Size of the data structure and scaling
A: The code scales well to dual core CPU, but quad cores bottleneck on memory bandwidth.
"Optimal Thread Parallelism, etc."
His background includes lot of work on FPGAs & others for accelerated computing.
His core requirement is it needs to be integrated into the CPU environment.
Current accelerators have lots in common
- Specialized HW
- New programming languages
- Rearrange your data
This is also true of the vector ops in current CPUs. Where data arrangement and compiler flags will change the code behavior. No matter what happens, he expects some kind of heterogeneous silicon on the die. in his opinion, the performance balance can't be reached with homogeneous cores
Most of what he does is generate synthetic data as quickly as possible. He has a number of 8 core systems in his lab today. This does allow for the easy approach of separate MPI processes, which is easy. However, Matlab & other shared data sneaks into the general purpose cores.
However, he does have a PS3, which he doesn't have to share. (and doesn't run Matlab). So what he needed was a code base that worked on both the PS3 and the 8 core machines. Most of the math is PDEs (Partial Differential Equations).
- SPEs have only 256KB local memory, so you need to manage your data well
- Explicit data movement via DMA
- Local SPE is like a user-controlled cache, which can be good
- Out of core methodology. You asynchronously refresh the data during the compute
His approach was Control & Compute threads. The Control thread handles Initialization, File IO & Thread sync. While the Compute threads are simple. He did rearrange his data to to be much more vector friendly using intrinsics & partioned to map to his command & data movement.
Results of PS3 and the x86 8 core machine showed the scaling of the heterogeneous processing v. the 8 core machine. On a single thread the performance is similar, but at 6 worker threads the PS3 is 2x faster. Scaling beyond 4 x66 Intel cores was poor. (at 4 threads, PS3 was better than 1.5x performance)
What to remember is "Memory access is the critical. The effort is applicable to both the x86 & x86."
Sam had more specific points. I will see if I can find them and add to the comments.
However, this is not readily readable code for others... Which is a problem for sustaining the work.
Q: Your results are almost exactly what we got! However, we need more scaling. Do you see what the effect of IO performance on your ability to use PS3 code?
A: Given his problem size there wasn't a need to address additional IO.
Q: What is the QS20 in your data
A: Cell Blade.
Q: What limited the x86 code scaling...
A: The usual... Memory bandwidth. He hasn't had a chance to run on AMD systems with integrated memory controllers.
Over 220 registered attendees though it does include a strong showing of vendors. Looks like a room of 190 plus.
Seven corporate sponsors. Three of the supernationals are sponsors. Key names include Keith Gray, Henri Calandara & Ch-- Wong.
Scott Morton.. Hess and his experience on Seismic Algorythms on GPUS
What he wants
- 10x price performance (commiserate with improvement from SC to x86 clusters
- Commodity volumes
- significant parallelism
- "easy to program"
They've looked at a variety of hardware platforms
- Commodity. Not easy to move the algorithms, especially for large data sets
- Worked with SRC in on wave equations algo in 2003
- 10x performance and 10x cost
- Programming is graphical, which doesn't map to skills and tools.
- I wonder what the cost/performance number looks like 5 years later since the FPGA vendors claim much higher than Moore's Law improvement in gates per $.
- Tracking. Believes it is commodity, but haven't dove into it yet.
Hess did work with Peakstream and delivered 5 to 10x speed up in 2D, but only 2 to 3x in 3D for Kirkoff. Once Google bought them, they disappeared. Now he's working with Nvidia Cg.
Comments on CUDA
- Realtively easy to program, but hard to optimzie
- Two day course was useful
- Used on Kirkoff, Reverse-time & Wave Equation algorythms.
- Showing ongoing work... This is not final!
Optimization was on minimizing the data movement between the GPU and CPU with an emphasis on most compute on GPU.
Reverse-time algo is dominated by 3D FFTs. The results are 5x over single CPU, which is 20% faster than a dual node quad core. Notably (and this is really interesting...) 1 quad core and 2 quad cores deliver the same performance.
Wave Equations MIgration is an implicit solver. Prototype performance is 48x 1 3.6GHz Xeon. The core performance flattens at 4 quad core CPUs - 8 is no better.
(These problems are all 32bit and just over a 1GB in size.)
They have ordered a 32 node system. Their theoretical performance of 32 nodes of GPU & dual quad-core should outperform the current 4k Xeon cluster.
Q: What is the hardware platform?
A: The external Telsa boxes connected via PCIe cables.
Q: DO you really think it will meet price/performance targets
A: Yes. but we still need to develop the production code.
Q: Experience with heat density
A: New hardware, so there are no problems so far :)
Q: Are slides going to be available?
A: Yes (I'll add them in the comments)
Q: Why the asymptotic performance on CPU scaling
A: Memory bandwidth
Q: What about IEEE 754 & double precision
A: We don't care about that right now.
Monday, March 3, 2008
At dinner, one of the first thing that Clay asked me was if my work was being influenced by John Coarmack's positions on the future of useful computing. If Clay, who is up to his neck in thinking about people on the web connects Accelerated Computing with Carmack's comments on Game Physics, PPUs & GPUs. This is more than just something being considered in random labs.
+++ The good stuff
I attended Clay's public presentation at the launch of his new book at the Berkman last week. I have great notes thanks to Zinta, but she also pointed out that David Weinberger was live blogging. Zinta's notes are excellent. David's capturing of the event is different and has comments from others. Not many, but the dialog is valuable.
Inadvertently and with no malice, I was convinced over dinner by the 'Fellows of Berkman' to recognize the validity of the term cyberspace... A phrase I dislike due to historical overuse. However, what else are you going to call something that embraces twitter, skype, flikr and email. It's more than 'the Internet.'
Here are the links
Clay Shirky & others on his excellent new book.
Weinberger on Shirky in the evening: The Book.
And in the afternoon on protest culture.
Of course, the talks by Clay will be posted to the Berkman Center site so you can have the original, rather than the distillate.
I am not a fan of blogs. In general, they are self indulgent musings in search of an editor. I can say this from personal experience, though I think I successfully purged those old postings from the net. So, why have I started another one?
I attend a number of conferences and technology related events for work and personal growth. They fall into two categories. Those that aren't generally covered by some flavor of cybernews and those that are. This blog captures my notes when they are the only ones available, or will point to others when they are.
The areas which I expect to have here are:
Mostly, Accelerated Computing and all that it means (not well covered)
Often, Emergent organizations in cyberspace (covered very well)
When I can, Market forces that affect, or effect either of the above (lots of opinion, but really mixed)
A couple very important caveats:
Rule 1: I don't mention my company. This is my blog.
Though many of the notes here may be related to my work, here I do not speak for my employer. If you want to know about my company's opinion on something, go Google it. If you want to here what I say about my employer's technology, market position, etc. go Google that too. It isn't here.
Rule 2: Nothing is Confidential. This is a public forum.
You will never find anything here that was not revealed in a public forum. I have other places to put notes that aren't meant for all to see.
Rule 3: This is temporary.
I am undertaking this project because I need to organize better in a period of high-velocity. This is not a commitment to be a pundit, alpha geek or journalist. I know several of those.
Rule 4: Lead, follow or get out of the way!
I can't stand blogs that rehash good original work into pabulum. I'd rather link than write.
Rule 5: Cluetrain says...
I recognize I am letting something into the wild. Comments, assistance and C&D requests will be equally invited and acknowledged.
AND, if someone can point me to a better place to cover these subject... I will.