The Perils of Parallel: June 2010

Monday, June 14, 2010

WNPoTs and the Conservatism of Hardware Development

There are some things about which I am undoubtedly considered a crusty old fogey, the abominable NO man, an ostrich with its head in the sand, and so on. Oh frabjous day! I now have a word for such things, courtesy of Charlie Stross, who wrote:

Just contemplate, for a moment, how you'd react to some guy from the IT sector walking into your place of work to evangelize a wonderful new piece of technology that will revolutionize your job, once everybody in the general population shells out £500 for a copy and you do a lot of hard work to teach them how to use it, And, on closer interrogation, you discover that he doesn't actually know what you do for a living; he's just certain that his WNPoT is going to revolutionize it. Now imagine that this happens (different IT marketing guy, different WNPoT, same pack drill) approximately once every two months for a five year period. You'd learn to tune him out, wouldn't you?

I've been through that pack drill more times than I can recall, and yes, I tune them out. The WNPoTs in my case were all about technology for computing itself, of course. Here are a few examples; they are sure to step on number of toes:

Any new programming language existing only for parallel processing, or any reason other than making programming itself simpler and more productive (see my post 101 parallel languages)
Multi-node single system image (see my post Multi-Multicore Single System Image)
Memristors, a new circuit type. A key point here is that exactly one company (HP) is working on it. Good technologies instantly crystallize consortia around themselves. Also, HP isn't a silicon technology company in the first place.
Quantum computing. Primarily good for just one thing: Cracking codes.
Brain simulation and strong artificial intelligence (really "thinking," whatever that means). Current efforts were beautifully characterized by John Horgan, in a SciAm guest blog: 'Current brain simulations resemble the "planes" and "radios" that Melanesian cargo-cult tribes built out of palm fronds, coral and coconut shells after being occupied by Japanese and American troops during World War II.'

Of course, for the most part those aren't new. They get re-invented regularly, though, and drooled over by ahistorical evalgelists who don't seem to understand that if something has already failed, you need to lay out what has changed sufficiently that it won't just fail again.

The particular issue of retred ideas aside, genuinely new and different things have to face up to what Charlie Stross describes above, in particular the part about not understanding what you do for a living. That point, for processor and system design, is a lot more important than one might expect, due to a seldom-publicized social fact: Processor and system design organizations are incredibly, insanely, conservative. They have good reason to be. Consider:

Those guys are building some of the most, if not the most, intricately complex structures ever created in the history of mankind. Furthermore, they can't be fixed in the field with an endless stream of patches. They have to just plain work – not exactly in the first run, although that is always sought, but in the second or, at most, third; beyond that money runs out.

The result they produce must also please, not just a well-defined demographic, but a multitude of masters from manufacturing to a wide range of industries and geographies. And of course it has to be cost- and performance-competitive when released, which entails a lot of head-scratching and deep breathing when the multi-year process begins.

Furthermore, each new design does it all over again. I'm talking about the "tock" phase for Intel; there's much less development work in the "tick" process shrink phase. Development organizations that aren't Intel don't get that breather. You don't "re-use" much silicon. (I don't think you ever re-use much code, either, with a few major exceptions; but that's a different issue.)

This is a very high stress operation. A huge investment can blow up if one of thousands of factors is messed up.

What they really do to accomplish all this is far from completely documented. I doubt it's even consciously fully understood. (What gets written down by someone paid from overhead to satisfy an ISO requirement is, of course, irrelevant.)

In this situation, is it any wonder the organizations are almost insanely conservative? Their members cannot even conceive of something except as a delta from both the current product and the current process used to create it, because that's what worked. And it worked within the budget. And they have their total intellectual capital invested in it. Anything not presented as a delta of both the current product and process is rejected out of hand. The process and product are intertwined in this; what was done (product) was, with no exceptions, what you were able to do in the context (process).

An implication is that they do not trust anyone who lacks the scars on their backs from having lived that long, high-stress process. You can't learn it from a book; if you haven't done it, you don't understand it. The introduction of anything new by anyone without the tribal scars is simply impossible. This is so true that I know of situations where taking a new approach to processor design required forming a new, separate organization. It began with a high-level corporate Act of God that created a new high-profile organization from scratch, dedicated to the new direction, staffed with a mix of outside talent and a few carefully-selected high-talent open-minded people pirated from the original organization. Then, very gradually, more talent from the old organization was siphoned off and blended into the new one until there was no old organization left other than a maintenance crew. The new organization had its own process, along with its own product.

This is why I regard most WNPoT announcements from a company's "research" arm as essentially meaningless. Whatever it is, it won't get into products without an "Act of God" like that described above. WNPoTs from academia or other outside research? Fuggedaboudit. Anything from outside is rejected unless it was originally nurtured by someone with deep, respected tribal scars, sufficiently so that that person thinks they completely own it. Otherwise it doesn't stand a chance.

Now I have a term to sum up all of this: WNPoT. Thanks, Charlie.

Oh, by the way, if you want a good reason why the Moore's Law half-death that flattened clock speeds produced multi- / many-core as a response, look no further. They could only do more of what they already knew how to do. It also ties into how the very different computing designs that are the other reaction to flat clocks came not from CPU vendors but outsiders – GPU vendors (and other accelerator vendors; see my post Why Accelerators Now?). They, of course, were also doing more of what they knew how to do, with a bit of Sutherland's Wheel of Reincarnation and DARPA funding thrown in for Nvidia. None of this is a criticism, just an observation.

Tuesday, June 8, 2010

Ten Ways to Trash your Performance Credibility

Watered by rains of development sweat, warmed in the sunny smiles of ecstatic customers, sheltered from the hailstones of Moore's Law, the accelerator speedup flowers are blossoming.

Danger: The showiest blooms are toxic to your credibility.

(My wife is planting flowers these days. Can you tell?)

There's a paradox here. You work with a customer, and he's happy with the result; in fact, he's ecstatic. He compares the performance he got before you arrived with what he's getting now, and gets this enormous number – 100X, 1000X or more. You quote that customer, accurately, and hear:

"I would have to be pretty drunk to believe that."

Your great, customer-verified, most wonderful results have trashed your credibility.

Here are some examples:

In a recent talk, Prof. Sharon Glotzer just glowed about getting a 100X speedup "overnight" on the molecular dynamics codes she runs.

In an online discussion on LinkedIn, a Cray marketer said his client's task went from taking 12 hours on a Quad-core Intel Westmere 5600 to 1.2 seconds. That's a speedup of 36,000X. What application? Sorry, that's under non-disclosure agreement.

In a video interview, a customer doing cell pathology image analysis reports their task going from 400 minutes to 65 milliseconds, for a speedup of just under 370,000X. (Update: Typo, he really does say "minutes" in the video.)

None of these people are shading the truth. They are doing what is, for them, a completely valid comparison: They're directly comparing where they started with where they ended up. The problem is that the result doesn't pass the drunk test. Or the laugh test. The idea that, by itself, accelerator hardware or even some massively parallel box will produce 5-digit speedups is laughable. Anybody baldly quoting such results will instantly find him- or herself dismissed as, well, the polite version would be that they're living in la-la land or dipping a bit too deeply into 1960s pop pharmacology.

What's going on with such huge results is that the original system was a target-rich zone for optimization. It was a pile of bad, squirrely code, and sometimes, on top of that, interpreted rather than compiled. Simply getting to the point where an accelerator, or parallelism, or SIMD, or whatever, could be applied involved fixing it up a lot, and much of the total speedup was due to that cleanup – not directly to the hardware.

This is far from a new issue. Back in the days of vector supercomputers, the following sequence was common: Take a bunch of grotty old Fortran code and run it through a new super-duper vectorizing optimizing compiler. Result: Poop. It might even slow down. So, OK, you clean up the code so the compiler has a fighting chance of figuring out that there's a vector or two in there somewhere, and Wow! Gigantic speedup. But there's a third step, a step not always done: Run the new version of the code through a decent compiler without vectors or any special hardware enabled, and, well, hmmm. In lots of cases it runs almost as fast as with the special hardware enabled. Thanks for your help optimizing my code, guys, but keep your hardware; it doesn't seem to add much value.

The moral of that story is that almost anything is better than grotty old Fortran. Or grotty, messed-up MATLAB or Java or whatever. It's the "grotty" part that's the killer. A related modernized version of this story is told in a recent paper Believe It or Not! Multi-core CPUs can Match GPU Performance, where they note "The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively." If you really clean up the code and match it to the platform it's using, great things can happen.

This of course doesn't mean that accelerators and other hardware are useless; far from it. The "Believe It or Not!" case wasn't exactly hurt by the fact that Power7 has a macho memory subsystem. It does mean that you should be aware of all the factors that sped up the execution, and using that information, present your results with credit due to the appropriate actions.

The situation we're in is identical to the one that lead someone (wish I remembered who), decades ago, to write a short paper titled, approximately, Ten Ways to Lie about Parallel Processing. I thought I kept a copy, but if I did I can't find it. It was back at the dawn of whatever, and I can't find it now even with Google Scholar. (If anyone out there knows the paper I'm referencing, please let me know.) Got it! It's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, by David H. Bailey. Thank you, Roland!

In the same spirit, and probably duplicating that paper massively, here are my ten ways to lose your credibility:

Only compare the time needed to execute the innermost kernel. Never mind that the kernel is just 5% of the total execution time of the whole task.
Compare your single-precision result to the original, which computed in double precision. Worry later that your double precision is 4X slower, and the increased data size won't fit in your local memory. Speaking of which,
Pick a problem size that just barely fits into the local memory you have available. Why? See #4.
Don't count the time to initialize the hardware and load the problem into its memory. PCI Express is just as fast as a processor's memory bus. Not.
Change the algorithm. Going from a linear to a binary search or a hash table is just good practice.
Rewrite the code from scratch. It was grotty old Fortran, anyway; the world is better off without it.
Allow a slightly different answer. A*(X+Y) equals A*X+A*Y, right? Not in floating point, it doesn't.
Change the operating system. Pick the one that does IO to your device fastest.
Change the libraries. The original was 32 releases out of date! And didn't work with my compiler!
Change the environment. For example, get rid of all those nasty interrupts from the sensors providing the real-time data needed in practice.

This, of course, is just a start. I'm sure there are another ten or a hundred out there.

A truly fair accounting for the speedup provided by an accelerator, or any other hardware, can only be done by comparing it to the best possible code for the original system. I suspect that the only time anybody will be able to do that is when comparing formally standardized benchmark results, not live customer codes.

For real customer codes, my advice would be to list all the differences between the original and the final runs that you can find. Feel free to use the list above as a starting point for finding those differences. Then show that list before you present your result. That will at least demonstrate that you know you're comparing marigolds and peonies, and will help avoid trashing your credibility.

*****************

Thanks to John Melonakos of Accelereyes for discussion and sharing his thoughts on this topic.

Friday, June 4, 2010

How Hardware Virtualization Works (Part 4)

This is the fourth and last in a series of posts about how hardware virtualization works. Catch it from Part 1 to understand the context.

Drown It in Silicon

In the previous discussion I might have lead you to believe that paravirtualization is widely used in mainframes (IBM zSeries and clones). Sorry. It is used, but in many cases another technique is used, alone or in combination with paravirtualization.

Consider the example of reading the real time clock. All that has to happen is that a silly little offset is added. It is perfectly possible to build hardware that adds an offset all by itself, without any "help" from software. So that's what they did. (See figure below.)

They embedded nearly the whole shooting match directly into silicon. This implies that the bag 'o bits I've been glibly referring to becomes part of the hardware architecture: Now it's hardware that has to reach in and know where the clock offset resides. Not everything is as trivial as adding an offset, of course; what happens with the memory mapping gets, to me anyway, a tad scary in its complexity. But, of course, it can be made to work.
Nobody else is willing to invest a pound or so of silicon into doing this. Yet.

As Moore's Law keeps providing us with more and more transistors, perhaps at some point the industry will tire of providing even more cores, and spend some of those transistors on something that might actually be immediately usable.

A Bit About Input and Output

One reason for all this mainframe talk is that it provides an existence proof: Mainframes have been virtualizing IO basically forever, allowing different virtual machines to think they completely own their own IO devices when in fact they're shared. And, of course, it is strongly supported in yet more hardware. A virtual machine can issue an IO operation, have it directed to its address for an IO device (which may not be the "real" address), get the operation performed, and receive a completion interrupt, or an error, all without involving a hypervisor, at full hardware efficiency. So it can be done.

But until very recently, it could not be readily done with PCI and PCIe (PCI Express) IO. Both the IO interface and the IO devices need hardware support for this to work. As a result, IO operations have for commodity and RISC systems been done interpretively, by the hypervisor. This obviously increases overhead significantly. Paravirtualization can clearly help here: Just ask the hypervisor to go do the IO directly.

However, even with paravirtualization this requires the hypervisor to have its own IO driver set, separate from that of the guest operating systems. This is a redundancy that adds significant bulk to a hypervisor and isn't as reliable as one would like, for the simple reason that no IO driver is ever as reliable as one would like. And reliability is very strongly desired in a hypervisor. Errors within it can bring down all the guest systems running under them.

Another thing that can help is direct assignment of devices to guest systems. This gives a guest virtual machine sole ownership of a physical device. Together with hardware support that maps and isolates IO addresses, so a virtual machine can only access the devices it owns, this provides full speed operation using the guest operating system drivers, with no hypervisor involvement. However, it means you do need dedicated devices for each virtual machine, something that clearly inhibits scaling: Imagine 15 virtual servers, all wanting their own physical network card. This support is also not an industry standard. What we want is some way for a single device to act like multiple virtual devices.

Enter the PCI SIG. It has recently released a collection – yes, a collection – of specifications to deal with this issue. I'm not going to attempt to cover them all here. The net effect, however, is that they allow industry-standard creation of IO devices with internal logic that makes them appear as if they are several, separate, "virtual" devices (the SR-IOV and MR-IOV specifications); and add features supporting that concept, such as multiple different IO addresses for each device.

A key point here is that this requires support by the IO device vendors. It cannot be done just by a purveyor of servers and server chipsets. So its adoption will be gated by how soon those vendors roll this technology out, how good a job they do, and how much of a premium they choose to charge for it. I am not especially sanguine about this. We have done too good a job beating a low cost mantra into too many IO vendors for them to be ready to jump on anything like this, which increases cost without directly improving their marketing numbers (GBs stored, bandwidth, etc.).

Conclusion

There is a joke, or a deep truth, expressed by the computer pioneer David Wheeler, co-inventor of the subroutine, as "All problems in computer science can be solved by another level of indirection."

Virtualization is not going to prove that false. It is effectively a layer of indirection or abstraction added between physical hardware and the systems running on it. By providing that layer, virtualization enables a collection of benefits that were recognized long ago, benefits that are now being exploited by cloud computing. In fact, virtualization is so often embedded in cloud computing discussions that many have argued, vehemently, that without virtualization you do not have cloud computing. As explained previously, I don't agree with that statement, especially when "virtualization" is used to mean "hardware virtualization," as it usually is.

However, there is no denying that the technology of virtualization makes cloud computing tremendously more economic and manageable.

Virtualization is not magic. It is not even all that complicated in its essence. (Of course its details, like the details of nearly anything, can be mind-boggling.) And despite what might first appear to be the case, it is also efficient; resources are not wasted by using it. There is still a hole to plug in IO virtualization, but solutions there are developing gradually if not necessarily expeditiously.

There are many other aspects of this topic that have not been touched on here, such as where the hypervisor actually resides (on the bare metal? Inside an operating system?), the role virtualization can play when migrating between hardware architectures, and the deep relationship that can, and will, exist between virtualization and security. But hopefully this discussion has provided enough background to enable some of you to cut through the marketing hype and the thicket of details that usually accompany most discussions of this topic. Good luck.

Tuesday, June 1, 2010

How Hardware Virtualization Works (Part 3)

This is the third in a series of posts about how hardware virtualization works. Catch it from Part 1 to understand the context.

Translate, Trap and Map

The basic Trap and Map technique described previously depends crucially on a hardware feature: The hardware must be able to trap on every instruction that could affect other virtual machines. Prior to the introduction of Intel's and AMD's specific additional hardware virtualization support, that was not true. For example, setting the real time clock was, in fact, not a trappable instruction. It wasn't even restricted to supervisors. (Note, not all Intel processors have virtualization support today; this is apparently a done to segment the market.)

Yet VMware and others did provide, and continue to provide, hardware virtualization on such older systems. How? By using a load-time binary scan and patch. (See figure below.) Whenever a section of memory was marked executable – making that marking was, thankfully, trap-able – the hypervisor would immediately scan the executable binary for troublesome instructions and replace each one with a trap instruction. In addition, of course, it augmented the bag 'o bits for that virtual machine with information saying what each of those traps was supposed to do originally.

Now, many software companies are not fond of the idea of someone else modifying their shipped binaries, and can even get sticky about things like support if that is done. Also, my personal reaction is that this is a horrendous kluge. But is a necessary kluge, needed to get around hardware deficiencies, and it has proven to work well in thousands, if not millions, of installations.

Thankfully, it is not necessary on more recent hardware releases.

Paravirtualization

Whether or not the hardware traps all the right things, there is still unavoidable overhead in hardware virtualization. For example, think back to my prior comments about dealing with virtual memory. You can imagine the complex hoops a hypervisor must repeatedly jump through when the operating system in a client machine is setting up its memory map at application startup, or adjusting the working sets of applications by manipulating its map of virtual memory.

One way around overhead like that is to take a long, hard look at how prevalent you expect virtualization to be, and seriously ask: Is this operating system ever really going to run on bare metal? Or will it almost always run under a hypervisor?

Some operating system development streams decided the answer to that question is: No bare metal. A hypervisor will always be there. Examples: Linux with the Xen hypervisor, IBM AIX, and of course the IBM mainframe operating system z/OS (no mainframe has been shipped without virtualization since the mid-1980s).

If that's the case, things can be more efficient. If you know a hypervisor is always really behind memory mapping, for example, provide an actual call to the hypervisor to do things that have substantial overhead. For example: Don't do your own memory mapping, just ask the hypervisor for a new page of memory when you need it. Don't set the real-time clock yourself, tell the hypervisor directly to do it. (See figure below.)

This technique has become known as paravirtualization, and can lower the overhead of virtualization significantly. A set of "para-APIs" invoking the hypervisor directly has even been standardized, and is available in Xen, VMware, and other hypervisors.

The concept of paravirtualizatin actually dates back to around 1973 and the VM operating system developed in the IBM Cambridge Science Center. They had the not-unreasonable notion that the right way to build a time-sharing system was to give every user his or her own virtual machine, a notion somewhat like today's virtual desktop systems. The operating system run in each of those VMs used paravirtualization, but it wasn't called that back in the Computer Jurassic.

Virtualization is, in computer industry terms, a truly ancient art.

The next post covers , lowest-overhead technique used in virtualization, then input/output, and draws some conclusions. (Link will be added when it is posted.)