Monday, December 6, 2010

The Varieties of Virtualization

There appear to be many people for whom the term virtualization exclusively means the implementation of virtual machines à la VMware's products, Microsoft's Hyper-V, and so on. That's certainly a very important and common case, enough so that I covered various ways to do it in a separate series of posts; but it's scarcely the only form of virtualization in use.

There's a hint that this is so in the gaggle of other situations where the word virtualization is used, such as desktop virtualization, application virtualization, user virtualization (I like that one; I wonder what it's like to be a virtual user), and, of course, Java Virtual Machine (JVM). Talking about the latter as a true case of virtualization may cause some head-scratching; I think most people consign it to a different plane of existence than things like VMware.

This turns out not to be the case. They're not only all in the same (boringly mundane) plane, they relate to one another hierarchically. I see five levels to that hierarchy right now, anyway; I wouldn't claim this is the last word.

A key to understanding this is to adopt an appropriate definition of virtualization. Mine is that virtualization is the creation of isolated, idealized platforms on which computing services are provided. Anything providing that, whether it's hardware, software, or a mixture, is virtualization. The adjectives in front of "platform" could have qualifiers: Maybe it's not quite idealized in all cases, and isolation is never total. But lack of qualification is the intent.

Most types of virtualization allow hosting of several platforms on one physical or software resource, but that's not part of my definition because it's not universal; it could be just one, or a single platform could be created spanning multiple physical resources. It's also necessary to not always dwell all that heavily on boundaries between hardware and software. But that's starting to get ahead of the discussion. Let's go through the levels, starting at the bottom.

I'll relate this to the cloud computing's IaaS/PaaS/SaaS levels later.

Level 1: Hardware Partitioning

Some hardware is designed like a brick of chocolate that can be broken apart along various predefined fault lines, each piece a fully functional computer. Sun Microsystems (Oracle, now) famously did this with its .com workhorse, the Enterprise 10000 (UE10000). That system had multiple boards plugged into a memory-bus backplane, each board with processor(s), memory, and IO. Firmware let you set registers allowing or disallowing inter-board memory traffic, cache coherence and IO traffic, allowing you to create partitions of the whole machine built with any number of whole boards. The register setting, etc., is set up so that no code running on any of the processors can alter it or, usually, even tell it's there; a privileged console accesses them, under command of an operator, and that's it. HP, IBM and others have provided similar capabilities in large systems, often with the processors, memory, and IO in separate units, numbers of each assigned to different partitions.

Hardware partitioning has the big advantage that even hardware failures (for the most part) simply cannot propagate among partitions. With appropriate electrical design, you can even power-cycle one partition without affecting others. Software failures are of course also totally isolated within partitions (as long as one isn't performing a service for another, but that issue is on another plane of abstraction).

The big negative of hardware partitioning is that you usually cannot have very many of them. Even a single chip now contains multiple processors, so partitioning even by separate chips is far less granularity than is generally desirable. In fact, it's common to assign just a fraction of one CPU, and that can't be done without bending the notion of a hardware-isolated, power-cycle-able partition to the breaking point. In addition, there is always some hardware in common across the partition. For example, power supplies are usually shared, and whatever interconnects all the parts is shared; failure of that shared hardware cause all partitions to fail. (For more complete high availability, you need multiple completely separate physical computers, not under the same sprinkler head, preferably located on different tectonic plates, etc. depending on your personal level of paranoia.)

Despite its negatives, hardware partitioning is fairly simple to implement, useful, and still used. It or something like it, I speculate, is effectively what will be used for initial "virtualization" of GPUs when that starts appearing.

Level 2: Virtual Machines

This is the level of VMware and its kissin' cousins. All the hardware is shared en masse, and a special layer of software, a hypervisor, creates the illusion of multiple completely separate hardware platforms. Each runs its own copy of an operating system and any applications above that, and (ideally) none even knows that the others exist. I've previously written about how this trick can be performed without degrading performance to any significant degree, so won't go into it here.

The good news here is that you can create as many virtual machines as you like, independent of the number of physical processors and other physical resources – at least until you run out of resources. The hypervisor usually contains a scheduler that time-slices among processors, so sub-processor allocation is available. With the right hardware, IO can also fractionally allocated (again, see my prior posts).

The bad news is that you generally get much less hardware fault isolation than with hardware partitioning; if the hardware croaks, well, it's one basket and those eggs are scrambled. Very sophisticated hypervisors can help with that when there is appropriate hardware support (mainframe customers do get something for their money). In addition, and this is certainly obvious after it's stated: If you put N virtual machines on one physical machine, you are now faced with all the management pain of managing all N copies of the operating system and its applications.

This is the level often used in so-called desktop virtualization. In that paradigm, individuals don't own hardware, their own PC. Instead, they "own" a block of bits back on a server farm that happens to be the description of a virtual machine, and can request that their virtual machine be run from whatever terminal device happens to be handy. It might actually run back on the server, or might run on a local machine after downloading. Many users absolutely loathe this; they want to own and control their own hardware. Administrators like it, a lot, since it lets them own, and control, the hardware.

Level 3: Containers

This level was, as far as I know, originally developed by Sun Microsystems (Oracle), so I'll use their name for it: Containers. IBM (in AIX) and probably others also provide it, under different names.

With containers, you have one copy of the operating system code, but it provides environments, containers, which act like separate copies of the OS. In Unix/Linux terms, each container has its own file system root (including IO), process tree, shared segment naming space, and so on. So applications run as if they were running on their own copy of the operating system – but they are actually sharing one copy of the OS code, with common but separate OS data structures, etc.; this provides significant resource sharing that helps the efficiency of this level.

This is quite useful if you have applications or middleware that were written under the assumption that they were going to run on their own separate server, and as a result, for example, all use the same name for a temporary file. Were they run on the same OS, they would clobber each other in the common /tmp directory; in separate containers, they each have their own /tmp. More such applications exist than one would like to believe; the most quoted case is the Apache web server, but my information on that may be out of date and it may have been changed by now. Or not, since I'm not sure what the motivation to change would be.

I suspect container technology was originally developed in the Full Moon cluster single-system-image project, which needs similar capabilities. See my much earlier post about single-system-image if you want more information on such things.

In addition, there's just one real operating system to manage in this case, so management headaches are somewhat lessened. You do have to manage all those containers, so it isn't an N:1 advantage, but I've heard customers say this is a significant management savings.

A perhaps less obvious example of containerization is the multiuser BASIC systems that flooded the computer education system several decades back. There was one copy of the BASIC interpreter, run on a small minicomputer and used simultaneously by many students, each of whom had their own logon ID and wrote their own code. And each of whom could botch things up for everybody else with the wrong code that soaked up the CPU. (This happened regularly in the "computer lab" I supervised for a while.) I locate this in the container level rather than higher in the stack because the BASIC interpreter really was the OS: It ran on the bare metal, with no supervisor code below it.

Of course, fault isolation at this level is even less than in the prior cases. Now if the OS crashes, all the containers go down. (Or if the wrong thing is done in BASIC…) In comparison, an OS crash in a virtual machine is isolated to that virtual machine.

Level 4: Software Virtual Machines

We've reached the JVM level. It's also the .NET level, the Lisp level, the now more usual BASIC level, and even the CICS (and so on): the level of more-or-less programming-language based independent computing environments. Obviously, multiple of these can be run as applications under a single operating system image, each providing a separate environment for the execution of applications. At least this can be done in theory, and in many cases in practice; some environments were implemented as if they owned the computer they run on.

What you get out of this is, of course, a more standard programming environment that can be portable – run on multiple computer architectures – as well as extensions to a machine environment that provide services simplifying application development. Those extensions are usually the key reason this level is used. There's also a bit of fault tolerance, since if one of those dies of a fault in its support or application code, it need not always affect others, assuming a competent operating system implementation.

Fault isolation at this level is mostly software only; if one JVM (say) crashes, or the code running on it crashes, it usually doesn't affect others. Sophisticated hardware / firmware / OS can inject the ability to keep many of the software VMs up if a failure occurred that only affected one of them. (Mainframe again.)

Level 5: Multitenant / Multiuser Environment

Many applications allow multiple users to log in, all to the same application, with their own profiles, data collections, etc. They are legion. Examples include web-based email, Facebook,, Worlds of Warcraft, and so on. Each user sees his or her own data, and thinks he / she is doing things isolated from others except at those points where interaction is expected. They see their own virtual system – a very specific, particularized system running just one application, but a system apparently isolated from all others in any event.

The advantages here? Well, people pay to use them (or put up with advertising to use them). Aside from that, there is potentially massive sharing of resources, and, concomitantly, care must be taken in the software and system architecture to avoid massive sharing of faults.

All Together Now

Yes. You can have all of these levels of virtualization active simultaneously in one system: A hardware partition running a hypervisor creating a virtual machine that hosts an operating system with containers that each run several programming environments executing multi-user applications.

It's possible. There may be circumstances where it appears warranted. I don't think I'd want to manage it, myself. Imagining a performance tuning on a 5-layer virtualization cake makes me shudder. I once had a television system that had two volume controls in series: A cable set-top box had its volume control, feeding an audio system with its own. Just those two levels drove me nuts until I hit upon a setting of one of them that let the other, alone, span the range I wanted.

Virtualization and Cloud Computing

These levels relate to the usual IaaS/PaaS/SaaS (Infrastructure / Platform / Software as a Service) distinctions discussed in cloud computing circles, but are at a finer granularity than those.

IaaS relates to the bottom two layers: hardware partitioning and virtual machines. Those two levels, particularly virtual machines, make it possible to serve up raw computing infrastructure (machines) in a way that can utilize the underlying hardware far more efficiently than handing customers whole computers that they aren't going to use 100% of the time. As I've pointed out elsewhere, it is not a logical necessity that a cloud use this or some other form of virtualization; but in many situations, it is an economic necessity.

Software virtual machines are what PaaS serves up. There's a fairly close correspondence between the two concepts.

SaaS is, of course, a Multiuser environment. It may, however, be delivered by using software virtual machines under it.

Containers are a mix of IaaS and PaaS. It's doesn't provide pure hardware, but a plain OS is made available, and that can certainly be considered a software platform. It is, however, a fairly barren environment compared with what software virtual machines provide..


This post has been brought to you by my poor head, which aches every time I encounter yet another discussion over whether and how various forms of cloud computing do or do not use virtualization. Hopefully it may help clear up some of that confusion.

Oh, yes, and the obvious conclusion: There's more than one kind of virtualization, out there, folks.

Monday, November 15, 2010

The Cloud Got GPUs

Amazon just announced, on the first full day of SC10 (SuperComputing 2010), the availability of Amazon EC2 (cloud) machine instances with dual Nvidia Fermi GPUs. According to Amazon's specification of instance types, this "Cluster GPU Quadruple Extra Large" instance contains:

  • 22 GB of memory
  • 33.5 EC2 Compute Units (2 x Intel Xeon X5570, quad-core "Nehalem" architecture)
  • 2 x NVIDIA Tesla "Fermi" M2050 GPUs
  • 1690 GB of instance storage
  • 64-bit platform
  • I/O Performance: Very High (10 Gigabit Ethernet)
So it looks like the future virtualization features of CUDA really are for purposes of using GPUs in the cloud, as I mentioned in my prior post.

One of these XXXXL instances costs $2.10 per hour for Linux; Windows users need not apply. Or, if you reserve an instance for a year – for $5630 – you then pay just $0.74 per hour during that year. (Prices quoted from Amazon's price list as of 11/15/2010; no doubt it will decrease over time.)

This became such hot news that GPU was a trending topic on Twitter for a while.

For those of you who don't watch such things, many of the Top500 HPC sites – the 500 supercomputers worldwide that are the fastest at the Linpack benchmark – have nodes featuring Nvidia Fermi GPUs. This year that list notoriously includes, in the top slot, the system causing the heaviest breathing at present: The Tianhe-1A at the National Supercomputer Center in Tianjin, in China.

I wonder how well this will do in the market. Cloud elasticity – the ability to add or remove nodes on demand – is usually a big cloud selling point for commercial use (expand for holiday rush, drop nodes after). How much it will really be used in HPC applications isn't clear to me, since those are usually batch mode, not continuously operating, growing and shrinking, like commercial web services. So it has to live on price alone. The price above doesn't feel all that inexpensive to me, but I'm not calibrated well in HPC costs these days, and don't know how much it compares with, for example, the cost of running the same calculation on Teragrid. Ad hoc, extemporaneous use of HPC is another possible use, but, while I'm sure it exists, I'm not sure how much exists.

Then again, how about services running games, including the rendering? I wonder if, for example, the communications secret sauce used by OnLive to stream rendered game video fast enough for first-person shooters can operate out of Amazon instances. Even if it doesn't, games that can tolerate a tad more latency may work. Possibly games targeting small screens, requiring less rendering effort, are another possibility. That could crater startup costs for companies offering games over the web.

Time will tell. For accelerators, we certainly are living in interesting times.

Thursday, November 11, 2010

Nvidia Past, Future, and Circular

I'm getting tired about writing about Nvidia and its Fermi GPU architecture (see here and here for recent posts). So I'm going to just dump out some things I've considered for blog entries into this one, getting it all out of the way.

Past Fermi Product Mix

For those of you wondering about how much Nvidia's product mix is skewed to the low end, here's some data for Q3, 2010 from Investor Village:

Also, note that despite the raging hormones of high-end HPC, the caption indicates that their median and mean prices have decreased from Q2: They became more, not less, skewed towards the low end. As I've pointed out, this will be a real problem as Intel's and AMD's on-die GPUs assert some market presence, with "good enough" graphics for free – built into all PC chips. It won't be long now, since AMD has already started shipping its Zacate integrated-GPU chip to manufacturers.

Future Fermis

Recently Fermi's chief executive Jen-Hsun Huang gave an interview on what they are looking at for future features in the Fermi architecture. Things he mentioned were: (a) More development of their CUDA software; (b) virtual memory and pre-emption; (c) directly attaching InfiniBand, the leading HPC high-speed system-to-system interconnect, to the GPU. Taking these in that order:

More CUDA: When asked why not OpenCL, he said because other people are working on OpenCL and they're the only ones doing CUDA. This answer ranks right up there in the stratosphere of disingenuousness. What the question really meant was why they don't work to make OpenCL, a standard, work as well as their proprietary CUDA on their gear? Of course the answer is that OpenCL doesn't get them lock-in, which one doesn't say in an interview.

Virtual memory and pre-emption: A GPU getting a page fault, then waiting while the data is loaded from main memory, or even disk? I wouldn't want to think of the number of threads it would take to cover that latency. There probably is some application somewhere for which this is the ideal solution, but I doubt it's the main driver. This is a cloud play: Cloud-based systems nearly all use virtual machines (for very good reason; see the link), splitting one each system node into N virtual machines. Virtual memory and pre-emption allows the GPU to participate in that virtualization. The virtual memory part is, I would guess, more intended to provide memory mapping, so applications can be isolated from one another reliably and can bypass issues of contiguous memory allocation. It's effectively partitioning the GPU, which is arguably a form of virtualization. [UPDATE: Just after this was published, John Carmak (of Id Software ) wrote a piece laying out the case for paging into GPUs. So that may be useful in games and generally.]

 Direct InfiniBand attachment: At first glance, this sounds as useful as tits on a boar hog (as I occasionally heard from older locals in Austin). But it is suggested, a little, by the typical compute cycle among parallel nodes in HPC systems. That often goes like this: (a) Shove data from main memory out to the GPU. (b) Compute on the GPU. (c) Suck data back from GPU into main memory. (d) Using the interconnect between nodes, send part of that data from main memory to the main memory in other compute nodes, while receiving data into your memory from other compute nodes. (e) Merge the new data with what's in main memory. (f) Test to see if everybody's done. (g) If not, done, shove resulting new data mix in main memory out to the GPU, and repeat. At least naively, one might think that the copying to and from main memory could be avoided since the GPUs are the ones doing all the computing: Just send the data from one GPU to the other, with no CPU involvement. Removing data copying is, of course, good. In practice, however, it's not quite that straightforward; but it is at least worth looking at.

So, that's what may be new in Nvidia CUDA / Fermi land. Each of those are at least marginally justifiable, some very much so (like virtualization). But stepping back a little from these specifics, this all reminds me of dueling Nvidia / AMD (ATI) announcements of about a year ago.

That was the time of the Fermi announcement, which compared with prior Nvidia hardware doubled everything, yada yada, and added… ECC. And support for C++ and the like, and good speed double-precision floating-point.

At that time, Tech Report said that the AMD Radeon HD 5870 doubled everything, yada again, and added … a fancy new anisotropic filtering algorithm for smoothing out texture applications at all angles, and supersampling to better avoid antialiasing.

Fine, Nvidia doesn't think much of graphics any more. But haven't they ever heard of the Wheel of Reincarnation?

The Wheel of Reincarnation

The wheel of reincarnation is a graphics system design phenomenon discovered all the way back in 1968 by T. H. Meyers and Ivan Sutherland. There are probably hundreds of renditions of it floating around the web; here's mine.

Suppose you want to use a computer to draw pictures on a display of some sort. How do you start? Well, the most dirt-simple, least hardware solution is to add an IO device which, prodded by the processor with X and Y coordinates on the device, puts a dot there. That will work, and actually has been used in the deep past. The problem is that you've now got this whole computer sitting there, and all you're doing with it is putting stupid little dots on the screen. It could be doing other useful stuff, like figuring out what to draw next, but it can't; it's 100% saturated with this dumb, repetitious job.

So, you beef up your IO device, like by adding the ability to go through a whole list of X, Y locations and putting dots up at each specified point. That helps, but the computer still has to get back to it very reliably every refresh cycle or the user complains. So you tell it to repeat. But that's really limiting. It would be much more convenient if you could tell the device to go do another list all by itself, like by embedding the next list's address in block of X,Y data. This takes a bit of thought, since it means adding a code to everything, so the device can tell X,Y pairs from next-list addresses; but it's clearly worth it, so in it goes.

Then you notice that there are some graphics patterns that you would like to use repeatedly. Text characters are the first that jump out at you, usually. Hmm. That code on the address is kind of like a branch instruction, isn't it? How about a subroutine branch? Makes sense, simplifies lots of things, so in it goes.

Oh, yes, then some of those objects you are re-using would be really more useful if they could be rotated and scaled… Hello, arithmetic.

At some stage it looks really useful to add conditionals, too, so…

Somewhere along the line, to make this a 21st century system, you get a frame buffer in there, too, but that's kind of an epicycle; you write to that instead of literally putting dots on the screen. It eliminates the refresh step, but that's all.

Now look at what you have. It's a Turing machine. A complete computer. It's got a somewhat strange instruction set, but it works, and can do any kind of useful work.

And it's spending all its time doing nothing but putting silly dots on a screen.

How about freeing it up to do something more useful by adding a separate device to it to do that?

This is the crucial point. You've reached the 360 degree point on the wheel, spinning off a graphics processor on the graphics processor.

Every incremental stage in this process was very well-justified, and Meyers and Sutherland say they saw examples (in 1968!) of systems that were more than twice around the wheel: A graphics processor hanging on a graphics processor hanging on a graphics processor. These multi-cycles are often justified if there's distance involved; in fact, in these terms, a typical PC on the Internet can be considered to be twice around the wheel: It's got a graphics processor on a processor that uses a server somewhere else.

I've some personal experience with this. For one thing, back in the early 70s I worked for Ivan Sutherland at then-startup Evans and Sutherland Computing Corp., out in Salt Lake City; it was a summer job while I was in grad school. My job was to design nothing less than an IO system on their second planned graphics system (LDS-2). It was, as was asked for, a full-blown minicomputer-level IO system, attached to a system whose purpose in life was to do nothing but put dots on a screen. Why an IO system? Well, why bother the main system with trivia like keyboard and mouse (light pen) interrupts? Just attach them directly to the graphics unit, and let it do the job.

Just like Nvidia is talking about attaching InfiniBand directly to its cards.

Also, in the mid-80s in IBM Research, after the successful completion of an effort to build special-purpose parallel hardware system of another type (a simulator), I spent several months figuring out how to bend my brain and software into using it for more general purposes, with various and sundry additions taken from the standard repertoire of general-purpose systems.

Just like Nvidia is adding virtualization to its systems.

Each incremental step is justified – that's always the case with the wheel – just as in the discussion above, I showed a justification for every general-purpose additions to Nvidia architecture are justifiable.

The issue here is not that this is all necessarily bad. It just is. The wheel of reincarnation is a factor in the development over time of every special-purpose piece of hardware. You can't avoid it; but you can be aware that you are on it, like it or not.

With that knowledge, you can look back at what, in its special-purpose nature, made the original hardware successful – and make your exit from the wheel thoughtfully, picking a point where the reasons for your original success aren't drowned out by the complexity added to chase after ever-widening, and ever more shallow, market areas. That's necessary if you are to retain your success and not go head-to-head with people who have, usually with far more resources than you have, been playing the general-purpose game for decades.

It's not clear to me that Nvidia has figured this out yet. Maybe they have, but so far, I don't see it.

Sunday, October 17, 2010

RIP, Benoit Mandelbrot, father of fractal geometry

Benoit Mandelbrot, father of fractal geometry, has died.

See my post about him, and my interaction with him, in my mostly non-technical blog, Random Gorp: RIP, Benoit Mandelbrot, father of fractal geometry.

Saturday, September 4, 2010

Intel Graphics in Sandy Bridge: Good Enough

As I and others expected, Intel is gradually rolling out how much better the graphics in its next generation will be. Anandtech got an early demo part of Sandy Bridge and checked out the graphics, among other things. The results show that the "good enough" performance I argued for in my prior post (Nvidia-based Cheap Supercomputing Coming to an End) will be good enough to sink third party low-end graphics chip sets. So it's good enough to hurt Nvidia's business model, and make their HPC products fully carry their own development burden, raising prices notably.

The net is that for this early chip, with early device drivers, at low, but usable resolution (1024x768) there's adequate performance on games like "Batman: Arkham Asylum," "Call of Duty MW2," and a bunch of others, significantly including "Worlds of Warfare." And it'll play Blue-Ray 3D, too.

Anandtech's conclusion is "If this is the low end of what to expect, I'm not sure we'll need more than integrated graphics for non-gaming specific notebooks." I agree. I'd add desktops, too. Nvidia isn't standing still, of course; on the low end they are saying they'll do 3D, too, and will save power. But integrated graphics are, effectively, free. It'll be there anyway. Everywhere. And as a result, everything will be tuned to work best on that among the PC platforms; that's where the volumes will be.

Some comments I've received elsewhere on my prior post have been along the lines of "but Nvidia has such a good computing model and such good software support – Intel's rotten IGP can't match that." True. I agree. But.

There's a long history of ugly architectures dominating clever, elegant architectures that are superior targets for coding and compiling. Where are the RISC-based CAD workstations of 15+ years ago? They turned into PCs with graphics cards. The DEC Alpha, MIPS, Sun SPARC, IBM POWER and others, all arguably far better exemplars of the computing art, have been trounced by X86, which nobody would call elegant. Oh, and the IBM zSeries, also high on the inelegant ISA scale, just keeps truckin' through the decades, most recently at an astounding 5.2 GHz.

So we're just repeating history here. Volume, silicon technology, and market will again trump elegance and computing model.

PostScript: According to Bloomberg, look for a demo at Intel Developer Forum next week.

Wednesday, August 11, 2010

Nvidia-based Cheap Supercomputing Coming to an End

Nvidia's CUDA has been hailed as "Supercomputing for the Masses," and with good reason. Amazing speedups on scientific / technical code have been reported, ranging from a mere 10X through hundreds. It's become a darling of academic computing and a major player in DARPA's Exascale program, but performance alone is not the reason; it's price. For that computing power, they're incredibly cheap. As Sharon Glotzer of UMich noted, "Today you can get 2GF for $500. That is ridiculous." It is indeed. And it's only possible because CUDA is subsidized by sinking the fixed costs of its development into the high volumes of Nvidia's mass market low-end GPUs.

Unfortunately, that subsidy won't last forever; its end is now visible. Here's why:

Apparently ignored in the usual media fuss over Intel's next and greatest, Sandy Bridge, is the integration of Intel's graphics onto the same die as the processor chip.

The current best integration is onto the same package, as illustrated in the photo of the current best, Clarkdale (a.k.a. Westmere), as shown in the photo on the right. As illustrated, the processor is in 32nm silicon technology, and the graphics, with memory controller, is in 45nm silicon technology. Yes, the graphics and memory controller is the larger chip.

Intel has not been touting higher graphics performance from this tighter integration. In fact, Intel's press releasers for Clarkdale claimed that being on two die wouldn't reduce performance because they were in the same package. But unless someone has changed the laws of physics as I know them, that's simply false; at a minimum, eliminating off-chip drivers will reduce latency substantially. Also, being on the same die as the processor implies the same process, so graphics (and memory control) goes all the way from 45nm to 32nm, the same as the processor, in one jump; this certainly will also result in increased performance. For graphics, this is a very loud the Intel "Tock" in its "Tick-Tock" (architecture / silicon) alternation.

So I'll semi-fearlessly predict some demos of midrange games out of Intel when Sandy Bridge is ready to hit the streets, which hasn't been announced in detail aside from being in 2011.

Probably not coincidentally, mid-2011 is when AMD's Llano processor sees daylight. Also in 32nm silicon, it incorporates enough graphics-related processing to be an apparently decent DX11 GPU, although to my knowledge the architecture hasn't been disclosed in detail.

Both of these are lower-end units, destined for laptops, and intent on keeping a tight power budget; so they're not going to run high-end games well or be a superior target for HPC. It seems that they will, however, provide at least adequate low-end, if not midrange, graphics.

Result: All of Nvidia's low-end market disappears by the end of next year.

As long as passable performance is provided, integrated into the processor equates with "free," and you can't beat free. Actually, it equates with cheaper than free, since there's one less chip to socket onto the motherboard, eliminating socket space and wiring costs. The power supply will probably shrink slightly, too.

This means the end of the low-end graphics subsidy of high-performance GPGPUs like Nvidia's CUDA. It will have to pay its own way, with two results:

First, prices will rise. It will no longer have a huge advantage over purpose-built HPC gear. The market for that gear is certainly expanding. In a long talk at the 2010 ISC in Berlin, Intel's Kirk Skaugan (VP of Intel Architecture Group and GM, Data Center Group, USA) stated that HPC was now 25% of Intel's revenue – a number double the HPC market I last heard a few years ago. But larger doesn't mean it has anywhere near the volume of low-end graphics.

DARPA has pumped more money in, with Nvidia leading a $25M chunk of DARPA's Exascale project. But that's not enough to stay alive. (Anybody remember Thinking Machines?)

The second result will be that Nvidia become a much smaller company.

But for users, it's the loss of that subsidy that will hurt the most. No more supercomputing for the masses, I'm afraid. Intel will have MIC (son of Larrabee); that will have a partial subsidy since it probably can re-use some X86 designs, but that's not the same as large low-end sales volumes.

So enjoy your "supercomputing for the masses," while it lasts.

Thursday, July 29, 2010

Standards Are About the Money

Nonstandard Cloud
Standards for cloud computing are a never-ending topic of cloud buzz ranging all over the map: APIs (programming interfaces), system management, legal issues, and so on.

With a few exceptions where the motivation is obvious (like some legal issues in the EU), most of these discussions miss a key point: Standards are implemented and used if and only if they make money for their implementers.

Whether customers think they would like them is irrelevant – unless that liking is strong enough to clearly translate into increased sales, paying back the cost of defining and implementing appropriate standards. "Appropriate" always means "as close to my existing implementation as possible" to minimize implementation cost.

That is my experience, anyway, having spent a number of years as a company representative to the InfiniBand Trade Association and the PCI-SIG, along with some interaction with the PowerPC standard and observation of DMTF and IETF standards processes.

Right now there's an obvious tension, since cloud customers see clear benefits to having an industry-wide, stable implementation target that allows portability among cloud system vendors, a point well-detailed in the Berkeley report on cloud computing.

That's all very nice, but unless the cloud system vendors see where the money is coming from, standards aren't going to be implemented where they count. In particular, when there are major market leaders, like Amazon and Google right now, it has to be worth more to those leaders than the lock-in they get from proprietary interfaces. I've yet to see anything indicating that they will, so am not very positive about cloud standards at present time.

But it could happen. The road to any given standard is very often devious, always political, regularly suffused with all kinds of nastiness, and of course ultimately driven throughout by good old capitalist greed. An example I'm rather familiar with is the way InfiniBand came to be, and semi-failed.

The beginning was a presentation by Justin Rattner at the 1998 Intel Developer Forum, in which he declared Intel's desire for their micros to grow up to be mainframes (mmmm… really juicy profit margins!). He thought they had everything except for IO. Busses were bad. He actually showed a slide with a diagram that could have come right out of an IBM Parallel Sysplex white paper, complete with channels and channel directors (switches) connecting banks of storage with banks of computers. That was where we need to go, he said, at a commodity price point. 

Shortly thereafter, Intel founded the Next Generation IO Forum (NGIO), inviting
 other companies to join in the creation of this new industry IO standard. That sounds fine, and rather a step better than IBM did when trying to foist Microchannel architecture on the world (a dismal failure), until you read the fine print in the membership agreement. There you find a few nasties. Intel had 51% of every vote. Oh, and if you have any intellectual property (IP) (patents) in the area, they now all belonged to Intel. Several companies did join, like Dell; they like to be "tightly integrated" with their suppliers.

A few folks with a tad of IP in the IO area, like IBM and Compaq (RIP), understandably declined to join. But they couldn't just let Intel go off and define something they would then have to license. So a collection of companies – initially Compaq, HP, and IBM – founded the rival Future IO Developer's Forum (FIO). Its membership agreement was much more palatable: One company, one vote; and if you had IP that was used, you had to promise to license it with terms that were "reasonable and nondiscriminatory," a phrase that apparently means something quite specific to IP lawyers.

Over the next several months, there was a steady movement of companies out of NGIO and into FIO. When NGIO became only Intel and Dell (still tightly integrated), the two merged as the InfiniBand Trade Association (IBTA). They even had a logo for the merger itself! (See picture.) The name "InfiniBand" was dreamed up by a multi-company collection of marketing people, by the way; when a technical group member told them he thought it was a great name (a lie) they looked worried. The IBTA had, in a major victory for the FIO crowd, the same key terms and conditions as FIO. In addition, Roberts' rules of order were to be used, and most issues were to be decided by a simple majority (of companies).

Any more questions about where the politics comes in? Let's cover devious and nasty with a sub-story:

While on one of the IBTA groups, during a contentious discussion I happened to be leading for one side, I mentioned I was going on vacation for the next two weeks. The first day I was on vacation a senior-level executive of a company on the other side in the dispute, an executive not at all directly involved in IBTA, sent an email to another senior-level executive in a completely different branch of IBM, a branch with which the other company did a very large amount of business. It complained that I "was not being cooperative" and I had said on the IBTA mailing lists that certain IBM products were bad in some way. The obvious intent was that it be forwarded to my management chain through layers of people who didn't understand (or care) what was really going on, just that I had made this key customer unhappy and had dissed IBM products. At the very least, it would have chewed up my time disentangling the mess left after it wandered around forwards for two weeks (I was on vacation, remember?); at worst, it could have resulted in orders to me to be more "cooperative," and otherwise undermined me within my own company. Fortunately, and to my wife's dismay, I had taken my laptop on vacation and watched email; and a staff guy in the different division intercepted that email, forwarded it directly to me, and asked what was going on. As a result, I could nip it all in the bud.

It's sad and perhaps nearly unbelievable that precisely the same tactic – complain at a high level through an unrelated management chain – had been used by that same company against someone else who was being particularly effective against them.

Another, shorter, story: A neighbor of mine who was also involved in a similar inter-company dispute told me that, while on a trip (and he took lots of trips; he was a regional sales manager) he happened to return to his hotel room after checking out and found people going through his trash, looking for anything incriminating.

Standards can be nasty.

Anyway, after a lot of the dust settled and IB had taken on a fairly firm shape, Intel dropped development of its IB product. Exactly why was never explicitly stated, but the consensus I heard was that compared with others' implementations in progress it was not competitive. Without the veto power of NGIO, Intel couldn't shape the standard to match what it was implementing. With Intel out, Microsoft followed suit, and the end result was InfiniBand as we see it today: A great interconnect for high-end systems that pervades HPC, but not the commodity-volume server part the founders hoped that it would be. I suspect there are folks at Intel who think they would have been more successful at achieving the original purpose if they had their veto, since then it would have matched their inexpensive parts. I tend to doubt that, since in the meantime PCI has turned into a hierarchical switched fabric (PCI Express), eliminating many of the original problems stemming from it being a bus.

All this illustrates what standards are really about, from my perspective. Any relationship with pristine technical discussions or providing the "right thing" for customers is indirect, with all motivations leading through money – with side excursions through political, devious, and just plain nasty.

Thursday, July 15, 2010

OnLive Follow-Up: Bandwidth and Cost

As mentioned earlier in OnLive Works! First Use Impressions, I've tried OnLive, and it works quite well, with no noticeable lag and fine video quality. As I've discussed, this could affect GPU volumes, a lot, if it becomes a market force, since you can play high-end games with a low-end PC. However, additional testing has confirmed that users will run into bandwidth and data usage issues, and the cost is not what I'd like for continued use.

To repeat some background, for completeness: OnLive is a service that runs games on their servers up in the cloud, streaming the video to your PC or Mac. It lets you run the highest-end games on very inexpensive systems, avoiding the cost of a rip-roaring gamer system. I've noted previously that this could hurt the mass market for GPUs, since OnLive doesn't need much graphics on the client. But there were serious questions (see my post Twilight of the GPU?) as to whether they could overcome bandwidth and lag issues: Can OnLive respond to your inputs fast enough for games to be playable? And could its bandwidth requirements be met with a normal household ISP?

As I said earlier, and can re-confirm: Video, check. I found no problems there; no artifacts, including in displayed text. Lag, hence gameplay, is perfectly adequate, at least for my level of skill. Those with sub-millisecond reflexes might feel otherwise; I can't tell. There's confirmation of the low lag from Eurogamer, which measured it at "150ms - similar to playing … locally".


Bandwidth, on the other hand, does not present a pretty picture.

When I was playing or watching action, OnLive continuously ran at about 5.8% - 6.4% utilization of a 100 Mb/sec LAN card. (OnLive won't run on WiFi, only on a wired connection.) This rate is very consistent. Displayed image resolution didn't cause it to vary outside that range, whether it was full-screen on my 1600 x 900 laptop display, full-screen on my 1920 x 1080 monitor, or windowed to about half the laptop screen area (which was the window size OnLive picked without input from me). When looking at static text displays, like OnLive control panels, it dropped down to a much smaller amount, in the 0.01% range; but that's not what you want to spend time doing with a system like this.

I observed these values playing (Borderlands) and watching game trailers for a collection of "coming soon" games like Deus Ex, Drive, Darksiders, Two Worlds, Driver, etc. If you stand still in a non-action situation, it does go down to about 3% (of 100 Mb/sec) for me, but with action games that isn't the point.

6.4% of 100 Mb/sec is about 2.9 GB (bytes) per hour. That hurts.

My ISP, Comcast, considers over 250 GB/month "excessive usage" and grounds for terminating your account if you keep doing it regularly. That limit and OnLive's bandwidth together mean that over a 30-day period, Comcast customers can't play more than 3 hours a day without being considered "excessive."


I also found that prices are not a bargain, unless you're counting the money you save using a bargain PC – one that costs, say, what a game console costs.

First, you pay for access to OnLive itself. For now that can be free, but after a year it's slated to be $4.95 a month. That's scarcely horrible. But you can't play anything with just access; you need to also buy a GamePass for each game you want to play.

A Full GamePass, which lets you play it forever (or, presumably, as long as OnLive carries the game) is generally comparable to the price of the game itself, or more for the PC version. For example, the Borderlands Full GamePass is $29.99, and the game can be purchased for $30 or less (one site lists it for $3! (plus about $9 shipping)). F.E.A.R. 2 is $19.99 GamePass, and the purchase price is $19-$12. Assassin's Creed II was a loser, with GamePass for $39.99 and purchased game available for $24-$17. The standalone game prices are from online sources, and don't include shipping, so OnLive can net a somewhat smaller total. And you can play it on a cheap PC, right? Hmmm. Or a console.

There are also, in many cases, 5 day and 3 day passes, typically $9-$7 for 5-day and $4-$6 for 3-day. As a try before you buy, maybe those are OK, but 30 minute free demos are available, too, making a reasonably adequate try available for free.

Not all the prices are that high. There's something called AAAAAAA, which seems to consist entirely of falling from tall buildings, with a full GamePass for $9.99; and Brain Challenge is $4.99. I'll bet Brain Challenge doesn't use much bandwidth, either.

The correspondence between Full GamePass and the retail price is obviously no coincidence. I wouldn't be surprised at all to find that relationship to be wired into the deals OnLive has with game publishers. Speculation, since I just don't know: Do the 5 or 3 day pass prices correspond to normal rental rates? I'd guess yes.

Simplicity & the Mac Factor

A real plus for OnLive is simplicity. Installation is just pure dead simple, and so is starting to play. Not only do you not have to acquire the game, there's no installation and no patching; you just select the game, get a PayPass (zero time with a required pre-registered credit card), and go. Instant gratification.

Then there's the Mac factor. If you have only Apple products – no console and no Windows PC – you are simply shut out of many games unless you pursue the major hassle of BootCamp, which also requires purchasing a copy of Windows and doing the Windows maintenance. But OnLive runs on Macs, so a wide game experience is available to you immediately, without a hassle.


To sum up:

Positive: great video quality, great playability, hassle-free instant gratification, and the Mac factor.

Negative: Marginally competitive game prices (at best) and bandwidth, bandwidth, bandwidth. The cost can be argued, and may get better over time, but your ISP cutting you off for excessive data usage is pretty much a killer.

So where does this leave OnLive and, as a consequence, the market for GPUs? I think the bandwidth issue says that OnLive will have little impact in the near future.

However, this might change. Locally, Comcast TV ads showing off their "Xfinity" rebranding had a small notice indicating that 105 Mb data rates would be available in the future. It seems those have disappeared, so maybe it won't happen. But a 10X data rate improvement wouldn't mean much if you also didn't increase the data usage cap, and a 10X usage cap increase would completely eliminate the bandwidth issue.

Or maybe the Net Neutrality guys will pick this up and succeed. I'm not sure on that one. It seems like trying to get water from a stone if the backbone won't handle it, but who knows?

The proof, however, is in the playing and its market share, so we can just watch to see how this works out. The threat is still there, just masked by bandwidth requirements.

(And I still think virtual worlds should evaluate this technology closely. Installation difficulty is a key inhibitor to several markets there, forcing extreme measures – like shipping laptops already installed – in one documented case; see Living In It: A Tale of Learning in Second Life.)

Monday, July 12, 2010

Who Does the Shoe Fit? Functionally Decomposed Hardware (GPGPUs) vs. Multicore.

This post is a long reply to the thoughtful comments on my post WNPoTs and the Conservatism of Hardware Development that were made by Curt Sampson and Andrew Richards. The issue is: Is functionally decomposed hardware, like a GPU, much harder to deal with than a normal multicore (SMP) system? (It's delayed. Sorry. For some reason I ended up in a mental deadlock on this subject.)

I agree fully with Andrew and Curt that using functionally decomposed hardware can be straightforward if the hardware performs exactly the function you need in the program. If it does not, massive amounts of ingenuity may have to be applied to use it. I've been there and done that, trying at one point to make some special-purpose highly-parallel hardware simulation boxes do things like chip wire routing or more general computing. It required much brain twisting and ultimately wasn't that successful.

However, GPU designers have been particularly good at making this match. Andrew made this point very well in a video'd debate over on Charlie Demerjian's SemiAccurate blog: Last minute changes that would be completely anathema to GP designs are apparently par for the course with GPU designs.

The embedded systems world has been dealing with functionally decomposed hardware for decades. In fact, a huge part of their methodology is devoted to figuring out where to put a hardware-software split to match their requirements. Again, though, the hardware does exactly what's needed, often through last-minute FPGA-based hardware modifications.

However, there's also no denying that the mainstream of software development, all the guys who have been doing Software Engineering and programming system design for a long time, really don't have much use for anything that's not an obvious Turing Machine onto which they can spin off anything they want. Traditional schedulers have a rough time with even clock speed differences. So, for example, traditional programmers look at Cell SPUs, with their manually-loaded local memory, and think they're misbegotten spawn on the devil or something. (I know I did initially.)

This train of thought made me wonder: Maybe traditional cache-coherent MP/multicore actually is hardware specifically designed for a purpose, like a GPU. That purpose is, I speculate, transaction processing. This is similar to a point I raised long ago in this blog (IT Departments Should NOT Fear Multicore), but a bit more pointed.

Don't forget that SMPs have been around for a very long time, and practically from their inception in the early 1970s were used transparently, with no explicit parallel programming and code very often written by less-than-average programmers. Strongly enabling that was a transaction monitor like IBM's CICS (and lots of others). All code is written as a relatively small chunk (debit this account) (and the cash on hand, and total cash in a bank…). That chunk is automatically surrounded by all locking it needs, called by the monitor when a customer implicitly invokes it, and can be backed out as needed either by facilities built into the monitor or by a back-end database system.

It works, and it works very well right up to the present, even with programmers so bad it's a wonder they don't make the covers fly off the servers. (OK, only a few are that bad, but the point is that genius is not required.)

Of course, transaction monitors aren't a language or traditional programming construct, and also got zero academic notice except perhaps for Jim Gray. But they work, superbly well on SMP / multicore. They can even work well across clusters (clouds) as long as all data is kept in a separate backend store (perhaps logically separate), which model, by the way, is the basis of a whole lot of cloud computing.

Attempts to make multicores/SMPs work in other realms, like HPC, have been fairly successful but have always produced cranky comments about memory bottlenecks, floating-point performance, how badly caches fit the requirements, etc., comments you don't hear from commercial programmers. Maybe this is because it was designed for them? That question is, by the way, deeply sarcastic; performance on transactional benchmarks (like TPC's) are the guiding light and laser focus of most larger multicore / SMP designs.

So, overall, this post makes a rather banal point: If the hardware matches your needs, it will be easy to use. If it doesn't, well, the shoe just doesn't fit, and will bring on blisters. However, the observation that multicore is actually a special purpose device, designed for a specific purpose, is arguably an interesting perspective.

Tuesday, July 6, 2010

OnLive Works! First Use Impressions

I've tried OnLive, and it works. At least for the games I tried, it seems to work quite well, with no noticeable lag and fine video quality. But I'm not sure about the bandwidth issue yet, or the cost.

OnLive is a service that runs games on their servers up in the cloud, streaming the video to your PC or Mac. I've noted previously that this could hurt the mass market for GPUs, since it doesn't need much graphics on the client. But there were serious questions (see my post Twilight of the GPU?) as to whether they could overcome bandwidth and lag issues: Can OnLive respond to your inputs fast enough for games to be playable? And could its bandwidth requirements be met with a normal household ISP?

As I said above: Lag, check. Video, check. I found no problems there. Bandwidth, inconclusive. Cost, ditto. More data will answer those, but I've not had the chance to gather it yet. Here's what I did:

I somehow was "selected" from their wait-list as an OnLive founding member, getting me free access for a year – which doesn't mean I play totally free for a year; see below – and tried it out today, playing free 30-minute demos of Assassin's Creed II a little bit, and Borderlands enough for a good impression.

Assassin's Creed II was fine through initial cutscenes and minor initial movement. But when I reached the point where I was reborn as a player in medieval times, I ran into a showstopper. As an introduction to the controls, the game wanted me to press <squiggle_icon> to move my legs. <squiggle_icon>, unfortunately, corresponds to no key on my laptop. I tried everything plus shift, control, and alt variations, and nothing worked. In the process I accidentally created a brag clip, went back to the OnLive dashboard, and did some other obscure things I never did figure out, but never did move my legs. I moved my arms with about four different key combinations, but the game wasn't satisfied with that. So I ditched it. For all I know there's something on the OnLive web site explaining this, but I didn't look enough to find it.

I was much more successful with Borderlands, a post-apocalyptic first-person shooter. I completed the initial training mission, leveled up, and was enjoying myself when the demo time – 30 minutes, which I consider adequately generous – ran out. Targeting and firing seemed to be just as good as native games on my system. I played both in a window and in fullscreen mode, and at no time was there noticeable lag or any visual artifacts. It just played smoothly and nicely.

I wanted to try Dragon Age – I'm more of an RPG guy – but while it shows up on the web site, I couldn't find it among the games available for play on the live system.

This is not to say there weren't hassles and pains involved in getting going. Here are some details.

First, my environment: The system I used is a Sony Vaio VGN-2670N, with Intel Core Duo @ 2.66 GHz, a 1600x900 pixel display, with 4GB RAM and an Nvidia GeForce 9300M; but the Nvidia display adapter wasn't being used. For those of you wondering about speed-of-light delays, my location is just North of Denver, CO, so this was all done more than 1000 miles from the closest server farm they have (Dallas, TX). My ISP is Comcast cable, nominally providing 10 Mb/sec; I have seen it peak as high as 15 Mb/sec in spurts during downloads. My OS is 32-bit Windows Vista. (I know…)

There was a minor annoyance at the start, since their client installer refuses to even try using Google Chrome as the browser. IE, Firefox, and Safari are supported. But that only required me to use IE, which I shun, for the install; it's not used running the client.

The much bigger pain is that OnLive adamantly refuses to run over Wifi. The launcher checks, gives you one option – exit – and points you to a FAQ, which pointer gets a 404 (page not found). I did find the relevant FAQ manually on the web site. There they apologize and say it "does indeed work well with good quality Wi-Fi connections, and in the future OnLive will support wireless" but initially they're scared of bad packet-dropping low-signal-strength crud. I can understand this; they're fighting an uphill battle convincing people this works at all, and do not need a multitude complaining they don't work when the problem is crummy Wi-Fi. (Or WiFi in a coffee shop – a more serious issue; see bandwidth discussion below.)

Nevertheless, this is a pain for me. I had to go down in the basement and set up a chair where my router is, next to my water heater, to get a wired connection. When I did go down there, after convincing Vista (I know!) to actually use the wired connection, things went as described above.

That leaves one question: Bandwidth. My ISP, Comcast, has a 250 GB/month limit beyond which I am an "excessive user" and apparently get a stern talking-to, followed by account termination if I don't mend my excessive ways. Up to now, this has been far from an issue. With OnLive, it may be a significant limitation.

Unfortunately, I didn't monitor my network use carefully when using OnLive, and ran out of time to go back and do better monitoring. I'll report more when I've done that. However, checking some numbers provided by Comcast after the fact, I can see the possibility that averaging four hours a day is all the OnLive I could do and not get terminated, since my hour of use may (just may) have sucked down 2 GB. This could be a significant issue, limiting OnLive to only very casual users, but I need better measurement to be sure.

This also points to a reason for not initially allowing Wifi that they didn't mention: I doubt your local free Wifi hot spot in a Starbucks or McDonald's is really up to the task of serving several OnLive players all day.

Finally, there's cost. What I have free is access to the OnLive system; after a year that's $4.95/month (which may be a "founding member" deal). But to play other than a free demo, I need to purchase a PlayPass for each game played. I didn't do that, and still need to check that cost. Sorry, time limitations again.

So where does this leave the market for GPUs? With the information I have so far, all I can say is that the verdict is inconclusive. I think they really have the lag and display issues licked; those just aren't a problem. If I'm wrong about the bandwidth (entirely possible), and the PlayPasses don't cost too much, it could over time deal a large blow to the mass market for GPUs, which among other problems would sink the volumes that make them relatively inexpensive for HPC use.

On the other hand, if the bandwidth and cost make OnLive suitable only for very casual gaming, there may actually be a positive effect on the GPU market, since OnLive could be used as a very good "try before you buy" facility. It worked for me; I've been avoiding first-person shooters in favor of RPGs, but found the Borderlands demo to be a lot more fun than I expected.

Finally, I'll just note that Second Life recently changed direction and is saying they're going to move to a browser-based client. They, and other virtual world systems, might do well to consider instead a system using this type of technology. It would expand the range of client systems dramatically, and, even though there is a client, simplify use dramatically.

Monday, June 14, 2010

WNPoTs and the Conservatism of Hardware Development

There are some things about which I am undoubtedly considered a crusty old fogey, the abominable NO man, an ostrich with its head in the sand, and so on. Oh frabjous day! I now have a word for such things, courtesy of Charlie Stross, who wrote:

Just contemplate, for a moment, how you'd react to some guy from the IT sector walking into your place of work to evangelize a wonderful new piece of technology that will revolutionize your job, once everybody in the general population shells out £500 for a copy and you do a lot of hard work to teach them how to use it, And, on closer interrogation, you discover that he doesn't actually know what you do for a living; he's just certain that his WNPoT is going to revolutionize it. Now imagine that this happens (different IT marketing guy, different WNPoT, same pack drill) approximately once every two months for a five year period. You'd learn to tune him out, wouldn't you?
I've been through that pack drill more times than I can recall, and yes, I tune them out. The WNPoTs in my case were all about technology for computing itself, of course. Here are a few examples; they are sure to step on number of toes:

  • Any new programming language existing only for parallel processing, or any reason other than making programming itself simpler and more productive (see my post 101 parallel languages)
  • Multi-node single system image (see my post Multi-Multicore Single System Image)
  • Memristors, a new circuit type. A key point here is that exactly one company (HP) is working on it. Good technologies instantly crystallize consortia around themselves. Also, HP isn't a silicon technology company in the first place.
  • Quantum computing. Primarily good for just one thing: Cracking codes.
  • Brain simulation and strong artificial intelligence (really "thinking," whatever that means). Current efforts were beautifully characterized by John Horgan, in a SciAm guest blog: 'Current brain simulations resemble the "planes" and "radios" that Melanesian cargo-cult tribes built out of palm fronds, coral and coconut shells after being occupied by Japanese and American troops during World War II.'
Of course, for the most part those aren't new. They get re-invented regularly, though, and drooled over by ahistorical evalgelists who don't seem to understand that if something has already failed, you need to lay out what has changed sufficiently that it won't just fail again.

The particular issue of retred ideas aside, genuinely new and different things have to face up to what Charlie Stross describes above, in particular the part about not understanding what you do for a living. That point, for processor and system design, is a lot more important than one might expect, due to a seldom-publicized social fact: Processor and system design organizations are incredibly, insanely, conservative. They have good reason to be. Consider:

Those guys are building some of the most, if not the most, intricately complex structures ever created in the history of mankind. Furthermore, they can't be fixed in the field with an endless stream of patches. They have to just plain work – not exactly in the first run, although that is always sought, but in the second or, at most, third; beyond that money runs out.

The result they produce must also please, not just a well-defined demographic, but a multitude of masters from manufacturing to a wide range of industries and geographies. And of course it has to be cost- and performance-competitive when released, which entails a lot of head-scratching and deep breathing when the multi-year process begins.

Furthermore, each new design does it all over again. I'm talking about the "tock" phase for Intel; there's much less development work in the "tick" process shrink phase. Development organizations that aren't Intel don't get that breather. You don't "re-use" much silicon. (I don't think you ever re-use much code, either, with a few major exceptions; but that's a different issue.)

This is a very high stress operation. A huge investment can blow up if one of thousands of factors is messed up.

What they really do to accomplish all this is far from completely documented. I doubt it's even consciously fully understood. (What gets written down by someone paid from overhead to satisfy an ISO requirement is, of course, irrelevant.)

In this situation, is it any wonder the organizations are almost insanely conservative? Their members cannot even conceive of something except as a delta from both the current product and the current process used to create it, because that's what worked. And it worked within the budget. And they have their total intellectual capital invested in it. Anything not presented as a delta of both the current product and process is rejected out of hand. The process and product are intertwined in this; what was done (product) was, with no exceptions, what you were able to do in the context (process).

An implication is that they do not trust anyone who lacks the scars on their backs from having lived that long, high-stress process. You can't learn it from a book; if you haven't done it, you don't understand it. The introduction of anything new by anyone without the tribal scars is simply impossible. This is so true that I know of situations where taking a new approach to processor design required forming a new, separate organization. It began with a high-level corporate Act of God that created a new high-profile organization from scratch, dedicated to the new direction, staffed with a mix of outside talent and a few carefully-selected high-talent open-minded people pirated from the original organization. Then, very gradually, more talent from the old organization was siphoned off and blended into the new one until there was no old organization left other than a maintenance crew. The new organization had its own process, along with its own product.

This is why I regard most WNPoT announcements from a company's "research" arm as essentially meaningless. Whatever it is, it won't get into products without an "Act of God" like that described above. WNPoTs from academia or other outside research? Fuggedaboudit. Anything from outside is rejected unless it was originally nurtured by someone with deep, respected tribal scars, sufficiently so that that person thinks they completely own it. Otherwise it doesn't stand a chance.

Now I have a term to sum up all of this: WNPoT. Thanks, Charlie.

Oh, by the way, if you want a good reason why the Moore's Law half-death that flattened clock speeds produced multi- / many-core as a response, look no further. They could only do more of what they already knew how to do. It also ties into how the very different computing designs that are the other reaction to flat clocks came not from CPU vendors but outsiders – GPU vendors (and other accelerator vendors; see my post Why Accelerators Now?). They, of course, were also doing more of what they knew how to do, with a bit of Sutherland's Wheel of Reincarnation and DARPA funding thrown in for Nvidia. None of this is a criticism, just an observation.

Tuesday, June 8, 2010

Ten Ways to Trash your Performance Credibility

Watered by rains of development sweat, warmed in the sunny smiles of ecstatic customers, sheltered from the hailstones of Moore's Law, the accelerator speedup flowers are blossoming.

Danger: The showiest blooms are toxic to your credibility.

(My wife is planting flowers these days. Can you tell?)

There's a paradox here. You work with a customer, and he's happy with the result; in fact, he's ecstatic. He compares the performance he got before you arrived with what he's getting now, and gets this enormous number – 100X, 1000X or more. You quote that customer, accurately, and hear:

"I would have to be pretty drunk to believe that."

Your great, customer-verified, most wonderful results have trashed your credibility.

Here are some examples:

In a recent talk, Prof. Sharon Glotzer just glowed about getting a 100X speedup "overnight" on the molecular dynamics codes she runs.

In an online discussion on LinkedIn, a Cray marketer said his client's task went from taking 12 hours on a Quad-core Intel Westmere 5600 to 1.2 seconds. That's a speedup of 36,000X. What application? Sorry, that's under non-disclosure agreement.

In a video interview, a customer doing cell pathology image analysis reports their task going from 400 minutes to 65 milliseconds, for a speedup of just under 370,000X. (Update: Typo, he really does say "minutes" in the video.)

None of these people are shading the truth. They are doing what is, for them, a completely valid comparison: They're directly comparing where they started with where they ended up. The problem is that the result doesn't pass the drunk test. Or the laugh test. The idea that, by itself, accelerator hardware or even some massively parallel box will produce 5-digit speedups is laughable. Anybody baldly quoting such results will instantly find him- or herself dismissed as, well, the polite version would be that they're living in la-la land or dipping a bit too deeply into 1960s pop pharmacology.

What's going on with such huge results is that the original system was a target-rich zone for optimization. It was a pile of bad, squirrely code, and sometimes, on top of that, interpreted rather than compiled. Simply getting to the point where an accelerator, or parallelism, or SIMD, or whatever, could be applied involved fixing it up a lot, and much of the total speedup was due to that cleanup – not directly to the hardware.

This is far from a new issue. Back in the days of vector supercomputers, the following sequence was common: Take a bunch of grotty old Fortran code and run it through a new super-duper vectorizing optimizing compiler. Result: Poop. It might even slow down. So, OK, you clean up the code so the compiler has a fighting chance of figuring out that there's a vector or two in there somewhere, and Wow! Gigantic speedup. But there's a third step, a step not always done: Run the new version of the code through a decent compiler without vectors or any special hardware enabled, and, well, hmmm. In lots of cases it runs almost as fast as with the special hardware enabled. Thanks for your help optimizing my code, guys, but keep your hardware; it doesn't seem to add much value.

The moral of that story is that almost anything is better than grotty old Fortran. Or grotty, messed-up MATLAB or Java or whatever. It's the "grotty" part that's the killer. A related modernized version of this story is told in a recent paper Believe It or Not! Multi-core CPUs can Match GPU Performance, where they note "The best performing versions on the Power7, Nehalem, and GTX 285 run in 1.02s, 1.82s, and 1.75s, respectively." If you really clean up the code and match it to the platform it's using, great things can happen.

This of course doesn't mean that accelerators and other hardware are useless; far from it. The "Believe It or Not!" case wasn't exactly hurt by the fact that Power7 has a macho memory subsystem. It does mean that you should be aware of all the factors that sped up the execution, and using that information, present your results with credit due to the appropriate actions.

The situation we're in is identical to the one that lead someone (wish I remembered who), decades ago, to write a short paper titled, approximately, Ten Ways to Lie about Parallel Processing. I thought I kept a copy, but if I did I can't find it. It was back at the dawn of whatever, and I can't find it now even with Google Scholar. (If anyone out there knows the paper I'm referencing, please let me know.) Got it! It's Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers, by David H. Bailey. Thank you, Roland!

In the same spirit, and probably duplicating that paper massively, here are my ten ways to lose your credibility:

  1. Only compare the time needed to execute the innermost kernel. Never mind that the kernel is just 5% of the total execution time of the whole task.
  2. Compare your single-precision result to the original, which computed in double precision. Worry later that your double precision is 4X slower, and the increased data size won't fit in your local memory. Speaking of which,
  3. Pick a problem size that just barely fits into the local memory you have available. Why? See #4.
  4. Don't count the time to initialize the hardware and load the problem into its memory. PCI Express is just as fast as a processor's memory bus. Not.
  5. Change the algorithm. Going from a linear to a binary search or a hash table is just good practice.
  6. Rewrite the code from scratch. It was grotty old Fortran, anyway; the world is better off without it.
  7. Allow a slightly different answer. A*(X+Y) equals A*X+A*Y, right? Not in floating point, it doesn't.
  8. Change the operating system. Pick the one that does IO to your device fastest.
  9. Change the libraries. The original was 32 releases out of date! And didn't work with my compiler!
  10. Change the environment. For example, get rid of all those nasty interrupts from the sensors providing the real-time data needed in practice.
This, of course, is just a start. I'm sure there are another ten or a hundred out there.

A truly fair accounting for the speedup provided by an accelerator, or any other hardware, can only be done by comparing it to the best possible code for the original system. I suspect that the only time anybody will be able to do that is when comparing formally standardized benchmark results, not live customer codes.

For real customer codes, my advice would be to list all the differences between the original and the final runs that you can find. Feel free to use the list above as a starting point for finding those differences. Then show that list before you present your result. That will at least demonstrate that you know you're comparing marigolds and peonies, and will help avoid trashing your credibility.


Thanks to John Melonakos of Accelereyes for discussion and sharing his thoughts on this topic.

Friday, June 4, 2010

How Hardware Virtualization Works (Part 4)

This is the fourth and last in a series of posts about how hardware virtualization works. Catch it from Part 1 to understand the context.

Drown It in Silicon

In the previous discussion I might have lead you to believe that paravirtualization is widely used in mainframes (IBM zSeries and clones). Sorry. It is used, but in many cases another technique is used, alone or in combination with paravirtualization.

Consider the example of reading the real time clock. All that has to happen is that a silly little offset is added. It is perfectly possible to build hardware that adds an offset all by itself, without any "help" from software. So that's what they did. (See figure below.)

They embedded nearly the whole shooting match directly into silicon. This implies that the bag 'o bits I've been glibly referring to becomes part of the hardware architecture: Now it's hardware that has to reach in and know where the clock offset resides. Not everything is as trivial as adding an offset, of course; what happens with the memory mapping gets, to me anyway, a tad scary in its complexity. But, of course, it can be made to work.
Nobody else is willing to invest a pound or so of silicon into doing this. Yet.

As Moore's Law keeps providing us with more and more transistors, perhaps at some point the industry will tire of providing even more cores, and spend some of those transistors on something that might actually be immediately usable.

A Bit About Input and Output

One reason for all this mainframe talk is that it provides an existence proof: Mainframes have been virtualizing IO basically forever, allowing different virtual machines to think they completely own their own IO devices when in fact they're shared. And, of course, it is strongly supported in yet more hardware. A virtual machine can issue an IO operation, have it directed to its address for an IO device (which may not be the "real" address), get the operation performed, and receive a completion interrupt, or an error, all without involving a hypervisor, at full hardware efficiency. So it can be done.

But until very recently, it could not be readily done with PCI and PCIe (PCI Express) IO. Both the IO interface and the IO devices need hardware support for this to work. As a result, IO operations have for commodity and RISC systems been done interpretively, by the hypervisor. This obviously increases overhead significantly. Paravirtualization can clearly help here: Just ask the hypervisor to go do the IO directly.

However, even with paravirtualization this requires the hypervisor to have its own IO driver set, separate from that of the guest operating systems. This is a redundancy that adds significant bulk to a hypervisor and isn't as reliable as one would like, for the simple reason that no IO driver is ever as reliable as one would like. And reliability is very strongly desired in a hypervisor. Errors within it can bring down all the guest systems running under them.

Another thing that can help is direct assignment of devices to guest systems. This gives a guest virtual machine sole ownership of a physical device. Together with hardware support that maps and isolates IO addresses, so a virtual machine can only access the devices it owns, this provides full speed operation using the guest operating system drivers, with no hypervisor involvement. However, it means you do need dedicated devices for each virtual machine, something that clearly inhibits scaling: Imagine 15 virtual servers, all wanting their own physical network card. This support is also not an industry standard. What we want is some way for a single device to act like multiple virtual devices.

Enter the PCI SIG. It has recently released a collection – yes, a collection – of specifications to deal with this issue. I'm not going to attempt to cover them all here. The net effect, however, is that they allow industry-standard creation of IO devices with internal logic that makes them appear as if they are several, separate, "virtual" devices (the SR-IOV and MR-IOV specifications); and add features supporting that concept, such as multiple different IO addresses for each device.

A key point here is that this requires support by the IO device vendors. It cannot be done just by a purveyor of servers and server chipsets. So its adoption will be gated by how soon those vendors roll this technology out, how good a job they do, and how much of a premium they choose to charge for it. I am not especially sanguine about this. We have done too good a job beating a low cost mantra into too many IO vendors for them to be ready to jump on anything like this, which increases cost without directly improving their marketing numbers (GBs stored, bandwidth, etc.).


There is a joke, or a deep truth, expressed by the computer pioneer David Wheeler, co-inventor of the subroutine, as "All problems in computer science can be solved by another level of indirection."

Virtualization is not going to prove that false. It is effectively a layer of indirection or abstraction added between physical hardware and the systems running on it. By providing that layer, virtualization enables a collection of benefits that were recognized long ago, benefits that are now being exploited by cloud computing. In fact, virtualization is so often embedded in cloud computing discussions that many have argued, vehemently, that without virtualization you do not have cloud computing. As explained previously, I don't agree with that statement, especially when "virtualization" is used to mean "hardware virtualization," as it usually is.

However, there is no denying that the technology of virtualization makes cloud computing tremendously more economic and manageable.

Virtualization is not magic. It is not even all that complicated in its essence. (Of course its details, like the details of nearly anything, can be mind-boggling.) And despite what might first appear to be the case, it is also efficient; resources are not wasted by using it. There is still a hole to plug in IO virtualization, but solutions there are developing gradually if not necessarily expeditiously.

There are many other aspects of this topic that have not been touched on here, such as where the hypervisor actually resides (on the bare metal? Inside an operating system?), the role virtualization can play when migrating between hardware architectures, and the deep relationship that can, and will, exist between virtualization and security. But hopefully this discussion has provided enough background to enable some of you to cut through the marketing hype and the thicket of details that usually accompany most discussions of this topic. Good luck.