Friday, January 23, 2009

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (5)

This is part 5 of a multi-post sequence on this topic which began here. This is the crux, the final part attempting to answer the question

Why Hasn't SSI Taken Over The World?

After last-minute productus interruptus pull-outs and half-hearted product introductions by so many major industry players – IBM, HP, Sun, Tandem, Intel, SCO – you have to ask what's wrong.

Semi-random events like business conditions, odd circumstances, and company politics always play a large part in whether any product sees the light of day, but over multiple attempts those should average out: It can't just be a massive string of bad luck. Can it? This stuff just seems so cool. It practically breaths breathless press-release hype. Why isn't it everywhere?

Well, I'm fully open to suggestions.

I will offer some possibilities here.

Marketing: A Tale of Focus Groups

In one of the last, biggest pushes for this within IBM, sometime around 2002, a real marketer got involved, got funding, and ran focus groups with several sets of customers in multiple cities to get a handle on who would buy it, and for how much. I was involved in the evaluation and listened to every dang cassette tape of every dang multi-hour session. I practically kicked them in frustration at some points.

The format: After a meet-and-greet with a round of donuts or cookies or whatever, a facilitator explained the concept to the group and asked them "Would you buy this? How much would you pay?" They then discussed it among themselves, and gave their opinions.

It turned out that the customers in the focus groups divided neatly into two completely separate groups:

Naives

These were the "What's a cluster?" people. They ran a small departmental server of their own, but had never even heard of the concept of a cluster. They had no clue what a single system image was or why anybody would conceivably want one, and the short description provided certainly didn't enlighten them. This was the part I was near to kicking about; I knew I could have done that better. Now, however, I realize that what they were told was probably fairly close to the snippets they would have heard in the normal course of marketing and trade press articles.

Result: They wouldn't buy it because they wouldn't understand what it was and why they should be interested.

Experts

These guys were sysadmins of their own clusters, and knew every last minute detail of the area. They'd built clusters (usually small ones), configured cluster databases, and kept them running come hell or high water with failover software. The Naives must have thought they were talking Swahili.

They got the point, they got it fast, and they got it deep. They immediately understood implications far beyond what the facilitator said. They were amazed that it was possible, and thought it was really neat, although some expressed doubt that anybody could actually make it work, as in "You mean this is actually possible? Geez."

After a short time, however, all of them, every last one, zeroed in on one specific flaw: Operating system updates are not HA, because you can't in many cases run a different fix level on different nodes simultaneously. Sometimes it can work, but not always. These folks did rolling OS upgrades all the time, updating one node's OS and seeing if it fell over, then moving to the next node, etc.; this is a standard way to avoid planned outages.

Result: They wouldn't buy it either, because, as they repeatedly said, they didn't want to go backwards in availability.

That was two out of two. Nobody would buy it. It's difficult to argue with that market estimate.

But what if those problems were fixed? The explanation can be fixed, for sure; as I said, I all but kicked the cassettes in frustration. Doing kernel updates without an outage is pretty hard, but in all but the worst cases it could also be done, with enough effort.

Even were that done, I'm not particularly hopeful, for the several other reasons discussed below.

Programming Model

As Seymour Cray said of virtual memory, "Memory is like an orgasm: It's better when you don't have to fake it." (Thank you, Eugene Miya and his comp.sys.super FAQ.)

That remains true when it's shared virtual memory bouncing between nodes, even despite a probable lack of disk accesses. Even were an application or framework written to scale up in shared-memory multiprocessor style over many multi-cores in many nodes – and significant levels of multiprocessor scaling will likely be achieved as the single-chip core count rises – that application is going to perform much better if it is rewritten to partition its data so each node can do nearly all accesses into local memory.

But hey, why bother with all that work? Lots of applications have already been set up to scale on separate nodes, so why not just run multiple instances of those applications, and tune them to run on separate nodes? It achieves the same purpose, and you just run the same code. Why not?

Because most of the time it won't work. They haven't been written to run multiple copies on the same OS. Apache is the poster child for this. Simple, silly things get in the way, like using the same names for temp files and other externally-visible entities. So you modify the file system, letting each have an instance-specific /temp and other files… But now you've got to find all those cases.

The massively dominant programming model of our times runs each application on its own copy of the operating system. That has been a major cause of server sprawl and the resultant killer app for virtualization. The issue isn't just silly duplicate file names, although that is still there. The issue is also performance isolation, fault isolation, security isolation, and even inter-departmental political isolation. "Modern" operating systems simply haven't implemented hardening of the isolation between their separate jobs, not because it's impossible – mainframes OSs did it and still do – but because nobody cares any more. Virtualization is instead used to create isolated copies of the OS.

But virtualizing to one OS instance per node on top of a single-system-image OS that unifies the nodes, that's – I'm struggling for strong enough words here. Crazy? Circular? Möbius-strip-like? It would negate the whole point of the SSI.

Decades of training, tool development, and practical experience are behind the one application / one OS / one node programming model. It has enormous cognitive momentum. Multinode single system image is simply swimming upstream against a very strong current on this one, even if in some cases it may seem simpler, at least initially, to implement parallelism using shared virtual memory across multiple nodes. This alone seems to guarantee isolation to a niche market.

Scaling and Single-OS Semantics

Scaling up an application is a massive logical AND. The hardware must scale AND the software must scale. For the hardware to scale, the processing must scale AND the memory bandwidth must scale AND the IO must scale. For the software to scale, the operating system, middleware, AND application must all scale. If anything at all in the whole stack does not scale, you are toast: You have a serial bottleneck.

There are many interesting and useful applications that scale to hundreds or thousands of parallel nodes. This is true both of batch-like big single operations – HPC simulations, MapReduce over Internet-scale data – and the continuous massive river of transactions running through web sites like Amazon.com and eBay. So there are many applications that scale, along with their middleware, on hardware – piles of separate computers – that scales more-or-less trivially. When there's a separate operating system on each node, that part scales trivially, too.

But what happens when the separate operating systems are replaced with a kernel-level SSI operating system? That LCC was used on the Intel Paragon seems to say that it can.

However, a major feature and benefit of the whole idea of SSI is that the single system image matches the semantics of a single-node OS exactly. Getting that exact match is not easy. At one talk I heard, Jerry Popek estimated that getting the last 5% of those semantics was over 80% of the work, but provided 95% of the benefit – because it guarantees that if any piece of code ran on one node, it will simply run on the distributed version. That guarantee is a powerful feature.

Unfortunately, single-node operating systems simply weren't designed with multi-node scaling in mind. The poster child for this one is Unix/Linux systems' file position pointer, which is associated with the file handle, not with the process manipulating the file. The intent, used in many programs, is that a parent process can open a file, read and process some, then pass the handle to a child, which continues reading and processing more of the file; when the child ends, the parent can pick up where the child left off. It's how command-line arguments traditionally get passed: The parent reads far enough to see what kind of child to start up, starts it, and implicitly passes the rest of the input file – the arguments – to the child, which slurps them up. On a single-node system, this is actually the simplest thing to do: You just use one index, associated with the handle, which everybody increments. For a parallel system, that one index is a built-in serial bottleneck. Spawn a few hundred child processes to work in parallel, and watch them contend for a common input command string. The file pointer isn't the only such case, but it's an obvious egregious example.

So how did they make it work on the Intel Paragon? By sacrificing strict semantic compatibility.

For a one-off high-end scientific supercomputer (not clear it was meant to be one-off, but that's life), it's not a big deal. One would assume a porting effort or the writing of a lot of brand new code. For a general higher-volume offering, cutting such corners would mean trouble; too much code wants to be reused.

Conclusions

Making a single image of an operating system span multiple computers sounds like an absolutely wonderful idea, with many clear benefits. It seems to almost magically solve a whole range of problems, cleanly and clearly. Furthermore, it very clearly can be done; it's appeared in a whole string of projects and low-volume products. In fact, it's been around for decades, in a stream of attempts that are a real tribute to the perseverance of the people involved, and to the seductiveness of the concept.

But it has never really gotten any traction in the marketplace. Why?

Well, maybe it was just bad luck. It could happen. You can't rule it out.

On the other hand, maybe (a) you can't sell it; (b) it's at right angles to the vastly dominant application programming model; (c) it scales less well than many applications you would like to run on it.

Bummer.

__________________________

(Postscript: Are there lessons here for cloud computing? I suspect so, but haven't worked it out, and at this point my brain is tired. See you later, on some other topic.)

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (4)

This is part 4 of a multi-post sequence on this topic which began here. This part discusses some implementation issues and techniques.

Implementation, Briefly

The only implementations I know much about were the Locus ones. I use the plural deliberately, since two different organizations were used over time.

Initially, the problem was approached the most straightforward way conceivable: Start at the root of the source tree and crawl through every line of code in the OS. Everywhere you see an assumption that some resource is only on one node, replace it with code that doesn't assume that, participates in cross-system consistency and recovery, and so on.

This is a massive undertaking, requiting a huge number of changes scattered throughout the source tree. It's a mess. And it's massively complicated. I am in awe that they actually got it to work. As you might imagine, after a few of their many ports, they seriously began looking for a better way, and found it in an analogy to Unix/Linux vnode interface.

Vnode is what enables distributed file systems to easily plug into Unix/Linux. Its basic idea is simple: Anytime you want to do anything at all to a file, you funnel the request through vnode; you don't do it any other way. Vnode itself is a simple switch, with an effect that's roughly like this:

  If    <the vnode ID shows this is a native file system file>
then  <just do it: call the native file system code>
Else  <ship the request to the implementer, and return its result>

If the implementer is on another computer, as it is with many distributed file implementations, this means that you ship the request off to the other computer, where it's done and the result passed back to the requestor.

What later implementations of the Locus line of code did was create a vproc interface analogous to vnode, but used for all manipulations of processes. If a process is local, just do it; otherwise, ship the request to wherever the process lives, do it there, and return the result. This neatly consolidates a whole lot of what has to be done into one place, enabling natural reuse and far greater consistency. It is definitely a win.

Unfortunately, the rest of the kernel isn't written to use the vproc interface. So you still have to crawl through OS and convert any code that directly manipulated a process into code that manipulated it through vproc. This is still a pain, but a much lesser one, and you get some better structure out of it. (I believe vproc got into the Linux kernel at some point.) However, this alone doesn't do the job for IO, doesn't create a single namespace for sockets, shared segments, and so on; all that has to be done independently. But those aspects are generally more localized than process manipulation, so you have fixed a significant problem.

Other implementations, like the Virtual Iron one and, I believe, the one by ScaleMP, take a different tack. They concentrate on shared memory first, implementing a distributed shared memory model of some sort (exactly what I'm not sure). That gets you the function of shared memory across nodes, which is a necessity for any full kernel SSI implementation and does appear in Locus, too. But it doesn't cover everything, of course.

Enough with all the background. The next post gets to the ultimate point: If it's so great, why hasn't it taken over the world?

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (3)

This is part 3 of a multi-post sequence on this topic which began here. This part recounts some implementations and history.

History and Examples

The earliest version of this concept that I'm aware of was the Locus project at UCLA, led by Jerry Popek. It had a long, and ultimately frustrating, history.

The research project was successful. It created a version of UNIX that ran across several computers (VAXen) and as a system didn't go down for something over a year of use by students. The project was funded by IBM, and that connection led to key elements being ported experimentally to IBM's AIX in the late 1980s; I saw AIX processes migrating between nodes in 1989-90. Then an IBM executive, in a fit of magnanimity and/or cluelessness, gifted all the IP back to the Locus project. They went off and founded Locus Computing Corporation (LCC). From there it was used on an Intel Paragon (a massively-parallel computing project), which says something about scalability. At one point it was also part of an attempt in IBM to make mainframes and desktops (PS/2s) together look like one big AIX (UNIX) system. Feel free to boggle at that concept. It did run.

But LCC ran into the boughts: They were bought by Tandem, where the code was productized as Tandem NonStop Clusters, which as far as I know shipped in small numbers. Tandem was bought by Compaq, which bought DEC, so the code was ported to DEC's Alpha, and almost used in DEC's brand of Unix before DEC was snuffed. Then Compaq was bought by HP, and shortly thereafter this code stream came perilously close to being part of HP-UX, HP's Unix: In 2004, there was a public announcement that it would be in the next major HP-UX release. About a month later, that was later officially de-committed. It also saw use in a PRPQ (limited-edition product) from IBM that was called POWER4. At some point it also almost became part of SCO's UnixWare.

This is a portrait of frustration for all involved.

The Locus thread, however, isn't the only implementation or planned product of full single system image. There's an open-source implementation now called MOSIX and based on Linux that you can download and try, designed specifically to support HPC. It also has a long history, starting life as MOS in the late 70s at The Hebrew University of Jerusalem. It doesn't distribute absolutely every element of the OS, but does quite enough to be useful.

Sun Microsystems published a fair amount of material about a version for Solaris, which it called Full Moon, including a roadmap indicating complete single-system-image in 1999. It hasn't happened yet, obviously. The most recent news I found about it was that some current variation would go open source (CDDL) in 2007. The odd name, by the way, was chosen because full moons make wolf packs howl, and Microsoft's cluster support was codenamed Wolfpack. ("Wolfpack" referred to my analogy of clusters being like packs of dogs in In Search of Clusters. "Dogpack" wouldn't have had quite the same connotations. But there's no mythological multi-headed wolf to play the part of the multiprocessor in my analogy.)

ScaleMP today sells a multi-multicore single system image product, vSMP foundation, aiming to ride on the speed of modern higher-performance interconnects like InfiniBand (and likely Data Center Ethernet or equivalent), but my understanding is that vSMP is hardware-assisted, a somewhat different subject. Similarly, Virtual Iron was at one point also pushing an ability to glue multiple multi-cores into one larger SMP-like system, with hardware assist, but that has apparently been dropped. Both of these, unlike prior efforts, have some flavor of virtualization to them and/or their implementation.

I wouldn't be at all surprised to learn that there are others who have heard the seductive Siren call of SSI.

So much for history. Implementation comments in the next post.

Tuesday, January 20, 2009

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (2)

This is part 2 of a multi-post sequence on this topic that began here. This post describes in more detail what this concept actually is.

What It Is

The figure below illustrates the conventional approach to building a cluster / farm / cloud / data center / whatever you want to call a bunch of computers working more-or-less together.

Each system, above the hardware and its interconnect (black line), has its own independent copy of:

an operating system,
various networking basics (think DNS and the like),
some means of accessing common data storage (like a distributed file system),
assorted application-area tools (like Apache, MapReduce, MPI)
and various management tools to wrangle all the moving parts into some semblance of order.

Each of those has its own inter-instance communication (colored lines), which has to be made to work, too.

The order in which I've stacked the elements above the operating system can be debated; feel free to pick your own, since that's not the point. The point is that there are a collection of separate elements to collect, install, manage, and train administrators and developers to use, all the time hoping the elements will in this case play nicely with each other. The pain of doing this has lessened over time as various stacks become popular, and hence well-understood, but it's still building things out of an erector set each time, carefully keeping in mind how each has to slot into another. Intel even has a Cluster Ready specification and configuration checker whose sole purpose is to avoid many of the standard shoot-yourself-in-the-foot situations when building clusters.

That just leaves one question: Why did I use that cloudy shape to represent the operating system? Because the normal function of an operating system is to turn the content of one node into a pool of resources, which it manages. This is particularly obvious when each node system is a multicore (multiprocessor): The collection of cores and threads is obviously a resource pool. But then again, so is the memory. It's perhaps less obvious that this is happening when you just consider the seemingly singular items attached to a typical system, but look closer and you find it, for example in network bandwidth and disk space.

I also introduced a bit of cosmetic cloudiness to set things up for this figure, by comparison:

Here, you have one single instance of an operating system that runs the whole thing, exactly like one operating system pools and allocates all the processors, memory, and other resources in a multicore chip or multiprocessor system. On that single operating system instance, you install, once, one (just one) set of the application tools, just like you would install it on one node. Since that tool set is then on the (singular) operating system, it can be used on any node by any application or applications.

The vision is of a truly distributed version of the original operating system, looking exactly the same as the single-node version as far as any administrator, user, or program can tell: It maintains all the system interfaces and semantics of the original operating system. That's why I called it a Kernel-level Single System Image in In Search of Clusters, as opposed to a single-system image at the file system level, or at lower hardware levels like memory.

Consider what that would mean (using some Linux/Unix terminology for the examples):

There's one file system with one root (including /dev), one process tree, one name space for sockets & shared segments, etc. They're all exactly the same as they are on the standard single-node OS. All the standard single-system tools for managing this just work.
There is inter-node process migration for load balancing, just like a multicore OS moves processes between cores; some just happen to be on other nodes. (Not as fast, obviously.) Since that works for load balancing, it can also be used to avoid planned outages: Move all the work over to this node, then bring down the just-emptied node.
Application code doesn't know or care about any of this. It never finds out it has been migrated, for example (even network connections). It just works, exactly the way it worked on a single system. The kernel interface (KPI) has a single system image, identical to the interface on a single-node operating system.

With such a system, you could:

add a user – once, for all nodes, using standard single-OS tools; there's no need to learn any new multi-system management tools.
install applications and patches – once, for all nodes, ditto.
set application parameters – once, for all nodes, ditto.

Not only that, it can be highly available under hardware failures. Believing this may take a leap of faith, but hang on for a bit and I'll cite the implementations which did it. While definitely not trivial, you can arrange things so that the system as a whole can survive the death of one, or more, nodes. (And one implementation, Locus, was known for doing rational things under the ugly circumstance of partitioning the cluster into multiple clusters.) Some folks immediately assume that one node has to be a master, and losing it gets you in big trouble. That's not the case, any more than one processor or thread in a multicore system is the master.

Note, however, that all this discussion refers to system availability, not application availability. Applications may need to be restarted if the node they were on got fried or drowned. But it provides plumbing on which application availability can be built.

So this technique provides major management simplification, high availability, and scale up in one blow. This seems particularly to win for the large number of small installations that don't have the budget to spend on an IT shop. Knowing how to run the basic OS is all you need. That SMB (Small and Medium Business) target market seems always to be the place for high growth, so this is smack dab on a sweet spot.

Tie me to the mast, boys, the Sirens are calling. This is a fantastically seductive concept.

Of course, it's not particularly wonderful if you make money selling big multiprocessors, like SGI, IBM, HP, Fujitsu, and a few other "minor" players. Then it's potentially devastating, since it appears that anybody can at least claim they can whip up a big SMP out of smaller, lower-cost systems.

Also, just look at that figure. It's a blinkin' cloud! All you need do is assume the whole thing is accessed by a link to the Internet. It turns a collection of computers into a uniform pool of resources that is automatically managed, directly by the (one) OS. You want elastic? We got elastic. Roll up additional systems, plug them in, it expands to cover them.

{Well, at least it's a cloud is by some definitions of cloud computing; defining "cloud computing" is a cottage industry. Of course for elasticity there's the little bitty problem that the applications have to elasticize, too, but that's just as true of every cloud computing scenario I'm aware of. This is the same issue with high availability.}

So, if it's so seductive, to say nothing of reminiscent of today's hot cloud buzz – and for many, but not all the same reasons – why hasn't it taken over the world?

That's the big question. Before tackling that, though, let's fill in some blanks. Can it actually be done? The answer to that is an unequivocal yes. It's been done, repeatedly. How is it done? Well, not trivial, but there are structuring techniques that help.

That's next. First, the history.

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (part 1)

Wouldn't it be wonderful if you could simply glue together several multicore systems with some software and make the result look like one bigger multicore system?

This sounds like something to make sysadmins salivate and venture capitalists trample each other in a rush to fund it. I spent a whole chapter of In Search of Clusters, both editions, explaining why I thought it was wonderful. Some of these reasons are echoed in today's Cloud Computing hubbub, particularly the ability to use a collection of computers as a seamless, single resource. The flavor is very different, and the implementation totally different, but that intent and the effects are the same.

Unfortunately, while this has been around for quite a while, it has never really caught on despite numerous very competent attempts. One must therefore ask the embarrassing question: Why? What kind of computer halitosis does it have to never have been picked up?

That's the subject of this series of posts. I'm going to start at the beginning, explaining what it really is (and why it's not unlike cloud computing); then describe some of the history of attempts to get it on the market; some of the technology that underlies it; and finally take a run at the central question: If it's so wonderful, can be implemented, and has appeared in products, why hasn't it taken over the world?

Doing this will span several posts; it's a long story. (This time I promise to link them in reading order.)

Note: This is about a purely software approach to multiplying the multi of multicore; it's done by distributing the operating system. Plain, ordinary, communications gear – Ethernet, likely, but whatever you like – is all that connects the systems. Using special hardware to pull the same trick will be the subject of later posts, as may techniques to distribute other entities to the same effect, like a JVM (Java Virtual Machine).

Next, a discussion of what this really is.

Sunday, January 18, 2009

A Redirection

I'm back. Been a while.

Sorry for the long delay. In large part this was due to a readjustment of my intentions and directions for this blog, and ultimately for a new book. It finally became clear to me that the overall negative message I was carrying had two undesirable effects:

"You are toast" is not a message likely to motivate most people to read a lot of material. Positive messages sell better.
It leaves me with less to talk about. The one focused negative theme made it difficult to justify talking about the range of topics I want to comment on, which is wider than just multicore. Realistically, I think this was the primary issue that turned me around.

You may have noticed my new blog subtitle, which I modified to reflect the new orientation. I didn't see any reason to change the title, though. Parallel is still perilous, but you'll be in less peril if you understand it, which is where I and this blog come in.

So you'll find a lot less haranguing about how multicore being a collective big wish that everything will still be just fine, and more about lessons learned and discussions of areas where I see broad confusion.

So, that's the intent from now on. You may, however, be interested or amused by a related random dollop of irony that I fetched up against.

Shortly after I decided to make this change, I just happened to catch The Dark Secret of Hendrick Schön on the Science Channel. Overall, the show is about how Schön was an uber-wunderkind of physics, publishing great discoveries at a ferocious rate, until he was fired in disgrace from Bell Labs (1). Saying more about Schön would be a spoiler and isn't particularly relevant to my point here, which is:

The show starts with a fairly long black-and-white segment that moves about a deserted, rundown city filled with ugly old industrial buildings, paper blowing in the streets, and an occasional apparently homeless person pushing a shopping cart stacked high with trash. While this dystopia rolls by, the narration breathlessly prophesies all the horrors that will unfold across all of modern society if one, specific, terrible technical tragedy occurs.

The tragedy?

Moore's Law runs out.

_________________________________________________________

1. I wasn't even aware that Bell Labs still existed. Apparently it does, still doing work in nanotechnology among other areas. This relates to why the show description in the link above goes gaga over Grey Goo. Grey Goo was not his "dark secret."