This is part 2 of a multi-post sequence on this topic that began here. This post describes in more detail what this concept actually is.

What It Is

The figure below illustrates the conventional approach to building a cluster / farm / cloud / data center / whatever you want to call a bunch of computers working more-or-less together.

Each system, above the hardware and its interconnect (black line), has its own independent copy of:

an operating system,
various networking basics (think DNS and the like),
some means of accessing common data storage (like a distributed file system),
assorted application-area tools (like Apache, MapReduce, MPI)
and various management tools to wrangle all the moving parts into some semblance of order.

Each of those has its own inter-instance communication (colored lines), which has to be made to work, too.

The order in which I've stacked the elements above the operating system can be debated; feel free to pick your own, since that's not the point. The point is that there are a collection of separate elements to collect, install, manage, and train administrators and developers to use, all the time hoping the elements will in this case play nicely with each other. The pain of doing this has lessened over time as various stacks become popular, and hence well-understood, but it's still building things out of an erector set each time, carefully keeping in mind how each has to slot into another. Intel even has a Cluster Ready specification and configuration checker whose sole purpose is to avoid many of the standard shoot-yourself-in-the-foot situations when building clusters.

That just leaves one question: Why did I use that cloudy shape to represent the operating system? Because the normal function of an operating system is to turn the content of one node into a pool of resources, which it manages. This is particularly obvious when each node system is a multicore (multiprocessor): The collection of cores and threads is obviously a resource pool. But then again, so is the memory. It's perhaps less obvious that this is happening when you just consider the seemingly singular items attached to a typical system, but look closer and you find it, for example in network bandwidth and disk space.

I also introduced a bit of cosmetic cloudiness to set things up for this figure, by comparison:

Here, you have one single instance of an operating system that runs the whole thing, exactly like one operating system pools and allocates all the processors, memory, and other resources in a multicore chip or multiprocessor system. On that single operating system instance, you install, once, one (just one) set of the application tools, just like you would install it on one node. Since that tool set is then on the (singular) operating system, it can be used on any node by any application or applications.

The vision is of a truly distributed version of the original operating system, looking exactly the same as the single-node version as far as any administrator, user, or program can tell: It maintains all the system interfaces and semantics of the original operating system. That's why I called it a Kernel-level Single System Image in In Search of Clusters, as opposed to a single-system image at the file system level, or at lower hardware levels like memory.

Consider what that would mean (using some Linux/Unix terminology for the examples):

There's one file system with one root (including /dev), one process tree, one name space for sockets & shared segments, etc. They're all exactly the same as they are on the standard single-node OS. All the standard single-system tools for managing this just work.
There is inter-node process migration for load balancing, just like a multicore OS moves processes between cores; some just happen to be on other nodes. (Not as fast, obviously.) Since that works for load balancing, it can also be used to avoid planned outages: Move all the work over to this node, then bring down the just-emptied node.
Application code doesn't know or care about any of this. It never finds out it has been migrated, for example (even network connections). It just works, exactly the way it worked on a single system. The kernel interface (KPI) has a single system image, identical to the interface on a single-node operating system.

With such a system, you could:

add a user – once, for all nodes, using standard single-OS tools; there's no need to learn any new multi-system management tools.
install applications and patches – once, for all nodes, ditto.
set application parameters – once, for all nodes, ditto.

Not only that, it can be highly available under hardware failures. Believing this may take a leap of faith, but hang on for a bit and I'll cite the implementations which did it. While definitely not trivial, you can arrange things so that the system as a whole can survive the death of one, or more, nodes. (And one implementation, Locus, was known for doing rational things under the ugly circumstance of partitioning the cluster into multiple clusters.) Some folks immediately assume that one node has to be a master, and losing it gets you in big trouble. That's not the case, any more than one processor or thread in a multicore system is the master.

Note, however, that all this discussion refers to system availability, not application availability. Applications may need to be restarted if the node they were on got fried or drowned. But it provides plumbing on which application availability can be built.

So this technique provides major management simplification, high availability, and scale up in one blow. This seems particularly to win for the large number of small installations that don't have the budget to spend on an IT shop. Knowing how to run the basic OS is all you need. That SMB (Small and Medium Business) target market seems always to be the place for high growth, so this is smack dab on a sweet spot.

Tie me to the mast, boys, the Sirens are calling. This is a fantastically seductive concept.

Of course, it's not particularly wonderful if you make money selling big multiprocessors, like SGI, IBM, HP, Fujitsu, and a few other "minor" players. Then it's potentially devastating, since it appears that anybody can at least claim they can whip up a big SMP out of smaller, lower-cost systems.

Also, just look at that figure. It's a blinkin' cloud! All you need do is assume the whole thing is accessed by a link to the Internet. It turns a collection of computers into a uniform pool of resources that is automatically managed, directly by the (one) OS. You want elastic? We got elastic. Roll up additional systems, plug them in, it expands to cover them.

{Well, at least it's a cloud is by some definitions of cloud computing; defining "cloud computing" is a cottage industry. Of course for elasticity there's the little bitty problem that the applications have to elasticize, too, but that's just as true of every cloud computing scenario I'm aware of. This is the same issue with high availability.}

So, if it's so seductive, to say nothing of reminiscent of today's hot cloud buzz – and for many, but not all the same reasons – why hasn't it taken over the world?

That's the big question. Before tackling that, though, let's fill in some blanks. Can it actually be done? The answer to that is an unequivocal yes. It's been done, repeatedly. How is it done? Well, not trivial, but there are structuring techniques that help.

That's next. First, the history.

The Perils of Parallel

Tuesday, January 20, 2009

Multi-Multicore Single System Image / Cloud Computing. A Good Idea? (2)

What It Is

No comments:

Post a Comment