The Perils of Parallel: October 2011

Friday, October 28, 2011

MIC and the Knights

Intel’s Many-Integrated-Core architecture (MIC) was on wide display at the 2011 Intel Developer Forum (IDF), along with the MIC-based Knight’s Ferry (KF) software development kit. Well, I thought it was wide display, but I’m an IDF Newbie. There was mention in two keynotes, a demo in the first booth on the right in the exhibit hall, several sessions, etc. Some old hands at IDF probably wouldn’t consider the display “wide” in IDF terms unless it’s in your face on the banners, the escalators, the backpacks, and the bagels.

Also, there was much attempted discussion of the 2012 product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion was much attempted by me, anyway, with decidedly limited success. There were some hints, and some things can be deduced, but the real KC hasn’t stood up yet. That reticence is probably a turn for the better, since KF is the direct descendant of Intel’s Larrabee graphics engine, which was quite prematurely trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only to eventually be dropped – to become KF. A bit more circumspection is now certainly called for.

This circumspection does, however, make it difficult to separate what I learned into neat KF or KC buckets; KC is just too well hidden so far. Here is my best guesses, answering questions I received from Twitter and elsewhere as well as I can.

If you’re unfamiliar with MIC or KF or KC, you can call up a plethora of resources on the web that will tell you about it; I won’t be repeating that information here. Here’s a relatively recent one: Intel Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one chip. KC has “50 or more.” In addition, and crucially for much of the discussion below, each core has an enhanced and expanded vector / SIMD unit. You can think of that as an extension of SSE or AVX, but 512 bits wide and with many more operations available.

An aside: Intel’s department of code names is fond of using place names – towns, rivers, etc. – for the externally-visible names of development projects. “Knight’s Ferry” follows that tradition; it’s a town up in the Sierra Nevada Mountains in central California. The only “Knight’s Corner” I could find, however, is a “populated area,” not even a real town, probably a hamlet or development, in central Massachusetts. This is at best an unlikely name source. I find this odd; I wish I’d remembered to ask about it.

Is It Real?

The MIC architecture is apparently as real as it can be. There are multiple generations of the MIC chip in roadmaps, and Intel has committed to supply KC (product-level) parts to the University of Texas TACC by January 2013, so at least the second generation is as guaranteed to be real as a contract makes it. I was repeatedly told by Intel execs I interviewed that it is as real as it gets, that the MIC architecture is a long-term commitment by Intel, and it is not transitional – not a step to other, different things. This is supposed to be the Intel highly-parallel technical computing accelerator architecture, period, a point emphasized to me by several people. (They still see a role for Xeon, of course, so they don't think of MIC as the only technical computing architecture.)

More importantly, Joe Curley (Intel HPC marketing) gave me a reason why MIC is real, and intended to be architecturally stable: HPC and general technical computing are about a third of Intel’s server business. Further, that business tends to be a very profitable third since those customers tend to buy high-end parts. MIC is intended to slot directly into that business, obviously taking the money that is now increasingly spent on other accelerators (chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as discussed below, Intel’s intention for MIC is to greatly widen the pool of customers for accelerators.

The Big Feature: Source Compatibility

There is absolutely no question that Intel regards source compatibility as a primary, key feature of MIC: Take your existing programs, recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag), and they run on KF. I have zero doubt that this will also be true of KC and is planned for every future release in their road map. I suspect it’s why there is a MIC – why they did it, rather than just burying Larrabee six feet deep. No binary compatibility, though; you need to recompile.

You do need to be on Linux; I heard no word about Microsoft Windows. However, Microsoft Windows 8 has a new task manager display changed to be a better visualization of many more – up to 640 – cores. So who knows; support is up to Microsoft.

Clearly, to get anywhere, you also need to be parallelized in some form; KF has support for MPI (messaging), OpenMP (shared memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s a real Linux, by the way, that runs on a few of the MIC processors; I was told “you can SSH to it.” The rest of the cores run some form of microkernel. I see no reason they would want any of that to become more restrictive on KC.

If you can pull off source compatibility, you have something that is wonderfully easy to sell to a whole lot of customers. For example, Sriram Swaminarayan of LANL has noted (really interesting video there) that over 80% of HPC codes have, like him, a very large body of legacy codes they need to carry into the future. “Just recompile” promises to bring back the good old days of clock speed increases when you just compiled for a new architecture and went faster. At least it does if you’ve already gone parallel on X86, which is far from uncommon. No messing with newfangled, brain-bending languages (like CUDA or OpenCL) unless you really want to. This collection of customers is large, well-funded, and not very well-served by existing accelerator architectures.

Right. Now, for all those readers screaming at me “OK, it runs, but does it perform?” –

Well, not necessarily.

The problem is that to get MIC – certainly KF, and it might be more so for KC – to really perform, on many applications you must get its 512-bit-wide SIMD / vector unit cranking away. Jim Reinders regaled me with a tale of a four-day port to MIC, where, surprised it took that long (he said), he found that it took one day to make it run (just recompile), and then three days to enable wider SIMD / vector execution. I would not be at all surprised to find that this is pleasantly optimistic. After all, Intel cherry-picked the recipients of KF, like CERN, which has one of the world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications in the known universe. (See my post Random Things of Interest at IDF 2011.)

Where, on this SIMD/vector issue, are the 80% of folks with monster legacy codes? Well, Sriram (see above) commented that when LANL tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes with the horsepower coming from attached IBM Cell blades – they had a problem because to perform well, the Cell SPUs needed crank up their two-way SIMD / vector units. Furthermore, they still have difficulty using earlier Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s 8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.

On the other hand, getting good performance on other accelerators, like Nvidia’s, requires much wider SIMD; they need 100s of units cranking, minimally. Full-bore SIMD may in some cases be simpler to exploit than SIMD/vector instructions. But even going through gigabytes of grotty old FORTRAN code just to insert notations saying “do this loop in parallel,” without breaking the code, can be arduous. The programming language, by the way, is not the issue. Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language.

But wait! How can these guys be choking on 2-way parallelism when they have obviously exploited thousands of cluster nodes in parallel? The answer is that we have here two different forms of parallelism; the node-level one is based on scaling the amount of data, while the SIMD-level one isn’t.

In physical simulations, which many of these codes perform, what happens in this simulated galaxy, or this airplane wing, bomb, or atmosphere column over here has a relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as perturbations, smoothed out with a few more global iterations. That’s the basis of the node-level parallelism, with communication between nodes. It can also readily be the basis of processor/core-level parallelism across the cores of a single multiprocessor. (One basis of those kinds of parallelism, anyway; other techniques are possible.)

Inside any given galaxy, wing, bomb, or atmosphere column, however, quantities tend to be much more tightly coupled to each other. (Consider, for example, R² force laws; irrelevant when sufficiently far, dominant when close.) Changing the way those tightly-coupled calculations and done can often strongly affect the precision of the results, the mathematical properties of the solution, or even whether you ever converge to any solution. That part may not be simple at all to parallelize, even two-way, and exploiting SIMD / vector forces you to work at that level. (For example, you can get into trouble when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter 9, “Basic Programming Models and Issues.”) To be sure, not all applications have this problem; those that don’t often can easily spin up into thousands of operations in parallel at all levels. (Also, multithreaded “real” SIMD, as opposed to vector SIMD, can in some cases avoid some of those problems. Note italicized words.)

The difficulty of exploiting parallelism in tightly-coupled local computations implies that those 80% are in deep horse puckey no matter what. You have to carefully consider everything (even, in some cases, parenthesization of expressions, forcing order of operations) when changing that code. Needing to do this to exploit MIC’s SIMD suggests an opening for rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually necessary for Intel, too, and if you do it our way you get” tons more performance / lower power / whatever.

Can compilers help here? Sure, they can always eliminate a pile of gruntwork. Automatically vectorizing compilers have been working quite well since the 80s, and progress continues to be made in disentangling the aliasing problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or semi-commercial) products from people like CAPS and The Portland Group get better results if you tell them what’s what, with annotations. Those, of course, must be very carefully applied across mountains of old codes. (They even emit CUDA and OpenCL these days.)

By the way, at least some of the parallelism often exploited by SIMD accelerators (as opposed to SIMD / vector) derives from what I called node-level parallelism above.

Returning to the main discussion, Intel’s MIC has the great advantage that you immediately get a simply ported, working program; and, in the cases that don’t require SIMD operations to hum, that may be all you need. Intel is pushing this notion hard. One IDF session presentation was titled “Program the SAME Here and Over There” (caps were in the title). This is a very big win, and can be sold easily because customers want to believe that they need do little. Furthermore, you will probably always need less SIMD / vector width with MIC than with GPGPU-style accelerators. Only experience over time will tell whether that really matters in a practical sense, but I suspect it does.

Several Other Things

Here are other MIC facts/factlets/opinions, each needing far less discussion.

How do you get from one MIC to another MIC? MIC, both KF and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does not have a PCIe root complex, so cannot source PCIe. It must be attached to a standard compute node. So all anybody was talking about was going down PCIe to node memory, then back up PCIe to a different MIC, all at least partially under host control. Maybe one could use peer-to-peer PCIe device transfers, although I didn’t hear that mentioned. I heard nothing about separate busses directly connecting MICs, like the ones that can connect dual GPUs. This PCIe use is known to be a bottleneck, um, I mean, “known to require using MIC on appropriate applications.” Will MIC be that way for ever and ever? Well, “no announcement of future plans”, but “typically what Intel has done with accelerators is eventually integrate them onto a package or chip.” They are “working with others” to better understand “the optimal arrangement” for connecting multiple MICs.

What kind of memory semantics does MIC have? All I heard was flat cache coherence across all cores, with ordering and synchronizing semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that I think of it, did that count threads? I don’t know.) I asked whether the other cores ran a micro-kernel and got a nod of assent. It is not the same Linux that they run on Xeons. In some ways that’s obvious, since those microkernels on other nodes have to be managed; whether other things changed I don’t know. Each core has a private cache, and all memory is globally accessible.

Synchronization will likely change in KC. That’s how I interpret Jim Reinders’ comment that current synchronization is fine for 32-way, but over 40 will require some innovation. KC has been said to be 50 cores or more, so there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100% necessary for source code to run (as opposed to perform), I think that might be a candidate for the chopping block at some point.

Is there adequate memory bandwidth for apps that strongly stream data? The answer was that they were definitely going to be competitive, which I interpret as saying they aren’t going to break any records, but will be good enough for less stressful cases. Some quite knowledgeable people I know (non-Intel) have expressed the opinion that memory chips will be used in stacks next to (not on top of) the MIC chip in the product, KC. Certainly that would help a lot. (This kind of stacking also appears in a leaked picture of a “far future prototype” from Nvidia, as well as an Intel Labs demo at IDF.)

Power control: Each core is individually controllable, and you can run all cores flat out, in their highest power state, without melting anything. That’s definitely true for KF; I couldn’t find out whether it’s true for KC. Better power controls than used in KF are now present in Sandy Bridge, so I would imagine that at least that better level of support will be there in KC.

Concluding Thoughts

Clearly, I feel the biggest point here are Intel’s planned commitment over time to a stable architecture that is source code compatible with Xeon. Stability and source code compatibility are clear selling points to the large fraction of the HPC and technical computing market that needs to move forward a large body of legacy applications; this fraction is not now well-served by existing accelerators. Also important is the availability of familiar tools, and more of them, compared with popular accelerators available now. There’s also a potential win in being able to evolve existing programmer skill, rather than replacing them. Things do change with the much wider core- and SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that required by current accelerator products, and it starts in a familiar place.

Will MIC win in the marketplace? Big honking SIMD units, like Nvidia ships, will always produce more peak performance, which makes it easy to grab more press. But Intel’s architectural disadvantage in peak juice is countered by process advantage: They’re always two generations ahead of the fabs others use; KC is a 22nm part, with those famous “3D” transistors. It looks to me like there’s room for both approaches.

Finally, don’t forget that Nvidia in particular is here now, steadily increasing its already massive momentum, while a product version of MIC remains pie in the sky. What happens when the rubber meets the road with real MIC products is unknown – and the track record of Larrabee should give everybody pause until reality sets well into place, including SIMD issues, memory coherence and power (neither discussed here, but not trivial), etc.

I think a lot of people would, or should, want MIC to work. Nvidia is hard enough to deal with in reality that two best paper awards were given at the recently concluded IPDPS 2011 conference – the largest and most prestigious academic parallel computing conference – for papers that may as well have been titled “How I actually managed to do something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like a shortest-path graph algorithm (PHAST) are not exactly what one typically expects to run well on a GPGPU. Nevertheless, this is not a good sign. People should not have to do work at the level of intellectual academic accolades to get something done – anything! – on a truly useful computer architecture.

Hope aside, a lot of very difficult hardware and software still has to come together to make MIC work. And…

Larrabee was supposed to be real, too.

**************************************************************

Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!

Wednesday, October 5, 2011

Will Knight’s Corner Be Different? Talking to Intel’s Joe Curley at IDF 2011

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview Joe Curley, Director, Technical Computing Marketing of Intel’s Datacenter & Connected Systems Group in Hillsboro.

Intel-provided information about Joe:

Joe Curley, serves Intel® Corporation as director of marketing for technical computing in the Data Center Group. The technical computing marketing team manages marketing for high-performance computing (HPC) and workstation product lines as well as future Intel® Many Integrated Core (Intel® MIC) products. Joe joined Intel in 2007 to manage planning activities that lead up to the announcement of the Intel® MIC Architecture in May of 2010. Prior to joining Intel, Joe worked at Dell, Inc. and graphics pioneer Tseng Labs in a series of marketing and engineering leadership roles.

I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! to all who responded.)

This is the last in a series of three such transcripts. Hallelujah! Doing this has been a pain. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews. But some were in the interviews, including here.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

[We began by discovering we had similar deep backgrounds, both starting in graphics hardware. I designed & built a display processor (a prehistoric GPU), he built “the most efficient framework buffer controller you could possibly make”. Guess which one of us is in marketing?]

A: My experience in the [HPC] business really started relatively recently, a little under five years ago, [when] I started working on many-core processors. I won’t be able to go into history, but I can at least tell you what we’re doing and why.

Q: Why don’t we start there? At a high level, what are you doing, and why? High level for what you are doing, and as much detail on “why” as you can provide.

A: We have to narrow the question. So, at Intel, what we’re after first of all is in what we call our Technical Computing Marketing Group inside Data Center Group. That has really three major objectives. The first one is to specify the needs for high performance computing, how we can help our customers and developers build the best high performance computing systems.

Q: Let me stop you for a second right there. My impression for high performance computing is that they are people whose needs are that they want more. Just more.

A: Oh, yes, but more at what cost? What cost of power, what cost of programability, what cost of size. How are we going to build the IO system to handle it affordably or use the fabric of the day.

Q: Yes, they want more, but they want it at two bytes/FLOPS of memory bandwidth and communication bandwidth.

A: There’s an old thing called the Dilbert Spec, which is “I want it all, and by the way, can it be free?” But that’s not really what people tell us they want. People in HPC have actually been remarkably pragmatic about what it takes to develop innovation. So they really want us to do some things, and do them really well.

By the way, to finish what we do, we also have the workstation segment, and the MIC Many Integrated Core product line. The marketing for that is also in our group.

You asked “what are you doing and why.” It would probably take forever to go across all domains, but we could go into any one of them a little bit better.

Q: Can you give me a general “why” for HPC, and a specific “why” for MIC?

A: Well, HPC’s a really good business. I get stunned, somebody must be Asking really weird questions, asking “why are you doing HPC?”

Q: What I’ve heard is that HPC is traditionally 12% of the market.

A: Supercomputing is a relatively small percentage of the market. HPC and technical computing, combined, is, not exactly, but roughly, a third of our data center business. [emphasis added by me] Our data center business is a pretty robust business. And high performance computing is a business that requires very high end, high performance processors. It’s actually a very desirable business to be in, if you can do it, and if your systems work. It’s a business we spend a lot of time working on because it’s a good business.

Now, if you look at MIC, back in 2005 we made a tacit conclusion that the performance of a system will come out of parallelism. Parallelism could be expressed at Intel in a lot of different ways. You can look at it as threads, we have this concept called hyperthreading. You can look at it as cores. And we have the SSE instructions sitting around which are SIMD, that’s a form of parallelism; people argue about the definition, but yes, it is. [I agree.] So you take a look at the basic architectural constructs, ease of programming, you know, a cache-based CISC model, and then scaling on cores, threads, SIMD or vectors, these common attributes have been adopted and well-used by a lot of programmers. There are programs across the continuum of coarse- to fine-grained parallel, embarrassingly parallel, pick your taxonomy. But there are applications that developers would be willing to trade the performance of any particular task or thread for the sum of what you can do inside the power envelope at a given period of time. Lots of people have different ways of defining that, you hear throughput, whatever, but this is the class of applications, and over time they’re growing.

Q: Growing relatively, or, say, compared to commercial processing, or…? Is the segment getting larger?

A: The number of people who have tasks they want to run on that kind of hardware is clearly growing. One of the reasons we’re doing MIC, maybe I should just cut it to the easiest answer, is developers and customers asked us to.

Q: Really?

A: And they came to us with a really simple question. We were struggling in the marketing group with how to position MIC, and one of our developers got worked up, like “Look, you give me the parallel performance of an accelerator, but you give me the ease of CPU programming!” Now, ease is a funny word; you can get into religious arguments about ease. But I think what he means is “I don’t have to re-think my algorithm, I don’t have to reorder my data set, there are some things that I don’t have to do. So that they wanted to have the idea of give me this architecture and get it to scale to be wildly parallel. And that is exactly what we’ve done with the MIC architecture. If you think about what the Kinght’s Ferry STP [? Undoubtedly this is SDP - Software Development Platform; I just heard it wrong on the recording.] is, a 32 core, coherent, on a chip, teraflop part, it’s kind of like Paragon or ASCI Red on a chip. [but it is only a TFLOPS in single precision] And the programming model is, surprisingly, kind of like a bunch of processor cores on a network, which a lot of people understand and can get a lot of utility out of in a very well-understood way. So, in a sense, we’re giving people what they want, and that, generally, is good business. And if you don’t give them what they want, they’ll have to go find someone else. So we’re simply doing what our marketplace asked us for.

Q: Well, let me play a little bit of devil’s advocate here, because MIC is very clearly derivative of Larrabee, and…

A: Knight’s Ferry is.

Q: … Knight’s Ferry is. Not MIC?

A: No. I think you have to take a look at what Larrabee was. Larrabee, by the way, was a really cool project, but what Larrabee was was a tile rendering graphics device, which meant its design point, was first of all the programming model was derived from what you do for graphics. It’s going to be API-based, the answer it’s going to generate is going to be a pixel, the pixel is going to have a defined level of sub-pixel accuracy. It’s a very predictable output. The internal optimizations you would make for a graphics implementation of a general many-core architecture is one very specific implementation. Let’s talk about the needs of the high performance computing market. I need bandwidth. I need memory depth. Larrabee didn’t need memory depth; it didn’t have a frame buffer.

Q: It needed bandwidth to local memory [of which it didn’t have enough; see my post The Problem with Larrabee]

A: Yes, but less than you think, because the cache was the critical element in that architecture [again, see that post] if you look through the academic papers on that…

Q: OK, OK.

A: So, they have a common heritage, they’re both derived out of the thoughts that came out of the Intel Labs terascale research. They’re both many-core. But Knight’s Ferry came out with a few, they’re only a few, modifications. But the programming model is completely different. You don’t program a graphics device like you do a computer, and MIC is a computer.

Q: The higher-level programming model is different.

A: Correct.

Q: But it is a big, wide, cache-coherent SMP.

A: Well, yes, that’s what Knight’s Ferry is, but we haven’t talked about what Knight’s Corner yet, and unfortunately I won’t today, and we haven’t talked about where the product line will go from there, either. But there are many things that will remain the same, because there are things you can take and embellish and work and things that will be really different.

Q: But can you at least give me a hint? Is there a chance that Knight’s Corner will be a substantially different hardware model than Knight’s Ferry?

A: I’m going to really love to talk to you about Knight’s Corner. [his emphasis]

Q: But not today.

A: I’m going to duck it today.

Q: Oh, man…

A: The product is going to be in our 22 nm process, and 22 nm isn’t shipping yet. When we get a little bit closer, when it deserves to have the buzz generated, we’ll start generating buzz. Right now, the big thing is that we’re making the investments in the Knight’s Ferry software development platform, to see how codes scale across the many-core, to get the environment and tools up, to let developers poke at it and find stuff, good stuff, bad stuff, in between stuff, that allow us to adjust the product line for ongoing generations. We’ve done that really well since we announced the architecture about 15 months ago.

Q: I was wondering what else I was going to talk about after having talked to both John Hengeveld and Jim Reinders. This is great. Nobody talked about where it really came from, and even hinted that there were changes to the MIC chip [architecture].

A: Oh, no, no, many things will be the same, many things will be different. If you’re targeting trying to do a pixel-renderer, go do a pixel-renderer. If you’re trying to do a general-purpose computing device, do a general-purpose computing device. You’ll see some things and say “well, it’s all the same” and other things “wow, it’s completely different.” We’ll get around to talking about the part when we’re a little closer.

The most important thing that James and/or John should have been talking about is that the key thing is the ability to not force the developer to completely and utterly re-think their problem to use your hardware. There are two models: In an accelerator model, which is something I spent a lot of my life working with, accelerators have the advantage of optimization. You can say “I want to do one thing really well.” So you can then describe a programming model for the hardware. You can say “build your data this way, write your program this way” and if you do it will work. The problem is that not everything fits into the box. Oh, you have sparse data. Oh, you have recursive code.

Q: And there’s madness in that direction, because if you start supporting that you wind yourself around to a general-purpose machine. […usually, a very odd-looking general-purpose machine. I’ve talked about Sutherland’s “Wheel of Reincarnation” in this blog, haven’t I? Oh, there it is: The Cloud Got GPUs, back in November 2010.]

A: Then it’s not an accelerator any more. The thing that you get in MIC is the performance of one of those accelerators. We’ve shown this. We’ve hit 960GF out of a peak 1.2TF without throwing away precision, without playing any circus tricks, just run the hardware. On Knight’s Ferry we’ve shown that. So you get performance, but you’re getting it out of the general purpose programming model.

Q: That’s running LINPACK, or… ?

A: That was an even more basic thing; I’m just talking about SGEMM [single-precision dense matrix multiply].

Q: I just wanted to ground the number.

A: For LU factorization, I think we showed hybrid LU, really cool, one of the great things about this hybrid…

Q: They’re demo-ing that downstairs.

A: … OK. When the matrix size is small, I keep it on the host; when the matrix size is large, I move it. But it’s all the same code, the same code either place. I’m just deciding where I want to run the code intelligently, based on the size of the matrix. You can get the exact number, but I think it’s on the order of 750GBytes/sec for LU [GFLOPS?], which is actually, for a first-generation part, not shabby. [They were doing 650-750 GF according to the meter I saw. That's single precision; Knight's Ferry was originally a graphics part.]

Q: Yaahh, well, there are a lot of people who can deliver something like that.

A: We’ll keep working on it and making it better and better. So, what are we proving today. All we’ve proven today is that the architecture is capable of performance. We’ve got a lot of work to do before we have a product, but the architecture has shown itself to be capable. The programming model, we have people who will speak for us, like the quotes that came from LRZ [data center for the universities of Munich and the Bavarian Academy of Sciences], from Leibnitz [same place], a code they couldn’t port to other accelerators was running in two hours and optimized in two days. Now, actual mileage may vary, see dealer for…

Q: So, there are things that just won’t run on a CUDA model? Example?

A: Well, perhaps, again, the thing you try to get to is whether there is evidence growing that what you say is real. So we’re having people who are starting to be able to speak to that, and that gives people the confidence that we’re going to be able to get there. The other thing it ends up doing, it’s kind of an odd benefit, as people have started building their code, trying to optimize it for MIC, they’re finding the parallelism, they’re doing what we wanted them to do all along, they’re taking the same code on their current cluster and they’re getting benefits right now.

Q: That’s got a long history. People would have some grotty old FORTRAN code, and want to vectorize it, but the vectorizing compiler couldn’t make crap out of it. So they cleaned it up, made it obvious what was going on, and the vectorizer did its thing well. Then they put it back on the original machine and it ran twice as fast.

A: So, one of the nice things that’s happened is that as people are looking at ways to scale power, performance, they’re finally getting around to dealing with parallelism. The offer that we’re trying to provide is portable, high level, standards-based, and you can use it now.

You said “why.” That’s why. Our customers and developers say “if you can do that, that’s really valuable.” Now. We’re four men and a pudding, we haven’t shipped a product yet, we’ve got a lot of work to do, but the thought and the promise and the early data is really good.

Q: OK. Well, great.

A: Was that a good use of the time?

Q: That’s a very good use of the time. Let me poke on one thing a little bit. Conceptually, it ought to be simpler to write code to that kind of a shared memory model and get parallelism out of the code that way. Now, on the other hand, there was a talk – sorry, I forget his name, he was one of the software guys working on Larrabee [it was Tom Forsyth; see my post The Problem with Larrabee again] said someone on the project had written four renderers, and three of them were for Larrabee. He was having one hell of a time trying to get something that performed well. His big issue, at least what it came down to from what I remember of the talk, was memory bandwidth.

A: Well, first of all, we’ve said Larrabee’s not a product. As I’ve said, one of the things that is critical, you’ve got the compute-bound, you’ve got the memory-bound, and most people are somewhere in between, but you have to be able to handle the two edge cases. We understand that, and we intend to deliver a really good value across the spectrum. Now, Knight’s Ferry has the RVI silicon [RVI? I’m guessing here], it’s a variation off the silicon we used, no one cares about that, but on Knight’s Ferry, the memory bus is 256 bits wide. Relatively narrow, and for a graphics processor, very narrow. There are definitely design decisions in how that chip was made that would limit the bandwidth. And the memory it was designed with is slower than the memory today, you have all of the normal things. But if you went downstairs to the show floor, and talk to Daniel Paul, he’s demonstrating a pretty dramatic ray-tracer.

[What follows is a bit confused. He didn’t mean the Austrian Crown stochastic ray-tracing demo, but rather the real-time ray-tracing demo. As I said in my immediately previous post (Random Things of Interest at IDF 2011), the real-time demo is on a set of Knight’s Ferries attached to a Xeon-based node. At the time of the interview, I hadn’t seen the real-time demo, just the stochastic one; the latter is not on Knight’s Ferry.]

Q: I’ve seen that one. The Austrian Crown?

A: Yes.

Q: I thought that was on a cluster.

A: In the little box behind there, he’s able to scale from one to eight Knight’s Ferries.

Q: He never told me there was a Knight’s Ferry in there.

A: Yes, it’s all Knight’s Ferry.

Q: Well, I’m going to go down there and beat on him a little bit.

A: I’m about to point you to a YouTube site, it got compressed and thrown up on YouTube. You can’t get the impact of the complexity of the rays, but you can at least get the superficial idea of the responsiveness of the system from Knight’s Ferry.

[He didn’t point me to YouTube, or I lost it, but here’s one I found. Ignore the fact that the introduction is in Swedish or something [it's Dutch, actually]; Daniel – and it’s Daniel, not David – speaks English, and gives a good demo. Yes, everybody in the “Labs” part of the showroom wore white lab coats. I did a bit of teasing. I also updated the Random Things of Interest post to directly include it.]

Well, if you believe that what we’re going to do in our mainstream processors is roughly double the FLOPS every generation for the next many generations, that’s our intent. What if we can do that on the MIC line as well? By the time you get to where ray-tracing would be practical, you could see multiple of those being integrated into a single device [added in transcription: Multiple MICs in a single device? Hierarchical MIC?] becomes practical computationally. That won’t be far from now. So, it’s a nice demo. David’s an expert in his field, I didn’t hear what he said, but it you want to see the device downstairs actually running a fairly strenuous graphics workload, take a look at that.

Q: OK. I did go down there and I did see that, I just didn’t know it was Knight’s Ferry. [It’s not, it’s not, still confused here.] On that HDR display that is gorgeous. [Where “it” = stochastically-ray-traced Austrian Crown. It is.]

[At that point, Dave Patterson walked in, which interrupted us. We said hello – I know Dave of old, a bit – thanks were exchanged with Joe, and I departed.]

[I can’t believe this is the end of the last one. I really don’t like transcribing.]

Sunday, October 2, 2011

Random Things of Interest at IDF 2011 (Intel Developer Forum)

I still have one IDF interview to transcribe (Joe Curley), but I’m sick of doing transcriptions. So here are a few other random things I observed at the 2011 Intel Developers Forum. It is nothing like comprehensive. It’s also not yet the promised MIC dump; that will still come.

Exhibit Hall

I found very few products I had a direct interest in, but then again I didn’t look very hard.

On the right, immediately as you enter, was a demo of a Xeon/MIC combination clocking 600-700 GFLOPS (quite assuredly single precision) doing LRU Factorization. Questions to the guys running the demo indicated: (1) They did part on the Xeon, and there may have been two of those, they weren’t sure (the diagram showed two). (2) They really learned how to say “We don’t comment on competitors” and “We don’t comment on unannounced products.”

A 6-legged robot controlled by Atom, controlled by a game controller. I included this here only because it looked funky and I took a picture (q. v.). Also, for some reason it was in constant slight motion, like it couldn’t sit still, ever.

There were three things that were interesting to me in the Intel Labs section:

One Tbit/sec memory stack: To understand why this is interesting, you need to know that the semiconductor manufacturing processes used to make DRAM and logic are quite different. Putting both on the same chip requires compromises in one or the other. The logic that must exist on DRAM chips isn’t quite as good as it could be, for example. In this project, they separated the two onto separate chips in a stack: Logic is on one, the bottom one, that interfaces with the outside world. On top of this are multiple pure memory chips, multiple layers of pure DRAM, no logic. They connect by solder bumps or something (I’m not sure), and there are many (thousands of) “through silicon vias” that go all the way through the memory chips to allow connecting a whole stack to the logic at the bottom with very high bandwidth. This whole idea eliminates the need to compromise on semiconductor processes, so the DRAM can be dense (and fast), and the logic can be fast (and low power). One result is that they can suck 1 Tbit/sec of data out of one of these stacks. This just feels right to me as a direction. Too bad they’re unlikely to use the new IBM/3M thermally conductive glue to suck heat out of the stack.

Stochastic Ray-Tracing: What it says: Ray-tracing, but allows light to be probabilistically scattered off surfaces, so, for example, shiny matte surfaces have realistically blurred reflections on them, and produce more realistic color effects on other surfaces to which they reflect. Shiny matte surfaces like the surface of the golden dome in the center of the Austrian crown, reflecting the jewels in the outer band, which was their demo image. I have a picture here, but it comes nowhere near doing this justice. The large, high dynamic range monitor they had, though – wow. Just wow. Spectacular. A guy was explaining this to me pointing to a normal monitor when I happened to glance up at the HDR one. I was like “shut up already, I just want to look at that.” To run it they used a cluster of four Xeon-based nodes, each apparently about 4U high, and that was not in real time; several seconds were required per update. But wow.

Real-Time Ray-Tracing: This has been done before; I saw it a demo on a Cell processor back in about 2006. This, however, was a much more complex scene than I’d previously viewed. It had the usual shiny classic car, but that was now in the courtyard of a much larger old palace-like building, with lots of columns and crenellations and the like. It ran on a MIC, of course – actually, several of them, all attached to the same Xeon system. Each had a complete copy of the scene data in its memory, which is unrealistic but does serve to make the problem “pleasantly parallel” (which is what I’m told is now the PC way to describe what used to be called “embarrassingly parallel”). However, the demo was still fun. Here's a video of it I found. It apparently was shot at a different event, but still the same technology demonstrated. The intro is in Swedish, or something, but it reverts to English at the demo. And yes, all the Intel Labs guys wore white lab coats. I teased them a bit on that.

Keynotes

Otellini (CEO): Intel is going hot and heavy into supporting the venerable Trusted Platform technology, a collection of technology which might well work, but upon which nobody has yet bitten. This security emphasis clearly fits with the purchase of MacAfee (everybody got a free MacAfee package at registration, good for 3 systems). “2011 may be the year the industry got serious about security!” I remain unconvinced.

Mooley Eden (General Manager, Mobile Platforms): OK. Right now, I have to say that this is the one time in the course of these IDF posts that I am going to bow to Intel’s having paid for me to attend IDF, bite my tongue rather than succumbing to my usual practice of biting the hand that feeds me, and limit my comments to:

Mooley Eden must be an acquired taste.

To learn more of my personal opinions on this subject, you are going to have to buy me a craft beer (dark & hoppy) in a very noisy bar. Since I don’t like noisy bars, and that’s an unusual combination, I consider this unlikely.

Technically… Ultrabooks, ultrabooks, “beyond thin and light.” More security. They had a lame Ninja-garbed guy on stage, trying to hack into trusted-platform-protected system, and of course failing. (Please see this.) There was also a picture of a castle with a moat, and a (deliberately) crude animation of knights trying to cross the moat and falling in. (I mention this only because it’s relevant to something below.)

People never use hibernate, because it takes too long to wake up. The solution is… to have the system wake up regularly. And run sync operations. Eh what? Is this supposed to cause your wakeup to take less time because the wakeup time is actually spent syncing? My own wakeup time is mostly wakeup. All I know is that suspend/resume used to be really fast, reliable, and smart. Then it got transplanted to Windows from BIOS and has been unsatisfactory - slow and dumb - ever since.

This was my first time seeing Windows 8. It looks like Mango phone interface. Is making phones & PCs look alike supposed to help in some way? (Like boost Windows Phone sales?) I’m quite a bit less than intrigued. It means I’m going to have to buy another laptop before Win 8 becomes the standard.

Justin Rattner (CTO): Some of his stuff I covered in my first post on IDF. One I didn’t cover was the massive deal made of CERN and the LHC (Large Hadron Collider) (“the largest machine human beings have ever created”) (everybody please now go “ooooohhh”) using MICs. Look, folks, the major high energy physics apps are embarrassingly parallel: You get a whole lot, like millions, billions, of particle collisions, gather each one’s data, and do an astounding amount of floating-point computing on each completely independent set of collision data. Separately. Hoping to find out that one is a Higgs boson or something. I saw people doing this in the late 1980s at Fermilab on a homebrew parallel system. They even had a good software framework for using it: Write your (serial) code for analyzing a collision your way, and hand it to us; we run it many times in parallel, just handing out each event’s data to an instance of your code. The only thing that would be interesting about this would be if for some reason they actually couldn’t run HEP codes very well indeed. But they can run them well. Which makes it a yawn for me. I’ve no question that the LHC is ungodly impressive, of course. I just wish it were in Texas and called something else.

Intel Fellows Panel

Some interesting questions asked and answered, many questions lame. Like: “Will high-end Xeons pass mainframes?” Silly question. Depends on what “pass” means. In the sense in which most people may mean, they already have, and it doesn’t matter. Here are some others:

Q: Besides MIC, what else is needed for Exascale? A: We’re having to go all the way down to device level. In particular, we’re looking at subthreshold or near-threshold logic. We tried that before, but failed. Devices turn out to be most efficient 20mv above threshold. May have to run at 800MHz. [Implication: A whole lot of parallelism.] Funny how they talked about near-threshold logic, and Justin Rattner just happened to have a demo of that at the next day’s keynote.

Q: Are you running out of rare metals? A: It’s a question of cost. Yes, we always try to move off expensive materials. Rare earths needed, but not much; we only use them in layers like five atoms thick.

Q: Is Moore’s Law going to end? A: This was answered by Kelin J. Kuhn, Fellow & Director of the Technology and Manufacturing Group – i.e., she really knows silicon. She noted that, by observation, at every given generation it always looks like Moore’s Law ends in two generations. But it never has. Every time we see a major impediment to the physics – several examples given, going back to the 1980s and the end of Dennard scaling – something seems to come along to avoid the problem. The exception seems to be right now: Unlike prior eras when it will end in two generations, there don't seem to be any clouds on this particular horizon at all. (While I personally know of no reason to dispute this, keep in mind that this is from Intel, whose whole existence seems tied to Moore's Law, and it's said by the woman who probably has the biggest responsibility to make it all come about.)

An aside concerning the question-taking woman with the microphone on my side of the hall: I apparently reminded her of something she hates. She kept going elsewhere, even after standing right beside me for several minutes while I had my hand raised. What I was going to ask was: This morning in the keynote we saw a castle with a moat, and several knights dropping into the moat. The last two days we also heard a lot about a knight which appears to take a ferry across the moat of PCIe. Why are you strangling a TFLOP of computation with PCIe? Other accelerator vendors don’t have a choice with their accelerators, but you guys own the whole architecture. Surely something better could be done. Does this, perhaps, indicate a lack of integration or commitment to the new architecture across the organization?

Maybe she was fitted with a wiseass detection system.

Anyway, I guess I won’t find out this year.