Friday, October 28, 2011

MIC and the Knights

Intel’s Many-Integrated-Core architecture (MIC) was on wide display at the 2011 Intel Developer Forum (IDF), along with the MIC-based Knight’s Ferry (KF) software development kit. Well, I thought it was wide display, but I’m an IDF Newbie. There was mention in two keynotes, a demo in the first booth on the right in the exhibit hall, several sessions, etc. Some old hands at IDF probably wouldn’t consider the display “wide” in IDF terms unless it’s in your face on the banners, the escalators, the backpacks, and the bagels.

Also, there was much attempted discussion of the 2012 product version of the MIC architecture, dubbed Knight’s Corner (KC). Discussion was much attempted by me, anyway, with decidedly limited success. There were some hints, and some things can be deduced, but the real KC hasn’t stood up yet. That reticence is probably a turn for the better, since KF is the direct descendant of Intel’s Larrabee graphics engine, which was quite prematurely trumpeted as killing off such GPU stalwarts as Nvidia and ATI (now AMD), only to eventually be dropped – to become KF. A bit more circumspection is now certainly called for.

This circumspection does, however, make it difficult to separate what I learned into neat KF or KC buckets; KC is just too well hidden so far. Here is my best guesses, answering questions I received from Twitter and elsewhere as well as I can.

If you’re unfamiliar with MIC or KF or KC, you can call up a plethora of resources on the web that will tell you about it; I won’t be repeating that information here. Here’s a relatively recent one: Intel Larraabee Take Two. In short summary, MIC is the widest X86 shared-memory multicore anywhere: KF has 32 X86 cores, all sharing memory, four threads each, on one chip. KC has “50 or more.” In addition, and crucially for much of the discussion below, each core has an enhanced and expanded vector / SIMD unit. You can think of that as an extension of SSE or AVX, but 512 bits wide and with many more operations available.

An aside: Intel’s department of code names is fond of using place names – towns, rivers, etc. – for the externally-visible names of development projects. “Knight’s Ferry” follows that tradition; it’s a town up in the Sierra Nevada Mountains in central California. The only “Knight’s Corner” I could find, however, is a “populated area,” not even a real town, probably a hamlet or development, in central Massachusetts. This is at best an unlikely name source. I find this odd; I wish I’d remembered to ask about it.

Is It Real?

The MIC architecture is apparently as real as it can be. There are multiple generations of the MIC chip in roadmaps, and Intel has committed to supply KC (product-level) parts to the University of Texas TACC by January 2013, so at least the second generation is as guaranteed to be real as a contract makes it. I was repeatedly told by Intel execs I interviewed that it is as real as it gets, that the MIC architecture is a long-term commitment by Intel, and it is not transitional – not a step to other, different things. This is supposed to be the Intel highly-parallel technical computing accelerator architecture, period, a point emphasized to me by several people. (They still see a role for Xeon, of course, so they don't think of MIC as the only technical computing architecture.)

More importantly, Joe Curley (Intel HPC marketing) gave me a reason why MIC is real, and intended to be architecturally stable: HPC and general technical computing are about a third of Intel’s server business. Further, that business tends to be a very profitable third since those customers tend to buy high-end parts. MIC is intended to slot directly into that business, obviously taking the money that is now increasingly spent on other accelerators (chiefly Nvidia products) and moving that money into Intel’s pockets. Also, as discussed below, Intel’s intention for MIC is to greatly widen the pool of customers for accelerators.

The Big Feature: Source Compatibility

There is absolutely no question that Intel regards source compatibility as a primary, key feature of MIC: Take your existing programs, recompile with a “for this machine” flag set to MIC (literally: “-mmic” flag), and they run on KF. I have zero doubt that this will also be true of KC and is planned for every future release in their road map. I suspect it’s why there is a MIC – why they did it, rather than just burying Larrabee six feet deep. No binary compatibility, though; you need to recompile.

You do need to be on Linux; I heard no word about Microsoft Windows. However, Microsoft Windows 8 has a new task manager display changed to be a better visualization of many more – up to 640 – cores. So who knows; support is up to Microsoft.

Clearly, to get anywhere, you also need to be parallelized in some form; KF has support for MPI (messaging), OpenMP (shared memory), and OpenCL (GPUish SIMD), along with, of course, Intel’s Threading Building Blocks, Cilk, and probably others. No CUDA; that’s Nvidia’s product. It’s a real Linux, by the way, that runs on a few of the MIC processors; I was told “you can SSH to it.” The rest of the cores run some form of microkernel. I see no reason they would want any of that to become more restrictive on KC.

If you can pull off source compatibility, you have something that is wonderfully easy to sell to a whole lot of customers. For example, Sriram Swaminarayan of LANL has noted (really interesting video there) that over 80% of HPC codes have, like him, a very large body of legacy codes they need to carry into the future. “Just recompile” promises to bring back the good old days of clock speed increases when you just compiled for a new architecture and went faster. At least it does if you’ve already gone parallel on X86, which is far from uncommon. No messing with newfangled, brain-bending languages (like CUDA or OpenCL) unless you really want to. This collection of customers is large, well-funded, and not very well-served by existing accelerator architectures.

Right. Now, for all those readers screaming at me “OK, it runs, but does it perform?” –

Well, not necessarily.

The problem is that to get MIC – certainly KF, and it might be more so for KC – to really perform, on many applications you must get its 512-bit-wide SIMD / vector unit cranking away. Jim Reinders regaled me with a tale of a four-day port to MIC, where, surprised it took that long (he said), he found that it took one day to make it run (just recompile), and then three days to enable wider SIMD / vector execution. I would not be at all surprised to find that this is pleasantly optimistic. After all, Intel cherry-picked the recipients of KF, like CERN, which has one of the world’s most embarrassingly, ah pardon me, “pleasantly” parallel applications in the known universe. (See my post Random Things of Interest at IDF 2011.)

Where, on this SIMD/vector issue, are the 80% of folks with monster legacy codes? Well, Sriram (see above) commented that when LANL tried to use Roadrunner – the world’s first PetaFLOPS machine, X86 cluster nodes with the horsepower coming from attached IBM Cell blades – they had a problem because to perform well, the Cell SPUs needed crank up their two-way SIMD / vector units. Furthermore, they still have difficulty using earlier Xeons’ two-way (128-bit) vector/SIMD units. This makes it sound like using MIC’s 8-way (64-bit ops) SIMD / vector is going to be far from trivial in many cases.

On the other hand, getting good performance on other accelerators, like Nvidia’s, requires much wider SIMD; they need 100s of units cranking, minimally. Full-bore SIMD may in some cases be simpler to exploit than SIMD/vector instructions. But even going through gigabytes of grotty old FORTRAN code just to insert notations saying “do this loop in parallel,” without breaking the code, can be arduous. The programming language, by the way, is not the issue. Sriram reminded me of the old saying that great FORTRAN coders, who wrote the bulk of those old codes, can write FORTRAN in any language.

But wait! How can these guys be choking on 2-way parallelism when they have obviously exploited thousands of cluster nodes in parallel? The answer is that we have here two different forms of parallelism; the node-level one is based on scaling the amount of data, while the SIMD-level one isn’t.

In physical simulations, which many of these codes perform, what happens in this simulated galaxy, or this airplane wing, bomb, or atmosphere column over here has a relatively limited effect on what happens in that galaxy, wing, bomb or column way over there. The effects that do travel can be added as perturbations, smoothed out with a few more global iterations. That’s the basis of the node-level parallelism, with communication between nodes. It can also readily be the basis of processor/core-level parallelism across the cores of a single multiprocessor. (One basis of those kinds of parallelism, anyway; other techniques are possible.)

Inside any given galaxy, wing, bomb, or atmosphere column, however, quantities tend to be much more tightly coupled to each other. (Consider, for example, R2 force laws; irrelevant when sufficiently far, dominant when close.) Changing the way those tightly-coupled calculations and done can often strongly affect the precision of the results, the mathematical properties of the solution, or even whether you ever converge to any solution. That part may not be simple at all to parallelize, even two-way, and exploiting SIMD / vector forces you to work at that level. (For example, you can get into trouble when going parallel and/or SIMD naively changes from Gauss-Seidel iteration to Gauss-Jacobi simulation. I went into this in more detail way back in my book In Search of Clusters, (Prentice-Hall), Chapter 9, “Basic Programming Models and Issues.”) To be sure, not all applications have this problem; those that don’t often can easily spin up into thousands of operations in parallel at all levels. (Also, multithreaded “real” SIMD, as opposed to vector SIMD, can in some cases avoid some of those problems. Note italicized words.)

The difficulty of exploiting parallelism in tightly-coupled local computations implies that those 80% are in deep horse puckey no matter what. You have to carefully consider everything (even, in some cases, parenthesization of expressions, forcing order of operations) when changing that code. Needing to do this to exploit MIC’s SIMD suggests an opening for rivals: I can just see Nvidia salesmen saying “Sorry for the pain, but it’s actually necessary for Intel, too, and if you do it our way you get” tons more performance / lower power / whatever.

Can compilers help here? Sure, they can always eliminate a pile of gruntwork. Automatically vectorizing compilers have been working quite well since the 80s, and progress continues to be made in disentangling the aliasing problems that limit their effectiveness (think FORTRAN COMMON). But commercial (or semi-commercial) products from people like CAPS and The Portland Group get better results if you tell them what’s what, with annotations. Those, of course, must be very carefully applied across mountains of old codes. (They even emit CUDA and OpenCL these days.)

By the way, at least some of the parallelism often exploited by SIMD accelerators (as opposed to SIMD / vector) derives from what I called node-level parallelism above.

Returning to the main discussion, Intel’s MIC has the great advantage that you immediately get a simply ported, working program; and, in the cases that don’t require SIMD operations to hum, that may be all you need. Intel is pushing this notion hard. One IDF session presentation was titled “Program the SAME Here and Over There” (caps were in the title). This is a very big win, and can be sold easily because customers want to believe that they need do little. Furthermore, you will probably always need less SIMD / vector width with MIC than with GPGPU-style accelerators. Only experience over time will tell whether that really matters in a practical sense, but I suspect it does.

Several Other Things

Here are other MIC facts/factlets/opinions, each needing far less discussion.

How do you get from one MIC to another MIC? MIC, both KF and KC, is a PCIe-attached accelerator. It is only a PCIe target device; it does not have a PCIe root complex, so cannot source PCIe. It must be attached to a standard compute node. So all anybody was talking about was going down PCIe to node memory, then back up PCIe to a different MIC, all at least partially under host control. Maybe one could use peer-to-peer PCIe device transfers, although I didn’t hear that mentioned. I heard nothing about separate busses directly connecting MICs, like the ones that can connect dual GPUs. This PCIe use is known to be a bottleneck, um, I mean, “known to require using MIC on appropriate applications.” Will MIC be that way for ever and ever? Well, “no announcement of future plans”, but “typically what Intel has done with accelerators is eventually integrate them onto a package or chip.” They are “working with others” to better understand “the optimal arrangement” for connecting multiple MICs.

What kind of memory semantics does MIC have? All I heard was flat cache coherence across all cores, with ordering and synchronizing semantics “standard” enough (= Xeon) that multi-core Linux runs on multiple nodes. Not 32-way Linux, though, just 4-way (16, including threads). (Now that I think of it, did that count threads? I don’t know.) I asked whether the other cores ran a micro-kernel and got a nod of assent. It is not the same Linux that they run on Xeons. In some ways that’s obvious, since those microkernels on other nodes have to be managed; whether other things changed I don’t know. Each core has a private cache, and all memory is globally accessible.

Synchronization will likely change in KC. That’s how I interpret Jim Reinders’ comment that current synchronization is fine for 32-way, but over 40 will require some innovation. KC has been said to be 50 cores or more, so there you go. Will “flat” memory also change? I don’t know, but since it isn’t 100% necessary for source code to run (as opposed to perform), I think that might be a candidate for the chopping block at some point.

Is there adequate memory bandwidth for apps that strongly stream data? The answer was that they were definitely going to be competitive, which I interpret as saying they aren’t going to break any records, but will be good enough for less stressful cases. Some quite knowledgeable people I know (non-Intel) have expressed the opinion that memory chips will be used in stacks next to (not on top of) the MIC chip in the product, KC. Certainly that would help a lot. (This kind of stacking also appears in a leaked picture of a “far future prototype” from Nvidia, as well as an Intel Labs demo at IDF.)

Power control: Each core is individually controllable, and you can run all cores flat out, in their highest power state, without melting anything. That’s definitely true for KF; I couldn’t find out whether it’s true for KC. Better power controls than used in KF are now present in Sandy Bridge, so I would imagine that at least that better level of support will be there in KC.

Concluding Thoughts

Clearly, I feel the biggest point here are Intel’s planned commitment over time to a stable architecture that is source code compatible with Xeon. Stability and source code compatibility are clear selling points to the large fraction of the HPC and technical computing market that needs to move forward a large body of legacy applications; this fraction is not now well-served by existing accelerators. Also important is the availability of familiar tools, and more of them, compared with popular accelerators available now. There’s also a potential win in being able to evolve existing programmer skill, rather than replacing them. Things do change with the much wider core- and SIMD-levels of parallelism in MIC, but it’s a far less drastic change than that required by current accelerator products, and it starts in a familiar place.

Will MIC win in the marketplace? Big honking SIMD units, like Nvidia ships, will always produce more peak performance, which makes it easy to grab more press. But Intel’s architectural disadvantage in peak juice is countered by process advantage: They’re always two generations ahead of the fabs others use; KC is a 22nm part, with those famous “3D” transistors. It looks to me like there’s room for both approaches.

Finally, don’t forget that Nvidia in particular is here now, steadily increasing its already massive momentum, while a product version of MIC remains pie in the sky. What happens when the rubber meets the road with real MIC products is unknown – and the track record of Larrabee should give everybody pause until reality sets well into place, including SIMD issues, memory coherence and power (neither discussed here, but not trivial), etc.

I think a lot of people would, or should, want MIC to work. Nvidia is hard enough to deal with in reality that two best paper awards were given at the recently concluded IPDPS 2011 conference – the largest and most prestigious academic parallel computing conference – for papers that may as well have been titled “How I actually managed to do something interesting on an Nvidia GPGPU.” (I’m referring to the “PHAST” and “Profiling” papers shown here.) Granted, things like a shortest-path graph algorithm (PHAST) are not exactly what one typically expects to run well on a GPGPU. Nevertheless, this is not a good sign. People should not have to do work at the level of intellectual academic accolades to get something done – anything! – on a truly useful computer architecture.

Hope aside, a lot of very difficult hardware and software still has to come together to make MIC work. And…

Larrabee was supposed to be real, too.

**************************************************************

Acknowledgement: This post was considerably improved by feedback from a colleague who wishes to maintain his Internet anonymity. Thank you!

Wednesday, October 5, 2011

Will Knight’s Corner Be Different? Talking to Intel’s Joe Curley at IDF 2011


At the recent Intel Developer Forum (IDF), I was given the opportunity to interview Joe Curley, Director, Technical Computing Marketing of Intel’s Datacenter & Connected Systems Group in Hillsboro.

Intel-provided information about Joe:
Joe Curley, serves Intel® Corporation as director of marketing for technical computing in the Data Center Group. The technical computing marketing team manages marketing for high-performance computing (HPC) and workstation product lines as well as future Intel® Many Integrated Core (Intel® MIC) products. Joe joined Intel in 2007 to manage planning activities that lead up to the announcement of the Intel® MIC Architecture in May of 2010. Prior to joining Intel, Joe worked at Dell, Inc. and graphics pioneer Tseng Labs in a series of marketing and engineering leadership roles.
I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! to all who responded.)

This is the last in a series of three such transcripts. Hallelujah! Doing this has been a pain. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews. But some were in the interviews, including here.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

[We began by discovering we had similar deep backgrounds, both starting in graphics hardware. I designed & built a display processor (a prehistoric GPU), he built “the most efficient framework buffer controller you could possibly make”. Guess which one of us is in marketing?]

A: My experience in the [HPC] business really started relatively recently, a little under five years ago, [when] I started working on many-core processors. I won’t be able to go into history, but I can at least tell you what we’re doing and why.

Q: Why don’t we start there? At a high level, what are you doing, and why? High level for what you are doing, and as much detail on “why” as you can provide.

A: We have to narrow the question. So, at Intel, what we’re after first of all is in what we call our Technical Computing Marketing Group inside Data Center Group. That has really three major objectives. The first one is to specify the needs for high performance computing, how we can help our customers and developers build the best high performance computing systems.

Q: Let me stop you for a second right there. My impression for high performance computing is that they are people whose needs are that they want more. Just more.

A: Oh, yes, but more at what cost? What cost of power, what cost of programability, what cost of size. How are we going to build the IO system to handle it affordably or use the fabric of the day.

Q: Yes, they want more, but they want it at two bytes/FLOPS of memory bandwidth and communication bandwidth.

A: There’s an old thing called the Dilbert Spec, which is “I want it all, and by the way, can it be free?” But that’s not really what people tell us they want. People in HPC have actually been remarkably pragmatic about what it takes to develop innovation. So they really want us to do some things, and do them really well.

By the way, to finish what we do, we also have the workstation segment, and the MIC Many Integrated Core product line. The marketing for that is also in our group.

You asked “what are you doing and why.” It would probably take forever to go across all domains, but we could go into any one of them a little bit better.

Q: Can you give me a general “why” for HPC, and a specific “why” for MIC?

A: Well, HPC’s a really good business. I get stunned, somebody must be Asking really weird questions, asking “why are you doing HPC?”

Q: What I’ve heard is that HPC is traditionally 12% of the market.

A: Supercomputing is a relatively small percentage of the market. HPC and technical computing, combined, is, not exactly, but roughly, a third of our data center business. [emphasis added by me] Our data center business is a pretty robust business. And high performance computing is a business that requires very high end, high performance processors. It’s actually a very desirable business to be in, if you can do it, and if your systems work. It’s a business we spend a lot of time working on because it’s a good business.

Now, if you look at MIC, back in 2005 we made a tacit conclusion that the performance of a system will come out of parallelism. Parallelism could be expressed at Intel in a lot of different ways. You can look at it as threads, we have this concept called hyperthreading. You can look at it as cores. And we have the SSE instructions sitting around which are SIMD, that’s a form of parallelism; people argue about the definition, but yes, it is. [I agree.] So you take a look at the basic architectural constructs, ease of programming, you know, a cache-based CISC model, and then scaling on cores, threads, SIMD or vectors, these common attributes have been adopted and well-used by a lot of programmers. There are programs across the continuum of coarse- to fine-grained parallel, embarrassingly parallel, pick your taxonomy. But there are applications that developers would be willing to trade the performance of any particular task or thread for the sum of what you can do inside the power envelope at a given period of time. Lots of people have different ways of defining that, you hear throughput, whatever, but this is the class of applications, and over time they’re growing.

Q: Growing relatively, or, say, compared to commercial processing, or…? Is the segment getting larger?

A: The number of people who have tasks they want to run on that kind of hardware is clearly growing. One of the reasons we’re doing MIC, maybe I should just cut it to the easiest answer, is developers and customers asked us to.

Q: Really?

A: And they came to us with a really simple question. We were struggling in the marketing group with how to position MIC, and one of our developers got worked up, like “Look, you give me the parallel performance of an accelerator, but you give me the ease of CPU programming!” Now, ease is a funny word; you can get into religious arguments about ease. But I think what he means is “I don’t have to re-think my algorithm, I don’t have to reorder my data set, there are some things that I don’t have to do. So that they wanted to have the idea of give me this architecture and get it to scale to be wildly parallel. And that is exactly what we’ve done with the MIC architecture. If you think about what the Kinght’s Ferry STP [? Undoubtedly this is SDP - Software Development Platform; I just heard it wrong on the recording.] is, a 32 core, coherent, on a chip, teraflop part, it’s kind of like Paragon or ASCI Red on a chip. [but it is only a TFLOPS in single precision] And the programming model is, surprisingly, kind of like a bunch of processor cores on a network, which a lot of people understand and can get a lot of utility out of in a very well-understood way. So, in a sense, we’re giving people what they want, and that, generally, is good business. And if you don’t give them what they want, they’ll have to go find someone else. So we’re simply doing what our marketplace asked us for.

Q: Well, let me play a little bit of devil’s advocate here, because MIC is very clearly derivative of Larrabee, and…

A: Knight’s Ferry is.

Q: … Knight’s Ferry is. Not MIC?

A: No. I think you have to take a look at what Larrabee was. Larrabee, by the way, was a really cool project, but what Larrabee was was a tile rendering graphics device, which meant its design point, was first of all the programming model was derived from what you do for graphics. It’s going to be API-based, the answer it’s going to generate is going to be a pixel, the pixel is going to have a defined level of sub-pixel accuracy. It’s a very predictable output. The internal optimizations you would make for a graphics implementation of a general many-core architecture is one very specific implementation. Let’s talk about the needs of the high performance computing market. I need bandwidth. I need memory depth. Larrabee didn’t need memory depth; it didn’t have a frame buffer.

Q: It needed bandwidth to local memory [of which it didn’t have enough; see my post The Problem with Larrabee]

A: Yes, but less than you think, because the cache was the critical element in that architecture [again, see that post] if you look through the academic papers on that…

Q: OK, OK.

A: So, they have a common heritage, they’re both derived out of the thoughts that came out of the Intel Labs terascale research. They’re both many-core. But Knight’s Ferry came out with a few, they’re only a few, modifications. But the programming model is completely different. You don’t program a graphics device like you do a computer, and MIC is a computer.

Q: The higher-level programming model is different.

A: Correct.

Q: But it is a big, wide, cache-coherent SMP.

A: Well, yes, that’s what Knight’s Ferry is, but we haven’t talked about what Knight’s Corner yet, and unfortunately I won’t today, and we haven’t talked about where the product line will go from there, either. But there are many things that will remain the same, because there are things you can take and embellish and work and things that will be really different.

Q: But can you at least give me a hint? Is there a chance that Knight’s Corner will be a substantially different hardware model than Knight’s Ferry?

A: I’m going to really love to talk to you about Knight’s Corner. [his emphasis]

Q: But not today.

A: I’m going to duck it today.

Q: Oh, man…

A: The product is going to be in our 22 nm process, and 22 nm isn’t shipping yet. When we get a little bit closer, when it deserves to have the buzz generated, we’ll start generating buzz. Right now, the big thing is that we’re making the investments in the Knight’s Ferry software development platform, to see how codes scale across the many-core, to get the environment and tools up, to let developers poke at it and find stuff,  good stuff, bad stuff, in between stuff, that allow us to adjust the product line for ongoing generations. We’ve done that really well since we announced the architecture about 15 months ago.

Q: I was wondering what else I was going to talk about after having talked to both John Hengeveld and Jim Reinders. This is great. Nobody talked about where it really came from, and even hinted that there were changes to the MIC chip [architecture].

A: Oh, no, no, many things will be the same, many things will be different. If you’re targeting trying to do a pixel-renderer, go do a pixel-renderer. If you’re trying to do a general-purpose computing device, do a general-purpose computing device. You’ll see some things and say “well, it’s all the same” and other things “wow, it’s completely different.” We’ll get around to talking about the part when we’re a little closer.

The most important thing that James and/or John should have been talking about is that the key thing is the ability to not force the developer to completely and utterly re-think their problem to use your hardware. There are two models: In an accelerator model, which is something I spent a lot of my life working with, accelerators have the advantage of optimization. You can say “I want to do one thing really well.” So you can then describe a programming model for the hardware. You can say “build your data this way, write your program this way” and if you do it will work. The problem is that not everything fits into the box. Oh, you have sparse data. Oh, you have recursive code.

Q: And there’s madness in that direction, because if you start supporting that you wind yourself around to a general-purpose machine. […usually, a very odd-looking general-purpose machine. I’ve talked about Sutherland’s “Wheel of Reincarnation” in this blog, haven’t I? Oh, there it is: The Cloud Got GPUs, back in November 2010.]

A: Then it’s not an accelerator any more. The thing that you get in MIC is the performance of one of those accelerators. We’ve shown this. We’ve hit 960GF out of a peak 1.2TF without throwing away precision, without playing any circus tricks, just run the hardware. On Knight’s Ferry we’ve shown that. So you get performance, but you’re getting it out of the general purpose programming model.

Q: That’s running LINPACK, or… ?

A: That was an even more basic thing; I’m just talking about SGEMM [single-precision dense matrix multiply].

Q: I just wanted to ground the number.

A: For LU factorization, I think we showed hybrid LU, really cool, one of the great things about this hybrid…

Q: They’re demo-ing that downstairs.

A: … OK. When the matrix size is small, I keep it on the host; when the matrix size is large, I move it. But it’s all the same code, the same code either place. I’m just deciding where I want to run the code intelligently, based on the size of the matrix. You can get the exact number, but I think it’s on the order of 750GBytes/sec for LU [GFLOPS?], which is actually, for a first-generation part, not shabby. [They were doing 650-750 GF according to the meter I saw. That's single precision; Knight's Ferry was originally a graphics part.]

Q: Yaahh, well, there are a lot of people who can deliver something like that.

A: We’ll keep working on it and making it better and better. So, what are we proving today. All we’ve proven today is that the architecture is capable of performance. We’ve got a lot of work to do before we have a product, but the architecture has shown itself to be capable. The programming model, we have people who will speak for us, like the quotes that came from LRZ [data center for the universities of Munich and the Bavarian Academy of Sciences], from Leibnitz [same place], a code they couldn’t port to other accelerators was running in two hours and optimized in two days. Now, actual mileage may vary, see dealer for…

Q: So, there are things that just won’t run on a CUDA model? Example?

A: Well, perhaps, again, the thing you try to get to is whether there is evidence growing that what you say is real. So we’re having people who are starting to be able to speak to that, and that gives people the confidence that we’re going to be able to get there. The other thing it ends up doing, it’s kind of an odd benefit, as people have started building their code, trying to optimize it for MIC, they’re finding the parallelism, they’re doing what we wanted them to do all along, they’re taking the same code on their current cluster and they’re getting benefits right now.

Q: That’s got a long history. People would have some grotty old FORTRAN code, and want to vectorize it, but the vectorizing compiler couldn’t make crap out of it. So they cleaned it up, made it obvious what was going on, and the vectorizer did its thing well. Then they put it back on the original machine and it ran twice as fast.

A: So, one of the nice things that’s happened is that as people are looking at ways to scale power, performance, they’re finally getting around to dealing with parallelism. The offer that we’re trying to provide is portable, high level, standards-based, and you can use it now.

You said “why.” That’s why. Our customers and developers say “if you can do that, that’s really valuable.” Now. We’re four men and a pudding, we haven’t shipped a product yet, we’ve got a lot of work to do, but the thought and the promise and the early data is really good.

Q: OK. Well, great.

A: Was that a good use of the time?

Q: That’s a very good use of the time. Let me poke on one thing a little bit. Conceptually, it ought to be simpler to write code to that kind of a shared memory model and get parallelism out of the code that way. Now, on the other hand, there was a talk – sorry, I forget his name, he was one of the software guys working on Larrabee [it was Tom Forsyth; see my post The Problem with Larrabee again] said someone on the project had written four renderers, and three of them were for Larrabee. He was having one hell of a time trying to get something that performed well. His big issue, at least what it came down to from what I remember of the talk, was memory bandwidth.

A: Well, first of all, we’ve said Larrabee’s not a product. As I’ve said, one of the things that is critical, you’ve got the compute-bound, you’ve got the memory-bound, and most people are somewhere in between, but you have to be able to handle the two edge cases. We understand that, and we intend to deliver a really good value across the spectrum. Now, Knight’s Ferry has the RVI silicon [RVI? I’m guessing here], it’s a variation off the silicon we used, no one cares about that,  but on Knight’s Ferry, the memory bus is 256 bits wide. Relatively narrow, and for a graphics processor, very narrow. There are definitely design decisions in how that chip was made that would limit the bandwidth. And the memory it was designed with is slower than the memory today, you have all of the normal things. But if you went downstairs to the show floor, and talk to Daniel Paul, he’s demonstrating a pretty dramatic ray-tracer.

[What follows is a bit confused. He didn’t mean the Austrian Crown stochastic ray-tracing demo, but rather the real-time ray-tracing demo. As I said in my immediately previous post (Random Things of Interest at IDF 2011), the real-time demo is on a set of Knight’s Ferries attached to a Xeon-based node. At the time of the interview, I hadn’t seen the real-time demo, just the stochastic one; the latter is not on Knight’s Ferry.]

Q: I’ve seen that one. The Austrian Crown?

A: Yes.

Q: I thought that was on a cluster.

A: In the little box behind there, he’s able to scale from one to eight Knight’s Ferries.

Q: He never told me there was a Knight’s Ferry in there.

A: Yes, it’s all Knight’s Ferry.

Q: Well, I’m going to go down there and beat on him a little bit.

A: I’m about to point you to a YouTube site, it got compressed and thrown up on YouTube. You can’t get the impact of the complexity of the rays, but you can at least get the superficial idea of the responsiveness of the system from Knight’s Ferry.

[He didn’t point me to YouTube, or I lost it, but here’s one I found. Ignore the fact that the introduction is in Swedish or something [it's Dutch, actually]; Daniel – and it’s Daniel, not David – speaks English, and gives a good demo. Yes, everybody in the “Labs” part of the showroom wore white lab coats. I did a bit of teasing. I also updated the Random Things of Interest post to directly include it.]

Well, if you believe that what we’re going to do in our mainstream processors is roughly double the FLOPS every generation for the next many generations, that’s our intent. What if we can do that on the MIC line as well? By the time you get to where ray-tracing would be practical, you could see multiple of those being integrated into a single device [added in transcription: Multiple MICs in a single device? Hierarchical MIC?] becomes practical computationally. That won’t be far from now. So, it’s a nice demo. David’s an expert in his field, I didn’t hear what he said, but it you want to see the device downstairs actually running a fairly strenuous graphics workload, take a look at that.

Q: OK. I did go down there and I did see that, I just didn’t know it was Knight’s Ferry. [It’s not, it’s not, still confused here.] On that HDR display that is gorgeous. [Where “it” = stochastically-ray-traced Austrian Crown. It is.]

[At that point, Dave Patterson walked in, which interrupted us. We said hello – I know Dave of old, a bit – thanks were exchanged with Joe, and I departed.]

[I can’t believe this is the end of the last one. I really don’t like transcribing.]

Sunday, October 2, 2011

Random Things of Interest at IDF 2011 (Intel Developer Forum)


I still have one IDF interview to transcribe (Joe Curley), but I’m sick of doing transcriptions. So here are a few other random things I observed at the 2011 Intel Developers Forum. It is nothing like comprehensive. It’s also not yet the promised MIC dump; that will still come.

Exhibit Hall

I found very few products I had a direct interest in, but then again I didn’t look very hard.

On the right, immediately as you enter, was a demo of a Xeon/MIC combination clocking 600-700 GFLOPS (quite assuredly single precision) doing LRU Factorization. Questions to the guys running the demo indicated: (1) They did part on the Xeon, and there may have been two of those, they weren’t sure (the diagram showed two). (2) They really learned how to say “We don’t comment on competitors” and “We don’t comment on unannounced products.”

A 6-legged robot controlled by Atom, controlled by a game controller. I included this here only because it looked funky and I took a picture (q. v.). Also, for some reason it was in constant slight motion, like it couldn’t sit still, ever.

There were three things that were interesting to me in the Intel Labs section:

One Tbit/sec memory stack: To understand why this is interesting, you need to know that the semiconductor manufacturing processes used to make DRAM and logic are quite different. Putting both on the same chip requires compromises in one or the other. The logic that must exist on DRAM chips isn’t quite as good as it could be, for example. In this project, they separated the two onto separate chips in a stack: Logic is on one, the bottom one, that interfaces with the outside world. On top of this are multiple pure memory chips, multiple layers of pure DRAM, no logic. They connect by solder bumps or something (I’m not sure), and there are many (thousands of) “through silicon vias” that go all the way through the memory chips to allow connecting a whole stack to the logic at the bottom with very high bandwidth. This whole idea eliminates the need to compromise on semiconductor processes, so the DRAM can be dense (and fast), and the logic can be fast (and low power). One result is that they can suck 1 Tbit/sec of data out of one of these stacks. This just feels right to me as a direction. Too bad they’re unlikely to use the new IBM/3M thermally conductive glue to suck heat out of the stack.

Stochastic Ray-Tracing:  What it says: Ray-tracing, but allows light to be probabilistically scattered off surfaces, so, for example, shiny matte surfaces have realistically blurred reflections on them, and produce more realistic color effects on other surfaces to which they reflect. Shiny matte surfaces like the surface of the golden dome in the center of the Austrian crown, reflecting the jewels in the outer band, which was their demo image. I have a picture here, but it comes nowhere near doing this justice. The large, high dynamic range monitor they had, though – wow. Just wow. Spectacular. A guy was explaining this to me pointing to a normal monitor when I happened to glance up at the HDR one. I was like “shut up already, I just want to look at that.” To run it they used a cluster of four Xeon-based nodes, each apparently about 4U high, and that was not in real time; several seconds were required per update. But wow.

Real-Time Ray-Tracing: This has been done before; I saw it a demo on a Cell processor back in about 2006. This, however, was a much more complex scene than I’d previously viewed. It had the usual shiny classic car, but that was now in the courtyard of a much larger old palace-like building, with lots of columns and crenellations and the like. It ran on a MIC, of course – actually, several of them, all attached to the same Xeon system. Each had a complete copy of the scene data in its memory, which is unrealistic but does serve to make the problem “pleasantly parallel” (which is what I’m told is now the PC way to describe what used to be called “embarrassingly parallel”). However, the demo was still fun. Here's a video of it I found. It apparently was shot at a different event, but still the same technology demonstrated. The intro is in Swedish, or something, but it reverts to English at the demo. And yes, all the Intel Labs guys wore white lab coats. I teased them a bit on that.


Keynotes

Otellini (CEO): Intel is going hot and heavy into supporting the venerable Trusted Platform technology, a collection of technology which might well work, but upon which nobody has yet bitten. This security emphasis clearly fits with the purchase of MacAfee (everybody got a free MacAfee package at registration, good for 3 systems). “2011 may be the year the industry got serious about security!” I remain unconvinced.

Mooley Eden (General Manager, Mobile Platforms): OK. Right now, I have to say that this is the one time in the course of these IDF posts that I am going to bow to Intel’s having paid for me to attend IDF, bite my tongue rather than succumbing to my usual practice of biting the hand that feeds me, and limit my comments to:

Mooley Eden must be an acquired taste.

To learn more of my personal opinions on this subject, you are going to have to buy me a craft beer (dark & hoppy) in a very noisy bar. Since I don’t like noisy bars, and that’s an unusual combination, I consider this unlikely.

Technically… Ultrabooks, ultrabooks, “beyond thin and light.” More security. They had a lame Ninja-garbed guy on stage, trying to hack into trusted-platform-protected system, and of course failing. (Please see this.) There was also a picture of a castle with a moat, and a (deliberately) crude animation of knights trying to cross the moat and falling in. (I mention this only because it’s relevant to something below.)

People never use hibernate, because it takes too long to wake up. The solution is… to have the system wake up regularly. And run sync operations. Eh what? Is this supposed to cause your wakeup to take less time because the wakeup time is actually spent syncing? My own wakeup time is mostly wakeup. All I know is that suspend/resume used to be really fast, reliable, and smart. Then it got transplanted to Windows from BIOS and has been unsatisfactory - slow and dumb - ever since.

This was my first time seeing Windows 8. It looks like Mango phone interface. Is making phones & PCs look alike supposed to help in some way? (Like boost Windows Phone sales?) I’m quite a bit less than intrigued. It means I’m going to have to buy another laptop before Win 8 becomes the standard.

Justin Rattner (CTO): Some of his stuff I covered in my first post on IDF. One I didn’t cover was the massive deal made of CERN and the LHC (Large Hadron Collider) (“the largest machine human beings have ever created”) (everybody please now go “ooooohhh”) using MICs. Look, folks, the major high energy physics apps are embarrassingly parallel: You get a whole lot, like millions, billions, of particle collisions, gather each one’s data, and do an astounding amount of floating-point computing on each completely independent set of collision data. Separately. Hoping to find out that one is a Higgs boson or something. I saw people doing this in the late 1980s at Fermilab on a homebrew parallel system. They even had a good software framework for using it: Write your (serial) code for analyzing a collision your way, and hand it to us; we run it many times in parallel, just handing out each event’s data to an instance of your code. The only thing that would be interesting about this would be if for some reason they actually couldn’t run HEP codes very well indeed. But they can run them well. Which makes it a yawn for me. I’ve no question that the LHC is ungodly impressive, of course. I just wish it were in Texas and called something else.

Intel Fellows Panel

Some interesting questions asked and answered, many questions lame. Like: “Will high-end Xeons pass mainframes?” Silly question. Depends on what “pass” means. In the sense in which most people may mean, they already have, and it doesn’t matter. Here are some others:

Q: Besides MIC, what else is needed for Exascale? A: We’re having to go all the way down to device level. In particular, we’re looking at subthreshold or near-threshold logic. We tried that before, but failed. Devices turn out to be most efficient 20mv above threshold. May have to run at 800MHz. [Implication: A whole lot of parallelism.] Funny how they talked about near-threshold logic, and Justin Rattner just happened to have a demo of that at the next day’s keynote.

Q: Are you running out of rare metals? A: It’s a question of cost. Yes, we always try to move off expensive materials. Rare earths needed, but not much; we only use them in layers like five atoms thick.

Q: Is Moore’s Law going to end? A: This was answered by Kelin J. Kuhn, Fellow & Director of the Technology and Manufacturing Group – i.e., she really knows silicon. She noted that, by observation, at every given generation it always looks like Moore’s Law ends in two generations. But it never has. Every time we see a major impediment to the physics – several examples given, going back to the 1980s and the end of Dennard scaling – something seems to come along to avoid the problem. The exception seems to be right now: Unlike prior eras when it will end in two generations, there don't seem to be any clouds on this particular horizon at all. (While I personally know of no reason to dispute this, keep in mind that this is from Intel, whose whole existence seems tied to Moore's Law, and it's said by the woman who probably has the biggest responsibility to make it all come about.)

An aside concerning the question-taking woman with the microphone on my side of the hall: I apparently reminded her of something she hates. She kept going elsewhere, even after standing right beside me for several minutes while I had my hand raised. What I was going to ask was: This morning in the keynote we saw a castle with a moat, and several knights dropping into the moat. The last two days we also heard a lot about a knight which appears to take a ferry across the moat of PCIe. Why are you strangling a TFLOP of computation with PCIe? Other accelerator vendors don’t have a choice with their accelerators, but you guys own the whole architecture. Surely something better could be done. Does this, perhaps, indicate a lack of integration or commitment to the new architecture across the organization?

Maybe she was fitted with a wiseass detection system.

Anyway, I guess I won’t find out this year.

Thursday, September 29, 2011

A Conversation with Intel’s James Reinders at IDF 2011


At the recent Intel Developer Forum (IDF), I was given the opportunity to interview James Reinders. James is in the Director, Software Evangelist of Intel’s Software and Services Group in Oregon, and the conversation ranged far and wide, from programming languages, to frameworks, to transactional memory, to the use of CUDA, to Matlab, to vectorizing for execution on Intel’s MIC (Many Integrated Core) architecture.


Intel-provided information about James:
James Reinders is an expert on parallel computing. James is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including systolic arrays systems WARP and iWarp, and the world's first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for multiple Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. His most recent book is “Intel Threading Building Blocks” from O'Reilly Media which has been translated to Japanese, Chinese and Korean. James has published numerous articles, contributed to several books and is one of his current projects is as a co-author on a new book on parallel programming to be released in 2012.

I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)

This is #2 in a series of three such transcripts. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

Pfister: [Discussing where I’m coming from, crowd-sourced question list, HPC & MIC focus here.] So where would you like to start?

Reinders: Wherever you like. MIC and HPC – HPC is my life, and parallel programming, so do your best. It has been for a long, long time, so hopefully I have a very realistic view of what works and what doesn’t work. I think I surprise some people with optimism about where we’re going, but I have some good reasons to see there’s a lot of things we can do in the architecture and the software that I think will surprise people to make that a lot more approachable than you would expect. Amdahl’s law is still there, but some of the difficulties that we have with the systems in terms of programming, the nondeterminism that gets involved in the programming, which you know really destroys the paradigm of thinking how to debug, those are solvable problems. That surprises people a lot, but we have a lot at our disposal we didn’t have 20 or 30 years ago, computers are so much faster and it benefits the tools. Think about how much more the tools can do. You know, your compiler still compiles in about the same time it did 10 years ago, but now it’s doing a lot more, and now that multicore has become very prevalent in our hardware architecture, there are some hooks that we are going to get into the hardware that will solve some of the debugging problems that debugging tools can’t do by themselves because we can catch the attention of the architects and we understand enough that there’s some give-and-take in areas that might surprise people, that they will suddenly have a tool where people say “how’d you solve that problem?” and it’s over there under the covers. So I’m excited about that.

[OK, so everybody forgive me for not jumping right away on his fix for nondeterminism. What he meant by that was covered later.]

Pfister: So, you’re optimistic?

Reinders: Optimistic that it’s not the end of the world.

Pfister: OK. Let me tell you where I’m coming from on that. A while back, I spent an evening doing a web survey of parallel programming languages, and made a spreadsheet of 101 parallel programming languages [see my much earlier post, 101 Parallel Programming Languages].

Reinders: [laughs] You missed a few.

Pfister: I’m sure I did. It was just one night. But not one of those was being used. MPI and OpenMP, that was it.

Reinders: And Erlang has had some limited popularity, but is dying out. They’re a lot like AI and some other things. They help solve some problems, and then if the idea is really an advance, you’ll see something from that materialize in C or C++, Java, or C#. Those languages teach us something that we then use where we really want it.

Pfister: I understand what you’re saying. It’s like MapReduce being a large-scale version of the old LISP mapcar.

Reinders: Which was around in the early 70s. A lot of people picked up on it, it’s not a secret but it’s still, you know, on the edge.

Pfister: I heard someone say recently that there was a programming crisis in the early 80s: How were you going to program all those PCs? It was solved not by programming, but by having three or four frameworks, such as Excel or Word, that some experts in a dark room wrote, everybody used, and it spread like crazy. Is there anything now like that which we could hope for?

Reinders: You see people talk about Python, you see Matlab. Python is powerful, but I think it’s sort of trapped between general-purpose programming and the Matlab. It may be a big enough area; it certainly has a lot of followers. Matlab is a good example. We see a lot of people doing a lot in Matlab. And then they run up against barriers. Excel has the same thing. You see Excel grow up and people incredibly hairy things. We worked with Microsoft a few years ago, and they’ve added parallelism to Excel, and it’s extremely important to some people. Some people have spreadsheets out there that do unbelievable things. You change one cell, and it would take a computer from just a couple of years ago and just stall it for 30 minutes while it recomputes. [I know of people in the finance industry who go out for coffee for a few hours if they accidentally hit F5.] Now you can do that in parallel. I think people do gravitate towards those frameworks, as you’re saying. So which ones will emerge? I think there’s hope. I think Matlab is one; I don’t know that I’d put my money on that being the huge one. But I do think there’s a lot of opportunity for that to hide this compute power behind it. Yes, I agree with that, Word and Excel spreadsheets, they did that, they removed something that you would have programmed over and over again, made it accessible without it looking like programming.

Pfister: People did spreadsheets without realizing they were programming, because it was so obvious.

Reinders: Yes, you’re absolutely right. I tend to think of it in terms of libraries, because I’m a little bit more of an engineer. I do see development of important libraries that use unbelievable amounts of compute power behind them and then simply do something that anyone could understand. Obviously image processing is one [area], but there are other manipulations that I think people will just routinely be able to throw into an application, but what stands behind them is an incredibly complex library that uses compute power to manipulate that data. You see Apple use a lot of this in their user interface, just doing this [swipes] or that to the screen, I mean the thing behind that uses parallelism quite well.

Pfister: But this [swipes] [meaning the thing you do] is simple.

Reinders: Right, exactly. So I think that’s a lot like moving to spreadsheets; that’s the modern equivalent of using spreadsheets or Word. It’s the user interfaces, and they are demanding a lot behind them. It’s unbelievable the compute power that can use.

Pfister: Yes, it is. And I really wonder how many times you’re going to want to scan your pictures for all the images of Aunt Sadie. You’ll get tired of doing it after a couple of days.

Reinders: Right, but I think rather than that being an activity, it’s just something your computer does for you. It disappears. Most of us don’t want to organize things, we want it just done. And Google’s done that on the web. Instead of keeping a million bookmarks to find something, you do a search.

Pfister: Right. I used to have this incredible tree of bookmarks, and could never find anything there.

Reinders: Yes. You’d marvel at people who kept neat bookmarks, and now nobody keeps them.

Pfister: I remember when it was a major feature of Firefox that it provided searching of your bookmarks.

Reinders: [Laughter]

Pfister: You mentioned nondeterminism. Are there any things in the hardware that you’re thinking of? IBM Blue Gene just said they have transactional memory, in hardware, on a chip. I’m dubious.

Reinders: Yes, the Blue Gene/Q stuff. We’ve been looking at transactional memory a long time, we being the industry, Intel included. At first we hoped “Wow, get rid of locks, we’ll all just use transactional memory, it’ll just work.” Well, the shortest way I can say why it doesn’t work is that software people want transactions to be arbitrarily large, and hardware needs it to be constrained, so it can actually do what you’re asking it to do, like holding a buffer. That’s a nonstarter.

So now what’s happening? Rocks was looking at this in Sun, a hybrid technique, and unfortunately they didn’t bring that to market. Nobody outside the team knows exactly what happened, but the project as a whole failed, rather than saying transactional memory was the death. But they had a hard time figuring out how you engineer that buffering. A lot of smart people are looking at it. IBM’s come up with a solution, but I’ve heard it’s constrained to a single socket. It makes sense to me why a constraint like that would be buildable. The hard part is then how do you wrap that into a programming model. Blue Gene’s obviously a very high end machine, so those developers have more familiarity with constraints and dealing with it. Making it general purpose is a very hard problem, very attractive, but I think that at the end of the day, all transactional memory will do is be another option, that may be less error-prone, to use in frameworks or toolkits. I don’t see a big shift in programming model where people say “Oh, I’m using transactional memory.” It’ll be a piece of infrastructure that toolkits like Threading Building Blocks or OpenMP or Cilk+ use. It’ll be important for us in that it gives better guarantees.

The things I more had in mind is you’re seeing a whole class of tools. We’ve got a tool that can do deadlock and race detection dynamically and find it; a very, very good tool. You see companies like TotalView looking at what they would call replaying, or unwinding, going backwards, with debuggers. The problem with debuggers if your program’s nondeterministic is you run it to a breakpoint and say, whoa, I want to see what happened back here, what we usually do is just pop out of the debugger and run it with an earlier breakpoint, or re-run it. If the program is nondeterministic, you don’t get what you want. So the question is, can the debugger keep enough information to back up? Well, the thing that backing up and debugging, deadlock detection, and race detection, all those things have in common is that they tend to run two or three orders of magnitude slower when you’re using those techniques. Well, that’s not compelling. But, the cool part is, with the software, we’re showing how to detect those – just a thousand times slower than real time.

Now we have the cool engineering problem: Can you make it faster? Is there something you could do in the software or the hardware and make that faster? I think there is, and a lot of people do. I get really excited when you solve a hard problem, can you replay a debug, yeah, it’s too slow. We use it to solve really hard problems, with customers that are really important, where you hunker down for a week or two using a tool that’s a thousand times slower to find the bug, and you’re so happy you found it – I can’t stand out in a booth and market and have a million developers use it. That won’t happen unless we get it closer to real time. I think that will happen. We’re looking at ways to do that. It’s a cooperative thing between hardware and software, and it’s not just an Intel thing; obviously the Blue Gene team worries about these things, Sun’s team as worried about them. There’s actually a lot of discussion between those small teams. There aren’t that many people who understand what transactional memory is or how to implement it in hardware, and the people who do talk to each other across companies.

[In retrospect, while transcribing this, I find the sudden transition back to TM to be mysterious. Possibly james was veering away from unannounced technology, or possibly there’s some link between TM and 1000x speedups of playback. If there is such a link, it’s not exactly instantly obvious to me.]

Pfister: At a minimum, at conferences.

Reinders: Yes, yes, and they’d like to see the software stack on top of them come together, so they know what hardware to build to give whatever the software model is what it needs. One of the things we learned about transactional memory is that the software model is really hard. We have a transactional memory compiler that does it all in software. It’s really good. We found that when people used it, they treated transactional memory like locks and created new problems. They didn’t write a transactional memory program from scratch to use transactional memory, they took code they wrote for locks and tried to use transactional memory instead of locks, and that creates problems.

Pfister: The one place I’ve seen where rumors showed someone actually using it was the special-purpose Java machine Azul. 500 plus processors per rack, multiple racks, point-to-point connections with a very pretty diagram like a rosette. They got into a suit war with Sun. And some of the things they showed were obvious applications of transactional memory.

Reinders: Hmm.

Pfister: Let’s talk about support for things like MIC. One question I had was that things like CUDA, which let you just annotate your code, well, do more than that. But I think CUDA was really a big part of the success of Nvidia.

Reinders: Oh, absolutely. Because how else are you going to get anything to go down your shader pipeline for a computation if you don’t give a model? And by lining up with one model, no matter the pros or cons, or how easy or hard it was, it gave a mechanism, actually a low-level mechanism, that turns out to be predictable because the low-level mechanism isn’t trying to do anything too fancy for you, it’s basically giving you full control. That’s a problem to get a lot of people to program that way, but when a programmer does program that way, they get what the hardware can give them. We don’t need a fancy compiler that gets it right half the time on top of that, right? Now everybody in the world would like a fancy compiler that always got it right, and when you can build that, then CUDA and that sort of thing just poof! Gone. I’m not sure that’s a tractable problem on a device that’s not more general than that type of pipeline.

So, the challenge I see with CUDA, and OpenCL, and even C++AMP is that they’re going down the road of saying look, there are going to be several classes of devices, and we need you the programmer to write a different version of your program for each class of device. Like in OpenCL, you can take a function and write a version for a CPU, for a GPU, a version for an accelerator. So in this terminology, OpenCL is proposing CPU is like a Xeon, GPU is like a Tesla, an accelerator something like MIC. We have a hard enough problem getting one version of an optimized program written. I think that’s a fatal flaw in this thing being widely adopted. I think we can bring those together.

What you really are trying to say is that part of your program is going to be restrictive enough that it can be vectorized, done in parallel. I think there are alternatives to this that will catch on and mitigate the need to write much code in OpenCL and in CUDA. The other flaw with those techniques is that in a world where you have a GPU and a CPU, the GPU’s got a job to do on the user interface, and so far we’ve not described what happens when applications mysteriously try to send some to the GPU, some to the CPU. If you get too many apps pounding on the GPU, the user experience dies. [OK, mea culpa for not interrupting and mentioning Tianhe-1A.] AMD has proposed in their future architectures that they’re going to produce a meta-language that OpenCL targets, and then the hardware can target some to the GPU, and some to the CPU. So I understand the problem, and I don’t know if that solution’s the right one, but it highlights that the problem’s understood if you write too much OpenCL code. I’m personally more of a believer that we find higher-level programming interfaces like Cilk plusses, array notations, add array notations to C that explicitly tells you vectorize and the compiler can figure out whether that’s SSC, is it AVX, is it the 512-bit wide stuff on MIC, a GPU pipeline, whatever is on the hardware. But don’t pollute the programming language by telling the programmer to write three versions of your code. The good news is, though, if you do use OpenCL or CUDA to do that, you have extreme control of the hardware and will get the best hardware results you can, and we learn from that. I just think the learnings are going to drive us to more abstract programming models. That’s why I’m a big believer in the Cilk plus stuff that we’re doing.

Pfister: But how many users of HPC systems are interested in squeezing that last drop of performance out?

Reinders: HPC users are extremely interested in squeezing performance if they can keep a single source code that they can run everywhere. I hear this all the time, you know, you go to Oak Ridge, and they want to run some code. Great, we’ll run it on an Intel machine, or we’ll run it on a machine from IBM or HP or whatever, just don’t tell me it has to be rewritten in a strange language that’s only supported on your machine. It’s pretty consistent. So the success of CUDA, to be used on those machines, it’s limited in a way, but it’s been exciting. But it’s been a strain on the people who have done that because CUDA code because CUDA code’s not going to run on an Intel machine [Well, actually, the Portland Group has a CUDA C/C++ compiler targeting x86. I do not know how good the output code performance is.]. OpenCL offers some opportunities to run everywhere, but then has problems of abstraction. Nvidia will talk about 400X speedups, which aren’t real, well that depends on your definition of “real”.

Pfister: Let’s not start on that.

Reinders: OK, well, what we’re seeing constantly is that vectorization is a huge challenge. You talk to people who have taken their cluster code and moved it to MIC [Cluster? No shared memory?], very consistently they’ll tell us stories like, oh, “We ported in three days.” The Intel  marketing people are like “That’s great! Three days!” I ask why the heck did it take you three days? Everybody tells me the same thing: It ran right away, since we support MPI, OpenMP, Fortran, C++. Then they had to spend a few days to vectorize because otherwise performance was terrible. They’re trying to use the 512-bit-wide vectors, and their original code was written using SSE [Xeon SIMD/vector] with intrinsics [explicit calls to the hardware operations]. They can’t automatically translate, you have to restructure the loop because it’s 512 bits wide – that should be automated, and if we don’t get that automated in the next decade we’ve made a huge mistake as an industry. So I’m hopeful that we have solutions to that today, but I think a standardized solution to that will have to come forward.

Pfister: I really wonder about that, because wildly changing the degree of parallelism, at least at a vector level – if it’s not there in the code today, you’ve just got to rewrite it.

Reinders: Right, so we’ve got low-hanging fruit, we’ve got codes that have the parallelism today, we need to give them a better way of specifying it. And then yes, over time, those need to migrate to that [way of specifying parallelism in programs]. But migrating the code where you have to restructure it a lot, and then you do it all in SSE intrinsics, that’s very painful. If it feels more readable, more intuitive, like array extensions to the language, I give it better odds. But it’s still algorithmic transformation. They have to teach people where to find their data parallelism; that’s where all the scaling is in an application. If you don’t know how to expose it or write programs that expose it, you won’t take advantage of this shift in the industry.

Pfister: Yes.

Reinders: I’m supposed to make sure you wander down at about 11:00.

Pfister: Yes, I’ve got to go to the required press briefing, so I guess we need to take off. Thanks an awful lot.

Reinders: Sure. If there are any other questions we need to follow up on, I’ll be happy to talk to you. I hope I’ve knocked off a few of your questions.

Pfister: And then some. Thanks.

[While walking down to the press briefing, I asked James whether the synchronization features he had from the X86 architecture were adequate for MIC. He said that they were OK for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something additional. Interestingly, after the conference, there was an Intel press release about the Intel/Dell “home run” win at TACC – using Knight’s Corner, “an innovative design that includes more than 50 cores.” This dovetails with what Joe Curley told me about Knight’s Corner not being the same as Knight’s Ferry. Stay tuned for the next interview.]

Monday, September 26, 2011

A Conversation with Intel’s John Hengeveld at IDF 2011

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview John Hengeveld. John is in the Datacenter and Connected Systems Group in Hillsboro.

Intel-provided information about John:
John is responsible for end user and OEM marketing for Intel’s Workstation and HPC businesses and leads an outstanding team of industry visionaries.  John has been at Intel for 6 years and was previously the senior business strategist for Intel’s Digital Enterprise Group and the lead strategist for Intel’s Many Core development initiatives. John has 20 years of experience in general management, strategy and marketing leadership roles in high technology. 
John is dedicated to life-long learning, he has taught Corporate Strategy and Business Strategy and Policy; Technology Management; and Marketing Research and Strategy for Portland State University’s Master of Business Administration program.  John is a graduate of the Massachusetts Institute of Technology and holds his MBA from the University of Oregon. 

I recorded our conversation. What follows is a transcript, rather than a summary, since our topics ranged fairly widely and in some cases information is conveyed by the style of the answer. Conditions weren’t optimal for recording; it was in a large open space with many other conversations going on and the “Intel Robotic Orchestra” playing in the background. Hopefully I got all the words right.

I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)

Full disclosure: As I noted in a prior post, Intel paid for me to attend IDF. Thanks, again.

Occurrences of [] indicate words I added for clarification. There aren’t many.

Pfister: What, overall, is HPC to Intel? Is it synonymous with MIC?

Hengeveld: No. Actually, HPC has a research effort, how to scale applications, how to deal with performance and power issues that are upcoming. That’s the labs portion of it. Then we have significant product activity around our mainstream Xeon products, how to support the software and infrastructure when those products are delivered in cluster form to supercomputing activities. In addition to those products also get delivered into what we refer to as the volume HPC market, which is small and medium-sized clusters being used for product design, research activities, such as those in biomed, some in visualization. Then comes the MIC part. So, when we look at MIC, we try to manage and characterize the collection of workloads we create optimized performance for. About 20% of those, and we think these are representative of workloads in the industry, map to what MIC does really well. And the rest, most customers have…

Pfister: What is the distinguishing characteristic?

Hengeveld: There are two distinguishing characteristics. One is what I would refer to as compute density – applications that have relatively small memory footprints but have a high number of compute operations per memory access, and that parallelize well. Then there’s a second set of applications, streaming applications, where size isn’t significant but memory bandwidth is the distinguishing factor. You see some portion of the workload space there.

Pfister: Streaming is something I was specifically going to ask you about. It seems that with the accelerators being used today, there’s this bifurcation in HPC: Things that don’t need, or can’t use, memory streaming; and those that are limited by how fast you can move data to and from memory.

Hengeveld: That’s right. I agree.

Pfister: Is MIC designed for the streaming side?

Hengeveld: MIC will perform well for many streaming applications. Not all. There are some that require a memory access model MIC doesn’t map to particularly well. But a lot of the streaming applications will do very well on MIC in one of the generations. We have a collection of generations of MIC on the roadmap, but we’re not talking about anything beyond the next “Corner” generation [Knight’s Corner, 2012 product successor to the current limited-production Knight’s Ferry software development vehicle]. More beyond that, down the roadmap, you will see more and more effect for that class of application.

Pfister: So you expect that to be competitive in bandwidth and throughput with what comes out of Nvidia?

Hengeveld: Very much so. We’re competing in this market space to be successful; and we understand that we need to be competitive on a performance density, performance per watt basis. The way I kind of think about it is that we have a roadmap with exceptional performance, but, in addition to that, we have a consistent programming model with the rest of the Xeon platforms. The things you do to create an optimized cluster will work in the MIC space pretty much straightforwardly. We’ve done a number of demonstrations of that here and at ISC. That’s the main difference. So we’ll see the performance; we’ll be ahead in the performance. But the real difference is the programming model.

Pfister: But the application has to be amenable.

Hengeveld: The application has to be amenable. For many customers that do a wide range of applications – you know, if you are doing a few things, it’s likely possible that some of those few things will be these highly-parallel, many-core optimized kinds of things. But most customers are doing a range of things. The powerful general-purpose solution is still the mainstream Xeon architecture, which handles the widest range of workloads really robustly, and as we continue with our beat rate in the Xeon space, you know with Sandy Bridge coming out we moved significantly forward with floating-point performance, and you’ll see that again going forward. You see the charts going up and to the right 2X per release.

Pfister: Yes, all marketing charts go up and to the right.

Hengeveld: Yes, all marketing charts go up and to the right, but the point is that there’s a continued investment to drive floating-point performance and effective parallelism and power efficiency in a way that will be useful to HPC customers and mainstream customers.

Pfister: Is MIC going to be something that will continue over time? That you can write code for an expect it to continue to work in the future?

Hengeveld: Absolutely. It’s a major investment on our part on a distinct architectural approach that we expect to continue on as far out as our roadmaps envision today.

Pfister: Can you tell me anything about memory and connectivity? There was some indication at one point of memory being stacked on a MIC chip.

Hengeveld: A lot of research concepts are being explored for future products, and I can’t really talk about much of that kind of thing for things that are out in the roadmap. There’s a lot of work being done around innovative approaches about how to do the system work around this silicon.

Pfister: MIC vs. SCC – Single Chip Cluster.

Hengeveld: SCC! Got it! I thought you meant single chip computer.

Pfister: That would probably be SoC, System on a Chip. Is SCC part of your thinking on this?

Hengeveld: SCC was a research vehicle to try to explore extreme parallelism and some different instruction set architectures. It was a research vehicle. MIC is a series of products. It’s an architecture that underlies them. We always use “MIC” as an adjective: It’s a MIC architecture, MIC products, or something like that. It means Many Integrated Cores, Many Integrated Core architecture is an approach that underlies a collection of products, that are a product mix from Intel. As opposed to SCC, which is a research vehicle. It’s intended to get the academic community thinking about how to solve some of the major problems that remain in parallelism, using computer science to solve problems.

Pfister: One person noted that a big part of NVIDIA’s success in the space is CUDA…

Hengeveld: Yep.

Pfister: …which people can use to get, without too much trouble, really optimized code running on their accelerators. I know there are a lot of other things that can be re-used from Intel architecture – Threaded Building Blocks, etc. – but will CUDA be supported?

Hengeveld: That’s a question you have to ask NVIDIA. CUDA’s not my product. I have a collection of products that have an architectural approach.

Pfister: OpenCL is covered?

Hengeveld: OpenCL is part of our support roadmap, and we announced that previously. So, yes OpenCL.

Pfister: Inside of a MIC, right now, it has dual counter-rotating rings. Are connections other than that being considered? I’m thinking of the SCC mesh and other stuff. Are they in your thinking at this point?

Hengeveld: Yes, so, further out in the roadmap. These are all part of the research concepts. That’s the reason we do SCC and things like that, to see if it makes sense to use that architecture in the longer term products. But that’s a long ways away. Right now we have a fairly reasonable architectural approach that takes us out a bit, and certainly into our first generation of products. We’re not discussing yet how we’re going to use these learnings in future MIC products. But you can imagine that’s part of the thinking.

Pfister: OK.

Hengeveld: So, here’s the key thing. There are problems in exascale that the industry doesn’t know how to solve yet, and we’re working with the industry very actively to try to figure out whether there are architectural breakthroughs, things like mesh architectures. Is that part of the solution to exascale conundrums? Are there workloads in exascale, sort of a wave processing model, that you might see in a mesh architecture, that might make sense. So working with research centers, working with the labs, in part, we’re trying to figure out how to crack some of these nuts. For us it’s about taking all the pieces people are thinking about and seeing what the whole is.

Pfister: I’m glad to hear you express it that way, since the way it seemed to be portrayed at ISC was, from Intel, “Exascale, we’ve got that covered.”

Hengeveld: So, at the very highest strategic level, we have it covered in that we are working closely with a collection of academic and industry partners to try and solve difficult problems. But exascale is a long way off yet. We’re committed to make it happen, committed to solve the problems. That’s the real meat of what Kirk declared at ISC. It’s not that we have the answer; it’s that we have a commitment to make it happen, and to make it happen in a relatively early time period, with a relatively sustainable product architectural approach. But there are many problems to solve in exascale; we can barely get our arms around it.

Pfister: Do you agree with the DARPA targets for exascale, particularly low power, or would you relax those?

Hengeveld: The Intel commit, what we said in the declaration, was not inconsistent with the DARPA thing. It may be slightly relaxed. You can relax one of two things, you can relax time or you can relax DARPA targets. So I think you’re going to reach DARPA’s targets eventually – but when. So the target that Kirk raised is right in there, in the same ballpark. Exascale in 20MW is one set of rational numbers; I’ve heard 10 [MW], I’ve heard 40 [MW], somewhere between those, right? I think 40 [MW] is so easy it’s not worth thinking about. I don’t think it’s economically rational.

Pfister: As you move forward, what do you think are the primary barriers to performance? There are two different axes here, technical barriers, and market barriers.

Hengeveld: The technical barriers are cracking bandwidth and not violating the power budget; tracking how to manage the thread complexity of an exascale system – how many threads are you going to need? A whole lot. So how do you get your arms around that? There are business barriers: How do you get a return on investment through productizing things that apply in the exascale world? This is a John [?] quote, not an Intel quote, but I am far less interested in the first exascale system than I am in the 100th. I would like a proliferation of exascale applications and performance, and have it be accessible to a wide range of people and applications, some applications that don’t exist today. In any ecosystem-building task, you’ve got to create awareness of the need, and create economic momentum behind serving that need. Those problems are equally complex to solve [equal to the technical ones]. In my camp, I think that maybe in some ways the technical problems are more solvable, since you’re not training people in a new way of thinking and working and solving problems. It takes some time to do that.

Pfister: Yes, in some ways the science is on a totally different time schedule.

Hengeveld: Yes, I agree. I agree entirely. A lot of what I’m talking about today is leaps forward in science as technical computing advances, but as the capability grows, the science will move to match it. How will that science be used? Interesting question. How will it be proliferated? Genome work is a great target for some of this stuff. You probably don’t need exascale for genome. You can make it faster, you can make it more cost-effective.

Pfister: From what I have heard from people working on this at CSU, they have a whole lot more problems with storage than with computing capability.

Hengeveld: That’s exactly right.

Pfister: They throw data away because they have no place to put it.

Hengeveld: That’s a fine example of the business problems you have to crack along with the compute problems that you have to crack. There’s a whole infrastructure around those applications that has to grow up.

Pfister: Looking at other questions I had… You wouldn’t call MIC a transitional architecture, would you?

Hengeveld: No. Heavens no. It’s a design point for a set of workloads in HPC and other areas. We believe MIC fits more things than just HPC. We started with HPC. It’s a design point that has a persistence well beyond as far as we can see on the roadmap. It’s not a transitional product.

Pfister: I have a lot of detailed technical questions which probably aren’t appropriate, like whether each of the MIC cores has equal latency to main memory.

Hengeveld: Yes, that’s a fine example of a question I probably shouldn’t answer.

Pfister: Returning to ultimate limits of computing, there are two that stand out, power and bandwidth, both to memory and between chips. Does either of those stand out to you as the sore thumb?

Hengeveld: Wow. So, the guts of that question gets to workload characterization. One of my favorite topics is “It’s the workload, stupid.” People say “it’s the economy, stupid,” well in this space it’s the workload. There aren’t general statements you can make about all workloads in this market.

Pfister: Yes, HPC is not one market.

Hengeveld: Right, it’s not one market, it’s not one class of usages, it’s not one architecture of solutions, it’s one reason why MIC is required, it’s not invisible. One size doesn’t fit all. Xeon does a great job of solving a lot of it really well, but there are individual workloads that are valuable that we want to dive into with more capability in a more targeted way. There are workloads in the industry where the interconnect bandwidth between processors in a node and nodes in a cluster is the dominant factor in performance. There are other workloads where the bandwidth to memory is the dominant factor in performance. All have to be solved. All have to be moved forward at a reasonable pace. I think the ones that are going to map to exascale best are ones where the memory bandwidth required can be solved well by local memory, and the problems that can be addressed well are those that have rational scaling of interconnect requirement between nodes. You’re not going to see problems that have a massive explosion of communication; the bandwidth won’t exist to keep up with that. You can actually see something I call “well-fed FLOPS,” which is how many FLOPS can you rationally support given the rest of this architecture. That’s something you have to know for each workload. You have to study it for each domain of HPC usage before you get to the answer about which is more important.

Pfister: You probably have to go now. I did want to say that I noticed the brass rat. Mine is somewhere in the Gulf of Mexico.

Hengeveld: That’s terrible. Class of ’80.

Pfister: Class of ’67.

Hengeveld: Wow.

Pfister: Stayed around for graduate school, too.

Hengeveld: When’d you leave?

Pfister: In ’74.

Hengeveld: We just missed overlapping, then. Have you been back recently?

Pfister: Not too recently. But there have been a lot of changes.

Hengeveld: That’s true, a lot of changes.

Pfister: But East Campus is still the same?

Hengeveld: You were in East Campus? Where’d you live?

Pfister: Munroe.

Hengeveld: I was in the black hall of fifth-floor Bemis.

Pfister: That doesn’t ring a bell with me.

Hengeveld: Back in the early 70s, they painted the hall black, and put in red lights in 5th-floor Bemis.

Pfister: Oh, OK. We covered all the lights with green gel.

Hengeveld: Yes, I heard of that. That’s something that they did even to my time period there.

Pfister: Anyway, thank you.

Hengeveld: A pleasure. Nice talking to you, too.