Thursday, September 29, 2011

A Conversation with Intel’s James Reinders at IDF 2011

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview James Reinders. James is in the Director, Software Evangelist of Intel’s Software and Services Group in Oregon, and the conversation ranged far and wide, from programming languages, to frameworks, to transactional memory, to the use of CUDA, to Matlab, to vectorizing for execution on Intel’s MIC (Many Integrated Core) architecture.

Intel-provided information about James:
James Reinders is an expert on parallel computing. James is a senior engineer who joined Intel Corporation in 1989 and has contributed to projects including systolic arrays systems WARP and iWarp, and the world's first TeraFLOP supercomputer (ASCI Red), as well as compilers and architecture work for multiple Intel processors and parallel systems. James has been a driver behind the development of Intel as a major provider of software development products, and serves as their chief software evangelist. His most recent book is “Intel Threading Building Blocks” from O'Reilly Media which has been translated to Japanese, Chinese and Korean. James has published numerous articles, contributed to several books and is one of his current projects is as a co-author on a new book on parallel programming to be released in 2012.

I recorded our conversation; what follows is a transcript. Also, I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)

This is #2 in a series of three such transcripts. I’ll have at least one additional post about IDF 2011, summarizing the things I learned about MIC and the Intel “Knight’s” accelerator boards using them, since some important things learned were outside the interviews.

Full disclosure: As I originally noted in a prior post, Intel paid for me to attend IDF. Thanks, again. It was a great experience, since I’d never before attended.

Occurrences of [] indicate words I added for clarification or comment post-interview.

Pfister: [Discussing where I’m coming from, crowd-sourced question list, HPC & MIC focus here.] So where would you like to start?

Reinders: Wherever you like. MIC and HPC – HPC is my life, and parallel programming, so do your best. It has been for a long, long time, so hopefully I have a very realistic view of what works and what doesn’t work. I think I surprise some people with optimism about where we’re going, but I have some good reasons to see there’s a lot of things we can do in the architecture and the software that I think will surprise people to make that a lot more approachable than you would expect. Amdahl’s law is still there, but some of the difficulties that we have with the systems in terms of programming, the nondeterminism that gets involved in the programming, which you know really destroys the paradigm of thinking how to debug, those are solvable problems. That surprises people a lot, but we have a lot at our disposal we didn’t have 20 or 30 years ago, computers are so much faster and it benefits the tools. Think about how much more the tools can do. You know, your compiler still compiles in about the same time it did 10 years ago, but now it’s doing a lot more, and now that multicore has become very prevalent in our hardware architecture, there are some hooks that we are going to get into the hardware that will solve some of the debugging problems that debugging tools can’t do by themselves because we can catch the attention of the architects and we understand enough that there’s some give-and-take in areas that might surprise people, that they will suddenly have a tool where people say “how’d you solve that problem?” and it’s over there under the covers. So I’m excited about that.

[OK, so everybody forgive me for not jumping right away on his fix for nondeterminism. What he meant by that was covered later.]

Pfister: So, you’re optimistic?

Reinders: Optimistic that it’s not the end of the world.

Pfister: OK. Let me tell you where I’m coming from on that. A while back, I spent an evening doing a web survey of parallel programming languages, and made a spreadsheet of 101 parallel programming languages [see my much earlier post, 101 Parallel Programming Languages].

Reinders: [laughs] You missed a few.

Pfister: I’m sure I did. It was just one night. But not one of those was being used. MPI and OpenMP, that was it.

Reinders: And Erlang has had some limited popularity, but is dying out. They’re a lot like AI and some other things. They help solve some problems, and then if the idea is really an advance, you’ll see something from that materialize in C or C++, Java, or C#. Those languages teach us something that we then use where we really want it.

Pfister: I understand what you’re saying. It’s like MapReduce being a large-scale version of the old LISP mapcar.

Reinders: Which was around in the early 70s. A lot of people picked up on it, it’s not a secret but it’s still, you know, on the edge.

Pfister: I heard someone say recently that there was a programming crisis in the early 80s: How were you going to program all those PCs? It was solved not by programming, but by having three or four frameworks, such as Excel or Word, that some experts in a dark room wrote, everybody used, and it spread like crazy. Is there anything now like that which we could hope for?

Reinders: You see people talk about Python, you see Matlab. Python is powerful, but I think it’s sort of trapped between general-purpose programming and the Matlab. It may be a big enough area; it certainly has a lot of followers. Matlab is a good example. We see a lot of people doing a lot in Matlab. And then they run up against barriers. Excel has the same thing. You see Excel grow up and people incredibly hairy things. We worked with Microsoft a few years ago, and they’ve added parallelism to Excel, and it’s extremely important to some people. Some people have spreadsheets out there that do unbelievable things. You change one cell, and it would take a computer from just a couple of years ago and just stall it for 30 minutes while it recomputes. [I know of people in the finance industry who go out for coffee for a few hours if they accidentally hit F5.] Now you can do that in parallel. I think people do gravitate towards those frameworks, as you’re saying. So which ones will emerge? I think there’s hope. I think Matlab is one; I don’t know that I’d put my money on that being the huge one. But I do think there’s a lot of opportunity for that to hide this compute power behind it. Yes, I agree with that, Word and Excel spreadsheets, they did that, they removed something that you would have programmed over and over again, made it accessible without it looking like programming.

Pfister: People did spreadsheets without realizing they were programming, because it was so obvious.

Reinders: Yes, you’re absolutely right. I tend to think of it in terms of libraries, because I’m a little bit more of an engineer. I do see development of important libraries that use unbelievable amounts of compute power behind them and then simply do something that anyone could understand. Obviously image processing is one [area], but there are other manipulations that I think people will just routinely be able to throw into an application, but what stands behind them is an incredibly complex library that uses compute power to manipulate that data. You see Apple use a lot of this in their user interface, just doing this [swipes] or that to the screen, I mean the thing behind that uses parallelism quite well.

Pfister: But this [swipes] [meaning the thing you do] is simple.

Reinders: Right, exactly. So I think that’s a lot like moving to spreadsheets; that’s the modern equivalent of using spreadsheets or Word. It’s the user interfaces, and they are demanding a lot behind them. It’s unbelievable the compute power that can use.

Pfister: Yes, it is. And I really wonder how many times you’re going to want to scan your pictures for all the images of Aunt Sadie. You’ll get tired of doing it after a couple of days.

Reinders: Right, but I think rather than that being an activity, it’s just something your computer does for you. It disappears. Most of us don’t want to organize things, we want it just done. And Google’s done that on the web. Instead of keeping a million bookmarks to find something, you do a search.

Pfister: Right. I used to have this incredible tree of bookmarks, and could never find anything there.

Reinders: Yes. You’d marvel at people who kept neat bookmarks, and now nobody keeps them.

Pfister: I remember when it was a major feature of Firefox that it provided searching of your bookmarks.

Reinders: [Laughter]

Pfister: You mentioned nondeterminism. Are there any things in the hardware that you’re thinking of? IBM Blue Gene just said they have transactional memory, in hardware, on a chip. I’m dubious.

Reinders: Yes, the Blue Gene/Q stuff. We’ve been looking at transactional memory a long time, we being the industry, Intel included. At first we hoped “Wow, get rid of locks, we’ll all just use transactional memory, it’ll just work.” Well, the shortest way I can say why it doesn’t work is that software people want transactions to be arbitrarily large, and hardware needs it to be constrained, so it can actually do what you’re asking it to do, like holding a buffer. That’s a nonstarter.

So now what’s happening? Rocks was looking at this in Sun, a hybrid technique, and unfortunately they didn’t bring that to market. Nobody outside the team knows exactly what happened, but the project as a whole failed, rather than saying transactional memory was the death. But they had a hard time figuring out how you engineer that buffering. A lot of smart people are looking at it. IBM’s come up with a solution, but I’ve heard it’s constrained to a single socket. It makes sense to me why a constraint like that would be buildable. The hard part is then how do you wrap that into a programming model. Blue Gene’s obviously a very high end machine, so those developers have more familiarity with constraints and dealing with it. Making it general purpose is a very hard problem, very attractive, but I think that at the end of the day, all transactional memory will do is be another option, that may be less error-prone, to use in frameworks or toolkits. I don’t see a big shift in programming model where people say “Oh, I’m using transactional memory.” It’ll be a piece of infrastructure that toolkits like Threading Building Blocks or OpenMP or Cilk+ use. It’ll be important for us in that it gives better guarantees.

The things I more had in mind is you’re seeing a whole class of tools. We’ve got a tool that can do deadlock and race detection dynamically and find it; a very, very good tool. You see companies like TotalView looking at what they would call replaying, or unwinding, going backwards, with debuggers. The problem with debuggers if your program’s nondeterministic is you run it to a breakpoint and say, whoa, I want to see what happened back here, what we usually do is just pop out of the debugger and run it with an earlier breakpoint, or re-run it. If the program is nondeterministic, you don’t get what you want. So the question is, can the debugger keep enough information to back up? Well, the thing that backing up and debugging, deadlock detection, and race detection, all those things have in common is that they tend to run two or three orders of magnitude slower when you’re using those techniques. Well, that’s not compelling. But, the cool part is, with the software, we’re showing how to detect those – just a thousand times slower than real time.

Now we have the cool engineering problem: Can you make it faster? Is there something you could do in the software or the hardware and make that faster? I think there is, and a lot of people do. I get really excited when you solve a hard problem, can you replay a debug, yeah, it’s too slow. We use it to solve really hard problems, with customers that are really important, where you hunker down for a week or two using a tool that’s a thousand times slower to find the bug, and you’re so happy you found it – I can’t stand out in a booth and market and have a million developers use it. That won’t happen unless we get it closer to real time. I think that will happen. We’re looking at ways to do that. It’s a cooperative thing between hardware and software, and it’s not just an Intel thing; obviously the Blue Gene team worries about these things, Sun’s team as worried about them. There’s actually a lot of discussion between those small teams. There aren’t that many people who understand what transactional memory is or how to implement it in hardware, and the people who do talk to each other across companies.

[In retrospect, while transcribing this, I find the sudden transition back to TM to be mysterious. Possibly james was veering away from unannounced technology, or possibly there’s some link between TM and 1000x speedups of playback. If there is such a link, it’s not exactly instantly obvious to me.]

Pfister: At a minimum, at conferences.

Reinders: Yes, yes, and they’d like to see the software stack on top of them come together, so they know what hardware to build to give whatever the software model is what it needs. One of the things we learned about transactional memory is that the software model is really hard. We have a transactional memory compiler that does it all in software. It’s really good. We found that when people used it, they treated transactional memory like locks and created new problems. They didn’t write a transactional memory program from scratch to use transactional memory, they took code they wrote for locks and tried to use transactional memory instead of locks, and that creates problems.

Pfister: The one place I’ve seen where rumors showed someone actually using it was the special-purpose Java machine Azul. 500 plus processors per rack, multiple racks, point-to-point connections with a very pretty diagram like a rosette. They got into a suit war with Sun. And some of the things they showed were obvious applications of transactional memory.

Reinders: Hmm.

Pfister: Let’s talk about support for things like MIC. One question I had was that things like CUDA, which let you just annotate your code, well, do more than that. But I think CUDA was really a big part of the success of Nvidia.

Reinders: Oh, absolutely. Because how else are you going to get anything to go down your shader pipeline for a computation if you don’t give a model? And by lining up with one model, no matter the pros or cons, or how easy or hard it was, it gave a mechanism, actually a low-level mechanism, that turns out to be predictable because the low-level mechanism isn’t trying to do anything too fancy for you, it’s basically giving you full control. That’s a problem to get a lot of people to program that way, but when a programmer does program that way, they get what the hardware can give them. We don’t need a fancy compiler that gets it right half the time on top of that, right? Now everybody in the world would like a fancy compiler that always got it right, and when you can build that, then CUDA and that sort of thing just poof! Gone. I’m not sure that’s a tractable problem on a device that’s not more general than that type of pipeline.

So, the challenge I see with CUDA, and OpenCL, and even C++AMP is that they’re going down the road of saying look, there are going to be several classes of devices, and we need you the programmer to write a different version of your program for each class of device. Like in OpenCL, you can take a function and write a version for a CPU, for a GPU, a version for an accelerator. So in this terminology, OpenCL is proposing CPU is like a Xeon, GPU is like a Tesla, an accelerator something like MIC. We have a hard enough problem getting one version of an optimized program written. I think that’s a fatal flaw in this thing being widely adopted. I think we can bring those together.

What you really are trying to say is that part of your program is going to be restrictive enough that it can be vectorized, done in parallel. I think there are alternatives to this that will catch on and mitigate the need to write much code in OpenCL and in CUDA. The other flaw with those techniques is that in a world where you have a GPU and a CPU, the GPU’s got a job to do on the user interface, and so far we’ve not described what happens when applications mysteriously try to send some to the GPU, some to the CPU. If you get too many apps pounding on the GPU, the user experience dies. [OK, mea culpa for not interrupting and mentioning Tianhe-1A.] AMD has proposed in their future architectures that they’re going to produce a meta-language that OpenCL targets, and then the hardware can target some to the GPU, and some to the CPU. So I understand the problem, and I don’t know if that solution’s the right one, but it highlights that the problem’s understood if you write too much OpenCL code. I’m personally more of a believer that we find higher-level programming interfaces like Cilk plusses, array notations, add array notations to C that explicitly tells you vectorize and the compiler can figure out whether that’s SSC, is it AVX, is it the 512-bit wide stuff on MIC, a GPU pipeline, whatever is on the hardware. But don’t pollute the programming language by telling the programmer to write three versions of your code. The good news is, though, if you do use OpenCL or CUDA to do that, you have extreme control of the hardware and will get the best hardware results you can, and we learn from that. I just think the learnings are going to drive us to more abstract programming models. That’s why I’m a big believer in the Cilk plus stuff that we’re doing.

Pfister: But how many users of HPC systems are interested in squeezing that last drop of performance out?

Reinders: HPC users are extremely interested in squeezing performance if they can keep a single source code that they can run everywhere. I hear this all the time, you know, you go to Oak Ridge, and they want to run some code. Great, we’ll run it on an Intel machine, or we’ll run it on a machine from IBM or HP or whatever, just don’t tell me it has to be rewritten in a strange language that’s only supported on your machine. It’s pretty consistent. So the success of CUDA, to be used on those machines, it’s limited in a way, but it’s been exciting. But it’s been a strain on the people who have done that because CUDA code because CUDA code’s not going to run on an Intel machine [Well, actually, the Portland Group has a CUDA C/C++ compiler targeting x86. I do not know how good the output code performance is.]. OpenCL offers some opportunities to run everywhere, but then has problems of abstraction. Nvidia will talk about 400X speedups, which aren’t real, well that depends on your definition of “real”.

Pfister: Let’s not start on that.

Reinders: OK, well, what we’re seeing constantly is that vectorization is a huge challenge. You talk to people who have taken their cluster code and moved it to MIC [Cluster? No shared memory?], very consistently they’ll tell us stories like, oh, “We ported in three days.” The Intel  marketing people are like “That’s great! Three days!” I ask why the heck did it take you three days? Everybody tells me the same thing: It ran right away, since we support MPI, OpenMP, Fortran, C++. Then they had to spend a few days to vectorize because otherwise performance was terrible. They’re trying to use the 512-bit-wide vectors, and their original code was written using SSE [Xeon SIMD/vector] with intrinsics [explicit calls to the hardware operations]. They can’t automatically translate, you have to restructure the loop because it’s 512 bits wide – that should be automated, and if we don’t get that automated in the next decade we’ve made a huge mistake as an industry. So I’m hopeful that we have solutions to that today, but I think a standardized solution to that will have to come forward.

Pfister: I really wonder about that, because wildly changing the degree of parallelism, at least at a vector level – if it’s not there in the code today, you’ve just got to rewrite it.

Reinders: Right, so we’ve got low-hanging fruit, we’ve got codes that have the parallelism today, we need to give them a better way of specifying it. And then yes, over time, those need to migrate to that [way of specifying parallelism in programs]. But migrating the code where you have to restructure it a lot, and then you do it all in SSE intrinsics, that’s very painful. If it feels more readable, more intuitive, like array extensions to the language, I give it better odds. But it’s still algorithmic transformation. They have to teach people where to find their data parallelism; that’s where all the scaling is in an application. If you don’t know how to expose it or write programs that expose it, you won’t take advantage of this shift in the industry.

Pfister: Yes.

Reinders: I’m supposed to make sure you wander down at about 11:00.

Pfister: Yes, I’ve got to go to the required press briefing, so I guess we need to take off. Thanks an awful lot.

Reinders: Sure. If there are any other questions we need to follow up on, I’ll be happy to talk to you. I hope I’ve knocked off a few of your questions.

Pfister: And then some. Thanks.

[While walking down to the press briefing, I asked James whether the synchronization features he had from the X86 architecture were adequate for MIC. He said that they were OK for the 30 or so cores in Knight’s Ferry, but when you got above 40, they would need to do something additional. Interestingly, after the conference, there was an Intel press release about the Intel/Dell “home run” win at TACC – using Knight’s Corner, “an innovative design that includes more than 50 cores.” This dovetails with what Joe Curley told me about Knight’s Corner not being the same as Knight’s Ferry. Stay tuned for the next interview.]

Monday, September 26, 2011

A Conversation with Intel’s John Hengeveld at IDF 2011

At the recent Intel Developer Forum (IDF), I was given the opportunity to interview John Hengeveld. John is in the Datacenter and Connected Systems Group in Hillsboro.

Intel-provided information about John:
John is responsible for end user and OEM marketing for Intel’s Workstation and HPC businesses and leads an outstanding team of industry visionaries.  John has been at Intel for 6 years and was previously the senior business strategist for Intel’s Digital Enterprise Group and the lead strategist for Intel’s Many Core development initiatives. John has 20 years of experience in general management, strategy and marketing leadership roles in high technology. 
John is dedicated to life-long learning, he has taught Corporate Strategy and Business Strategy and Policy; Technology Management; and Marketing Research and Strategy for Portland State University’s Master of Business Administration program.  John is a graduate of the Massachusetts Institute of Technology and holds his MBA from the University of Oregon. 

I recorded our conversation. What follows is a transcript, rather than a summary, since our topics ranged fairly widely and in some cases information is conveyed by the style of the answer. Conditions weren’t optimal for recording; it was in a large open space with many other conversations going on and the “Intel Robotic Orchestra” playing in the background. Hopefully I got all the words right.

I used Twitter to crowd-source questions, and some of my comments refer to picking questions out of the list that generated. (Thank you! To all who responded.)

Full disclosure: As I noted in a prior post, Intel paid for me to attend IDF. Thanks, again.

Occurrences of [] indicate words I added for clarification. There aren’t many.

Pfister: What, overall, is HPC to Intel? Is it synonymous with MIC?

Hengeveld: No. Actually, HPC has a research effort, how to scale applications, how to deal with performance and power issues that are upcoming. That’s the labs portion of it. Then we have significant product activity around our mainstream Xeon products, how to support the software and infrastructure when those products are delivered in cluster form to supercomputing activities. In addition to those products also get delivered into what we refer to as the volume HPC market, which is small and medium-sized clusters being used for product design, research activities, such as those in biomed, some in visualization. Then comes the MIC part. So, when we look at MIC, we try to manage and characterize the collection of workloads we create optimized performance for. About 20% of those, and we think these are representative of workloads in the industry, map to what MIC does really well. And the rest, most customers have…

Pfister: What is the distinguishing characteristic?

Hengeveld: There are two distinguishing characteristics. One is what I would refer to as compute density – applications that have relatively small memory footprints but have a high number of compute operations per memory access, and that parallelize well. Then there’s a second set of applications, streaming applications, where size isn’t significant but memory bandwidth is the distinguishing factor. You see some portion of the workload space there.

Pfister: Streaming is something I was specifically going to ask you about. It seems that with the accelerators being used today, there’s this bifurcation in HPC: Things that don’t need, or can’t use, memory streaming; and those that are limited by how fast you can move data to and from memory.

Hengeveld: That’s right. I agree.

Pfister: Is MIC designed for the streaming side?

Hengeveld: MIC will perform well for many streaming applications. Not all. There are some that require a memory access model MIC doesn’t map to particularly well. But a lot of the streaming applications will do very well on MIC in one of the generations. We have a collection of generations of MIC on the roadmap, but we’re not talking about anything beyond the next “Corner” generation [Knight’s Corner, 2012 product successor to the current limited-production Knight’s Ferry software development vehicle]. More beyond that, down the roadmap, you will see more and more effect for that class of application.

Pfister: So you expect that to be competitive in bandwidth and throughput with what comes out of Nvidia?

Hengeveld: Very much so. We’re competing in this market space to be successful; and we understand that we need to be competitive on a performance density, performance per watt basis. The way I kind of think about it is that we have a roadmap with exceptional performance, but, in addition to that, we have a consistent programming model with the rest of the Xeon platforms. The things you do to create an optimized cluster will work in the MIC space pretty much straightforwardly. We’ve done a number of demonstrations of that here and at ISC. That’s the main difference. So we’ll see the performance; we’ll be ahead in the performance. But the real difference is the programming model.

Pfister: But the application has to be amenable.

Hengeveld: The application has to be amenable. For many customers that do a wide range of applications – you know, if you are doing a few things, it’s likely possible that some of those few things will be these highly-parallel, many-core optimized kinds of things. But most customers are doing a range of things. The powerful general-purpose solution is still the mainstream Xeon architecture, which handles the widest range of workloads really robustly, and as we continue with our beat rate in the Xeon space, you know with Sandy Bridge coming out we moved significantly forward with floating-point performance, and you’ll see that again going forward. You see the charts going up and to the right 2X per release.

Pfister: Yes, all marketing charts go up and to the right.

Hengeveld: Yes, all marketing charts go up and to the right, but the point is that there’s a continued investment to drive floating-point performance and effective parallelism and power efficiency in a way that will be useful to HPC customers and mainstream customers.

Pfister: Is MIC going to be something that will continue over time? That you can write code for an expect it to continue to work in the future?

Hengeveld: Absolutely. It’s a major investment on our part on a distinct architectural approach that we expect to continue on as far out as our roadmaps envision today.

Pfister: Can you tell me anything about memory and connectivity? There was some indication at one point of memory being stacked on a MIC chip.

Hengeveld: A lot of research concepts are being explored for future products, and I can’t really talk about much of that kind of thing for things that are out in the roadmap. There’s a lot of work being done around innovative approaches about how to do the system work around this silicon.

Pfister: MIC vs. SCC – Single Chip Cluster.

Hengeveld: SCC! Got it! I thought you meant single chip computer.

Pfister: That would probably be SoC, System on a Chip. Is SCC part of your thinking on this?

Hengeveld: SCC was a research vehicle to try to explore extreme parallelism and some different instruction set architectures. It was a research vehicle. MIC is a series of products. It’s an architecture that underlies them. We always use “MIC” as an adjective: It’s a MIC architecture, MIC products, or something like that. It means Many Integrated Cores, Many Integrated Core architecture is an approach that underlies a collection of products, that are a product mix from Intel. As opposed to SCC, which is a research vehicle. It’s intended to get the academic community thinking about how to solve some of the major problems that remain in parallelism, using computer science to solve problems.

Pfister: One person noted that a big part of NVIDIA’s success in the space is CUDA…

Hengeveld: Yep.

Pfister: …which people can use to get, without too much trouble, really optimized code running on their accelerators. I know there are a lot of other things that can be re-used from Intel architecture – Threaded Building Blocks, etc. – but will CUDA be supported?

Hengeveld: That’s a question you have to ask NVIDIA. CUDA’s not my product. I have a collection of products that have an architectural approach.

Pfister: OpenCL is covered?

Hengeveld: OpenCL is part of our support roadmap, and we announced that previously. So, yes OpenCL.

Pfister: Inside of a MIC, right now, it has dual counter-rotating rings. Are connections other than that being considered? I’m thinking of the SCC mesh and other stuff. Are they in your thinking at this point?

Hengeveld: Yes, so, further out in the roadmap. These are all part of the research concepts. That’s the reason we do SCC and things like that, to see if it makes sense to use that architecture in the longer term products. But that’s a long ways away. Right now we have a fairly reasonable architectural approach that takes us out a bit, and certainly into our first generation of products. We’re not discussing yet how we’re going to use these learnings in future MIC products. But you can imagine that’s part of the thinking.

Pfister: OK.

Hengeveld: So, here’s the key thing. There are problems in exascale that the industry doesn’t know how to solve yet, and we’re working with the industry very actively to try to figure out whether there are architectural breakthroughs, things like mesh architectures. Is that part of the solution to exascale conundrums? Are there workloads in exascale, sort of a wave processing model, that you might see in a mesh architecture, that might make sense. So working with research centers, working with the labs, in part, we’re trying to figure out how to crack some of these nuts. For us it’s about taking all the pieces people are thinking about and seeing what the whole is.

Pfister: I’m glad to hear you express it that way, since the way it seemed to be portrayed at ISC was, from Intel, “Exascale, we’ve got that covered.”

Hengeveld: So, at the very highest strategic level, we have it covered in that we are working closely with a collection of academic and industry partners to try and solve difficult problems. But exascale is a long way off yet. We’re committed to make it happen, committed to solve the problems. That’s the real meat of what Kirk declared at ISC. It’s not that we have the answer; it’s that we have a commitment to make it happen, and to make it happen in a relatively early time period, with a relatively sustainable product architectural approach. But there are many problems to solve in exascale; we can barely get our arms around it.

Pfister: Do you agree with the DARPA targets for exascale, particularly low power, or would you relax those?

Hengeveld: The Intel commit, what we said in the declaration, was not inconsistent with the DARPA thing. It may be slightly relaxed. You can relax one of two things, you can relax time or you can relax DARPA targets. So I think you’re going to reach DARPA’s targets eventually – but when. So the target that Kirk raised is right in there, in the same ballpark. Exascale in 20MW is one set of rational numbers; I’ve heard 10 [MW], I’ve heard 40 [MW], somewhere between those, right? I think 40 [MW] is so easy it’s not worth thinking about. I don’t think it’s economically rational.

Pfister: As you move forward, what do you think are the primary barriers to performance? There are two different axes here, technical barriers, and market barriers.

Hengeveld: The technical barriers are cracking bandwidth and not violating the power budget; tracking how to manage the thread complexity of an exascale system – how many threads are you going to need? A whole lot. So how do you get your arms around that? There are business barriers: How do you get a return on investment through productizing things that apply in the exascale world? This is a John [?] quote, not an Intel quote, but I am far less interested in the first exascale system than I am in the 100th. I would like a proliferation of exascale applications and performance, and have it be accessible to a wide range of people and applications, some applications that don’t exist today. In any ecosystem-building task, you’ve got to create awareness of the need, and create economic momentum behind serving that need. Those problems are equally complex to solve [equal to the technical ones]. In my camp, I think that maybe in some ways the technical problems are more solvable, since you’re not training people in a new way of thinking and working and solving problems. It takes some time to do that.

Pfister: Yes, in some ways the science is on a totally different time schedule.

Hengeveld: Yes, I agree. I agree entirely. A lot of what I’m talking about today is leaps forward in science as technical computing advances, but as the capability grows, the science will move to match it. How will that science be used? Interesting question. How will it be proliferated? Genome work is a great target for some of this stuff. You probably don’t need exascale for genome. You can make it faster, you can make it more cost-effective.

Pfister: From what I have heard from people working on this at CSU, they have a whole lot more problems with storage than with computing capability.

Hengeveld: That’s exactly right.

Pfister: They throw data away because they have no place to put it.

Hengeveld: That’s a fine example of the business problems you have to crack along with the compute problems that you have to crack. There’s a whole infrastructure around those applications that has to grow up.

Pfister: Looking at other questions I had… You wouldn’t call MIC a transitional architecture, would you?

Hengeveld: No. Heavens no. It’s a design point for a set of workloads in HPC and other areas. We believe MIC fits more things than just HPC. We started with HPC. It’s a design point that has a persistence well beyond as far as we can see on the roadmap. It’s not a transitional product.

Pfister: I have a lot of detailed technical questions which probably aren’t appropriate, like whether each of the MIC cores has equal latency to main memory.

Hengeveld: Yes, that’s a fine example of a question I probably shouldn’t answer.

Pfister: Returning to ultimate limits of computing, there are two that stand out, power and bandwidth, both to memory and between chips. Does either of those stand out to you as the sore thumb?

Hengeveld: Wow. So, the guts of that question gets to workload characterization. One of my favorite topics is “It’s the workload, stupid.” People say “it’s the economy, stupid,” well in this space it’s the workload. There aren’t general statements you can make about all workloads in this market.

Pfister: Yes, HPC is not one market.

Hengeveld: Right, it’s not one market, it’s not one class of usages, it’s not one architecture of solutions, it’s one reason why MIC is required, it’s not invisible. One size doesn’t fit all. Xeon does a great job of solving a lot of it really well, but there are individual workloads that are valuable that we want to dive into with more capability in a more targeted way. There are workloads in the industry where the interconnect bandwidth between processors in a node and nodes in a cluster is the dominant factor in performance. There are other workloads where the bandwidth to memory is the dominant factor in performance. All have to be solved. All have to be moved forward at a reasonable pace. I think the ones that are going to map to exascale best are ones where the memory bandwidth required can be solved well by local memory, and the problems that can be addressed well are those that have rational scaling of interconnect requirement between nodes. You’re not going to see problems that have a massive explosion of communication; the bandwidth won’t exist to keep up with that. You can actually see something I call “well-fed FLOPS,” which is how many FLOPS can you rationally support given the rest of this architecture. That’s something you have to know for each workload. You have to study it for each domain of HPC usage before you get to the answer about which is more important.

Pfister: You probably have to go now. I did want to say that I noticed the brass rat. Mine is somewhere in the Gulf of Mexico.

Hengeveld: That’s terrible. Class of ’80.

Pfister: Class of ’67.

Hengeveld: Wow.

Pfister: Stayed around for graduate school, too.

Hengeveld: When’d you leave?

Pfister: In ’74.

Hengeveld: We just missed overlapping, then. Have you been back recently?

Pfister: Not too recently. But there have been a lot of changes.

Hengeveld: That’s true, a lot of changes.

Pfister: But East Campus is still the same?

Hengeveld: You were in East Campus? Where’d you live?

Pfister: Munroe.

Hengeveld: I was in the black hall of fifth-floor Bemis.

Pfister: That doesn’t ring a bell with me.

Hengeveld: Back in the early 70s, they painted the hall black, and put in red lights in 5th-floor Bemis.

Pfister: Oh, OK. We covered all the lights with green gel.

Hengeveld: Yes, I heard of that. That’s something that they did even to my time period there.

Pfister: Anyway, thank you.

Hengeveld: A pleasure. Nice talking to you, too.

Sunday, September 18, 2011

Impressions of a Newbie at Intel Developer Forum (IDF)

Out of the blue (which in this case is a pun), I received an invitation from an Intel representative to attend the 2011 Intel Developer Forum (IDF), in San Francisco, at Intel’s expense. Yes, I accepted. Thank you, Intel in general; and thank you in particular to the very nice lady who invited me and shepherded me through the process.

[There are some updates below, marked in this color.]

I’d never attended an IDF before, so I thought I’d spend an initial post on my overall impressions, describing the things that stood out to this wide-eyed IDF newbie. It may be boring to long-time IDF attendees – and there are very long-timers; a friend of mine has been to every domestic IDF for the last 12 years. But what the heck, they impressed me.

I do have some technical gee-whiz later in this post, but I’ll primarily go into more technical detail in subsequent posts. Those will including recountings of the three private interviews that were arranged for me with Intel HPC and MIC (Many-Integrated Core) executives (John Hengeveld, James Reinders, and Joe Curley), as well as other things I picked up along the way, primarily about MIC.

Here are my summary impressions: (1) Big. Very Big. (2) Incredibly slick and polished. (3) A fine attempt at Borgilation.

IDF is gigantic. It doesn’t surpass the mother of all trade shows, the Consumer Electronics show, but I wouldn’t be surprised to find that it is the largest single-company trade show. The Moscone Center West, filled by IDF on all three floors, is almost 300,000 sq. ft. Justin Rattner (Intel Fellow & CTO) said in his keynote that there were over 5,000 attendees, and that hauling in the gear and exhibits required 500 semis. I believe it.

There was of course the usual massive collection of trade-show booths covering one huge exhibit area (see photo of the center aisle of the exhibit area, below). That alone filled 100,000 sq. ft of exhibit space, completely.

In addition, all the large open areas each had their large well-manned pavilion dedicated to one thing or another: One had a bevy of ultrabooks (ultrabook = Intel’s push for a viable non-Apple MacBook Air) that you could play with. Another was an “Extreme Zone” with a battery of four high-end gaming systems (mostly playing what looked like Wolfenstein-y game). Another was a multi-player racing game with several drivers’ seats with steering wheels, etc. Another demoed twenty or thirty so different sizes and shapes of laptops (in addition to the displays in the exhibit area). Another was a contraption of pipes and random stuff spitting plastic balls onto pseudo-xylophones, cymbals, and so on, physically mimicking the famous YouTube video of several years back, demonstrating industrial controllers run by Atom processors. It didn’t actually play the music, but the video’s a pure animation so it’s one up on that. [Intel has a press release on this which seems to indicate that it actually played the music. Didn't seem like it to me, but might be.]

Everywhere could be found fanatic attention to detail and production values, extending down to even small details.

The keynotes were marvels of production; I’ve been to many IBM affairs, and nothing I saw over the years compared with these in slick, polished execution. Movies were theatre-quality cinematic productions (despite typical marketing fluff plots with occasional cheesy humor), and every one queued in at exactly the right instant, no hiccups. Every on-stage demo went right on the money, and even when one crashed – a momentary screen showing a windows driver crash – another was seamlessly switched in what seemed less than 2 seconds; I strongly suspect a hot backup, since no way does Windows recover that fast.

But smaller things had their share of attention, too. The technical sessions I attended all had fluent, personable speakers; meticulously designed slides; and perfect audio with nary a glitch in microphone use or (&deity. forbid) feedback. Even the backpacks handed out were high quality and custom-made. Simple customization is no big deal, but these came with Intel logos on the zipper pulls and a custom lining emblazoned with their chip-layout banner theme (see photos).

Speaking of that banner theme, it blared out at you over the entrance to each hall, on a photo at least 20 ft. high and 100 feet long (photo again), a huge illustration: You are a dull, chalky, dead, white – until Intel’s silicon brings you to vibrant, colored life. Not exactly subtle symbolism, but that’s marketing.

And speaking of marketing, the unmistakable overall message was: We will dominate everything. Everything with a processor in it, that is. Servers, with volumes ever-increasing at huge rates? Check. High-end 10+ core major stompers? Check. Midrange? Check. Low end? Super checkety-check-check-check. Ultrabook (future) with 14-day standby. (Standby? Do we really care?) Even a cell phone, demoed, run by an Intel processor. It’s the little black rectangle at the center-right of this pic:

(I couldn’t get a better picture, since after every keynote there was a “photo opportunity” that produced a paparazzi-dense melee/feeding frenzy on the stage. This is, I'm told, and IDF tradition. I’m not sufficiently a press-banger to elbow my way through that wall of bodies.)

The low-power demo that impressed me, though, was of a two-watt processor in a system showing a squee-worthy kitty video (and something else, but who noticed?), powered by a small solar panel. This was a demo of the future potential of near-threshold voltage operation, also touted (not, I’m sure, by accident) (not at all) in the Intel Fellows’ panel the day before. They used an old Pentium to do it, undoubtedly for reasons I’m not enough of a circuit jocky to understand. There was even what appeared – horrors! – to be an on-stage ad lib (!!) about “dumpster diving” for it. (Hey, eBay! Did they just call you a dumpster? The perils of ad libbing.) Some blatant futurism followed this, talking about 100 GF in that same 2W envelope; no hint when, fantastic if it ever happens.

There are chinks in the armor, though. You have to look seriously to find them, or have some comparisons on your side.

A friend happened to note to me, for example, that this IDF was three keynotes short of the usual full house of six. There was Otellini’s (CEO) general keynote, and Mooley Eden’s laptop ultrabook keynote, and Justin Rattner’s “futures” presentation in which he laughs too much for my taste. Those are regulars at every IDF. However, there was no keynote specifically devoted to Servers; understandable, I suppose, because they’re between big releases and have nothing major to announce (but they said a whole lot about the next-gen Ivy Bridge and the future server market in a media-only briefing). There was also no keynote for Digital Home; they are wrapped up with Sony [and other partners] on that one, and likely it hasn’t any splashes to make at this time (or else everybody’s figured out that connecting your TV to the Internet isn’t yet a world-shaking idea). And… dang, there was a third one historically, but I’ve lost it. Sorry. [The third missing keynote was on softtware and services, traditionally performed by Renee James.] Takeaway: Ambitions seem a bit shrunken, but it may just be circumstances. 

A big deal was made in a media briefing about how they were going to improve Intel's Atom SoCs (Systems-On-a-Chip) at double Moore’s Law. (I think you’re supposed to gasp now.) That sounds sexy, but I interpret it as meaning they figured out that Atom really needs to be done in their latest and greatest silicon technology, as opposed to lagging a couple of generations (nodes) back the way it now does particularly now that their highest-end technologies are focused on low power.

So they’re going to catch up. Everybody, including Atom, will be using use the same 14nm technology in 2014. (That’s an estimated, forward-looking 2014, see their prospectus for caveats, etc.) Until then, well, there are iterations. I take “double Moore’s Law” to mean that they can’t steer the massive ship of microprocessor development fast enough to catch up in a single release; and/or (likely) their existing Atom customer base can’t wait without any new Atom products for as long as a single leap would take.

Will this put a dent in ARM's dominance of the low-power arena? Or MIPS's share? Maybe, in time.

Then there was that graph, also in a media briefing, of future server shipments. (Wish I had a pic; can’t find the pdf on the Intel web site.) They extended it to show some trebling or quadrupling of server shipments in the next few years, but…

Maybe they have some data I don’t have. To me, the actual past data on the graph seemed to me to say that curve of shipment volumes recently started flattening out. Extrapolating based on the slope that existed a couple of quarters or years in the past doesn’t seem justified by what I saw purely based on that graph.

Hey, did I mention that I wuz a medium? I got in with media credentials, which was another personal first. (Thanks again!) Talk about being a newbie – I didn’t even know there was a special “media corridor” until half-way through the first day. Dang. I could have had a much better breakfast on the first day.

Now I have this itch to buy a fedora so I can put a press pass into the hatband.

More will come, but I’ve got a trip to Mesa Verde for the next few days, so it won’t be immediate. Sorry. The wait won’t be anywhere near as long as it has been between other recent posts.