Motivations and Design Approach for the IA-64 64-Bit Instruction Set Architecture
John Crawford, Intel Corporation
Jerry Huck, Hewlett-Packard
Microprocessor Forum

October 14, 1997
San Jose, Calif.

JOHN CRAWFORD: Thank you, Linley. It's a great privilege to be one of the two spokesmen here for the great work that many people have done, that Jerry and I get to stand up and present to you.

Let me start with the objectives of the talk, the ideas to unveil, the technology behind our next generation instruction set. We're focusing on the instruction set architecture, and some of the key technology there, not on the implementation.

What we're going to do, I'm going to cover a little bit of the context, some of the history and motivation behind the new techniques, and then Jerry is going to come up and talk about how we fixed some of the problems, with some of the key techniques that we've built into the instruction set, and describe some of the benefits of that. Then I'll come back up, work through an example, and conclude.

Here we go. So this is my opportunity to talk a little bit about the history. Intel and HP decided to get together and jointly develop a 64-bit instruction set technology. We announced this partnership back in June of 1994, and I want to highlight here a couple of key things each party brought to the table.

Intel, of course, had great experience in volume microprocessor technology, and platform technology. At the time, we also were well along on our 64-bit instruction set definition. And here IÆd like to take the opportunity to recognize a couple people -- Don Alpert {} and Hans Mulder {} -- from Intel who were very instrumental in that activity.

HP on the other hand brought enterprise system technology. They had a lot of activity with PA-RISC and did a really excellent job of pushing performance ahead in high-end systems. So they brought that expertise to the partnership. They also had done some great architectural research at HP Laboratories, and a couple of the key people there I'd like to mention are Bill Worley {} and Rajiv {} Gupta who pulled together some really good research.

They had also developed a 64-bit extension of the PA-RISC architecture and were late in the process of bringing that to market. So we had the two parties bringing together a number of things. We got together to jointly define this next generation instruction set which consists of an instruction set specification, and also related to that are compiler optimizations which we worked on together and performance simulation and measurement.

The objectives we had in front of us, first of all, was to enable a new level of system performance with the instruction set; really to take advantage of a lot of good research activity; pull together a set of really good ideas to provide a break through level of performance as well as headroom for the future.

We wanted to enable compatibility with our existing software bases, Intel's IA-32 and of course HP's PA-RISC.

We also wanted to have this thing have a long life. We wanted to make sure we had an instruction set definition with which we could turn the crank for many years and scale forward as semiconductor technology provided more and more transistors to implement a processor. Finally, which kind of goes without saying, it had to be 64-bits.

So this chart shows the state of the art or the mainstream architecture progression. A long time ago the industry mainstream was what we now call CISC. Then RISC technology came along, and offered a strategy to rely more on the compiler and software, and do less in the hardware. So RISC had simple fixed length instructions, with sequencing done by the compiler, emitting instruction sequences, rather than microcode sequencing of complex operations.

Then a number of influences came along. One of them was VLIW which showed that it was possible for certain applications to get a lot of things going in parallel. VLIW was not a commercial success, but it provided the motivation behind superscalar and then today's out of order superscalar machines, where we have hardware doing a fair amount of work, looking for parallelism in your instruction code, searching for and dispatching independent instructions, and then enhancing that through out of order techniques, register renaming a smaller register set to a larger register set, and in general, a lot of hardware activity to find implicit parallelism in your code and to take advantage of that.

So building on that along with a lot of good architecture research that's taken place since the RISC days, we factored a lot of these good ideas together, and then worked jointly, refining the techniques, developing something that is industrial strength, that has all the exceptions covered, that has all aspects of system architecture and so on to produce a really good instruction set that we're here to tell you about.

Before I end up on that note, let me back up and talk about a couple of performance limiters. One that's well-known of course is branches which limit the performance of today's machines. The pipelines we build like to pull instructions in and execute them very quickly, and we manage with branch prediction to avoid a lot of the breaks in the pipeline but they're still there. Especially when it mispredicts, you can pay a huge performance penalty. It's not unusual for some of today's highly pipeline machines to have penalties of 20 to 30 percent of their performance going to branch mispredicts. So this is clearly something important to fix. Another thing is that branches break your programs into small chunks, or small basic blocks that exist between the branches, which don't provide much opportunity for instruction level parallelism, or for rearranging things and finding parallel activities. This leads to poor utilization of wide machines.

Another limiter, again this one is as old as the hills, too, is latency to memory. We've done a great job of driving processor technology forward by doubling performance every 18 months, and that's a compound annual growth rate of about 60 percent. Memory technology is not keeping pace. It may be, if we're lucky, getting five percent better per year.

The load delays are evident not only in access to main memory but also in the various levels of cache we put in place to overcome the memory latency issues. In today's machines, it's not unusual to see a couple clocks of delay in getting the data back even from the closest level of cache. These kind of delays need to get hidden in order to have good performance. These load delays are compounded by machine width. You have an awful lot of empty instruction execution slots go unused unless you can find enough parallel activities to fill them.

The wider your machine, the more instruction slots you have to fill.

The third limiter I wanted to touch on is a new one, and it has to do with the fundamental sequential execution model we've been using since the early days of computing where a program is executed as a series of instructions, one after the other, in sequence. You fetch one instruction and execute it, then you fetch the next and execute it, and so on.

Where that causes a problem, IÆll try to illustrate here.

Today you start with a source program. Your compiler will look at that. It will extract the parallelism it can see. It will even make transformations to try to shorten the critical paths through your program by issuing things in parallel. After it's done with all that good activity now it has to emit sequential machine code, and that gives an opportunity for some artificial sequentiality to creep in.

That is then fed to these wonderful out-of-order superscalar machines that look at the code, look for parallelism, look for independent instructions to issue. There is a fair amount of hardware required to do all of this.

In addition to the problem of devoting a lot of hardware resources, it also gives the compiler a limited, indirect view of the hardware, and it makes it more of a challenge for a compiler to optimize applications and know they are really going to improve things.

So in contrast to this implicit parallelism, a better way to go is with explicit parallelism. We already have compilers enhancing parallelism today in compiling for an out-of-order superscalar machine. So we would like to just make that parallelism explicit in the machine code, and in that way avoid the guesswork and avoid this difficult sequential medium that we are wrestling with today.

So, the idea then is to go with explicit parallelism, and that is the fundamental principal on which weÆve chosen to base the name of our technology. We call it "Explicitly Parallel Instruction Computing" or EPIC technology, which I indicated before is the combination of a lot of good research activity that's taken place and a lot of hard work by the people on the instruction set definition team with Intel and HP.

And with that, I would like to turn it over to Jerry Huck.

Whoops. Not quite yet.

In order to make sure everything is clear here, let me just walk through the terminology.

EPIC is the next generation technology we are talking about, kind of a generic philosophy or collection of techniques, for example, like RISC or CISC. It's an instruction set technology. IA-64 is an actual instruction set, as for example, IA-32 or PA RISC; a complete description is available in a book, and it's a complete instruction set.

Now, getting more concrete, Merced is a code name for first processor that's going to implement this IA-64 instruction set, as, for example, the Pentium II processor or the PA 8500. So we have an architecture technology, an instruction set and implementations of that instruction set.

And now with that, I think I can turn it over to Jerry.

JERRY HUCK: Let's go up.

Thank you, John. I'm happy to be here. My sinuses and nose weren't so thrilled about this talk, so I hope I can stay with all of you.

I want to cover four, key 64-bit features in the IA-64 bit architecture. We want to cover how the architecture resources of the machine is put together, discuss the instruction format and then cover two key areas where we are creating parallelism in the code.

This is going to let machines execute a lot faster and expose this parallelism to the processor.

Now, not surprising, there's a lot more to this architecture than just these four features. There's the multimedia architecture and a lot of other things that we are going to talk about at another time, another forum.

So the first issue is: what's the underlying philosophy of the architecture?

At a hardware level, we are trying to create a machine that can grab a large amount of units of work, a large number of instructions and just feed them to the functional units. The machine spends very little time pondering, "What should I do next? Where should I go next?"

We are trying to just feed lots of instructions, every clock, to the machine.

Here's what we are fundamentally building: a large number of registers that will be directly available to the compiler, a large register file with 128 registers. This is four times what is available today in the RISC machines.

In order to have more parallelism you need more registers. If it's going to be explicit, you need access to a large number of registers.

Now, in a traditional architecture to achieve this effect you have some renaming or other mechanism to create more resources that are needed to feed the width of the machine.

Now, architecturally, of course, there is 128 registers that will appear in the instruction format. But the functional units, that's a decision of the machine designer. So they are going to be replicated and connected to this register file.

So it's up to the implementation to decide how many functional units, how many execution units, they are going to build.

Also, there are 128 for the floating-point registers connected to with floating point units.

Now connected to these things are the memory ports that have the ability to talk to memory.

In this diagram memory is the whole cache hierarchy. That's up to the implementation to decide. But there will be, of course, a large number of read and write ports for the memory.

Now, this is inherently scalable when the functional units are just replicated out: it's a function of the size of die, the number of transistors in budget. It's not an architectural constraint. Now, by making it go explicitly parallel, we avoid the out of order logic, the dependency logic, where transistors are used inefficiently. We're going to use them more efficiently to build functional units, caches and the register file.

Now, the next thing I want to talk about is the instruction format, the whole notion of where is explicit parallelism in this machine. We're going to break the model that every instruction may depend on the previous instruction. It's up to the machine to figure out the parallelism.

In this, we're going to have an explicit instruction dependency specified by these, what we call, template bits or little augmentation bits. And what there going to say is here is a group of instructions. They're all independent; just issue them to the functional units. It may be more instructions than you can actually swallow, but, you know, do them as many as you can, depending on the width of the machine.

Now, the template bits specify not only a dependency within a few instructions, but they also specify dependency between groups of instructions. The actual instruction format is what we call a bundle, 128-bit bundle, and we pack in three instructions. So the template bits can say there's one independent instruction, two independent instructions, or it can say there's seven, eight, nine, or some interesting prime number of independent instructions.

So by explicitly scheduling this parallelism, this will allow the compilers to expose greater parallelism, to create the greater parallelism and express it to the machine.

Now, we're going to, of course, simplify the hardware because we're not going to have the dynamic mechanisms to figure these things out. It will be directly exposed to you.

Now, one important feature that was added that, in contrast to perhaps how VLIW machines might have done this or other machines, is we're going to have complete compatibility between family members, because the hardware is going to be fully interlocked.

So as latencies changes, cache latency changes, functional unit latency changes, the hardware is still fully interlocked. The machine interlocks, score board, whatever the mechanism of the implementations choice. By having it fully interlocked, then we don't have this kind of delay slots or other mechanisms that kind of constrain a binary to a particular implementation. So we'll always have full compatibility over the family of machines.

Now, because the register sizes are slightly larger, and because of another field we'll talk about in a moment, instructions are bigger than what common RISC architectures provide. Normally you have 32 bits per instruction. In this case, we're going to expand that out and we're going to pack three instructions into 128. From that you're going to get a modest code size increase from that kind of three-to-four ratio.

So from this new instruction, we're going to get scalability with compatibility.

Now, the next feature I want to talk about is the whole notion of predication. The architecture is allowing parallelism. Now -- but that's not good enough. Just because you allow it doesn't mean you're going to get any of it. So the next -- these next two features, their focus, their effort, is to create parallelism where there wasn't any. So let's go back to John's example about how branches limit performance.

If you think of an analogy of an old style bank where you actually talk to people, you know, interacted with people, way back, you've got to think a while back.

If you went into that bank, you're a customer, you want to have fast service, in a traditional architecture it might have done some style of "I'll wait till you tell me what you want, I'll fill out a withdrawal slip, you'll do the interaction and away you go." A smarter bank might have said, "Well, you usually do withdrawals. I'll fill that out in advance." That's kind of like prediction. And if you wanted to do a withdrawal, we're great. We go at fairly fast speed.

Now, in predication, the idea is: as the teller sees you walk in, they're ambidextrous and they fill out both the withdrawal and the deposit at the same time and hand you whichever one you wanted.

So let's go back to this branch example: how branches limit performance. In the traditional architecture, you have, say, an "if," "then," "else" construct creates basic blocks, where you have to execute a few instructions, branch around some other instructions, join together, finally, and some final instructions.

So it's the control flow that creates branches. In the example, we have a conditional jump to go to the "else" clause, an unconditional jump around the "then" clause. Then we have small number of instructions, small number of parallelism. So what we're going to introduce is the notion of predication. And the idea here is every instruction in the instruction set is augmented with a field that says "execute this instruction if the predicate is true."

The predicate is a flag, a logical, a Boolean value that says true or false. In this example, you can see in the "if" clause when computing the conditional, set a predicate, P1, to be true if the comparison is equal and set P2 to be false if that comparison were equal.

Then I annotate the "then" and the "else" clause by saying P1 is the predicate for the "then" clause, and put those on instructions three and four. P2 is true when I want to execute the "else" clause, and I put them on instructions five and six.

At this point I don't need the branches. I can execute all these instructions in a flow and I'll get the right answer. If the "then" clause was true, P1 will be true and I'll execute three and four. If the "else" clause was true, P2 is true and I'll execute five and six.

This is a simple example showing how predication removes branches.

You put this side by side, you can see what's happening here. A traditional architecture with branches in small basic blocks, fragmenting the instruction stream, limiting the parallelism that's available.

Now, contrast it with what an EPIC architecture is going to do, what IA-64 does which is predicate instructions, remove the branches and expose parallelism. What we have here now is the ability to execute both "then" and the "else" clause simultaneously, and in fact I can start merging instructions seven and eight, one and two with all of this. It's all freedom to the compiler to schedule as best as possible to minimize what we call the critical path through the code.

What's fundamental here is predication is going to create parallelism and enable a more effective use of the parallel hardware.

In summary again, what's going on is compiler is given a larger scheduling scope. Nearly all instructions are predicated. We move down to kind of the new item here: there are 64 one-bit predicate registers. In the same strategy, we're going to have a large number of general registers, of integer registers, we're also going to have a large number of predicates. Predicate is not an infrequent thing. In fact, when you think about much more complex control flows, more typical of, say, operating system code, or on-line transaction processing, or other important commercial codes, it's not just simple "if," "then," "else's." It's "if this," "if that," "then." Complex control flow creates a lot of individual branches that we can collapse away and create more parallelism and there will be lots of predicates that are going to be live at that point.

So by removing these predicates, we're going to not only reduce regular branches, which has this fragmentation problem, but we're also going to be reducing mispredicted branches, one of the more serious side effects of a branch.

And finally, this parallel execution is really enabled through these larger basic blocks. Getting more parallel execution where it's going to give us better performance.

Now, there are some studies on this that have been done. This is a particular study that we found at the University of Illinois, work that was done and presented at the international symposium on computer architecture in '95 by Scott Mahlke et al., and in that study they measured how effective is predication at increasing performance. They had a hypothetical eight-wide machine. You can look this up. There's a web site at the University of Illinois that you can find this report.

They said let's keep increasing the parallelism, increasing the amount of predication that's going on until we get more and more performance, and then stop. Of course you can predicate yourself into oblivion if you carry it on too far. But they kept increasing performance, kept doing better optimization, until they got the maximum performance for an eight-wide machine.

Then they measured and said, "Well, how well did we do here?" They found that over half, on average, over half the branches were removed. That's a huge number of branches. As I was reading the report, I go wow, that's a lot. I started doing the subtraction saying hey, this is pretty serious.

Now, on the surface you wouldn't say, is that going to be the easy branches or hard branches. They had a hybrid model of a reasonably industrial strength hardware predictor and said how many mispredicted branches are gone and they found over 40 percent of the mispredicted branches were removed. So if you start thinking about the penalties related to branches and how many of them are going away.

So the bottom line here is, again, we're exposing parallelism. We're not just creating a parallel machine. We have to expose more of it to the machine.

The final topic that I want to cover, is the notion of speculation. To stretch the analogy a little bit, if every time a new customer came into my bank, our bank, everyone's bank, we were to give them a loan application because they typically want a new loan when they come in if they're new customers, they might fill that out in advance, as they walk up to the teller, then of course if they needed that loan, they would have covered the time it would have taken to fill out an application. All right, we're stretching things a little bit, but you get the idea. What we try to do is cover the problem of memory latency.

So we go back to John's example of how memory latency is causing delays. It's really the load part of memory operations. It's not the stores that are our big problem. Loads, we find, are often the first instruction of a dependency chain. You come into a new procedure, you need to load, load, load, compute, store, return. So covering that latency is a big problem. I may have "if this," "if that," "then," "then," "then," "load." Unfortunately, the beginning of basic blocks.

So if we look at programs, and in this picture you'll see in a traditional architecture, you have basic blocks that have jumps at the bottom of them, then a following basic block where we have a load. The straightforward strategy is to say, let's cover latency, let's move the load, let's move the load away from its use. That will try to cover latency. But you can't move it across a control flow. I can't move it above the branch because that load may never have supposed to have happened because I might have taken the branch and never got to this block.

Now if I just move it anyway, I've got a problem because those loads can blow up: It might cause exceptions. For example, if the pointer is valid, then use it. If I use it ahead of time, I'm in trouble. So we have to solve that kind of a problem. On the surface it looks like we have no hope. We can't move loads above.

So the strategy then is to create what we call speculation. Separate the load behavior -- that is, delivering the value -- from the exception behavior, which is the kind of the blowing up which you normally don't want to do to a correct program.

So take that load instruction, and if we look at this example here, and break it into two instructions. We're going to create a speculative load, ld.s in the picture, that initiates the load and detects the exception. But it doesn't blow up. It just says, "If there's an exception here, record that in some token that's stored with the target." So I get the data, I get a token, and I continue on computing.

Since I fired off the load, now I have more time to do the load, to cover the latency. Then when I get to the home block, then I do what's called a check and say, "Hey, was it okay? Am I okay getting here?"

Now, if you were supposed to get here and it was a correct program, it will be okay.

So, here is a mechanism that is now creating parallelism by exposing -- by covering this latency.

If we put them side-by-side, again the traditional architecture, the load is buried underneath the branch instruction. In the new architecture we have split that load. The load can now float up and freely fill in slots above branches. It can go above multiple control flows to be initiated as early as possible in the schedule. So, again, the compiler is now given more opportunity to create parallelism. It's not stuck with a small block, a small dynamic window of scheduling.

In more dynamic machine, the out-of-order engines, they are exposed to just a certain window that the hardware kind of restricts them to, but now in this case the compiler can now expand that load and push that load over multiple blocks, multiple levels of control flow.

Okay. So that's the whole notion behind speculation. The value of it is to create this parallelism, give the machine more parallelism to do.

Now, I'll turn it back over to John, and he is going to go through an example that puts together speculation and predication.

JOHN CRAWFORD: Thank you, Jerry.

So what I've got here is a very simple example that hopefully will illustrate a couple of the key concepts. I have taken a key statement out of the "8 Queens Benchmark" program, which is basically a recursive descent, exhaustive search strategy for finding a solution to placing eight queens on a chessboard so they don't attack each other. And the way they do that is they march through the columns placing a queen on each column such that it's not attacked by a queen previously placed on either the row or either of the two diagonals. Here we have this array "B" which checks the row and the arrays "A" and "C" which check the two different diagonals.

And basically because it's used a short-circuit ANDing operators, you've chopped the thing into three tiny basic blocks and a lot of sequential code. What I've shown on the left-hand side is the original code. I have broken it into the clocks of execution, each clock of execution in a box, and I've charged each load with two clocks of latency.

The first thing I've done, you notice the instructions in green are actually the addressing computations for the second and third block. And since those are simple arithmetic calculations, I am free to do those anytime. They are not going to cause exceptions which will have a bad impact on the program. So I have already hoisted them up, and as a matter of fact, a good compiler is going to do even better, move the computations out of the loop, turn those into induction variables and have a simple way to take care of those things.

Beyond that, I am faced with just a brutally sequential series of very simple operations: load compare, branch; load, compare, branch; load, compare, branch, with each instruction dependent on the one previous to it. There is really little hope to do anything in parallel here.

To compound things even further, I am charging the loads with two clock latencies, so I have actually a dead clock here because there is actually nothing else to do in between the load and compare. So let me walk through.

The first transformation I will make is to apply speculation to cover some of that load latency.

There is no hope for the load in the first basic block unless we have a little more context to work with. Potentially, that one could be hoisted up higher, but in the context we are working with here I'll take this second load and do what Jerry explained before. I'll split it into a speculative load that I can hoist up beyond this branch as well as a check that , of course, stays at the home block. I'll do that for the load in the third one as well.

And this is a nice transformation. It opens up a lot more parallelism.

Now, if I have a three-wide machine I can do all three loads in parallel. If not, I still have this delay clock, I could slide one of the loads in, for example -- there's plenty of opportunity to pair up these instructions.

The other thing that happens is even in the home block now I can do the compare and the check in parallel, since the checks and the compares are both actually using the results of the load.

So I've gotten a little bit more parallelism there.

So I've gone from the original picture on the left-hand side to the picture on the right here where the original picture is a single stream û the instructions are marching through single file, even with a couple stops here and there.

So, now I've got parallelism of the three, one, one, two, one, one. So I am getting some three's and two's in here, not all just one's, and by counting here on the bottom, the original actually works out -- if you take the worst case path through this, through this IF statement, you are going to spend 13 cycles going through it, and what's worse, you'll have three branches that are going to mispredict with pretty serious rates, which I didn't talk about, but showed on the previous slide.

Then, the next transformation is to apply the predication transformation.

And in doing so then I can eliminate two of the three branches and compact it all into one block.

So let's see how that works.

I already had a compare which was generating two predicates, and now, instead of branches on the false predicate I can actually qualify execution of the check and the next compare on the true predicate.

This compare then generates two more results, one of which is going to qualify the final compare.

And I have now squashed this down into one basic block that has a lot more parallelism in it.

Now, we've got, you know, three, three, two, two, I guess there is a "one" in there someplace, but for the most part we've shrunk the longest path from 13 cycles down to 7.

What is even more important is we've gotten rid of a lot of the mispredicts along with these branches. Originally they have mispredict rates of 30 or 40 percent with the potential cost of ten or more clocks per mispredict and that's a significant amount of computation.

Another factor that enters in here as you combine these compares and you combine this complex control flow, not only do you shorten the critical path and widen up the program, but in many cases, particularly with data-dependent branches like this, you can get an improvement in your branch predictor.

We've gone from a situation where we had 30 or 40 percent mispredicts to a situation where this one remaining branch is much more well-behaved. It's taken -- it's true much less often, and the predictor can do a better job with predicting it, reducing the cycle counts even more.

I think the main thing here is to show how even with a simple example how we can take something that's just brutally sequential of one instruction after another, apply just the basic predication and speculation transformations to it, to shrink the critical path by a dramatic amount, and widen up the execution flow to take advantage of parallel hardware.

I guess the one thing I wanted to tease you with here, there are additional things beyond this -- that we've got more tricks up our sleeve we'll talk about later. We can shrink this path even more and take advantage of even wider machines.

So I hope this would serve to illustrate the success we can have in taking complex control flow and really shrinking it dramatically.

So now let me bounce away from the cycle counts and some of the details and pop up a level or two and try to summarize what we believe the key characteristics are of what we call EPIC or "Explicitly Parallel Instruction Computing", which rests on three fundamental characteristics.

Starting at the bottom, kind of the foundation is what Jerry talked about, these resources for parallel computing. We want to have a lot of registers. We want to be able to scale to a lot of functional units.

And really have an inherently scaleable instruction set, that really forms the foundation.

As we go out in time with larger and larger transistor budgets we'll be able to build wider and wider machines that really form the basis for this parallelism.

Then on top of that, really the first thing or the thing that's most characteristic of this is this concept of explicit parallelism, where we want to have the instruction level parallelism explicit in the machine code -- have the compiler tell the hardware what's independent and what's dependent, and not have to have transistors detecting that and finding implicit parallelism but by making it explicit, it's right there in the machine code.

Another characteristic is that this gives the compiler the ability to go across a wide scope. We're not limited by a hardware window, but rather by the algorithm and the creativity that you have in the compiler algorithms.

The other thing that's very important, particularly in the explicit parallelism activity, is what distinguishes us from the earlier VLIW machines, and that is we have built in flexibility in specifying the parallelism so we can offer scalability ahead, and compatibility ahead as we go to wider and wider machines.

We're able to specify from a degree of parallelism 1 up to really an arbitrarily large amount of parallelism, and depending on the width of the machine, then, it can break off very large chunks into whatever its shovel size is and shovel that into the machine. Narrower machines of course have a narrower shovel and will shovel fewer instructions at a time. Wider machines can pick up these bigger chunks and feed them in. It will all be compatible as we go forward, and scalable both from a compatibility point of view and from a performance point of view.

Finally, the third angle of this is in order to provide the parallelism, and to take advantage of wide machines, particularly in commercial applications and non-numeric computing applications, we need to have facilities that enhance instruction level parallelism. So we need to provide functions such as predication and speculation, as I indicated before, that allow us to shrink that critical path through your program down by taking advantage of wider hardware. And two of the key techniques, predication and speculation, we talked about today. There are some others that we'll be talking about in the future.

So it's these three -- these collection of characteristics we believe is the right technology to move ahead and take advantage of the kind of machines that we can build with a lot of parallelism.

So let me back up and summarize the talk here. IA-64 is an instruction set in which we've embodied these EPIC techniques. We believe it's going to enable industry leading performance. It's going to provide us a scalable, compatible architecture as we go forward. Again, explicitly parallel, this idea of making the parallelism explicit in the machine code, both makes the compiler's job easier in terms of describing this parallelism directly to the machine, and certainly makes the hardware easier by eliminating a need for a lot of transistors to detect that parallelism.

In order to make it scalable, we provided a lot of resources, a lot of registers that will support future generations as we scale this thing forward.

The other angle of things is fully compatible. We're going to carry forward both the Intel architecture IA-32 compatible application base as well as HP's PA-RISC application base as we go in the future, and then, of course, carry everything forward as we roll multiple generations of these products.

We think the instruction set has facilities that address the broad market, the high-end computing market: servers and workstations. We talked a little bit about some of the characteristics and some of the transformations we have for dealing with complex control flow and the kind of applications that are characterized by that where memory latency and branches are serious performance limiters. We believe we've got the right set of techniques to address that as well as some others we'll talk about in the future that will give us really good performance on enterprise computing, commercial applications, database, decision support, that kind of thing.

Beyond that there's a lot of very interesting applications. We've heard a lot about 3D transformations this morning. 3D graphics, imaging, video, all kinds of applications where there's a lot of inherent parallelism, and the basic hardware that we've got in place -- lots of registers, lots of functional units -- really directly attack those kind of problems. It really lets you get at the parallelism that's in those applications.

In addition to that, the mechanisms we have for dealing with complex control flow allow us to very smoothly handle the branching activity, and those tortuous, sometimes tortuous, paths between these big loops to even shrink those up and make sure the whole application gets a very effective speed-up.

So we're happy to be here to announce what we believe is the next big step in computer architecture. And with that, I'll wrap up.

(Applause.)