Flash and JavaScript are required for this feature.
Download the video from Internet Archive.
Description: Professor Shun discusses races and parallelism, how cilkscale can analyze computation and detect determinacy races, and types of schedulers.
Instructor: Julian Shun
Lecture 7: Races and Parall...
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high-quality educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
JULIAN SHUN: Good afternoon, everyone. So let's get started. So today, we're going to be talking about races and parallelism. And you'll be doing a lot of parallel programming for the next homework assignment and project.
One thing I want to point out is that it's important to meet with your MITPOSSE as soon as possible, if you haven't done so already, since that's going to be part of the evaluation for the Project 1 grade. And if you have trouble reaching your MITPOSSE members, please contact your TA and also make a post on Piazza as soon as possible.
So as a reminder, let's look at the basics of Cilk. So we have cilk_spawn and cilk_sync statements. In Cilk, this was the code that we saw in last lecture, which computes the nth Fibonacci number. So when we say cilk_spawn, it means that the named child function, the function right after the cilk_spawn keyword, can execute in parallel with the parent caller. So it says that fib of n minus 1 can execute in parallel with the fib function that called it.
And then cilk_sync says that control cannot pass this point until all of this spawned children have returned. So this is going to wait for fib of n minus 1 to finish before it goes on and returns the sum of x and y.
And recall that the Cilk keywords grant permission for parallel execution, but they don't actually force parallel execution. So this code here says that we can execute fib of n minus 1 in parallel with this parent caller, but it doesn't say that we necessarily have to execute them in parallel. And it's up to the runtime system to decide whether these different functions will be executed in parallel. We'll talk more about the runtime system today.
And also, we talked about this example, where we wanted to do an in-place matrix transpose. And this used the cilk_for keyword. And this says that we can execute the iterations of this cilk_for loop in parallel.
And again, this says that the runtime system is allowed to schedule these iterations in parallel, but doesn't necessarily say that they have to execute in parallel. And under the hood, cilk_for statements are translated into nested cilk_spawn and cilk_sync calls. So the compiler is going to divide the iteration space in half, do a cilk_spawn on one of the two halves, call the other half, and then this is done recursively until we reach a certain size for the number of iterations in a loop, at which point it just creates a single task for that.
So any questions on the Cilk constructs? Yes?
AUDIENCE: Is Cilk smart enough to recognize issues with reading and writing for matrix transpose?
JULIAN SHUN: So it's actually not going to figure out whether the iterations are independent for you. The programmer actually has to reason about that. But Cilk does have a nice tool, which we'll talk about, that will tell you which places your code might possibly be reading and writing the same memory location, and that allows you to localize any possible race bugs in your code. So we'll actually talk about races. But if you just compile this code, Cilk isn't going to know whether the iterations are independent.
So determinacy races-- so race conditions are the bane of concurrency. So you don't want to have race conditions in your code. And there are these two famous race bugs that cause disaster. So there is this Therac-25 radiation therapy machine, and there was a race condition in the software. And this led to three people being killed and many more being seriously injured. The North American blackout of 2003 was also caused by a race bug in the software, and this left 50 million people without power.
So these are very bad. And they're notoriously difficult to discover by conventional testing. So race bugs aren't going to appear every time you execute your program. And in fact, the hardest ones to find, which cause these events, are actually very rare events. So most of the times when you run your program, you're not going to see the race bug. Only very rarely will you see it.
So this makes it very hard to find these race bugs. And furthermore, when you see a race bug, it doesn't necessarily always happen in the same place in your code. So that makes it even harder.
So what is a race? So a determinacy race is one of the most basic forms of races. And a determinacy race occurs when two logically parallel instructions access the same memory location, and at least one of these instructions performs a write to that location. So let's look at a simple example.
So in this code here, I'm first setting x equal to 0. And then I have a cilk_for loop with two iterations, and each of the two iterations are incrementing this variable x. And then at the end, I'm going to assert that x is equal to 2.
So there's actually a race in this program here. So in order to understand where the race occurs, let's look at the execution graph here. So I'm going to label each of these statements with a letter. The first statement, a, is just setting x equal to 0.
And then after that, we're actually going to have two parallel paths, because we have two iterations of this cilk_for loop, which can execute in parallel. And each of these paths are going to increment x by 1. And then finally, we're going to assert that x is equal to 2 at the end.
And this sort of graph is known as a dependency graph. It tells you what instructions have to finish before you execute the next instruction. So here it says that B and C must wait for A to execute before they proceed, but B and C can actually happen in parallel, because there is no dependency among them. And then D has to happen after B and C finish.
So to understand why there's a race bug here, we actually need to take a closer look at this dependency graph. So let's take a closer look. So when you run this code, x plus plus is actually going to be translated into three steps. So first, we're going to load the value of x into some processor's register, r1. And then we're going to increment r1, and then we're going to set x equal to the result of r1.
And the same thing for r2. We're going to load x into register r2, increment r2, and then set x equal to r2.
So here, we have a race, because both of these stores, x1 equal to r1 and x2 equal to r2, are actually writing to the same memory location. So let's look at one possible execution of this computation graph. And we're going to keep track of the values of x, r1 and r2.
So the first instruction we're going to execute is x equal to 0. So we just set x equal to 0, and everything's good so far. And then next, we can actually pick one of two instructions to execute, because both of these two instructions have their predecessors satisfied already. Their predecessors have already executed.
So let's say I pick r1 equal to x to execute. And this is going to place the value 0 into register r1. Now I'm going to increment r1, so this changes the value in r1 to 1. Then now, let's say I execute r2 equal to x.
So that's going to read x, which has a value of 0. It's going to place the value of 0 into r2. It's going to increment r2. That's going to change that value to 1. And then now, let's say I write r2 back to x. So I'm going to place a value of 1 into x.
Then now, when I execute this instruction, x1 equal to r1, it's also placing a value of 1 into x. And then finally, when I do the assertion, this value here is not equal to 2, and that's wrong. Because if you executed this sequentially, you would get a value of 2 here.
And the reason-- as I said, the reason why this occurs is because we have multiple writes to the same shared memory location, which could execute in parallel. And one of the nasty things about this example here is that the race bug doesn't necessarily always occur. So does anyone see why this race bug doesn't necessarily always show up? Yes?
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Right. So the answer is because if one of these two branches executes all three of its instructions before we start the other one, then the final result in x is going to be 2, which is correct. So if I executed these instructions in order of 1, 2, 3, 7, 4, 5, 6, and then, finally, 8, the value is going to be 2 in x. So the race bug here doesn't necessarily always occur. And this is one thing that makes these bugs hard to find.
So any questions?
So there are two different types of determinacy races. And they're shown in this table here. So let's suppose that instruction A and instruction B both access some location x, and suppose A is parallel to B. So both of the instructions can execute in parallel. So if A and B are just reading that location, then that's fine. You don't actually have a race here.
But if one of the two instructions is writing to that location, whereas the other one is reading to that location, then you have what's called a read race. And the program might have a non-deterministic result when you have a read race, because the final answer might depend on whether you read A first before B updated the value, or whether A read the updated value before B reads it. So the order of the execution of A and B can affect the final result that you see.
And finally, if both A and B write to the same shared location, then you have a write race. And again, this will cause non-deterministic behavior in your program, because the final answer could depend on whether A did the write first or B did the write first.
And we say that two sections of code are independent if there are no determinacy races between them. So the two pieces of code can't have a shared location, where one computation writes to it and another computation reads from it, or if both computations write to that location.
Any questions on the definition?
So races are really bad, and you should avoid having races in your program. So here are some tips on how to avoid races. So I can tell you not to write races in your program, and you know that races are bad, but sometimes, when you're writing code, you just have races in your program, and you can't help it. But here are some tips on how you can avoid races.
So first, the iterations of a cilk_for loop should be independent. So you should make sure that the different iterations of a cilk_for loop aren't writing to the same memory location.
Secondly, between a cilk_spawn statement and a corresponding cilk_sync, the code of the spawn child should be independent of the code of the parent. And this includes code that's executed by additional spawned or called children by the spawned child. So you should make sure that these pieces of code are independent-- there's no read or write races between them.
One thing to note is that the arguments to a spawn function are evaluated in the parent before the spawn actually occurs. So you can't get a race in the argument evaluation, because the parent is going to evaluate these arguments. And there's only one thread that's doing this, so it's fine. And another thing to note is that the machine word size matters. So you need to watch out for races when you're reading and writing to packed data structures.
So here's an example. I have a struct x with two chars, a and b. And updating x.a and x.b may possibly cause a race. And this is a nasty race, because it depends on the compiler optimization level. Fortunately, this is safe on the Intel machines that we're using in this class. You can't get a race in this example. But there are other architectures that might have a race when you're updating the two variables a and b in this case.
So with the Intel machines that we're using, if you're using standard data types like chars, shorts, ints, and longs inside a struct, you won't get races. But if you're using non-standard types-- for example, you're using the C bit fields facilities, and the sizes of the fields are not one of the standard sizes, then you could possibly get a race. In particular, if you're updating individual bits inside a word in parallel, then you might see a race there. So you need to be careful.
Questions?
So fortunately, the Cilk platform has a very nice tool called the-- yes, question?
AUDIENCE: [INAUDIBLE] was going to ask, what causes that race?
JULIAN SHUN: Because the architecture might actually be updating this struct at the granularity of more than 1 byte. So if you're updating single bytes inside this larger word, then that might cause a race. But fortunately, this doesn't happen on Intel machines.
So the Cilksan race detector-- if you compile your code using this flag, minus f sanitize equal to cilk, then it's going to generate a Cilksan instrumentive program. And then if an ostensibly deterministic Cilk program run on a given input could possibly behave any differently than its serial elision, then Cilksan is going to guarantee to report and localize the offending race. So Cilksan is going to tell you which memory location there might be a race on and which of the instructions were involved in this race.
So Cilksan employs a regression test methodology where the programmer provides it different test inputs. And for each test input, if there could possibly be a race in the program, then it will report these races. And it identifies the file names, the lines, the variables involved in the races, including the stack traces. So it's very helpful when you're trying to debug your code and find out where there's a race in your program.
One thing to note is that you should ensure that all of your program files are instrumented. Because if you only instrument some of your files and not the other ones, then you'll possibly miss out on some of these race bugs.
And one of the nice things about the Cilksan race detector is that it's always going to report a race if there is possibly a race, unlike many other race detectors, which are best efforts. So they might report a race some of the times when the race actually occurs, but they don't necessarily report a race all the time. Because in some executions, the race doesn't occur. But the Cilksan race detector is going to always report the race, if there is potentially a race in there.
Cilksan is your best friend. So use this when you're debugging your homeworks and projects.
Here's an example of the output that's generated by Cilksan. So you can see that it's saying that there's a race detected at this memory address here. And the line of code that caused this race is shown here, as well as the file name. So this is a matrix multiplication example. And then it also tells you how many races it detected.
So any questions on determinacy races?
So let's now talk about parallelism. So what is parallelism? Can we quantitatively define what parallelism is? So what does it mean when somebody tells you that their code is highly parallel?
So to have a formal definition of parallelism, we first need to look at the Cilk execution model. So this is a code that we saw before for Fibonacci. Let's now look at what a call to fib of 4 looks like. So here, I've color coded the different lines of code here so that I can refer to them when I'm drawing this computation graph.
So now, I'm going to draw this computation graph corresponding to how the computation unfolds during execution. So the first thing I'm going to do is I'm going to call fib of 4. And that's going to generate this magenta node here corresponding to the call to fib of 4, and that's going to represent this pink code here.
And this illustration is similar to the computation graphs that you saw in the previous lecture, but this is happening in parallel. And I'm only labeling the argument here, but you could actually also write the local variables there. But I didn't do it, because I want to fit everything on this slide.
So what happens when you call fib of 4? It's going to get to this cilk_spawn statement, and then it's going to call fib of 3. And when I get to a cilk_spawn statement, what I do is I'm going to create another node that corresponds to the child that I spawned. So this is this magenta node here in this blue box. And then I also have a continue edge going to a green node that represents the computation after the cilk_spawn statement. So this green node here corresponds to the green line of code in the code snippet.
Now I can unfold this computation graph one more step. So we see that fib 3 is going to call fib of 2, so I created another node here. And the green node here, which corresponds to this green line of code-- it's also going to make a function call. It's going to call fib of 2. And that's also going to create a new node.
So in general, when I do a spawn, I'm going to have two outgoing edges out of a magenta node. And when I do a call, I'm going to have one outgoing edge out of a green node. So this green node, the outgoing edge corresponds to a function call. And for this magenta node, its first outgoing edge corresponds to spawn, and then its second outgoing edge goes to the continuation strand.
So I can unfold this one more time. And here, I see that I'm creating some more spawns and calls to fib. And if I do this one more time, I've actually reached the base case. Because once n is equal to 1 or 0, I'm not going to make any more recursive calls.
And by the way, the color of these boxes that I used here correspond to whether I called that function or whether I spawned it. So a box with white background corresponds to a function that I called, whereas a box with blue background corresponds to a function that I spawned.
So now I've gotten to the base case, I need to now execute this blue statement, which sums up x and y and returns the result to the parent caller. So here I have a blue node. So this is going to take the results of the two recursive calls, sum them together.
And I have another blue node here. And then it's going to pass its value to the parent that called it. So I'm going to pass this up to its parent, and then I'm going to pass this one up as well. And finally, I have a blue node at the top level, which is going to compute my final result, and that's going to be the output of the program.
So one thing to note is that this computation dag unfolds dynamically during the execution. So the runtime system isn't going to create this graph at the beginning. It's actually going to create this on the fly as you run the program. So this graph here unfolds dynamically. And also, this graph here is processor-oblivious. So nowhere in this computation dag did I mention the number of processors I had for the computation.
And similarly, in the code here, I never mentioned the number of processors that I'm using. So the runtime system is going to figure out how to map these tasks to the number of processors that you give to the computation dynamically at runtime. So for example, I can run this on any number of processors. If I run it on one processor, it's just going to execute these tasks in parallel.
In fact, it's going to execute them in a depth-first order, which corresponds to the what the sequential algorithm would do. So I'm going to start with fib of 4, go to fib of 3, fib of 2, fib of 1, and go pop back up and then do fib of 0 and go back up and so on. So if I use one processor, it's going to create and execute this computation dag in the depth-first manner. And if I have more than one processor, it's not necessarily going to follow a depth-first order, because I could have multiple computations going on.
Any questions on this example? I'm actually going to formally define some terms on the next slide so that we can formalize the notion of a computation dag. So dag stands for directed acyclic graph, and this is a directed acyclic graph. So we call it a computation dag.
So a parallel instruction stream is a dag G with vertices V and edges E. And each vertex in this dag corresponds to a strand. And a strand is a sequence of instructions not containing a spawn, a sync, or a return from a spawn. So the instructions inside a strand are executed sequentially. There's no parallelism within a strand.
We call the first strand the initial strand, so this is the magenta node up here. The last strand-- we call it the final strand. And then everything else, we just call it a strand.
And then there are four types of edges. So there are spawn edges, call edges, return edges, or continue edges. And a spawn edge corresponds to an edge to a function that you spawned. So these spawn edges are going to go to a magenta node.
A call edge corresponds to an edge that goes to a function that you called. So in this example, these are coming out of the green nodes and going to a magenta node.
A return edge corresponds to an edge going back up to the parent caller. So here, it's going into one of these blue nodes.
And then finally, a continue edge is just the other edge when you spawn a function. So this is the edge that goes to the green node. It's representing the computation after you spawn something.
And notice that in this computation dag, we never explicitly represented cilk_for, because as I said before, cilk_fors are converted to nested cilk_spawns and cilk_sync statements. So we don't actually need to explicitly represent cilk_fors in the computation DAG.
Any questions on this definition? So we're going to be using this computation dag throughout this lecture to analyze how much parallelism there is in a program.
So assuming that each of these strands executes in unit time-- this assumption isn't always true in practice. In practice, strands will take different amounts of time. But let's assume, for simplicity, that each strand here takes unit time. Does anyone want to guess what the parallelism of this computation is? So how parallel do you think this is? What's the maximum speedup you might get on this computation?
AUDIENCE: 5.
JULIAN SHUN: 5. Somebody said 5. Any other guesses? Who thinks this is going to be less than five? A couple people. Who thinks it's going to be more than five? A couple of people. Who thinks there's any parallelism at all in this computation?
Yeah, seems like a lot of people think there is some parallelism here. So we're actually going to analyze how much parallelism is in this computation. So I'm not going to tell you the answer now, but I'll tell you in a couple of slides. First need to go over some terminology.
So whenever you start talking about parallelism, somebody is almost always going to bring up Amdahl's Law. And Amdahl's Law says that if 50% of your application is parallel and the other 50% is serial, then you can't get more than a factor of 2 speedup, no matter how many processors you run the computation on. Does anyone know why this is the case? Yes?
AUDIENCE: Because you need it to execute for at least 50% of the time in order to get through the serial portion.
JULIAN SHUN: Right. So you have to spend at least 50% of the time in the serial portion. So in the best case, if I gave you an infinite number of processors, and you can reduce the parallel portion of your code to 0 running time, you still have the 50% of the serial time that you have to execute. And therefore, the best speedup you can get is a factor of 2.
And in general, if a fraction alpha of an application must be run serially, then the speedup can be at most 1 over alpha. So if 1/3 of your program has to be executed sequentially, then the speedup can be, at most, 3. Because even if you reduce the parallel portion of your code to tab a running time of 0, you still have the sequential part of your code that you have to wait for.
So let's try to quantify the parallelism in this computation here. So how many of these nodes have to be executed sequentially? Yes?
AUDIENCE: 9 of them.
JULIAN SHUN: So it turns out to be less than 9. Yes?
AUDIENCE: 7.
JULIAN SHUN: 7. It turns out to be less than 7. Yes?
AUDIENCE: 6.
JULIAN SHUN: So it turns out to be less than 6.
AUDIENCE: 4.
JULIAN SHUN: Turns out to be less than 4. You're getting close.
AUDIENCE: 2.
JULIAN SHUN: 2. So turns out to be more than 2.
AUDIENCE: 2.5.
JULIAN SHUN: What's left?
AUDIENCE: 3.
JULIAN SHUN: 3. OK. So 3 of these nodes have to be executed sequentially. Because when you're executing these nodes, there's nothing else that can happen in parallel. For all of the remaining nodes, when you're executing them, you can potentially be executing some of the other nodes in parallel. But for these three nodes that I've colored in yellow, you have to execute those sequentially, because there's nothing else that's going on in parallel.
So according to Amdahl's Law, this says that the serial fraction of the program is 3 over 18. So there's 18 nodes in this graph here. So therefore, the serial factor is 1 over 6, and the speedup is upper bound by 1 over that, which is 6. So Amdahl's Law tells us that the maximum speedup we can get is 6. Any questions on how I got this number here?
So it turns out that Amdahl's Law actually gives us a pretty loose upper bound on the parallelism, and it's not that useful in many practical cases. So we're actually going to look at a better definition of parallelism that will give us a better upper bound on the maximum speedup we can get.
So we're going to define T sub P to be the execution time of the program on P processors. And T sub 1 is just the work. So T sub 1 is if you executed this program on one processor, how much stuff do you have to do? And we define that to be the work. Recall in lecture 2, we looked at many ways to optimize the work. This is the work term.
So in this example, the number of nodes here is 18, so the work is just going to be 18. We also define T of infinity to be the span. The span is also called the critical path length, or the computational depth, of the graph. And this is equal to the longest directed path you can find in this graph.
So in this example, the longest path is 9. So one of the students answered 9 earlier, and this is actually the span of this graph. So there are 9 nodes along this path here, and that's the longest one you can find. And we call this T of infinity because that's actually the execution time of this program if you had an infinite number of processors.
So there are two laws that are going to relate these quantities. So the work law says that T sub P is greater than or equal to T sub 1 divided by P. So this says that the execution time on P processors has to be greater than or equal to the work of the program divided by the number of processors you have. Does anyone see why the work law is true?
So the answer is that if you have P processors, on each time stub, you can do, at most, P work. So if you multiply both sides by P, you get P times T sub P is greater than or equal to T1. If P times T sub P was less than T1, then that means you're not done with the computation, because you haven't done all the work yet. So the work law says that T sub P has to be greater than or equal to T1 over P.
Any questions on the work law?
So let's look at another law. This is called the span law. It says that T sub P has to be greater than or equal to T sub infinity. So the execution time on P processors has to be at least execution time on an infinite number of processors. Anyone know why the span law has to be true?
So another way to see this is that if you had an infinite number of processors, you can actually simulate a P processor system. You just use P of the processors and leave all the remaining processors idle. And that can't slow down your program. So therefore, you have that T sub P has to be greater than or equal to T sub infinity. If you add more processors to it, the running time can't go up.
Any questions?
So let's see how we can compose the work and the span quantities of different computations. So let's say I have two computations, A and B. And let's say that A has to execute before B. So everything in A has to be done before I start the computation in B. Let's say I know what the work of A and the work of B individually are. What would be the work of A union B? Yes?
AUDIENCE: I guess it would be T1 A plus T1 B.
JULIAN SHUN: Yeah. So why is that?
AUDIENCE: Well, you have to execute sequentially. So then you just take the time and [INAUDIBLE] execute A, then it'll execute B after that.
JULIAN SHUN: Yeah. So the work is just going to be the sum of the work of A and the work of B. Because you have to do all of the work of A and then do all of the work of B, so you just add them together.
What about the span? So let's say I know the span of A and I know the span of B. What's the span of A union B? So again, it's just a sum of the span of A and the span of B. This is because I have to execute everything in A before I start B. So I just sum together the spans.
So this is series composition. What if I do parallel composition? So let's say here, I'm executing the two computations in parallel. What's the work of A union B?
So it's not it's not going to be the maximum. Yes?
AUDIENCE: It should still be T1 of A plus T1 of B.
JULIAN SHUN: Yeah, so it's still going to be the sum of T1 of A and T1 of B. Because you still have the same amount of work that you have to do. It's just that you're doing it in parallel. But the work is just the time if you had one processor. So if you had one processor, you wouldn't be executing these in parallel.
What about the span? So if I know the span of A and the span of B, what's the span of the parallel composition of the two? Yes?
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Yeah, so the span of A union B is going to be the max of the span of A and the span of B, because I'm going to be bottlenecked by the slower of the two computations. So I just take the one that has longer span, and that gives me the overall span. Any questions?
So here's another definition. So T1 divided by TP is the speedup on P processors. If I have T1 divided by TP less than P, then this means that I have sub-linear speedup. I'm not making use of all the processors. Because I'm using P processors, but I'm not getting a speedup of P.
If T1 over TP is equal to P, then I'm getting perfect linear speedup. I'm making use of all of my processors. I'm putting P times as many resources into my computation, and it becomes P times faster. So this is the good case.
And finally, if T1 over TP is greater than P, we have something called superlinear speedup. In our simple performance model, this can't actually happen, because of the work law. The work law says that TP has to be at least T1 divided by P. So if you rearrange the terms, you'll see that we get a contradiction in our model.
In practice, you might sometimes see that you have a superlinear speedup, because when you're using more processors, you might have access to more cache, and that could improve the performance of your program. But in general, you might see a little bit of superlinear speedup, but not that much. And in our simplified model, we're just going to assume that you can't have a superlinear speedup. And getting perfect linear speedup is already very good.
So because the span law says that TP has to be at least T infinity, the maximum possible speedup is just going to be T1 divided by T infinity, and that's the parallelism of your computation. This is a maximum possible speedup you can get.
Another way to view this is that it's equal to the average amount of work that you have to do per step along the span. So for every step along the span, you're doing this much work. And after all the steps, then you've done all of the work.
So what's the parallelism of this computation dag here?
AUDIENCE: 2.
JULIAN SHUN: 2. Why is it 2?
AUDIENCE: T1 is 18 and T infinity is 9.
JULIAN SHUN: Yeah. So T1 is 18. There are 18 nodes in this graph. T infinity is 9. And the last time I checked, 18 divided by 9 is 2. So the parallelism here is 2.
So now we can go back to our Fibonacci example, and we can also analyze the work and the span of this and compute the maximum parallelism. So again, for simplicity, let's assume that each of these strands takes unit time to execute. Again, in practice, that's not necessarily true. But for simplicity, let's just assume that. So what's the work of this computation?
AUDIENCE: 17.
JULIAN SHUN: 17. Right. So the work is just the number of nodes you have in this graph. And you can just count that up, and you get 17.
What about the span? Somebody said 8. Yeah, so the span is 8. And here's the longest path. So this is the path that has 8 nodes in it, and that's the longest one you can find here.
So therefore, the parallelism is just 17 divided by 8, which is 2.125. And so for all of you who guessed that the parallelism was 2, you were very close.
This tells us that using many more than two processors can only yield us marginal performance gains. Because the maximum speedup we can get is 2.125. So we throw eight processors at this computation, we're not going to get a speedup beyond 2.125.
So to figure out how much parallelism is in your computation, you need to analyze the work of your computation and the span of your computation and then take the ratio between the two quantities. But for large computations, it's actually pretty tedious to analyze this by hand. You don't want to draw these things out by hand for a very large computation.
And fortunately, Cilk has a tool called the Cilkscale Scalability Analyzer. So this is integrated into the Tapir/LLVM compiler that you'll be using for this course. And Cilkscale uses compiler instrumentation to analyze a serial execution of a program, and it's going to generate the work and the span quantities and then use those quantities to derive upper bounds on the parallel speedup of your program. So you'll have a chance to play around with Cilkscale in homework 4.
So let's try to analyze the parallelism of quicksort. And here, we're using a parallel quicksort algorithm. The function quicksort here takes two inputs. These are two pointers. Left points to the beginning of the array that we want to sort. Right points to one element after the end of the array. And what we do is we first check if left is equal to right. If so, then we just return, because there are no elements to sort.
Otherwise, we're going to call this partition function. The partition function is going to pick a random pivot-- so this is a randomized quicksort algorithm-- and then it's going to move everything that's less than the pivot to the left part of the array and everything that's greater than or equal to the pivot to the right part of the array. It's also going to return us a pointer to the pivot.
And then now we can execute two recursive calls. So we do quicksort on the left side and quicksort on the right side. And this can happen in parallel. So we use the cilk_spawn here to spawn off one of these calls to quicksort in parallel. And therefore, the two recursive calls are parallel. And then finally, we sync up before we return from the function.
So let's say we wanted to sort 1 million numbers with this quicksort algorithm. And let's also assume that the partition function here is written sequentially, so you have to go through all of the elements, one by one. Can anyone guess what the parallelism is in this computation?
AUDIENCE: 1 million.
JULIAN SHUN: So the guess was 1 million. Any other guesses?
AUDIENCE: 50,000.
JULIAN SHUN: 50,000. Any other guesses? Yes?
AUDIENCE: 2.
JULIAN SHUN: 2. It's a good guess.
AUDIENCE: Log 2 of a million.
JULIAN SHUN: Log base 2 of a million. Any other guesses? So log base 2 of a million, 2, 50,000, and 1 million. Anyone think it's more than 1 million? No. So no takers on more than 1 million.
So if you run this program using Cilkscale, it will generate a plot that looks like this. And there are several lines on this plot. So let's talk about what each of these lines mean.
So this purple line here is the speedup that you observe in your computation when you're running it. And you can get that by taking the single processor running time and dividing it by the running time on P processors. So this is the observed speedup. That's the purple line.
The blue line here is the line that you get from the span law. So this is T1 over T infinity. And here, this gives us a bound of about 6 for the parallelism.
The green line is the bound from the work law. So this is just a linear line with a slope of 1. It says that on P processors, you can't get more than a factor of P speedup.
So therefore, the maximum speedup you can get has to be below the green line and below the blue line. So you're in this lower right quadrant of the plot.
There's also this orange line, which is the speedup you would get if you used a greedy scheduler. We'll talk more about the greedy scheduler later on in this lecture.
So this is the plot that you would get. And we see here that the maximum speedup is about 5. So for those of you who guessed 2 and log base 2 of a million, you were the closest.
You can also generate a plot that just tells you the execution time versus the number of processors. And you can get this quite easily just by doing a simple transformation from the previous plot.
So Cilkscale is going to give you these useful plots that you can use to figure out how much parallelism is in your program. And let's see why the parallelism here is so low.
So I said that we were going to execute this partition function sequentially, and it turns out that that's actually the bottleneck to the parallelism. So the expected work of quicksort is order n log n. So some of you might have seen this in your previous algorithms courses. If you haven't seen this yet, then you can take a look at your favorite textbook, Introduction to Algorithms. It turns out that the parallel version of quicksort also has an expected work bound of order n log n, if you pick a random pivot. So the analysis is similar.
The expected span bound turns out to be at least n. And this is because on the first level of recursion, we have to call this partition function, which is going to go through the elements one by one. So that already has a linear span. And it turns out that the overall span is also order n, because the span actually works out to be a geometrically decreasing sequence and sums to order n.
And therefore, the maximum parallelism you can get is order log n. So you just take the work divided by the span. So for the student who guessed that the parallelism is log base 2 of n, that's very good. Turns out that it's not exactly log base 2 of n, because there are constants in these work and span bounds, so it's on the order of log of n. That's the parallelism.
And it turns out that order log n parallelism is not very high. In general, you want the parallelism to be much higher, something polynomial in n.
And in order to get more parallelism in this algorithm, what you have to do is you have to parallelize this partition function, because right now I I'm just executing this sequentially. But you can actually indeed write a parallel partition function that takes linear your work in order log n span. And then this would give you an overall span bound of log squared n. And then if you take n log n divided by log squared n, that gives you an overall parallelism of n over log n, which is much higher than order log n here.
And similarly, if you were to implement a merge sort, you would also need to make sure that the merging routine is implemented in parallel, if you want to see significant speedup. So not only do you have to execute the two recursive calls in parallel, you also need to make sure that the merging portion of the code is done in parallel. Any questions on this example?
AUDIENCE: In the graph that you had, sometimes when you got to higher processor numbers, it got jagged, and so sometimes adding a processor was making it slower. What are some reasons [INAUDIBLE]?
JULIAN SHUN: Yeah so I believe that's just due to noise, because there's some noise going on in the machine. So if you ran it enough times and took the average or the median, it should be always going up, or it shouldn't be decreasing, at least. Yes?
AUDIENCE: So [INAUDIBLE] is also [INAUDIBLE]?
JULIAN SHUN: So at one level of recursion, the partition function takes order log n span. You can show that there are log n levels of recursion in this quicksort algorithm. I didn't go over the details of this analysis, but you can show that. And then therefore, the overall span is going to be order log squared. And I can show you on the board after class, if you're interested, or I can give you a reference. Other questions?
So it turns out that in addition to quicksort, there are also many other interesting practical parallel algorithms out there. So here, I've listed a few of them. And by practical, I mean that the Cilk program running on one processor is competitive with the best sequential program for that problem.
And so you can see that I've listed the work and the span of merge sort here. And if you implement the merge and parallel, the span of the overall computation would be log cubed n. And log n divided by log cubed n is n over log squared n. That's the parallelism, which is pretty high. And in general, all of these computations have pretty high parallelism.
Another thing to note is that these algorithms are practical, because their work bound is asymptotically equal to the work of the corresponding sequential algorithm. That's known as a work-efficient parallel algorithm. It's actually one of the goals of parallel algorithm design, to come up with work-efficient parallel algorithms. Because this means that even if you have a small number of processors, you can still be competitive with a sequential algorithm running on one processor. And in the next lecture, we actually see some examples of these other algorithms, and possibly even ones not listed on this slide, and we'll go over the work and span analysis and figure out the parallelism.
So now I want to move on to talk about some scheduling theory. So I talked about these computation dags earlier, analyzed the work and the span of them, but I never talked about how these different strands are actually mapped to processors at running time. So let's talk a little bit about scheduling theory. And it turns out that scheduling theory is actually very general. It's not just limited to parallel programming. It's used all over the place in computer science, operations research, and math.
So as a reminder, Cilk allows the program to express potential parallelism in an application. And a Cilk scheduler is going to map these strands onto the processors that you have available dynamically at runtime. Cilk actually uses a distributed scheduler.
But since the theory of distributed schedulers is a little bit complicated, we'll actually explore the ideas of scheduling first using a centralized scheduler. And a centralized scheduler knows everything about what's going on in the computation, and it can use that to make a good decision. So let's first look at what a centralized scheduler does, and then I'll talk a little bit about the Cilk distributed scheduler. And we'll learn more about that in a future lecture as well.
So we're going to look at a greedy scheduler. And an idea of a greedy scheduler is to just do as much as possible in every step of the computation. So has anyone seen greedy algorithms before? Right. So many of you have seen greedy algorithms before. So the idea is similar here. We're just going to do as much as possible at the current time step. We're not going to think too much about the future.
So we're going to define a ready strand to be a strand where all of its predecessors in the computation dag have already executed. So in this example here, let's say I already executed all of these blue strands. Then the ones shaded in yellow are going to be my ready strands, because they have all of their predecessors executed already.
And there are two types of steps in a greedy scheduler. The first kind of step is called a complete step. And in a complete step, we have at least P strands ready. So if we had P equal to 3, then we have a complete step now, because we have 5 strands ready, which is greater than 3.
So what are we going to do in a complete step? What would a greedy scheduler do? Yes?
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Yeah, so a greedy scheduler would just do as much as it can. So it would just run any 3 of these, or any P in general. So let's say I picked these 3 to run. So it turns out that these are actually the worst 3 to run, because they don't enable any new strands to be ready. But I can pick those 3.
And then the incomplete step is one where I have fewer than P strands ready. So here, I have 2 strands ready, and I have 3 processors. So what would I do in an incomplete step?
AUDIENCE: Just run through the strands that are ready.
JULIAN SHUN: Yeah, so just run all of them. So here, I'm going to execute these two strands. And then we're going to use complete steps and incomplete steps to analyze the performance of the greedy scheduler.
There's a famous theorem which was first shown by Ron Graham in 1968 that says that any greedy scheduler achieves the following time bound-- T sub P is less than or equal to T1 over P plus T infinity. And you might recognize the terms on the right hand side-- T1 is the work, and T infinity is the span that we saw earlier.
And here's a simple proof for why this time bound holds. So we can upper bound the number of complete steps in the computation by T1 over P. And this is because each complete step is going to perform P work. So after T1 over P completes steps, we'll have done all the work in our computation. So that means that the number of complete steps can be at most T1 over P.
So any questions on this?
So now, let's look at the number of incomplete steps we can have. So the number of incomplete steps we can have is upper bounded by the span, or T infinity. And the reason why is that if you look at the unexecuted dag right before you execute an incomplete step, and you measure the span of that unexecuted dag, you'll see that once you execute an incomplete step, it's going to reduce the span of that dag by 1.
So here, this is the span of our unexecuted dag that contains just these seven nodes. The span of this is 5. And when we execute an incomplete step, we're going to process all the roots of this unexecuted dag, delete them from the dag, and therefore, we're going to reduce the length of the longest path by 1. So when we execute an incomplete step, it decreases the span from 5 to 4.
And then the time bound up here, T sub P, is just upper bounded by the sum of these two types of steps. Because after you execute T1 over P complete steps and T infinity incomplete steps, you must have finished the entire computation.
So any questions?
A corollary of this theorem is that any greedy scheduler achieves within a factor of 2 of the optimal running time. So this is the optimal running time of a scheduler that knows everything and can predict the future and so on. So let's let TP star be the execution time produced by an optimal scheduler.
We know that TP star has to be at least the max of T1 over P and T infinity. This is due to the work and span laws. So it has to be at least a max of these two terms. Otherwise, we wouldn't have finished the computation.
So now we can take the inequality we had before for the greedy scheduler bound-- so TP is less than or equal to T1 over P plus T infinity. And this is upper bounded by 2 times the max of these two terms. So A plus B is upper bounded by 2 times the max of A and B.
And then now, the max of T1 over P and T infinity is just upper bounded by TP star. So we can substitute that in, and we get that TP is upper bounded by 2 times TP star, which is the running time of the optimal scheduler. So the greedy scheduler achieves within a factor of 2 of the optimal scheduler.
Here's another corollary. This is a more interesting corollary. It says that any greedy scheduler achieves near-perfect linear speedup whenever T1 divided by T infinity is greater than or equal to P.
To see why this is true-- if we have that T1 over T infinity is much greater than P-- so the double arrows here mean that the left hand side is much greater than the right hand side-- then this means that the span is much less than T1 over P. And the greedy scheduling theorem gives us that TP is less than or equal to T1 over P plus T infinity, but T infinity is much less than T1 over P, so the first term dominates, and we have that TP is approximately equal to T1 over P. And therefore, the speedup you get is T1 over P, which is P. And this is linear speedup.
The quantity T1 divided by P times T infinity is known as the parallel slackness. So this is basically measuring how much more parallelism you have in a computation than the number of processors you have. And if parallel slackness is very high, then this corollary is going to hold, and you're going to see near-linear speedup.
As a rule of thumb, you usually want the parallel slackness of your program to be at least 10. Because if you have a parallel slackness of just 1, you can't actually amortize the overheads of the scheduling mechanism. So therefore, you want the parallel slackness to be at least 10 when you're programming in Cilk.
So that was the greedy scheduler. Let's talk a little bit about the Cilk scheduler. So Cilk uses a work-stealing scheduler, and it achieves an expected running time of TP equal to T1 over P plus order T infinity. So instead of just summing the two terms, we actually have a big O in front of the T infinity, and this is used to account for the overheads of scheduling. The greedy scheduler I presented earlier-- I didn't account for any of the overheads of scheduling. I just assumed that it could figure out which of the tasks to execute.
So this Cilk work-stealing scheduler has this expected time provably, so you can prove this using random variables and tail bounds of distribution. So Charles Leiserson has a paper that talks about how to prove this. And empirically, we usually see that TP is more like T1 over P plus T infinity. So we usually don't see any big constant in front of the T infinity term in practice. And therefore, we can get near-perfect linear speedup, as long as the number of processors is much less than T1 over T infinity, the maximum parallelism.
And as I said earlier, the instrumentation in Cilkscale will allow you to measure the work and span terms so that you can figure out how much parallelism is in your program.
Any questions?
So let's talk a little bit about how the Cilk runtime system works. So in the Cilk runtime system, each worker or processor maintains a work deque. Deque stands for double-ended queue, so it's just short for double-ended queue. It maintains a work deque of ready strands, and it manipulates the bottom of the deck, just like you would in a stack of a sequential program. So here, I have four processors, and each one of them have their own deques, and they have these things on the stack, these function calls, saves the return address to local variables, and so on.
So a processor can call a function, and when it calls a function, it just places that function's frame at the bottom of its stack. You can also spawn things, so then it places a spawn frame at the bottom of its stack. And then these things can happen in parallel, so multiple processes can be spawning and calling things in parallel.
And you can also return from a spawn or a call. So here, I'm going to return from a call. Then I return from a spawn. And at this point, I don't actually have anything left to do for the second processor. So what do I do now, when I'm left with nothing to do? Yes?
AUDIENCE: Take a [INAUDIBLE].
JULIAN SHUN: Yeah, so the idea here is to steal some work from another processor. So when a worker runs out of work to do, it's going to steal from the top of a random victim's deque. So it's going to pick one of these processors at random. It's going to roll some dice to determine who to steal from.
And let's say that it picked the third processor. Now it's going to take all of the stuff at the top of the deque up until the next spawn and place it into its own deque. And then now it has stuff to do again. So now it can continue executing this code. It can spawn stuff, call stuff, and so on.
So the idea is that whenever a worker runs out of work to do, it's going to start stealing some work from other processors. But if it always has enough work to do, then it's happy, and it doesn't need to steal things from other processors. And this is why MIT gives us so much work to do, so we don't have to steal work from other people.
So a famous theorem says that with sufficient parallelism, workers steal very infrequently, and this gives us near-linear speedup. So with sufficient parallelism, the first term in our running bound is going to dominate the T1 over P term, and that gives us near-linear speedup.
Let me actually show you a pseudoproof of this theorem. And I'm allowed to do a pseudoproof. It's not actually a real proof, but a pseudoproof. So I'm allowed to do this, because I'm not the author of an algorithms textbook. So here's a pseudo proof.
AUDIENCE: Yet.
JULIAN SHUN: Yet. So a processor is either working or stealing at every time step. And the total time that all processors spend working is just T1, because that's the total work that you have to do. And then when it's not doing work, it's stealing. And each steal has a 1 over P chance of reducing the span by 1, because one of the processors is contributing to the longest path in the compilation dag. And there's a 1 over P chance that I'm going to pick that processor and steal some work from that processor and reduce the span of my remaining computation by 1.
And therefore, the expected cost of all steals is going to be order P times T infinity, because I have to steal P things in expectation before I get to the processor that has the critical path. And therefore, my overall costs for stealing is order P times T infinity, because I'm going to do this T infinity times. And since there are P processors, I'm going to divide the expected time by P, so T1 plus O of P times T infinity divided by P, and that's going to give me the bound-- T1 over P plus order T infinity.
So this pseudoproof here ignores issues with independence, but it still gives you an intuition of why we get this expected running time. If you want to actually see the full proof, it's actually quite interesting. It uses random variables and tail bounds of distributions. And this is the paper that has this. This is by Blumofe and Charles Leiserson.
So another thing I want to talk about is that Cilk supports C's rules for pointers. So a pointer to a stack space can be passed from a parent to a child, but not from a child to a parent. And this is the same as the stack rule for sequential C programs.
So let's say I have this computation on the left here. So A is going to spawn off B, and then it's going to continue executing C. In then C is going to spawn off D and execute E. So we see on the right hand side the views of the stacks for each of the tasks here.
So A sees its own stack. B sees its own stack, but it also sees A's stack, because A is its parent. C will see its own stack, but again, it sees A's stack, because A is its parent. And then finally, D and E, they see the stack of C, and they also see the stack of A. So in general, a task can see the stack of all of its ancestors in this computation graph.
And we call this a cactus stack, because it sort of looks like a cactus, if you draw this upside down. And Cilk's cactus stack supports multiple views of the stacks in parallel, and this is what makes the parallel calls to functions work in C.
We can also bound the stack space used by a Cilk program. So let's let S sub 1 be the stack space required by the serial execution of a Cilk program. Then the stack space required by a P-processor execution is going to be bounded by P times S1. So SP is the stack space required by a P-processor execution. That's less than or equal to P times S1.
Here's a high-level proof of why this is true. So it turns out that the work-stealing algorithm in Cilk maintains what's called the busy leaves property. And this says that each of the existing leaves that are still active in the computation dag have some work they're executing on it.
So in this example here, the vertices shaded in blue and purple-- these are the ones that are in my remaining computation dag. And all of the gray nodes have already been finished. And here-- for each of the leaves here, I have one processor on that leaf executing the task associated with it. So Cilk guarantees this busy leaves property.
And now, for each of these processors, the amount of stack space it needs is it needs the stack space for its own task and then everything above it in this computation dag. And we can actually bound that by the stack space needed by a single processor execution of the Cilk program, S1, because S1 is just the maximum stack space we need, which is basically the longest path in this graph.
And we do this for every processor. So therefore, the upper bound on the stack space required by P-processor execution is just P times S1. And in general, this is a quite loose upper bound, because you're not necessarily going all the way all the way down in this competition dag every time. Usually you'll be much higher in this computation dag.
So any questions? Yes?
AUDIENCE: In practice, how much work is stolen?
JULIAN SHUN: In practice, if you have enough parallelism, then you're not actually going to steal that much in your algorithm. So if you guarantee that there's a lot of parallelism, then each processor is going to have a lot of its own work to do, and it doesn't need to steal very frequently. But if your parallelism is very low compared to the number of processors-- if it's equal to the number of processors, then you're going to spend a significant amount of time stealing, and the overheads of the work-stealing algorithm are going to show up in your running time.
AUDIENCE: So I meant in one steal-- like do you take half of the deque, or do you take one element of the deque?
JULIAN SHUN: So the standard Cilk work-stealing scheduler takes everything at the top of the deque up until the next spawn. So basically that's a strand. So it takes that. There are variants that take more than that, but the Cilk work-stealing scheduler that we'll be using in this class just takes the top strand.
Any other questions?
So that's actually all I have for today. If you have any additional questions, you can come talk to us after class. And remember to meet with your MITPOSSE mentors soon.