Flash and JavaScript are required for this feature.
Download the video from Internet Archive.
Description: Prof. Shun discusses Bentley Rules for optimizing work and discusses a number of examples including packing and encoding, compile-time initialization, loop unrolling, short-circuiting, fast paths, and combining tests.
Instructor: Julian Shun
Lecture 2: Bentley Rules fo...
The following content is provided under a Creative Commons license. Your support will help MIT OpenCourseWare continue to offer high quality, educational resources for free. To make a donation or to view additional materials from hundreds of MIT courses, visit MIT OpenCourseWare at ocw.mit.edu.
JULIAN SHUN: So welcome to the second lecture of 6.172, performance engineering of software systems. Today, we're going to be talking about Bentley rules for optimizing work. All right, so work, does anyone know what work means? You're all at MIT, so you should know.
So in terms of computer programming, there's actually a formal definition of work. The work of a program on a particular input is defined to be the sum total of all the operations executed by the program. So it's basically a gross measure of how much stuff the program needs to do.
And the idea of optimizing work is to reduce the amount of stuff that the program needs to do in order to improve the running time of your program, improve its performance. So algorithm design can produce dramatic reductions in the work of a program. For example, if you want to sort an array of elements, you can use a nlogn time QuickSort. Or you can use an n squared time sort, like insertion sort.
So you've probably seen this before in your algorithm courses. And for large enough values of n, a nlogn time sort is going to be much faster than a n squared sort.
So today, I'm not going to be talking about algorithm design. You'll see more of this in other courses here at MIT. And we'll also talk a little bit about algorithm design later on in this semester. We will be talking about many other cool tricks for reducing the work of a program.
But I do want to point out, that reducing the work of our program doesn't automatically translate to a reduction in running time. And this is because of the complex nature of computer hardware. So there's a lot of things going on that aren't captured by this definition of work. There's instruction level parallelism, caching, vectorization, speculation and branch prediction, and so on. And we'll learn about some of these things throughout this semester.
But reducing the work of our program does serve as a good heuristic for reducing the overall running time of a program, at least to a first order. So today, we'll be learning about many ways to reduce the work of your program.
So rules we'll be looking at, we call them Bentley optimization rules, in honor of John Lewis Bentley. So John Lewis Bentley wrote a nice little book back in 1982 called Writing Efficient Programs. And inside this book there are various techniques for reducing the work of a computer program. So if you haven't seen this book before, it's very good. So I highly encourage you to read it.
Many of the original rules in Bentley's book had to deal with the vagaries of computer architecture three and a half decades ago. So today, we've created a new set of Bentley rules just dealing with the work of a program. We'll be talking about architecture-specific optimizations later on in the semester. But today, we won't be talking about this.
One cool fact is that John Lewis Bentley is actually my academic great grandfather. So John Bentley was one of Charles Leiseron's academic advisors. Charles Leiserson was Guy Blelloch's academic advisor. And Guy Blelloch, who's a professor at Carnegie Mellon, was my advisor when I was a graduate student at CMU.
So it's a nice little fact. And I had the honor of meeting John Bentley a couple of years ago at a conference. And he told me that he was my academic great grandfather.
[LAUGHING]
Yeah, and Charles is my academic grandfather. And all of Charles's students are my academic aunts and uncles--
[LAUGHING]
--including your T.A. Helen. OK, so here's a list of all the work optimizations that we'll be looking at today. So they're grouped into four categories, data structures, loops, and functions. So there's a list of 22 rules on this slide today. In fact, we'll actually be able to look at all of them today. So today's lecture is going to be structured as a series of many lectures.
And I'm going to be spending one to three slides on each one of these optimizations. All right, so let's start with optimizations for data structures. So first optimization is packing and encoding your data structure.
And the idea of packing is to store more than one data value in a machine word. And the related idea of encoding is to convert data values into a representation that requires fewer bits. So does anyone know why this could possibly reduce the running time of a program? Yes?
AUDIENCE: Need less memory fetches.
JULIAN SHUN: Right, so good answer. The answer was, it might need less memory fetches. And it turns out that that's correct, because computer program spends a lot of time moving stuff around in memory. And if you reduce the number of things that you have to move around in memory, then that's a good heuristic for reducing the running time of your program.
So let's look at an example. Let's say we wanted to encode dates. So let's say we wanted to code this string, September 11, 2018. You can store this using 18 bytes. So you can use one byte per character here.
And this would require more than two double words, because each double word is eight bytes or 64 bits. And you have 18 bytes. You need more than two double words. And you have to move around these words every time you want to manipulate the date. But turns out that you can actually do better than using 18 bytes.
So let's assume that we only want to store years between 4096 BCE and 4096 CE. So there are about 365.25 times 8,192 dates in this range, which is three million approximately. And you can use log base two of three million bits to represent all the dates within this range.
So the notation lg here means log base of two. That's going to be the notation I'll be using in this class. And L-O-G will mean log base 10. So we take the ceiling of log base two or three million, and that gives us 22 bits.
So a good way to remember how to compute the log base two of something, you can remember that the log base two of one million is 20, log base two of 1,000 is 10. And then you can factor this out and then add in log base two of three, rounded up, which is two. So that gives you 22 bits. And that easily fits within one 32-bit words. Now, you only need one word instead of three words, as you did in the original representation.
But with this modified representation, now determining the month of a particular date will take more work, because now you're not explicitly storing the month in your representation. Whereas, with the string representation, you are explicitly storing it at the beginning of the string. So this does take more work, but it requires less space. So any questions so far?
OK, so it turns out that there's another way to store this, which also makes it easy for you to fetch the month, the year, or the day for a particular date. So here, we're going to use the bit fields facilities in C. So we're going to create a struct called date underscore t with three fields, the year, the month, and the date. And the integer after the semicolon specifies how many bits I want to assign to this particular field in the struct.
So this says, I need 13 bits for the year, four bits for the month, and five bits for the day. So the 13 bits for the year is, because I have 8,192 possible years. So I need 13 bits to store that. For the month, I have 12 possible months.
So I need log base two of 12 rounded up, which is four. And then finally, for the day, I need log base two of 31 rounded up, which is five. So in total, this still takes 22 bits.
But now the individual fields can now be accessed much more quickly, than if we had just encoded the three million dates using sequential integers, because now you can just extract a month just by saying whatever you named your struct. You can just say that struct dot month. And that give you the month. Yes?
AUDIENCE: Does C actually store it like that, because I know C++ it makes it finalize. So then you end up taking more space.
JULIAN SHUN: Yeah, so this will actually pad the struct a little bit at the end, yeah. So you actually do require a little bit more than 22 bits. That's a good question. But this representation is much more easy to access, than if you just had encoded the integers as sequential integers.
Another point is that sometimes unpacking and decoding are the optimization, because sometimes it takes a lot of work to encode the values and to extract them. So sometimes you want to actually unpack the values so that they take more space, but they're faster to access. So it depends on your particular application. You might want to do one thing or the other. And the way to figure this out is just to experiment with it.
OK, so any other questions? All right, so the second optimization is data structure augmentation. And the idea here is to add information to a data structure to make common operations do less work, so that they're faster.
And let's look at an example. Let's say we had two singly linked list and we wanted to append them together. And let's say we only stored the head pointer to the list, and then each element in the list has a pointer to the next element in the list.
Now, if you want to spend one list to another list, well, that's going to require you walking down the first list to find the last element, so that you can change the pointer of the last element to point to the beginning of the next list. And this might be very slow if the first list is very long. So does anyone see a way to augment this data structure so that appending two lists can be done much more efficiently? Yes?
AUDIENCE: Store a pointer to the last value.
JULIAN SHUN: Yeah, so the answer is to store a pointer to the last value. And we call that the tail pointer. So now we have two pointers, both the head and the tail. The head points to the beginning of the list. The tail points to the end of the list.
And now you can just append two lists in constant time, because you can access the last element in the list by following the tail pointer. And then now you just change the successor pointer of the last element to point to the head of the second list. And then now you also have to update the tail to point to the end of the second list.
OK, so that's the idea of data structure augmentation. We added a little bit of extra information to the data structure, such that now appending two lists is much more efficient than in the original method, where we only had a head pointer. Questions?
OK, so the next optimization is precomputation. The idea of precomputation is to perform some calculations in advance so as to avoid doing these computations at mission-critical times, to avoid doing them at runtime. So let's say we had a program that needed to use binomial coefficients. And here's a definition of a binomial coefficient.
So it's basically the choose function. So you want to count the number of ways that you can choose k things from a set of n things. And the formula for computing this is, n factorial divided by the product of k factorial and n minus k factorial.
Computing this choose function can actually be quite expensive, because you have to do a lot of multiplications to compute the factorial, even if the final result is not that big, because you have to compute one term in the numerator and then two factorial terms in the denominator. And then you also might run into integer overflow issues, because n factorial grows very fast. It grows super exponentially. It grows like n to the n, which is even faster than two to the n, which is exponential. So doing this computation, you have to be very careful with integer overflow issues.
So one idea to speed up a program that uses these binomials coefficients is to precompute a table of coefficients when you initialize the program, and then just perform table lookup on this precomputed table at runtime when you need the binomial coefficient. So does anyone know what the table that stores binomial coefficients is called? Yes?
AUDIENCE: [INAUDIBLE]
JULIAN SHUN: Yea, Pascal's triangles, good. So here is what Pascal's triangle looks like. So on the vertical axis, we have different values of n. And then on the horizontal axis, we have different values of k. And then to get and choose k, you just go to the nth row in the case column and look up that entry.
Pascal's triangle has a nice property, that for every element, it can be computed as a sum of the element directly above it and above it and to the left of it. So here, 56 is the sum of 35 and 21. And this gives us a nice formula to compute the binomial coefficients.
So we first check if n is less than k in this choose function. If n is less than k, then we just return zero, because we're trying to choose more things than there are in a set. If n is equal to zero, then we just return one, because here k must also be equal to zero, since we had the condition n less than k above. And there's one way to choose zero things from a set of zero things.
And then if k is equal to zero, we also return one, because there's only one way to choose zero things from a set of any number of things. You just don't pick anything.
And then finally, we recursively call this choose function. So we call choose of n minus one k minus one. This is essentially the entry above and diagonal to this. And then we add in choose of n minus one k, which is the entry directly above it.
So this is a recursive function for generating this Pascal's triangle. But notice that we're actually still not doing precomputation, because every time we call this choose function, we're making two recursive calls. And this can still be pretty expensive.
So how can we actually precompute this table? So here's some C code for precomputing Pascal's triangle. And let's say we only wanted coefficients up to choose sides of 100. So we initialize matrix of 100 by 100. And then we call this an init choose function.
So first it goes from n equal zero, all the way up to choose size minus one. And then it says, choose n of zero to be one. It also sets choose of n, n to be one. So the first line is, because there's only one way to choose zero things from any number of things. And the second line is, because there's only one way to choose n things from n things, which is just to pick all of them.
And then now we have a second loop, which goes from n equals one, all the way up to choose size minus one. Then first we set choose of zero n to be zero, because here n is-- or k is greater than n. So there's no way to pick more elements from a set of things that is less than the number of things you want to pick.
And then now you loop from k equals one, all the way up to n minus one. And then your apply this recursive formula. So choose of n, k is equal to choose of n minus one, k minus one plus choose of n minus one k.
And then you also set choose of k, n to be zero. So this is basically all of the entries above the diagonal here, where k is greater than n. And then now inside the program whenever we need a binomial coefficient that's less than 100, we can just do table lookup into this table. And we just index and then just choose array.
So does this make sense? Any questions? It's pretty easy so far, right?
So one thing to note is, that we're still computing this table at runtime, because we have to initialize this table at runtime. And if we want to run our program many times, then we have to initialize this table many times. So is there a way to only initialize this table once, even though we might want to run the program many times? Yes?
AUDIENCE: Put in the source code.
JULIAN SHUN: Yeah, so good, so put it in the source code. And so we're going to do compile-time initialization. And if you put the table in the source code, then the compiler will compile this code and generate the table for you that compile time. So now whenever you run it, you don't have to spend time initializing the table. So idea of compile-time initialization is to store the values of constants during compilation and, therefore, saving work at runtime.
So let's say we wanted choose values up to 10. This is the table, the 10 by 10 table storing all of the binomial coefficients up to 10. So if you put this in your source code, now when you run the program, you can just index into this table to get the appropriate constant here. But this table was just a 10 by 10 table. What if you wanted a table of 1,000 by 1,000?
Does anyone actually want to type this in, a table of 1,000 by 1,000? So probably not. So is there any way to get around this? Yes?
AUDIENCE: You could make a program that uses it. And the function will be defined [INAUDIBLE] prints out the zero [INAUDIBLE].
JULIAN SHUN: Yeah, so the answer is to write a program that writes your program for you. And that's called metaprogramming.
So here's a snippet of code that will generate this table for you. So it's going to call this init choose function that we defined before. And then now it's just going to print out C code.
So it's going to print out the declaration of this array choose, followed by a left bracket. And then for each row of the table, we're going to print another left bracket and then print the value of each entry in that row, followed by a right bracket. And we do that for every row. So this will give you the C code. And then now you can just copy and paste this and place it into your source code.
This is a pretty cool technique to get your computer to do work for you. And you're welcome to use this technique in your homeworks and projects if you'd need to generate large tables of constant values. So this is a very good technique to know. So any questions? Yes?
AUDIENCE: Is there a way to write the output other programs to a file, as oppose to having to copy and paste into the source code?
JULIAN SHUN: Yeah, so you can pipe the output of this program to a file. Yes?
AUDIENCE: So are there compiler tools that can-- so we have three processor tools. Is there [INAUDIBLE] processor can do that? We compile the code, run it, and then [INAUDIBLE].
JULIAN SHUN: Yeah, so I think you can write macros to actually generate this table. And then the compiler will run those macros to generate this table for you. Yeah, so you don't actually need to copy and paste it yourself. Yeah?
CHARLES: And you know, you don't have to write it in C. If it's quicker to write with Python, you'd be writing in Python, just put it in the make file for the system you're building. So if it's in the make file, says, well, we're making this thing, first generate the file in the table and now you include that in whatever you're compiling or/and it's just one more step in the process. And for sure, it's generally easier to write these tables with the scripting language like Python than writing them in C. On the other hand, if you need experience writing in C, practice writing in C.
JULIAN SHUN: Right, so as Charles says, you can write your metaprogram using any language. You don't have to write it in C. You can write it in Python if you're more familiar with that. And it's often easier to write it using a scripting language like Python.
OK, so let's look at the next optimization. So we're already gone through a couple of mini lectures already. So congratulations to all of you who are still here.
So the next optimization is caching. The idea of caching is to store results that have been accessed recently, so that you don't need to compute them again in the program. So let's look at an example.
Let's say we wanted to compute the hypotenuse of a right triangle with side lengths A and B. So the formula for computing this is, you take the square root of A times A plus B times B. OK, so turns out that the square root operator is actually a relatively expensive, more expensive than additions and multiplications on modern machines. So you don't want to have to call the square root function if you don't have to.
And one way to avoid doing that is to create a cache. So here I have a cache just storing the previous hypotenuse that I calculated. And I also store the values of A and B that were passed to the function.
And then now when I call the hypotenuse function, I can first check if A is equal to the cached value of A and if B is equal to the cached value of B. And if both of those are true, then I already computed the hypotenuse before. And then I can just return cached of h.
But if it's not in my cache, now I need to actually compute it. So I need to call the square root function. And then I store the result into cached h. And I also store A and B into cached A and cached B respectively. And then finally, I returned cached h.
So this example isn't actually very realistic, because my cache is only a size one. And it's very unlikely, in a program, you're going to repeatedly call some function with the same input arguments. But you can actually make a larger cache. You can make a cache of size 1,000, storing the 1,000 most recently computer hypotenuse values. And then now when you call the hypotenuse function, you can just check if it's in your cache.
Checking the larger cache is going to be more expensive, because there are more values to look at. But they can still save you time overall. And hardware also does caching for you, as we'll talk about later on in the semester. But the point of this optimization is that you can also do caching yourself. You can do it in software. You don't have to let hardware do it for you.
And turns out for this particular program here, actually, it is about 30% faster if you do hit the cache about 2/3 of the time. So it does actually save you time if you do repeatedly compute the same values over and over again. So that's caching. Any questions?
OK, so the next optimization we'll look at is sparsity. The idea of exploiting sparsity, in an input, is to avoid storage and computing on zero elements of that input. And the fastest way to compute on zero is to just not compute on them at all, because we know that any value plus zero is just that original value. And any value times zero is just zero. So why waste a computation doing that when you already know the result?
So let's look at an example. This is matrix-vector multiplication. So we want to multiply a n by n matrix by a n by one vector. We can do dense matrix-vector multiplication by just doing a dot product of each row in the matrix with the column vector. And then that will give us the output vector.
But if you do dense matrix-vector multiplication, that's going to perform n squared or 36, in this example, scalar multiplies. But it turns out, only 14 of these entries in this matrix are zero or are non-zero. So you just wasted work doing the multiplication on the zero elements, because you know that zero times any other element is just zero.
So a better way to do this, is instead of doing the multiplication for every element, you first check if one of the arguments is zero. And if it is zero, then you don't have to actually do the multiplication. But this is still kind of slow, because you still have to do a check for every entry in your matrix, even though many of the entries are zero.
So it's actually a pretty cool data structure that won't actually store these zero entries. And this will speed up your matrix-vector multiplication if your matrix is sparse enough. So let me describe how this data structure works.
It's called compressed sparse row or CSR. There is an analogous representation called compressed sparse column or CSC. But today, I'm just going to talk about CSR.
So we have three arrays. First, we have the rows array. The length of the rows array is just equal to the number of rows in a matrix plus one. And then each entry in the rows array just stores an offset into the columns array or the cols array. And inside the cols array, I'm storing the indices of the non-zero entries in each of the rows.
So if we take row one, for example, we have rows of one is equal to two. That means I start looking at the second entry in the cols array. And then now I have the indices of the non-zero columns in the first row. So it's just one, two, four, and five. These are the indices for the non-zero entries.
And then I have another array called vals. The length of this array is the same as the cols array. And then this array just stores the actual value in these indices here. So the vals array for row one is going to store four, one, five, and nine, because these are the non-zero entries in the first row.
Right, so the rows array just serves as an index into this cols array. So you can basically get the starting index in this cols array for any row just by looking at the entry stored at the corresponding location in the rows array. So for example, row two starts at location six. So it starts here. And you have indices three and five, which are the non-zero indices.
So does anyone know how to compute the length, the number of non-zeros in a row by looking at the rows array? Yes, yes?
AUDIENCE: You go to the rows array and just drag the [INAUDIBLE]
JULIAN SHUN: Right.
AUDIENCE: [INAUDIBLE] the number of elements that are [INAUDIBLE].
JULIAN SHUN: Yeah, so to get the length of a row, you just take the difference between that row's offset and the next row's offset. So we can see that the length of the first row is four, because it's offset is two. And the offset for row two is six. So you just take the difference between those two entries.
We have an additional entry here. So we have the sixth row here, because we want to be able to compute the length of the last row without overflowing in our array. So we just created an additional entry in the rows array for that.
So turns out that this representation will save you space if your matrix is sparse. So the storage required by the CSR format is order n plus nnz, where nnz is the number of non-zeros in your matrix. And the reason why you have n plus nnz, well, you have two arrays here, cols and vals, whose length is equal to the number of non-zeros in the matrix. And then you also have this rows array, whose length is n plus one. So that's why we have n plus nnz.
And if the number of non-zeros is much less than n squared, then this is going to be significantly more compact than the dense matrix representation. However, this isn't always going to be the most compact representation. Does anyone see why? Why might the dense representation sometimes take less space? Yeah? Sorry.
AUDIENCE: Less space or more space?
JULIAN SHUN: Why might the dense representation sometimes take less space?
AUDIENCE: I mean, if you have not many zeros, then you can figure it out n squared plus something else with the sparse created.
JULIAN SHUN: Right. So if you have a relatively dense matrix, then it might take more space than storing it. It might take more space in the CSR representation, because you have these two arrays. So if you take the extreme case where all of the entries are non-zeros, then both of these arrays are going to be of length and squares. So you already have 2n squared there. And then you also need this rows array, which is of length and plus one.
OK, so now I gave you this more compact representation for storing the matrix. So how do we actually do stuff with this representation? So turns out that you can still do matrix-vector multiplication using this compressed sparse row format. And here's the code for doing it.
So we have this struct here, which is the CSR representation. We have the rows array, the cols array, and then the vals array. And then we also have the number of rows, n, and the number of non-zeros, nnz. And then now what we do, we call this SPMV or sparse matrix-vector multiply. We pass in our CSR representation, which is A, and then the input array, which is x. And then we store the result in an output array y.
So first, we loop through all the rows. And then we set y of i to be zero. This is just initialization. And then for each of my rows, I'm going to look at the column indices for the non-zero elements. And I can do that by starting at k equals to rows of i and going up to rows of i plus one.
And then for each one of these entries, I just look up the index, the column index for the non-zero element. And I can do that with cols of k, so let that be j. And then now I know which elements to multiply. I multiply vals of k by x of j. And then now I just add that to y of i. And then after I finish with all of these multiplications and additions, this will give me the same result as if I did the dense matrix-vector multiplication.
So this is actually a pretty cool program. So I encourage you to look at this program offline, to convince yourself that it's actually computing the same thing as the dense matrix-vector multiplication version. So I'm not going to approve this during lecture today. But you can feel free to ask me or any of your TAs after class, if you have questions about this.
And the number of scalar multiplication that you have to do using this code is just going to be nnz, because you're just operating on the non-zero elements. You don't have to touch all of the zero elements. And in contrast, the dense matrix-vector multiply algorithm would take n squared multiplication. So this can be significantly faster for a sparse matrices.
So turns out that you can also use a similar structure to store a sparse static graph. So I assume many of you have seen graphs in your previous courses. See, here's what the sparse graph representation looks like.
So again, we have these arrays. We have these two arrays. We have offsets and edges. The offsets array is analogous to the rows array. And the edges array is analogous to the columns array for the CSR representation.
And then in this offsets array, we store for each vertex where its neighbor's start in this edges array. And then in the edges array, we just write the indices of its neighbor's there.
So let's take vertex one, for example. The offset of vertex one is two. So we know that its outgoing neighbor start at position two in this edges array. And then we see that vertex one has outgoing edges to vertices two, three, and four. And we see in the edges array two, three, four listed there.
And you can also get the degree of each vertex, which is analogous to the length of each row, by taking the difference between consecutive offsets. So here we see that the degree of vertex one is three, because its offset is two. And the offset of vertex two is five.
And it turns out that using this representation, you can run many classic graph algorithms such as breadth-first search and PageRank quite efficiently, especially when the graph is sparse. So this would be much more efficient than using a dense matrix to represent the graph and running these algorithms.
You can also store weights on the edges. And one way to do that is to just create an additional array called weights, whose length is equal to the number of edges in the graph. And then you just store the weights in that array. And this is analogous to the values array in the CSR representation.
But there's actually a more efficient way to store this, if you always need to access the weight whenever you access an edge. And the way to do this is to interleave the weights with the edges, so to store the weight for a particular edge right next to that edge, and create an array of twice number of edges in the graph. And the reason why this is more efficient is, because it gives you improved cache locality.
And we'll talk much more about cache locality later on in this course. But the high-level idea is, that whenever you access an edge, the weight for that edge will also likely to be on the same cache line. So you don't need to go to main memory to access the weight of that edge again.
And later on in the semester we'll actually have a whole lecture on doing optimizations for graph algorithms. And today, I'm just going to talk about one representation of graphs. But we'll talk much more about this later on. Any questions?
OK, so that's it for the data structure optimizations. We still have three more categories of optimizations to go over. So it's a pretty fun lecture. We get to learn about many cool tricks for reducing the work of your program.
So in the next class of optimizations we'll look at is logic, so first thing is constant folding and propagation. The idea of constant folding and propagation is to evaluate constant expressions and substitute the result into further expressions, all at compilation times. You don't have to do it at runtime. So again, let's look at an example.
So here we have this function called orrery. Does anyone know what orrery means? You can look it up on Google.
[LAUGHING]
OK, so an orrery is a model of a solar system. So here we're constructing a digital orrery. And in an orrery we have these whole bunch of different constants. We have the radius, the diameter, the circumference, cross area, surface area, and also the volume.
But if you look at this code, you can notice that actually all of these constants can be defined in compile time once we fix the radius. So here we set the radius to be this constant here, six million, 371,000. I don't know where that constant comes from, by the way. But Charles made these slides, so he probably does.
CHARLES: [INAUDIBLE]
JULIAN SHUN: Sorry?
CHARLES: Radius of the Earth.
JULIAN SHUN: OK, radius of the Earth. Now, the diameter is just twice this radius. The circumference is just pi times the diameter. Cross area is pi times the radius squared.
Surface area is circumference times the diameter. And finally, volume is four times pi times the radius cube divided by three.
So you can actually evaluate all of these two constants at compile time. So with a sufficiently high level of optimization, the compiler will actually evaluate all of these things at compile time. And that's the idea of constant folding and propagation.
It's a good idea to know about this, even though the compiler is pretty good at doing this, because sometimes the compiler won't do it. And in those cases, you can do it yourself. And you can also figure out whether the compiler is actually doing it when you look at the assembly code.
OK, so the next optimization is common subexpression elimination. And the idea here is to avoid computing the same expression multiple times by evaluating the expression once and storing the result for later use. So let's look at this simple four-line program.
We have a equal to b plus c. The we set b equal to a minus d. Then we set c equal to b plus c. And finally, we set d equal to a minus d.
So notice her that the second and the fourth lines are computing the same expression. They're both computing a minus d. And they evaluate to the same thing.
So the idea of common subexpression elimination would be to just substitute the result of the first evaluation into the place where you need it in future line. So here, we still evaluate the first line for a minus d. But now in the second time we need a minus d. We just set the value to b. So now d is equal to b instead of a minus d.
So in this example, the first and the third line, the right hand side of those lines actually look the same. They're both b plus c. Does anyone see why you can't do common subexpression elimination here?
AUDIENCE: b minus changes the second line.
JULIAN SHUN: Yeah, so you can't do common subexpression for the first and the third lines, because the value of b changes in between. So the value of b changes on the second line. So on the third line when you do b plus c, it's not actually computing the same thing as the first evaluation of b plus c.
So again, the compiler is usually smart enough to figure this optimization out. So it will do this optimization for you in your code. But again, it doesn't always do it for you. So it's a good idea to know about this optimization so that you can do this optimization by hand when the compiler doesn't do it for you. Questions so far?
OK, so next, let's look at algebraic identities. The idea of exploiting algebraic identities is to replace more expensive algebraic expressions with equivalent expressions that are cheaper to evaluate. So let's look at an example.
Let's say we have a whole bunch of balls. And we want to detect whether two balls collide with each other. Say, ball has a x-coordinate, a y-coordinate, a z-coordinate, as well as a radius. And the collision test works as follows.
We set d equal to the square root of the sum of the squares of the differences between each of the three coordinates of the two balls. So here, we're taking the square of b1's x-coordinate minus b2's x-coordinate, and then adding the square of b1's y-coordinate minus b2's y-coordinate, and finally, adding the square of b1 z-coordinate minus b2's z-coordinate. And then we take the square root of all of that. And then if the result is less than or equal to the sum of the two radii of the ball, then that means there is a collision, and otherwise, that means there's not a collision.
But it turns out that the square root operator, as I mentioned before, is relatively expensive compared to doing multiplications and additions and subtractions on modern machines. So how can we do this without using the square root operator? Yes.
AUDIENCE: You add the two radii, and the distance is more than the distance between the centers, then you know that they must be overlying.
JULIAN SHUN: Right, so that's actually a good fast path check. I don't think it necessarily always gives you the right answer. Is there another? Yes?
AUDIENCE: You can square the ignition of the radii and compare that instead of taking the square root of [INAUDIBLE].
JULIAN SHUN: Right, right, so the answer is, that you can actually take the square of both sides. So now you don't have to take the square root anymore. So we're going to use the identity that says, that if the square root of u is less than or equal to v exactly when u is less than or equal to v squared. So we're just going to take the square of both sides. And here's the modified code.
So now I don't have this square root anymore on the right hand side when I compute d squared. But instead, I square the sum of the two radii. So this will give you the same answer. However, you do have to be careful with floating point operations, because they don't work exactly in the same way as real numbers.
So some numbers might run into overflow issues or rounding issues. So you do have to be careful if you're using algebraic identities and floating point computations. But the high-level idea is that you can use equivalent algebraic expressions to reduce the work of your program. And we'll come back to this example late on in this lecture when we talk about some other optimizations, such as the fast path optimization, as one of the students pointed out. Yes?
AUDIENCE: Why do you square the sum of these squares [INAUDIBLE]?
JULIAN SHUN: Which? Are you talking about--
AUDIENCE: Yeah.
JULIAN SHUN: --this line? So before we were comparing d.
AUDIENCE: [INAUDIBLE].
JULIAN SHUN: Yeah, yeah, OK, is that clear? OK. OK, so the next optimization is short-circuiting. The idea here is, that when we're performing a series of tests, we can actually stop evaluating this series of tests as soon as we know what the answer is So here's an example.
Let's say we have an array, a, containing all non-negative integers. And we want to check if the sum of the values in a exceed some limit. So the simple way to do this is, you just sum up all of the values of the array using a for loop. And then at the end, you check if the total sum is greater than the limit. So using this approach, you always have to look at all the elements in the array.
But there's actually a better way to do this. And the idea here is, that once you know the partial sum exceeds the limit that you're testing against, then you can just return true, because at that point you know that the sum of the elements in the array will exceed the limit, because all of the elements in the array are non-negative. And then if you get all the way to the end of this for loop, that means you didn't exceed this limit. And you can just return false.
So this second program here will usually be faster, if most of the time you exceed the limit pretty early on when you loop through the array. But if you actually end up looking at most of the elements anyways, or even looking at all the elements, this second program will actually be a little bit slower, because you have this additional check inside this for loop that has to be done for every iteration. So when you apply this optimization, you should be aware of whether this will actually be faster or slower, based on the frequency of when you can short-circuit the test. Questions?
OK, and I want to point out that there are short-circuiting logical operators. So if you do double ampersand, that's short-circuiting logical and operator. So if it evaluates the left side to be false, it means that the whole thing has to be false. So it's not even going to evaluate the right side.
And then the double vertical bar is going to be a short-circuiting or. So if it knows that the left side is true, it knows the whole thing has to be true, because or just requires one of the two sides to be true. And it's going to short circuit.
In contrast, if you just have a single ampersand or a single vertical bar, these are not short-circuiting operators. They're going to evaluate both sides of the argument. The single ampersand and single vertical bar turn out to be pretty useful when you're doing bit manipulation. And we'll be talking about these operators more on Thursday's lecture. Yes?
AUDIENCE: So if your program going to send false, if it were to call the function and that function was on the right hand side of an ampersand, would it mean that would never get called, even though-- and you possibly now find out that, the right hand side would crash simply because the left hand side was false?
JULIAN SHUN: Yeah, if you use a double ampersand, then that would be true. Yes?
AUDIENCE: [INAUDIBLE] check [INAUDIBLE] that would cause the cycle of left hand, so that the right hand doesn't get [INAUDIBLE].
JULIAN SHUN: Yeah. I guess one example is, if you might possibly index an array out of balance, you can first check whether you would exceed the limit or be out of bounds. And if so, then you don't actually do the index.
OK, a related idea is to order tests, suss out the tests that are more often successful or earlier. And the ones that are less frequently successful are later in the order, because you want to take advantage of short-circuiting. And similarly, inexpensive tests should precede expensive tests, because if you do the inexpensive tests and your test short-circuit, then you don't have to do the more expensive tests. So here's an example.
Here, we're checking whether a character is whitespace. So character's whitespace, if it's equal to the carriage return character, if it's equal to the tab character, if it's equal to space, or if it's equal to the newline character. So which one of these tests do you think should go at the beginning? Yes?
AUDIENCE: Probably the space.
JULIAN SHUN: Why is that?
AUDIENCE: Oh, I mean [INAUDIBLE]. Well, maybe the newline [INAUDIBLE]. Either of those could be [INAUDIBLE].
JULIAN SHUN: Yeah, yeah, so it turns out that the space and the newline characters appear more frequently than the carriage return. And the tab and the space is the most frequent, because you have a lot of spaces in text. So here I've reordered the test, so that the check for space is first. And then now if you have a character, that's a space. You can just short circuit this test and return true.
Next, the newline character, I have it as a second test, because these are also pretty frequent. You have a newline for every new line in your text. And then less frequent is the tab character, and finally, the carriage return for character isn't that frequently used nowadays. So now with this ordering, the most frequently successful tests are going to appear first.
Notice that this only actually saves you work if the character is a whitespace character. It it's not a whitespace character, than you're going to end up evaluating all of these tests anyways.
OK, so now let's go back to this example of detecting collision of balls. So we're going to look at the idea of creating a fast path. And the idea of creating a fast path is, that there might possibly be a check that will enable you to exit the program early, because you already know what the result is.
And one fast path check for this particular program here is, you can check whether the bounding boxes of the two balls intersect. If you know the bounding boxes of the two balls don't intersect, then you know that the balls cannot collide. If the bounding boxes of the two balls do intersect, well, then you have to do the more expensive test, because that doesn't necessarily mean that the balls will collide.
So here's what the fast path test looks like. We're first going to check whether the bounding boxes intersect. And we can do this by looking at the absolute value of the difference on each of the coordinates and checking if that's greater than the sum of the two radii. And if so, that means that for that particular coordinate the bounding boxes cannot intersect. And therefore, the balls cannot collide. And then we can return false of any one of these tests returned true.
And otherwise, we'll do the more expensive test of comparing d square to the square of the sum of the two radii. And the reason why this is a fast path is, because this test here is actually cheaper to evaluate than this test below. Here, we're just doing subtractions, additions, and comparisons.
And below we're using the square operator, which requires a multiplication. And multiplications are usually more expensive than additions on modern machines. So ideally, if we don't need to do the multiplication, we can avoid it by going through our fast path.
So for this example, it probably isn't worth it to do the fast path check since it's such a small program. But in practice there are many applications and graphics that benefit greatly from doing fast path checks. And the fast path check will greatly improve the performance of these graphics programs.
There's actually another optimization that we can do here. I talked about this optimization couple of slides ago. Does anyone see it? Yes?
AUDIENCE: You can factor out the sum of the radii for [INAUDIBLE].
JULIAN SHUN: Right. So we can apply common subexpression elimination here, because we're computing the sum of the two radii four times. We can actually just compute it once, store it in a variable, and then use it for the subsequent three calls.
And then similarly, when we're taking the difference between each of the coordinates, we're also doing it twice. So again, we can store that in a variable and then just use the result in the second time. Any questions?
OK, so the next idea is to combine tests together. So here, we're going to replace a sequence of tests with just one test or switch statement. So here's an implementation of a full adder. So a full adder is a hardware device that takes us input three bits. And then it returns the carry bit and the sum bit as output.
So here's a table that specifies for every possible input to the full adder of what the output should be. And there are eight possible inputs to the full adder, because it takes three bits. And there are eight possibilities there.
And this program here is going to check all the possibilities. It's first going to check if a is equal to zero. If so, it checks if b is equal to zero. If so, it checks if c is equal to zero.
And if that's true, it returns zero and zero for the two bits. And otherwise, it returns one and zero and so on.
So this is basically a whole bunch of if else statements nested together. Does anyone think this is a good way to write the program? Who thinks this is a bad way to write the program? OK, so most of you think it's a bad way to write the program. And hopefully, I can convince the rest of you who didn't raise your hand.
So here's a better way to write this program. So we're going to replace these multiple if else clauses with a single switch statement. And what we're going to do is, we're going to create this test variable. That is a three-bit variable.
So we're going to place the c bit in the least significant digit. The b bit, we're going to shift it over by one, so in the second least significant digit, and then the a bit in the third least significant digit. And now the value of this test variable is going to range from zero to seven. And then for each possibility, we can just specify what the sum and the carry bits should be. And this requires just a single switch statement, instead of a whole bunch of if else clauses.
There's actually an even better way to do this, for this example, which is to use table lookups. You just precompute all these answers, store it in a table, and then just look it up at runtime. But the idea here is that you can combine multiple tests in a single test. And this not only makes your code cleaner, but it can also improve the performance of your program, because you're not doing so many checks. And you won't have as many branch misses. Yes?
AUDIENCE: Would coming up with logic gates for this be better or no?
JULIAN SHUN: Maybe. Yeah, I mean, I encourage you to see if you can write a faster program for this. All right, so we're done with two categories of optimizations. We still have two more to go.
The third category is going to be about loops. So if we didn't have any loops in our programs, well, there wouldn't be many interesting programs to optimize, because most of our programs wouldn't be very long running. But with loops we can actually optimize these loops and then realize the benefits of performance engineering.
The first loop optimization I want to talk about is hoisting. The goal of hoisting, which is also called loop-invariant code motion, is to avoid recomputing a loop-invariant code each time through the body of a loop. So if you have a for loop where in each iteration are computing the same thing, well, you can actually save work by just computing it once.
So in this example here, I'm looping over an array of length N. And them I'm setting Y of i equal to X of i times the exponential of the square root of pi over two. But this exponential square root of pi over two is actually the same in every iteration. So I don't actually have to compute that every time.
So here's a version of the code that does hoisting. I just move this expression outside of the for loop and stored it in a variable factor. And then now inside the for loop, I just have to multiply by factor. I already computed what this expression is. And this can save running time, because computing the exponential, the square root of pi over two, is actually relatively expensive.
So turns out that for this example, you know, the compiler can probably figure it out and do this hoisting for you. But in some cases, the compiler might not be able to figure it out, especially if these functions here might change throughout the program. So it's a good idea to know what this optimization is, so you can apply it in your code when the compiler doesn't do it for you.
OK, sentinels, so sentinels are special dummy values placed in a data structure to simplify the logic of handling boundary conditions, and in particular the handling of loop exit tests. So here, I, again, have this program that checks whether-- so I have this program that checks whether the sum of the elements in sum array A will overflow if I added all of the elements together. And here, I've specified that all of the elements of A are non-negative.
So how I do this is, in every iteration I'm going to increment some by A of i. And then I'll check if the resulting sum is less than A of i. Does anyone see why this will detect if I had an overflow? Yes?
AUDIENCE: We're a closed algorithm. It's not taking any values.
JULIAN SHUN: Yeah, so if the thing I added in causes an overflow, then the result is going to wrap around. And it's going to be less than the thing I added in. So this is why the check here, that checks whether the sum is less than negative i, will detect an overflow.
OK, so I'm going to do this check in every iteration. If it's true, I'll just return true. And otherwise, I get to the end of this for loop where I just return false.
But here on every iteration, I'm doing two checks. I'm first checking whether I should exit the body of this loop. And then secondly, I'm checking whether the sum is less than A of i. It turns out that I can actually modify this program, so that I only need to do one check in every iteration of the loop.
So here's a modified version of this program. So here, I'm going to assume that I have two additional entries in my array A. So these are A of n and A of n minus one. So I assume I can use these locations. And I'm going to set A of n to be the largest possible 64-bit integer, or INT64 MAX. And I'm going to set A of n plus one to be one.
And then now I'm going to initialize my loop variable i to be zero. And then I'm going to set the sum equal to the first element in A or A of zero. And then now I have this loop that checks whether the sum is greater than or equal to A of i. And if so, I'm going to add A of i plus one to the sum. And then I also increment i.
OK, and this code here does the same thing as a thing on the left, because the only way I'm going to exit this while loop is, if I overflow. And I'll overflow if A of i becomes greater than sum, or if the sum becomes less than A of i, which is what I had in my original program. And then otherwise, I'm going to just increment sum by A of i.
And then this code here is going to eventually overflow, because if the elements in my array A don't cause the program to overflow, I'm going to get to A of n. And A of n is a very large integer. And if I add that to what I have, it's going to cause the program to overflow. And at that point, I'm going to exit this for loop or this while loop.
And then after I exit this loop, I can check why I overflowed. If I overflowed because of sum element of A, then the loop index i is going to be less than n, and I return true. But if I overflowed because I added in this huge integer, well, than i is going to be equal to n. And then I know that the elements of A didn't caused me to overflow, the A of n value here did. So then I just return false.
So does this makes sense? So here in each iteration, I only have to do one check instead of two checks, as in my original code. I only have to check whether the sum is greater than or equal to A of i. Does anyone know why I set A of n plus one equal to one? Yes?
AUDIENCE: If everything else in the array was zero, then you still wouldn't have overflowed. If you had been at 64 max, it would overflow.
JULIAN SHUN: Yeah, so good. So the answer is, because if all of my elements were zero in my original array, that even though I add in this huge integer, it's still not going to overflow. But now when I get to A of n plus one, I'm going to add one to it.
And then that will cause the sum to overflow. And then I can exit there. So this is a deal with the boundary condition when all the entries in my array are zero.
OK, so next, loop unrolling, so loop unrolling attempts to save work by combining several consecutive iterations of a loop into a single iteration. Thereby, reducing the total number of iterations of the loop and consequently the number of times that the instructions that control the loop have to be executed.
So there are two types of loop unrolling. There's full loop unrolling, where I unroll all of the iterations of the for loop, and I just get rid of the control-flow logic entirely. Then there's partial loop unrolling, where I only unroll some of the iterations but not all of the iterations. So I still have some control-flow code in my loop.
So let's first look at full loop unrolling. So here, I have a simple program that just loops for 10 iterations. The fully unrolled loop just looks like the code on the right hand side. I just wrote out all of the lines of code that I have to do in straight-line code, instead of using a for loop. And now I don't need to check on every iteration, whether I need to exit the for loop.
So this is for loop unrolling. This is actually not very common, because most of your loops are going to be much larger than 10. And oftentimes, many of your loop bounds are not going to be determined at compile time. They're determined at runtime. So the compiler can't fully unroll that loop for you.
For small loops like this, the compiler will probably unroll the loop for you. But for larger loops, it actually doesn't benefit to unroll the loop fully, because you're going to have a lot of instructions. And that's going to pollute your instruction cache.
So the more common form of loop unrolling is partial loop unrolling. And here, in this example here, I've unrolled the loop by a factor of four. So I reduce the number of iterations of my for loop by a factor of four. And then inside the body of each iteration I have four instructions.
And then notice, I also changed the logic in the control-flow of my for loops. So now I'm incrementing the variable j by four instead of just by one.
And then since n might not necessarily be divisible by four, I have to deal with the remaining elements. And this is what the second for loop is doing here. It's just dealing with the remaining elements.
And this is the more common form of loop unrolling. So the first benefit of doing this is, that you have fewer checks to the exit condition for the loop, because you only have to do this check every four iterations instead of every iteration. But the second and much bigger benefit is, that it allows more compiler optimizations, because it increases the size of the loop body. And it gives the compiler more freedom to play around with code and to find ways to optimize the performance of that code. So that's usually the bigger benefit.
If you unroll the loop by too much, that actually isn't very good, because now you're going to be pleading your instruction cache. And every time you fetch an instruction, it's likely going to be a miss in your instruction cache. And that's going to decrease the performance of your program. And furthermore, if your loop body is already very big, you don't really get additional improvements from having the compiler do more optimizations, because it already has enough code to work with. So giving it more code doesn't actually give you much there.
OK, so I just said this. The benefits of loop unrolling a lower number of instructions and loop control code. And then it also enables more compiler optimizations. And the second benefit here is usually the much more important benefit. And we'll talk more about compiler optimizations in a couple of lectures.
OK, any questions? OK, so the next optimization is loop fusion. This is also called jamming. And the idea here is to combine multiple loops over the same index range into a single loop, thereby saving the overhead of loop control.
So here, I have two loops. They're both looping from i equal zero, all the way up to n minus one. The first loop, I'm computing the minimum of A of i and B of i and storing the result in C of i. The second loop, I'm computing the maximum of A of i and B of i and storing the result in D of i.
So since these are going over the same index rang, I can fused together the two loops, giving me a single loop that does both of these lines here. And this reduces the overhead of loop control code, because now instead of doing this exit condition check two n times, I only have to do it n times.
This also gives you better cache locality. Again, we'll talk more about cache locality in a future lecture. But at a high level here, what it gives you is, that once you load A of i and B of i into cache, when you compute C of i, it's also going to be in cache when you compute D of i. Whereas, in the original code, when you compute D of i, it's very likely that A of i and B of i are going to be kicked out of cache already, even though you brought it in when you computed C of i. For this example here, again, there's another optimization you can do, common subexpression elimination, since you're computing this expression A of i is less than or equal to B of i twice.
OK, next, let's look at eliminating wasted iterations. The idea of eliminating wasted iterations is to modify the loop bounds to avoid executing loop iterations over essentially empty loop bodies. So here, I have some code to transpose, a matrix.
So I go from i equal zero to n minus one, from j equals zero to n minus one. And then I check if i is greater than j. And if so, I'll swap the entries in A of i, j and A of j, i.
The reason why I have this check here is, because I don't want to do the swap twice. Otherwise, I'll just end up with the same matrix I had before. So I only have to do the swap when i is greater than j.
One disadvantage of this code here is, I still have to loop for n squared iterations, even though only about half of the iterations are actually doing useful work, because about half of the iterations are going to fail this check here, that checks whether i is greater than j. So here's a modified version of the program, where I basically eliminate these wasted iterations. So now I'm going to loop from i equals one to n minus one, and then from j equals zero all the way up to i minus one.
So now instead of going up to n minus one, I'm going just up to i minus one. And that basically puts this check, whether i is greater than j, into the loop control code. And that saves me the extra wasted iterations.
OK, so that's the last optimization on loops. Are there any questions? Yes?
AUDIENCE: Isn't the checks still have [INAUDIBLE]?
JULIAN SHUN: So the check is-- so you still have to do the check in the loop control code. But here, you also had to do it. And now you just don't have to do it again inside the body of the loop. Yes?
AUDIENCE: In some cases, where it might be more complex to do it, is it also [INAUDIBLE] before you optimize it, but it's still going to be fast enough [INAUDIBLE]. Like in the first example, even though the loop is empty, most of the time you'll be able to process [INAUDIBLE] run the instructions.
JULIAN SHUN: Yes, so most of these are going to be branch hits. So it's still going to be pretty fast. But it's going to be even faster if you just don't do that check at all. So I mean, basically you should just text it out in your code to see whether it will give you a runtime improvement.
OK, so last category of optimizations is functions. So first, the idea of inlining is to avoid the overhead of a function call by replacing a call to the function with the body of the function itself. So here, I have a piece of code that's computing the sum of squares of elements in an array A.
And so I have this for loop that in each iteration is calling this square function. And the square function is defined above here. It just does x times x for input argument x.
But it turns out that there is actually some overhead to doing a function call. And the idea here is to just put the body of the function inside the function that's calling it. So instead of calling the square function, I'm just going to create a variable temp. And then I set sum equal to sum plus temp times temp. So now I don't have to do the additional function call.
You don't actually have to do this manually. So if you declare your function to be static inline, then the compiler is going to try to inline this function for you by placing the body of the function inside the code that's calling it. And nowadays, the compiler is pretty good at doing this. So even if you don't declare static inline, the compiler will probably still inline this code for you. But just to make sure, if you want to inline a function, you should declare it as static inline.
And you might ask, why can't you just use a macro to do this? But it turns out, that inline functions nowadays are just as efficient as macros. But they're better structured, because they evaluate all of their arguments. Whereas, macros just do a textual substitution.
So if you have an argument that's very expensive to evaluate, the macro might actually paste that expression multiple times in your code. And if the compiler isn't good enough to do common subexpression elimination, then you've just wasted a lot of work.
OK, so there's one more optimization-- or there's two more optimizations that I'm not going to have time to talk about. But I'm going to post these slides on Learning Modules, so please take a look at them, tail-recursion elimination and coarsening recursion.
So here are a list of most of the roles that we looked at today. There are two of the function optimizations I didn't get to talk about, please take a look at that offline, and ask your TAs if you have any questions. And some closing advice is, you should avoid premature optimizations.
So all of the things I've talked about today improve the performance of your program. But you first need to make sure that your program is correct. If you have a program that doesn't do the right thing, then it doesn't really benefit you to make it faster.
And to preserve correctness, you should do regression testing, so develop a suite of tests to check the correctness of your program every time you change the program. And as I said before, reducing the work of a program doesn't necessarily decrease its running time. But it's a good heuristic. And finally, the compiler automates many low-level optimizations. And you can look at the assembly code to see whether the compiler did something.