Administrative announcement: your accounts on kaduna.mines.edu and kafanchan.mines.edu (the two PPL machines) have been setup. If you have any questions, please contact the PPL systems administrator, Steve Chadsey (schadsey@mines).
Today, we will continue our discussion on algorithms, but we will start looking at how parallel algorithms can be analyzed.
Before we do that, though, let us test our knowledge on a few simple examples. Suppose, we had:
Can you come up with constants n and c such that
?
If so, we can say that f(x) is upper bounded by .
Now, can you come up with constants such that ?
If so, f(x) is both upper and lower bounded by
, and
we say that we have found a tight bound on f(x).
To determine the time complexity of an algorithm, you need to count the number of time steps that the algorithm takes, as a function of the problem size, and for the worst case input. To do this, you need to make some simplifying assumptions. For instance, you might choose to assume that all sequential machines have the same speed. Also, you might assume that every arithmetic operation (addition, subtraction, multiplication, division, etc.) take the same number of steps (say one step). Although these assumptions might not be realistic, they do not affect our conclusions since we are interested in the gross behavior of the algorithm. Once we have a general idea of the overall behavior, we can fill in the details of an actual implementation in the Implementation phase.
To obtain an accurate representation of the number of steps that an algorithm takes, one often has to use skills such as summation of series, or solving recurrence relations. An introductory text on algorithm analysis can provide you with these skills if you do not already have them. Once you have counted the number of steps that your algorithm requires, you are probably left with an expression for the time complexity that is a (hopefully polynomial) function of the size of the problem. To arrive at the asymptotic complexity, one often:
The Complexity of a problem is defined to be the minimum complexity among all algorithms for solving the problem (thus, this is the lower bound on the complexities of all algorithms that can solve that problem). To design an optimal algorithm for that problem, one has to come up with an algorithm whose worst case upper bound complexity matches the complexity of the problem.
Parallel Computational Complexity: Things get a little more complicated when we extend the analysis of algorithms from the sequential to the parallel domain. The one obvious complication is that now, the complexity of an algorithm depends not only on the size of the problem (n), but also on the number of processoring elements (PEs) available to you (we will refer to this quantity as p). Sequential algorithm complexity assumed that we were dealing with the von Neumann/RAM model of computation. When we talk about parallel algorithm complexity, it is important however to specify which model of parallel machine we are talking about (e.g. PRAM, multicomputer, etc.). To complicate things further, we will need to consider how the PEs are connected with each other-but we will ignore that actual communication cost for the moment, and assume that communicating one item of data takes one time step. Thus, the time complexity of an algorithm is referred to by a function of two parameters, T(n,p). To determine the time complexity, we count the total number of ``parallel'' time steps-the amount of real time used by the algorithm.
As an example of parallel complexity, an O(n) sequential algorithm is often transformed to the following parallel complexity:
The first term in the equation refers to the amount of work done locally on each of the PEs, and the log(p) time is spent on combining the results. Draw out the operations taking place for this setup. What happens when each of the local operations can be performed in constant (also referred to as O(1)) time?
An important consideration for parallel algorithms is the speedup that one attains over the sequential version of the algorithm. This is defined formally as:
What is the maximum speedup that one can expect by parallelizing an algorithm? Since there are p PEs solving the problem, usually, the best one can do is to get a speedup S(n,p) of p. This kind of speedup is usually called linear speedup, and is highly desirable. In this case, all the PEs are working equally at maximum efficiency. We therefore define the efficiency of the parallel algorithm to be 1 (the highest possible). In other cases, the efficiency of a parallel algorithm is defined as:
You will use both these definitions to evaluate the quality of your parallel solution (once you have chosen the appropriate model of parallel computation). Notice that all this can be done during the design (method) stage-well before you actually implement the solution! This analysis will provide you with a chance to predict the benefits of using parallel computing to solve the problem. Of course, once you have implemented the algorithm, you will also report actual speedup, and discuss its relationship with the theoretical speedup (if any :=).
Another concept that is important to look into is the concept of ``processor-time optimality''. This defines a parallel algorithm to be optimal if the product of the number of processors used times the parallel time complexity is equal to the time complexity of the best sequential algorithm to solve the problem. You will asked to evaluate your solution using this criterion too.
Next, let us look at a parallel algorithm, and analyze it in the above terms (this discussion is inspired by Highly Parallel Computing by Almasi and Gottlieb (Benjamin/Cummings Publishing House).
The example we look at involves summing n numbers. Sequentially, this can be done by initializing a variable to zero, and then adding each of the n numbers to it-this results in an optimal algorithm with complexity O(n). One way to parallelize this algorithm is to add pairs of numbers in parallel, and do this recursively on the resulting sums till we get a single number. On a distributed memory message passing machine, this would optimally require a binary tree interconnection network. Let us start with the case where we have one processor per number, p = n.
Architecture: If the sum was computed on a parallel machine with a binary tree interconnection, the height of this tree would be given by:
This gives the height h to be:
Initial data mapping: We start off by assigning each processor one number (since p = n).
Algorithm: The summation is done by each leaf node sending its item up one level-the nodes at this level add the two items they receive, to the locally stored item, and pass the result up the tree. The algorithm terminates when we reach the root.
Analysis: Each local operation is an addition of three numbers, and can be done in constant (O(1)) time. Thus, we are essentially looking at communication time that is given by the height of the tree:
What is the speedup and efficiency in this case? What comments do you have about the efficiency in this case (both intuitively, as well as from the formula).