UNDERGRADUATE COMPUTATIONAL ENGINEERING AND SCIENCE

UCES

"Multiprocessor Computing and Heat Transfer"

Course:   Methods and Analysis in UCES

Prerequisites:

  1. Math:   Algebra 2
  2. Computing:   Nested loops, arrays
  3. Science:   Physics

Objectives:

  1. Math:   Another time and space model
  2. Computing:   Amdahl's speedup model, parallel algorithms
  3. Science:   Combination of Fourier and Newton model

    General Information: Calculations are done on the Alliant FX/40 which has two processors each with a four segment vector pipeline. These calculations illustrate speedup, and the model the heat conduction in a thin wire which is being cooled by the surrounding region.

    Contact Person:   R. E. White, NCSU, white@math.ncsu.edu

    Revision Date:   7-28-94

    Copyright ©1994, R. E. White. All rights reserved.

    Return to Table of Contents


    1.4   Multiprocessor Computing and Heat Transfer

    1.4.1.   Introduction.

    In this lecture we will describe multiprocessing computers. We will study the next heat transfer model which is a variation of the thin wire problem, but now we will allow heat to be lost to the surrounding region. We will present some calculations for this problem using a vector/multiprocessing computer.

    A multiprocessing computer is a computer with more than one "tightly" coupled CPU. Here "tightly" means that there is relatively fast communication among the CPUs; this is in contrast with a "network" of computers. There are several classification schemes that are commonly use to describe various multiprocessors: memory, data streams, and interconnection.

    Two examples of memory are shared and distributed. The shared memory multiprocessors communicate via the global shared memory. The distributed memory multiprocessors communicate by explicit message passing which must be part of the computer code. Shared memory multiprocessors often have in code directives that indicate which parts are to be executed concurrently. The Cray Y-MP is an example of a shared memory computer, and an Intel hypercube is an example of a distributed computer as given in the following figure.

    Figure: Two Common Memory Types

    Classification by data streams has two main categoies: SIMD and MIMD. The first represents single instruction and multiple data, and an example is a vector pipeline. The second is multiple instruction and multiple data. The Cray Y-MP is a example of an MIMD. One can send different data and different code to the various processors. However, MIMD computers are often programmed like SIMD computers, that is, the same code is executed, but different data is input to the various CPUs.

    Interconnection schemes are important because of certain types of applications. For example in a closed loop system a ring interconnection might be the best. Or, if a problem requires a great deal of communication between processors, then the complete interconnection scheme might be appropriate. The ring interconnection has only two paths per processor regardless of the number of processors. The complete interconnection will have p-1 paths per processor where p is the number of processors (see the figure below). An interconnection scheme that is a compromise between these two extremes is the hypercube distrubuted memory which will have p = 2d processors with d paths per processor (see the above figure where d = 3).

    Figure: Ring and Complete Interconnection

    Another important consideration is the number of processors to be used. In order to be able to effectively use p processors, one must have p independent tasks to be performed. Vary rarely is this exactly the case; parts of the code may have no independent parts, two independent parts and so forth. In order to model the effectiveness of a multiprocessor with p processors, Amdahl's timing model has been widely used. It makes the assumption that alpha is the fraction with p independent parts and the rest (1 - alpha) has one independent part.

    Amdahl's Timing Model.

    Let p = the number of processors,

    alpha = the fraction with p independent parts,

    1 - alpha = the fraction with one independent part,

    T1 = serial execution time,

    (1 - alpha)T1   = execution time for the 1 independent part and

    alpha T1/p = execution time for the p independent parts.

    Speedup = Sp (alpha ) =

    Example. Consider a dotproduct of two vectors of dimension 100. There are 100 scalar products and 99 additions, and we may measure execution time in terms of operations so that T1 = 199. If p = 4 and the dotproducts are broken into four smaller dotproducts of dimension 25, then the parallel part will have 4(49) operations and the serial part will require 3 operations to add the smaller dotpoducts. Thus, alpha = 196/199 and S4 = 199/52.

    If alpha = 1, then the speedup is p, the ideal case. If alpha = 0, then the speedup is 1! Another parameter is the efficiency which is defined to be the speedup divided by the number of processors. Thus, for a fixed code the alpha will be fixed, and the efficiency will decrease as the number of processors increases. Another way to view this is in the following table where alpha = .9 and p varies from 2 to 16.

    Table: Speedup and Efficiency for alpha = .9
    Number of
      processors  
      Speedup     Efficiency  
    2 1.8 .90
    4 3.1 .78
    8 4.7 .59
    16 6.4 .40


    1.4.2.   Applied Area.

    We return to the problem in the previous section where we were interested in the heat transfer within a thin electrical wire. There we assumed that the wire did not loose any heat to the surrounding region.

    We will assume that there is some loss of heat to the surrounding region and that the amount will be given by a Newton-like law of cooling. The amount of heat lost to the surrounding region should be proportional to the difference in the surrounding temperature and the temperature of the small cylindrical segment of the wire. Other factors should include the time step, the lateral surface area, and the effectiveness of the insulation on the lateral surface.

    The importance of this type of problem is related to electrical sensors for wind speed and temperatures of the surrounding region. If the temperature of the wire changes, then the electrical resistivity will change and will alter the current running through the wire. This current can be calibrated with conditions of the surrounding region.


    1.4.3.   Model.

    The change in heat content over a small time interval delta t for a small cylindrical segment of volume hA where A = pi r2 is the cross section area of the wire will be a sum of the heat from electrical resistance, diffusion from the right side, diffusion from the left side and heat loss through the lateral surface. The last term is the additional contribution and will have the form

    csurdelta t pi 2rh (usur - uik )
    where csur is the proportionality constant.

    If csur = 0, then this is the perfect insulation case and is identical to the model of the previous section. As csur increases to +infinity, the internal temperature of the wire will approach the surrounding temperature, usur. The form of all this, using the notation of the previous section, is

    rho c uik+1 Ah - rho c uik Ah = A h delta t f
    + A delta t K (ui+1k - uik)/h
    - A delta t K (uik - ui-1k)/h
    + csur delta t pi 2rh (usur - uik).

    Now, divide by rho cAh, let alpha = K/(rho c)(delta t/h2 ) and explicitly solve for uik+1 .

    Explicit Model of Heat Transfer with Cooling to the Surrounding Region.

    uik+1 = (delta t/(rho c)) (f + csur (2/r)usur ) + alpha (ui+1k + ui-1k )
        + (1 - (delta t/(rho c)) csur (2/r) - 2 alpha) uik
    (1)
    where

    i = 1,...,n-1,
    k = 0,...,maxit-1,
    ui0 = 0 for i = 1,...,n-1 and (2)
    u0k = unk = 0 for k= 1,...,maxit. (3)

    Equation (2) is the initial temperature set equal to zero and (3) is the temperature at the left and right ends set equal to zero.

    An extremely important restriction on the time step delta t is to make sure the algorithm is stable. For example, consider the case n = 2 where the above is a scalar equation and we have the simplest first order finite difference model. Here a = 1 - (delta t/(rho c)) csur (2/r) - 2 alpha, and we must require |a| < 1. If a = 1 - (delta t/(rho c)) csur (2/r) - 2 alpha 0 , csur 0, and alpha > 0, then this condition will hold. If n is larger than 2, this simple condition will also imply stability. This will make sure there are no blowups provided the source term f is bounded. The illustration of the stability condition and an analysis will be presented later.

    Stability Condition for (1).

    1 - (delta t/(rho c)) csur (2/r) - 2 alpha 0 , csur 0, and alpha > 0.


    1.4.4.   Method.

    The following is a slight modification of the explicit algorithm in the last lecture. The inner most loop has been modified. Also, one will have to be careful to further restrict the time step as csur increases or as r decreases.

    Explicit Algorithm for Problem (1), (2) and (3).

            define f, n, maxit, rho, c, cond, dt, dx, csur, usur, r
            kappa = cond/(rho*c)
            alpha = kappa*dt/(dx*dx)
            for i = 1,n+1
                u(i,1) = 0
            endloop
            for k = 1,maxit+1
                u(1,k) = 0
                u(n+1,k) = 0
            endloop
            for k = 1,maxit
                for i = 2,n
                    u(i,k+1) = dt/(rho*c)*(f + csur *(2/r)*usur)
                                        + alpha*(u(i-1,k) + u(i+1,k))  
                                        + (1 - dt/(rho*c)*csur*2/r - 2*alpha)*u(i,k)
                endloop
            endloop.
    


    1.4.5.   Implementation.

    The first implementation uses Maple and illustrates the stability condition. Here we have set rho = c = 1, K = .001, r = .1, csur = .01 and usur = 0. Then the stablity condition requires

    1 - .4 delta t 0.

    Now the time step must be less than 2.5 and not 5.0 when we have perfect insulation, csur = 0. In the calculations we used a time step equal to one, and the solution converged quickly to the steady state solution indicated in the graph below where the time steps are along the column axis and the space steps are along the row axis.

    Maple Code for Heat Transfer in a Thin Wire with Cooling to Surrounding Region.

    
       Input Data:
            > f:=1:
            > Kappa:=.001:
            > n:=10:
            > dx := .1:
            > dt := 1:
            > n:=10:
            > maxit:= 30:
    
      Procedure is Defined:
            > HEAT1D:=proc(f,Kappa,dt,dx,n,maxit)
            > with(linalg):
            > with(plots):
            > csur := .01:
            > r:= .1:
            > u:=matrix(n+1,maxit+1,0):
            > alpha := Kappa*dt/(dx*dx):
            > for k from 1 to maxit do
                     for i from 2 to n do
                         u[i,k+1]:= dt*f + alpha*(u[i-1,k] + u[i+1,k]) + 
                                        (1-2*alpha - dt*csur*2/r)*u[i,k];
            >     od;
            > od;
            > matrixplot(u,style=wireframe,axes=normal);
            > end;
    
      Output is Graphed:
    
    Figure: Temperature of Wire with r = .1 and csur = .01

    The second implementation is on the Alliant FX/40 which has two processors each with a vector pipeline. The listing below has the code with parallel directives and an analysis of which loops were done either concurrently or vector or both. Concurrent computations of the inner most loop, loop 30, is achieved by dividing the length into two equal parts. This is done by the array index which is define in the beginning of the program. Loop 25 is to be done concurrently by the two available processors. The work inside loop 30 can be done by the vector pipeline associated with each processor.

    There are three directives being used. The 'cvd$l concur' directive turns on the concurrent calculation of the following loop. The directive 'cvd$l nodepchk' turns off the check for data dependency (lack of independent calculations) in the following loop. The 'cvd$l novector' directive will turn off the vector pipe calculations in the following loop.

    The loop summary analysis of the compiled code is given in the lower part. Loops 12 and 10 are just initializing the temperature array to zero. All calculations are independent. The outer time loop, loop 20, was not done in either concurrent or vector mode. This loop has a serious data dependency because in order to compute the temperature at the next time step, one must know the temperature at the previous time. Loop 25 was done concurrently as was directed. Also, loop 30 was done using a vector pipeline.

    Alliant FX/40 Fortran Code and Analysis of Parallelism for the Explicit Finite Difference Algorithm.

        Alliant FX/Fortran Compiler V4.2.40 (2.24D41)  23 Mar 1994 15:48:49  Page 1
    
        Source Listing of /usr/users/faculty/white/ma673.dir/heatp.f
    
        1         Program heat
        2         dimension u(512,80),index(3,2)
        3         real tt1(2),tt2(2)
        4         n = 510
        5         index(1,1) = 2
        6         index(1,2) = n/2
        7         index(2,1) = index(1,2) + 1
        8         index(2,2) = n
        9         maxit = 30
       10         f = 1.0 
       11         kappa = .001
       12         dt = 5
       13         dx = .1
       14         alpha = kappa*dt/(dx*dx) 
       15         do 10 k = 1,maxit
       16           do 12 i = 1,n+1
       17             u(i,k) = 0.0
       18    12     continue
       19    10   continue
       20         tbegin = etime(tt1)
       21         do 20 k = 1,maxit
       22   cvd$l concur
       23   cvd$l nodepchk
       24           do 25 ipar=1,2
       25   cvd$l vector
       26   cvd$l nodepchk
       27             do 30 i = index(ipar,1),index(ipar,2)
       28               u(i,k+1) = dt*f + alpha*(u(i-1,k) + u(i+1,k)) +
       29        $                 (1 - .2*dt - 2*alpha)*u(i,k)
       30    30       continue
       31    25     continue
       32    20   continue
       33         tend = etime(tt2)
       34         print*,  'time =', tend - tbegin
       35         stop
       36         end
    
    
    
        Alliant FX/Fortran Compiler V4.2.40 (2.24D41)  23 Mar 1994 15:48:49  Page 2
    
        Source Listing of /usr/users/faculty/white/ma673.dir/heatp.f
    
    
      -------------------    LOOP SUMMARY FOR ROUTINE HEAT    -------------------
    
       LABEL  INDEX   START   END NEST    COMMENT          CD? DP?    ITERATIONS
    
        10      K      15     19   1      CONCURRENT OUTER                30
        12      I      16     18   2      VECTORIZED                     511
        20      K      21     32   1      NOT CHOSEN                      30
        25      IPAR   24     31   2      CONCURRENT OUTER                 2
        30      I      27     30   3      VECTORIZED
    	
           2 VECTOR        0 CONCURRENT    0 VECT-CONCR    2 CON. OUTER
           1 NOT OPTMZD    0 NOT INNER     0 TOO SHORT     0 DEPENDENT
           5 TOTAL         5 EXAMINED      4 OPTIMIZED
    
    

    In order to see the effect of concurrent and vector calculations, we did four runs which are recorded in the table below. The serial mode is done by turning off both concurrent and vector computations: use noconcur and novector. The remaining three calculations can be done by the appropriate use of noconcur, concur, novector and vector. The ideal speedup for concurrent calculations is two because the computer has two processors. The ideal speedup for vectorization is four because the vector pipelines of this computer have four segments. In both calculations, we came fairly close. Larger problems (larger n) will increase the speedups. Also, one can combine concurrent and vector computations for at most a speedup of eight; we got about six for this particular problem.

    Table: Effect of Vector/Multiprocessing
    Compiler Option Execution Times (10-3 sec.)
    Serial 38.7
    Concurrent, 2 processors 20.4
    Vector, 1 processor 10.8
    Concurrent/Vector, 2 processors 6.5


    1.4.6.   Assessment.

    The effective use of vector pipelines and multiprocessing computers will depend on the particular code being executed. There must exist independent calculations within the code. Some computer codes have a large number of independent parts and some have almost none. The use of timing models can give insight to possible performance of codes. Also, some codes can be restructured to have more independent parts.

    The modified heat transfer problem for the thin wire resulted in a more restrictive stability condition on the time step. In this case, this restriction was not too severe as the problem reached the steady state solution very quickly. However, if the surrounding temperature is a function of time, this may not happen over such a short time interval. Many computer simulations range over periods of years. In such cases these restrictions on the time step may be too severe. Later we will describe alternative methods to the explicit time discretization.


    1.4.7.   Homework.

    1. Consider the dotproduct example of Amdahl's timing model. Repeat the calculations for n = 200 and 400. Why does the efficiency increase?

    2. If an engineering firm usually runs jobs where the fraction of computation that have four independent part is .8, should the firm buy one multiprocessor with four processors for $80,000, or should they buy four serial workstations for $23,000 each? Assume the performance of each CPU is the same.

    3. Use the above Maple code and observe the stability condition.

    4. Use the above Maple code and vary csur = 0, .0001, .001, .01, .1.

    5. Use the above Maple code and vary usur = 0, 10, 50, 100.

    6. Use the above Maple code with usur = 100t.