UCES
Course: Methods and Analysis in UCES
Applied Area: heat diffusion and surrounding region
Model: discrete Fourier and Newton heat laws, stability
Method: explicit time discretization
Implementation: Maple, Fortran (multiprocessing)
Prerequisites:
Objectives:
General Information: Calculations are done on the Alliant FX/40 which has two processors each with a four segment vector pipeline. These calculations illustrate speedup, and the model the heat conduction in a thin wire which is being cooled by the surrounding region.
Contact Person: R. E. White, NCSU, white@math.ncsu.edu
Revision Date: 7-28-94
Copyright ©1994, R. E. White. All rights reserved.
In this lecture we will describe multiprocessing computers. We will study the next heat transfer model which is a variation of the thin wire problem, but now we will allow heat to be lost to the surrounding region. We will present some calculations for this problem using a vector/multiprocessing computer.
A multiprocessing computer is a computer with more than one "tightly" coupled CPU. Here "tightly" means that there is relatively fast communication among the CPUs; this is in contrast with a "network" of computers. There are several classification schemes that are commonly use to describe various multiprocessors: memory, data streams, and interconnection.
Two examples of memory are shared and distributed. The shared memory multiprocessors communicate via the global shared memory. The distributed memory multiprocessors communicate by explicit message passing which must be part of the computer code. Shared memory multiprocessors often have in code directives that indicate which parts are to be executed concurrently. The Cray Y-MP is an example of a shared memory computer, and an Intel hypercube is an example of a distributed computer as given in the following figure.
![]() |
| Figure: Two Common Memory Types |
Classification by data streams has two main categoies: SIMD and MIMD. The first represents single instruction and multiple data, and an example is a vector pipeline. The second is multiple instruction and multiple data. The Cray Y-MP is a example of an MIMD. One can send different data and different code to the various processors. However, MIMD computers are often programmed like SIMD computers, that is, the same code is executed, but different data is input to the various CPUs.
Interconnection schemes are important because of certain types of applications. For example in a closed loop system a ring interconnection might be the best. Or, if a problem requires a great deal of communication between processors, then the complete interconnection scheme might be appropriate. The ring interconnection has only two paths per processor regardless of the number of processors. The complete interconnection will have p-1 paths per processor where p is the number of processors (see the figure below). An interconnection scheme that is a compromise between these two extremes is the hypercube distrubuted memory which will have p = 2d processors with d paths per processor (see the above figure where d = 3).
![]() |
| Figure: Ring and Complete Interconnection |
Another important consideration is the number of processors to be used.
In order to be able to effectively use p processors, one must have p
independent tasks to be performed. Vary rarely is this exactly the case;
parts of the code may have no independent parts, two independent parts and so
forth. In order to model the effectiveness of a multiprocessor with p
processors, Amdahl's timing model has been widely used. It makes the
assumption that
is the fraction with p
independent parts and the rest
)
| Let | p | = the number of processors, |
![]() |
= the fraction with p independent parts, | |
1 - ![]() |
= the fraction with one independent part, | |
| T1 | = serial execution time, | |
(1 - )T1 |
= execution time for the 1 independent part and | |
T1/p |
= execution time for the p independent parts. | |
| Speedup | = Sp ( ) =
|
Example. Consider a dotproduct of two vectors of dimension 100. There
are 100 scalar products and 99 additions, and we may measure execution time in
terms of operations so that T1 = 199. If p = 4 and the dotproducts
are broken into four smaller dotproducts of dimension 25, then the parallel
part will have 4(49) operations and the serial part will require 3 operations
to add the smaller dotpoducts. Thus,
= 196/199
If
= 1, then the speedup is p, the ideal
case. If
= 0,
will be fixed, and the efficiency
will decrease as the number of processors increases. Another way to view this
is in the following table where
= .9
Table: Speedup and Efficiency for
= .9
|
||
| Number of processors |
Speedup | Efficiency |
|---|---|---|
| 2 | 1.8 | .90 |
| 4 | 3.1 | .78 |
| 8 | 4.7 | .59 |
| 16 | 6.4 | .40 |
We return to the problem in the previous section where we were interested in the heat transfer within a thin electrical wire. There we assumed that the wire did not loose any heat to the surrounding region.
We will assume that there is some loss of heat to the surrounding region and that the amount will be given by a Newton-like law of cooling. The amount of heat lost to the surrounding region should be proportional to the difference in the surrounding temperature and the temperature of the small cylindrical segment of the wire. Other factors should include the time step, the lateral surface area, and the effectiveness of the insulation on the lateral surface.
The importance of this type of problem is related to electrical sensors for wind speed and temperatures of the surrounding region. If the temperature of the wire changes, then the electrical resistivity will change and will alter the current running through the wire. This current can be calibrated with conditions of the surrounding region.
The change in heat content over a small time interval
t for a small
cylindrical segment of volume hA where
r2
t
2rh (usur
- uik )
If csur = 0, then this is the perfect insulation case and is
identical to the model of the previous section. As csur increases
to
,
c
uik+1 Ah - c
uik Ah = |
A h t f + A t K (ui+1k - uik)/h- A t K (uik - ui-1k)/h+ csur t 2rh (usur -
uik). |
Now, divide by
cAh, let
= K/(
c)(
t/h2 )
Explicit Model of Heat Transfer with Cooling to the Surrounding Region.
| uik+1 = | ( t/( c)) (f + csur (2/r)usur ) +
(ui+1k + ui-1k )
+ (1 - ( t/( c)) csur (2/r) - 2 )
uik |
(1) |
| where | ||
| i = 1,...,n-1, k = 0,...,maxit-1, |
||
| ui0 = 0 for i = 1,...,n-1 and | (2) | |
| u0k = unk = 0 for k= 1,...,maxit. | (3) | |
Equation (2) is the initial temperature set equal to zero and (3) is the temperature at the left and right ends set equal to zero.
An extremely important restriction on the time step
t is to make sure the
algorithm is stable. For example, consider the case
t/(
c)) csur (2/r) - 2
,
t/(
c)) csur (2/r) - 2
0 ,
0,
> 0,
t/(
c)) csur (2/r) - 2
0 ,
csur
0, and
> 0.
The following is a slight modification of the explicit algorithm in the last lecture. The inner most loop has been modified. Also, one will have to be careful to further restrict the time step as csur increases or as r decreases.
Explicit Algorithm for Problem (1), (2) and (3).
define f, n, maxit, rho, c, cond, dt, dx, csur, usur, r
kappa = cond/(rho*c)
alpha = kappa*dt/(dx*dx)
for i = 1,n+1
u(i,1) = 0
endloop
for k = 1,maxit+1
u(1,k) = 0
u(n+1,k) = 0
endloop
for k = 1,maxit
for i = 2,n
u(i,k+1) = dt/(rho*c)*(f + csur *(2/r)*usur)
+ alpha*(u(i-1,k) + u(i+1,k))
+ (1 - dt/(rho*c)*csur*2/r - 2*alpha)*u(i,k)
endloop
endloop.
The first implementation uses Maple and illustrates the stability condition.
Here we have set
= c = 1,
t
0.
Now the time step must be less than 2.5 and not 5.0 when we have perfect insulation, csur = 0. In the calculations we used a time step equal to one, and the solution converged quickly to the steady state solution indicated in the graph below where the time steps are along the column axis and the space steps are along the row axis.
Maple Code for Heat Transfer in a Thin Wire with Cooling to Surrounding Region.
Input Data:
> f:=1:
> Kappa:=.001:
> n:=10:
> dx := .1:
> dt := 1:
> n:=10:
> maxit:= 30:
Procedure is Defined:
> HEAT1D:=proc(f,Kappa,dt,dx,n,maxit)
> with(linalg):
> with(plots):
> csur := .01:
> r:= .1:
> u:=matrix(n+1,maxit+1,0):
> alpha := Kappa*dt/(dx*dx):
> for k from 1 to maxit do
for i from 2 to n do
u[i,k+1]:= dt*f + alpha*(u[i-1,k] + u[i+1,k]) +
(1-2*alpha - dt*csur*2/r)*u[i,k];
> od;
> od;
> matrixplot(u,style=wireframe,axes=normal);
> end;
Output is Graphed:
![]() |
| Figure: Temperature of Wire with r = .1 and csur = .01 |
The second implementation is on the Alliant FX/40 which has two processors each with a vector pipeline. The listing below has the code with parallel directives and an analysis of which loops were done either concurrently or vector or both. Concurrent computations of the inner most loop, loop 30, is achieved by dividing the length into two equal parts. This is done by the array index which is define in the beginning of the program. Loop 25 is to be done concurrently by the two available processors. The work inside loop 30 can be done by the vector pipeline associated with each processor.
There are three directives being used. The 'cvd$l concur' directive turns on the concurrent calculation of the following loop. The directive 'cvd$l nodepchk' turns off the check for data dependency (lack of independent calculations) in the following loop. The 'cvd$l novector' directive will turn off the vector pipe calculations in the following loop.
The loop summary analysis of the compiled code is given in the lower part. Loops 12 and 10 are just initializing the temperature array to zero. All calculations are independent. The outer time loop, loop 20, was not done in either concurrent or vector mode. This loop has a serious data dependency because in order to compute the temperature at the next time step, one must know the temperature at the previous time. Loop 25 was done concurrently as was directed. Also, loop 30 was done using a vector pipeline.
Alliant FX/40 Fortran Code and Analysis of Parallelism for the Explicit Finite Difference Algorithm.
Alliant FX/Fortran Compiler V4.2.40 (2.24D41) 23 Mar 1994 15:48:49 Page 1
Source Listing of /usr/users/faculty/white/ma673.dir/heatp.f
1 Program heat
2 dimension u(512,80),index(3,2)
3 real tt1(2),tt2(2)
4 n = 510
5 index(1,1) = 2
6 index(1,2) = n/2
7 index(2,1) = index(1,2) + 1
8 index(2,2) = n
9 maxit = 30
10 f = 1.0
11 kappa = .001
12 dt = 5
13 dx = .1
14 alpha = kappa*dt/(dx*dx)
15 do 10 k = 1,maxit
16 do 12 i = 1,n+1
17 u(i,k) = 0.0
18 12 continue
19 10 continue
20 tbegin = etime(tt1)
21 do 20 k = 1,maxit
22 cvd$l concur
23 cvd$l nodepchk
24 do 25 ipar=1,2
25 cvd$l vector
26 cvd$l nodepchk
27 do 30 i = index(ipar,1),index(ipar,2)
28 u(i,k+1) = dt*f + alpha*(u(i-1,k) + u(i+1,k)) +
29 $ (1 - .2*dt - 2*alpha)*u(i,k)
30 30 continue
31 25 continue
32 20 continue
33 tend = etime(tt2)
34 print*, 'time =', tend - tbegin
35 stop
36 end
Alliant FX/Fortran Compiler V4.2.40 (2.24D41) 23 Mar 1994 15:48:49 Page 2
Source Listing of /usr/users/faculty/white/ma673.dir/heatp.f
------------------- LOOP SUMMARY FOR ROUTINE HEAT -------------------
LABEL INDEX START END NEST COMMENT CD? DP? ITERATIONS
10 K 15 19 1 CONCURRENT OUTER 30
12 I 16 18 2 VECTORIZED 511
20 K 21 32 1 NOT CHOSEN 30
25 IPAR 24 31 2 CONCURRENT OUTER 2
30 I 27 30 3 VECTORIZED
2 VECTOR 0 CONCURRENT 0 VECT-CONCR 2 CON. OUTER
1 NOT OPTMZD 0 NOT INNER 0 TOO SHORT 0 DEPENDENT
5 TOTAL 5 EXAMINED 4 OPTIMIZED
In order to see the effect of concurrent and vector calculations, we did four runs which are recorded in the table below. The serial mode is done by turning off both concurrent and vector computations: use noconcur and novector. The remaining three calculations can be done by the appropriate use of noconcur, concur, novector and vector. The ideal speedup for concurrent calculations is two because the computer has two processors. The ideal speedup for vectorization is four because the vector pipelines of this computer have four segments. In both calculations, we came fairly close. Larger problems (larger n) will increase the speedups. Also, one can combine concurrent and vector computations for at most a speedup of eight; we got about six for this particular problem.
| Table: Effect of Vector/Multiprocessing | |
| Compiler Option | Execution Times (10-3 sec.) |
|---|---|
| Serial | 38.7 |
| Concurrent, 2 processors | 20.4 |
| Vector, 1 processor | 10.8 |
| Concurrent/Vector, 2 processors | 6.5 |
The effective use of vector pipelines and multiprocessing computers will depend on the particular code being executed. There must exist independent calculations within the code. Some computer codes have a large number of independent parts and some have almost none. The use of timing models can give insight to possible performance of codes. Also, some codes can be restructured to have more independent parts.
The modified heat transfer problem for the thin wire resulted in a more restrictive stability condition on the time step. In this case, this restriction was not too severe as the problem reached the steady state solution very quickly. However, if the surrounding temperature is a function of time, this may not happen over such a short time interval. Many computer simulations range over periods of years. In such cases these restrictions on the time step may be too severe. Later we will describe alternative methods to the explicit time discretization.