Comparison
of Two Parallel
Rendering
Algorithms
By
Karl Schurig
Final
Project CS 584
-
- Abstract
-
- ICASE report No 91-3 presents
a parallel algorithm for rendering large numbers of triangle. This
algorithm uses a typical domain decomposition method, distributing
the triangles as well as the frame
-
buffer across processors. This
paper presents an alternative algorithm which duplicates the frame
buffer on each processor reducing communication in some cases and
distributes the load more equally between processors. Analysis of
performance models indicates that in many cases communication is not
reduced. Experimental data indicates that the gains from a more
equally distributed work load may sufficiently offset the increases
in communication to make the new algorithm more efficient.
- Introduction
-
- Many real-time applications
such as animations and scientific visualization require high
performance rendering of three dimensional scenes. In many cases
parallel algorithms are required to deliver adequate performance
levels. Crockett and Orloff's paper presents one such algorithm.
This paper presents a modified algorithm which increases
communication overhead but avoids a load balancing problem
associated with Crockett and Orloff's algorithm. A brief
description of the rendering problem is provided below.
-
- Simply put the rendering
problem consists of taking a three dimensional scene description and
using information about viewpoint, light conditions, and reflectance
properties to produce a realistic looking 2D image of the scene.
Three main steps in the rendering process account for most of the
processing time (reference). These steps are
-
1. The floating point
operations performed on objects, such as transforming, lighting, and
clipping.
-
2. The rasterization of
primitives transformed into screen coordinates.
-
3. Writing pixels to the
frame buffer.
-
Both algorithms in this paper
approach the problem from a domain decomposition perspective.
Specifics of each problem are described below.
-
- Description of
the Algorithms
-
- Crockett and
Orloff's Method
-
- Crockett and Orloff's
algorithm, which takes a domain decomposition approach, allows each
of the three stages to be parallelized, and distributes large data
structures (the scene description and the frame
-
buffer) between processors. Data
structures are distributed in the following way
-
1. The triangles are
distributed evenly in round-robin fashion to all processors.
-
2. The frame buffer is
divided among the processors by horizontal strips.
-
3. Small data structures,
such as the lights and viewing parameters, are replicated on each
processor.
-
- Once the data has been
distributed, the algorithm proceeds in the following manner
-
1. The shading,
transforming, and clipping steps are performed by each processor on
its local triangles.
-
2. Before rasterizing a
triangle, it is first transformed into
-
screen coordinates, then split
(if necessary) into trapezoids along local frame buffer boundaries.
Each trapezoid is then sent to the processor which owns the segment
of the frame buffer in which it lies. For efficiency, trapezoid
sends are buffered.
-
3. Upon receiving a
trapezoid, a given processor rasterizes it into its local frame
buffer using a standard z-buffer algorithm to eliminate hidden
surfaces.
-
- A termination algorithm is
also described.
- New Method
-
- The new method also takes a
domain decomposition approach, allowing each of the three stages of
rendering to be parallelized, but it restricts this to the
triangles. Data structures are distributed as follows
-
1. The triangles are
distributed evenly in round-robin fashion to all processors.
-
2. Small data structures,
such as the lights and viewing parameters, are replicated on each
processor.
-
3. The frame buffer is
also replicated on each processor.
-
- Because the frame buffer is
replicated, triangles can be replicated without any communication
between processors. After all triangles have been processed, the
replicated frame buffers can consolidated using their respective Z
buffers to determine which pixel values to use.
-
- An example image, rendered by
the new algorithm from 100000 randomly generated triangles follows.
-
-
image rendered
from 100,000 randomly generated triangles
-
- Performance
Models For the Algorithms
-
- The following terms are
defined for use in the performance analysis
-
- p = number of processors
-
n = number of triangles
-
y = height of frame buffer (in
scan lines)
-
h = height of average number of
pixels
-
d = trapezoid buffer depth
-
t = number of trapezoids
generated per processor
-
L = latency for sending a message
-
s = size in bytes of a trapezoid
-
B = time per byte to send a
message
-
C = time to process a single
triangle
-
f = size in bytes of the frame
buffer
-
z = size in bytes of the z buffer
-
- Crockett and
Orloff's Method
-
- Crockett and Orloff provide a
detailed analysis of their algorithm's performance. For simplicity,
and for the purposes of comparison, I will ignore some of the minor
contributing factors. A simplified performance model is as follows
-
- Total Time = Rendering Time +
Communication Time + Termination Algorithm
-
- Rendering Time = C*n/p
-
Communication Time = (t/d)L + tsB
-
Termination Algorithm = 2L(p - 1
+ log2(p))
-
t = nh/2y + n/p
-
- New Method
-
- The performance model for the
new method is relatively simple and can be stated as follows
-
- Total Time = Rendering Time +
Time to Collapse Frame Buffers
-
- Rendering Time = Cn/p
-
- The Frame Buffers may be
consolidated efficiently using a binary collapse. The performance
of the binary collapse depends only on hardware characteristics and
the size of the frame buffer as follows
-
- Collapse Time = log2(p)(L +
B(f + z))
-
- Comparison of
the Algorithms
-
- This section compares the
performance of Crockett and Orloff's methods with the performance of
the new method using the performance models given above.
- From the equations, it
becomes apparent that the primary communication cost in Crockett and
Orloff's method grows linearly with the number of triangles. In
contrast, communication time in the new method grows linearly with
the size of the frame buffer. So the relative performance of the
algorithms depends on the ratio of triangles to frame buffer size.
One value of interest is the number of triangles
-
for a given frame buffer size and
number of processors for which the algorithms perform approximately
equally. Setting the equations equal to each other and solving for
n gives the following equation
-
- n = ( (log2(p)(L + B(f + z))
-2L(p - 1 + log2(p))) 2yp ) / ( (L/d + sB)(hp + 2y) )
-
- A graph of these values was
created using the following constants
-
- L = .1 ms
-
B = 8 e-9 s
-
d = 100 for p = 8
-
40 for p = 16
- 20 for p = 32
-
10 for p = 64
-
6 for p = 128
-
h = 10
-
-
- The values in this table
indicate that for small sized frame buffers, the new algorithm
competes well in communication cost, although for larger frame
buffers it performs poorly.
-
- Though the new algorithm
requires more communication overhead in many instances, the effect
on overall performance is relatively small since computation, not
communication, accounts for most of the algorithm's execution time.
Following is a graph showing percentage of execution time spent in
communication as a function of number of processors and number of
triangles rendered using a 512x512 frame buffer. These numbers were
obtains on P2 266 machines with the latency and
-
bandwidth values given above.
-
-
- As this table shows
communication costs become increasingly more influential as
processors increase, but for relatively small numbers of processors
communication costs are small.
-
- The major advantage the new
algorithm has over Crockett and Orloff's method relates to rendering
actual scenes. Actual scenes tend to have areas of very highly
concentrated triangles along with areas of very few. Since Crockett
and Orloff's algorithm partitions the frame buffer based on
locality, it will tend to have and unequal work distribution. The
new algorithm avoids this problem. Unfortunately,
-
no quantitative data is available
with which to address this issue.
-
- Actual speedups using the new
algorithm are reasonably close to expected results. Following is a
graph summarizing speedups for various numbers of processors being
used and triangles being rendered.
-
-
- Conclusion
-
- In conclusion , the new
algorithm presented in this paper is an effective alternative to the
Crockett Orloff algorithm. Though the new algorithm requires
slightly more communication time, it avoids some load balancing
problems associated with the Crockett Orloff algorithm. In
addition, it is much simpler to implement.