Comparison of Two Parallel

Rendering Algorithms


By Karl Schurig


Final Project CS 584





Abstract

ICASE report No 91-3 presents a parallel algorithm for rendering large numbers of triangle. This algorithm uses a typical domain decomposition method, distributing the triangles as well as the frame
buffer across processors. This paper presents an alternative algorithm which duplicates the frame buffer on each processor reducing communication in some cases and distributes the load more equally between processors. Analysis of performance models indicates that in many cases communication is not reduced. Experimental data indicates that the gains from a more equally distributed work load may sufficiently offset the increases in communication to make the new algorithm more efficient.


Introduction

Many real-time applications such as animations and scientific visualization require high performance rendering of three dimensional scenes. In many cases parallel algorithms are required to deliver adequate performance levels. Crockett and Orloff's paper presents one such algorithm. This paper presents a modified algorithm which increases communication overhead but avoids a load balancing problem associated with Crockett and Orloff's algorithm. A brief description of the rendering problem is provided below.

Simply put the rendering problem consists of taking a three dimensional scene description and using information about viewpoint, light conditions, and reflectance properties to produce a realistic looking 2D image of the scene. Three main steps in the rendering process account for most of the processing time (reference). These steps are
1. The floating point operations performed on objects, such as transforming, lighting, and clipping.
2. The rasterization of primitives transformed into screen coordinates.
3. Writing pixels to the frame buffer.
Both algorithms in this paper approach the problem from a domain decomposition perspective. Specifics of each problem are described below.

Description of the Algorithms

Crockett and Orloff's Method

Crockett and Orloff's algorithm, which takes a domain decomposition approach, allows each of the three stages to be parallelized, and distributes large data structures (the scene description and the frame
buffer) between processors. Data structures are distributed in the following way
1. The triangles are distributed evenly in round-robin fashion to all processors.
2. The frame buffer is divided among the processors by horizontal strips.
3. Small data structures, such as the lights and viewing parameters, are replicated on each processor.

Once the data has been distributed, the algorithm proceeds in the following manner
1. The shading, transforming, and clipping steps are performed by each processor on its local triangles.
2. Before rasterizing a triangle, it is first transformed into
screen coordinates, then split (if necessary) into trapezoids along local frame buffer boundaries. Each trapezoid is then sent to the processor which owns the segment of the frame buffer in which it lies. For efficiency, trapezoid sends are buffered.
3. Upon receiving a trapezoid, a given processor rasterizes it into its local frame buffer using a standard z-buffer algorithm to eliminate hidden surfaces.

A termination algorithm is also described.


New Method

The new method also takes a domain decomposition approach, allowing each of the three stages of rendering to be parallelized, but it restricts this to the triangles. Data structures are distributed as follows
1. The triangles are distributed evenly in round-robin fashion to all processors.
2. Small data structures, such as the lights and viewing parameters, are replicated on each processor.
3. The frame buffer is also replicated on each processor.

Because the frame buffer is replicated, triangles can be replicated without any communication between processors. After all triangles have been processed, the replicated frame buffers can consolidated using their respective Z buffers to determine which pixel values to use.

An example image, rendered by the new algorithm from 100000 randomly generated triangles follows.


image rendered from 100,000 randomly generated triangles




Performance Models For the Algorithms



The following terms are defined for use in the performance analysis

p = number of processors
n = number of triangles
y = height of frame buffer (in scan lines)
h = height of average number of pixels
d = trapezoid buffer depth
t = number of trapezoids generated per processor
L = latency for sending a message
s = size in bytes of a trapezoid
B = time per byte to send a message
C = time to process a single triangle
f = size in bytes of the frame buffer
z = size in bytes of the z buffer

Crockett and Orloff's Method

Crockett and Orloff provide a detailed analysis of their algorithm's performance. For simplicity, and for the purposes of comparison, I will ignore some of the minor contributing factors. A simplified performance model is as follows

Total Time = Rendering Time + Communication Time + Termination Algorithm

Rendering Time = C*n/p
Communication Time = (t/d)L + tsB
Termination Algorithm = 2L(p - 1 + log2(p))
t = nh/2y + n/p


New Method

The performance model for the new method is relatively simple and can be stated as follows

Total Time = Rendering Time + Time to Collapse Frame Buffers

Rendering Time = Cn/p

The Frame Buffers may be consolidated efficiently using a binary collapse. The performance of the binary collapse depends only on hardware characteristics and the size of the frame buffer as follows

Collapse Time = log2(p)(L + B(f + z))


Comparison of the Algorithms

This section compares the performance of Crockett and Orloff's methods with the performance of the new method using the performance models given above.

From the equations, it becomes apparent that the primary communication cost in Crockett and Orloff's method grows linearly with the number of triangles. In contrast, communication time in the new method grows linearly with the size of the frame buffer. So the relative performance of the algorithms depends on the ratio of triangles to frame buffer size. One value of interest is the number of triangles
for a given frame buffer size and number of processors for which the algorithms perform approximately equally. Setting the equations equal to each other and solving for n gives the following equation

n = ( (log2(p)(L + B(f + z)) -2L(p - 1 + log2(p))) 2yp ) / ( (L/d + sB)(hp + 2y) )

A graph of these values was created using the following constants

L = .1 ms
B = 8 e-9 s
d = 100 for p = 8
40 for p = 16
20 for p = 32
10 for p = 64
6 for p = 128
h = 10





The values in this table indicate that for small sized frame buffers, the new algorithm competes well in communication cost, although for larger frame buffers it performs poorly.

Though the new algorithm requires more communication overhead in many instances, the effect on overall performance is relatively small since computation, not communication, accounts for most of the algorithm's execution time. Following is a graph showing percentage of execution time spent in communication as a function of number of processors and number of triangles rendered using a 512x512 frame buffer. These numbers were obtains on P2 266 machines with the latency and
bandwidth values given above.



As this table shows communication costs become increasingly more influential as processors increase, but for relatively small numbers of processors communication costs are small.

The major advantage the new algorithm has over Crockett and Orloff's method relates to rendering actual scenes. Actual scenes tend to have areas of very highly concentrated triangles along with areas of very few. Since Crockett and Orloff's algorithm partitions the frame buffer based on locality, it will tend to have and unequal work distribution. The new algorithm avoids this problem. Unfortunately,
no quantitative data is available with which to address this issue.

Actual speedups using the new algorithm are reasonably close to expected results. Following is a graph summarizing speedups for various numbers of processors being used and triangles being rendered.





Conclusion

In conclusion , the new algorithm presented in this paper is an effective alternative to the Crockett Orloff algorithm. Though the new algorithm requires slightly more communication time, it avoids some load balancing problems associated with the Crockett Orloff algorithm. In addition, it is much simpler to implement.