Eric Sokolowsky

CS 584 Final Project

A Parallel Ray Tracing Program

Many opportunities for parallelism exist in the realm of computer graphics. One application that benefits greatly from parallel computing is ray tracing. In this paper I describe the ray tracing problem briefly, the requirements I had for a parallel implementation, and my implementation. There are two parts to the implementation of my parallel ray tracer: the communication, and the ray tracing. Each of these will be described.

Ray Tracing

Ray tracing is a method in computer graphics to render a scene, or make an image of a scene. Objects, usually geometric primitives such as polygons and spheres, are placed in the scene, and a camera is positioned to "take a picture" of the scene. Rays emanate from the camera toward the scene on a per-pixel basis, or even a sub-pixel basis. When the ray hits an object, material properties such as smoothness and color are taken into account, and depending on other properties of the material, such as reflectivity or transparentness, other rays may begin at the intersection point. In addition to these rays, antialiasing (or producing images with softer edges) may achieved by generating multiple rays per pixel and averaging the results to get a pixel value. Thus, one pixel in the final image may require hundreds or even thousands of rays to be generated and calculated, causing a large number of floating point mathematical operations.

Parallel Ray Tracing

Since each pixel is traced separately, and since it is difficult to find coherency between neighboring rays (making each ray an independent operation), ray tracing lends itself very well to parallelization, because of relatively low communication requirements. Each processor can take a certain number of pixels and calculate the color of each one, independently of what other processors are doing.

There are many ways to divide up a ray tracing job among several processors. The main ones are scanline-based, block-based, or pixel-based, where the image is divided along scan lines (each processor gets a certain number of scan lines), rectangular blocks of pixels, or even random or regular per-pixel assignments, respectively. The scene can be distributed statically (before any computation is made), or dynamically, where each processor is given a static assignment to begin with, and then changes are made on the fly to keep each processor busy. In general, dynamic load-balancing algorithms tend to work more efficiently because it is very difficult to predict which pixels will require more work than others. [Heirich98] discusses many load-balancing strategies for parallel ray tracers.

A client/server architecture is often used for a ray tracing program. Under this scheme, the server is responsible for dividing up the task, and each client is responsible for tracing some of the rays. This type of architecture also lends itself well to dynamic load balancing, as each processor can request just enough work to do to keep it busy for a little while, but not too much so that everybody's waiting for a slower machine to finish its task. When a client is done with a particular part of the job, it can request more work to do.

I wanted the following features in my parallel ray tracer:

Written in C++ using classes and dynamic memory allocation using the new operator for ease of programming and debugging
Client/server architecture for load balancing
Portable across any UNIX system (specifically SGI, Linux, and HP-UX)
Cross-architecture compatibility (a single job can run on many architectures at once)
New clients can join a job in progress at any time
Take advantage of multiple processors on a single machine automatically
Use POSIX threads (pthreads) to make easy sharing of data between tasks on the same machine
Use optimizing tricks to make the raytracer as fast as possible

Many parallel APIs are available for use with C/C++. MPI (version 1) is one that we have used in class for various projects, and it works well for most applications. Its one biggest drawback is that host determination is static; the hosts participating in a job are decided at the beginning and there is no chance to add new ones later. There are some other things about MPI that make it inconvenient to work with, especially with executables across platforms. Also, nodes working on a multi-processor machine might not communicate efficiently with other nodes on the same machine. Also, since MPI is primarily written in strict C, it is difficult to use C++ features such as classes and reference variable passing in an MPI program. Also, MPI doesn't take advantage of multiple processors automatically; these must be specified manually in the configureation file.

Other APIs, such as PVM, will be described here as soon as I find out about them. HERE

I decided to write my own API, which I call Eric's Parallel Interface, or EPI. It overcomes the inconveniences described above. It uses sockets to communicate between machines, and IPC facilities to communicate between nodes on the same machine, to make communication as efficient as possible.

EPI - Eric's Parallel Interface

The EPI is designed to abstract away the details of communicating between tasks on different machines. While I wrote it specifically to implement a parallel ray tracer, it can be used for other applications as well.

I designate one machine as a master. This is usually the machine on which a particular job is started. This machine is responsible for accepting connections from new hosts and assigning a unique host identification number to each host that attaches to the job. Since this machine knows about all the hosts working on the same job, it is consulted when a host needs to know how to contact another host that's not the master. However, once a host knows about another host, communication can proceed between those two hosts directly, without further need for going through the master. Thus, the master can be treated as any other machine participating in a job; its extra responsibilities are usually required at only the beginning, so not much time is lost performing them.

On each machine there are two processes running in an EPI program. (UNIX processes are created via the fork system call.) The first one is the communication process, and it is responsible for communicating messages between tasks on that machine and tasks on other machines. The second process is the user process, and it is split into pthreads to run the user's job. There is one pthread for each processor on the machine. Hopefully, the CPU time required by the communication process is minimal compared to the amount of work being done by each thread.

Communication

Because I wanted to be able to communicate between UNIX machines anywhere, regardless of location, sockets are the only possible mechanism which I could use. However, there are many types of sockets. There are three families of sockets: Internet, UNIX, and Raw. UNIX sockets are only useful between processes sharing a common file system. Raw sockets are only useful between machines of the same architecture. Thus, Internet sockets were the only viable choice. Within the Internet socket family, there are two types: Datagram and Stream. Datagram sockets send messages one at a time, but delivery is unreliable. Stream sockets are treated as a constant stream of bytes, where there are no imposed message boundaries but delivery is guaranteed. I wanted to send discrete messages with EPI, but the reliability of stream sockets outweighed the inconvenience of determining message boundaries, so I used stream sockets for my implementation of EPI.

Under standard UNIX, there are many means by which inter-process communication (IPC) may be achieved. The main methods I considered are shared-memory message queues and pipes. As mentioned above, sockets of the UNIX family can also be used, but they will not be discussed in this paper.

Shared memory queues are somewhat difficult to set up and to clean up. They are created with the msgget system call, and referred to by an identification number. They remain on the system until either it is rebooted or else they are explicitly destroyed. The command-line utilities ipcs and ipcrm may be used to see and remove existing message queues, but it is inconvenient to use them from within a program.

Once they are set up, message queues are easy to use. Messages can be placed on the queue with msgsnd and retrieved with msgrcv. Messages can be given an identification number so that any particular type of message can be retrieved from the queue, even if it's not the first message. In addition, it is very easy to see if there are any messages waiting in a particular queue, so a process can check to see if there is any incoming message without blocking. Messages are sent integrally; either all of a message is present or none of it is. Since these message queues reside in shared memory, any process can use them.

Pipes perform a similar function as message queues, but their use is much different. First of all, pipes do not impose message boundaries. The pipe system call is used to create a set of pipes. It actually creates two pipes and places a pair of file descriptors used to refer to those pipes in an array. Thus, pipes are represented using file descriptors, just as if it were a regular file. The first pipe can be read from, and the second pipe can be written to. Because pipes are treated as files by UNIX, most system calls usable on files are also usable on pipes. For example, the fstat system call is normally used on a regular file to find out information about that opened file, including file size. This call can also be used on a pipe to find out how many bytes are ready for reading.

Because both pipes and sockets are described by file descriptors, and because stream sockets and pipes both do not preserve message boundaries, they can share code to read and write messages. This greatly simplifies the passing of messages. One function can be written to send and one to receive, with the file descriptor passed in as an argument.

Pipes have another advantage over shared-memory message queues in that they are easier to destroy. Pipes (and sockets) are destroyed automatically when a program exits, if they haven't been closed previously. This is helpful because a user may interrupt a program, and when this happens a program may not be able to clean up after itself. Pipes and sockets are self-cleaning, while shared-memory queues are not.

Pipes can also be shared among processes on the same machine, as long as a process creates the pipes to be shared before the fork. (Named pipes (known as FIFOs), also exist, allowing unrelated processes to communicate with each other, but these will not be discussed in this paper.) All child processes will have full access to all of the pipes, though it might make sense to close some of the pipes from some of the processes if other file descriptors are needed. The pipes remain open to all other processes even if they are closed in some.

Another variation of IPC with pipes can be used with the popen and pclose system calls. popen launches another program on the same computer, returning a FILE * that can be either read from or written to, but not both. Text written to this file appears on the standard input of the child process (if the file was opened for writing), and any output that the program produces appears in the file (if it was opened for reading). This functionality can be convenient for many tasks.

I decided to use pipes for IPC. I create a pair of pipes for each thread, and an additional pair for the communication process. When the communication process receives a message destined for one of its threads, it just writes the message into the proper writing pipe. Each thread reads from its read pipe. Threads that need to send messages to other machines send them to the communication process' write pipe, and the communication process reads such messages from its read pipe, and then sends it out on the appropriate socket. If the communication process cannot, for some reason, immediately send a message, it puts the message back on its write pipe to be read in later, after other messages are processed. In this way, the communication process need not block, waiting to write to the intended destination. A disadvantage of this method of message passing is that messages from one thread on one machine to another thread on a different machine must go through at least three complete transfers (two pipes and one socket), which decreases performance.

Using EPI

To use EPI, you must get four files:

epi.h
epi.cpp
eargs.h
eargs.cpp

Epi.h and epi.cpp contain prototypes and the implementation details of the EPI public functions, described below, as well as other private functions required for implementation. Eargs.h and eargs.cpp contain functions useful for parsing command-line options. While these files are written primarily in C++, they only use two features of that language: reference parameter passing and declaring local variables in the middle of a block. In every other way, these functions could be used with a strict C compiler, and it would not be difficult to modify these files to provide C compatibility. As I was primarily interested in using this interface in a C++ program, strict C compatibility was not a top priority.

The following command-line arguments are recognized in an EPI program:

-epi_master <mastername>
-epim <mastername>

This option (either form may be used) specifies the name of the master machine. Usually user programs will not use this option when invoked. It is used primarily when starting new processes on remote machines.

-epi_config <configfile>
-epic <configfile>

This option (either form may be used) specifies the name of a configuration file, used to determine which machines will start participating in a job. A configuration file consists of any number of lines. On each line are two tokens, separated by whitespace. Other whitespace is ignored. The first token on a line is the name of another machine. This can be a fully qualified internet domain name or a locally-known name. The second token on the line is the full path of the executable to run on the remote machine. If the first non-whitespace character on a line is "#", then the line is considered commented out, and ignored.

If this option is given, then the file is parsed, one line at a time. An attempt is made to launch a remote shell (via rsh for most versions of UNIX, but with remsh for HP-UX) on each machine listed in the configuration file, but no errors are reported if any line fails for some reason.

An interesting thing happens when the above two command-line options are used together. Each remote process tries to connect to the master specified on the command line, rather than the machine that launched the other programs. In this way, a machine can be used to start a number of machines working on a job already in progress somewhere else.

-epi_port <portnum>
-epip <portnum>

Use a different port number instead of the default, which is 12738. Port numbers must be available for use. Only the superuser may use ports less than 1024.

EPI Public Interface

There are several functions that can be used by EPI programs. These are explained below.

EPI_Start

int EPI_Start(int argc, char *argv[], void (*start_func)(int argc, char **argv, int id, int thread_id));

This function should be called first. It takes three arguments: argc, argv, and a startup function. Argc and argv should be the same as those passed to the main program. EPI_Start will parse through the command line, removing any EPI-specific arguments, perform its startup procedures, and call the startup function that is passed as the third parameter. This startup function should return void and take four arguments: argc, argv, the machine's identification number, and the thread's identification number. Each thread will be started with this function, so if a program is to be run on a machine with multiple processors, it should be thread-safe. Machine ID numbers start at 0 (the master is always ID 0), and thread ID numbers start at 1.

The return value is not meaningful.

EPI_Send

int EPI_Send(void *buf, int msglen, int src_id, int src_th, int dst_id, int dst_th);

Use this function to send a message to any machine and/or thread. Buf should be a pointer to the message. The contents of this message are not changed in any way during the transfer, so if integer or floating point values are sent, and if a mixture of big-endian machines and little-endian machines is used on the same job, the user is responsible for endian translation. The following functions are useful for network/host translation of integer values: htons, ntohs, htonl, and ntohl. These functions can be stated as "host to network short," "network to host short," "host to network long," and "network to host long," respectively. Longs are 4-byte values and shorts are 2-byte values. A similar byte-swapping is needed to send floating point values, both with floats and doubles.

In addition to the buffer, the message length (in bytes), the machine's host ID and thread ID, plus the destination ID and thread ID must be sent. A message may be sent to a particular machine but a random thread by specifying the destination thread ID as 0. Whichever thread has the shortest incoming message queue will get the message.

The return value is the number of bytes sent, or -1 if there was an error.

EPI_Recv

int EPI_Recv(void *buf, int src_th, EPI_Status *stat);

Use this function to receive messages sent by other threads. A pointer to the buffer where the message should be received is passed as an argument, as well as the thread_id of the receiving thread, and a pointer to a status structure with the following members (as defined in epi.h):

int msglen;
int src_id;
int src_th;
int dst_id;
int dst_th;

These are the same values that are passed as parameters to the EPI_Send function.

If there are no messages waiting for the calling thread, then this function call blocks until a message comes.

The return value is the number of bytes read, or -1 if there was an error.

EPI_RecvReady

int EPI_RecvReady(int src_th);

Sometimes a thread should not block, waiting for an incoming message. This function can be used to determine if there is a message waiting for a thread.

The return value is 1 if there is a message waiting, 0 otherwise.

EPI_BeginCritical

int EPI_BeginCritical(int which = 0);

When changing or accessing global data in multi-threaded programs, care should be taken to allow only one thread access to data at a time, or else the data may become corrupted. The use of semaphores can prevent such data corruption. Use EPI_BeginCritical and EPI_EndCritical around sections of code that modify data that is shared among threads. Different pieces of data may be protected by passing in a number for the parameter. There are currently 16 semaphores available for use (numbered 0 to 15). If no argument is given, 0 is assumed.

The return value is not significant.

EPI_EndCritical

int EPI_EndCritical(int which = 0);

Use this function at the end of a critical block of code.

The return value is not significant.

Compiling an EPI program

Compilation was tested on the following four Unix operating systems: SGI's IRIX, Linux for the x86 family, Linux for the DEC Alpha, and HP's HP-UX. Other operating systems might need further modifications, depending on desired functionality.

Epi.cpp expects that one of the following will be defined (using the -D command-line compiler option), depending on the operating system:

IRIX: OS_IRIX
Linux for x86: OS_Linux
Linux for Alpha: OS_alpha
HP-UX: OS_HPUX

If a different operating system is required, then code should be added to determine the number of processors available on the machine in the function GetNumProcessors in epi.cpp. Any other required changes should also be made.

One other symbol may be defined at compile-time (again, using the -D option): USE_PTHREADS. If this symbol is given, then multi-thread support using POSIX threads (pthreads) will be used. Otherwise, threads are not supported, and multi-processor machines are not used to full advantage.

EPI Implementation

EPI_Start

When this function starts, it takes care of initialization of global variables, and also takes care of EPI-specific command-line options. Then the current working directory is changed to be that of the executable, so that launched processes on other machines can take advantage of relative filenames. This is done with the chdir system call.

The next step is to create a set of pipes that can be used by the thread 0 for this purpose. An array of thread communication structures is created and initialized, and each one holds, among other things, a pair of file descriptors for the pipes used for that thread. I treat the comunication process as thread 0 in this array, so accessing a particular pipe is simplified. A call to pipe is made for each thread and once for the communication process.

After the pipes are created, the program forks into a communication process and a user process.

The Communication Process

After the communication process forks, it tries to get its host identification number. The machine determines whether it is master or not by the absence or presence of the "-epim" command-line option, respectively. If the machine is not the master, then it contacts the master by looking up the master's name and getting its IP address. Then it attempts to connect a socket (with the connect system call) to the well-known port on the master's machine. Once a connection is made, the master will send the machine's new host ID number. Otherwise, if the machine is the master, then the id is set to 0.

Then a socket bound to the well known port (using the bind system call) is created to listen for incoming connections. This is done for all machines, not just the master. This allows a direct socket connection with any other host, without always having to go through the master.

However, a deadlock condition could arise where two hosts, neither of which is the master, each has a message for the other. Each tries to connect to the other, and since the communication process is used for both sending and receiving, both are trying to talk and neither is listening. This is overcome by only allowing hosts of a higher host ID number to directly contact hosts whose host ID is lower. For example, host 2 can try to connect to host 1, but the reverse is not allowed to happen. If host 1 has a message for host 2, however, he can tell the master that host 2 should make a connection to host 1, and if host 2 is connected, it will do so. Meanwhile, host 1 keeps its message until a connection is made. This paradigm also applies to the master, with the exception that if it has a message for a host that is not yet connected, it can't ask anyone to tell that host to connect to the master. It must wait for the connection from that host.

When a host has a message for another host, its host ID number is greater than the recipient's, then a check is made to see if the sender has the IP address of the recipient. If not, the host asks the master for the IP address. Once obtained, the host can directly connect to the recipient. The two machines exchange host numbers, and then the message can be sent.

Most messages passed between hosts and between a thread and other threads has a message header of six 4-byte integer values: A message ID, the message length in bytes, the source host ID, the source thread ID, the destination host ID, and the destination thread ID. This message header goes with a message until it gets to its destination. Network translation is performed on the message header when needed (when sending a message over a socket or receiving a message from a socket), so that architecture compatibility is maintained. A very few messages consist of only one or two 4-byte integer values; these are used for initial handshaking when new connections are received and sent, and convey host ID numbers.

The communication process enters a loop, waiting for incoming messages and dispatching them. It uses the select system call. Select allows the user to register any number of file descriptors which will be tested for input, output, or exceptional conditions until a timeout value is reached. If one or more file descriptors become available, select will terminate, and the program will then be able to act on those incoming messages. In the case of the communication process, it just needed to register the incoming socket descriptor, the communcation process' incoming pipe, and all connected sockets for read availability. If a message is received, then it is dispatched. If an incoming connection is detected, it is received. This continues until one of the following conditions have been met:

If the machine is not the master, then it quits after a "thread quit" message has been received for each thread running on that machine. If any other connections to other hosts are still active, then a "communication quit" message is sent to each one.
If the machine is the master, then it quits after all of its threads have quit, and after it receives a "communication quit" message from all connected hosts. Even after that, it waits a few seconds in case another host tries to connect to it before it stops listening.

The User Process

The first thing the user process does is wait for the communication process to tell it what ID it is. This comes as a message through a pipe. Then the user process launches a remote program (with rsh or remsh) to each program in the configuration file, if one was specified. Then it starts one pthread for each processor on the machine, using the function that was given to EPI_Start. If pthread support is disabled (by omitting the USE_PTHREAD compiler token), then the function is called directly. In all cases, thread 1 on a machine is the one that is called directly, without using pthread_create.

Since thread 1 is launched directly, we don't want it to exit until all other threads have performed their tasks. Therefore, this thread waits for all other threads to finish before it is allowed to exit. This is done with a semaphore. Thread 1 waits for the semaphore once for each thread that was started using pthread_create. Each pthread thus started posts to that semaphore just before it exits.

Getting the number of processors available on each machine proved to be the only non-portable difference between the various architectures I tested. On the SGI machines, a simple system call to sysconf with an argument of _SC_NPROC_ONLN (number of online processors) directly returns the number of processors available. Under Linux it is a bit more complex. Processor information is stored in the file /proc/cpuinfo. There are several lines describing each processor, such as name, whether an FPU is present, etc. There is a group of these for each processor. So to find out how many processors there are on a machine, this file must be parsed. For the x86 architecture the number of occurrences of the word "processor" must be counted. For Linux on the DEC alpha, the number of occurrences of "Alpha" must be counted. Fortunately, UNIX supplies tools to make such parsing easy. I use "grep" to print the occurrences of the required name out of /proc/cpuinfo, and pipe the results into "wc" (word count) with an option to only count the number of lines. The result is one line of text that only contains one number. I can read this in as a file by launching the command with popen, open for reading. I can then read in the number and convert it to an integer with atoi.

Under the version of HP-UX on the machines in our computer lab, there is currently no support for threads, and none of them have more than one processor anyway, so I don't check for the number of processors. It just gets set to 1.

Ray Tracer

I modified the ray tracer I wrote for CS 555 (advanced graphics) to be more easily parallelizable. There are several arguments that can be used when launching the ray tracer (in addition to the EPI-specific arguments mentioned earlier):

-o <outputname>

Specify the output file name. Default is out.ppm. Currently only PPM (binary) format is available.

-v

Visualize the ray tracing job as it completes. Currently this is only supported on SGI machines, because it uses IRIS GL for rendering. If this option is desired, the ray tracer must be compiled with the -DUSE_IRISGL command-line option.

-single

Runs the ray tracer in single CPU mode. Thread 1 on machine 0 is the only one that contributes to the final image. This disables the default behavior of client/server with thread 1 on machine 0 being the server, and all other threads being clients.

-stat

Print statistics for each node after the image is generated and collected. This prints a list of how many pixels were drawn by each thread. In addition, a period is printed on the screen as the image is collected: one dot for each packet of results returned to the server.

-p <pixels_at_once>

This controls how many pixels are sent at a time, when a processor needs more work to do. Larger values tend to make a job complete more quickly. Default is 20.

-g <groups_at_start>

This controls how many groups of pixels are sent at the beginning, when a client first contacts the server. This is done in an effort to keep each processor busy even while transferring completed work back to the server. It can keep working on the other messages waiting for it. Default is 2.

When the ray tracer is run, a scene file must be given on the command line. I designed a simple scene file description, and a sample scene description file is included with this paper. I use the extension .esf for these files (Eric's Scene File). It can specify lights, spheres, and polygon files in .dat format. More information about .dat files can be found here.

Parallel Ray Tracing

Unless the "-single" option is given on the command line, the ray tracer operates in client/server mode. Thread 1 on host 0 becomes the server, and all other threads become clients. As soon as a client thread starts, it sends a message to the server declaring its existence. The server then sends a number of work packets to the client, as specified with the "-g" option above. Messages to send work consist of the following:

A message identifier representing a work message.
A number indicating how many pixels are in the message.
A list of integer values indicating the index into the image of each pixel to draw.

Since each machine has the name of the scene file, each machine loads the scene file before it can start working. A semaphore (through the EPI_BeginCritical and EPI_EndCritical API calls) is used to ensure that only one thread creates the scene. Once the scene is created, it is not further modified, and so it can be read by any of the threads without danger of collision.

Responses from clients to servers indicating work done consists of the following:

A message identifier indicating a results message.
A number indicating how many pixels are in the message.
A pair of integer values for each pixel, with the index into the image as the first integer, and the RGBA value of the pixel as the second.

The master reassembles the image as it is received from the clients. As soon as the image is completely received, the master sends a message to each client, telling them that the job is done and that they may now quit. The master then saves the image in the file specified.

Optimizations

As long as the master doesn't have any incoming messages to take care of immediately, it looks ahead and prepares extra work to be transmitted to clients.

There are many things that can be done to optimize the actual ray tracing process. The majority of the time is spent doing ray-object collision tests. There are two typical approaches to increasing performance: reducing the number of calculations, and reducing the complexity of calculations. I used a combination of techniques to reduce computation time.

A naive approach would be to test each ray against each object. However, shapes (polygons and spheres) are seldom very large, and should only be tested if the ray comes near. Bounding volumes can be placed around a group of objects to minimize the amount of calculation performed. If a ray misses a bounding volume, then there is no chance that any of the polygons or shapes bounded by that volume can be intersected by that ray. I implemented a bounding-sphere hierarchy to enclose groups of polygons because ray-sphere intersection tests are extremely fast. This scheme proved to decrease the amount of time required dramatically.

Another test that consumes a lot of time is a ray-polygon intersection. I decided to place an axis-aligned bounding box around each polygon, so that six simple checks can be made before of the more expensive ray-polygon intersection. Of course, if the object passes this test then the more expensive polygon-ray test must be performed, but most of the rays pass outside the boundaries of the polygon. Placing this bounding box proved to be very effective.

Another optimization I used is that I separated the intersection test from the lighting calculation. Once I find the closest object intersected along a ray, then I perform the lighting calculation on that point. This is much faster than the alternative, which is to calculate lighting information whenever a hit is made, even if that is not the closest hit. This throws a lot of uselessly-generated information away.

Results

I achieved some speedup using my parallel ray tracer over the serial version (with the "-single" option). The following tests were performed on a cluster of dual-processor Linux machines (times are in seconds):

Processors Time(s)
1 374.36 374.03
2 374.97 375.25
4 218.28 221.70
6 156.80 167.25
8 125.62 129.44
10 107.59 110.66
12 98.32 93.69
14 85.28 84.32 83.54 85.38 87.95
16 76.23 76.13 74.49

Processors	Time(s)
1	374.36	374.03
2	374.97	375.25
4	218.28	221.70
6	156.80	167.25
8	125.62	129.44
10	107.59	110.66
12	98.32	93.69
14	85.28	84.32	83.54	85.38	87.95
16	76.23	76.13	74.49

These times were created using the following arguments:

-p 50 -g 2 -stat

Better absolute times can be achieved by increasing the values for the "-p" and "-g" options. For example, on 4 CPUs with -p 100 and -g 4 I achieved a time of 161.24 seconds, on the same cluster, much faster than the above results. However, some problems were happening when too much was being sent, and I'm not sure why, though I suspect that too much information was being placed in the pipes on the master host between the communication process and the master thread (my pipes were getting clogged up). So I could not consistently test any other combination of values and still have it work.

A table of speedup results shows:

Processors Speedup
1 1.0000
2 0.9975
4 1.7143
6 2.3854
8 2.9775
10 3.4764
12 3.9922
14 4.4773
16 5.0212

Processors	Speedup
1	1.0000
2	0.9975
4	1.7143
6	2.3854
8	2.9775
10	3.4764
12	3.9922
14	4.4773
16	5.0212

Performance actually decreases slightly between 1 and 2 processors. This is expected, because there is still just one processor working on the actual generation of the scene in the case of 2 processors (the other one is the server). The cost of communication through pipes adds a little bit of time to the total.

These values represent a nearly linear speedup, though it's not nearly as fast as it could be. I suspect that the way I'm doing communcation is slowing things down considerably.

On the other hand, any processor added decreases the amount of time required to perform the ray tracing job. More processors means a faster time, even if new processors don't decrease it by much.

The best absolute time I achieved is 53.71 seconds, with 8 single-processor machines and 4 dual-processor machines working all at once. The biggest gain found was to exist when the master was on a multi-processor machine, because the communcation costs are so low. When this is the case, the other threads running on the same machine get about 3-4 times as much work to finish as any external threads, and so it completes more quickly.

Future Work

I would like to change the way communication is performed. Instead of using pipes, it would be more efficient to have a communication thread and just to pass a pointer to it with a message to be sent. Thus, overhead in switching processes is much reduced, and more time can be spent doing work instead of communicating.

For More Information

Anyone interested in UNIX programming must read a copy of [Stevens92]. It contains very detailed information about all aspects of programming with UNIX system calls. While it is a bit old, it is still full of very useful information. It doesn't deal much with sockets, and not at all with POSIX threads. For threading information see [Lewis98] or [Wagner95]. For socket information, [Radin91] is fairly complete, and its information is applicable to more than just IRIX. [Hall95] is also quite good, and offers practical details missing in other references. Other useful information can be found in [Comer95], [Comer96], [Leffler89], and [Shay99]. [Pipes] is a really good reference about pipes.

[Foley96] is often considered the best graphics reference available. It contains lots of information about ray tracing. [Glassner90] and [Kirk92], along with the other Graphics Gems books, have lots of information about optimizing ray tracing.

References

[Comer95] Douglas E. Comer. Internetworking with TCP/IP, Volume I (Third Edition). Prentice Hall, 1995.

[Comer96] Douglas E. Comer, David L. Stevens. Internetworking with TCP/IP, Volume III (Second Edition). Prentice Hall, 1996.

[Foley96] James D. Foley, Andries van Dam, Steven K. Feiner, John F. Hughes. Computer Graphics: Principles and Practice. Addison-Wesley, 1996 edition.

[Glassner90] Andrew Glassner, editor. Graphics Gems. Academic Press, 1990.

[Hall95] Brian Hall. Beej's Guide to Network Programming. Internet: http://www.ecst.csuchico.edu/~beej/guide/net. 1995.

[Heirich98] Alan Heirich, James Arvo. A Competitive Analysis of Load Balancing Strategies for Parallel Ray Tracing. Found in Supercomputing 1998.

[Kirk92] David Kirk, editor. Graphics Gems, volume III. Academic Press, 1992.

[Leffler89] Samuel J. Leffler, Marshall Kirk McKusick, Michael J. Karels, John S. Quarterman. The Design and Implementation of the 4.3BSD UNIX Operating System. Addison-Wesley, 1989.

[Lewis98] Bil Lewis, Daniel J. Berg. Multithreaded Programming with Pthreads. Prentice Hall, 1998.

[Pipes] Pipes. Internet: http://www.cs.adfa.oz.au/teaching/studinfo/csa2/OSLabNotes/node9.html.

[Radin91] Judith Radin. IRIX Network Programming Guide. Silicon Graphics, Inc, 1991.

[Shay99] William A. Shay. Understanding Data Communications & Networks. Brooks/Cole Publishing Company, 1999.

[Stevens92] W. Richard Stevens. Advanced Programming in the UNIX Environment. Addison-Wesley, 1992.

[Wagner95] Tom Wagner, Don Towsley. Getting Started with POSIX Threads. Internet: http://centaurus.cs.umass.edu/~wagner/threads_html/tutorial.html. 1995.