Unfortunately, general purpose languages (including Java) don't natively support intra-processor communication, and the common way to correct this shortcoming is to provide language extensions or non-portable libraries. However, we believe that threads are a good abstraction for distributed computation; while other mechanisms may improve performance, they are most useful as optimizations and may often be automated. Of course, performance is the whole goal of parallel processing, and properly so. Our goal is simply to show that basic threads are enough to build distributed applications.
From a program organization standpoint, there are two main paradigms in distributed computing: shared memory and message passing. Each method seems more natural for particular problems. Shared memory is very useful for recording the current search space in discrete optimization problems, and message passing is useful for multiple nodes to communicate values to neighbors in finite element problems. We will evaluate how threaded computations that use one or the other method can be transformed for execution on separate processors.
The next section discusses background and related work. Section 3 shows how threads that pass messages to one another can be translated to standard MPI programs. Section 4 shows how threads that share memory can be translated. Section 5 offers some final thoughts and honest opinions of this work.
Titanium [Yelick98] extends Java for parallel programs. It is built on an SPMD model of parallelism, where processes synchronize at programmer-specified points. Titanium extends Java and requires its own compiler. HPJava is a similar project which also extends Java to support SPMD parallelism. The mechanisms in this paper should apply to threaded programs in general, and our project works for any program that the standard Java compiler accepts.
We realize that this violates our goal to avoid extending the Java language. We justify this addition for two reasons. First, it is a simple class written wholly in standard Java. Second, in order to have any kind of communication, threads need access to some basic functions such as the number of parallel threads running and the current thread's own identifying number. Note that the transformation in section 4 (for shared-memory threads) requires no external class such as this.
public abstract class DistributedThread implements Runnable { // the number of threads working together protected int count(); // this thread's identifying number (0 <= id() < count()) protected int id(); // sends a data array to the recipient thread protected synchronized void send(boolean[] buffer, int recipient); protected synchronized void send(char[] buffer, int recipient); ... // one for each primitive type // receives a data array from the sender thread protected synchronized void receive(boolean[] buffer, int sender); protected synchronized void receive(char[] buffer, int sender); ... // one for each primitive type }Note that DistributedThread contains no MPI communication procedures more advanced than send and receive. We simply wish to show that this type of transformation is possible. We can implement the more advanced MPI functions based on the ideas in this paper without much more work.
When running a thread on a single processor, programs using DistributedThread calls can communicate with any other thread. Of course, this happens without any complicated networking.
When the user wishes to run on more than one processor, a simple program transformation makes it possible. In essence, there are primitive MPI calls for each of the procedures in DistributedThread, so those calls are replaced by their MPIJ counterparts. Here are the MPIJ versions of our procedures:
DistributedThread code | MPIJ code | ||||||
id()
|
Simply put, Thread-Slinger converts every function call on the left to the corresponding function call on the right.
Note that we do not take advantage of some MPI features such as the tags and the status variables. These, too, are features that we could implement without much trouble using the ideas presented here.
These four translations are the bulk of the problem for threads which use message-passing. In addition, there are a few other function calls that need to be added to initialize the process, make the MPI variables and functions available, etc. The following table shows all the remaining transformations that need to take place:
Old code or location | MPIJ code | ||||||||||||
extends DistributedThread
|
Note the last conversion. This is based on the assumption that all instances of this thread class can be replaced by remotely running processes. This does work for programs that dynamically create threads or rely on some quality of an ordered creation of threads, so we cannot transform such programs. We think this is a reasonable constraint, especially since the current version of MPIJ only supports a fixed number of processes at any one time.
After all these transformations, the user has source code which will successfully run on remote machines and pass messages to accomplish their task. Running these programs in parallel requires the DOGMA system, which we do not describe here; the interested reader may find more information at [web1].
The general approach here is to create and run a new thread that
maintains all the shared memory. In object-oriented terms, variables
that are shared among objects of a class are labelled as
static
; the compiler then allocates only one location for that
variable accessible to all objects of that class.
Therefore, we create a new thread to encapsulate all the static variables, called MemoryDaemon. Its only purpose is to serve read and write requests from all the other threads, which do the computation. It sits in an loops, responding to memory accesses, until all the threads signal they are finished with their computations.
Before discussing the mechanics of this transformation, we wish to address the obvious overhead incurred by programs that use something like MemoryDaemon. When done naively, every memory access becomes instead two communications across the network (the request and the response with data). If the network is the internet, even a simple program to print array items may take seconds or minutes to execute. This is a problem we attempt to solve below (see sections 4-2 and 4-3).
However, these problems don't detract from the fact that threads can be compiled to run on remote processors and still carry out their intended function. It is possible that the basic mechanisms described here will never improve performance beyond a certain threshold, but that is an issue for another paper.
To distribute shared-memory threads, most of the work is done in deriving the MemoryDaemon class. It must receive and serve memory requests correctly. We first explain how to generate MemoryDaemon, and then we describe how to generate the threads for the computation.
One of the biggest problems in creating the MemoryDaemon is that MPI
message data must be in an array of one of the primitive types. We
use messages of type integer
to signal memory requests,
and since the actual data may have a different type, we have to send a
second message containing it. This doesn't affect read requests,
which still take two messages (one to request and one to receive).
But write requests, which usually just take one message, now take two;
the node must request the write, and then send the data for the write
separately.
A receive request contains an array of integers, where each element has the following meaning:
Note that the request may contain 0 or more dimension offsets.
The following algorithm describes how MemoryDaemon is generated from the static variables:
private class MemoryDaemon implements
Runnable
private T read_X() throws MPIException
private void write_X(int dim0.., T newvalue) throws MPIException
As mentioned earlier, each function initiates two communications: one to request the memory operation, and the other to receive the old value or send the new value.
Finally, we must initialize and finalize things within the
main
procedure:
Old code or location | MPIJ code | ||||||
public static void main(String[] args)
|
Note that the MemoryDaemon is run on the last node. Also note the TOMEMORY_TAG; this is used to differentiate between the MemoryDaemon thread and normal thread that may also be running on the last node.
length
. This can be remedied in the future by
special-purpose messages to the MemoryDaemon.
These limitations place restrictions on the programmer, but we see no compelling reasons against distributing threaded programs in the manner we describe. In fact, most of these limitations exist simply by the nature of distributed computation. Programmers who understand these constraints will organize their programs to minimize the communication caused by accessing shared-memory or passing messages.
synchronized
and accessible to the
regular computation thread, which are simple modifications.
[Aho88] Aho, Sethi, and Ullman, Compilers: Principles, Techniqes, and Tools, Addison-Wesley, 1988
[Haines94] Haines, Cronk, and Mehrotra, "On the Design of Chant: A Talking Threads Package", Proceedings of Supercomputing 94, November 1994, http://computer.org/conferen/sc94/hainesm/hainesm.html
[web1] DOGMA homepage http://zodiac.cs.byu.edu/DOGMA/
[web2] Java Compiler Compiler - The Java Parser Generator http://www.suntest.com/JavaCC/
[Yelick98] Yelick, Semenzato, Pike, Miyamoto, Liblit, Krishnamurthy, Hilfinger, Graham, Gay, Colella, Aiken, "Titanium: A High-Performance Java Dialect," ACM 1998 Workshop on Java for High-Performance Network Computing, Stanford, California, February 1998.