Parallel Programming with Threaded Programs

An argument for threads as distributed programming primitives

Trent Larson

December 18, 1998

Abstract

The native Java language supports threads and uses monitors for writing threadsafe programs. This paper explores how threaded programs can be converted to parallel programs without user intervention. In particular, we show how to convert threaded programs, written in standard Java for a standard JVM on one machine, into programs that can be run on multiple, heterogeneous machines over the internet using standard MPI calls. We examine the issues of message passing and shared memory to optimize for speed.

Introduction

Unfortunately, general purpose languages (including Java) don't natively support intra-processor communication, and the common way to correct this shortcoming is to provide language extensions or non-portable libraries. However, we believe that threads are a good abstraction for distributed computation; while other mechanisms may improve performance, they are most useful as optimizations and may often be automated. Of course, performance is the whole goal of parallel processing, and properly so. Our goal is simply to show that basic threads are enough to build distributed applications.

From a program organization standpoint, there are two main paradigms in distributed computing: shared memory and message passing. Each method seems more natural for particular problems. Shared memory is very useful for recording the current search space in discrete optimization problems, and message passing is useful for multiple nodes to communicate values to neighbors in finite element problems. We will evaluate how threaded computations that use one or the other method can be transformed for execution on separate processors.

The next section discusses background and related work. Section 3 shows how threads that pass messages to one another can be translated to standard MPI programs. Section 4 shows how threads that share memory can be translated. Section 5 offers some final thoughts and honest opinions of this work.

Background and Related Work

Thread-Slinger

Background

Language

Communication standard

[web1]

Runtime tools

Program transformations

First, we read the program into an abstract syntax tree representation. We chose the Java tools JavaCC and JJTree [web2] to do this step. These tools use a grammar file (similar to lex and yacc) to create parser and generate an abstract syntax tree based on the source.
Next, we write out the source with certain substitutions, thereby generating another valid Java source file. We had to do this all on our own, since the tools we have cannot reconstruct the original source. The real work comes in when changing the source code. This took some effort to ensure type-correctness as well as syntax correctness. We won't provide any more details on this process; we recommend interested reader read [Aho88].

Related work

Titanium [Yelick98] extends Java for parallel programs. It is built on an SPMD model of parallelism, where processes synchronize at programmer-specified points. Titanium extends Java and requires its own compiler. HPJava is a similar project which also extends Java to support SPMD parallelism. The mechanisms in this paper should apply to threaded programs in general, and our project works for any program that the standard Java compiler accepts.

Message-Passing Threads

DistributedThread

We realize that this violates our goal to avoid extending the Java language. We justify this addition for two reasons. First, it is a simple class written wholly in standard Java. Second, in order to have any kind of communication, threads need access to some basic functions such as the number of parallel threads running and the current thread's own identifying number. Note that the transformation in section 4 (for shared-memory threads) requires no external class such as this.


public abstract class DistributedThread implements Runnable {

  // the number of threads working together 
  protected int count();

  // this thread's identifying number (0 <= id() < count())
  protected int id();

  // sends a data array to the recipient thread
  protected synchronized void send(boolean[] buffer, int recipient);
  protected synchronized void send(char[]    buffer, int recipient);
  ...  // one for each primitive type

  // receives a data array from the sender thread
  protected synchronized void receive(boolean[] buffer, int sender);
  protected synchronized void receive(char[]    buffer, int sender);
  ...  // one for each primitive type
}

When running a thread on a single processor, programs using DistributedThread calls can communicate with any other thread. Of course, this happens without any complicated networking.

When the user wishes to run on more than one processor, a simple program transformation makes it possible. In essence, there are primitive MPI calls for each of the procedures in DistributedThread, so those calls are replaced by their MPIJ counterparts. Here are the MPIJ versions of our procedures:

DistributedThread code MPIJ code
id() MPI.COMM_WORLD.rank()
count() MPI.COMM_WORLD.size()
send(message, recipient); MPI.COMM_WORLD.send(message, MPI.TYPE, sender, 0);
receive(message, sender); MPI.COMM_WORLD.RECV(message, MPI.TYPE, sender, MPI.ANY_TAG, status);

Simply put, Thread-Slinger converts every function call on the left to the corresponding function call on the right.

Note that we do not take advantage of some MPI features such as the tags and the status variables. These, too, are features that we could implement without much trouble using the ideas presented here.

These four translations are the bulk of the problem for threads which use message-passing. In addition, there are a few other function calls that need to be added to initialize the process, make the MPI variables and functions available, etc. The following table shows all the remaining transformations that need to take place:

Old code or location MPIJ code
extends DistributedThread extends MPIApplication
public void run() public void run() throws MPIException
beginning of run() Status status;
public static void main(String[] args) public void MPIMain(String[] args)
beginning of main() try { MPI.init();
end of main() run(); MPI.finalize(); } catch (Exception e) {}
within main() remove code that creates and runs threads of this type

Note the last conversion. This is based on the assumption that all instances of this thread class can be replaced by remotely running processes. This does work for programs that dynamically create threads or rely on some quality of an ordered creation of threads, so we cannot transform such programs. We think this is a reasonable constraint, especially since the current version of MPIJ only supports a fixed number of processes at any one time.

After all these transformations, the user has source code which will successfully run on remote machines and pass messages to accomplish their task. Running these programs in parallel requires the DOGMA system, which we do not describe here; the interested reader may find more information at [web1].

Shared-Memory Threads

The general approach here is to create and run a new thread that maintains all the shared memory. In object-oriented terms, variables that are shared among objects of a class are labelled as static; the compiler then allocates only one location for that variable accessible to all objects of that class.

Therefore, we create a new thread to encapsulate all the static variables, called MemoryDaemon. Its only purpose is to serve read and write requests from all the other threads, which do the computation. It sits in an loops, responding to memory accesses, until all the threads signal they are finished with their computations.

Before discussing the mechanics of this transformation, we wish to address the obvious overhead incurred by programs that use something like MemoryDaemon. When done naively, every memory access becomes instead two communications across the network (the request and the response with data). If the network is the internet, even a simple program to print array items may take seconds or minutes to execute. This is a problem we attempt to solve below (see sections 4-2 and 4-3).

However, these problems don't detract from the fact that threads can be compiled to run on remote processors and still carry out their intended function. It is possible that the basic mechanisms described here will never improve performance beyond a certain threshold, but that is an issue for another paper.

To distribute shared-memory threads, most of the work is done in deriving the MemoryDaemon class. It must receive and serve memory requests correctly. We first explain how to generate MemoryDaemon, and then we describe how to generate the threads for the computation.

MemoryDaemon class

One of the biggest problems in creating the MemoryDaemon is that MPI message data must be in an array of one of the primitive types. We use messages of type integer to signal memory requests, and since the actual data may have a different type, we have to send a second message containing it. This doesn't affect read requests, which still take two messages (one to request and one to receive). But write requests, which usually just take one message, now take two; the node must request the write, and then send the data for the write separately.

A receive request contains an array of integers, where each element has the following meaning:

The type of request, READ, WRITE, or DONE (0 .. 3)
The variable involved (0 .. number of shared variables )
The dimension offset

Note that the request may contain 0 or more dimension offsets.

The following algorithm describes how MemoryDaemon is generated from the static variables:

Declare the class: private class MemoryDaemon implements Runnable
(This can be done as an internal class.)
Declare the static variables exactly as they are declared in the original class.
Declare a counter to keep track of the threads still executing.
Loop while the counter is greater than zero:
- Receive a memory request message.
- Switch on the type of request:
  - READ: return the value of the requested variable to the sender.
  - WRITE: receive the value to write to the requested variable.
  - DONE: decrement the counter (this node has finished processing)

Computation threads

a read function
private T read_X() throws MPIException
a write function
private void write_X(int dim0.., T newvalue) throws MPIException

As mentioned earlier, each function initiates two communications: one to request the memory operation, and the other to receive the old value or send the new value.

Finally, we must initialize and finalize things within the main procedure:

Old code or location MPIJ code
public static void main(String[] args) public void MPIMain(String[] args)
beginning of main() try { MPI.init(); if (MPI.COMM_WORLD.rank() == MPI.COMM_WORLD.size()-1) new Thread(new MemoryDaemon()).start();
end of main() run(); int[] offset = {DONE_MSG}; MPI.COMM_WORLD.send(offset, MPI.INT, MPI.COMM_WORLD.size()-1, TOMEMORY_TAG); MPI.finalize(); } catch (Exception e) {}
within main() remove code that creates and runs threads of this type

Note that the MemoryDaemon is run on the last node. Also note the TOMEMORY_TAG; this is used to differentiate between the MemoryDaemon thread and normal thread that may also be running on the last node.

Conclusions and Future Work

Limitations

Every shared-memory access incurs communication overhead. This is a relatively simple issue to for programmers to address. They may use, for example, local variables to cache the global memory until the process absolutely needs the most recent version of global memory.
Threads may only share only primitives and arrays of primitives. This is an artifact of the underlying MPI mechanisms, which currenly only allow primitive data as messages. This restriction will be removed in future versions of MPIJ.
The shared-memory operations may only read and write to the existing array. There is no mechanism to set an array to null or reset a whole array using new. There is also no way to check the length of an array, which Java stores in a field named length. This can be remedied in the future by special-purpose messages to the MemoryDaemon.
Programs may not dynamically add, delete, or migrate threads. This, too, is an artifact of the current version of MPIJ, and future versions may allow these capabilities. These functions would be difficult to add to threads, and we have not explored these possibilities yet.

These limitations place restrictions on the programmer, but we see no compelling reasons against distributing threaded programs in the manner we describe. In fact, most of these limitations exist simply by the nature of distributed computation. Programmers who understand these constraints will organize their programs to minimize the communication caused by accessing shared-memory or passing messages.

Future Work

The first case doesn't relax any constraints; we simply recognize that for shared memory, there may be a computation thread running on the same processor as the MemoryDaemon thread. We can take advantage of that by allowing the computation thread to directly access the static variables held by the MemoryDaemon. This would require making all the static variables synchronized and accessible to the regular computation thread, which are simple modifications.
Another case is when a previous copy of the shared memory is sufficient; ie. the most current value of memory is not absolutely necessary. This is the case in some discrete optimization problems, where the processes are searching to maximize or minimize a value. In this case, we can run a concurrent communication thread along with the computation thread. These partnering threads share a copy of the global memory. The computation thread reads and writes to the local variables, and the communication thread checks the global values from time to time, and synchronizes the local variables when a change is registered.
One more case is when message-passing threads are located on the same processor or share memory. If they can detect this configuration at run-time, they can store the value locally for the other to read rather than using the MPI infrastructure. However, this seems difficult to do since the author knows of no way to determine if two threads share the same address space or reside on the same processor.

References

[Aho88] Aho, Sethi, and Ullman, Compilers: Principles, Techniqes, and Tools, Addison-Wesley, 1988

[Haines94] Haines, Cronk, and Mehrotra, "On the Design of Chant: A Talking Threads Package", Proceedings of Supercomputing 94, November 1994, http://computer.org/conferen/sc94/hainesm/hainesm.html

[web1] DOGMA homepage http://zodiac.cs.byu.edu/DOGMA/

[web2] Java Compiler Compiler - The Java Parser Generator http://www.suntest.com/JavaCC/

[Yelick98] Yelick, Semenzato, Pike, Miyamoto, Liblit, Krishnamurthy, Hilfinger, Graham, Gay, Colella, Aiken, "Titanium: A High-Performance Java Dialect," ACM 1998 Workshop on Java for High-Performance Network Computing, Stanford, California, February 1998.

DistributedThread code	MPIJ code
`id()`	`MPI.COMM_WORLD.rank()`
`count()`	`MPI.COMM_WORLD.size()`
`send(message, recipient);`	`MPI.COMM_WORLD.send(message, MPI.TYPE, sender, 0);`
`receive(message, sender);`	`MPI.COMM_WORLD.RECV(message, MPI.TYPE, sender, MPI.ANY_TAG, status);`

Old code or location	MPIJ code
`extends DistributedThread`	`extends MPIApplication`
`public void run()`	`public void run() throws MPIException`
beginning of run()	`Status status;`
`public static void main(String[] args)`	`public void MPIMain(String[] args)`
beginning of main()	`try { MPI.init();`
end of main()	`run(); MPI.finalize(); } catch (Exception e) {}`
within main()	remove code that creates and runs threads of this type