Parallel Programming with Threaded Programs

An argument for threads as distributed programming primitives

Trent Larson

December 18, 1998

Abstract

The native Java language supports threads and uses monitors for writing threadsafe programs. This paper explores how threaded programs can be converted to parallel programs without user intervention. In particular, we show how to convert threaded programs, written in standard Java for a standard JVM on one machine, into programs that can be run on multiple, heterogeneous machines over the internet using standard MPI calls. We examine the issues of message passing and shared memory to optimize for speed.

  1. Introduction
  2. Modern, general-purpose languages support multiple threads of execution, either natively or in libraries. The most recent addition to this family, Java, gives threads first-class status as objects and makes it simple to create and run threads. This alone is sufficient for a programmer to write a distributed program.

    Unfortunately, general purpose languages (including Java) don't natively support intra-processor communication, and the common way to correct this shortcoming is to provide language extensions or non-portable libraries. However, we believe that threads are a good abstraction for distributed computation; while other mechanisms may improve performance, they are most useful as optimizations and may often be automated. Of course, performance is the whole goal of parallel processing, and properly so. Our goal is simply to show that basic threads are enough to build distributed applications.

    From a program organization standpoint, there are two main paradigms in distributed computing: shared memory and message passing. Each method seems more natural for particular problems. Shared memory is very useful for recording the current search space in discrete optimization problems, and message passing is useful for multiple nodes to communicate values to neighbors in finite element problems. We will evaluate how threaded computations that use one or the other method can be transformed for execution on separate processors.

    The next section discusses background and related work. Section 3 shows how threads that pass messages to one another can be translated to standard MPI programs. Section 4 shows how threads that share memory can be translated. Section 5 offers some final thoughts and honest opinions of this work.

  3. Background and Related Work
  4. In this section, we present the basic ideas behind our work and the tools we use for our implementation, called Thread-Slinger. We also discuss some related work currently going on and contrast it with our approach.

    1. Background
      1. Language
      2. For communicating processes, users have a choice of languages and standards. We chose to work with the current language fad, Java, because it is portable in both source- and byte-code forms. It also has language mechanisms to support threaded programming.

      3. Communication standard
      4. Our choice of an underlying communication standard is arbitrary. We want to be able to send arbitrary messages around on the internet, so wanted to use message passing that was as simple as possible. We like MPI because of the single procedure calls for sending and receiving messages (unlike PVM). We must also admit that, having done other projects using MPI, we are familiar with the tools. There are also many Java MPI implementations to choose from; we selected MPIJ [web1] because it is similar to other MPI implementations and it comes with the run-time environment DOGMA which we describe next.

      5. Runtime tools
      6. Running Java programs on more than one machine takes special tools. DOGMA [web1] is a "distributed metacomputing architecture" that runs Java programs on any node connected by a network, including all internet machines that run Java-compliant browsers. It is a fairly stable environment, and its authorcreator plans to extend it to run only on applet viewers, which is of great interest us and will be very beneficial to this work.

      7. Program transformations
      8. This paper describes how threaded programs can be easily transformed into distributed MPI programs. We modify a program's source code in two steps:

        • First, we read the program into an abstract syntax tree representation. We chose the Java tools JavaCC and JJTree [web2] to do this step. These tools use a grammar file (similar to lex and yacc) to create parser and generate an abstract syntax tree based on the source.

        • Next, we write out the source with certain substitutions, thereby generating another valid Java source file. We had to do this all on our own, since the tools we have cannot reconstruct the original source. The real work comes in when changing the source code. This took some effort to ensure type-correctness as well as syntax correctness. We won't provide any more details on this process; we recommend interested reader read [Aho88].

    2. Related work
    3. We haven't found much attention to threads as basic element of parallel processing. In fact, in 1994, [Haines94] asserts that "thread packages for distributed memory systems have received little attention". They describe Chant, a system that extends threads to allow point-to-point communications as well as remote procedure calls. Our goal was to build a simpler system that allows any threaded programs to be executed on remote systems with little or no extension to the existing thread tools.

      Titanium [Yelick98] extends Java for parallel programs. It is built on an SPMD model of parallelism, where processes synchronize at programmer-specified points. Titanium extends Java and requires its own compiler. HPJava is a similar project which also extends Java to support SPMD parallelism. The mechanisms in this paper should apply to threaded programs in general, and our project works for any program that the standard Java compiler accepts.

  5. Message-Passing Threads
  6. This section discusses how to translate message-passing threads into MPI threads. Unfortunately, Java threads have no means of directly passing a message to another thread. To allow communication, one must create a shared variable through which the threads share data. We assume that the ability to pass messages directly to other threads is a useful tool, so we've added a class to do just that, called DistributedThread.

    We realize that this violates our goal to avoid extending the Java language. We justify this addition for two reasons. First, it is a simple class written wholly in standard Java. Second, in order to have any kind of communication, threads need access to some basic functions such as the number of parallel threads running and the current thread's own identifying number. Note that the transformation in section 4 (for shared-memory threads) requires no external class such as this.

    
    public abstract class DistributedThread implements Runnable {
    
      // the number of threads working together 
      protected int count();
    
      // this thread's identifying number (0 <= id() < count())
      protected int id();
    
      // sends a data array to the recipient thread
      protected synchronized void send(boolean[] buffer, int recipient);
      protected synchronized void send(char[]    buffer, int recipient);
      ...  // one for each primitive type
    
      // receives a data array from the sender thread
      protected synchronized void receive(boolean[] buffer, int sender);
      protected synchronized void receive(char[]    buffer, int sender);
      ...  // one for each primitive type
    }
    
    
    Note that DistributedThread contains no MPI communication procedures more advanced than send and receive. We simply wish to show that this type of transformation is possible. We can implement the more advanced MPI functions based on the ideas in this paper without much more work.

    When running a thread on a single processor, programs using DistributedThread calls can communicate with any other thread. Of course, this happens without any complicated networking.

    When the user wishes to run on more than one processor, a simple program transformation makes it possible. In essence, there are primitive MPI calls for each of the procedures in DistributedThread, so those calls are replaced by their MPIJ counterparts. Here are the MPIJ versions of our procedures:

    DistributedThread code MPIJ code
    id() MPI.COMM_WORLD.rank()
    count() MPI.COMM_WORLD.size()
    send(message, recipient); MPI.COMM_WORLD.send(message, MPI.TYPE, sender, 0);
    receive(message, sender); MPI.COMM_WORLD.RECV(message, MPI.TYPE, sender, MPI.ANY_TAG, status);

    Simply put, Thread-Slinger converts every function call on the left to the corresponding function call on the right.

    Note that we do not take advantage of some MPI features such as the tags and the status variables. These, too, are features that we could implement without much trouble using the ideas presented here.

    These four translations are the bulk of the problem for threads which use message-passing. In addition, there are a few other function calls that need to be added to initialize the process, make the MPI variables and functions available, etc. The following table shows all the remaining transformations that need to take place:

    Old code or location MPIJ code
    extends DistributedThread extends MPIApplication
    public void run() public void run() throws MPIException
    beginning of run() Status status;
    public static void main(String[] args) public void MPIMain(String[] args)
    beginning of main() try {
      MPI.init();
    end of main()   run();
      MPI.finalize();
    } catch (Exception e) {}
    within main() remove code that creates and runs threads of this type

    Note the last conversion. This is based on the assumption that all instances of this thread class can be replaced by remotely running processes. This does work for programs that dynamically create threads or rely on some quality of an ordered creation of threads, so we cannot transform such programs. We think this is a reasonable constraint, especially since the current version of MPIJ only supports a fixed number of processes at any one time.

    After all these transformations, the user has source code which will successfully run on remote machines and pass messages to accomplish their task. Running these programs in parallel requires the DOGMA system, which we do not describe here; the interested reader may find more information at [web1].

  7. Shared-Memory Threads
  8. This section describes how to translate threads which share variables into processes that use MPI calls. This is a more general process that applies to any threaded program. However, it is much more complicated to maintain a consistent memory.

    The general approach here is to create and run a new thread that maintains all the shared memory. In object-oriented terms, variables that are shared among objects of a class are labelled as static; the compiler then allocates only one location for that variable accessible to all objects of that class.

    Therefore, we create a new thread to encapsulate all the static variables, called MemoryDaemon. Its only purpose is to serve read and write requests from all the other threads, which do the computation. It sits in an loops, responding to memory accesses, until all the threads signal they are finished with their computations.

    Before discussing the mechanics of this transformation, we wish to address the obvious overhead incurred by programs that use something like MemoryDaemon. When done naively, every memory access becomes instead two communications across the network (the request and the response with data). If the network is the internet, even a simple program to print array items may take seconds or minutes to execute. This is a problem we attempt to solve below (see sections 4-2 and 4-3).

    However, these problems don't detract from the fact that threads can be compiled to run on remote processors and still carry out their intended function. It is possible that the basic mechanisms described here will never improve performance beyond a certain threshold, but that is an issue for another paper.

    To distribute shared-memory threads, most of the work is done in deriving the MemoryDaemon class. It must receive and serve memory requests correctly. We first explain how to generate MemoryDaemon, and then we describe how to generate the threads for the computation.

    1. MemoryDaemon class
    2. One of the biggest problems in creating the MemoryDaemon is that MPI message data must be in an array of one of the primitive types. We use messages of type integer to signal memory requests, and since the actual data may have a different type, we have to send a second message containing it. This doesn't affect read requests, which still take two messages (one to request and one to receive). But write requests, which usually just take one message, now take two; the node must request the write, and then send the data for the write separately.

      A receive request contains an array of integers, where each element has the following meaning:

      Note that the request may contain 0 or more dimension offsets.

      The following algorithm describes how MemoryDaemon is generated from the static variables:

    3. Computation threads
    4. Each normal thread must have a new function for reading or writing to the shared variables, and they must notify the MemoryDaemon when they've finished their processing. For each variable X of type T, we generate:

      As mentioned earlier, each function initiates two communications: one to request the memory operation, and the other to receive the old value or send the new value.

      Finally, we must initialize and finalize things within the main procedure:

      Old code or location MPIJ code
      public static void main(String[] args) public void MPIMain(String[] args)
      beginning of main() try {
        MPI.init();
        if (MPI.COMM_WORLD.rank() ==
            MPI.COMM_WORLD.size()-1)
         new Thread(new MemoryDaemon()).start();
      end of main()   run();
        int[] offset = {DONE_MSG};
        MPI.COMM_WORLD.send(offset, MPI.INT,
          MPI.COMM_WORLD.size()-1, TOMEMORY_TAG);
        MPI.finalize();
      } catch (Exception e) {}
      within main() remove code that creates and runs threads of this type

      Note that the MemoryDaemon is run on the last node. Also note the TOMEMORY_TAG; this is used to differentiate between the MemoryDaemon thread and normal thread that may also be running on the last node.

  9. Conclusions and Future Work
  10. We have presented a general way to transform threaded programs to parallel programs. This allows any threaded Java program to be distributed among remote machines without any extensions to the language. While the intra-process communication mechanisms may be less efficient than special-purpose languages or libraries, the transformation is simple, and our tools give anyone who can write a threaded program the ability to distribute their threads among multiple processors. This work is significant because it shows that the existence of threads in an object-oriented program is sufficient to distribute it to multiple processors. We believe that threads are a good abstraction for distributed and parallel programming: they allow programmers to concentrate on the algorithm and the communication, but they hide the ugly details required by most parallel programming libraries.

    1. Limitations
    2. There are some limitations to the method described here:

      These limitations place restrictions on the programmer, but we see no compelling reasons against distributing threaded programs in the manner we describe. In fact, most of these limitations exist simply by the nature of distributed computation. Programmers who understand these constraints will organize their programs to minimize the communication caused by accessing shared-memory or passing messages.

    3. Future Work
    4. This process is new and not well tested, and we wish to remedy this with substantial examples. It has especially not dealt with language issues, such as inheritance of threaded classes, and we wish to explore these issues as well to see how these mechanisms work in complex programs. There are a few ways we can improve performance by relaxing some constraints on memory:

    5. References
    6. [Aho88] Aho, Sethi, and Ullman, Compilers: Principles, Techniqes, and Tools, Addison-Wesley, 1988

      [Haines94] Haines, Cronk, and Mehrotra, "On the Design of Chant: A Talking Threads Package", Proceedings of Supercomputing 94, November 1994, http://computer.org/conferen/sc94/hainesm/hainesm.html

      [web1] DOGMA homepage http://zodiac.cs.byu.edu/DOGMA/

      [web2] Java Compiler Compiler - The Java Parser Generator http://www.suntest.com/JavaCC/

      [Yelick98] Yelick, Semenzato, Pike, Miyamoto, Liblit, Krishnamurthy, Hilfinger, Graham, Gay, Colella, Aiken, "Titanium: A High-Performance Java Dialect," ACM 1998 Workshop on Java for High-Performance Network Computing, Stanford, California, February 1998.