Mark Rosenvall's Parallel Database Project

Parallel Database Project
Mark Rosenvall
December 10, 1998

Abstract

Databases are usually a key component in a large computing environment. Increasing the performance and reliability of a database server is a major goal for many organizations. This project explores some of the performance issues of using Java with Databases, and a method of increasing the capabilities of a simple database program. Performance capabilities of this new system are presented, and some of the weaknesses of this approach are also presented.

Introduction

Database servers are a principal part of computer use. Increasing the performance of a large database server can be challenging task The most important database server performance metric is response time. As more clients access a database server, the average response time of that server decreases. This project explores the use of parallel concepts to increase the performance of a standard non-parallel, non-distributed database. The goals of this project are: to decrease the response time of the server under increasing load, provide automatic update replication, and to provide a simple server load balancing mechanism.

The Project

This project used a simple three tier architecture consisting of the database server, a client tier, and a middle tier managing the connection between the other two tiers. The server used in this project was Sybase SQL Anywhere, a simple database engine that is included with a number of application development suites. This database runs under Windows NT, which was the platform used in the tests. This database is not designed for heavy use. It is designed to be a simple, standalone database for testing programs. The data in the database was initially replicated on each server node. Each node must have an identical data set when the program begins to execute.

One possible future enhancement would be to automatically synchronize the databases when a server connection occurs, and update all of the nodes so they each contain identical data, however that problem is not addressed in this project. Also, some large databases may not fit on a single node. Data partitioning between database nodes is another problem that is too large to be addressed here. Allowing for the partitioning of data across multiple nodes would increase the complexity of the middle tier by a large amount, and would not be possible to implement given the time constraints of the project.

The middle tier is a simple parallel application written in Java. This application manages the connection between each of the database nodes and also handles client connections. The middle tier connects to the database using the standard Java JDBC/ODBC bridge. This is a generic database connectivity API, and allows a Java program to connect to any database and execute queries. No database vendor specific functionality is possible using the generic driver. A read operation occurs on a single database node. A write operation is required to occur on every database node, which becomes a parallel broadcast, followed by an update on each server node.

The third tier clients are also fairly simple. Load balancing is handled by the clients. Each client uses a pseudo random algorithm to determine which node of the database to connect to, which has the effect of balancing the load on the server nodes. The clients used to test the server simulate user database usage. Each client connects to the server, executes a query, and disconnects from the server repeatedly. The response time of the server is measured and reported. The clients were also Java programs, and were running under Windows 95 and NT.

Results

Database Read Operations

The primary goal of this project was to increase database server performance. Read operations are the most common database action, so the primary focus was to improve database read performance. The following performance table shows the results observed during execution of the test application:

This table shows the response time of the database when a simple read query is executed. No attempt at authenticating the client was made during the test. Also, no communication between the server nodes was required during the test.

Database Write Operations

A database write operation must be treated much differently than a read operation. A write operation must successfully execute on all of the server nodes. If a transaction does not successfully execute on all of the nodes, then the transaction must be rolled back. Because all write operations must execute on all nodes of the server, a write operation should be slower when multiple processors are used. The single node server should have approximately the same performance. The following performance chart confirms this expectation:

Database write operations must be handled much more carefully than read operations. Extra time is required to ensure database integrity. This also has the effect of replicating the data across multiple servers, allowing for the possibility of allowing for server failure. While this project did not explore this avenue, it is a possible path to explore in the future.

Conclusion

The results of this project show that it is possible to increase database server by applying parallel design techniques. Significant performance gains are possible during read operations when the client load is distributed among several servers. However, database update times were not improved in this test. In order to provide the best performance increase, both read and write operations must improve when multiple processors are available. Also, data integrity and security must be guaranteed by the server. These issues must be resolved before this can become a viable solution.

References

Kumar, Vipin, Ananth Grama, Anshul Gupta, and George Karypis. "Introduction to Parallel Computing: Design and Analysis of Algorithms" Benjamin Cummings Publishing Company, Inc.

Download the source code