A Genetic Algorithm for Maximum-

A Genetic Algorithm for Maximum-Likelihood Phylogeny Inference

Using Nucleotide Sequence Data

CS584 Final Project

December, 1998

Project Director

Professor Quinn Snell

Project Members

Wendy Chen

Yiqing Lai

Table of Contents

I Introduction

II Explanation of the algorithm

Inferring a phylogeny is an estimation procedure, we are making a "best estimate" of an evolutionary history based on the incomplete information contained in the data (Hills, Moritz and Mable, 1996). Usually, we don't have information about the past. We can only access the current species instead. Because we can postulate evolutionary scenarios by which any chosen phylogeny could have produced the observed data, we must have some basis for selecting one or more preferred trees from among the set of possible phylogenies (Hills, Moritz and Mable, 1996).

The phylogeny reconstruction is a classical NP-complete problem. The number of possible tree topologies is (2n-5)!/[(n-3)!2^(n-3)], where n is the number of taxa. For example, given 14 taxa, we could construct more than seven trillion unrooted phylogentic trees. For a big computational problem such as the phylogeny reconstruction, heuristic search method is normally adapted to solve the problem instead of exhaustive search. Professor Paul O. Lewis from University of New Mexico biology department has developed a genetic algorithm to solve the maximum-likelihood phylogeny inference by using nucleotide sequence (Lewis, 1998).

Professor Lewis has also written a program to implement this algorithm. However, even the sequential version of this GA program takes average about 14 hours CPU time to run on a 55-taxon problem (lewis, 1998). The genetic algorithm is well suited for parallel processing because at every generation individuals are independently selected for mutation and recombination following some probability distribution (Sena, Megherbi and Isern). Professor Lewis' implementation also has the capability of distributing the work of computing likelihood of different individuals on separate processors. The parallel processors running result of the implementation demonstrates significant speedup over the sequential version of the same implementation.

II Explanation of the algorithm

In genetic algorithm, the term chromosome typically refers to a candidate solution to a problem, often encoded as a bit string (Mitchell, 1998). The evaluation for a genetic algorithm problem is the fitness of each chromosome. By applying genetic operators (mutation, recombination, etc.) to an original population of chromosomes, a new set of population can be created. The best-fit individuals will be obtained after generation of evolution. Genetic algorithm is used to solve a diversity of optimization problems.

In Professor Lewis' GA phylogeny reconstruction algorithm, a single phylogenetic tree with its branch length and some other parameters (used for fitness computation) represents a chromosome. The natural log likelihood (lnL) score for maximum likelihood method of a tree represents the fitness function in GA. Maximum likelihood methods of phylogenetic inference evaluates a hypothesis about evolutionary history in terms of the probability that a proposed model of the evolutionary process and the hypothesized history would give rise to the observed data (Hills, Moritz and Mable, 1996). The higher lnL score a tree has, the better the tree is determined by maximum-likelihood criterion. Biologists hope that, by using the maximum likelihood method, the topology tree with the highest lnL score in a population represents the ancient ancestor of the species. The program written by Processor Lewis implements the HKY nucleotide substitution model (Hasegawa, Kishino and Yano, 1985).

The program begins with a population of n individuals, each of the individual is a tree topology. The number n of individuals is set in the configuration file. All initial n individuals are obtained by random stepwise addition. The random-addition tree is started with a tree composed of three randomly chosen taxa. A forth taxa is then added to the three taxa tree. The four taxa tree topology that has the highest lnL value is chosen for adding the fifth taxa on. It continues until all taxa are added on the tree. All n individuals in the population are also assigned with an arbitrary branch length, a transition/transversion rate ratio k, and equilibrium nucleotide frequencies.

There are three steps occur in each generation. First, the lnL score (fitness) of each individual in the population is calculated. The lnL score depends on the full tree topology, the branch length and the base frequency. No optimization of branch length is done at this step to save time. Second, the individuals are ranked according to their lnL score. The individuals with higher lnL scores have higher probability to be passed on to the next generation. The probability is defined as (n-i+1), where i is the ranking and n is the number of individuals in the population. In the third step, the n individuals for next generation are chosen. Out of n individuals, k of them are the best-fit (highest lnL score) individuals from the current generation. The remaining (n-k) individuals are chosen by the ranking of their probabilities. All individuals of the next generation except the best-fit individual undergo branch length mutation, and/or topological mutation and/or recombination. The first individual is protected from mutation and recombination to ensure that the genotype of the best individual found thus far will always be present (Lewis, 1998).

A random proportion lambda (same as the brlenmutprob in the configuration file) of the branch length of given individuals are changed. The new branch length equals to the original branch length times a gamma-distributed factor. A random proportion mu (same as the topomutprob in the configuration file) of individuals in a generation undergo topological mutation. Topological mutation involves with removing a random subtree and reattaching it at some other site in the original tree (this is also called SPR branch swapping strategy). A random proportion pi (same as kappaprob the configuration file) of k values (transition/transverstion rate ratio) of trees are mutated. The new k value is the old k value multiplies a gamma-distributed factor. If the gamma-distributed factor is less than 1.0, then it is set to 1.0. This is because that there is usually excess transition compared to transverstion. Branch recombination is applied with probability of r (same as crossoverprob in the configuration file). r of the individuals from the next generation will combine with a second individual randomly chosen from the previous generation. A subtree is first pruned from the individual of the next generation. Same taxa tips are pruned away from the individual of the previous generation, and remaining parts of the two individuals are rejoined together. This step allows the good portions of two distinct trees to be brought together.

Genetic algorithm is inherently parallel (Sena, Megherbi and Isern). Several populations can be distributed among available processors, so the genetic algorithm can be applied in parallel. Professor Lewis' GA program also has the capability to run on more than one processor in parallel. The parallel parts of the implementation use the Message Passing Interface (MPI). The computation of likelihood (lnL) scores of different individuals is distributed on different processors to gain speedup. A master-slave paradigm is adapted for implementation. At each generation, each slave processor sends master processor its best lnL score after calculation. The master processor chooses the highest lnL score among all slave processes as the best lnL score for the generation.

III Our Work

Understand the background knowledge and basic algorithm
Modify the Makefile and part of pop.cpp to generate executables in both sequential and parallel versions on HP-Unix platform
Run the sequential version of the program under 3 different processors over a period about 64-70 hours
Run the MPI parallel version under 4 processors, 8 processors, 16 processors, and 25 processors over a period about 64-70 hours (note: all the processors are different open lab HP-Unix machines)
Plot the graphs from the output data for all runs
Compare sequential results and parallel MPI results, calculate speedup

IV Results

This figure shows that in a certain amount of time different number of processors can reach different number of generation, i.e. after 70 hours, 1 processor only reach 3550 generation, while 25 processors are already done 70,000 generation.

The reason of the linear distribution shown in the picture is that all generation running are independent, takes same time unit for each generation.

This figure differs from the previous on because it only counts the CPU time of the master computer, which only deals with the message passing and doesn't do any computation. So, basically this picture shows the communication time for different processor number. (Except for 1 np, in which case master is doing all the work.) That's why 25 processors even spend more time in this case.

There are four lines in this figure, which represent 4 different calculating method for speedup.

1000 gen: record the time for reaching 1000 generation, then calculate speedup.

2000 gen: record the time for reaching 2000 generation.

3000 gen: record the time for reaching 3000 generation.

63.4 hours: This one count the number of generation finished by 1, 4, 8, 16, and 25 processors -- after running for 63.4 hours. And calculate the speedup from that.

Since generation_time is linear distributed, the results for all these different calculating method are almost identical.

This figure shows that the best score we could get after some generation. We can see that after 10000 generation there's not much improvement anymore. That's what we expect for a genetic algorithm.

Also, from the picture we can see that the best score only relates to the generation number, not to the processor number. This means parallism can improve the speed, but has little to do with result accuracy.

V File Download

The original source code that we received from Professor Lewis is here: original_gaml.tar

The original source code didn't compile for us. We did minor modification to it. There are two files we changed from the original directory. One is the Makefile. The new Makefile now has the ability to compile MPI code. The second file we changed is the pop.cpp. Line #207 and #213 of the old pop.cpp are changed to Line #224 and #236 in the new pop.cpp, respectively. The newand the code after modification is here: working_gaml.tar

The sequential and parallel executable files are here: exe.tar

Help file on the sequential version of gaml is here: seq_help.txt, help file on parallel version of gaml is here: mpi_help.txt

Our output files on 1 processor are here: 1pr_out_1.tar, 1pr_out_2.tar and 1pr_out_3.tar since we run it three times

Our output files on 4, 8, 16 and 25 processors are: 4pr_out.tar, 8pr_out.tar, 16pr_out.tar and 25pr_out.tar, respectively

VI Reference

Hasegawa, M., H. Kishino and T. Yano, 1985. Dating of the Human-Ape Splitting by a Molecular Clock of Mitochondrial DNA. J. Mol. Evol. 21:160-174.

Hills, D. M., C. Moritz and B. Mable, 1996. Molecular Systematics, second edition. Sinauer Associates, Inc.

Lewis, P. O., 1998. A Genetic Algorithm for Maximum-likelihood Phylogeny Inference Using Nucleotide Sequence Data. Mol. Biol. Evol. 15(3):277-283.

Mitchell, M., 1998. An Introduction to Genetic Algorithms. The MIP Press.

Sena, G. A., D. Megherbi, G. Isern. A Parallel and Distributed Genetic Algorithm for the Traveling Salesman Problem.