![]() |
University of Sydney |
| Author | Lars Jermiin |
| Contact | SoBS1, Heydon-Laurence Bldg A08, University of Sydney, NSW 2006, Australia, or |
| SUBIT2, Medical Foundation Bldg K25, University of Sydney, NSW 2006, Australia | |
| Phone | +61-2-9351-3717 |
| Fax | +61-2-9351-4119 |
| lars.jermiin [at] usyd.edu.au |
1 School of Biological Sciences
2 Sydney University Biological Informatics and Technology Centre
____________________________
Hetero is a computer program designed to simulate the evolution of a nucleotide sequence across a tree with four tips. It allows its user to specify the lineage-specific nucleotide substitution models that are used in the simulation, together with information on the ancestral sequence and the order and timing of the divergence events. It has a simple user-interface and output, making it equally useful in the teaching and research of phylogenetics.____________________________
Read this first... Installation Running Hetero The Results Teaching References
____________________________
Permission is granted to use Hetero, free of charge, for non-commercial use, subject to appropriate acknowledgement of the source. Commercial use of the program requires prior written permission from the University of Sydney, which may incur a fee.
The University of Sydney does not invite reliance upon, nor accept responsibility for, the information it provides through Hetero.
The University of Sydney gives no guarantees, undertakings or warranties, either expressed or implied, concerning the accuracy, completeness or up-to-date nature of the information provided or suitability of the information for a particular purpose or application. Users should confirm information from another source if it is of sufficient importance for them to do so.
Under no circumstances will the University of Sydney be liable for indirect or consequential damages, including, without limitation, loss of income or use of information.
Hetero is made available by the University of Sydney through the auspices of its School of Biological Sciences and the Sydney University Biological Informatics & Technology Centre (the 'hosts').
Whilst all care is taken to ensure a high degree of accuracy, users are invited to notify the author of any discrepancies.
Table of Content____________________________
uncompress hetero.tar.Z
followed by
tar xvf hetero.tar
The result is the appearance of the executable file, hetero. If you have superuser access to the computer, then you may wish to change the ownership and the group to the relevant choice, e.g.:
chown root:staff hetero
After having done so, place the executable in the appropriate directory, e.g.:
/usr/local/bin
or
/user/biotools/bin
Remember to make the chosen directory accessible to your group members. By now, hetero is ready to be used (all documentation is included in these web pages).
On PCs and similar types of computer platforms, download Hetero to the directory of you choice. Double-click on the file to run the installation program. The installation program allows you to specify the directory in which you would like to install hetero; the default directory is
c:\Program Files\Hetero
The installation program will also place a shortcut to the program on your Desktop. You may then delete the original installation program, as it is no longer required by the program. By now, hetero is ready to be used (all documentation is included in these web pages).
Anyone wishing to obtain a copy of the source code (ANSI C compatible) must have a software license to Numerical Recipes because the source code includes a licenced algorithm from Numerical Recipes in C (Cambridge University Press). A software license can be ordered from Numerical Recipes' web site or by email from order@nr.com.
Table of Content____________________________
hetero
in a terminal window, and answer the questions as they appear on the monitor. One of the two output files contains the information that was used to generate the sequence alignments stored in the other output file. Both files can be viewed using a text editor, and the file containing the aligned nucleotide sequences can be analysed directly, without any reformatting, using the relevant phylogenetic programs from PHYLIP.
hetero.exe
in a terminal window, and answer the questions as they appear on the monitor. One of the two output files contains the information that was used to generate the sequence alignments stored in the other output file. Both files can be viewed using a text editor, and the file with aligned nucleotide sequences can be analysed directly, without any reformatting, using the relevant programs from PHYLIP.
hetero
in a terminal window, and answer the questions as they appear on the monitor. One of the two output files contains the information that was used to generate the sequence alignments stored in the other output file. Both files can be viewed using a text editor, and the file with aligned nucleotide sequences can be analysed directly, without any reformatting, using the relevant programs from PHYLIP.
Whenever you enter an answer, the program will, in most cases, assess whether the answer conforms to it its expectations. If the answer is not correct, the program will in most instances tell you what is wrong. If you keep on entering unexpected answers, the program will abort after three attempts.
Only in one known instance is it possible to cause the program to hang; this can occur when (i) you enter parameters that produce a tree with a length just below 1.0 substitution per site, and (ii) you disallow multiple substitutions at the same site. When this happens, you should:
Hetero was stress tested on four computer systems: (i) SunBlade 1000 running SunOS 5.8; (ii) Macintosh PowerBook G3 running MacOS 9.2; (iii) Macintosh G4 running MacOSX; (iv) Toshiba Presario 1550AP running Windows XP.
Table of Content____________________________
The output of hetero appears in two files. The first output file contains all the details that were entered before the simulation was begun, and it also contains most of the results. An example of the first output file is given below - all details in the output file are indented and written in a different typeface.
The first few lines (below) are largely self-explanatory - the output file contains the aligned nucleotide sequences that can be analysed using many of the programs from PHYLIP, and the time gives the exact point in time when the simulation was done.
DETAILS OF PROGRAM
Program Hetero 1.0 Copyright Hetero is Copyright to the University of Sydney, 2003 Output file test.aln Time Fri Nov 7 19:42:40 2003
The next few lines (below) show the length of the six edges in the tree, and that the length of these edges was entered in terms of time units (and not in terms of average substitutions per site).
PROPERTIES OF THE TREEEdges entered in terms of time (units)
Length of a 0.95000 Length of b 0.95000 Length of c 0.95000 Length of d 0.95000 Length of e 0.05000 Length of f 0.05000
The next few lines (below) show the lengths of all the edges - the tree length is then estimated by adding six edge lengths together. If the tree length is equal to or larger than 1.0, then multiple substitutions will always be allowed (see below).
The tree used with the Monte Carlo simulation,The following few lines are largely self-explanatory.
written in the Newick format with edge lengths
given in terms of (1) time or (2) average rate
of nucleotide substitution per site:
(1) - ((SeqA:0.9500,SeqB:0.9500):0.0500,(SeqC:0.9500,SeqD:0.9500):0.0500);
(2) - ((SeqA:0.1425,SeqB:0.1425):0.0075,(SeqC:0.1425,SeqD:0.1425):0.0075);
AVERAGE RATES OF CHANGE ALONG THE EDGESThe following few lines outline the other relevant information.
Rate along edge a 0.15000 Rate along edge b 0.15000 Rate along edge c 0.15000 Rate along edge d 0.15000 Rate along edge e 0.15000 Rate along edge f 0.15000
OTHER RELEVANT INFORMATIONThe sequence length is the number of nucleotides in the alignment; the number of cycles is the number of times that the ancestral sequence at the root of the tree is allowed to evolve towards the tips of the tree; the seed is a number needed to prime the random number generator (Note: the number entered is 57 but in the print out it will appear as -57; if the number in the print out is not negative, then you have entered an invalid seed); the multiple hits are either accepted, in which case the answer is "yes"; the order of nucleotides is mentioned because it allows us to read the properties of the substitution models (see below) and those of the threshold matrices (see below).
Sequence length 100 Number of cycles 5 Seed -57 Multiple hits yes Order of nucleotides A, C, G & T
The following few lines are largely self-explanatory.
PROPERTIES OF THE ANCESTRAL SEQUENCEThe next many lines require some explanation. For each of the six edges, a matrix of rates of change is developed on the basis of the equilibrium nucleotide frequencies and the conditional rates of change of the model. The rates of change are measured in terms of time units.
Frequency of A 0.25000 Frequency of C 0.25000 Frequency of G 0.25000 Frequency of T 0.25000
PROPERTIES OF THE SUBSTITUTION MODELSThe next many lines require some explanation. In order to understand how the nucleotide sequences are generated using this program, it is necessary to explain an integral component of the Monte Carlo simulation, i.e. the generation of random mutations. The rate matrix for a given edge in the tree defines how random mutations are generated along that edge and works through its corresponding threshold matrix. The threshold matrix is produced by (i) adding the identity matrix, I, to the rate matrix, and by (ii) adding the values in the x-th column of this matrix to the corresponding values in the preceeding columns. A comparison of this method of the generation of nucleotide sequences with other such methods is presented in Ababneh et al. (2006). The relationship between the rate matrix and its corresponding threshold matrix is illustrated below:Model Ra -- Nucleotide frequencies
0.25000 0.25000 0.25000 0.25000
Model Ra -- Conditional rates
------- 0.20000 0.20000 0.20000
0.20000 ------- 0.20000 0.20000
0.20000 0.20000 ------- 0.20000
0.20000 0.20000 0.20000 -------
Model Ra -- Rates of change along edge a
-0.15000 0.05000 0.05000 0.05000
0.05000 -0.15000 0.05000 0.05000
0.05000 0.05000 -0.15000 0.05000
0.05000 0.05000 0.05000 -0.15000
Model Rb -- Nucleotide frequencies
0.25000 0.25000 0.25000 0.25000
Model Rb -- Conditional rates
------- 0.20000 0.20000 0.20000
0.20000 ------- 0.20000 0.20000
0.20000 0.20000 ------- 0.20000
0.20000 0.20000 0.20000 -------
Model Rb -- Rates of change along edge b
-0.15000 0.05000 0.05000 0.05000
0.05000 -0.15000 0.05000 0.05000
0.05000 0.05000 -0.15000 0.05000
0.05000 0.05000 0.05000 -0.15000
Model Rc -- Nucleotide frequencies
0.25000 0.25000 0.25000 0.25000
Model Rc -- Conditional rates
------- 0.20000 0.20000 0.20000
0.20000 ------- 0.20000 0.20000
0.20000 0.20000 ------- 0.20000
0.20000 0.20000 0.20000 -------
Model Rc -- Rates of change along edge c
-0.15000 0.05000 0.05000 0.05000
0.05000 -0.15000 0.05000 0.05000
0.05000 0.05000 -0.15000 0.05000
0.05000 0.05000 0.05000 -0.15000
Model Rd -- Nucleotide frequencies
0.25000 0.25000 0.25000 0.25000
Model Rd -- Conditional rates
------- 0.20000 0.20000 0.20000
0.20000 ------- 0.20000 0.20000
0.20000 0.20000 ------- 0.20000
0.20000 0.20000 0.20000 -------
Model Rd -- Rates of change along edge d
-0.15000 0.05000 0.05000 0.05000
0.05000 -0.15000 0.05000 0.05000
0.05000 0.05000 -0.15000 0.05000
0.05000 0.05000 0.05000 -0.15000
Model Re -- Nucleotide frequencies
0.25000 0.25000 0.25000 0.25000
Model Re -- Conditional rates
------- 0.20000 0.20000 0.20000
0.20000 ------- 0.20000 0.20000
0.20000 0.20000 ------- 0.20000
0.20000 0.20000 0.20000 -------
Model Re -- Rates of change along edge e
-0.15000 0.05000 0.05000 0.05000
0.05000 -0.15000 0.05000 0.05000
0.05000 0.05000 -0.15000 0.05000
0.05000 0.05000 0.05000 -0.15000
Model Rf -- Nucleotide frequencies
0.25000 0.25000 0.25000 0.25000
Model Rf -- Conditional rates
------- 0.20000 0.20000 0.20000
0.20000 ------- 0.20000 0.20000
0.20000 0.20000 ------- 0.20000
0.20000 0.20000 0.20000 -------
Model Rf -- Rates of change along edge f
-0.15000 0.05000 0.05000 0.05000
0.05000 -0.15000 0.05000 0.05000
0.05000 0.05000 -0.15000 0.05000
0.05000 0.05000 0.05000 -0.15000
Rate matrix (R):
-0.15000 0.05000 0.05000 0.05000
0.05000 -0.15000 0.05000 0.05000
0.05000 0.05000 -0.15000 0.05000
0.05000 0.05000 0.05000 -0.15000
Identity matrix (I):
1.00000 0.00000 0.00000 0.00000
0.00000 1.00000 0.00000 0.00000
0.00000 0.00000 1.00000 0.00000
0.00000 0.00000 0.00000 1.00000
Sum of rate matrix and identity matrix (R + I):
0.85000 0.05000 0.05000 0.05000
0.05000 0.85000 0.05000 0.05000
0.05000 0.05000 0.85000 0.05000
0.05000 0.05000 0.05000 0.85000
Threshold matrix:
0.85000 0.90000 0.95000 1.00000
0.05000 0.90000 0.95000 1.00000
0.05000 0.10000 0.95000 1.00000
0.05000 0.10000 0.15000 1.00000
Remembering the order of the four nucleotides (i.e., A, C, G & T), it is now possible to use randomly generated numbers between 0.0 and 1.0 to determine whether a nucleotide at a given site will remain the same or change to one of the three other nucleotides. For example, if the nucleotide at a given site is an A, then we must focus on the first row in the threshold matrix. Suppose the random number returned by the random number generator equals 0.3567, then the A will remain an A because 0.0000 < 0.3567 <= 0.8500. If, on the other hand, the random number was 0.9232, then the A will change to a G because 0.9000 < 0.9232 <= 0.9500. The six threshold matrices are listed below.
THRESHOLD MATRICESThe next many lines contain some of the results that can be gleaned from analysing the aligned nucleotide sequences - these results are stored in three tables.Thresholds used under Model Ra (edge a)
0.85000 0.90000 0.95000 1.00000
0.05000 0.90000 0.95000 1.00000
0.05000 0.10000 0.95000 1.00000
0.05000 0.10000 0.15000 1.00000
Thresholds used under Model Rb (edge b)
0.85000 0.90000 0.95000 1.00000
0.05000 0.90000 0.95000 1.00000
0.05000 0.10000 0.95000 1.00000
0.05000 0.10000 0.15000 1.00000
Thresholds used under Model Rc (edge c)
0.85000 0.90000 0.95000 1.00000
0.05000 0.90000 0.95000 1.00000
0.05000 0.10000 0.95000 1.00000
0.05000 0.10000 0.15000 1.00000
Thresholds used under Model Rd (edge d)
0.85000 0.90000 0.95000 1.00000
0.05000 0.90000 0.95000 1.00000
0.05000 0.10000 0.95000 1.00000
0.05000 0.10000 0.15000 1.00000
Thresholds used under Model Re (edge e)
0.85000 0.90000 0.95000 1.00000
0.05000 0.90000 0.95000 1.00000
0.05000 0.10000 0.95000 1.00000
0.05000 0.10000 0.15000 1.00000
Thresholds used under Model Rf (edge f)
0.85000 0.90000 0.95000 1.00000
0.05000 0.90000 0.95000 1.00000
0.05000 0.10000 0.95000 1.00000
0.05000 0.10000 0.15000 1.00000
The first table contains for each simulation the differences in the GC content between the four sequences, the number of constant sites, and the number of sites with different types of splits in the data.
RESULTS OF SIMULATIONColumn [1] Dif. in GC content (SeqA vs SeqB)
[2] Dif. in GC content (SeqA vs SeqC)
[3] Dif. in GC content (SeqA vs SeqD)
[4] Dif. in GC content (SeqB vs SeqC)
[5] Dif. in GC content (SeqB vs SeqD)
[6] Dif. in GC content (SeqC vs SeqD)
[7] Constant sites
[8] Split A (A|BCD)
[9] Split B (B|ACD)
[10] Split C (C|ABD)
[11] Split D (D|ABC)
[12] Split E (AB|CD)
[13] Split F (AC|BD)
[14] Split G (AD|BC)
[15] Hypervariable sites
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
0.010 0.030 0.000 0.020 -0.010 -0.030 57.0 11.0 10.0 8.0 7.0 1.0 1.0 0.0 5.0
-0.020 -0.020 -0.030 0.000 -0.010 -0.010 68.0 2.0 11.0 5.0 4.0 2.0 0.0 0.0 8.0
0.050 0.040 0.000 -0.010 -0.050 -0.040 53.0 6.0 10.0 8.0 11.0 2.0 1.0 1.0 8.0
0.000 -0.060 -0.060 -0.060 -0.060 0.000 56.0 12.0 10.0 5.0 5.0 4.0 3.0 1.0 4.0
-0.010 0.030 0.000 0.040 0.010 -0.030 63.0 2.0 3.0 15.0 12.0 0.0 1.0 1.0 3.0
The next table contains for each simulation the average differences in the GC content between the four sequences, the average number of constant sites, and the average number of sites with different types of splits in the data.
AVERAGE VALUES
[1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15]
0.006 0.004 -0.018 -0.002 -0.024 -0.022 59.4 6.6 8.8 8.2 7.8 1.8 1.2 0.6 5.6
The next table contains the average number of sites that have changed X times - these values can only be generated through simulation and can only be guessed or estimated through analysis of real data.
Average number of sites with X hits
X # Sites Percentage
0 58.600 58.600
1 29.800 29.800
2 8.800 8.800
3 1.800 1.800
4 0.800 0.800
5 0.200 0.200
6 0.000 0.000
7 0.000 0.000
8 0.000 0.000
9 0.000 0.000
10 0.000 0.000
11 0.000 0.000
12 0.000 0.000
13 0.000 0.000
14 0.000 0.000
15 0.000 0.000
16 0.000 0.000
17 0.000 0.000
18 0.000 0.000
19 0.000 0.000
The second output file contains the aligned nucleotide sequences in a sequential PHYLIP format, and since five cycles were done in this simulation, there are five alignments in this file. The file content is illustrated below:
4 100
SeqA ACTCACGTGATTCGAGGAATCTCTGGTAGTGCTGCACCAGTTTCTCGTGCGCACTCCAGCCATAAGTCTAGGAGCTGCAAGCTTGGGAATCGAGATAGTC
SeqB AGTCAGGTGATTCGAGGAATGTCGCGTGGTGATGAAACACCTACGCGTACGCACTCCAAGTATGAGTCTAAGGGCTGCTTGCTTGAGATTCGAGTTCGTC
SeqC ACTCAGGTGATTCGACGAATTTGTGGTGGTGCTGCTACAGATTCGCTTACGCTCAACAGGCATCAGTCTAAGAGCCGAATGATTGGGATTCGAGTTCGTC
SeqD ACTCAGGTGATTCGAGGAATTTCCGGCGGTGCCGCAAAAGATTCGCTTAAGCACTCCAGGCATCAGTCTAACAGCAGCTTGCTTGGGATTCAGGTTCGTC
4 100
SeqA GCTGAAACGGAGCGTTGCTTGGTTCGCGTGACTAGGCAGCTACATGTCGCATCGACGACTGTAACTAGAATGCTTCGTCGTTAGTATGCGGATGTTCACT
SeqB GCAGATACGGATCGTTCTTTGGTTCCTGTGGCTAGGCAACGACATGTCGCGTCGACCACTGAAACTAGTATGCCACCTCGTTACTGTGCGGATGTTCCCA
SeqC GCTGAGACGGAGAGTTCCTTGGATCGCTTGCCTAGGTACCTACATGTCGGATCGACGACTGTTAGTAGTATGCCCCGTCGTTAGTATGCGGATGTTCGCA
SeqD GCTGAGACGGAGAGTTTCTTGGTTCGCGTGGCTAGGTAACTACATCTCGCATCGACGACTGTAACGAGCATGCCACGTAGTTAGTCCGCGGATGTTCGCA
4 100
SeqA AACGTCCTGGAAACTCGCTTCGACGCATTACATTGAGAGCTCACCCGCGTATTGAATGGTTAAACCCCGGAAACTTATTAATGCAACCGCGGCAGCTCCT
SeqB AACAATCTCGTAACTCGCTTTCACGCAGTACATCGAGGGATCACCAGCATAGTGAATGGTTATAACCTGTTAACTGATTGATGCATCCACGGTAGCCACT
SeqC AAGGGCCTAGAAACTCTCTTCAACGCATTACAGAGCGGGCTCACCAACATATTGAATGGTAAAAACTCATTAGCTTATTCATGCATCCGCGGCAACCGCT
SeqD TACGTCATCGAAACTTGCTTCCACGCCTTACATAGAGGGCTCAGCAACATATTGAACGGTTTGACCTCGTCAACTGATTCGTGGATCCGTAGCAGCCGCT
4 100
SeqA ATTTCAGTCTAACGCGGCCAGGCCTAGAGGGTTTGTATCGGCGGCTCCTGCGGTTTTGGTACCCTCAATGTACTACTCACACCAGACACTCGGGCCCATG
SeqB ATTCCAATCGGACTGGTCTACGCCGAGCGGGCTTGTCCCGGCGGCTCCTTGGGATTTGGTAACCTCAGTGTATGGCTCAAGCAAGAGATAAGGGCCCATA
SeqC CTTTCAGTGGGACTCGGCCAGGCCGAGAGGGCTTGTCTCGGCGGCTCCTTGGGACTTGGTACCCTCAATCCATGACACAAGACAGCGTCACGGGCCCATG
SeqD ATACCCGTCGGACTCGTCCAGGGCGAGAGGGCTTGTCTCGGCGGCTCCTCGAGATTTGGTACCCTCACTCTATAGCACCAGCCAGCGTCTCGGGCCCATT
4 100
SeqA TTCTCGTTTGGGTCGTCCAGGAGCTGAAGAATTTGGCCTCATGACGATTTTCCCTCTGTAGGGGATAGATCGCGGTATACACTCCGAAAGTCGCTTGCTG
SeqB TTCTCGTTTGGTTCGTCCAGGGGCTGGAGAATTTGGCCTCATGACGAGTTTACCTCTGTAGGGGATAGATCGCGATATACACTCCGAAAGCGGCTTGCTG
SeqC TTCTCGTTTGGGTCTTACAGTAACGGAAGAATCAGGCCTAATGACGATTTTACCTATGCACGGAGTGGATCGCGTTATACACTCAGAAACCGGCTTGCCG
SeqD TTCCGGTTTGGGTCGTCCACGAGCATATGAATTTGGCCTCATCACGAGTTTACCTCTGTGGTGGCTATATCGCGATAAACACACCGAAAGCCGCTTCCTG
This file can easily be analysed using many of the programs available from PHYLIP.
Table of Content____________________________
Students may wish to work in pairs to improve their performance and understanding of the subject - if so, then I recommend that they simulate their data under conditions that are known only to themselves; after having generated the data, the students exchange data set, and the task then becomes to determine the evolutionary pattern and process. This often turns out to be harder than expected, and because it is an emulation of a real case scenario, it will sharpen the students approach to phylogenetics.
There are many interesting questions that can be addressed using hetero, and some of them are currently being investigated at the University of Sydney, and elsewhere. The results from some of these experiments have published (e.g., Jermiin et al. 2004; Ho and Jermiin 2004). Below I outline two simple examples that illustrate the use of hetero - many more examples are possible, but I leave it to the imagination of the inquisitive investigator to find out what they are....
I have outlined two exercises above but have deliberately not included the results because it would defeat the purpose of doing them. There is a number of other exercises that are readily available using hetero; many of these involve using other parameter values for the same exercises as those outlined above. Another exercise, which will not be outlined here, involves combining convergence in the nucleotide content and rate heterogeneity among the diverging lineages - this combination formed the basis for several unexpected results (Ho and Jermiin 2004).Long Branch Attraction
Jermiin et al. (2003) outline an example of how hetero might be used to study the problem that is commonly referred to as the 'long branch attraction' effect. The problem has been studied extensively by others (e.g., Felsenstein 1978; Hasegawa et al. 1991; Steel et al. 1993), and can easily be illustrated using two sets of parameters, which I outline below.
Followed the ordered list, and analyse the alignments using the phylogenetic method of your choice.
- Use hetero to generate 100 alignments of 1000 nucleotides using the following steps and parameters:
- Use time to measure the length of edges in the phylogenetic tree.
- Enter the following set of edge lengths:
Length of a ... 0.95
Length of b ... 0.95
Length of c ... 0.95
Length of d ... 0.95
Length of e ... 0.05
Length of f ... 0.05
- Enter the following nucleotide content in ancestral sequence:
0.25000 0.25000 0.25000 0.25000
- Enter the following nucleotide content for the six models:
0.25000 0.25000 0.25000 0.25000
- Enter the following conditional rates for the six models:
------- 0.20000 0.20000 0.20000
0.20000 ------- 0.20000 0.20000
0.20000 0.20000 ------- 0.20000
0.20000 0.20000 0.20000 -------
- Allow multiple substitutions at the same site to occur.
- Save the information that describes the input in a file called sim_1.xls and the corresponding 100 alignments in a file called sim_1.aln.
- Analyse the data in sim_1.aln using your favourite phylogenetic program, and record how many times the phylogenetic program recovers the correct phylogenetic tree (which is listed in sim_1.xls).
- Repeat the simulation experiment but this time with different conditional rates.
- Enter the following conditional rates for the models Re and Rf:
------- 0.20000 0.20000 0.20000
0.20000 ------- 0.20000 0.20000
0.20000 0.20000 ------- 0.20000
0.20000 0.20000 0.20000 -------- Enter the following conditional rates for the models Ra and Rd:
------- 0.02000 0.02000 0.02000
0.02000 ------- 0.02000 0.02000
0.02000 0.02000 ------- 0.02000
0.02000 0.02000 0.02000 -------- Enter the following conditional rates for the models Rb and Rc:
------- 0.38000 0.38000 0.38000
0.38000 ------- 0.38000 0.38000
0.38000 0.38000 ------- 0.38000
0.38000 0.38000 0.38000 -------- Save the information that describes the input in a file called sim_2.xls and the corresponding 100 alignments in a file called sim_2.aln.
- Analyse the data in sim_2.aln using the same phylogenetic program as before, and record how many times the program recovers the correct phylogenetic tree (which is listed in sim_2.xls).
- Compare the phylogenetic results and discuss why they may or may not be a different.
- Repeat the phylogenetic analysis with another phylogenetic program and discuss why the two programs produce similar or different phylogenetic results.
Convergence of Nucleotide Content
Jermiin et al. (2004), among others (e.g., Galtier and Gouy 1995; Conant and Lewis 2001), have studied the effects of convergence of nucleotide content on phylogenetic estimates, and it is clear that as the nucleotide content becomes more and more heterogeneous, some phylogenetic methods are increasingly unable to recover the correct tree. This problem can easily be illustrated using two sets of parameters, which I outline below.Followed the ordered list, and analyse the alignments using the phylogenetic method of your choice.
- Generate 100 alignments of 1000 nucleotides using the following steps and parameters:
- Use time to measure the length of edges in the phylogenetic tree.
- Enter the following set of edge lengths:
Length of a ... 0.475
Length of b ... 0.475
Length of c ... 0.475
Length of d ... 0.475
Length of e ... 0.025
Length of f ... 0.025
- Enter the following nucleotide content in ancestral sequence:
0.25000 0.25000 0.25000 0.25000
- Enter the following nucleotide content for the six models:
0.25000 0.25000 0.25000 0.25000
- Enter the following conditional rates for the six models:
------- 0.40000 0.40000 0.40000
0.40000 ------- 0.40000 0.40000
0.40000 0.40000 ------- 0.40000
0.40000 0.40000 0.40000 -------
- Allow multiple substitutions at the same site to occur.
- Save the information that describes the input in a file called sim_3.xls and the corresponding 100 alignments in a file called sim_3.aln.
- Analyse the data in sim_3.aln using your favourite phylogenetic program, and record how many times the phylogenetic program recovers the correct phylogenetic tree (which is listed in sim_3.xls).
- Repeat the simulation experiment but this time with different nucleotide frequencies for the six models.
- Enter the following nucleotide content for models Re and Rf:
0.25000 0.25000 0.25000 0.25000
- Enter the following nucleotide content for models Ra and Rd:
0.50000 0.00000 0.00000 0.50000
- Enter the following nucleotide content for models Rb and Rc:
0.00000 0.50000 0.50000 0.00000
- Save the information that describes the input in a file called sim_4.xls and the corresponding 100 alignments in a file called sim_4.aln.
- Analyse the data in sim_4.aln using the same phylogenetic program as before, and record how many times the phylogenetic program recovers the correct phylogenetic tree (which is listed in sim_4.xls).
- Compare the phylogenetic results and discuss why they may or may not be a different.
- Repeat the phylogenetic analysis with another phylogenetic program, and discuss why the programs may produce similar or different phylogenetic results.
Table of Content
Top of page
____________________________
- Ababneh, F, Jermiin LS, Robinson J (2006). Generation of the exact distribution and simulation of matched nucleotide sequences on a phylogenetic tree. Journal of Mathematical Modelling and Algorithms 5, 291-308.
- Conant GC, Lewis PO (2001). Effects of nucleotide composition bias on the success of the parsimony criterion on phylogenetic inference. Molecular Biology and Evolution 18, 1024-1033.
- Felsenstein J (1978). Cases in which parsimony and compatibility methods will be positively misleading. Systematic Zoology 27, 401-410.
- Galtier N, Gouy M (1995). Inferring phylogenies from DNA sequences of unequal base compositions. Proceedings of the National Academy of Sciences USA 92, 11317-11321.
- Hasegawa M, Kishino H, Saitou N (1991). On the maximum likelihood method in molecular phylogenetics. Journal of Molecular Evolution 32, 443-445
- Ho SYW, Jermiin LS (2004). Tracing the decay of the historical signal in biological sequence data. Systematic Biology 53, 623-637
- Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD (2003). Hetero: a program to simulate the evolution of DNA on a four-taxon tree. Applied Bioinformatics 2, 159-163.
- Jermiin LS, Ho SYW, Ababneh F, Robinson J, Larkum AWD (2004). The biasing effect of compositional heterogeneity on phylogenetic estimates may be underestimated. Systematic Biology 53, 638-643
- Steel MA, Hendy MD, Penny D (1993). Parsimony can be consistent! Systematic Biology 42, 581-587.
Updated 9-Oct-2006