DATA MINING and DISTANCE MATRIX FILE

DATA MINING and DISTANCE MATRIX FILE

How to obtain a Distance Matrix File from
ClustalW alignment, FASTA or BLAST search results
and use it with GenomePixelizer and PhyloGrapher

by Alexander Kozik, Ivan Kozik, Garrick Ng and Richard Michelmore
University of California at Davis

INTRODUCTION

From the numerous feedback from users of GenomePixelizer and PhyloGrapher, we realized that the setup of the Matrix File is not a trivial task if the user is not familiar with the Unix or Linux environment. In our research we have used custom scripts to generate matrix file(s), but these scripts were not perfect and were easy to use, and also required frequent modifications to handle different data sets.

Here we describe two basic approaches, namely, ClustalW/Phylip and FASTA/BLAST searches, that can be used to generate a matrix file. We have also provided Tcl/Tk and Perl scripts to parse ClustalW/Phylip and FASTA/BLAST outputs to generate matrix file(s) for GenomePixelizer and PhyloGrapher.

Part I: How to setup a matrix file using ClustalW
The matrix contains data that shows relationships between a given set of elements. In molecular biology these elements are protein or DNA sequences. Values in the matrix file show distance, similarity or identity between different sequences.

For example, for the set of protein sequences A,B,C,D and E:


>A
GARFIELDMADCATLIKETHAT
>B
GARFIELDBADCATLIKETHIS
>C
GARFIELDFATRATTAKETHAT
>D
WHATEVERFATRATTAKETHAT
>E
THISISADISTINCTPEPTIDE

we can get a multiple sequence alignment:


A   -GARFIELDMADCATLIKETHAT
B   -GARFIELDBADCATLIKETHIS
C   -GARFIELDFATRATTAKETHAT
D   WHATEVER-FATRATTAKETHAT
E   -THISISADISTINCTPEPTIDE
         :. . :      : *

Based on the alignment we can set up the DISTANCE matrix file:


       "A"     "B"     "C"     "D"     "E"
"A"     0       0.136   0.227   0.476   0.864
"B"     0.136   0       0.318   0.571   0.864
"C"     0.227   0.318   0       0.238   0.773
"D"     0.476   0.571   0.238   0       0.857
"E"     0.864   0.864   0.773   0.857   0

or similar matrix file that shows the IDENTITY instead of distance:


       "A"     "B"     "C"     "D"     "E"
"A"     1       0.864   0.773   0.524   0.136
"B"     0.864   1       0.682   0.429   0.136
"C"     0.773   0.682   1       0.762   0.227
"D"     0.524   0.429   0.762   1       0.143
"E"     0.136   0.136   0.227   0.143   1

Note that the first and second matrices contain identical information. The difference is that first file shows DISTANCE, and the second one contains IDENTITY data.

GenomePixelizer and PhyloGrapher use as an input file matrix based on IDENTITY. ClustalW output usually contains DISTANCE data and you will need to recalculate all DISTANCES into IDENTITIES using the simple formula IDENTITY = 1 - DISTANCE.

You may obtain a distance matrix file by running ClustalW. Align your sequences using the standard ClustalW procedure, check the alignment and fix it. If it is not perfect, then run ClustalW's "Phylogenetic trees". In "Output format options" check "Toggle Phylip distance matrix output = ON". Then run "Draw tree now", the program in this case will ask you "Enter name for distance matrix output file". Press "Enter" if you agree with default file name with the extension .dst. Finaly you should get the matrix file which looks like this:


     5
A           0.000  0.136  0.227  0.476  0.864 
B           0.136  0.000  0.318  0.571  0.864 
C           0.227  0.318  0.000  0.238  0.773 
D           0.476  0.571  0.238  0.000  0.857 
E           0.864  0.864  0.773  0.857  0.000

After a conversion of DISTANCE into IDENTITY we get:


       "A"     "B"     "C"     "D"     "E"
"A"     1       0.864   0.773   0.524   0.136
"B"     0.864   1       0.682   0.429   0.136
"C"     0.773   0.682   1       0.762   0.227
"D"     0.524   0.429   0.762   1       0.143
"E"     0.136   0.136   0.227   0.143   1

Note that the matrix file above contains redundant information. A non-redundant file may look like:


       "A"     "B"     "C"     "D"     "E"
"A"     1       0.864   0.773   0.524   0.136
"B"             1       0.682   0.429   0.136
"C"                     1       0.762   0.227
"D"                             1       0.143
"E"                                     1

or an even simpler version because we assume that all sequences are identical to themselves.


       "A"     "B"     "C"     "D"     "E"
"A"             0.864   0.773   0.524   0.136
"B"                     0.682   0.429   0.136
"C"                             0.762   0.227
"D"                                     0.143
"E"

Another form of presentation of the data above looks like this:


A     B	   0.864
A     C	   0.773
A     D	   0.524
A     E	   0.136
B     C	   0.682
B     D	   0.429
B     E	   0.136
C     D	   0.762
C     E	   0.227
D     E	   0.143

which is "binary" or a three column type of matrix representation. This matrix file contains the same amount of information as all files above, just in a different format. This type of matrix file format is used by GenomePixelizer and PhyloGrapher. It is easy to calculate the number of lines in this file using a simple formula:


                  (NUMBER_OF_GENES)^2 -  NUMBER_OF_GENES
NUMBER_OF_LINES =    --------------------------------
                                     2

So far, for five genes we have: (25 - 5)/2 = 10 lines
one hundred genes give us: (10000 - 100)/2 = 4950 lines
one thousand gives us: (1000000 - 1000)/2 = 499500 lines
the whole Arabidopsis genome: (25,000*25,000 - 25,000)/2 = 6.24975e+08

Even a half million lines is too much for a matrix file. Fortunately, we can reduce the size of a given matrix file by eliminating lines with a pair of genes that have low similarity or identity. For example, in matrix file above, for a particular project, we are not interested in pairs of genes that have identity lower than 0.2. In this case our matrix became smaller:


A     B	   0.864
A     C	   0.773
A     D	   0.524
B     C	   0.682
B     D	   0.429
C     D	   0.762
C     E	   0.227

When you analyze a set of genes for a whole genome, it is unlikely that all of them (20,000 - 30,000) have a high homology to each other. If it is the case, re-check your data set or re-sequence the given genome. From our experience we know that for the Arabidopsis genome matrix file for all genes with the identity level greater than 0.6 has about 20,000 of lines pairwise data. GenomePixelizer can handle a matrix file of this size without any problems.

To use an "N x N" matrix file derived from ClustalW you need to convert it into "binary", the three column type of matrix, as described above. Click here to download "Phylip2Genopix_024.tcl", a Tcl/Tk script to convert an "N x N" ClustalW's DISTANCE matrix file into GenomePixelizer's or PhyloGrapher's matrix format. The script will prompt you for the input file (ClustalW/Phylip format), output file (three column type) and identity cutoff (the script transforms all DISTANCES into IDENTIES and excludes all data that are below of specified identity cutoff). Finally, you can sort the output file by the "Identity" column using the sort command (on Linux):


 "sort -n -r +2 Matrix.txt > Matrix.sorted"

or on Windows using an Excel spreadsheet.

The approach and program described above are valid for a Phylip distance matrix too.

Part II: How to setup a matrix file using FASTA or BLAST search output
It is incorrect to try to obtain a distance matrix file for a whole genome using ClustalW. It simply does not work. You can not get an alignment and respectively the distance matrix for different sets of non-related sequences. Another limitation is that ClustalW runs out of memory with a huge set of sequences. However we can get what we need (distance matrix data) from the results of a FASTA or BLAST search by running "everything against everything". This means that the analyzed database should contain all protein or DNA sequences for a given genome and we run queries for all the protein or DNA sequences against that database.

There are two types of data we may want to extract from a FASTA or BLAST search to input into GenomePixelizer or PhyloGrapher. The first type of data is expectation values, second type - identity values. There is an endless discussion about what data is more important for sequence analysis, and everyone, most likely, is right defending a particular point of view. If you use only the expectation value, you miss valuable data for similarity/identity. For example, very long sequences may have good expectation values being aligned, however similarity/identity may be relatively low. Short sequences may have high similarity/identity values at a bad expectation value. If the database is small and contains many similar sequences you may not use expectation values as a criteria for sequence homology because statistics are based on the size and heterogeneity of the database. So, the type of data extraction is strongly dependent from your particular type of research, your needs and your scientific point of view. In our particular case, GenomePixelizer and PhyloGrapher matrix files should contain sequence identity or similarity data.

Results of a FASTA or BLAST search usually have two subsections:
A. Best scores ("bit" scores and expectation values only)
B. Alignments (expectation values, identity data and other detailed statistical information)
A good tutorial about BLAST principles, working and data interpretation was published in Genome Biology 2001, 2(10):reviews2002.1-2002.10

For example, a FASTA search of Arabidopsis At1g10920 against the nr protein NCBI database (click here to view full text version):


At1g10920.txt: 821 aa
 >At1g10920
 vs  /usr/local/genome/database/nr library

section "A":


The best scores are:                                                                               opt bits E(731512)

gi|1931650|gb|AAB65485.1| (U95973) disease resistance protein RPM1 isolog; 80607-83399 [A  ( 821) 5477 1184       0

gi|7769860|gb|AAF69538.1|AC008007_13 (AC008007) F12M16.25 [Arabidopsis thaliana]           (1584) 3197  695 6.6e-198

gi|10177352|dbj|BAB10695.1| (AB015468) disease resistance protein [Arabidopsis thaliana]   ( 908) 3171  689 1.8e-196

gi|11357253|pir||T48898 disease resistance protein RPP8 [validated] - Arabidopsis thalian  ( 906) 3146  684 7.4e-195

gi|11357252|pir||T48899 disease resistance protein rpp8 [similarity] - Arabidopsis thalia  ( 908) 3133  681 5.1e-194

gi|7110565|gb|AAF36987.1|AF234174_1 (AF234174) viral resistance protein [Arabidopsis thal  ( 909) 3106  675 2.9e-192

gi|5080812|gb|AAD39321.1|AC007258_10 (AC007258) Putative disease resistance protein [Arab  ( 906) 2306  504 1.4e-140

gi|12321052|gb|AAG50648.1|AC082643_12 (AC082643) disease resistance protein, putative [Ar  ( 899) 2249  491 6.6e-137

gi|12321042|gb|AAG50638.1|AC082643_2 (AC082643) disease resistance protein, putative [Ara  ( 907) 2208  483  3e-134

gi|14475950|gb|AAK62797.1|AC027036_18 (AC027036) viral resistance protein, putative [Arab  (1155) 2166  474  2e-131

and section "B":


>>gi|1931650|gb|AAB65485.1| (U95973) disease resistance protein RPM1 isolog; 80607-83399 [Arabidopsis
thaliana] (821 aa)
 initn: 5477 init1: 5477 opt: 5477  Z-score: 6340.7  bits: 1184.2 E():    0
Smith-Waterman score: 5477;  100.000% identity (100.000% ungapped) in 821 aa overlap (1-821:1-821)

>>gi|7769860|gb|AAF69538.1|AC008007_13 (AC008007) F12M16.25 [Arabidopsis thaliana]                (1584 aa)
 initn: 2867 init1: 1575 opt: 3197  Z-score: 3690.8  bits: 694.8 E(): 6.6e-198
Smith-Waterman score: 3631;  69.112% identity (72.727% ungapped) in 845 aa overlap (3-814:316-1151)

>>gi|10177352|dbj|BAB10695.1| (AB015468) disease resistance protein [Arabidopsis thaliana]        (908 aa)
 initn: 2244 init1: 1185 opt: 3171  Z-score: 3665.0  bits: 689.2 E(): 1.8e-196
Smith-Waterman score: 3579;  67.376% identity (70.545% ungapped) in 846 aa overlap (1-814:1-840)

>>gi|11357253|pir||T48898 disease resistance protein RPP8 [validated] - Arabidopsis thaliana
gi|3928862|gb|AAC83165.1| (AF089710) disease resistance protein RPP8 [Arabidopsis thaliana] (906 aa)
 initn: 1860 init1: 1179 opt: 3146  Z-score: 3636.0  bits: 683.9 E(): 7.4e-195
Smith-Waterman score: 3548;  67.021% identity (70.347% ungapped) in 846 aa overlap (1-814:1-838)

>>gi|11357252|pir||T48899 disease resistance protein rpp8 [similarity] - Arabidopsis thaliana
gi|3901294|gb|AAC78631.1| (AF089711) rpp8 [Arabidopsis thaliana] gi|8843900|dbj|BAA97426.1|
(AB025638) dise (908 aa)
 initn: 2230 init1: 1183 opt: 3133  Z-score: 3620.9  bits: 681.1 E(): 5.1e-194
Smith-Waterman score: 3540;  66.548% identity (69.678% ungapped) in 846 aa overlap (1-814:1-840)

>>gi|9758146|dbj|BAB08703.1| (AB015477) disease resistance protein [Arabidopsis thaliana]         (901 aa)
 initn: 2999 init1: 1053 opt: 3112  Z-score: 3596.6  bits: 676.6 E(): 1.2e-192
Smith-Waterman score: 3545;  67.456% identity (71.250% ungapped) in 845 aa overlap (1-814:1-831)

>>gi|7110565|gb|AAF36987.1|AF234174_1 (AF234174) viral resistance protein [Arabidopsis thaliana]  (909 aa)
 initn: 2596 init1: 1160 opt: 3106  Z-score: 3589.6  bits: 675.3 E(): 2.9e-192
Smith-Waterman score: 3529;  67.060% identity (70.297% ungapped) in 847 aa overlap (1-814:1-841)

>>gi|5080812|gb|AAD39321.1|AC007258_10 (AC007258) Putative disease resistance protein [Arabidopsis
thaliana] (906 aa)
 initn: 1327 init1: 433 opt: 2306  Z-score: 2661.7  bits: 503.6 E(): 1.4e-140
Smith-Waterman score: 2306;  47.986% identity (51.396% ungapped) in 844 aa overlap (1-814:6-823)

>>gi|12321052|gb|AAG50648.1|AC082643_12 (AC082643) disease resistance protein, putative [Arabidopsis
thaliana] (899 aa)
 initn: 1437 init1: 756 opt: 2249  Z-score: 2595.6  bits: 491.3 E(): 6.6e-137
Smith-Waterman score: 2493;  50.236% identity (53.383% ungapped) in 848 aa overlap (5-814:3-838)

>>gi|12321042|gb|AAG50638.1|AC082643_2 (AC082643) disease resistance protein, putative [Arabidopsis
thaliana] (907 aa)
 initn: 1653 init1: 478 opt: 2208  Z-score: 2548.0  bits: 482.5 E(): 3e-134
Smith-Waterman score: 2501;  49.941% identity (52.861% ungapped) in 851 aa overlap (1-814:1-841)

>>gi|14475950|gb|AAK62797.1|AC027036_18 (AC027036) viral resistance protein, putative [Arabidopsis
thaliana] (1155 aa)
 initn: 1816 init1: 511 opt: 2166  Z-score: 2497.4  bits: 473.5 E(): 2e-131
Smith-Waterman score: 2429;  48.174% identity (50.998% ungapped) in 849 aa overlap (1-813:1-838)

As you can see from analysis of section "B" there is no direct correlation between the expectation values and identity levels. Hits with higher expectation may have lower identity values and the opposite is also true. By parsing (data extraction) section "B" you can get data for matrix file, in our example it looks like this:


At1g10920	1931650		0		100.000
At1g10920	7769860		6.6e-198	69.112
At1g10920	10177352	1.8e-196	67.376
At1g10920	11357253	7.4e-195	67.021
At1g10920	11357252	5.1e-194	66.548
At1g10920	9758146		1.2e-192	67.456
At1g10920	7110565		2.9e-192	67.060
At1g10920	5080812		1.4e-140	47.986
At1g10920	12321052	6.6e-137	50.236
At1g10920	12321042	3e-134		49.941
At1g10920	14475950	2e-131		48.174

where the first column is the query, second column - hit (subject), third column - expectation and fourth column - identity. It takes an enormous amount of time to set up a table by the "copy - paste" procedure and the final matrix, definitely, will be full of errors. This procedure should be avoided.

Here, we provide a Tcl/Tk script, Fasta2Phylopix_024.tcl, to parse FASTA search output. It should work fine with FASTA files of version 3. It should also work with the output of the MPI version of FASTA. The script will prompt you for the input file (results of FASTA search) and output file (four column matrix) and other parameters (type of FASTA search).
Example input to generate a matrix:


./Fasta2Phylopix_024.tcl
FASTA program was "(m)pi" or "(r)egular": r
Enter the SOURCE file name: fasta_search_At1g10920.txt
Enter the DESTINATION file name: fasta.matrix
Does SOURCE contain description line? (y/n): y
extract DESCRIPTION line (Y/N): n
Subject DELIMITER type "|"(t) or " "(s): t
type of FASTA search was fasta(n) fasta(x) fasta(p): p

If your input file is fasta_search_At1g10920.txt the program should generate an output like fasta.matrix.
This "fasta.matrix" file is not ready yet as input for GenomePixelizer or PhyloGrapher because of three reasons.
Reason 1: we need to swap columns three and four and divide identity values by 100.
Reason 2: the output file may contain redundant information like this:

gene_A  gene_B  2e-10  55
gene_B  gene_A  2e-10  55

Reason 3: you may want to reduce the size of the matrix file by specifying identity cutoff
The perl script phylopix_redundancy.pl does the last steps for us, namely:
1. swaps column three and four
2. gets rid of redundant information
3. extracts data with specified identity cutoff.
Example of program usage:


"perl phylopix_redundancy.pl fasta.matrix fasta.matrix.final 0.2"

After all of these procedures the "three column matrix file" (it still has a fourth column which is ignored by GenomePixelizer and PhyloGrapher and can be deleted by user) fasta.matrix.final generated from the results of a FASTA search can be used by GenomePixelizer or PhyloGrapher programs.

You may process several FASTA files simultaneously. For that you need to concatenate them into one file using Unix "cat" program, for example:


"cat *.txt > Fasta_Results"

and then run Fasta2Phylopix_024.tcl as described above.

----------------------

BLAST output is similar to FASTA output.
For example, a BLAST search of Arabidopsis At1g10920 against nr protein NCBI database (click here to view full text version):


>gi|15220237|ref|NP_172561.1| disease resistance protein RPM1 isolog [Arabidopsis thaliana]
 gi|1931650|gb|AAB65485.1| (U95973) disease resistance protein RPM1 isolog; 80607-83399
           [Arabidopsis thaliana]
          Length = 821

 Score = 1584 bits (4101), Expect = 0.0
 Identities = 821/821 (100%), Positives = 821/821 (100%)
-------
>gi|7769860|gb|AAF69538.1|AC008007_13 (AC008007) F12M16.25 [Arabidopsis thaliana]
          Length = 1584

 Score = 1079 bits (2791), Expect = 0.0
 Identities = 583/844 (69%), Positives = 668/844 (79%), Gaps = 40/844 (4%)
-------
>gi|15239027|ref|NP_199673.1| disease resistance protein [Arabidopsis thaliana]
 gi|10177352|dbj|BAB10695.1| (AB015468) disease resistance protein [Arabidopsis thaliana]
          Length = 908

 Score = 1042 bits (2694), Expect = 0.0
 Identities = 570/846 (67%), Positives = 658/846 (77%), Gaps = 38/846 (4%)
-------
>gi|11357253|pir||T48898 disease resistance protein RPP8 [validated] - Arabidopsis thaliana
 gi|3928862|gb|AAC83165.1| (AF089710) disease resistance protein RPP8 [Arabidopsis thaliana]
          Length = 906

 Score = 1035 bits (2677), Expect = 0.0
 Identities = 567/846 (67%), Positives = 654/846 (77%), Gaps = 40/846 (4%)
-------
>gi|15239876|ref|NP_199160.1| disease resistance protein RPP8 [Arabidopsis thaliana]
 gi|11357252|pir||T48899 disease resistance protein rpp8 [similarity] - Arabidopsis thaliana
 gi|3901294|gb|AAC78631.1| (AF089711) rpp8 [Arabidopsis thaliana]
 gi|8843900|dbj|BAA97426.1| (AB025638) disease resistance protein RPP8 [Arabidopsis thaliana]
          Length = 908

 Score = 1032 bits (2669), Expect = 0.0
 Identities = 563/846 (66%), Positives = 651/846 (76%), Gaps = 38/846 (4%)
-------
>gi|15238507|ref|NP_198395.1| disease resistance protein [Arabidopsis thaliana]
 gi|9758146|dbj|BAB08703.1| (AB015477) disease resistance protein [Arabidopsis thaliana]
          Length = 901

 Score = 1045 bits (2701), Expect = 0.0
 Identities = 569/845 (67%), Positives = 654/845 (77%), Gaps = 45/845 (5%)
-------
>gi|7110565|gb|AAF36987.1|AF234174_1 (AF234174) viral resistance protein [Arabidopsis thaliana]
          Length = 909

 Score = 1024 bits (2647), Expect = 0.0
 Identities = 568/847 (67%), Positives = 655/847 (77%), Gaps = 39/847 (4%)
-------
>gi|15218909|ref|NP_176187.1| disease resistance protein, putative [Arabidopsis thaliana]
 gi|5080812|gb|AAD39321.1|AC007258_10 (AC007258) Putative disease resistance protein [Arabidopsis
           thaliana]
          Length = 906

 Score =  662 bits (1707), Expect = 0.0
 Identities = 405/851 (47%), Positives = 539/851 (62%), Gaps = 70/851 (8%)
-------
>gi|15217959|ref|NP_176137.1| disease resistance protein, putative [Arabidopsis thaliana]
 gi|12321052|gb|AAG50648.1|AC082643_12 (AC082643) disease resistance protein, putative [Arabidopsis
           thaliana]
          Length = 899

 Score =  686 bits (1770), Expect = 0.0
 Identities = 414/834 (49%), Positives = 551/834 (65%), Gaps = 48/834 (5%)
-------
>gi|15217954|ref|NP_176135.1| disease resistance protein, putative [Arabidopsis thaliana]
 gi|12321042|gb|AAG50638.1|AC082643_2 (AC082643) disease resistance protein, putative [Arabidopsis
           thaliana]
          Length = 907

 Score =  730 bits (1884), Expect = 0.0
 Identities = 429/853 (50%), Positives = 564/853 (65%), Gaps = 51/853 (5%)
-------
>gi|15217999|ref|NP_176151.1| viral resistance protein, putative [Arabidopsis thaliana]
 gi|14475950|gb|AAK62797.1|AC027036_18 (AC027036) viral resistance protein, putative [Arabidopsis
           thaliana]
          Length = 1155

 Score =  683 bits (1763), Expect = 0.0
 Identities = 405/848 (47%), Positives = 539/848 (62%), Gaps = 45/848 (5%)
-------

We may set up a table with the extracted data similar to a FASTA search:


At1g10920	1931650		0.0		100	     100
At1g10920	7769860		0.0		69	     79
At1g10920	10177352	0.0		67	     77
At1g10920	11357253	0.0		67	     77
At1g10920	11357252	0.0		66	     76
At1g10920	9758146		0.0		67	     77
At1g10920	7110565		0.0		67	     77
At1g10920	5080812		0.0		47	     62
At1g10920	12321052	0.0		49	     65
At1g10920	12321042	0.0		50	     65
At1g10920	14475950	0.0		47	     62

You see here that BLAST output is almost identical to FASTA search results. BLAST results provide additional information such as sequence similarity data or "positives" (fifth column of table above). To extract data from BLAST search results you can use the Blast2Phylopix_024.tcl program. You need to specify input and output files and what kind of data you want to extract. Example input to generate matrix:


./Blast2Phylopix_024.tcl 
Enter the SOURCE file name: blast_search_At1g10920_no_filter.txt
Enter the DESTINATION file name: blast.matrix
extract DESCRIPTION line (Y/N): n
type of BLAST search was blast(n) blast(x) blast(p): p

If your input file is blast_search_At1g10920_no_filter.txt the program should generate output like blast.matrix. The output file will contain data in a "four column style" and you need to use the perl script phylopix_redundancy.pl to
1. swap column three and four
2. get rid of redundant information
3. extract data with specified identity cutoff.
An example of program usage:


"perl phylopix_redundancy.pl blast.matrix blast.matrix.final 0.2"

After all of these procedures blast.matrix.final is generated from the results of a BLAST search can be used by GenomePixelizer or PhyloGrapher. Note that fourth column of this file is ignored by GenomePixelizer or PhyloGrapher.

We need to warn about the usage of a BLAST search with default "low complexity" filter. If you want to generate a matrix file using BLAST you need to disable this option, otherwise you may get wrong identity or similarity values. You can compare the results of a BLAST search with filtering blast_search_At1g10920.txt to the results without filtering blast_search_At1g10920_no_filter.txt. As you see, the filtering alters similarity/identity values, and BLAST identity scores are below 100% even in the case of completely identical proteins.

Input file format requirements

To parse the results of a BLAST search using "Blast2Phylopix_024.tcl" the BLAST output file should contain "gi|" GenBank identifiers in description line. "Fasta2Phylopix_024.tcl" is more flexible with input file format, there is no requirement for a "gi|" identifier. However, more flexibility, more bugs (probably). Send us a bug report and an example of the input file, and we will try to fix the problem.

SUMMARY

On this web site we provide a description of the usage of three Tcl/Tk and one Perl script to generate matrix file(s) for GenomePixelizer and PhyloGrapher from ClustalW/Phylip and FASTA/BLAST searches outputs.

1. Phylip2Genopix_024.tcl - to transform "N x N" type of matrix into "binary", three column type of matrix.

2. Fasta2Phylopix_024.tcl - to parse results of FASTA search.

3. Blast2Phylopix_024.tcl - to parse results of BLAST search.

4. phylopix_redundancy.pl - to remove redundancy from Fasta2Phylopix_024.tcl and Blast2Phylopix_024.tcl outputs and generate a matrix file in the "proper" format for GenomePixelizer. or PhyloGrapher.

We hope that these scripts improve the capability of GenomePixelizer and PhyloGrapher. Any comments from users are very welcome to improve documentation and the code.

email: Alexander Kozik
email: Richard Michelmore

Last modified June 26, 2002

DATA MINING and DISTANCE MATRIX FILE How to obtain a Distance Matrix File from ClustalW alignment, FASTA or BLAST search results and use it with GenomePixelizer and PhyloGrapher

by Alexander Kozik, Ivan Kozik, Garrick Ng and Richard Michelmore University of California at Davis

INTRODUCTION

Part I: How to setup a matrix file using ClustalW

Part II: How to setup a matrix file using FASTA or BLAST search output

Input file format requirements

SUMMARY

DATA MINING and DISTANCE MATRIX FILE

How to obtain a Distance Matrix File from
ClustalW alignment, FASTA or BLAST search results
and use it with GenomePixelizer and PhyloGrapher

by Alexander Kozik, Ivan Kozik, Garrick Ng and Richard Michelmore
University of California at Davis