>A GARFIELDMADCATLIKETHAT >B GARFIELDBADCATLIKETHIS >C GARFIELDFATRATTAKETHAT >D WHATEVERFATRATTAKETHAT >E THISISADISTINCTPEPTIDE
A -GARFIELDMADCATLIKETHAT B -GARFIELDBADCATLIKETHIS C -GARFIELDFATRATTAKETHAT D WHATEVER-FATRATTAKETHAT E -THISISADISTINCTPEPTIDE :. . : : *
"A" "B" "C" "D" "E" "A" 0 0.136 0.227 0.476 0.864 "B" 0.136 0 0.318 0.571 0.864 "C" 0.227 0.318 0 0.238 0.773 "D" 0.476 0.571 0.238 0 0.857 "E" 0.864 0.864 0.773 0.857 0
"A" "B" "C" "D" "E" "A" 1 0.864 0.773 0.524 0.136 "B" 0.864 1 0.682 0.429 0.136 "C" 0.773 0.682 1 0.762 0.227 "D" 0.524 0.429 0.762 1 0.143 "E" 0.136 0.136 0.227 0.143 1
5 A 0.000 0.136 0.227 0.476 0.864 B 0.136 0.000 0.318 0.571 0.864 C 0.227 0.318 0.000 0.238 0.773 D 0.476 0.571 0.238 0.000 0.857 E 0.864 0.864 0.773 0.857 0.000
"A" "B" "C" "D" "E" "A" 1 0.864 0.773 0.524 0.136 "B" 0.864 1 0.682 0.429 0.136 "C" 0.773 0.682 1 0.762 0.227 "D" 0.524 0.429 0.762 1 0.143 "E" 0.136 0.136 0.227 0.143 1
"A" "B" "C" "D" "E" "A" 1 0.864 0.773 0.524 0.136 "B" 1 0.682 0.429 0.136 "C" 1 0.762 0.227 "D" 1 0.143 "E" 1or an even simpler version because we assume that all sequences are identical to themselves.
"A" "B" "C" "D" "E" "A" 0.864 0.773 0.524 0.136 "B" 0.682 0.429 0.136 "C" 0.762 0.227 "D" 0.143 "E"Another form of presentation of the data above looks like this:
A B 0.864 A C 0.773 A D 0.524 A E 0.136 B C 0.682 B D 0.429 B E 0.136 C D 0.762 C E 0.227 D E 0.143which is "binary" or a three column type of matrix representation. This matrix file contains the same amount of information as all files above, just in a different format. This type of matrix file format is used by GenomePixelizer and PhyloGrapher. It is easy to calculate the number of lines in this file using a simple formula:
(NUMBER_OF_GENES)^2 - NUMBER_OF_GENES NUMBER_OF_LINES = -------------------------------- 2So far, for five genes we have: (25 - 5)/2 = 10 lines
A B 0.864 A C 0.773 A D 0.524 B C 0.682 B D 0.429 C D 0.762 C E 0.227When you analyze a set of genes for a whole genome, it is unlikely that all of them (20,000 - 30,000) have a high homology to each other. If it is the case, re-check your data set or re-sequence the given genome. From our experience we know that for the Arabidopsis genome matrix file for all genes with the identity level greater than 0.6 has about 20,000 of lines pairwise data. GenomePixelizer can handle a matrix file of this size without any problems.
"sort -n -r +2 Matrix.txt > Matrix.sorted"or on Windows using an Excel spreadsheet.
At1g10920.txt: 821 aa >At1g10920 vs /usr/local/genome/database/nr librarysection "A":
The best scores are: opt bits E(731512) gi|1931650|gb|AAB65485.1| (U95973) disease resistance protein RPM1 isolog; 80607-83399 [A ( 821) 5477 1184 0 gi|7769860|gb|AAF69538.1|AC008007_13 (AC008007) F12M16.25 [Arabidopsis thaliana] (1584) 3197 695 6.6e-198 gi|10177352|dbj|BAB10695.1| (AB015468) disease resistance protein [Arabidopsis thaliana] ( 908) 3171 689 1.8e-196 gi|11357253|pir||T48898 disease resistance protein RPP8 [validated] - Arabidopsis thalian ( 906) 3146 684 7.4e-195 gi|11357252|pir||T48899 disease resistance protein rpp8 [similarity] - Arabidopsis thalia ( 908) 3133 681 5.1e-194 gi|7110565|gb|AAF36987.1|AF234174_1 (AF234174) viral resistance protein [Arabidopsis thal ( 909) 3106 675 2.9e-192 gi|5080812|gb|AAD39321.1|AC007258_10 (AC007258) Putative disease resistance protein [Arab ( 906) 2306 504 1.4e-140 gi|12321052|gb|AAG50648.1|AC082643_12 (AC082643) disease resistance protein, putative [Ar ( 899) 2249 491 6.6e-137 gi|12321042|gb|AAG50638.1|AC082643_2 (AC082643) disease resistance protein, putative [Ara ( 907) 2208 483 3e-134 gi|14475950|gb|AAK62797.1|AC027036_18 (AC027036) viral resistance protein, putative [Arab (1155) 2166 474 2e-131and section "B":
>>gi|1931650|gb|AAB65485.1| (U95973) disease resistance protein RPM1 isolog; 80607-83399 [Arabidopsis thaliana] (821 aa) initn: 5477 init1: 5477 opt: 5477 Z-score: 6340.7 bits: 1184.2 E(): 0 Smith-Waterman score: 5477; 100.000% identity (100.000% ungapped) in 821 aa overlap (1-821:1-821) >>gi|7769860|gb|AAF69538.1|AC008007_13 (AC008007) F12M16.25 [Arabidopsis thaliana] (1584 aa) initn: 2867 init1: 1575 opt: 3197 Z-score: 3690.8 bits: 694.8 E(): 6.6e-198 Smith-Waterman score: 3631; 69.112% identity (72.727% ungapped) in 845 aa overlap (3-814:316-1151) >>gi|10177352|dbj|BAB10695.1| (AB015468) disease resistance protein [Arabidopsis thaliana] (908 aa) initn: 2244 init1: 1185 opt: 3171 Z-score: 3665.0 bits: 689.2 E(): 1.8e-196 Smith-Waterman score: 3579; 67.376% identity (70.545% ungapped) in 846 aa overlap (1-814:1-840) >>gi|11357253|pir||T48898 disease resistance protein RPP8 [validated] - Arabidopsis thaliana gi|3928862|gb|AAC83165.1| (AF089710) disease resistance protein RPP8 [Arabidopsis thaliana] (906 aa) initn: 1860 init1: 1179 opt: 3146 Z-score: 3636.0 bits: 683.9 E(): 7.4e-195 Smith-Waterman score: 3548; 67.021% identity (70.347% ungapped) in 846 aa overlap (1-814:1-838) >>gi|11357252|pir||T48899 disease resistance protein rpp8 [similarity] - Arabidopsis thaliana gi|3901294|gb|AAC78631.1| (AF089711) rpp8 [Arabidopsis thaliana] gi|8843900|dbj|BAA97426.1| (AB025638) dise (908 aa) initn: 2230 init1: 1183 opt: 3133 Z-score: 3620.9 bits: 681.1 E(): 5.1e-194 Smith-Waterman score: 3540; 66.548% identity (69.678% ungapped) in 846 aa overlap (1-814:1-840) >>gi|9758146|dbj|BAB08703.1| (AB015477) disease resistance protein [Arabidopsis thaliana] (901 aa) initn: 2999 init1: 1053 opt: 3112 Z-score: 3596.6 bits: 676.6 E(): 1.2e-192 Smith-Waterman score: 3545; 67.456% identity (71.250% ungapped) in 845 aa overlap (1-814:1-831) >>gi|7110565|gb|AAF36987.1|AF234174_1 (AF234174) viral resistance protein [Arabidopsis thaliana] (909 aa) initn: 2596 init1: 1160 opt: 3106 Z-score: 3589.6 bits: 675.3 E(): 2.9e-192 Smith-Waterman score: 3529; 67.060% identity (70.297% ungapped) in 847 aa overlap (1-814:1-841) >>gi|5080812|gb|AAD39321.1|AC007258_10 (AC007258) Putative disease resistance protein [Arabidopsis thaliana] (906 aa) initn: 1327 init1: 433 opt: 2306 Z-score: 2661.7 bits: 503.6 E(): 1.4e-140 Smith-Waterman score: 2306; 47.986% identity (51.396% ungapped) in 844 aa overlap (1-814:6-823) >>gi|12321052|gb|AAG50648.1|AC082643_12 (AC082643) disease resistance protein, putative [Arabidopsis thaliana] (899 aa) initn: 1437 init1: 756 opt: 2249 Z-score: 2595.6 bits: 491.3 E(): 6.6e-137 Smith-Waterman score: 2493; 50.236% identity (53.383% ungapped) in 848 aa overlap (5-814:3-838) >>gi|12321042|gb|AAG50638.1|AC082643_2 (AC082643) disease resistance protein, putative [Arabidopsis thaliana] (907 aa) initn: 1653 init1: 478 opt: 2208 Z-score: 2548.0 bits: 482.5 E(): 3e-134 Smith-Waterman score: 2501; 49.941% identity (52.861% ungapped) in 851 aa overlap (1-814:1-841) >>gi|14475950|gb|AAK62797.1|AC027036_18 (AC027036) viral resistance protein, putative [Arabidopsis thaliana] (1155 aa) initn: 1816 init1: 511 opt: 2166 Z-score: 2497.4 bits: 473.5 E(): 2e-131 Smith-Waterman score: 2429; 48.174% identity (50.998% ungapped) in 849 aa overlap (1-813:1-838)As you can see from analysis of section "B" there is no direct correlation between the expectation values and identity levels. Hits with higher expectation may have lower identity values and the opposite is also true. By parsing (data extraction) section "B" you can get data for matrix file, in our example it looks like this:
At1g10920 1931650 0 100.000 At1g10920 7769860 6.6e-198 69.112 At1g10920 10177352 1.8e-196 67.376 At1g10920 11357253 7.4e-195 67.021 At1g10920 11357252 5.1e-194 66.548 At1g10920 9758146 1.2e-192 67.456 At1g10920 7110565 2.9e-192 67.060 At1g10920 5080812 1.4e-140 47.986 At1g10920 12321052 6.6e-137 50.236 At1g10920 12321042 3e-134 49.941 At1g10920 14475950 2e-131 48.174where the first column is the query, second column - hit (subject), third column - expectation and fourth column - identity. It takes an enormous amount of time to set up a table by the "copy - paste" procedure and the final matrix, definitely, will be full of errors. This procedure should be avoided.
./Fasta2Phylopix_024.tcl FASTA program was "(m)pi" or "(r)egular": r Enter the SOURCE file name: fasta_search_At1g10920.txt Enter the DESTINATION file name: fasta.matrix Does SOURCE contain description line? (y/n): y extract DESCRIPTION line (Y/N): n Subject DELIMITER type "|"(t) or " "(s): t type of FASTA search was fasta(n) fasta(x) fasta(p): pIf your input file is fasta_search_At1g10920.txt the program should generate an output like fasta.matrix.
gene_A gene_B 2e-10 55 gene_B gene_A 2e-10 55Reason 3: you may want to reduce the size of the matrix file by specifying identity cutoff
"perl phylopix_redundancy.pl fasta.matrix fasta.matrix.final 0.2"After all of these procedures the "three column matrix file" (it still has a fourth column which is ignored by GenomePixelizer and PhyloGrapher and can be deleted by user) fasta.matrix.final generated from the results of a FASTA search can be used by GenomePixelizer or PhyloGrapher programs.
"cat *.txt > Fasta_Results"
>gi|15220237|ref|NP_172561.1| disease resistance protein RPM1 isolog [Arabidopsis thaliana] gi|1931650|gb|AAB65485.1| (U95973) disease resistance protein RPM1 isolog; 80607-83399 [Arabidopsis thaliana] Length = 821 Score = 1584 bits (4101), Expect = 0.0 Identities = 821/821 (100%), Positives = 821/821 (100%) ------- >gi|7769860|gb|AAF69538.1|AC008007_13 (AC008007) F12M16.25 [Arabidopsis thaliana] Length = 1584 Score = 1079 bits (2791), Expect = 0.0 Identities = 583/844 (69%), Positives = 668/844 (79%), Gaps = 40/844 (4%) ------- >gi|15239027|ref|NP_199673.1| disease resistance protein [Arabidopsis thaliana] gi|10177352|dbj|BAB10695.1| (AB015468) disease resistance protein [Arabidopsis thaliana] Length = 908 Score = 1042 bits (2694), Expect = 0.0 Identities = 570/846 (67%), Positives = 658/846 (77%), Gaps = 38/846 (4%) ------- >gi|11357253|pir||T48898 disease resistance protein RPP8 [validated] - Arabidopsis thaliana gi|3928862|gb|AAC83165.1| (AF089710) disease resistance protein RPP8 [Arabidopsis thaliana] Length = 906 Score = 1035 bits (2677), Expect = 0.0 Identities = 567/846 (67%), Positives = 654/846 (77%), Gaps = 40/846 (4%) ------- >gi|15239876|ref|NP_199160.1| disease resistance protein RPP8 [Arabidopsis thaliana] gi|11357252|pir||T48899 disease resistance protein rpp8 [similarity] - Arabidopsis thaliana gi|3901294|gb|AAC78631.1| (AF089711) rpp8 [Arabidopsis thaliana] gi|8843900|dbj|BAA97426.1| (AB025638) disease resistance protein RPP8 [Arabidopsis thaliana] Length = 908 Score = 1032 bits (2669), Expect = 0.0 Identities = 563/846 (66%), Positives = 651/846 (76%), Gaps = 38/846 (4%) ------- >gi|15238507|ref|NP_198395.1| disease resistance protein [Arabidopsis thaliana] gi|9758146|dbj|BAB08703.1| (AB015477) disease resistance protein [Arabidopsis thaliana] Length = 901 Score = 1045 bits (2701), Expect = 0.0 Identities = 569/845 (67%), Positives = 654/845 (77%), Gaps = 45/845 (5%) ------- >gi|7110565|gb|AAF36987.1|AF234174_1 (AF234174) viral resistance protein [Arabidopsis thaliana] Length = 909 Score = 1024 bits (2647), Expect = 0.0 Identities = 568/847 (67%), Positives = 655/847 (77%), Gaps = 39/847 (4%) ------- >gi|15218909|ref|NP_176187.1| disease resistance protein, putative [Arabidopsis thaliana] gi|5080812|gb|AAD39321.1|AC007258_10 (AC007258) Putative disease resistance protein [Arabidopsis thaliana] Length = 906 Score = 662 bits (1707), Expect = 0.0 Identities = 405/851 (47%), Positives = 539/851 (62%), Gaps = 70/851 (8%) ------- >gi|15217959|ref|NP_176137.1| disease resistance protein, putative [Arabidopsis thaliana] gi|12321052|gb|AAG50648.1|AC082643_12 (AC082643) disease resistance protein, putative [Arabidopsis thaliana] Length = 899 Score = 686 bits (1770), Expect = 0.0 Identities = 414/834 (49%), Positives = 551/834 (65%), Gaps = 48/834 (5%) ------- >gi|15217954|ref|NP_176135.1| disease resistance protein, putative [Arabidopsis thaliana] gi|12321042|gb|AAG50638.1|AC082643_2 (AC082643) disease resistance protein, putative [Arabidopsis thaliana] Length = 907 Score = 730 bits (1884), Expect = 0.0 Identities = 429/853 (50%), Positives = 564/853 (65%), Gaps = 51/853 (5%) ------- >gi|15217999|ref|NP_176151.1| viral resistance protein, putative [Arabidopsis thaliana] gi|14475950|gb|AAK62797.1|AC027036_18 (AC027036) viral resistance protein, putative [Arabidopsis thaliana] Length = 1155 Score = 683 bits (1763), Expect = 0.0 Identities = 405/848 (47%), Positives = 539/848 (62%), Gaps = 45/848 (5%) -------We may set up a table with the extracted data similar to a FASTA search:
At1g10920 1931650 0.0 100 100 At1g10920 7769860 0.0 69 79 At1g10920 10177352 0.0 67 77 At1g10920 11357253 0.0 67 77 At1g10920 11357252 0.0 66 76 At1g10920 9758146 0.0 67 77 At1g10920 7110565 0.0 67 77 At1g10920 5080812 0.0 47 62 At1g10920 12321052 0.0 49 65 At1g10920 12321042 0.0 50 65 At1g10920 14475950 0.0 47 62You see here that BLAST output is almost identical to FASTA search results. BLAST results provide additional information such as sequence similarity data or "positives" (fifth column of table above). To extract data from BLAST search results you can use the Blast2Phylopix_024.tcl program. You need to specify input and output files and what kind of data you want to extract. Example input to generate matrix:
./Blast2Phylopix_024.tcl Enter the SOURCE file name: blast_search_At1g10920_no_filter.txt Enter the DESTINATION file name: blast.matrix extract DESCRIPTION line (Y/N): n type of BLAST search was blast(n) blast(x) blast(p): pIf your input file is blast_search_At1g10920_no_filter.txt the program should generate output like blast.matrix. The output file will contain data in a "four column style" and you need to use the perl script phylopix_redundancy.pl to
"perl phylopix_redundancy.pl blast.matrix blast.matrix.final 0.2"After all of these procedures blast.matrix.final is generated from the results of a BLAST search can be used by GenomePixelizer or PhyloGrapher. Note that fourth column of this file is ignored by GenomePixelizer or PhyloGrapher.
Last modified June 26, 2002
email: Alexander Kozik
email: Richard Michelmore