| 
 
        tcl_blast_parser_123_V007.tcl 
        PROGRAM DESCRIPTION and USAGE
 
       This web page describes "our own 
        version" of BLAST parser, its usage, input and output file formats. 
        Almost any bioinformatician wrote at least once in his/her lifetime a
        BLAST parser for some particular needs. There are many 
        parsers over the Internet: Google search for keywords "blast parser". Here 
        we present our contribution into highthroughput analysis of results BLAST 
        searches. Major advantages of our parser are: 
        1- automatic generation of the "Matrix file" suitable for 
        protein/sequence clustering analysis and
 2- generation of the 
        "Two Hits File" which helps to distinguish orthologs from paralogs in 
        the case of BLAST search of one genome against another, for example: 
        Lettuce/Sunflower 
        COS candidates (Conserved Orthologs 
        Set) web page.
 
 It is better to describe 
        BLAST parser functionality with a real example: For instance, we have a 
        set of Arabidopsis resistance genes in file: NB-ARC.fasta. 
        We would like to BLAST this set against itself and set up a MATRIX 
        file, which is the input for 
        PhyloGrapher
        or GenomePixelizer or any other 
        bioinformatics program.
 
 It is 
        assumed that NCBI 
        BLAST is installed on your computer. First format the database file.
        Type in the following command:
 
 formatdb -i NB-ARC.fasta -p T
 
 There are three new 
        files generated. In this case, they are NB-ARC.fasta.phr, 
        NB-ARC.fasta.pin, and NB-ARC.fasta.psq .
 
 To run BLAST in command 
        line we need to execute a command:
 
 blastall -p blastp -F F -d NB-ARC.fasta -i NB-ARC.fasta -o 
        NB-ARC_blastp.out -e 1e-10 -I T -v 200 -b 200
 
 where by 
        different options:
 
 "blastall -p blastp"  
        to run blastp program (proteins against proteins)
 "-F 
        F" we do not want to use low complexity 
        filter
 "-d NB-ARC.fasta" our database
 "-i 
        NB-ARC.fasta" our input file
 "-o NB-ARC_blastp.out"  
        our output file
 "-e 1e-10" expectation cutoff value
 "-I 
        T" if GenBank IDs are present in database file they will be 
        displayed in output file
 "-v 200 -b 200" we want to see in the 
        output file up to 200 short description lines for hits if found and up 
        to 200 sequence 
        alignments
 
 After some time 
        (several minutes on a 1 GHz computer), when the search is done BLAST 
        should generate NB-ARC_blastp.out 
        output file. It is a very large file, about 50 Mb after uncompression. 
        To test it with BLAST parser you need to download it and uncompress 
        using "gunzip" on UNIX or "WinZip" on Windows. To view part of this file 
        click 
        here . Then we can process BLAST output using 
        tcl_blast_parser_123_V007.tcl and generate tab delimited parsed files 
        suitable for further data analysis.
 
 
 
       tcl_blast_parser_123_V007.tcltakes as input the 
        results of stand alone BLAST search. You need to run program in command 
        shell or DOS prompt with six arguments: 
 tcl_blast_parser_123_V007.tcl 
        [input_file] [output_file] [expect_cutoff] [identity_cutoff] 
        [overlap_cutoff] [matrix_option]
 
 For example, 
        in UNIX and in DOS respectively:
 
 tcl_blast_parser_123_V007.tcl 
        NB-ARC_blastp.out NB-ARC_blastp.out 20 40 100 MATRIX
 
 tclsh 
        tcl_blast_parser_123_V007.tcl NB-ARC_blastp.out NB-ARC_blastp.out 
        20 40 100 MATRIX
 
 in this case file with name 
        NB-ARC_blastp.out will be analyzed by parser and six new 
        files with names:
 
 NB-ARC_blastp.out.all_hits
 NB-ARC_blastp.out.best_hit
 NB-ARC_blastp.out.blast_stat
 NB-ARC_blastp.out.id_list
 NB-ARC_blastp.out.matrix
 NB-ARC_blastp.out.two_hits
 
 will 
        be generated upon analysis. You can specify output file name the same as 
        input file name, parser will add file extensions to each type of output, 
        so input file will remain without any changes.
 
 Below is the 
        detailed description of every file produced by parser:
 
 NB-ARC_blastp.out.all_hits
 contains 
        info about ALL HITS in blast report
 
 
          NB-ARC_blastp.out.best_hit 1 column: "Query" sequence ID 
           2 column: "Subject" sequence ID 
           3 column: normalized expectation (-log(Exp)) 
           4 column: percent of identity 
           5 column: number of perfect matches 
           6 column: length of the alignment 
           7 column: hit number for primary alignment (1 is the best 
          first hit) 
           8 column: hit number for alternative alignment 
           9 column: PRM stands for primary alignment, ALT for 
          alternative alignment 
          10 column: first position of the "Query" sequence in the alignment 
          11 column: last position of the "Query" sequence in the alignment 
          12 column: first position of the "Subject" sequence in the 
          alignment 
          13 column: last position of the "Subject" sequence in the 
          alignment  contains 
        info about FIRST BEST hits only
 
 
          In our particular case when we run the set of genes 
        against itself, best hit, of course, is to itself. This file is really 
        useful when you run one set of genes against another. For example, Lettuce/Sunflower 
        COS candidates (Conserved Orthologs 
        Set) BLAST search against Arabidopsis genome and its 
        validation by tcl_blast_parser_123_V007.tcl 1 column: "Query" sequence ID 
           2 column: "Subject" sequence ID 
           3 column: normalized expectation (-log(Exp)) 
           4 column: percent of identity 
           5 column: number of perfect matches 
           6 column: length of the alignment 
           7 column: hit number for primary alignment (1 is the best 
          first hit)  
 NB-ARC_blastp.out.blast_stat
 contains 
        info of HOW MANY HITS every "query" sequence has. If it is
        just a single hit, special mark "SINGLE_HIT" is placed at an additional 
        column
 
 NB-ARC_blastp.out.id_list
 list 
        of all gene IDs found in NB-ARC_blastp.out file including all "query" 
        IDs as well as all "subject" IDs
 
 NB-ARC_blastp.out.matrix
 pairwise 
        (binary) matrix for all primary hits if they are better than 20 (1e-20) 
        expectation cutoff value, 40% identity and 100 aa (or nt) alignment 
        overlap length. The three values of expectation, identity and overlap 
        cutoff are the third, fourth and fifth arguments correspondingly when the 
        script is executed from command line.
 
 
          This MATRIX 
        file can be used by PhyloGrapher and GenomePixelizer without any further 
        modifications. 1st and 2nd column: pair of genes 
           3rd column: IDENTITY value normalized between 0 and 1 
           4th column: normalized expectation (-log(Exp)) 
           5th column: length of the alignment  
 NB-ARC_blastp.out.two_hits
 if 
        "query" has two or more hits, additional data will be written into the 
        "TWO HITS" file. This file compares expectation, identity and 
        alignment length scores for the best first hit to the next after him. If 
        differences of these scores are greater than cutoff values defined in the 
        body of parser, then special mark "GOOD_DIFF" will be added to the last 
        column. Otherwise the differences are considered as "BAD_DIFF".
 
 You can check how useful this file on the example Lettuce/Sunflower 
        COS candidates (Conserved Orthologs 
        Set) web page.
 
 
 
         
 PROTEIN SEQUENCE CLUSTERING
 
 If you run 
        tcl_blast_parser_123_V007.tcl with "GRAPH" option, for example:
 
 tcl_blast_parser_123_V007.tcl 
        NB-ARC_blastp.out NB-ARC_blastp.out 20 40 100 GRAPH
 
 in this case two additional files will be generated:
 
 NB-ARC_blastp.out.adj_list
 NB-ARC_blastp.out.group_degree_info
 
 NB-ARC_blastp.out.adj_list
 is 
        an Adjacency List for protein sequences based on Matrix 
        File. If query (gene, first column) is similar to other genes 
        (subject in BLAST report) within defined cutoff values (expectation, 
        identity and alignment length) then these gene IDs are written into 
        corresponding row. In other words, this is a list of nodes (genes) with 
        direct connection to each others.
 
 NB-ARC_blastp.out.group_degree_info
 is 
        a Group Degree Info file. Genes (protein sequences) may have 
        similarity to each other via transitive homology. Group Degree 
        Info file represents information about clusters of genes belonging 
        to distinct groups. Each group in Group Degree Info file is a 
        connected graph. It means each node (gene) can reach all other nodes 
        (genes) via edges (identity scores).
 
          tcl_blast_parser_123_V007.tcl can easily cluster 
        group of genes up to 1000 in a set. To process a set of genes with 
        number greater than 1000 we recommend to use Graph9 
        program. It is written in C. It works much faster and provides 
        additional information about articulation point (or bridges) which helps 
        identify multidomain sequences.1 column -- Node (gene) ID 
          2 column -- Number of other genes connected to the current gene 
          3 column -- Graph size (number of genes in the graph) 
          4 column -- Group number 
          5 column -- Either "****" or nothing. The "****" indicates the 
          beginning of a new group. This is only for visual purpose. 
         
        DOWNLOAD: tcl_blast_parser_123_V007.tcl
 Tcl/Tk ToolKit
 
 tcl_blast_parser_123_V017.tcl
 New version "017" is almost identical to version "007". New version "017" generates two additional "info1"
	and "info2" files extracting description line from BLAST report. "info1" and "info2" files are ready to go
	into mySQL database Click here to read more about "info" output file formats.
 
 Download BLAST 2.2.10 binaries (win, linux, mac and sparc) 
	BLAST_2_2_10_BIN.tar.gz
 
 
 |