BLAST PARSER, EXTRACTION OF DESCRIPTION LINE

How to parse results of BLAST search (ESTs vs proteins)
and set up tables for mySQL database.


by Alexander Kozik, Brian Chan and Richard Michelmore,
University of California, Davis




tcl_blast_parser_123_V017.tcl
PROGRAM DESCRIPTION and USAGE

For example, we have a set of 4448 Cacao ESTs (Theobroma cacao) downloaded from NCBI in file: Cacao_EST.fasta. We would like to BLAST this set against predicted proteins of Arabidopsis thaliana Ath_NCBI.faa.gz and set up tab-delimited tables which can be used as an input for mySQL database.

      It is assumed that NCBI BLAST is installed on your computer. First, format the database file. Type in the following command:

formatdb -i Ath_NCBI.faa -p T

      There are three new files generated. In this case, they are Ath_NCBI.faa.phr, Ath_NCBI.faa.pin, and Ath_NCBI.faa.psq .

      To run BLAST in command line we need to execute a command:

blastall -p blastx -F T -d Ath_NCBI.faa -i Cacao_EST.fasta -o Cacao_EST_vs_Ath_NCBI.blastX -e 1e-05 -I T -v 12 -b 12

      where by different options:

"blastall -p blastx" to run blastx program (DNA against proteins)
"-F T" we want to use low complexity filter
"-d Ath_NCBI.faa" our database
"-i Cacao_EST.fasta" our input file
"-o Cacao_EST_vs_Ath_NCBI.blastX" our output file
"-e 1e-05" expectation cutoff value
"-I T" if GenBank IDs are present in database file they will be displayed in output file
"-v 12 -b 12" we want to see in the output file up to 12 short description lines for hits if found and up to 12 sequence alignments

      After some time (~ two hours on a 1 GHz computer), when the search is done BLAST should generate Cacao_EST_vs_Ath_NCBI.blastX.gz output file. Size of file is about 20 Mb. Then we can process BLAST output using tcl_blast_parser_123_V017.tcl and generate tab delimited parsed files suitable for further data analysis.


      tcl_blast_parser_123_V017.tcl takes as input the results of stand alone BLAST search. You need to run program in command shell or DOS prompt with six arguments:

      tcl_blast_parser_123_V017.tcl [input_file] [output_file] [expect_cutoff] [identity_cutoff] [overlap_cutoff] [matrix_option]

For example, in UNIX and in DOS respectively:

      tcl_blast_parser_123_V017.tcl Cacao_EST_vs_Ath_NCBI.blastX Cacao_EST_vs_Ath_NCBI.blastX 20 40 100 GRAPH

      tclsh tcl_blast_parser_123_V017.tcl Cacao_EST_vs_Ath_NCBI.blastX Cacao_EST_vs_Ath_NCBI.blastX 20 40 100 GRAPH

in this case file with name Cacao_EST_vs_Ath_NCBI.blastX will be analyzed by parser and ten new files with names:

      Cacao_EST_vs_Ath_NCBI.blastX.adj_list
      Cacao_EST_vs_Ath_NCBI.blastX.all_hits
      Cacao_EST_vs_Ath_NCBI.blastX.best_hit
      Cacao_EST_vs_Ath_NCBI.blastX.blast_stat
      Cacao_EST_vs_Ath_NCBI.blastX.group_degree_info
      Cacao_EST_vs_Ath_NCBI.blastX.id_list
      Cacao_EST_vs_Ath_NCBI.blastX.info1
      Cacao_EST_vs_Ath_NCBI.blastX.info2
      Cacao_EST_vs_Ath_NCBI.blastX.matrix
      Cacao_EST_vs_Ath_NCBI.blastX.two_hits

will be generated upon analysis. You can specify output file name the same as input file name, parser will add file extensions to each type of output, so input file will remain without any changes.

Below is the detailed description of every file produced by parser:

Cacao_EST_vs_Ath_NCBI.blastX.all_hits
contains info about ALL HITS in blast report

  •  1 column: "Query" sequence ID
  •  2 column: "Subject" sequence ID
  •  3 column: normalized expectation (-log(Exp))
  •  4 column: percent of identity
  •  5 column: number of perfect matches
  •  6 column: length of the alignment
  •  7 column: hit number for primary alignment (1 is the best first hit)
  •  8 column: hit number for alternative alignment
  •  9 column: PRM stands for primary alignment, ALT for alternative alignment
  • 10 column: frame of translation (+1, +2, +3, -1, -2 or -3)
  • 11 column: first position of the "Query" sequence in the alignment
  • 12 column: last position of the "Query" sequence in the alignment
  • 13 column: first position of the "Subject" sequence in the alignment
  • 14 column: last position of the "Subject" sequence in the alignment
Cacao_EST_vs_Ath_NCBI.blastX.best_hit
contains info about FIRST BEST hits only
  •  1 column: "Query" sequence ID
  •  2 column: "Subject" sequence ID
  •  3 column: normalized expectation (-log(Exp))
  •  4 column: percent of identity
  •  5 column: number of perfect matches
  •  6 column: length of the alignment
  •  7 column: hit number for primary alignment (1 is the best first hit)
Cacao_EST_vs_Ath_NCBI.blastX.blast_stat
contains info HOW MANY HITS every "query" sequence has. If it is just a single hit, special mark "SINGLE_HIT" is placed at an additional column

Cacao_EST_vs_Ath_NCBI.blastX.id_list
list of all gene IDs found in Cacao_EST_vs_Ath_NCBI.blastX file including all "query" IDs as well as all "subject" IDs

Cacao_EST_vs_Ath_NCBI.blastX.matrix
pairwise (binary) matrix for all primary hits if they are better than 20 (1e-20) expectation cutoff value, 40% identity and 100 aa (or nt) alignment overlap length. The three values of expectation, identity and overlap cutoff are the third, fourth and fifth arguments correspondingly when the script is executed from command line.
  •  1st and 2nd column: pair of genes
  •  3rd column: IDENTITY value normalized between 0 and 1
  •  4th column: normalized expectation (-log(Exp))
  •  5th column: length of the alignment
Cacao_EST_vs_Ath_NCBI.blastX.two_hits
if "query" has two or more hits, additional data will be written into the "TWO HITS" file. This file compares expectation, identity and alignment length scores for the best first hit to the next after him. If differences of these scores are greater than cutoff values defined in the body of parser, then special mark "GOOD_DIFF" will be added to the last column. Otherwise the differences are considered as "BAD_DIFF". Values for cutoff criteria are:
  • diff expectation values should be greater than 20 (1e-20)
  • identity scores for second hit should be two times less than for first hit
  • alignment length generated by BLAST for second hit should be shorter than for first hit.
This file is really useful when you run one set of genes against another and want to find a set of gene (EST) which have single strong hits to a target genome (Arabidopsis in our case). For example, Lettuce/Sunflower COS candidates (Conserved Orthologs Set) BLAST search against Arabidopsis genome and its validation by tcl_blast_parser_123_V017.tcl

Cacao_EST_vs_Ath_NCBI.blastX.out.adj_list
is an Adjacency List for protein sequences based on Matrix File. If query (gene, first column) is similar to other genes (subject in BLAST report) within defined cutoff values (expectation, identity and alignment length) then these gene IDs are written into corresponding row. In other words, this is a list of nodes (genes) with direct connection to each others.

Cacao_EST_vs_Ath_NCBI.blastX.group_degree_info
is a Group Degree Info file. Genes (protein sequences) may have similarity to each other via transitive homology. Group Degree Info file represents information about clusters of genes belonging to distinct groups. Each group in Group Degree Info file is a connected graph. It means each node (gene) can reach all other nodes (genes) via edges (identity scores).
  • 1 column -- Node (gene) ID
  • 2 column -- Number of other genes connected to the current gene
  • 3 column -- Graph size (number of genes in the graph)
  • 4 column -- Group number
  • 5 column -- Either "****" or nothing. The "****" indicates the beginning of a new group. This is only for visual purpose.
tcl_blast_parser_123_V017.tcl can easily cluster group of genes up to 1000 in a set. To process a set of genes with number greater than 1000 in a group we recommend to use Graph9 program. It is written in C. It works much faster and provides additional information about articulation point (or bridges) which helps identify multidomain sequences. Also, if you are using tcl_blast_parser_123_V017.tcl with MATRIX option, Adjacency List and Group Degree Info are not generated. Parser in this case will stop the BLAST output analysis at the creation of Matrix file and total time of analysis will be much shorter.


Finally, description of Info1 and Info2 files:

Cacao_EST_vs_Ath_NCBI.blastX.info1
contains info about FIRST BEST hits only and description line of the subject from BLAST report
  •  1 column: "Query" sequence ID
  •  2 column: "Subject" sequence ID
  •  3 column: description line for "Subject" sequence
  •  4 column: normalized expectation (-log(Exp)/100) [note, that expectation is normalized between 0 and 1]
  •  5 column: percent of identity
  •  6 column: number of perfect matches
  •  7 column: length of the alignment
  •  8 column: hit number for primary alignment (1 is the best first hit)

Cacao_EST_vs_Ath_NCBI.blastX.info2
contains info about ALL PRIMARY HITS in blast report. It is almost identical to All Hits file with exception that it contains description line for subject, expectation is normalized between 0 and 1 and there is no information for ALT - alternative alignments
  •  1 column: "Query" sequence ID
  •  2 column: "Subject" sequence ID
  •  3 column: description line for "Subject" sequence
  •  4 column: normalized expectation (-log(Exp)/100) [note, that expectation is normalized between 0 and 1]
  •  5 column: percent of identity
  •  6 column: number of perfect matches
  •  7 column: length of the alignment
  •  8 column: hit number for primary alignment (1 is the best first hit)
  •  9 column: frame of translation (+1, +2, +3, -1, -2 or -3)
  • 10 column: first position of the "Query" sequence in the alignment
  • 11 column: last position of the "Query" sequence in the alignment
  • 12 column: first position of the "Subject" sequence in the alignment
  • 13 column: last position of the "Subject" sequence in the alignment
  • 14 column: Length of "Query"(nt or aa) / Length of "Subject"(nt or aa)

Tables like "Info2" can be used as input data table for mySQL database. For example: Lettuce_vs_Arabidopsis table at CGPDB. It has phpMyAdmin interface to manage data. phpMyAdmin provides basic tools to query and sort data in database via web browser.

DOWNLOAD:
tcl_blast_parser_123_V017.tcl
Tcl/Tk ToolKit


tcl_blast_parser_123_V012.tcl - previous version (almost identical to tcl_blast_parser_123_V017 with exception that last column #14 of Info2 file (Query Length/Subject Length) is not generated.

Work in progress! This version of BLAST parser tcl_blast_parser_123_V025.tcl extracts query and subject sequences from BLAST report and writes them into separate file "*.align". For example: Cacao_EST_vs_Ath_NCBI.blastX.align will be generated with Cacao BLAST example above. This is very useful when we want to extract sequence alignment from BLAST report.

Further Development:
tcl_blast_parser_123_V033.tcl - to extract sequences in tab-delimited format from BLAST report
tcl_blast_parser_123_V038.tcl - current working version

Download BLAST 2.2.10 binaries (win, linux, mac and sparc) BLAST_2_2_10_BIN.tar.gz

cover cover

email: Alexander Kozik

Last modified June 29, 2005