Step 5: Extraction of sequence IDs from BLAST output
We have used tcl_blast_parser_123
to parse results of BLAST search and fastacmd (included with NCBI BLAST
stand-alone distribution) to extract a subset of sequences from large
Lycopersicon_esculentum.fasta file.
Following commands have been executed:
$ tcl_blast_parser_123_V012.tcl
Lycopersicon_hirsutum_vs_Lycopersicon_esculentum.blastn
Lycopersicon_hirsutum_vs_Lycopersicon_esculentum.blastn 20 40 100 MATRIX
$ tcl_blast_parser_123_V012.tcl
Lycopersicon_pennellii_vs_Lycopersicon_esculentum.blastn
Lycopersicon_pennellii_vs_Lycopersicon_esculentum.blastn 20 40 100 MATRIX
tcl_blast_parser_123 among several
output files generates "*.id_list" file which contains all sequence IDs found in BLAST report.
To get all sequence IDs for Lycopersicon esculentum that have hits to
Lycopersicon hirsutum and Lycopersicon pennellii
ESTs following UNIX grep, sort
and uniq commands have been executed on
Lycopersicon_hirsutum_vs_Lycopersicon_esculentum.blastn.id_list and
Lycopersicon_pennellii_vs_Lycopersicon_esculentum.blastn.id_list files:
$ grep "A_" Lycopersicon_hirsutum_vs_Lycopersicon_esculentum.blastn.id_list >
Lycopersicon_esculentum.with_hits.IDs_temp
$ grep "A_" Lycopersicon_pennellii_vs_Lycopersicon_esculentum.blastn.id_list >>
Lycopersicon_esculentum.with_hits.IDs_temp
$ sort Lycopersicon_esculentum.with_hits.IDs_temp | uniq > Lycopersicon_esculentum.with_hits.IDs
In the example above, EST IDs for Lycopersicon esculentum have been extracted since
the prefix "A_" is specific for this genotype only.