SNP/INDEL discovery pipeline based on CAP3 assembly

Step 5: Extraction of sequence IDs from BLAST output

We have used tcl_blast_parser_123 to parse results of BLAST search and fastacmd (included with NCBI BLAST stand-alone distribution) to extract a subset of sequences from large Lycopersicon_esculentum.fasta file.

Following commands have been executed:

$ tcl_blast_parser_123_V012.tcl Lycopersicon_hirsutum_vs_Lycopersicon_esculentum.blastn Lycopersicon_hirsutum_vs_Lycopersicon_esculentum.blastn 20 40 100 MATRIX

$ tcl_blast_parser_123_V012.tcl Lycopersicon_pennellii_vs_Lycopersicon_esculentum.blastn Lycopersicon_pennellii_vs_Lycopersicon_esculentum.blastn 20 40 100 MATRIX

tcl_blast_parser_123 among several output files generates "*.id_list" file which contains all sequence IDs found in BLAST report.

To get all sequence IDs for Lycopersicon esculentum that have hits to Lycopersicon hirsutum and Lycopersicon pennellii ESTs following UNIX grep, sort and uniq commands have been executed on Lycopersicon_hirsutum_vs_Lycopersicon_esculentum.blastn.id_list and Lycopersicon_pennellii_vs_Lycopersicon_esculentum.blastn.id_list files:

$ grep "A_" Lycopersicon_hirsutum_vs_Lycopersicon_esculentum.blastn.id_list > Lycopersicon_esculentum.with_hits.IDs_temp

$ grep "A_" Lycopersicon_pennellii_vs_Lycopersicon_esculentum.blastn.id_list >> Lycopersicon_esculentum.with_hits.IDs_temp

$ sort Lycopersicon_esculentum.with_hits.IDs_temp | uniq > Lycopersicon_esculentum.with_hits.IDs

In the example above, EST IDs for Lycopersicon esculentum have been extracted since the prefix "A_" is specific for this genotype only.