Step 3: Pre-processing of example dataset
So far, we have downloaded:
150193 ESTs for Lycopersicon esculentum and saved
them in Lycopersicon_esculentum.fasta file
2504 ESTs for Lycopersicon hirsutum and saved
them in Lycopersicon_hirsutum.fasta file
8346 ESTs for Lycopersicon pennellii and saved
them in Lycopersicon_pennellii.fasta file
To distinguish these three genotypes we have modified EST IDs in fasta files.
For example, all EST IDs in Lycopersicon_esculentum.fasta file
after pre-processing contain prefix "A_",
Lycopersicon_hirsutum.fasta - prefix "C_"
and Lycopersicon_pennellii.fasta - prefix "B_"
It has been done by executing in UNIX shell following perl commands
(/find/replace/ regular expressions):
$ perl -p -i -e 's/^\>gi\|/\>A_/' Lycopersicon_esculentum.fasta
$ perl -p -i -e 's/\|/ /' Lycopersicon_esculentum.fasta
$ perl -p -i -e 's/^\>gi\|/\>C_/' Lycopersicon_hirsutum.fasta
$ perl -p -i -e 's/\|/ /' Lycopersicon_hirsutum.fasta
$ perl -p -i -e 's/^\>gi\|/\>B_/' Lycopersicon_pennellii.fasta
$ perl -p -i -e 's/\|/ /' Lycopersicon_pennellii.fasta