Arabidopsis EST assembly was done on a relatively small set
of RAFL ESTs (61,281 sequences) downloaded from NCBI by May 2002.
CAP3 was used with default parameters:
overlap length cutoff - 40 nt
overlap percent identity cutoff - 80%
clipping range - 250 nt
Upon assembly 9,487 contigs were generated. Those contigs were
analyzed by DIS pipeline. Input and output files are available
for download:
RAFL_CA_HTrm.fasta.gz - set of 61,281 RAFL ESTs
RAFL_CA_HTrm.cap.contigs.gz - 9,487 contigs
RAFL_CA_HTrm_CAP3.out.gz - CAP3 output with detailed information about assembly
RAFL_CA_HTrm_CAP3.out has been processed by
DIS pipeline (see steps 10, 11 and 12),
you can download all 9,487 alignments in CAP3 format here (Ath_Alignments.tar.gz file)
Output files with information about polymorphic sites:
ath_deletions.good.sorted
ath_insertions.good.sorted
ath_substitutions.good.sorted
have been examined and some interesting examples were found and analyzed with Contig Viewer:
ATH_Contig5584.align with possible alternative spliced ESTs in the assembly
(see graphical output here)
ATH_Contig660.align - set of paralogs assembled into one contig
(see graphical output here)
ATH_Contig5295.align and other interesting case
(see graphical output here)
Note that this Arabidopsis assembly is represented by one
genotype. In this case Python_CAP3_contig_poly_DIS_Feb_27_2004.py
script can detect so called "partial" substitutions only.
mailto: Alexander Kozik
last modified: March 08 2004