Contig Viewer and Arabidopsis Assembly

Arabidopsis EST assembly was done on a relatively small set of RAFL ESTs (61,281 sequences) downloaded from NCBI by May 2002.

CAP3 was used with default parameters:
overlap length cutoff - 40 nt
overlap percent identity cutoff - 80%
clipping range - 250 nt

Upon assembly 9,487 contigs were generated. Those contigs were analyzed by DIS pipeline. Input and output files are available for download:

RAFL_CA_HTrm.fasta.gz - set of 61,281 RAFL ESTs
RAFL_CA_HTrm.cap.contigs.gz - 9,487 contigs
RAFL_CA_HTrm_CAP3.out.gz - CAP3 output with detailed information about assembly

RAFL_CA_HTrm_CAP3.out has been processed by DIS pipeline (see steps 10, 11 and 12), you can download all 9,487 alignments in CAP3 format here (Ath_Alignments.tar.gz file)

Output files with information about polymorphic sites:
ath_deletions.good.sorted
ath_insertions.good.sorted
ath_substitutions.good.sorted
have been examined and some interesting examples were found and analyzed with Contig Viewer:

ATH_Contig5584.align with possible alternative spliced ESTs in the assembly (see graphical output here)

ATH_Contig660.align - set of paralogs assembled into one contig (see graphical output here)

ATH_Contig5295.align and other interesting case (see graphical output here)

Note that this Arabidopsis assembly is represented by one genotype. In this case Python_CAP3_contig_poly_DIS_Feb_27_2004.py script can detect so called "partial" substitutions only.

mailto: Alexander Kozik
last modified: March 08 2004