What the program does
GenBank Data Parser is a Python script designed to
translate the region of DNA sequence specified in CDS part of each gene
into protein sequence. It also generates additional files that are
designed to assist in GenBank data analysis.
How the program works
Program reads in user defined SOURCE file that was
generated by GenBank database. This file usually has .gbk
extension. The program produces 10 output files with extensions
.500, .cds, .gene, .inbetween, .join, .msg,
.out, .position, .protein, protein.dupl. If files with these
extensions exist in current directory user will be asked if he/she wants
to overwrite each file or provide a different name for each one. When
the program is done the message is prompted that the modified files have
been created.
User input
After being prompted for a SOURCE file name, enter your GenBank
file's name.
Example of a source file: NC_003075.gbk
You are then asked to choose the header format.
This header format affects .500, .join,
.msg, .protein and .protein.dupl files which have fasta
format headers containing gene_name, protein_name, genbank_id,
CDS, and orientation. .protein.dupl file header also cotains a
flag as its last entry to indicate whether the sequence is unique (UNIQUE_SET)
or a duplicate (DUPLIC_SET). You can arrange gene_name, protein_id and
genbank_id in your fasta header any way you like. Just enter the
position in range from 1 to 3 when you are prompted for specific header
element position.
Example:
*********************************************
Please ascribe the order to
the
following header
elements:
gene_name (1 2
3)
protein_name (1 2
3)
genbank_id (1 2
3)
********************************************
Enter the sequence order for GENE_NAME: 3
Enter
the sequence order for PROTEIN_NAME: 2
Enter the sequence order
for GENEBANK_ID: 1
Next, you are prompted to choose which of the duplicated
sequences would you like to use: the one that has the longest interval, or the
first sequence found with duplicated name.
Example:
****************************************
In case of duplicated sequences,
would you like to extract
1. longest sequence
2. first sequence
****************************************
Program output
Program produces ten output files:
Example:
join(3138..3142,3152..3160)
The entry into
.500 file will start with 2638th position in lower case
letters, sequence between positions 3138 and 3142 will be
capitalized, sequence between the positions 3142 and 3151 will be in
lower case letters, sequence between 3152 and 3160 will be
capitalized again and sequence between 3160 and 3660 will be in
lower case again.
Example of a
single CDS entry:
CDS join(2752579..2752879,2753222..2753280)
/gene="At4g05430"
/codon_start=1
/protein_id="NP_192452.1"
/db_xref="GI:15235581"
-
.gene
- contains all extracted gene's.
Example of a
single gene entry:
gene 17792..20066
/gene="F6N15.11"
/note="locus_tag: At4g00050; bHLH protein family"
-
.inbetween
- contains non-redundant list of sequences between the ending position of one gene CDS (stop
codon) and beginning position of another gene CDS (start codon). If two
sequences contain overlapping fragment (i.e. stop codon of one
sequence is in position 10348960 and start codon of another sequence
is in position 10348889) - no intron displayed and warning message is
written into .msg file.
The header for each gene entry
has format:
>at_gene_name1_at_gene_name2
[ gene_name_cds1_gene_name_cds2
]
[orientation of gene1][orientation of gene2]
Example:
>At4g21360_At4g21366 [
T6K22.90_T6K22.4 ]
RF
R - reverse orientation, F - forward
orientation
1. warnings for gene region
- no "note" field in gene region
- duplicated gene name found
2. warnings for CDS region:
- 2 genes share overlapping sequence fragment
- gene name is not unique
- position without pair (single letter sequence)
- position number contains > or < signs
- the number of amino acids is not divisible by 3
- premature stop codon, no stop codon, more than 1 stop
codons
Example of individual
entry:
At4g00030 F6N15.13 NP_191914.1 CDS forward 15236688
13965* 13565-14028|14114-14179|14258-14366
* this number represents middle
position; it is equal to the floor of the first number in the
position list (13565) plus the last number in the position list
(14366) divided by 2.
-
.protein*
- contains translated to protein spliced exons for non-redundant
data sets in CDS part of a gene.
-
.protein.dupl*
- contains translated to protein spliced exons for redundant
data sets in CDS part of a gene.
* For these files you choose the order of header
elements.
Download
In order to run GenBank Parser you need to download two files:
GenBank_parser_Uni_Feb_25_2004.py (should work with any GenBank file)
or GenBank_parser_Ath_Feb_25_2004.py (adapted for Arabidopsis GenBank files)
and
dna_to_prot.py (standard
translation table)
======================================================
GenBank_parser_Uni_Mar_18_2006.py
(current working version)
dna_to_prot.py (standard translation table)
Example input file ("/locus_tag" field is required) NT_033777.gbk - D.melanogaster (3R chr)
Example output files:
NT_033777.gbk.xclean.cds_dna.fasta
NT_033777.gbk.xclean.protein.fasta
NT_033777.gbk.position
NT_033777.gbk.500
NT_033777.gbk.out
======================================================
Run script on Unix: in your shell prompt type
python
GenBank_parser_Uni_Feb_25_2004.py
Run script on Windows: in your command prompt
type
GenBank_parser_Uni_Feb_25_2004.py
Python software is available at: http://www.python.org/
parser.py is under the GNU General
Public License
Copyright ©
2003 University of California at Davis