GenBank Data Parser

How it works
User Input
Program Output

What the program does

GenBank Data Parser is a Python script designed to translate the region of DNA sequence specified in CDS part of each gene into protein sequence. It also generates additional files that are designed to assist in GenBank data analysis.

How the program works

Program reads in user defined SOURCE file that was generated by GenBank database. This file usually has .gbk extension. The program produces 10 output files with extensions .500, .cds, .gene, .inbetween, .join, .msg, .out, .position, .protein, protein.dupl. If files with these extensions exist in current directory user will be asked if he/she wants to overwrite each file or provide a different name for each one. When the program is done the message is prompted that the modified files have been created.

User input

After being prompted for a SOURCE file name, enter your GenBank file's name.

Example of a source file: NC_003075.gbk

You are then asked to choose the header format. This  header format affects .500, .join, .msg.protein and .protein.dupl files which have fasta format headers containing gene_name,  protein_name, genbank_id, CDS, and orientation. .protein.dupl file header also cotains a flag as its last entry to indicate whether the sequence is unique (UNIQUE_SET) or a duplicate (DUPLIC_SET). You can arrange gene_name,  protein_id and genbank_id in your fasta header any way you like. Just enter  the position in range from 1 to 3 when you are prompted for specific header element position.


     Please ascribe the order to the    
        following header elements:       
               gene_name (1 2 3)          
            protein_name (1 2 3)          
               genbank_id (1 2 3)          

Enter the sequence order for GENE_NAME: 3

Enter the sequence order for PROTEIN_NAME: 2

Enter the sequence order for GENEBANK_ID: 1

Next, you are prompted to choose which of the duplicated sequences would you like to use: the one that has the longest interval, or the first sequence found with duplicated name.


 In case of duplicated sequences, 
       would you like to extract 

        1. longest sequence 
        2. first sequence 


Program output

Program produces ten output files:

  • .500* - contains region specified in CDS part of the gene with 500  nucleic acids preceding start codon and 500 nucleic acids following  stop codon. Exons are shown in capital letters. This file contains redundant data set.



The entry into .500 file will start with 2638th position in lower case letters, sequence between positions 3138 and 3142 will be capitalized, sequence between the positions 3142 and 3151 will be in lower case letters, sequence between 3152 and 3160 will be capitalized again and sequence between 3160 and 3660 will be in lower case again.

  • .cds - contains all extracted CDS's.

Example of a single CDS entry:

CDS       join(2752579..2752879,2753222..2753280)

  • .gene - contains all extracted gene's.

    Example of a single gene entry:

    gene      17792..20066

       /note="locus_tag: At4g00050; bHLH protein family"
  • .inbetween - contains non-redundant list of sequences between the ending position of one gene CDS (stop codon) and beginning position of another gene CDS (start codon). If two sequences contain overlapping fragment (i.e. stop codon of one sequence is in position 10348960 and start codon of another sequence is in position 10348889) - no intron displayed and warning message is written into .msg file. 

The header for each gene entry has format:
>at_gene_name1_at_gene_name2 [ gene_name_cds1_gene_name_cds2 ] [orientation of gene1][orientation of gene2]


>At4g21360_At4g21366 [ T6K22.90_T6K22.4 ] RF
R - reverse orientation, F - forward orientation

  • .join* - contains only spliced exons (joined intervals that are specified in parenthesis, usually preceded by words complement, join or complement join) of a  non-redundant data set.

  • .msg* - contains warnings generated during the execution of the program.

1.   warnings for gene region

  • no "note" field in gene region
  • duplicated gene name found

2.   warnings for CDS region:

  • 2 genes share overlapping sequence fragment
  • gene name is not unique
  • position without pair (single letter sequence)
  • position number contains > or < signs
  • the number of amino acids is not divisible by 3
  • premature stop codon, no stop codon, more than 1 stop codons 
  • .out - contains DNA sequence with all CDS intervals for every gene capitalized. If there are two overlapping sequences. The interval from the beginning of one to the end of the other is capitalized.

  • .position - lists all the intervals for each CDS entry of a non-redundant set of data.

Example of individual entry:

At4g00030 F6N15.13 NP_191914.1 CDS forward 15236688 13965* 13565-14028|14114-14179|14258-14366

* this number represents middle position; it is equal to the floor of the first number in the position list (13565) plus the last number in the position list (14366) divided by 2.  

  • .protein* - contains translated to protein spliced exons  for non-redundant data sets in CDS part of a gene.

  • .protein.dupl* - contains translated to protein spliced exons  for redundant data sets in CDS part of a gene.

* For these files you choose the order of header elements.


In order to run GenBank Parser you need to download two files: (should work with any GenBank file)
or (adapted for Arabidopsis GenBank files)
and (standard translation table)

====================================================== (current working version) (standard translation table)

Example input file ("/locus_tag" field is required) NT_033777.gbk - D.melanogaster (3R chr)

Example output files:

Run script on Unix: in your shell prompt type

Run script on Windows: in your command prompt type

Python software is available at: is under the GNU General Public License
Copyright 2003 University of California at Davis

cover cover

email: Alexander Kozik
email: Elena Kochetkova

last modified: February 25, 2006