#################################################### 
# 
# Author: Brian Chan (birdchan@ucdavis.edu)
# Supervisor : Alex Kozik (akozik@atgc.org)
# Date: July 24th, 2002
# 
#################################################### 
# 
# In short, this script reads in the contig ID info
#    and contig mismatch info, then output details
#    about possible polymorphism. 
# 
############ Detailed description #################
# 
# It first asks for misMatch file, which is in this format:
#    ContigName \t seqName \t (index1:symbol)*
# For example, a line could be like the following:
#    QH_CA_Contig23 QHA10B27.yg.ab1+ 72:D|29:I
# The symbol D means deletion, S means substitution, I
#    means insertion. So, say 72:D means that at index
#    72 of such seq of such contig there is a deletion.
# 
# It then asks for a contig ID file, which is in this format:
#    ContigName \t numOfSeq \t (seqName w/ start and end indexes)*
# For example, a line could look like this:
#    QG_CA_Contig72 2 Seq1+(2,200)|Seq2+(1,234)
# 
# It then asks for whether we want to use prefix or
#    suffix to distinguish seqs in two groups. 
# 
# It then asks how long the pattern is. This pattern 
#    corresponds to the prefix or suffix we chose.
# 
# Then it asks for the pattern for group 1. Then for
#    group 2.
# 
# Say, if we input QGA, QGB, QGC, QGD and QGI for 
#    group 1, and QGE, QGF, QGG, QGH, and QGJ for
#    group 2. Then all the seq with the prefix or
#    suffix (depending what we chose) in the group
#    1 patterns will be placed in group 1. Same 
#    applies to group 2. For example, seq QGB1122
#    will be placed in group 1. All unidentified
#    seq will be ignored. 
# 
# After all these input, the script then analysze 
#    the content of the two input files. It forms
#    groups and checks if possibly polymorphism
#    exists. If so, output findings to corresponding
#    output file. 
# 
# The output format is as follow:
#    ContigName DIS ALL PartialFraction LowPriority
# For example, the followings are valid outputs:
#    QH_CA_Contig921 S 1 2/5 1
#    QH_CA_Contig3499 I 7 1/13 0
# 
# DIS is an element of {D, I, S}
# 
# ALL means the number of DIS occuring in a group
#    whose sequences having a particular index all
#    have this DIS, while the other sequences in
#    the other group all do not have this DIS at
#    that particular index.
# 
# PartialFraction is similar to ALL, except that
#    some sequences don't have that particular
#    DIS, while the sequences in the other group
#    all don't have that DIS. 
# 
# LowPriority is either 0 or 1. Basically, when
#    ALL is greater than 1, then LowPriority is 0,
#    meaning this contig has some priority. When
#    ALL is less than or equal to 1, then LowPriority
#    is 1, meaning this contig has low priority.
# 
# As an illustration,
#    -----D---------
#    -----D-------------S---------D---
#    -----D-------------S---
#       --D---------
#      ========I==S=======D===
#      ========I==S====
#      ========I=====
# 
# The output would be
#    ContigName D 4 0/1 0
#    ContigName I 3 0/1 0
#    ContigName S 2 2/3 0
# 
# Notice, the default output of the partial fraction
#    is 0/1 . 
# 
# The following outputs are wrong
#    ContigName D 1 0/1 1  (since we have 4 D's for ALL)
#    ContigName S 0 2/3 0  (since we do have something for ALL)
# 
# As rules given for this script, we keep the 
#    biggest "ALL" and "PartialFraction". 
# 
####################################################