####################################################
#
# Author: Brian Chan (birdchan@ucdavis.edu)
# Supervisor : Alex Kozik (akozik@atgc.org)
# Date: July 24th, 2002
#
####################################################
#
# In short, this script reads in the contig ID info
# and contig mismatch info, then output details
# about possible polymorphism.
#
############ Detailed description #################
#
# It first asks for misMatch file, which is in this format:
# ContigName \t seqName \t (index1:symbol)*
# For example, a line could be like the following:
# QH_CA_Contig23 QHA10B27.yg.ab1+ 72:D|29:I
# The symbol D means deletion, S means substitution, I
# means insertion. So, say 72:D means that at index
# 72 of such seq of such contig there is a deletion.
#
# It then asks for a contig ID file, which is in this format:
# ContigName \t numOfSeq \t (seqName w/ start and end indexes)*
# For example, a line could look like this:
# QG_CA_Contig72 2 Seq1+(2,200)|Seq2+(1,234)
#
# It then asks for whether we want to use prefix or
# suffix to distinguish seqs in two groups.
#
# It then asks how long the pattern is. This pattern
# corresponds to the prefix or suffix we chose.
#
# Then it asks for the pattern for group 1. Then for
# group 2.
#
# Say, if we input QGA, QGB, QGC, QGD and QGI for
# group 1, and QGE, QGF, QGG, QGH, and QGJ for
# group 2. Then all the seq with the prefix or
# suffix (depending what we chose) in the group
# 1 patterns will be placed in group 1. Same
# applies to group 2. For example, seq QGB1122
# will be placed in group 1. All unidentified
# seq will be ignored.
#
# After all these input, the script then analysze
# the content of the two input files. It forms
# groups and checks if possibly polymorphism
# exists. If so, output findings to corresponding
# output file.
#
# The output format is as follow:
# ContigName DIS ALL PartialFraction LowPriority
# For example, the followings are valid outputs:
# QH_CA_Contig921 S 1 2/5 1
# QH_CA_Contig3499 I 7 1/13 0
#
# DIS is an element of {D, I, S}
#
# ALL means the number of DIS occuring in a group
# whose sequences having a particular index all
# have this DIS, while the other sequences in
# the other group all do not have this DIS at
# that particular index.
#
# PartialFraction is similar to ALL, except that
# some sequences don't have that particular
# DIS, while the sequences in the other group
# all don't have that DIS.
#
# LowPriority is either 0 or 1. Basically, when
# ALL is greater than 1, then LowPriority is 0,
# meaning this contig has some priority. When
# ALL is less than or equal to 1, then LowPriority
# is 1, meaning this contig has low priority.
#
# As an illustration,
# -----D---------
# -----D-------------S---------D---
# -----D-------------S---
# --D---------
# ========I==S=======D===
# ========I==S====
# ========I=====
#
# The output would be
# ContigName D 4 0/1 0
# ContigName I 3 0/1 0
# ContigName S 2 2/3 0
#
# Notice, the default output of the partial fraction
# is 0/1 .
#
# The following outputs are wrong
# ContigName D 1 0/1 1 (since we have 4 D's for ALL)
# ContigName S 0 2/3 0 (since we do have something for ALL)
#
# As rules given for this script, we keep the
# biggest "ALL" and "PartialFraction".
#
####################################################