#################################################### # # Author: Brian Chan (birdchan@ucdavis.edu) # Supervisor : Alex Kozik (akozik@atgc.org) # Date: July 24th, 2002 # #################################################### # # In short, this script reads in the contig ID info # and contig mismatch info, then output details # about possible polymorphism. # ############ Detailed description ################# # # It first asks for misMatch file, which is in this format: # ContigName \t seqName \t (index1:symbol)* # For example, a line could be like the following: # QH_CA_Contig23 QHA10B27.yg.ab1+ 72:D|29:I # The symbol D means deletion, S means substitution, I # means insertion. So, say 72:D means that at index # 72 of such seq of such contig there is a deletion. # # It then asks for a contig ID file, which is in this format: # ContigName \t numOfSeq \t (seqName w/ start and end indexes)* # For example, a line could look like this: # QG_CA_Contig72 2 Seq1+(2,200)|Seq2+(1,234) # # It then asks for whether we want to use prefix or # suffix to distinguish seqs in two groups. # # It then asks how long the pattern is. This pattern # corresponds to the prefix or suffix we chose. # # Then it asks for the pattern for group 1. Then for # group 2. # # Say, if we input QGA, QGB, QGC, QGD and QGI for # group 1, and QGE, QGF, QGG, QGH, and QGJ for # group 2. Then all the seq with the prefix or # suffix (depending what we chose) in the group # 1 patterns will be placed in group 1. Same # applies to group 2. For example, seq QGB1122 # will be placed in group 1. All unidentified # seq will be ignored. # # After all these input, the script then analysze # the content of the two input files. It forms # groups and checks if possibly polymorphism # exists. If so, output findings to corresponding # output file. # # The output format is as follow: # ContigName DIS ALL PartialFraction LowPriority # For example, the followings are valid outputs: # QH_CA_Contig921 S 1 2/5 1 # QH_CA_Contig3499 I 7 1/13 0 # # DIS is an element of {D, I, S} # # ALL means the number of DIS occuring in a group # whose sequences having a particular index all # have this DIS, while the other sequences in # the other group all do not have this DIS at # that particular index. # # PartialFraction is similar to ALL, except that # some sequences don't have that particular # DIS, while the sequences in the other group # all don't have that DIS. # # LowPriority is either 0 or 1. Basically, when # ALL is greater than 1, then LowPriority is 0, # meaning this contig has some priority. When # ALL is less than or equal to 1, then LowPriority # is 1, meaning this contig has low priority. # # As an illustration, # -----D--------- # -----D-------------S---------D--- # -----D-------------S--- # --D--------- # ========I==S=======D=== # ========I==S==== # ========I===== # # The output would be # ContigName D 4 0/1 0 # ContigName I 3 0/1 0 # ContigName S 2 2/3 0 # # Notice, the default output of the partial fraction # is 0/1 . # # The following outputs are wrong # ContigName D 1 0/1 1 (since we have 4 D's for ALL) # ContigName S 0 2/3 0 (since we do have something for ALL) # # As rules given for this script, we keep the # biggest "ALL" and "PartialFraction". # ####################################################