################################################################# # # # MAD MAPPER # # # # MAD MAPPING PROGRAM # # # # PART 1 ( CLUSTERING ) # # # # and # # # # PART 2 ( MAP CONSTRUCTION ) # # # # COPYRIGHT 2004 2005 # # Alexander Kozik # # # # http://www.atgc.org/ # # # # UCD Genome Center. R.Michelmore group # # # ################################################################# ################################################################# # +-------+ # # | BIT | # # SCORING SYSTEM: | | # # | REC | # # +-------+ # # # # . +-------+-------+-------+-------+-------+-------+ # # . | | | | | | | # # . | A | B | C | D | H | - | # # .| | | | | | | # # +-------*-------+-------+-------+-------+-------+-------+ # # | | . 6 | -6 | -4 | 4 | -2 | 0 | # # | A | | | | | | | # # | | 0 .| 1 | 1 | 0 | 0.5 | 0 | # # +-------+-------*-------+-------+-------+-------+-------+ # # | | -6 | . 6 | 4 | -4 | -2 | 0 | # # | B | | | | | | | # # | | 1 | 0 .| 0 | 1 | 0.5 | 0 | # # +-------+-------+-------*-------+-------+-------+-------+ # # | | -4 | 4 | . 4 | -4 | 0 | 0 | # # | C | | | | | | | # # | | 1 | 0 | 0 .| 1 | 0 | 0 | # # +-------+-------+-------+-------*-------+-------+-------+ # # | | 4 | -4 | -4 | . 4 | 0 | 0 | # # | D | | | | | | | # # | | 0 | 1 | 1 | 0 .| 0 | 0 | # # +-------+-------+-------+-------+-------*-------+-------+ # # | | -2 | -2 | 0 | 0 | . 2 | 0 | # # | H | | | | | | | # # | | 0.5 | 0.5 | 0 | 0 | 0 .| 0 | # # +-------+-------+-------+-------+-------+-------*-------+ # # | | 0 | 0 | 0 | 0 | 0 | . 0 | # # | - | | | | | | | # # | | 0 | 0 | 0 | 0 | 0 | 0 .| # # +-------+-------+-------+-------+-------+-------+-------*. # # # # # # NOTES: # # C - NOT A ( H or B ) # # D - NOT B ( H or A ) # # H - A and B # # # ################################################################# ################################################################# # # # EXAMPLES OF SCORING: # # # # # # POSITIVE LINKAGE: # # # # AAAAAAAAAAAAAAAAAAAA BIT SCORE = 6*20 = 120 # # AAAAAAAAAAAAAAAAAAAA REC SCORE = 0 (0.0) # # .. # # AAAAAAAAAAAAAAAAAAAA BIT SCORE = 6*18 - 6*2 = 96 # # AAAAAAAAAAAAAAAAAABB REC SCORE = 2 (2/20 = 0.1) # # # # AAAAAAAAAABBBBBBBBBB BIT SCORE = 6*10 + 6*10 = 120 # # AAAAAAAAAABBBBBBBBBB REC SCORE = 0 (0.0) # # .. # # AAAAAAAAABABBBBBBBBB BIT SCORE = 6*18 - 6*2 = 96 # # AAAAAAAAAABBBBBBBBBB REC SCORE = 2 (2/20 = 0.1) # # # # # # NO LINKAGE: # # .......... # # AAAAAAAAAAAAAAAAAAAA BIT SCORE = 6*10 - 6*10 = 0 # # AAAAAAAAAABBBBBBBBBB REC SCORE = 10 (10/20 = 0.5) # # . . . . . . . . . . # # BBBAABBAAAAAAABAABBB BIT SCORE = 6*10 - 6*10 = 0 # # BABBAABBABABABBBAABA REC SCORE = 10 (10/20 = 0.5) # # # # # # NEGATIVE LINKAGE: # # .................. # # AAAAAAAAAAAAAAAAAAAA BIT SCORE = 6*2 - 6*18 = -96 # # AABBBBBBBBBBBBBBBBBB REC SCORE = 18 (18/20 = 0.9) # # .................. # # ABABABABABABABABABAB BIT SCORE = 6*2 - 6*18 = -96 # # ABBABABABABABABABABA REC SCORE = 18 (18/20 = 0.9) # # # # # ################################################################# ################################################################################## #1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80# #-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5-=3+5# ################################################################################## ################################################################################## # --+============+-- # # README_FIRST: # # PYTHON MAD MAPPER # # HOW IT WORKS AND WHAT IT DOES: # # RULES AND ALGORITHMS OF MADMAPPER APPROACH # # ---++==========================================++--- # # # # -------------------------------------------------------------------------- # # NOTE: # # LABELS LIKE ### README_00X ### ARE INCORPORATED INTO THE SOURCE CODE # # TO HELP TO FIND FUNCTIONS AND VARIABLES DESCRIBED IN THIS DOCUMENT # # -------------------------------------------------------------------------- # # # # = ASSUMPTIONS AND RESTRICTIONS = # # 0. MadMapper is designed to work with RILs of 6-th generation or higher where # # the number of heterozygotes in the set of scored individuals is low. # # MadMapper is designed to work with high-density genetic maps where # # recombination values between adjacent markers is 0.2 or better (lower). # # # # = WHAT MAD MAPPER DOES = # # MadMapper analyses raw marker scores (loc file). # # Based on this analysis it DOES: # # # # A. MadMapper generates pairwise distance scores for all markers in a dataset. # # B. MadMapper does grouping/clustering analysis based on pairwise distances. # # C. MadMapper assists to find genetic bins (sets of tightly linked markers). # # D. MadMapper assigns new markers in a dataset to known linkage groups. # # E. MadMapper validates all markers in a dataset and sorts them into # # 'good' and 'bad' groups based on several criteria and cutoff values. # # F. MadMapper does so called 'triplet' or 'trio' analysis finding sets of # # tightly linked triplets (3 markers in a row) and their relative order. # # G. MadMapper analyzes markers with negative linkage. # # # # H. MadMapper DOES NOT construct genetic maps yet. However, triplet analysis # # is close enough to this final goal. # # # # MadMapper utilizes BIT scoring system instead of LOD scores to manipulate # # genetic data. # # ____________________________________________________________________________ # # # # = INPUT FILES = # # 1. Locus file with raw marker scores is required. # # ### README_001 ### # # Frame work marker list (map) is optional. # # # # = BASIC DATA STRUCTURE = # # ### README_002 ### # # 2. MadMapper reads locus file into memory and creates two-dimensional array: # # sequence_array[id,q] = data_point # # where: id - marker_id # # q - individual RIL # # data_point - raw marker score # # sequence_array[id,q] contains marker scores 'as is' in the input file: # # 'A' 'B' 'C' 'D' 'H' and '-' (heterozygotes are allowed and considered) # # # # ### README_003 ### # # 3. At the same MadMapper creates a second two-dimensional array: # # sequence_array_bin[id,q] = score_point # # where all heterozygotes are removed or replaced with homozygotes. # # This binary array has 'A' and 'B' values only. # # 'H' is considered as "no data" # # 'C' -> 'B' 'D' -> 'A' # # There is an assumption that 'C' and 'D' scores should be homozygous for # # RIL set of the 6-th generation or higher. # # This 'sequence_array_bin' will be used to build non-redundant data set # # based on marker scores as well as to filter 'bad' scores (see #4 below). # # # # NOTE: If the initial dataset has 'A' and 'B' values only then # # 'array sequence_array_bin' is identical to 'sequence_array' # # ____________________________________________________________________________ # # # # = NON-REDUNDANT MARKER SCORES = # # ### README_004 ### # # 4. '*.z_nr_scores.loc' will be generated with non-redundant binary scores. # # If two or more markers have identical scores based on the analysis of the # # sequence_array_bin[id,q] then only one of them will be written into # # '*.z_nr_scores.loc file'. # # '*.z_scores_dupl' have info about duplicated markers and which marker # # is used as a master marker. # # NOTE: THIS NON-REDUNDANT SET IS FILTERED FOR "BAD" MARKERS (SEE BELOW #5). # # IT MEANS THAT THIS SET DOES NOT CONTAIN MARKERS WITH ALLELE DISTORTION # # AND MARKERS WITH THE NUMBER OF MISSING DATA ABOVE DEFINED CUTOFF VALUES. # # # # = FILTERING OF BAD MARKERS = # # ### README_005 ### # # 5. MadMapper has two variables: # # 'allele_dist' = 0.33 ( allele distortion ) # # 'abs_loss' = 50 ( absolute data loss ) # # to filter those markers that do not pass criteria: # # allele ratio (distortion) should be less than 1:3 (0.33) # # number of missing scores should be less than 50 # # ### README_006 ### # # these cutoff values can be changed using MadMapper arguments/options # # at the time of script execution. # # ### README_007 ### # # variable 'nr_good_list' stores marker IDs which passed criteria above. # # Markers IDs stored in 'nr_good_list' will be used for TRIO analysis. # # # # NOTE: FILTERING DOES NOT AFFECT CALCULATION OF PAIRWISE DISTANCES AND # # GLOBAL MATRIX WITH PAIRWISE DISTANCES AND FURTHER CLUSTERING ANALYSIS. # # NON-REDUNDANT CLEAN (FILTERED) SET CAN BE USED ON A SECOND ROUND OF # # THE ANALYSIS IF NECESSARY. HOWEVER, FILTERED NON-REDUNDANT SET IS USED # # FOR TRIPLET ANALYSIS TO FIND BEST TRIO SETS (SEE BELOW). # # ____________________________________________________________________________ # # # # = CALCULATION OF PAIRWISE DISTANCES = # # ### README_008 ### # # 6. BIT scoring matrix values and REC (recombination) scoring matrix values # # are defined under function 'Define_Bit_Scores'. For each pair of markers # # haplotype pairwise distances are calculated according to this BIT and REC # # matrix scoring system. # # # # = GLOBAL MATRIX WITH PAIRWISE DISTANCES = # # ### README_009 ### # # 7. Creation of Global Matrix Data for further marker clustering: # # There are three cutoff values to create Matrix with pairwise distances # # for further clustering: # # 'rec_cut' (recombination or distance cutoff) 0.2 - default # # 'bit_cut' (BIT score cutoff) 100 - default # # 'dat_cut' (datapoints cutoff) 25 - default # # # # Explanation of datapoints: # # 0 1 1 2 2 3 3 4 4 5 # # ====5====0====5====0====5====0====5====0====5====0 # # M1 ----AAABBBAAABBBAAABBBAAABBBAA-------------------- # # M2 ---------BAAABBBAAABBBAAABBBAAAABBB--------------- # # M3 --------------BBAAABBBAAABBBAAAABBBBBAAA---------- # # M4 -------------------BBBAAABBBAAAABBBBBAAAAABBB----- # # M5 ------------------------ABBBAAAABBBBBAAAAABBBBBAAA # # # # In this example the number of individuals (RILs) is 50. # # If 'data_cut' value for this small set is 12 then # # at least 12 scores between two markers must be compared # # to get pairwise matrix values. In this example pairwise scores for pairs # # of markers: M1-M4 M1-M5 M2-M5 are not assigned because they have # # overlapping scores below data_cut cutoff (12). # # # # = MATRIX OUTPUT FILES [ PAIRWISE SCORES ] = # # ### README_010 ### # # '*.pairs_all' global matrix file will contain pairwise scores for all # # markers if data overlap is greater than 'data_cut' value (25 by default). # # BIT score cutoff and REC score cutoff do not affect this file and will be # # used for further clustering (grouping) analysis only. # # ### README_011 ### # # '*.pairs_positive' will contain pairwise data for all pairs of markers # # with recombination 0.4 or less (positive linkage) # # ### README_012 ### # # '*.pairs_negative' will contain pairwise data for all pairs of markers # # with recombination 0.6 or greater (negative linkage) # # # # STRUCTURE OF '*.pairs_all' '*.pairs_positive' '*.pairs_negative' FILES: # # * first column: marker ID "A" # # * second column: marker ID "B" # # * third column: recombination value between markers "A" and "B" # # * fourth column: BIT score for pair "A" and "B" # # * fifth column: datapoints value # # (fraction of datapoints for pair "A" and "B") # # * sixth column: "***" - visual mark # # * seventh column: total number of recombination events # # * eighth column: total number of datapoints # # * ninth column: total number of data loss # # * tenth column: total number of possible comparisons (number of RILs) # # # # PAIRWISE SCORES FOR THE EXAMPLE ABOVE: # # M1 M1 0.0 156 0.52 *** 0 26 24 50 # # M1 M2 0.0 126 0.42 *** 0 21 29 50 # # M1 M3 0.0 96 0.32 *** 0 16 34 50 # # M2 M1 0.0 126 0.42 *** 0 21 29 50 # # M2 M2 0.0 156 0.52 *** 0 26 24 50 # # M2 M3 0.0 126 0.42 *** 0 21 29 50 # # M2 M4 0.0 96 0.32 *** 0 16 34 50 # # M3 M1 0.0 96 0.32 *** 0 16 34 50 # # M3 M2 0.0 126 0.42 *** 0 21 29 50 # # M3 M3 0.0 156 0.52 *** 0 26 24 50 # # M3 M4 0.0 126 0.42 *** 0 21 29 50 # # M3 M5 0.0 96 0.32 *** 0 16 34 50 # # M4 M2 0.0 96 0.32 *** 0 16 34 50 # # M4 M3 0.0 126 0.42 *** 0 21 29 50 # # M4 M4 0.0 156 0.52 *** 0 26 24 50 # # M4 M5 0.0 126 0.42 *** 0 21 29 50 # # M5 M3 0.0 96 0.32 *** 0 16 34 50 # # M5 M4 0.0 126 0.42 *** 0 21 29 50 # # M5 M5 0.0 156 0.52 *** 0 26 24 50 # # # # ____________________________________________________________________________ # # # # = CLUSTERING AND GROUPING = # # ### README_014 ### # # 8. Regardless of filtering described above (see paragraphs 4 and 5) # # MadMapper will do clustering with all markers from input dataset. # # Clustering with filtered (selected) markers can be done on a second run # # of MadMapper if necessary. # # # # ### README_015 ### # # DFS (Depth First Search) procedure is repeated 16 times with different # # recombination (haplotype distance) cutoff values starting with 0.2 and # # ending with 0.0. Markers are grouped together based on transitive linkage. # # If marker 'M1' is linked to marker 'M2' and marker 'M2' is linked to 'M3' # # then all three markers 'M1' 'M2' and 'M3' belong to the same linkage group # # even if 'M1' is not linked directly to marker 'M3'. # # # # = CLUSTERING/GROUPING OUTPUT FILES = # # Information about grouping is stored in three files per iteration: # # '*.matrix' - pairwise distances for a given group # # '*.adj_list' - adjacency list # # '*.group_info' - group info # # # # STRUCTURE OF '*.matrix' FILE: # # * first column: marker ID "A" # # * second column: marker ID "B" # # * third column: recombination value between markers "A" and "B" # # * fourth column: BIT score for markers "A" and "B" # # * fifth column: datapoints value # # (fraction of datapoints for pair "A" and "B") # # # # STRUCTURE OF '*.group_info' FILE: # # * first column: marker ID # # * second column: length of an adjacency list for given marker or how many # # other markers are linked to given marker directly # # * third column: size of the given group (how many markers in this # # particular linked group) # # * fourth column: arbitrary group number # # * fifth column: visual mark ("*****" separates different group) # # * sixth column: information about framework markers # # * seventh column: type of graph [SINGLETON/LINKED/COMPLETE]. # # If node is a singleton (is not connected to any other node) then it is # # labeled as 'SINGLE____NODE' # # If nodes form complete graph (all nodes linked to each other directly) # # then such group is labeled as 'COMPLETE_GRAPH' # # If group is not a complete graph (some nodes do not have direct links # # or connections to other nodes then such group is labeled as # # 'LINKED___GROUP'. # # * eighth column: type of node (SATURATED or DILUTED). If a node has all # # possible connections to all other nodes in a group [in other words: node # # is connected directly to all other nodes in a group] then such node is # # labeled as 'SATURATED_NODE'. 'DILUTED___NODE' is an indication that a # # group does not form complete graph. Group can be considered as a bin if # # all nodes in a given group are SATURATED and graph is COMPLETE. # # ____________________________________________________________________________ # # # # = CLUSTERING/GROUPING SUMMARY FOR ALL 16 ITERATIONS = # # [ DENDRO-CLUSTERING ] # # STRUCTURE OF *.x_tree_clust FILE: # # * 1-st column: group ID for clustering with cutoff 0.20 # # * 2-nd column: group ID for clustering with cutoff 0.18 # # * 3-d column: group ID for clustering with cutoff 0.16 # # * 4-th column: group ID for clustering with cutoff 0.14 # # * 5-th column: group ID for clustering with cutoff 0.12 # # * 6-th column: group ID for clustering with cutoff 0.10 # # * 7-th column: group ID for clustering with cutoff 0.09 # # * 8-th column: group ID for clustering with cutoff 0.08 # # * 9-th column: group ID for clustering with cutoff 0.07 # # *10-th column: group ID for clustering with cutoff 0.06 # # *11-th column: group ID for clustering with cutoff 0.05 # # *12-th column: group ID for clustering with cutoff 0.04 # # *13-th column: group ID for clustering with cutoff 0.03 # # *14-th column: group ID for clustering with cutoff 0.02 # # *15-th column: group ID for clustering with cutoff 0.01 # # *16-th column: group ID for clustering with cutoff 0.00 # # *17-th column: type of graph (COMPLETE or LINKED) for the last iteration # # *18-th column: type of node (SATURATED or DILUTED) for the last iteration # # *19-th column: '***' - visual mark # # *20-th column: linkage group from frame work marker list # # *21-st column: position on the map of frame marker (if available) # # *22-nd column: '***' - visual mark # # *23-d column: "ABC" (alphabetical) order of markers prior clustering # # *24-th column: '***' - visual mark # # *25-th column: "LG" - reserved field for manipulation in excel like editor # # *26-th column: marker ID # # *27-th column: '*' - visual mark # # *28-th column: sum of 'A' scores (A+D) # # *29-th column: sum of 'B' scores (B+C) # # *30-th column: total number of possible scores (number of RILs) # # *31-st column: '*' - visual mark # # *32-nd column: order of markers according to dendro-clustering # # # # ____________________________________________________________________________ # # # # = TRIO/TRIPLET ANALYSIS = # # # # 9. MadMapper performs TRIPLET or TRIO analysis. It finds tightly linked # # triplets (3 markers in a row) and their relative order. # # # # Each marker from 'GOOD NON-REDUNDANT SET' (see #5 above) is checked for # # all possible combinations with other markers. In this case 'test' marker # # takes a middle position in a trio and recombination scores between middle # # and flanking markers are analyzed. # # # # DATA STRUCTURE FOR TRIO ANALYSIS: #################### # # # ALL TRIOS (madmapper_scores.loc.out.z_trio_all): # # # # MV 0.1 240 1.0 MP 0.06 264 1.0 MI *** 0.16 204 1.0 *** 0 # # MV 0.1 240 1.0 MP 0.0455 240 0.88 MK *** 0.1591 180 0.88 *** 0 # # MV 0.1 240 1.0 MP 0.04 276 1.0 MR *** 0.06 264 1.0 *** 2 # # MV 0.1 240 1.0 MP 0.08 252 1.0 MS *** 0.02 288 1.0 *** 4 # # .............................. # # MG 0.125 216 0.96 MR 0.14 216 1.0 MT *** 0.2708 132 0.96 *** 0 # # MG 0.125 216 0.96 MR 0.06 264 1.0 MV *** 0.1875 180 0.96 *** 0 # # MH 0.1111 210 0.9 MR 0.1458 204 0.96 MF *** 0.0444 246 0.9 *** 5 # # MH 0.1111 210 0.9 MR 0.125 216 0.96 MG *** 0.0222 258 0.9 *** 5 # # # # # # GOOD TRIOS (madmapper_scores.loc.out.z_trio_good): # # # # MK 0.0909 216 0.88 MR 0.06 264 1.0 MV *** 0.1591 180 0.88 *** 0 --- 0.15 # # MP 0.04 276 1.0 MR 0.04 276 1.0 MS *** 0.08 252 1.0 *** 0 +++ 0.0 # # MP 0.04 276 1.0 MR 0.14 216 1.0 MT *** 0.18 192 1.0 *** 0 --- 0.2 # # MP 0.04 276 1.0 MR 0.06 264 1.0 MV *** 0.1 240 1.0 *** 0 --- 0.04 # # # # # # BEST TRIOS (madmapper_scores.loc.out.z_trio_best) # # # # MH 0.0 234 0.78 MK 0.0 264 0.88 MI *** 0.0 270 0.9 *** 0 +++ 0.0 # # MI 0.0 264 0.88 MK 0.0 234 0.78 MH *** 0.0 270 0.9 *** 0 +++ 0.0 # # MK 0.0455 240 0.88 MP 0.04 276 1.0 MR *** 0.0909 216 0.88 *** 0 +++ 0.0 # # MR 0.04 276 1.0 MP 0.0455 240 0.88 MK *** 0.0909 216 0.88 *** 0 +++ 0.0 # # MP 0.04 276 1.0 MR 0.04 276 1.0 MS *** 0.08 252 1.0 *** 0 +++ 0.0 # # MS 0.04 276 1.0 MR 0.04 276 1.0 MP *** 0.08 252 1.0 *** 0 +++ 0.0 # # # # # # BAD TRIOS (madmapper_scores.loc.out.z_trio_bad) # # # # MS 0.1 240 1.0 MT 0.12 228 1.0 MV *** 0.02 288 1.0 *** 5 +++ 0.0 # # MV 0.12 228 1.0 MT 0.1 240 1.0 MS *** 0.02 288 1.0 *** 5 +++ 0.0 # # | | | | | | | | | | | | | | | | # # 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 16 | 17 # # - - - - - - - - - -- -- -- -- -- -- -- -- # # # # #################### # * 1-st column - 'upper' marker ID (above 'target' marker) # # # # * 2-nd column - distance between 'upper' marker and 'target' [DIST_UP] # # # # * 3-d column - BIT score between 'upper' marker and 'target' # # # # * 4-th column - fraction of datapoints between 'upper' marker and 'target' # # # # * 5-th column - 'target' marker ID (middle position in a triplet) # # # # * 6-th column - distance between 'lower' marker and 'target' [DIST_DN] # # # # * 7-th column - BIT score between 'lower' marker and 'target' # # # # * 8-th column - fraction of datapoints between 'lower' marker and 'target' # # # # * 9-th column - 'lower' marker ID (below 'target' marker) # # # # *10-th column - '***' - visual mark # # # # *11-th column - distance between 'upper' marker and 'lower' marker # # [distance between flanking markers] [DIST_FL] # # # # *12-th column - BIT score between flanking markers # # # # *13-th column - fraction of datapoints between flanking markers # # # # *14-th column - '***' - visual mark # # # # *15-th column - number of double cross-overs in a triplet # # # # *16-th column - '+++' stands for 'best' case ('---' for other than 'best') # # # # *17-th column - trio distance summary to calculate the 'best' case # # it is a sum of three values: # # [DIST_UP] + [DIST_DN] + [DIST_FL] minus [BEST_VALUE] # # ( it must be 0.0 for 'best' case and greater than 0.0 for other cases ) # # [BEST_VALUE] is defined as a minimum (lowest value) of sum of # # [DIST_UP] + [DIST_DN] + [DIST_FL] for 'best' case # # # # # # = BRIEF DESCRIPTION OF TRIO APPROACH = # # # # ### README_016 ### # # [A] - All possible trios are analyzed and TRIO MATRIX is created # # # # ### README_017 ### # # [B] - best_recomb_trio[item_c,dk] array contains trios that have a number # # of double crossovers less than cutoff value (3 by default) # # ['double_cross' is a last argument/option of MadMapper program] # # # # ### README_018 ### # # [C] - Best recombination value is calculated for each 'target' marker # # [BEST_VALUE] = MIN OF SUM: [DIST_UP] + [DIST_DN] + [DIST_FL] # # # # ### README_019 ### # # [D] - Extraction of BEST TRIOS into '*.z_trio_best' file # # # # ____________________________________________________________________________ # # # # NUMBER OF # # = BRIEF DESCRIPTION OF ALL OUTPUT FILES = FILES: # # # # PAIRWISE DISTANCES AND CLUSTERING: # # *.adj_list_(01-16) - adjacency lists (positive clustering) 16 # # *.adj_list_N(1-3) - adjacency lists (negative clustering) 3 # # *.group_info_(01-16) - group info (positive clustering) 16 # # *.group_info_N(1-3) - group info (negative clustering) 3 # # *.group_info_Summary - 1 # # - summary for all 16 iterations of positive clustering # # *.matrix_(01-16) - pairwise distance matrix (positive clustering) 16 # # *.matrix_N(1-3) - pairwise distance matrix (negative clustering) 3 # # *.pairs_all - pairwise distance matrix with all available scores 1 # # *.pairs_negative - pairwise distance matrix with all negative scores 1 # # *.pairs_positive - pairwise distance matrix with all positive scores 1 # # # # ERROR CHECKING: # # *.set_dupl - duplicated marker IDs | DO NOT CONFUSE WITH | 1 # # *.set_uniq - unique marker IDs | DUPLICATED SCORES | 1 # # # # LOG FILE: # # *.x_log_file - log file with run parameters recorded 1 # # # # MARKER SCORES INFO: # # *.x_scores_stat - detailed information about scores/linkage stat 1 # # (see *.x_scores_stat structure below README_013) # # # # GROUPING OF MARKERS ( DENDRO-CLUSTERING ): # # *.x_tree_clust - grouping of markers based on the analysis 1 # # of all 16 *.group_info_(01-16) files # # # # NON-REDUNDANT SCORES SET AND FILTERING OF 'BAD' MARKERS: # # *.z_marker_sum 1 # # *.z_nr_scores.loc 1 # # *.z_scores_dupl 1 # # # # TRIO (TRIPLET) ANALYSIS: # # *.z_trio_all 1 # # *.z_trio_bad 1 # # *.z_trio_best 1 # # *.z_trio_good 1 # # *.z_trio_graph 1 # # *.z_trio_map 1 # # # # BIN CONSENSUS SET (EXPERIMENTAL): # # *.z_xconsensus.conv_adjc 1 # # *.z_xconsensus.conv_real 1 # # *.z_xconsensus.debug 1 # # *.z_xconsensus.frame 1 # # *.z_xconsensus.loc_all 1 # # *.z_xconsensus.loc_nr 1 # # *.z_xconsensus.matrix 1 # # # # TOTAL NUMBER OF OUTPUT FILES: 82 # # ____________________________________________________________________________ # # # # MARKER SCORES INFO OUTPUT FILE FORMAT: # # ### README_013 ### # # * 1-st column (MARKER_ID): marker ID # # * 2-nd column (00_01): number of markers linked to a given marker within # # 0.0 - 0.1 recombination frequency (strong positive linkage) # # * 3-d column (01_02): number of markers linked to a given marker within # # 0.1 - 0.2 recombination frequency (strong positive linkage) # # * 4-th column (02_03): number of markers linked to a given marker within # # 0.2 - 0.3 recombination frequency (positive linkage) # # * 5-th column (03_04): number of markers linked to a given marker within # # 0.3 - 0.4 recombination frequency # # * 6-th column (04_05): number of markers linked to a given marker within # # 0.4 - 0.5 recombination frequency # # * 7-th column (05_06): number of markers linked to a given marker within # # 0.5 - 0.6 recombination frequency # # * 8-th column (06_07): number of markers linked to a given marker within # # 0.6 - 0.7 recombination frequency # # * 9-th column (07_08): number of markers linked to a given marker within # # 0.7 - 0.8 recombination frequency (negative linkage) # # *10-th column (08_09): number of markers linked to a given marker within # # 0.8 - 0.9 recombination frequency (strong negative linkage) # # *11-th column (09_10): number of markers linked to a given marker within # # 0.9 - 1.0 recombination frequency (strong negative linkage) # # *12-th column (REC:P-N): difference between strong positive and strong # # negative events [ (00_01+01_02) minus (08_09+09_10) ] # # *13-th column (REC_ABS): absolute recombination value for a given marker # # (accumulative value for all possible recombination values) # # [ Summary of all (0.5 - recombination frequency) ] # # *14-th column (***): "***" - visual mark # # *15-th column (BIT_POS): number of markers having BIT score 100 or higher # # to a given marker (positive BIT scores) # # *16-th column (BIT_MED): number of markers having BIT score within # # 100 and -100 range (low or medium BIT scores) # # *17-th column (BIT_NEG): number of markers having BIT score -100 or lower # # to a given marker (negative BIT scores) # # *18-th column (BIT:P-N): difference between positive and negative BIT # # scores [ BIT_POS minus BIT_NEG ] # # *19-th column (BIT_ABS): absolute BIT value (score) for a given marker # # (accumulative value for all possible BIT scores) # # [ Summary of all BIT scores ] # # *20-th column (***): "***" - visual mark # # ### README_13LG ### # # *21-st column (LG_SUMMARY): attempt to classify markers by analysis of # # 2-nd through 11-th columns [can be ignored currently] # # *22-nd column (NP_SUMMARY): brief summary about positive/negative dominance # # *23-d column (***): "***" - visual mark # # *24-th to 33-d columns: stat summary for 'AD-BC-H' scores ('X' - no data) # # # ################################################################################## ################################################################################## # # # = EXAMPLES OF DATA PROCESSING BY MAD MAPPER = # # # # *** INPUT RAW MARKER SCORES: # # # # M1 ----AAABBBAAABBBAAABBBAAABBBAA-------------------- # # M2 ---------BAAABBBAAABBBAAABBBAAAABBB--------------- # # M3 --------------BBAAABBBAAABBBAAAABBBBBAAA---------- # # M4 -------------------BBBAAABBBAAAABBBBBAAAAABBB----- # # M5 ------------------------ABBBAAAABBBBBAAAAABBBBBAAA # # M6 -------------------BBBAAABBBAAAABBBBBAAAAABBBBBAAB # # M7 ------------------ABBBAAABBBAAAABBBBBAAAAABBBBBABB # # M8 ----------------AAABBBAAABBBAAAAABBBBAAAAABBBBBBBB # # M9 ----------------AAABBBAAABBBAAAAAABBBAAAAABBBBBBBB # # MF -AABBB-AABBBBBABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # MG -AABBB-AABBBBAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # MH -AABBB-AABBBAAA-BBBAAAAAAAA-AAAAAAABBBBBBBB-BBBBBB # # MI AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # MJ AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # MK AA-BBBA-ABBBAAABBBB---AAAAAAAAAAAAABBBBBBBBBBBBBB- # # ML DD-CCCD-DCCCDDDCCCC---DDDDDDDDDDDDDCCCCCCCCCCCCCC- # # MN BBBAAABBBAAABBBAAAABBBBBBBBBBBBBBBBAAAAAAAAAAAAAAB # # MP AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBAAA # # MR AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBAAAAA # # MS AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBAAAAAAA # # MT ABABBBAAABABAAABBBBAABAAAABAAAAAAAABBBBABBBAAAAAAA # # MV AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBAAAAAAAA # # MW -AAAAAAAAAAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAABBBBBB # # MX ------------A------------B------------A----------B # # MY -----------BA-----------AB-----------AA---------BB # # MZ ------------AA----------ABB-----------AAA--------B # # # # # # *** PROGRAM EXECUTION: # # $python Python_MadMapper_V248_RECBIT.py madmapper_scores.loc # # madmapper_scores.loc.out # # 0.2 100 12 X 0.33 25 TRIO 3 # # # # # # *** X_LOG (madmapper_scores.loc.out.x_log_file) OUTPUT FILE: # # ============================================= # # RUN PARAMETERS: # # 1. INPUT FILE: madmapper_scores.loc # # 2. OUTPUT FILE: madmapper_scores.loc.out # # 3. RECM CUTOFF: 0.2 # # 4. BITS CUTOFF: 100 # # 5. DATA CUTOFF: 12 # # 6. ALLELE DIST: 0.33 # # 7. MISSING DAT: 25 # # 8. FRAME LIST: X # # 9. TRIO ANALYS: TRUE # # 10. DOUBLE LIMT: 3 # # ============================================= # # ============================================= # # 26 UNIQ IDs IN THE SET FOUND # # 0 IDs ARE DUPLICATED # # ============================================= # # CONTINUE ANALYSIS WITH 26 SEQUENCES OUT OF 26 # # ============================================= # # 7 GROUPS WERE FOUND (STEP 01) # # 7 GROUPS WERE FOUND (STEP 02) # # ........ # # # # # # *** NON-REDUNDANT SCORES (madmapper_scores.loc.out.z_nr_scores.loc): # # # # M1 ----AAABBBAAABBBAAABBBAAABBBAA-------------------- # # M2 ---------BAAABBBAAABBBAAABBBAAAABBB--------------- # # M3 --------------BBAAABBBAAABBBAAAABBBBBAAA---------- # # M4 -------------------BBBAAABBBAAAABBBBBAAAAABBB----- # # M5 ------------------------ABBBAAAABBBBBAAAAABBBBBAAA # # M6 -------------------BBBAAABBBAAAABBBBBAAAAABBBBBAAB # # M7 ------------------ABBBAAABBBAAAABBBBBAAAAABBBBBABB # # M8 ----------------AAABBBAAABBBAAAAABBBBAAAAABBBBBBBB # # M9 ----------------AAABBBAAABBBAAAAAABBBAAAAABBBBBBBB # # MF -AABBB-AABBBBBABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # MG -AABBB-AABBBBAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # MH -AABBB-AABBBAAA-BBBAAAAAAAA-AAAAAAABBBBBBBB-BBBBBB # # MI AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # MK AA-BBBA-ABBBAAABBBB---AAAAAAAAAAAAABBBBBBBBBBBBBB- # # MN BBBAAABBBAAABBBAAAABBBBBBBBBBBBBBBBAAAAAAAAAAAAAAB # # MP AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBAAA # # MR AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBAAAAA # # MS AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBAAAAAAA # # MT ABABBBAAABABAAABBBBAABAAAABAAAAAAAABBBBABBBAAAAAAA # # MV AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBAAAAAAAA # # NOTE THAT MARKERS: MJ and ML were removed because of scores duplication # # MW was removed because of allele distortion # # MX MY MZ were removed because of missing data # # # # # # *** INFO ABOUT DUPLICATED SCORES (madmapper_scores.loc.out.z_scores_dupl): # # MI == MJ *** MI # # MJ == MI *** MI # # MK == ML *** MK # # ML == MK *** MK # # (last column - 'master' marker) # # # # # # *** GROUP INFO FILE (iteration #16 madmapper_scores.loc.out.group_info_16): # # # # M1 1 7 1 ***** _NONE_ LINKED___GROUP_00001 DILUTED___NODE # # M2 3 7 1 ----- _NONE_ LINKED___GROUP_00001 DILUTED___NODE # # M3 4 7 1 ----- _NONE_ LINKED___GROUP_00001 DILUTED___NODE # # M4 4 7 1 ----- _NONE_ LINKED___GROUP_00001 DILUTED___NODE # # M5 1 7 1 ----- _NONE_ LINKED___GROUP_00001 DILUTED___NODE # # M6 2 7 1 ----- _NONE_ LINKED___GROUP_00001 DILUTED___NODE # # M7 3 7 1 ----- _NONE_ LINKED___GROUP_00001 DILUTED___NODE # # M8 0 1 2 ***** _NONE_ SINGLE____NODE_00002 SATURATED_NODE # # M9 0 1 3 ***** _NONE_ SINGLE____NODE_00003 SATURATED_NODE # # MF 0 1 4 ***** _NONE_ SINGLE____NODE_00004 SATURATED_NODE # # MG 0 1 5 ***** _NONE_ SINGLE____NODE_00005 SATURATED_NODE # # MH 4 5 6 ***** _NONE_ COMPLETE_GRAPH_00006 SATURATED_NODE # # MI 4 5 6 ----- _NONE_ COMPLETE_GRAPH_00006 SATURATED_NODE # # MJ 4 5 6 ----- _NONE_ COMPLETE_GRAPH_00006 SATURATED_NODE # # MK 4 5 6 ----- _NONE_ COMPLETE_GRAPH_00006 SATURATED_NODE # # ML 4 5 6 ----- _NONE_ COMPLETE_GRAPH_00006 SATURATED_NODE # # MN 0 1 7 ***** _NONE_ SINGLE____NODE_00007 SATURATED_NODE # # MP 0 1 8 ***** _NONE_ SINGLE____NODE_00008 SATURATED_NODE # # MR 0 1 9 ***** _NONE_ SINGLE____NODE_00009 SATURATED_NODE # # MS 0 1 10 ***** _NONE_ SINGLE____NODE_00010 SATURATED_NODE # # MT 0 1 11 ***** _NONE_ SINGLE____NODE_00011 SATURATED_NODE # # MV 0 1 12 ***** _NONE_ SINGLE____NODE_00012 SATURATED_NODE # # MW 0 1 13 ***** _NONE_ SINGLE____NODE_00013 SATURATED_NODE # # MX 0 1 14 ***** _NONE_ SINGLE____NODE_00014 SATURATED_NODE # # MY 0 1 15 ***** _NONE_ SINGLE____NODE_00015 SATURATED_NODE # # MZ 0 1 16 ***** _NONE_ SINGLE____NODE_00016 SATURATED_NODE # # # # NOTE THE MAJOR DIFFERENCE BETWEEN GROUP 1 AND GROUP 6: # # GROUP 1 is a 'LINKED GROUP' # # GROUP 6 is a 'COMPLETE GRAPH' # # # # # # *** CONSENSUS (madmapper_scores.loc.out.z_xconsensus.loc_nr): # # # # GC_00001 -------------------------------------------------- # # GC_00008 ----------------AAABBBAAABBBAAAAABBBBAAAAABBBBBBBB # # GC_00009 ----------------AAABBBAAABBBAAAAAABBBAAAAABBBBBBBB # # GC_00010 -AABBB-AABBBBBABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # GC_00011 -AABBB-AABBBBAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # GC_00012 AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # GC_00015 BBBAAABBBAAABBBAAAABBBBBBBBBBBBBBBBAAAAAAAAAAAAAAB # # GC_00016 AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBAAA # # GC_00017 AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBAAAAA # # GC_00018 AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBAAAAAAA # # GC_00019 ABABBBAAABABAAABBBBAABAAAABAAAAAAAABBBBABBBAAAAAAA # # GC_00020 AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBAAAAAAAA # # # # # # *** CONSENSUS MATRIX (madmapper_scores.loc.out.z_xconsensus.matrix): # # # # GC_00008 M8 1.0 # # GC_00009 M9 1.0 # # GC_00010 MF 1.0 # # GC_00011 MG 1.0 # # GC_00012 MH 1.0 # # GC_00012 MJ 1.0 # # GC_00012 MK 1.0 # # GC_00012 ML 1.0 # # GC_00012 MI 1.0 # # GC_00013 MI 1.0 # # GC_00013 MK 1.0 # # GC_00013 MH 1.0 # # GC_00013 MJ 1.0 # # GC_00013 ML 1.0 # # GC_00014 MK 1.0 # # GC_00014 MI 1.0 # # GC_00014 MH 1.0 # # GC_00014 ML 1.0 # # GC_00014 MJ 1.0 # # GC_00015 MN 1.0 # # GC_00016 MP 1.0 # # GC_00017 MR 1.0 # # GC_00018 MS 1.0 # # GC_00019 MT 1.0 # # GC_00020 MV 1.0 # # # # NOTE THAT GC_00012, GC_00013 and GC_00014 are identical (redundant). # # This is a reason why only GC_00012 was written into non-redundant loc file # # # # Markers M1 - M7 were unable to generate a consensus because they did not # # form 'COMPLETE GRAPH' # # # # # # *** CONSENSUS EXAMPLE (madmapper_scores.loc.out.z_xconsensus.debug): # # ........ # # =================================================== # # MH -AABBB-AABBBAAA-BBBAAAAAAAA-AAAAAAABBBBBBBB-BBBBBB # # MJ AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # MK AA-BBBA-ABBBAAABBBB---AAAAAAAAAAAAABBBBBBBBBBBBBB- # # ML AA-BBBA-ABBBAAABBBB---AAAAAAAAAAAAABBBBBBBBBBBBBB- # # MI AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # GC_00012 AAABBBAAABBBAAABBBBAAAAAAAAAAAAAAAABBBBBBBBBBBBBBB # # =================================================== # # # # # ##################################################################################