megamerger |
The sequences can be very long. The program does a match of all sequence words of size 20 (by default). It then reduces this to the minimum set of overlapping matches by sorting the matches in order of size (largest size first) and then for each such match it removes any smaller matches that overlap. The result is a set of the longest ungapped alignments between the two sequences that do not overlap with each other. If the two sequences are identical in their region of overlap then there will be one region of match and no mismatches.
It should be possible to merge sequences that are Mega bytes long. Compare this with the program merger which does a more accurate alignment of more divergent sequences using the Needle and Wunsch algorithm but which uses much more memory.
The sequences should ideally be identical in their region of overlap. If there are any mismatches between the two sequences then megamerger will still attempt to create a merged sequence, but you should check that this is what you required.
A report of the actions of megamerger is written out. Any actions that require a choice between using regions of the two sequences where they have a mismatch is marked with the word WARNING!. The sequence in these regions is written out in uppercase. All other regions of the output sequence are written in lowercase.
Where there is a mismatch then the sequence that is chosen to supply the region of the mismatch in the final merged sequence is that sequence whose mismatch region is furthest from the start of end of the sequence.
% megamerger tembl:v00295 tembl:v00296 Merge two large overlapping nucleic acid sequences Word size [20]: output sequence [v00295.merged]: Output file [v00295.megamerger]: report |
Go to the input files for this example
Go to the output files for this example
Standard (Mandatory) qualifiers: [-asequence] sequence Nucleotide sequence filename and optional format, or reference (input USA) [-bsequence] sequence Nucleotide sequence filename and optional format, or reference (input USA) -wordsize integer [20] Word size (Integer 2 or more) [-outseq] seqout [ |
Standard (Mandatory) qualifiers | Allowed values | Default | |
---|---|---|---|
[-asequence] (Parameter 1) |
Nucleotide sequence filename and optional format, or reference (input USA) | Readable sequence | Required |
[-bsequence] (Parameter 2) |
Nucleotide sequence filename and optional format, or reference (input USA) | Readable sequence | Required |
-wordsize | Word size | Integer 2 or more | 20 |
[-outseq] (Parameter 3) |
Sequence filename and optional format (output USA) | Writeable sequence | <*>.format |
[-outfile] (Parameter 4) |
Output file name | Output file | <*>.megamerger |
Additional (Optional) qualifiers | Allowed values | Default | |
-prefer | When a mismatch between the two sequence is discovered, one or other of the two sequences must be used to create the merged sequence over this mismatch region. The default action is to create the merged sequence using the sequence where the mismatch is closest to that sequence's centre. If this option is used, then the first sequence (seqa) will always be used in preference to the other sequence when there is a mismatch. | Boolean value Yes/No | No |
Advanced (Unprompted) qualifiers | Allowed values | Default | |
(none) |
ID V00295; SV 1; linear; genomic DNA; STD; PRO; 1500 BP. XX AC V00295; XX DT 09-JUN-1982 (Rel. 01, Created) DT 07-JUL-1995 (Rel. 44, Last updated, Version 4) XX DE E. coli lacY gene (codes for lactose permease). XX KW membrane protein. XX OS Escherichia coli OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; OC Enterobacteriaceae; Escherichia. XX RN [1] RP 1-1500 RX DOI; 10.1038/283541a0. RX PUBMED; 6444453. RA Buechel D.E., Gronenborn B., Mueller-Hill B.; RT "Sequence of the lactose permease gene"; RL Nature 283(5747):541-545(1980). XX CC lacZ is a beta-galactosidase and lacA is transacetylase. CC KST ECO.LACY XX FH Key Location/Qualifiers FH FT source 1..1500 FT /organism="Escherichia coli" FT /mol_type="genomic DNA" FT /db_xref="taxon:562" FT CDS <1..54 FT /codon_start=1 FT /transl_table=11 FT /note="reading frame (lacZ)" FT /db_xref="GOA:P00722" FT /db_xref="PDB:1BGL" FT /db_xref="PDB:1BGM" FT /db_xref="PDB:1DP0" FT /db_xref="PDB:1F49" FT /db_xref="PDB:1F4A" FT /db_xref="PDB:1F4H" FT /db_xref="PDB:1GHO" FT /db_xref="PDB:1HN1" FT /db_xref="PDB:1JYN" FT /db_xref="PDB:1JYV" FT /db_xref="PDB:1JYW" FT /db_xref="PDB:1JYX" FT /db_xref="PDB:1JYY" [Part of this file has been deleted for brevity] FT /db_xref="PDB:2CFP" FT /db_xref="PDB:2CFQ" FT /db_xref="UniProtKB/Swiss-Prot:P02920" FT /protein_id="CAA23571.1" FT /translation="MYYLKNTNFWMFGLFFFFYFFIMGAYFPFFPIWLHDINHISKSDT FT GIIFAAISLFSLLFQPLFGLLSDKLGLRKYLLWIITGMLVMFAPFFIFIFGPLLQYNIL FT VGSIVGGIYLGFCFNAGAPAVEAFIEKVSRRSNFEFGRARMFGCVGWALCASIVGIMFT FT INNQFVFWLGSGCALILAVLLFFAKTDAPSSATVANAVGANHSAFSLKLALELFRQPKL FT WFLSLYVIGVSCTYDVFDQQFANFFTSFFATGEQGTRVFGYVTTMGELLNASIMFFAPL FT IINRIGGKNALLLAGTIMSVRIIGSSFATSALEVVILKTLHMFEVPFLLVGCFKYITSQ FT FEVRFSATIYLVCFCFFKQLAMIFMSVLAGNMYESIGFQGAYLVLGLVALGFTLISVFT FT LSGPGPLSLLRRQVNEVA" FT CDS 1423..>1500 FT /transl_table=11 FT /note="reading frame (lacA)" FT /db_xref="GOA:P07464" FT /db_xref="PDB:1KQA" FT /db_xref="PDB:1KRR" FT /db_xref="PDB:1KRU" FT /db_xref="PDB:1KRV" FT /db_xref="UniProtKB/Swiss-Prot:P07464" FT /protein_id="CAA23572.1" FT /translation="MNMPMTERIRAGKLFTDMCEGLPEKR" XX SQ Sequence 1500 BP; 315 A; 342 C; 357 G; 486 T; 0 other; ttccagctga gcgccggtcg ctaccattac cagttggtct ggtgtcaaaa ataataataa 60 ccgggcaggc catgtctgcc cgtatttcgc gtaaggaaat ccattatgta ctatttaaaa 120 aacacaaact tttggatgtt cggtttattc tttttctttt acttttttat catgggagcc 180 tacttcccgt ttttcccgat ttggctacat gacatcaacc atatcagcaa aagtgatacg 240 ggtattattt ttgccgctat ttctctgttc tcgctattat tccaaccgct gtttggtctg 300 ctttctgaca aactcgggct gcgcaaatac ctgctgtgga ttattaccgg catgttagtg 360 atgtttgcgc cgttctttat ttttatcttc gggccactgt tacaatacaa cattttagta 420 ggatcgattg ttggtggtat ttatctaggc ttttgtttta acgccggtgc gccagcagta 480 gaggcattta ttgagaaagt cagccgtcgc agtaatttcg aatttggtcg cgcgcggatg 540 tttggctgtg ttggctgggc gctgtgtgcc tcgattgtcg gcatcatgtt caccatcaat 600 aatcagtttg ttttctggct gggctctggc tgtgcactca tcctcgccgt tttactcttt 660 ttcgccaaaa cggatgcgcc ctcttctgcc acggttgcca atgcggtagg tgccaaccat 720 tcggcattta gccttaagct ggcactggaa ctgttcagac agccaaaact gtggtttttg 780 tcactgtatg ttattggcgt ttcctgcacc tacgatgttt ttgaccaaca gtttgctaat 840 ttctttactt cgttctttgc taccggtgaa cagggtacgc gggtatttgg ctacgtaacg 900 acaatgggcg aattacttaa cgcctcgatt atgttctttg cgccactgat cattaatcgc 960 atcggtggga aaaacgccct gctgctggct ggcactatta tgtctgtacg tattattggc 1020 tcatcgttcg ccacctcagc gctggaagtg gttattctga aaacgctgca tatgtttgaa 1080 gtaccgttcc tgctggtggg ctgctttaaa tatattacca gccagtttga agtgcgtttt 1140 tcagcgacga tttatctggt ctgtttctgc ttctttaagc aactggcgat gatttttatg 1200 tctgtactgg cgggcaatat gtatgaaagc atcggtttcc agggcgctta tctggtgctg 1260 ggtctggtgg cgctgggctt caccttaatt tccgtgttca cgcttagcgg ccccggcccg 1320 ctttccctgc tgcgtcgtca ggtgaatgaa gtcgcttaag caatcaatgt cggatgcggc 1380 gcgacgctta tccgaccaac atatcataac ggagtgatcg cattgaacat gccaatgacc 1440 gaaagaataa gagcaggcaa gctatttacc gatatgtgcg aaggcttacc ggaaaaaaga 1500 // |
ID V00296; SV 1; linear; genomic DNA; STD; PRO; 3078 BP. XX AC V00296; XX DT 13-JUL-1983 (Rel. 03, Created) DT 18-APR-2005 (Rel. 83, Last updated, Version 5) XX DE E. coli gene lacZ coding for beta-galactosidase (EC 3.2.1.23). XX KW galactosidase. XX OS Escherichia coli OC Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; OC Enterobacteriaceae; Escherichia. XX RN [1] RP 1-3078 RX PUBMED; 6313347. RA Kalnins A., Otto K., Ruether U., Mueller-Hill B.; RT "Sequence of the lacZ gene of Escherichia coli"; RL EMBO J. 2(4):593-597(1983). XX RN [2] RX PUBMED; 3038536. RA Zell R., Fritz H.J.; RT "DNA mismatch-repair in Escherichia coli counteracting the hydrolytic RT deamination of 5-methyl-cytosine residues"; RL EMBO J. 6(6):1809-1815(1987). XX CC Data kindly reviewed (18-MAY-1983) by U. Ruether XX FH Key Location/Qualifiers FH FT source 1..3078 FT /organism="Escherichia coli" FT /mol_type="genomic DNA" FT /db_xref="taxon:562" FT CDS <1..3072 FT /transl_table=11 FT /note="galactosidase" FT /db_xref="GOA:P00722" FT /db_xref="PDB:1BGL" FT /db_xref="PDB:1BGM" FT /db_xref="PDB:1DP0" FT /db_xref="PDB:1F49" FT /db_xref="PDB:1F4A" FT /db_xref="PDB:1F4H" FT /db_xref="PDB:1GHO" FT /db_xref="PDB:1HN1" FT /db_xref="PDB:1JYN" [Part of this file has been deleted for brevity] gaggcccgca ccgatcgccc ttcccaacag ttgcgcagcc tgaatggcga atggcgcttt 180 gcctggtttc cggcaccaga agcggtgccg gaaagctggc tggagtgcga tcttcctgag 240 gccgatactg tcgtcgtccc ctcaaactgg cagatgcacg gttacgatgc gcccatctac 300 accaacgtaa cctatcccat tacggtcaat ccgccgtttg ttcccacgga gaatccgacg 360 ggttgttact cgctcacatt taatgttgat gaaagctggc tacaggaagg ccagacgcga 420 attatttttg atggcgttaa ctcggcgttt catctgtggt gcaacgggcg ctgggtcggt 480 tacggccagg acagtcgttt gccgtctgaa tttgacctga gcgcattttt acgcgccgga 540 gaaaaccgcc tcgcggtgat ggtgctgcgt tggagtgacg gcagttatct ggaagatcag 600 gatatgtggc ggatgagcgg cattttccgt gacgtctcgt tgctgcataa accgactaca 660 caaatcagcg atttccatgt tgccactcgc tttaatgatg atttcagccg cgctgtactg 720 gaggctgaag ttcagatgtg cggcgagttg cgtgactacc tacgggtaac agtttcttta 780 tggcagggtg aaacgcaggt cgccagcggc accgcgcctt tcggcggtga aattatcgat 840 gagcgtggtg gttatgccga tcgcgtcaca ctacgtctga acgtcgaaaa cccgaaactg 900 tggagcgccg aaatcccgaa tctctatcgt gcggtggttg aactgcacac cgccgacggc 960 acgctgattg aagcagaagc ctgcgatgtc ggtttccgcg aggtgcggat tgaaaatggt 1020 ctgctgctgc tgaacggcaa gccgttgctg attcgaggcg ttaaccgtca cgagcatcat 1080 cctctgcatg gtcaggtcat ggatgagcag acgatggtgc aggatatcct gctgatgaag 1140 cagaacaact ttaacgccgt gcgctgttcg cattatccga accatccgct gtggtacacg 1200 ctgtgcgacc gctacggcct gtatgtggtg gatgaagcca atattgaaac ccacggcatg 1260 gtgccaatga atcgtctgac cgatgatccg cgctggctac cggcgatgag cgaacgcgta 1320 acgcgaatgg tgcagcgcga tcgtaatcac ccgagtgtga tcatctggtc gctggggaat 1380 gaatcaggcc acggcgctaa tcacgacgcg ctgtatcgct ggatcaaatc tgtcgatcct 1440 tcccgcccgg tgcagtatga aggcggcgga gccgacacca cggccaccga tattatttgc 1500 ccgatgtacg cgcgcgtgga tgaagaccag cccttcccgg ctgtgccgaa atggtccatc 1560 aaaaaatggc tttcgctacc tggagagacg cgcccgctga tcctttgcga atacgcccac 1620 gcgatgggta acagtcttgg cggtttcgct aaatactggc aggcgtttcg tcagtatccc 1680 cgtttacagg gcggcttcgt ctgggactgg gtggatcagt cgctgattaa atatgatgaa 1740 aacggcaacc cgtggtcggc ttacggcggt gattttggcg atacgccgaa cgatcgccag 1800 ttctgtatga acggtctggt ctttgccgac cgcacgccgc atccagcgct gacggaagca 1860 aaacaccagc agcagttttt ccagttccgt ttatccgggc aaaccatcga agtgaccagc 1920 gaatacctgt tccgtcatag cgataacgag ctcctgcact ggatggtggc gctggatggt 1980 aagccgctgg caagcggtga agtgcctctg gatgtcgctc cacaaggtaa acagttgatt 2040 gaactgcctg aactaccgca gccggagagc gccgggcaac tctggctcac agtacgcgta 2100 gtgcaaccga acgcgaccgc atggtcagaa gccgggcaca tcagcgcctg gcagcagtgg 2160 cgtctggcgg aaaacctcag tgtgacgctc cccgccgcgt cccacgccat cccgcatctg 2220 accaccagcg aaatggattt ttgcatcgag ctgggtaata agcgttggca atttaaccgc 2280 cagtcaggct ttctttcaca gatgtggatt ggcgataaaa aacaactgct gacgccgctg 2340 cgcgatcagt tcacccgtgc accgctggat aacgacattg gcgtaagtga agcgacccgc 2400 attgacccta acgcctgggt cgaacgctgg aaggcggcgg gccattacca ggccgaagca 2460 gcgttgttgc agtgcacggc agatacactt gctgatgcgg tgctgattac gaccgctcac 2520 gcgtggcagc atcaggggaa aaccttattt atcagccgga aaacctaccg gattgatggt 2580 agtggtcaaa tggcgattac cgttgatgtt gaagtggcga gcgatacacc gcatccggcg 2640 cggattggcc tgaactgcca gctggcgcag gtagcagagc gggtaaactg gctcggatta 2700 gggccgcaag aaaactatcc cgaccgcctt actgccgcct gttttgaccg ctgggatctg 2760 ccattgtcag acatgtatac cccgtacgtc ttcccgagcg aaaacggtct gcgctgcggg 2820 acgcgcgaat tgaattatgg cccacaccag tggcgcggcg acttccagtt caacatcagc 2880 cgctacagtc aacagcaact gatggaaacc agccatcgcc atctgctgca cgcggaagaa 2940 ggcacatggc tgaatatcga cggtttccat atggggattg gtggcgacga ctcctggagc 3000 ccgtcagtat cggcggaatt ccagctgagc gccggtcgct accattacca gttggtctgg 3060 tgtcaaaaat aataataa 3078 // |
# Report of megamerger of: V00295 and V00296 V00295 overlap starts at 1 V00296 overlap starts at 3019 Using V00296 1-3018 as the initial sequence Matching region V00295 1-60 : V00296 3019-3078 Length of match: 60 V00295 overlap ends at 60 V00296 overlap ends at 3078 Using V00295 61-1500 as the final sequence |
>V00295 V00295.1 E. coli lacY gene (codes for lactose permease). accatgattacggattcactggccgtcgttttacaacgtcgtgactgggaaaaccctggc gttacccaacttaatcgccttgcagcacatccccctttcgccagctggcgtaatagcgaa gaggcccgcaccgatcgcccttcccaacagttgcgcagcctgaatggcgaatggcgcttt gcctggtttccggcaccagaagcggtgccggaaagctggctggagtgcgatcttcctgag gccgatactgtcgtcgtcccctcaaactggcagatgcacggttacgatgcgcccatctac accaacgtaacctatcccattacggtcaatccgccgtttgttcccacggagaatccgacg ggttgttactcgctcacatttaatgttgatgaaagctggctacaggaaggccagacgcga attatttttgatggcgttaactcggcgtttcatctgtggtgcaacgggcgctgggtcggt tacggccaggacagtcgtttgccgtctgaatttgacctgagcgcatttttacgcgccgga gaaaaccgcctcgcggtgatggtgctgcgttggagtgacggcagttatctggaagatcag gatatgtggcggatgagcggcattttccgtgacgtctcgttgctgcataaaccgactaca caaatcagcgatttccatgttgccactcgctttaatgatgatttcagccgcgctgtactg gaggctgaagttcagatgtgcggcgagttgcgtgactacctacgggtaacagtttcttta tggcagggtgaaacgcaggtcgccagcggcaccgcgcctttcggcggtgaaattatcgat gagcgtggtggttatgccgatcgcgtcacactacgtctgaacgtcgaaaacccgaaactg tggagcgccgaaatcccgaatctctatcgtgcggtggttgaactgcacaccgccgacggc acgctgattgaagcagaagcctgcgatgtcggtttccgcgaggtgcggattgaaaatggt ctgctgctgctgaacggcaagccgttgctgattcgaggcgttaaccgtcacgagcatcat cctctgcatggtcaggtcatggatgagcagacgatggtgcaggatatcctgctgatgaag cagaacaactttaacgccgtgcgctgttcgcattatccgaaccatccgctgtggtacacg ctgtgcgaccgctacggcctgtatgtggtggatgaagccaatattgaaacccacggcatg gtgccaatgaatcgtctgaccgatgatccgcgctggctaccggcgatgagcgaacgcgta acgcgaatggtgcagcgcgatcgtaatcacccgagtgtgatcatctggtcgctggggaat gaatcaggccacggcgctaatcacgacgcgctgtatcgctggatcaaatctgtcgatcct tcccgcccggtgcagtatgaaggcggcggagccgacaccacggccaccgatattatttgc ccgatgtacgcgcgcgtggatgaagaccagcccttcccggctgtgccgaaatggtccatc aaaaaatggctttcgctacctggagagacgcgcccgctgatcctttgcgaatacgcccac gcgatgggtaacagtcttggcggtttcgctaaatactggcaggcgtttcgtcagtatccc cgtttacagggcggcttcgtctgggactgggtggatcagtcgctgattaaatatgatgaa aacggcaacccgtggtcggcttacggcggtgattttggcgatacgccgaacgatcgccag ttctgtatgaacggtctggtctttgccgaccgcacgccgcatccagcgctgacggaagca aaacaccagcagcagtttttccagttccgtttatccgggcaaaccatcgaagtgaccagc gaatacctgttccgtcatagcgataacgagctcctgcactggatggtggcgctggatggt aagccgctggcaagcggtgaagtgcctctggatgtcgctccacaaggtaaacagttgatt gaactgcctgaactaccgcagccggagagcgccgggcaactctggctcacagtacgcgta gtgcaaccgaacgcgaccgcatggtcagaagccgggcacatcagcgcctggcagcagtgg cgtctggcggaaaacctcagtgtgacgctccccgccgcgtcccacgccatcccgcatctg accaccagcgaaatggatttttgcatcgagctgggtaataagcgttggcaatttaaccgc cagtcaggctttctttcacagatgtggattggcgataaaaaacaactgctgacgccgctg cgcgatcagttcacccgtgcaccgctggataacgacattggcgtaagtgaagcgacccgc attgaccctaacgcctgggtcgaacgctggaaggcggcgggccattaccaggccgaagca gcgttgttgcagtgcacggcagatacacttgctgatgcggtgctgattacgaccgctcac gcgtggcagcatcaggggaaaaccttatttatcagccggaaaacctaccggattgatggt agtggtcaaatggcgattaccgttgatgttgaagtggcgagcgatacaccgcatccggcg cggattggcctgaactgccagctggcgcaggtagcagagcgggtaaactggctcggatta gggccgcaagaaaactatcccgaccgccttactgccgcctgttttgaccgctgggatctg ccattgtcagacatgtataccccgtacgtcttcccgagcgaaaacggtctgcgctgcggg acgcgcgaattgaattatggcccacaccagtggcgcggcgacttccagttcaacatcagc cgctacagtcaacagcaactgatggaaaccagccatcgccatctgctgcacgcggaagaa ggcacatggctgaatatcgacggtttccatatggggattggtggcgacgactcctggagc ccgtcagtatcggcggaattccagctgagcgccggtcgctaccattaccagttggtctgg tgtcaaaaataataataaccgggcaggccatgtctgcccgtatttcgcgtaaggaaatcc attatgtactatttaaaaaacacaaacttttggatgttcggtttattctttttcttttac ttttttatcatgggagcctacttcccgtttttcccgatttggctacatgacatcaaccat atcagcaaaagtgatacgggtattatttttgccgctatttctctgttctcgctattattc caaccgctgtttggtctgctttctgacaaactcgggctgcgcaaatacctgctgtggatt attaccggcatgttagtgatgtttgcgccgttctttatttttatcttcgggccactgtta caatacaacattttagtaggatcgattgttggtggtatttatctaggcttttgttttaac gccggtgcgccagcagtagaggcatttattgagaaagtcagccgtcgcagtaatttcgaa tttggtcgcgcgcggatgtttggctgtgttggctgggcgctgtgtgcctcgattgtcggc atcatgttcaccatcaataatcagtttgttttctggctgggctctggctgtgcactcatc ctcgccgttttactctttttcgccaaaacggatgcgccctcttctgccacggttgccaat gcggtaggtgccaaccattcggcatttagccttaagctggcactggaactgttcagacag ccaaaactgtggtttttgtcactgtatgttattggcgtttcctgcacctacgatgttttt gaccaacagtttgctaatttctttacttcgttctttgctaccggtgaacagggtacgcgg gtatttggctacgtaacgacaatgggcgaattacttaacgcctcgattatgttctttgcg ccactgatcattaatcgcatcggtgggaaaaacgccctgctgctggctggcactattatg tctgtacgtattattggctcatcgttcgccacctcagcgctggaagtggttattctgaaa acgctgcatatgtttgaagtaccgttcctgctggtgggctgctttaaatatattaccagc cagtttgaagtgcgtttttcagcgacgatttatctggtctgtttctgcttctttaagcaa ctggcgatgatttttatgtctgtactggcgggcaatatgtatgaaagcatcggtttccag ggcgcttatctggtgctgggtctggtggcgctgggcttcaccttaatttccgtgttcacg cttagcggccccggcccgctttccctgctgcgtcgtcaggtgaatgaagtcgcttaagca atcaatgtcggatgcggcgcgacgcttatccgaccaacatatcataacggagtgatcgca ttgaacatgccaatgaccgaaagaataagagcaggcaagctatttaccgatatgtgcgaa ggcttaccggaaaaaaga |
A merged sequence is written out.
Where there has been a mismatch between the two sequences, the merged sequence is written out in uppercase and the sequence whose mismatch region is furthest from the edges of the sequence is used in the merged sequence.
The name and description of the first input sequence is used for the name and description of the output sequence.
A report of the merger is written out.
Program name | Description |
---|---|
cons | Creates a consensus from multiple alignments |
merger | Merge two overlapping sequences |
Compare this with the program merger which does a more accurate alignment of more divergent sequences using the Needle and Wunsch algorithm but which uses much more memory.
A graphical dotplot of the matches used in this merge can be displayed using the program dotpath.