Entering edit mode
5.6 years ago
Roelof
▴
10
Hi there,
I've been handed a transcriptome containing consensus sequence data contained in a word document. For some reason this person also place each contig entry in the fasta file as a single line with no character separation between name and sequence.
The resulting file looks as follows. I've tried to solve using SeqIO in biopython, but to no avail. My Regex just isn't good enough to separate this. Later on in the file the contig names have vivid descriptions and so a pattern on the names seems to be error prone.
>rmi3_Contig1AAAATATGACAGTCTTTATTCCAGTCTATTTTAGCCAATACAGCACACACAGCTCAAGAAGTTCTTAAAATGACAGTCTAGAAATGACTACAAAGCATTTTCTTTTGCTGGAACTTTGTACAAATACAAGTAATTGTCTAAACAACTAGTTAAATATTGGCTCAAGAATCTGCCGTTCATCATAAAAAACTAGTGACAGAAAGTTGCAGTAAACCGAATAAATTTTCTGACATGCACTACCGCATCAAACGGCGCATGGATTTTATTCTGGAACAATGAGAGCATTGCACACGGCATCTGCTTCTCATGGCGTTCCTGCAAACGTTCGTTTGTGCTACAATTTCGACATTCTCATGCGTGCACTTCAAACTCTTTGTATATTATGCTTGCATATACTCTAAGGAAGCAGCATGTTCGCAGAATGATTTCTAGGAATATGTTCTGAAGGCGAGACATGACAGCGCATCGTATGCTACACACAGATCTGCTTCTGATACTGATTTACCCGGTGACCTTAGTGCTAAATCCCCTTGATATTTCTGCACTGCTCCACAAAAATGCGGGTATATTTTAGCGAGAAGTTGGATTCTGGAACAGAATTTGTGTGCACAGAAGTGTGCGTAGGAACATATGAAAAAAACTAATCTTGTGCGAATACTCGAACGAGTAATACGATATTCGAATTTGTTTCAATTAGAATTTAAATTATTGAAGATTTTGAAGAATCAAAATGAGTGAATAGCTGTGTATGAATCTGAATGTAACCTCCTGTAGAGATAGTTTGGTGGCAGTATAGAAGTATTAAGTTGTGAAAACACCTAACTAGAGGACATCTGCACTGGCACAAAACCTCGCTTCAATTTTAAATGAAATTATCACCCTCACACCGATTAATAATTTTTTTAAGTTTAAAAGCTTGTTACACCGGTTTATATGTCTGACGTATAGAAACTTTTTAAGAGTAACATACTTTACAGGTTATAACTTGCATTTTACCGAAAGTCAAATCCTGCTACTGTTCAAAGTTGTTTTCACTTCTTTTGAACCAATAAAACAAAACAAAAAAGGCACTTGCACGGGCTTTGTTTCAATTTAATAAGATAAATATTGTGCACATTCGTTAGGTGATGACAAATATGCCTATGTTTAAAATAATATTTATAATAAGTGAGATATAATATTTTATTTGATTCAAAATAATTTGACCAAAATTGCTATTCGTTTCAAATTCGCTACGAGCCTAAAATTAACTATTCGCACAAGCGTAAAAAAAAAGTTATTGTTAGATGCATATTGCTATACTGGAGATAGTAAAGCCTGCTTATAAAATATAACACATACTTCCTGGGAACAGGTATATTTTTGTGAAGATCGTGTAGCCTTATATGGTAAAATTTTGGCAAGTGCAGGCTCACATTCACGGTTATTCTTTGAAGCAAAGTAACAAATCCCAACTTGCTGAGCTTGAGCTTCTTCTTTTCTTATTGCACGTCCACATTGGGATCAGAGATGTGATTAGGGGGTTCCTAAGTTCACTTGAAAACTGTCAGGCTCTGGCCATATAATACCTGCTATGGCCAGACCTGTTATATTGTATACCTGTTATTTCCTTTGAGGCACGAAACATCAAATTAACACTAGATGTCCAATAATGCTCAATGCGCAATCTAAAAAACAAAGTTGCAAGAACACTCGCACAGCAAAACGATTTTACTAGGATACTGCGACCAATACAGTTGTCCAAATGCTTTTATACTTTTCTGATAGTTCGAGAAGCGCGCTTTAAGTAGGGAAAGGAATAACAGTAGATTGCAAGATGCGCCTATGGAATTAGGAAGTTCACTTTGCCGTATTTCTTGGTCCATTACCAGCATACCATATGTTTATAAAAAATGCATTGATAATTATGATTTTGCTTGAAAAATTATGACCTAGAAAGATGAAATGGAAGCCGCATATTGTGGGATTCCTAAGAACTGGTTTGCTACGATTCAAAGCATTCTGATGATTTAAAAAATGAAGTGAAAAAAGAATCACGAGAAGGACTAGATGCTTGAACATAATTATGATTTTCGAAACAAGATGCATGCATTGAAAACCCGTGAAATCATTTGTCATGATTTAACAATTTTTTTTCGCAAAATGAAAATGAAGCGTGGACATCAGGACGAGGAGACAAACAATATTCAGATAATATATGTCATCCGATGCTATCCCTGCGATAGATGCCGGCAACATCAGACTTCGGGGTGCCATATTTTGACTTAGTGGGAACGGCAGTTGGGCGGCATCCGGTGAATGCTGCATCATAACGAGATGTTTTTCCAGCAGAACAGGATGACCCCGATGGAGGCGAGCACGGCGGCTGCTGAAAAATCAACCTTGTTGGCGAATGATGGCACGCCCATGTAGTCTTGGACGTTGGCCTTGTCCCATCCGACTACCTCATTTTTGATGCGCTCGTCAAGCCATTTCTCGAGCGGCTCGTAGTACTTCTTTAAGGATGAAGCTGACATCTGCCGCGTGCCAGCCATGATCTCGAGGACATCGGGCCAAGGCTTAGAGCGACCCAGCGACAGTCCCTTCTTGAGGACGTCGCCAGCGTTCTTCTCTCCGTAAATGTCGCATTCATGGAATGGATGATGTTCGTCCACCTTCTTTGCAACTGTGCATAAATGCTCGTGAAACTGGAACTGAAGGATGAACGCCACGAAATATCTCAGGTACGGAACGTGCAAAGCCACGTGGTACTTAGCTCCGCCGTCAAAGAAGGACTCGTTGCGCTTTACCGGAGGTGACACGCCTTGGTATTTTATCCTGTATTCCCAGAACTTCTCGTTCATCTTGTCGAACGGCGTCTCGCCGGTGAATATGGTCCAGCGCCACTTGTCCAACAGGTACCCGAATGGCAAGAAGGCGATCTTGTCAAGTGCCGACATAAGCAGGAGGTCAACAGCATTGTATTTATCCGTTGGTTTCAGCAAGCTAAGCTTTCCGTAATGTGTTTTTGTGGCAACTGAAAGGGCTATCAGATCTCCGACGGCCTCATGGAAACCTTCGTTGGCTCCCTCTTGCAGCAGGACGTGCAGGTGCTTGTACTGCATGTAATACTCGATGTGGCCCATCTCGTGGTGGACAGTGCGCAGTTCCTCGACGCTGGGGTCGGTGCACATCTTTATTCTGAAGTCGTCGCCGTTGTACATGTTCCAGGCGGAGGCGTGACACTGAATCTCTCGGTCTTCGGGCTTTGTAAGGATGGACTTGCTCCAAAACTCGCTGGTCATGTTGTCCAGGCCCAGACTCGTAAAGAAGTCCTCCGCTGCGTGGAACATCTTTTGGGCATCCCATTTCTGTTCCACCATTGTCTTCGAGATATCCAAAGGTTTGTCTTCCATCGTTAGGTGAGGGTATAGTGTGCCCCACTCTTGCGCCCACATGTTCCCTAACAGATGGGCTGGTATCGTGCCATCTTCGGGCAGGCGTCCAGGATAAATCTCCCTGAGCTTCATTCTCACGTAGGCGTGCAGCTTTTTGTACAGCGGAGACAGATCTTCCCACAGTTTGTCGACGATCTCGGTCATGTTTTCCGTCTCGTAGTCACTGAGCCAGGCGCTCTTGATATTGTCGTAACCGTCCAGAGACGCTGCTTCATTGGACAGCTTGATGTAGGGAATGTAGTACTGTTTTATAGCCGGACCAACTGCGTTATGCCATGCCAGCCAAGTTTGCAGTAATTTATCGTAATTGCCAACTTCCTTCATATTCCGGGTGAGATCTGGTTCCAGAGGAAGGTCCTTATCCTTGCCAACGGTCACCTTGGTCGATCCATATATGGCGGCCATCTTCGAACTGAGACTGGTCGCATTCTCAAGCTTGTCGTCGGGAAGAGCTGCCAGGCCAATGGTAGCGACGTGCCTAAATAGTCGTTTGAGCGAATCATTTTTGAAATTGTGCCAGTCGAAACGCTTCGCCGTAATTCCAAACTGCCGCTCCATTTTTGAGACTTCGGTGGAGACCTTATTCGACATGTTTTGGTTGTAATCGGTGATGTTGGAAGCATAGTCCCAACTAGAAGATGAGTCCACGTTGTTAATCGTTGTATATGGGTCATTCAAGCCTTCTATAAAGGCAACGCCCATTGCTTCATCTTTTATCAAGGCTGATACATTCGATAAAGTTGCCAAGTAGGTGTCGAAGTTGTCTGCGGCTGCTGTCGCGTACAGCGCGGTGGCCAGGAGAGCGACGGCCACGAAGCGATCGGCGGCCGACGATCCCGATCGAGCAGCCATGTCGGGCCGTTTCGGTGCGGGTGAAGCTCCGCAGCTGCTCTGGTTTGTTGAGGATGTTGCGCGCGCTCTTCGCTGCTCACCGACGAGAGCGCACTCCG
>rmi3_Contig2CAAGAGCGCATCTGAGCATGCGCACTGGTATGTTTGCAACCCTCTTTACTAGGCCTAGTGCATTTTAACATGGACCCAGAGGGAAGCCGTGAAAGATCCTGAAACTATTTAATTTAGTGCAAAGTTTATTGATTTAGTGTTGTTGCGAGGTGCCTGCAGTTGGCTACAAGCACATTTAGGATCCATGGACAGTACGTCCATAATTACTCAAGTGAACAGAGAAGAGGAACAACTAACAAATTTTCCTCCTGCTGATCGAGTGCCGCCCTCCAGCAGGAAGCCCCGGCAGCGCGGCTTCCTCAACACGCTGCTCTGCTGCTTCGGGAGCAACAACCAAGGCAACAACCCCGTGATTGCCGAGGAAAATGGCCAGTACTCGCCCAAGCTCCAGGGCAAGTACCTGCTGCCACCCGTGCGGCATCAGGATATTCGCAAGATATGCCTCATCATTGACCTCGATGAGACATTGGTCCATAGCTCATTCAAGCCCATCAACAATGCTGACTTCGTGGTGCCTGTAGAGATAGATGGCACGGTGCACCAGGTGTATGTCTTGAAGAGGCCTTATGTGGACGAGTTCTTGCAGCGAGTTGGCGATGCCTACGAATGTGTCTTATTCACAGCCAGCCTTGCAAAGTATGCTGACCCTGTGGCTGACCTGCTGGACAAGTGGGGTGTCTTCCGGGCACGGTTATTCCGAGAGTCTTGTGTCTTCTACCGAGGAAACTATGTCAAGGACCTTGGTAGGCTGGGCCGGGATCTGCGCAGAGTGGTCATAATAGACAACTCACCGGCCTCGTACATCTTCCATCCTGACAATGCAGTACCTGTCAACTCGTGGTTCGATGACATGTCAGACACGGAGCTGCGGGACCTGATGCCATTGTTCGACGAACTGAGCCGTGTCGAGGACGTGTACACGGTGCTGCGCAACTCCAACAACGCGGCTGGCGGTGGCGGCGGCTCTCCGGCCTTCCCTGCACCACTCCTGATGAACGGCAGCGCGGTGGCTTTGCACAACAGCGGTTCCTAGCATTCCGCACAGTGCGGCTTGTGCAATAGCCCCTTCTCCGCCGGCAGTACAAAAGCGCTTACGGGTCCCGTGCTAGTCTCGCCGGCCTACTTAACGTCGGAGGGGGGCTGCCCCTTGTGCCTTGTCTCTTCCGCTCTGGACGAGAGTTTGTATAATAACGGTGTTCCATAATCTCGCCTGTATCATAGATTAAAGACGACTATTTCAGCCTGCAAAA
Anyone know how to convert this into the following
>rmi3_Contig1
AAAATATGACAGTCTTTATTCCAGTCTATTTTAGCCAATACAGCACACACAGCTCAAGAAGTTCTTAAAATGACAGTCTAGAAATGACTACAAAGCATTTTCTTTTGCTGGAACTTTGTACAAATACAAGTAATTGTCTAAACAACTAGTTAAATATTGGCTCAAGAATCTGCCGTTCATCATAAAAAACTAGTGACAGAAAGTTGCAGTAAACCGAATAAATTTTCTGACATGCACTACCGCATCAAACGGCGCATGGATTTTATTCTGGAACAATGAGAGCATTGCACACGGCATCTGCTTCTCATGGCGTTCCTGCAAACGTTCGTTTGTGCTACAATTTCGACATTCTCATGCGTGCACTTCAAACTCTTTGTATATTATGCTTGCATATACTCTAAGGAAGCAGCATGTTCGCAGAATGATTTCTAGGAATATGTTCTGAAGGCGAGACATGACAGCGCATCGTATGCTACACACAGATCTGCTTCTGATACTGATTTACCCGGTGACCTTAGTGCTAAATCCCCTTGATATTTCTGCACTGCTCCACAAAAATGCGGGTATATTTTAGCGAGAAGTTGGATTCTGGAACAGAATTTGTGTGCACAGAAGTGTGCGTAGGAACATATGAAAAAAACTAATCTTGTGCGAATACTCGAACGAGTAATACGATATTCGAATTTGTTTCAATTAGAATTTAAATTATTGAAGATTTTGAAGAATCAAAATGAGTGAATAGCTGTGTATGAATCTGAATGTAACCTCCTGTAGAGATAGTTTGGTGGCAGTATAGAAGTATTAAGTTGTGAAAACACCTAACTAGAGGACATCTGCACTGGCACAAAACCTCGCTTCAATTTTAAATGAAATTATCACCCTCACACCGATTAATAATTTTTTTAAGTTTAAAAGCTTGTTACACCGGTTTATATGTCTGACGTATAGAAACTTTTTAAGAGTAACATACTTTACAGGTTATAACTTGCATTTTACCGAAAGTCAAATCCTGCTACTGTTCAAAGTTGTTTTCACTTCTTTTGAACCAATAAAACAAAACAAAAAAGGCACTTGCACGGGCTTTGTTTCAATTTAATAAGATAAATATTGTGCACATTCGTTAGGTGATGACAAATATGCCTATGTTTAAAATAATATTTATAATAAGTGAGATATAATATTTTATTTGATTCAAAATAATTTGACCAAAATTGCTATTCGTTTCAAATTCGCTACGAGCCTAAAATTAACTATTCGCACAAGCGTAAAAAAAAAGTTATTGTTAGATGCATATTGCTATACTGGAGATAGTAAAGCCTGCTTATAAAATATAACACATACTTCCTGGGAACAGGTATATTTTTGTGAAGATCGTGTAGCCTTATATGGTAAAATTTTGGCAAGTGCAGGCTCACATTCACGGTTATTCTTTGAAGCAAAGTAACAAATCCCAACTTGCTGAGCTTGAGCTTCTTCTTTTCTTATTGCACGTCCACATTGGGATCAGAGATGTGATTAGGGGGTTCCTAAGTTCACTTGAAAACTGTCAGGCTCTGGCCATATAATACCTGCTATGGCCAGACCTGTTATATTGTATACCTGTTATTTCCTTTGAGGCACGAAACATCAAATTAACACTAGATGTCCAATAATGCTCAATGCGCAATCTAAAAAACAAAGTTGCAAGAACACTCGCACAGCAAAACGATTTTACTAGGATACTGCGACCAATACAGTTGTCCAAATGCTTTTATACTTTTCTGATAGTTCGAGAAGCGCGCTTTAAGTAGGGAAAGGAATAACAGTAGATTGCAAGATGCGCCTATGGAATTAGGAAGTTCACTTTGCCGTATTTCTTGGTCCATTACCAGCATACCATATGTTTATAAAAAATGCATTGATAATTATGATTTTGCTTGAAAAATTATGACCTAGAAAGATGAAATGGAAGCCGCATATTGTGGGATTCCTAAGAACTGGTTTGCTACGATTCAAAGCATTCTGATGATTTAAAAAATGAAGTGAAAAAAGAATCACGAGAAGGACTAGATGCTTGAACATAATTATGATTTTCGAAACAAGATGCATGCATTGAAAACCCGTGAAATCATTTGTCATGATTTAACAATTTTTTTTCGCAAAATGAAAATGAAGCGTGGACATCAGGACGAGGAGACAAACAATATTCAGATAATATATGTCATCCGATGCTATCCCTGCGATAGATGCCGGCAACATCAGACTTCGGGGTGCCATATTTTGACTTAGTGGGAACGGCAGTTGGGCGGCATCCGGTGAATGCTGCATCATAACGAGATGTTTTTCCAGCAGAACAGGATGACCCCGATGGAGGCGAGCACGGCGGCTGCTGAAAAATCAACCTTGTTGGCGAATGATGGCACGCCCATGTAGTCTTGGACGTTGGCCTTGTCCCATCCGACTACCTCATTTTTGATGCGCTCGTCAAGCCATTTCTCGAGCGGCTCGTAGTACTTCTTTAAGGATGAAGCTGACATCTGCCGCGTGCCAGCCATGATCTCGAGGACATCGGGCCAAGGCTTAGAGCGACCCAGCGACAGTCCCTTCTTGAGGACGTCGCCAGCGTTCTTCTCTCCGTAAATGTCGCATTCATGGAATGGATGATGTTCGTCCACCTTCTTTGCAACTGTGCATAAATGCTCGTGAAACTGGAACTGAAGGATGAACGCCACGAAATATCTCAGGTACGGAACGTGCAAAGCCACGTGGTACTTAGCTCCGCCGTCAAAGAAGGACTCGTTGCGCTTTACCGGAGGTGACACGCCTTGGTATTTTATCCTGTATTCCCAGAACTTCTCGTTCATCTTGTCGAACGGCGTCTCGCCGGTGAATATGGTCCAGCGCCACTTGTCCAACAGGTACCCGAATGGCAAGAAGGCGATCTTGTCAAGTGCCGACATAAGCAGGAGGTCAACAGCATTGTATTTATCCGTTGGTTTCAGCAAGCTAAGCTTTCCGTAATGTGTTTTTGTGGCAACTGAAAGGGCTATCAGATCTCCGACGGCCTCATGGAAACCTTCGTTGGCTCCCTCTTGCAGCAGGACGTGCAGGTGCTTGTACTGCATGTAATACTCGATGTGGCCCATCTCGTGGTGGACAGTGCGCAGTTCCTCGACGCTGGGGTCGGTGCACATCTTTATTCTGAAGTCGTCGCCGTTGTACATGTTCCAGGCGGAGGCGTGACACTGAATCTCTCGGTCTTCGGGCTTTGTAAGGATGGACTTGCTCCAAAACTCGCTGGTCATGTTGTCCAGGCCCAGACTCGTAAAGAAGTCCTCCGCTGCGTGGAACATCTTTTGGGCATCCCATTTCTGTTCCACCATTGTCTTCGAGATATCCAAAGGTTTGTCTTCCATCGTTAGGTGAGGGTATAGTGTGCCCCACTCTTGCGCCCACATGTTCCCTAACAGATGGGCTGGTATCGTGCCATCTTCGGGCAGGCGTCCAGGATAAATCTCCCTGAGCTTCATTCTCACGTAGGCGTGCAGCTTTTTGTACAGCGGAGACAGATCTTCCCACAGTTTGTCGACGATCTCGGTCATGTTTTCCGTCTCGTAGTCACTGAGCCAGGCGCTCTTGATATTGTCGTAACCGTCCAGAGACGCTGCTTCATTGGACAGCTTGATGTAGGGAATGTAGTACTGTTTTATAGCCGGACCAACTGCGTTATGCCATGCCAGCCAAGTTTGCAGTAATTTATCGTAATTGCCAACTTCCTTCATATTCCGGGTGAGATCTGGTTCCAGAGGAAGGTCCTTATCCTTGCCAACGGTCACCTTGGTCGATCCATATATGGCGGCCATCTTCGAACTGAGACTGGTCGCATTCTCAAGCTTGTCGTCGGGAAGAGCTGCCAGGCCAATGGTAGCGACGTGCCTAAATAGTCGTTTGAGCGAATCATTTTTGAAATTGTGCCAGTCGAAACGCTTCGCCGTAATTCCAAACTGCCGCTCCATTTTTGAGACTTCGGTGGAGACCTTATTCGACATGTTTTGGTTGTAATCGGTGATGTTGGAAGCATAGTCCCAACTAGAAGATGAGTCCACGTTGTTAATCGTTGTATATGGGTCATTCAAGCCTTCTATAAAGGCAACGCCCATTGCTTCATCTTTTATCAAGGCTGATACATTCGATAAAGTTGCCAAGTAGGTGTCGAAGTTGTCTGCGGCTGCTGTCGCGTACAGCGCGGTGGCCAGGAGAGCGACGGCCACGAAGCGATCGGCGGCCGACGATCCCGATCGAGCAGCCATGTCGGGCCGTTTCGGTGCGGGTGAAGCTCCGCAGCTGCTCTGGTTTGTTGAGGATGTTGCGCGCGCTCTTCGCTGCTCACCGACGAGAGCGCACTCCG
>rmi3_Contig2
CAAGAGCGCATCTGAGCATGCGCACTGGTATGTTTGCAACCCTCTTTACTAGGCCTAGTGCATTTTAACATGGACCCAGAGGGAAGCCGTGAAAGATCCTGAAACTATTTAATTTAGTGCAAAGTTTATTGATTTAGTGTTGTTGCGAGGTGCCTGCAGTTGGCTACAAGCACATTTAGGATCCATGGACAGTACGTCCATAATTACTCAAGTGAACAGAGAAGAGGAACAACTAACAAATTTTCCTCCTGCTGATCGAGTGCCGCCCTCCAGCAGGAAGCCCCGGCAGCGCGGCTTCCTCAACACGCTGCTCTGCTGCTTCGGGAGCAACAACCAAGGCAACAACCCCGTGATTGCCGAGGAAAATGGCCAGTACTCGCCCAAGCTCCAGGGCAAGTACCTGCTGCCACCCGTGCGGCATCAGGATATTCGCAAGATATGCCTCATCATTGACCTCGATGAGACATTGGTCCATAGCTCATTCAAGCCCATCAACAATGCTGACTTCGTGGTGCCTGTAGAGATAGATGGCACGGTGCACCAGGTGTATGTCTTGAAGAGGCCTTATGTGGACGAGTTCTTGCAGCGAGTTGGCGATGCCTACGAATGTGTCTTATTCACAGCCAGCCTTGCAAAGTATGCTGACCCTGTGGCTGACCTGCTGGACAAGTGGGGTGTCTTCCGGGCACGGTTATTCCGAGAGTCTTGTGTCTTCTACCGAGGAAACTATGTCAAGGACCTTGGTAGGCTGGGCCGGGATCTGCGCAGAGTGGTCATAATAGACAACTCACCGGCCTCGTACATCTTCCATCCTGACAATGCAGTACCTGTCAACTCGTGGTTCGATGACATGTCAGACACGGAGCTGCGGGACCTGATGCCATTGTTCGACGAACTGAGCCGTGTCGAGGACGTGTACACGGTGCTGCGCAACTCCAACAACGCGGCTGGCGGTGGCGGCGGCTCTCCGGCCTTCCCTGCACCACTCCTGATGAACGGCAGCGCGGTGGCTTTGCACAACAGCGGTTCCTAGCATTCCGCACAGTGCGGCTTGTGCAATAGCCCCTTCTCCGCCGGCAGTACAAAAGCGCTTACGGGTCCCGTGCTAGTCTCGCCGGCCTACTTAACGTCGGAGGGGGGCTGCCCCTTGTGCCTTGTCTCTTCCGCTCTGGACGAGAGTTTGTATAATAACGGTGTTCCATAATCTCGCCTGTATCATAGATTAAAGACGACTATTTCAGCCTGCAAAA
Hopelessly stuck R
try:
or
with bash: Regex works only if there is 6 letter word (contig in this example), preceded by _ (underscore) and followed by a single number (contig number in this example) (probably works with OP data only):
ps: Removed part of contig 1 sequence for 5000 character limit.