Hi, I have two files, one is protein sequences of a group of genes, and the other is their corresponding nucleotide sequences.
>maker-scaffold10x_338_pilon-snap-gene-0.71-mRNA-1
MHLKNGDPKPTIKPNQCTLFGFRFCPYVDRVRMVLQYYNVPHDNVWIHLYSKPDWYLELY
PVGKVPLLITKEGKTIVESDAIIRYLDETIGNKSLMSLCGEAEFERAGKLASKLMAQSHG
ILFGASVAEANASAYRDVCQEINDTIKGPYLLGDKLTLADFLLFSHVNHFEPIMARLDGL
APSDVHDLKATDQYRTKWPRLTTFLDVMRRLPCVLTVREPSQKLALFAETYRQGQPNPDL
>augustus_masked-scf7180006947290-processed-gene-0.5-mRNA-1
MSEIRSLNIFDANSQNSSEFRRNIPDFLRPYECYRCVIGKKKPDDVEYICRYSLSCLGDC
AKEKDYARYLEMKPCIFLQVNKVYGWIPDIVGENLLVKCFGKVGLIKIILNSITPEIVFN
YFGIKYDKVLINLQDKPEWFLKMYPEGKVPFIIDKQRQLGDSEIIIRDYDSKNNNKLITA
CGEEKFSETKDLISSFFGLCYTILFKDNKISDENADLFLKALEKVEAKIVGPFMWGDQLS
LADVILFTHLNMFECSLSRIEGIHPDQVKDGYPNAAREASFVKIPAYLKQMRNHSAVKDV
YVHPNDISKYAVGLRIGKPNPEGDN
>DILT_0000424901-mRNA-1
MGWVLGGDGSFLPTGCANHGDPEPSVNPENVTLYDMQFCPYCQRVRYTLDYHKIPYDRIL
IDLMSKPSWYLKMYPVGKVPLLLYRGKTMAESDVIMKYCDQMKGAKASLLSVCGEEGFKR
ALNLTSSVSLLLIALLRYKLLFSPDVTRADADSLKAALSNLDKAIQGPYLMDLLPFLTFE
GKELSLADLALFPFLHAWDLLISRLKDVGDDSDESAEPVAPRWPNVLKYCQLMNQKPFIM
KTAFRDDEFSKYMDTRLQAARP
>MS3_04642.1
MHLKRSDPKPLIDPNRLTLIGFRFCPYVDRVRLILSYYKIDYDLINVSLASKPEWFLKMY
PIGKVPLLLLPNEQKLPESDEIIRHIDKLYGSETLLSHCGIEEFEKVKELITGISRPSYM
IMCVQEINLCDVSLYRAACNKINDAIKGPYFTGSELSLADLILFPHLHRFEVVMGRITGK
KPEEINELNINDELRKEFPKLTEFLDTMRKQSFVIDVTIPYRIHVQYAASVLSGHANPDI
E
Here's the nucleotide sequences, I have deleted the remaining part of the last 3 sequences
>maker-scaffold10x_338_pilon-snap-gene-0.71-mRNA-1
ATGCATCTGAAAAACGGTGACCCAAAACCTACCATCAAGCCTAATCAATGTACTCTATTT
GGTTTTCGATTCTGTCCCTATGTGGATCGTGTCAGAATGGTACTCCAATATTACAACGTC
CCGCATGATAATGTTTGGATACATTTATACTCAAAACCGGATTGGTATCTGGAATTATAT
CCGGTCGGCAAAGTACCTCTTTTGATTACCAAAGAGGGGAAGACAATTGTGGAATCGGAT
GCGATTATACGGTATTTGGACGAAACGATCGGAAACAAGTCTCTGATGTCTTTGTGTGGT
GAAGCGGAGTTTGAGCGGGCCGGGAAATTGGCGTCTAAACTCATGGCTCAATCGCATGGT
ATTTTATTCGGCGCCAGTGTCGCGGAAGCTAATGCGTCTGCGTATCGTGACGTCTGTCAA
GAAATAAATGATACAATCAAGGGACCATACTTGTTGGGCGACAAGTTGACATTGGCCGAT
TTTCTGTTATTCTCTCATGTGAACCACTTCGAACCGATCATGGCTCGTTTAGACGGTCTA
GCACCCAGTGACGTTCATGATCTGAAAGCGACCGATCAGTACAGGACGAAATGGCCCCGG
TTGACCACCTTCTTGGATGTTATGCGTCGTTTGCCCTGTGTGCTTACCGTACGTGAGCCG
TCCCAAAAGCTTGCCCTTTTTGCGGAAACATATCGTCAAGGTCAACCAAATCCGGATCTA
TGA
>augustus_masked-scf7180006947290-processed-gene-0.5-mRNA-1
ATGAGTGAAATACGGAGTTTAAACATTTTCGATGCCAACAGCCAGAACTCA.......
>DILT_0000424901-mRNA-1
ATGGGCTGGGTATTAGGTGGCGACGGCTCCTTCTTACCCACCGGTTGTGCTAA.......
>MS3_04642.1
ATGCACCTCAAACGAAGTGACCCTAAACCACTGATTGATCCTAATC..........
After aligning my protein sequences, my output looks something likes (I have deleted the remaining part to reduce space):
CLUSTAL W (1.81) multiple sequence alignment
augustus_masked-scf7180006947290 MSEIRSLNIFDANSQNSSEFRRNIPD-FLRPYECYRCVIGKKKPDDVEYICRYSLSCLGD
DILT_0000424901-mRNA-1 ---MGWVLGGDGSFLPTGCANHGDPEPSVNPENV--------------------------
maker-scaffold10x_338_pilon-snap -----------------MHLKNGDPKPTIKPNQC--------------------------
MS3_04642.1 -----------------MHLKRSDPKPLIDPNRL--------------------------
... *. : *
augustus_masked-scf7180006947290 CAKEKDYARYLEMKPCIFLQVNKVYGWIPDIVGENLLVKCFGKVGLIKIILNSITPEIVF
DILT_0000424901-mRNA-1 --------TLYDMQFCPYCQRVR----------------------------------YTL
maker-scaffold10x_338_pilon-snap --------TLFGFRFCPYVDRVR----------------------------------MVL
MS3_04642.1 --------TLIGFRFCPYVDRVR----------------------------------LIL
:. * : : . :
augustus_masked-scf7180006947290 NYFGIKYDKVLINLQDKPEWFLKMYPEGKVPFIIDK-QRQLGDSEIIIRDYDSKNNNK--
DILT_0000424901-mRNA-1 DYHKIPYDRILIDLMSKPSWYLKMYPVGKVPLLLYR-GKTMAESDVIMKYCDQMKGAKAS
maker-scaffold10x_338_pilon-snap QYYNVPHDNVWIHLYSKPDWYLELYPVGKVPLLITKEGKTIVESDAIIRYLDETIGNK-S
MS3_04642.1 SYYKIDYDLINVSLASKPEWFLKMYPIGKVPLLLLPNEQKLPESDEIIRHIDKLYGSE-T
.*. : :* : : * .**.*:*::** ****::: . : :*: *:. *. . :
The challenge I'm having is generating a new file containing the nucleotide sequences matching the sequence ID (the alignment software shortens the sequence ID to 32 characters I think) and the order newly assigned to them the alignment software (especially if I have loads of sequences to align). The nucleotide sequences should now look like (I've deleted some of the nucleotide sequences):
>augustus_masked-scf7180006947290
ATGAGTGAAATACGGAGTTTAAACATTTTCGATGCCAACAGCCAGAACTCATCAGAATTT
AGACGTAATATTCCAGATTTCCTGAGACCCTATGAGTGTTATCGCTGTGTTATCGGGAAA
AAGAAGCCGGATGATGTTGAATACATTTGCAGATATTCTCTGTCATGTTTAGGTGATTGT
GCAAAAGAAAAGGACTATGCAAGGTATCTGGAAATGAAACCCTGCATTTTTCTTCAAGTC
AATAAAGTTTATGGCTGGATTCCAGACATTGTTGGTGAAAATTTACTCGTGAAATGTTTC
GGAAAGGTCGGTTTAATTAAAATTATATTAAATAGTATAACACCTGAAATTGTATTCAAC
TACTTCGGGATCAAATATGACAAGGTTCTAATAAATCTACAGGATAAACCTGAATGGTTT
CTCAAAATGTACCCTGAAGGCAAGGTTCCATTCATCATTGATAAACAGAGACAACTTGGT
GACTCTGAGATTATCATTCGAGACTATGACTCAAAGAACAATAATAAATTGATTACTGCC
TGTGGCGAAGAAAAGTTTTCTGAAACTAAAGATCTCATCTCAAGCTTCTTTGGCCTTTGC
TATACCATTCTCTTCAAGGATAATAAAATTTCCGATGAGAATGCTGATCTCTTCTTGAAA
GCTCTCGAGAAGGTTGAAGCGAAAATTGTTGGCCCCTTCATGTGGGGAGATCAACTATCT
CTAGCCGATGTAATTCTCTTCACACATTTGAACATGTTCGAGTGCTCTTTATCGAGAATC
GAGGGAATTCATCCTGACCAAGTGAAAGATGGTTATCCCAATGCCGCAAGGGAAGCTAGC
TTCGTCAAGATTCCCGCCTATCTGAAGCAAATGAGAAATCACTCTGCAGTTAAAGATGTC
TACGTTCATCCTAATGATATTTCCAAGTACGCTGTCGGTTTAAGAATTGGAAAGCCAAAT
CCGGAAGGCGATAACTAG
>DILT_0000424901-mRNA-1
ATGGGCTGGGTATTAGGTGGCGACGGCTCCTTCTTACCCACCGG.......
>maker-scaffold10x_338_pilon-snap
ATGCATCTGAAAAACGGTGACCCAAAACCTACCA..........
>MS3_04642.1
ATGCACCTCAAACGAAGTGACCCTAAACCACTGATTGATCCTAAT.........
I need the final nucleotide sequences file and the alignment file for another analysis, but I am stuck. How can I do this? been thinking of python, but my scripting skills aren't best yet. Any advice? Apologies for the long post. Thanks kay