make fasta sequence which is multiple of three
2
0
Entering edit mode
5.6 years ago
alim.hcu ▴ 20

I have fasta sequence of different length. But I want to make number of nucleotide in fasta sequences multiple of three by adding AA or -- at the end. Like if the sequence length is 149 then i want to make it 150 by adding -- or any nucleotides.

Thank You

alignment • 4.5k views
3
Entering edit mode

This looks like an XY Problem. Why do you want to do this multiple of 3 thing? The 3 part tells me if probably has to do with codon usage, but in no scenario can I imagine needing this functionality. Can you explain your end goal please?

0
Entering edit mode

Hello, if you can program in Java, you may use the SEDA API (https://github.com/sing-group/seda) in order to create this operation that you need. You can even use the new Java 9 shell to easily do this as I show in this post:

Regards.

0
Entering edit mode

Thanks for you reply. I downloaded this tools. But the video is not much clear. which option i should use to make multiple of three nucleotides.

0
Entering edit mode

Actually I have to calculate dn/ds using paml. And it does not take any sequences which is not multiplle of three.

3
Entering edit mode

I have to calculate dn/ds using paml

0
Entering edit mode

You should have said so in the original question to avoid half the comments you received here :-).

0
Entering edit mode

Please use ADD COMMENT or ADD REPLY to answer to previous reactions, as such this thread remains logically structured and easy to follow. Your reaction was moved but as you can see it's not optimal. Adding an answer should only be used for providing a solution to the question asked.

0
Entering edit mode

If you are planning on using PAML then you'll need to pairwise align those sequences anyway. So the more cleaver thing to do is to "codon-aware" align the nucleotide sequence and strip off anything that is not aligned (== will automatically result in modulo3 'sequences')

0
Entering edit mode

Ok.. I got it..is there anything to do this in paml. I am aligning those sequences using maft. How should i trim those sequences.

0
Entering edit mode

Start by searching the forum for MAFTT posts and if you don't find an appropriate post, ask a new question.

0
Entering edit mode

ah, no you will have to do this prior to providing them to PAML.

In my lab we have custom script to do this (not helping much :-) ), but here is a tool that does it as well : trimAl and there will be others around

make sure that you do codon-align in mafft (this is crucial for dN/dS estimations!!)

0
Entering edit mode

Specifically "AA" or any nucleotide?

0
Entering edit mode

Thank You very much..it worked..

0
Entering edit mode

Wouter literally just asked you to not add answers. Please be more careful. This should be a comment or comment reply. I've moved it to a comment on the top level post, but it ideally belongs elsewhere. Please read the posts under http://biostars.org/t/how-to to structure your posts better.

1
Entering edit mode
5.6 years ago

Beside the correct question from Ram why you want to do this, here is a python solution:

from Bio import SeqIO

original_file = "./original.fasta"
extended_file = "./extended.fasta"

with open(original_file) as original, open(extended_file, 'w') as extended:
records = SeqIO.parse(original_file, 'fasta')
for record in records:
if len(record.seq)%3 == 1:
record.seq = record.seq + "--"
elif len(record.seq)%3 == 2:
record.seq = record.seq + "-"
SeqIO.write(record, extended, 'fasta')


fin swimmer

1
Entering edit mode

Minor comment, but there is no need to with open(...) your input file when using SeqIO

0
Entering edit mode

I want to run this script in loop..Please suggest me to modify it for running in this script in loop..

Thank You

3
Entering edit mode

why are you still insisting in fiddling about with your input sequences? while the solution offered by finswimmer will work perfectly , several other people have pointed out to you that this is not the way you need to go!

0
Entering edit mode

Actually i am working on transposons. And i have to calculate insertion time for LTR pairs present in LTR elements. So basically these are the nucleotide sequences.

1
Entering edit mode

Actually i am working on transposons.

Why did we have to wait 20 hours and so many comments before you tell us what you are working on? Don't you think all of this thread is a big waste of time?

0
Entering edit mode

alright, I was kinda expecting this already.

anyway the point remains: you should not add bases to get to modulo3 sequences, You first have to align them and subsequently remove unaligned parts (rather than adding bases!!).

Moreover have you looked into the literature on estimating transposon insertion time? There will be software around that specifically does this kind of analysis (though I can not think of one from the top of my head)

0
Entering edit mode

"You first have to align them and subsequently remove unaligned parts (rather than adding bases!!" I would be interested in doing this how to do this? Since i was trying to do this "If you're intending to run it through PAML to do dNdS " but again the multiple of 3 errors comes.

0
Entering edit mode

have a look at TrimAl , or such ... in general tools to clean multiple seq alignments

or manual if there are not that much sequences?

0
Entering edit mode

thanks will look into it trimAI

1
Entering edit mode
5.6 years ago
Joe 21k

It can be done pretty easily with BioPython, as FinSwimmer has shown, particularly with the MutableSeq objects.

If you're intending to run it through PAML to do dNdS etc, then a far more common approach, rather than hacking at your sequences, is just to use a dedicated codon-aware aligner like the imaginatively named CodonAlign

EDIT:

1. I'd strongly suggest creating files with all pairwise comparison sequences you're interested in. If you have many sequences, this may rapidly become untenable though, so think very carefully about what you're trying to show/achieve.

• Its an n choose k problem so, all pairwise comparisons of even just 100 sequences is 4950.
2. CRUCIALLY, use a codon-aware aligner as we've mentioned. Do not try to hack your sequences to a multiple of 3 by adding random characters as this will affect your results.

3. Santise your alignments if necessary. This you may need to do manually.

4. Run PAML.

0
Entering edit mode

Can You describe the pipe line. I have thousand of sequences in pair wise alignment. I want to calculate dn/ds of each pairwise alignment in paml. If any script you can suggest, it would be very helpful for me.

Thank You

0
Entering edit mode

The program is supposed to be downloadable from here: http://homepage.mac.com/barryghall/CodonAlign.html but the webpage seems to be down.

0
Entering edit mode

To my knowledge (limited as it is in this instance), you can't do this with pairwise alignments, they will need to be multiple sequence alignments, else there isn't sufficient information for the dNdS rates to be meaningful. I could be totally wrong here though.

As for scripts, no I don't know I'm afraid. But CodonAlign, and tools like it, are self contained executables, so you shouldn't need much if any scripting in order to use them.

If I am wrong, then it will probably be as simple as concatenating all your sequence pairs in to files of their own, and then running CodonAlign and PAML for each file in a loop or parallel.

0
Entering edit mode

I have to disagree here: it's perfectly OK to calculate dN and/or dS from pairwise alignments (this is how all these Ks plots are being generated).

Otherwise the approach you describe is exactly what I would do :

1. create set of files with each a pair of sequences in it to be aligned
2. align the sequences with something like CodonAlign (codon-aware aligner is key here!!)
3. clean the alignments (to remove gaps etc)
4. run PAML (codeml) on each alignment result file
0
Entering edit mode

I stand corrected on the pairwise in that case (I've never done it myself), but this is absolutely the workflow, so we all agree there!

I'm curious though, if you're just using pairwise sequences, how can you decide which sequence is 'right' relative to the other? You'd have no other information about what the most common codon in that position is to know whether a particular sequence is drifting/being selected for or against?

(this is my pure naivety though I'm sure)

1
Entering edit mode

Indeed, in a pairwise context you can't tell, but what you are referring to is already step 2 of this kind of procedure. step 1 is to calculate (estimate rather) the dS and dN for each pair. We're not looking for the most common codon or such yet, we're simply looking at the divergence on nucleotide level between 2 genes that once shared a common ancestor (origin).

In step 2 we can start looking at the ration of dS over dN which (coming to think of it) we can actually also do in a pairwise manner. The thing we are looking for is to see whether there are more non-synonymous or more synonymous substitutions happened dN/dS > 1 is positive (driving) selection , = 1 is neutral, < 1 is purifying selection . There should in "theory" not be any bias between codons used , we assume that mutations are happening at a constant rate (kuch ;) ) and thus we are only interested to see if the mutations we observe are more dN or are more dS