How to split the sequences in multi fasta file in a specific length
0
1
Entering edit mode
7.8 years ago
vasilislenis ▴ 150

Hello all,

I'm trying to split the sequences inside a multi fasta file into smaller ones by using a threshold. For example, lets say that I have sequences that are larger than 500 bp and I want to split them in sequences of 250 bp. I know that faSplit of Kent's toolbox does the job if I use the -bysize, but the problem occurs when the lengths are not divided exactly with my threshold. For example, let's say that there is a sequence with 550 bp length. faSize will give me 3 subsequences with 250 bp, 250 bp and 50 bp, respectively. The problem here is that there will be a lot of small sequences after this. In this example what I would like to have is 2 subsequences, one with 250bp and the other with 550-250 = 300 bp. In other words, I want to split the sequences in a specific number but if the last part that remains is smaller than my threshold, I would like to be concatenated with the previous one. Is there any way that I can do this? Thank you very much in advance.

PS. I have noticed that a similar question was already answered, but I think that my question is a little bit different.

FASTA FASPLIT • 3.0k views
ADD COMMENT
0
Entering edit mode

What you are asking for is a very specific feature which you are not likely to find in a ready program. You can either cat the last 2 files yourself (I assume you are asking because there are many?) or write a script yourself. Former should not be that bad to do this (since you can always find the last two files by their names).

ADD REPLY
0
Entering edit mode

Thank you very much for your reply. The think that I want to do is to split the sequences in subsequences with a specific number of bases each, not to split the file into smaller files. A large sequence e.g 1100 bp to be split in 3 with 250 bp (lets say that the threshold is 250) and one with 350 bp because I don't want to have smaller sequences from the threshold. All the sequences will be at the same fasta file.

ADD REPLY
1
Entering edit mode

My apologies. Since you were referring to faSize I jumped to the wrong conclusion.

You may be able to use reformat.sh from BBMap to achieve what you want to do with the following option. You will need to deal with the last two pieces manually.

reformat.sh in=your.fa out=new_file.fa breaklength=250
ADD REPLY
0
Entering edit mode

Thank you. I tried it but the problem is that if the last part is smaller than 250 it splits it anyway. Somehow I have to check if the last part is smaller that 250 don't split it. It does more or less the same job like faSplit.

ADD REPLY
0
Entering edit mode

My original comments are going to apply for reformat.sh as well. It is just the way these tools are doing the splitting. I have my doubts that you are going to find a ready program for this (unless @Brian, author of BBMap, who participates here, knows some other way to do it with reformat.sh).

ADD REPLY
0
Entering edit mode

Thank you very much, you were right. Finally, I wrote a piece of code to do this. I'm keeping the coordinates of the fasta file into a tab separated format (bed style) and I'm splitting them where it needs. After this, I extract again the sequences.

ADD REPLY

Login before adding your answer.

Traffic: 1546 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6