Split multi-fasta file into smaller files using R
3
0
Entering edit mode
3.9 years ago

Hi! I'm very new to bioinformatics and was wondering if someone could help me.

I am trying to split a multi-fasta file (1.5 millions of contigs) into smaller files, which would contain each 1000 sequences (except for the last one, probably). My goal is to BLASTn the smaller files after. The thing is that I cannot use "split" in Unix directly on the .fasta file, since each sequence has a different lenght. If I use "split", it breaks the sequence in the middle.

I have the option of putting all the sequences in one line so I can split them after. I managed to do it, however, I really want to learn more about coding (I feel like this would be the "easy option").

My supervisor has told me that I could do that in R with a for loop. I have read my data using the package ape so the .fasta file is now a DNAbin. When I do length(mydata), the result is the number of sequences. I have made a try by doing and it gave me my first two sequences. So we can assume that for now, each line = one sequence.

x <- mydata[1:2]

Basically, I want that everytime we see a multiple of 1000 (1000, 2000, 3000, 4000, ...) R writes the lines from 0:1000 into one file, the lines from 1001:2000 in another file, ... I just started recently with Python and with R, so I've been trying a few things but nothing is working.

Is that something doable in R?

Thank you in advance!

R multi-fasta fasta blastn sequences • 2.7k views
ADD COMMENT
2
Entering edit mode
3.9 years ago
GenoMax 142k

If using an alternate to R is an option then there are many other ways of doing this. faSplit utility by Jim Kent (UCSC) is likely the most performant option. Search for split fasta files site:biostars.org.

ADD COMMENT
0
Entering edit mode

Thank you for your answer!!

ADD REPLY
0
Entering edit mode
3.9 years ago
shimbalama ▴ 10

hi. what you want to do is very easy with python and biopython. if ur supervisor said to use a for loop in R then they clearly aren't a Bioinformatician - avoid iterating in R.

I'm using one hand on a phone so this wont be syntacticly correct but it should get u close. there are faster ways, like seqtk (or just basic py), but this is teaching u some biopython, which comes in handy for little tasks like this.

from Bio import SeqIO

for i, fa in enumerate(SeqIO.parse('seqs.fa', 'fasta')):
    if i%1000==0:
        if i==0:
            fout = open(str(i)+'out.fa', 'w')
        else:
            fout.close()
            fout = open(str(i)+'out.fa', 'w')
    SeqIO.write(fa, fout, 'fasta')

fout.close()

I'm sure there's better ways but that's the first thing that came to mind. and sorry if hard to read. phone making it hard

ADD COMMENT
0
Entering edit mode

Thank you very much!! I'll look into that!!

ADD REPLY
0
Entering edit mode
3.9 years ago
Juke34 8.6k

Several solutions (None in R) documented in this mini-review

ADD COMMENT
0
Entering edit mode

Thank you for your answer!! I'll look into that this week!

ADD REPLY

Login before adding your answer.

Traffic: 1426 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6