Split multi-fasta file into smaller files using R
3
0
Entering edit mode
19 months ago

Hi! I'm very new to bioinformatics and was wondering if someone could help me.

I am trying to split a multi-fasta file (1.5 millions of contigs) into smaller files, which would contain each 1000 sequences (except for the last one, probably). My goal is to BLASTn the smaller files after. The thing is that I cannot use "split" in Unix directly on the .fasta file, since each sequence has a different lenght. If I use "split", it breaks the sequence in the middle.

I have the option of putting all the sequences in one line so I can split them after. I managed to do it, however, I really want to learn more about coding (I feel like this would be the "easy option").

My supervisor has told me that I could do that in R with a for loop. I have read my data using the package ape so the .fasta file is now a DNAbin. When I do length(mydata), the result is the number of sequences. I have made a try by doing and it gave me my first two sequences. So we can assume that for now, each line = one sequence.

x <- mydata[1:2]

Basically, I want that everytime we see a multiple of 1000 (1000, 2000, 3000, 4000, ...) R writes the lines from 0:1000 into one file, the lines from 1001:2000 in another file, ... I just started recently with Python and with R, so I've been trying a few things but nothing is working.

Is that something doable in R?

R multi-fasta fasta blastn sequences • 1.1k views
2
Entering edit mode
19 months ago
GenoMax 111k

If using an alternate to R is an option then there are many other ways of doing this. faSplit utility by Jim Kent (UCSC) is likely the most performant option. Search for split fasta files site:biostars.org.

0
Entering edit mode

0
Entering edit mode
19 months ago
shimbalama ▴ 10

hi. what you want to do is very easy with python and biopython. if ur supervisor said to use a for loop in R then they clearly aren't a Bioinformatician - avoid iterating in R.

I'm using one hand on a phone so this wont be syntacticly correct but it should get u close. there are faster ways, like seqtk (or just basic py), but this is teaching u some biopython, which comes in handy for little tasks like this.

from Bio import SeqIO

for i, fa in enumerate(SeqIO.parse('seqs.fa', 'fasta')):
if i%1000==0:
if i==0:
fout = open(str(i)+'out.fa', 'w')
else:
fout.close()
fout = open(str(i)+'out.fa', 'w')
SeqIO.write(fa, fout, 'fasta')

fout.close()


I'm sure there's better ways but that's the first thing that came to mind. and sorry if hard to read. phone making it hard

0
Entering edit mode

Thank you very much!! I'll look into that!!

0
Entering edit mode
19 months ago
Juke34 ★ 6.5k

Several solutions (None in R) documented in this mini-review

0
Entering edit mode