Question: Split multi-fasta file into smaller files using R
0
gravatar for slushverte01
10 days ago by
slushverte010 wrote:

Hi! I'm very new to bioinformatics and was wondering if someone could help me.

I am trying to split a multi-fasta file (1.5 millions of contigs) into smaller files, which would contain each 1000 sequences (except for the last one, probably). My goal is to BLASTn the smaller files after. The thing is that I cannot use "split" in Unix directly on the .fasta file, since each sequence has a different lenght. If I use "split", it breaks the sequence in the middle.

I have the option of putting all the sequences in one line so I can split them after. I managed to do it, however, I really want to learn more about coding (I feel like this would be the "easy option").

My supervisor has told me that I could do that in R with a for loop. I have read my data using the package ape so the .fasta file is now a DNAbin. When I do length(mydata), the result is the number of sequences. I have made a try by doing and it gave me my first two sequences. So we can assume that for now, each line = one sequence.

x <- mydata[1:2]

Basically, I want that everytime we see a multiple of 1000 (1000, 2000, 3000, 4000, ...) R writes the lines from 0:1000 into one file, the lines from 1001:2000 in another file, ... I just started recently with Python and with R, so I've been trying a few things but nothing is working.

Is that something doable in R?

Thank you in advance!

ADD COMMENTlink modified 10 days ago by Juke344.4k • written 10 days ago by slushverte010
2
gravatar for genomax
10 days ago by
genomax85k
United States
genomax85k wrote:

If using an alternate to R is an option then there are many other ways of doing this. faSplit utility by Jim Kent (UCSC) is likely the most performant option. Search for split fasta files site:biostars.org.

ADD COMMENTlink written 10 days ago by genomax85k

Thank you for your answer!!

ADD REPLYlink written 10 days ago by slushverte010
0
gravatar for shimbalama
10 days ago by
shimbalama0
shimbalama0 wrote:

hi. what you want to do is very easy with python and biopython. if ur supervisor said to use a for loop in R then they clearly aren't a Bioinformatician - avoid iterating in R.

I'm using one hand on a phone so this wont be syntacticly correct but it should get u close. there are faster ways, like seqtk (or just basic py), but this is teaching u some biopython, which comes in handy for little tasks like this.

from Bio import SeqIO

for i, fa in enumerate(SeqIO.parse('seqs.fa', 'fasta')):
    if i%1000==0:
        if i==0:
            fout = open(str(i)+'out.fa', 'w')
        else:
            fout.close()
            fout = open(str(i)+'out.fa', 'w')
    SeqIO.write(fa, fout, 'fasta')

fout.close()

I'm sure there's better ways but that's the first thing that came to mind. and sorry if hard to read. phone making it hard

ADD COMMENTlink modified 10 days ago • written 10 days ago by shimbalama0

Thank you very much!! I'll look into that!!

ADD REPLYlink written 10 days ago by slushverte010
0
gravatar for Juke34
10 days ago by
Juke344.4k
Sweden
Juke344.4k wrote:

Several solutions (None in R) documented in this mini-review

ADD COMMENTlink written 10 days ago by Juke344.4k

Thank you for your answer!! I'll look into that this week!

ADD REPLYlink written 10 days ago by slushverte010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1732 users visited in the last hour