Question: Tips to speed up read.fasta() in R for large files?
1
gravatar for oliver.bayfield
3.9 years ago by
United States
oliver.bayfield110 wrote:

I found some tips for speeding up read.table(), here, I wondered if anyone can suggest something for read.fasta(), which is part of the seqinr library. Not that they're really comparable.

I have a ~6GB file.

sequence R fasta • 4.3k views
ADD COMMENTlink modified 3.9 years ago by John12k • written 3.9 years ago by oliver.bayfield110

why do you need R ? Can't you use another program for your project ? linux pipeline, C, etc...

ADD REPLYlink written 3.9 years ago by Pierre Lindenbaum119k

No particular reason, I'm slightly more familiar with R and I'm not sure how to translate my script into another language 

ADD REPLYlink written 3.9 years ago by oliver.bayfield110
2
gravatar for SES
3.9 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

Since that function is written in R I don't think optimizing your code will result in much of a speed up compared to using a library written in another language. If you want to work in R, it would probably be easier to use the system() command to transform your sequences with another program or just use command line as Pierre suggested. If you have a specific task I'm sure people can provide a solution.

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by SES8.2k

Thanks @SES.
My post on the script itself is C: R script to extract sequences using seqinr returns Error Subscript out of bounds, which now works fine (but slowly for .fasta files) thanks to @Martombo's comment.
Any suggestions and examples on moving it over to another program or another language would be a great learning experience.
I have no particular allegiance to R, other than its been useful for me so far and I've found a lot of relevant posts for what I'm doing to be using R. I wasn't aware of its speed limitations until now.

ADD REPLYlink written 3.9 years ago by oliver.bayfield110
2
gravatar for cyril-cros
3.9 years ago by
cyril-cros890
France
cyril-cros890 wrote:

If you have a huge file (just reread your post, 6Gb), the best way is to use an index (this is how databases work...). Take a look at this link: http://stackoverflow.com/questions/23173215/how-to-subset-sequences-in-fasta-file-based-on-sequence-id-or-name

It depends on what you want to do; do you only need a subset of these sequences at any given time or do you require every one of them? You would need a way to load and store them in your RAM, rather than reading a 6Gb file each time you tinker your file. NB: you can increase R's memory to avoid errors.

ADD COMMENTlink written 3.9 years ago by cyril-cros890

definitely reading an indexed or maybe a compressed file is the solution here.

ADD REPLYlink written 3.9 years ago by Giovanni M Dall'Olio26k

Sounds ideal!
However when I follow the instructions in that link, when I index my file in R, when I come back out the size of my orginal fasta is about a 10th of what it was orignally, and its only indexing about 90 out of 2000 sequences (big sequences). What's happening there so that my orignal file gets overwritten!

library(Rsamtools)
indexFa("foo.fasta")

My fasta file is of the form:
>123456
AGTGATGATGATGATGATGCAGCTAGACTGCTAGCTACGTCAGACT(...)
>98765432
AGTCGATGCATCGCATACGACTAGCTCAGCATCGACTACGTACGATC(...)



ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by oliver.bayfield110
1

Can you confirm this? You should get an index file in .fai format, and no changes to your old file. Please check if the lines with dna sequences have a consistent length.

EDIT

C: What would make fasta file indexing stop prematurely? same problem as you. Fixed by running the code below, rather than indexFa("yourFileName")

 library(Rsamtools)
 file_path <- "myfile.fa"
 indexFa(file_path)

 

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by cyril-cros890

Thanks cyril-cros. Indexing looks like the sensible way to go, however how would I now go about incorporating it into my script (below), which other than reading in this large fasta file works fine, where now instead of read.fasta() I'm additionally calling on a .fai file I assume to only read in one entry at a time? I'm unfamiliar with how index files work basically.

 #!/usr/bin/Rscript

 library(seqinr)
 infas=read.fasta("test.fa")
 posi=read.delim("positiontest.txt", header=FALSE, sep="\t")
 outfas=file("orf2seqs.fa", 'w')

 #construct function to extract subsequences from infas based on start=stop positions defined in posi. e.g. in posi there may be 3 subsequences defined per fasta file entry (gi) which are to be extracted, each line read into posi is one such subsequence: gi, start, stop, strand.
 doitall=function(x) {
   seq_id <- x[[1]]; start <- x[[2]]; stop <- x[[3]]; seq <- infas[[as.character(seq_id)]]; strand <- x[[4]]
   seqv <- seq[start:stop]
   header <- paste(sep="", ">", attr(seq, "name"), ":", start, "-", stop, ."_(", strand, ")", "\n")
   cat(file=outfas, header)
   cat(file=outfas, toupper(paste(sep="", collapse="", seqv)), "\n")
 } 
 #execute function. invisible isn't necessary, just hides the NULL
 invisible(apply(posi, 1, doitall))
 close(outfas)
ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by oliver.bayfield110
1
gravatar for John
3.9 years ago by
John12k
Germany
John12k wrote:

There are only two ways to speed up loading data - buy a faster hard drive, or read less data from it :P
I cant help with the former, but for the latter you could try converting your FASTA files to 2bit and load that. A 6Gb FASTA file will only be 1.5Gb in 2bit format - and thats before compression. It might even fit into RAM as 2bit, and a quick Google suggests that there are packages in R which understand the 2bit format. I dont know if it then converts it to something else after reading it... who knows.

But 6Gb isnt much data really... hard drives typically read over 100MB/s, so unless 60 seconds is too long for you, I doubt getting more bang for your RPM is going to make a difference. Most likely the logic part of your program is the slow part, not the reading part. R is a famously slow language, particularly in its loops, which is why packages like plyr exist to try and speed things up.

But all of this is just speculation - since we dont know what your program does, we cannot really say how to speed it up. If you really want to speed up your code (and become a better programmer in the process) definitely check out this: http://adv-r.had.co.nz/Profiling.html
Good luck! :)

ADD COMMENTlink written 3.9 years ago by John12k
0
gravatar for cyril-cros
3.9 years ago by
cyril-cros890
France
cyril-cros890 wrote:

I advise you to read the GRanges documentation on biocLite,as well as Rsamtools and its getSeq method from Rsamtools.

  • To read a fasta file efficiently using an index, use the method FaFile(filepath) - I don't know if read.fasta does the same
  • To store and manipulate annotation data or position records in the form <gi, start, stop, strand>, use a GRange object.
  • You can then access all the sequences corresponding to records in a GRange using the method getSeq
ADD COMMENTlink written 3.9 years ago by cyril-cros890
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1122 users visited in the last hour