Question: What would make fasta file indexing stop prematurely?
0
gravatar for oliver.bayfield
3.9 years ago by
United States
oliver.bayfield110 wrote:

I'm following the pointers in this post, regarding the use of Rsamtools and the indexFa() function to index a large fasta file of around 7GB, about 2000 nt sequences. However only the first ~87 are indexed, I see no reason why it would stop at this point unless it reaches a limitation.
What's more, the file the function is operating on, indexFa("seqs.fa"), becomes truncated to these first 87 entries also (I don't know how the function works, but that was quite unexpected). Any solution to this or other methods of indexing a large fasta file to call in sequences by name to avoid trying to read in the whole fasta file with read.fasta() in one go (which gets Killed prematurely too).
Thanks again,
 

library(Rsamtools)
indexFa("seqs.fa")   # I think this is stumbling point
fa = FaFile("seqs.fa")
gr = as(seqinfo(fa), "GRanges")
getSeq(fa, gr[2:1])
idx = pmatch("296783888", names(gr)) 
seq = getSeq(fa, gr[idx])
gene sequence R fasta • 1.1k views
ADD COMMENTlink modified 3.9 years ago by cyril-cros890 • written 3.9 years ago by oliver.bayfield110
1
gravatar for Antonio R. Franco
3.9 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.0k wrote:

several reasons can explain this

1. Fasta files has to have a first line starting with ">", then the name, then some optional commentaries. The next line must be compulsorily DNA data. If some commentaries go to the second line, the parsing of the fasta file can fail

2. The presence of letters which are not nucleic acid

If the fasta file fail in loading you have several choices to discover where. One is a Fasta Validator

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Antonio R. Franco4.0k

Thanks Antonio, on 2. would "N"'s qualify as non-nucleic acid? What's the standard approach to deal with N: remove or mutate?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by oliver.bayfield110
1

"N" (ambiguous base call) is perfectly acceptable in FASTA DNA sequences.

ADD REPLYlink written 3.9 years ago by Matt Shirley8.9k

Okay that's not the reason then. Nor is variable length.

ADD REPLYlink written 3.9 years ago by oliver.bayfield110
1
gravatar for Matt Shirley
3.9 years ago by
Matt Shirley8.9k
Cambridge, MA
Matt Shirley8.9k wrote:

To index the file using this method, all sequence lines (except the last) in each entry need to be of a consistent length. There should also be no blank lines between entries.

ADD COMMENTlink written 3.9 years ago by Matt Shirley8.9k

Thanks Matt, my sequences fall short on the consistent length criterion. However I don't find this to be the case with some test files.

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by oliver.bayfield110
1
gravatar for cyril-cros
3.9 years ago by
cyril-cros890
France
cyril-cros890 wrote:

In R try using memory.limit(size=7000) to increase your RAM max allowance - it should be around 2Go by default. You need a 64bit system AND enough RAM obviously.

Size is in Mb, and just using memory.limit() will display your current setting.

ADD COMMENTlink written 3.9 years ago by cyril-cros890

Thanks, I expect memory has held me back at one point here, though I've managed to use Rsamtools indexing by changing my commands slightly (above). +1

ADD REPLYlink written 3.9 years ago by oliver.bayfield110
0
gravatar for Brian Bushnell
3.9 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

It might help if you could explain what you are trying to accomplish, and the characteristics of your computer, both physical and software.

ADD COMMENTlink written 3.9 years ago by Brian Bushnell16k

I am trying to read in a large fasta file to R, to extract subsequences from. It's too big for read.fasta() however so it was suggested I index the file and only call the sequence when its required for extraction; but I'm having problems with this Rsamtools idx as described above. I'm on a Mac OS if it makes a difference.

ADD REPLYlink written 3.9 years ago by oliver.bayfield110
0
gravatar for oliver.bayfield
3.9 years ago by
United States
oliver.bayfield110 wrote:

I'm coming to the conclusion its a memory thing, which I'd hoped indexing would circumvent. When I open the fasta file, its only loads up to the 87th entry, which is the same number that it is able to index in R. However if I grep the fasta headers "^>" I can see the rest (all 2000 odd) are still in the file, and you can tell it based on the file size too. Any way to expand memory or am I going to have to do it in 20 or 30 chunks?

ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by oliver.bayfield110

How big is the file?

ADD REPLYlink written 3.9 years ago by Matt Shirley8.9k

7Gb

ADD REPLYlink written 3.9 years ago by oliver.bayfield110

Can you index it using samtools faidx?

ADD REPLYlink written 3.9 years ago by Matt Shirley8.9k

Is that different to the one I'm using above? I don't think it will index which is my current problem

ADD REPLYlink written 3.9 years ago by oliver.bayfield110

It's a different implementation of the same method so might inform you of what the error is. Also could use "pyfaidx" which includes a faidx program that will tell you what causes the indexing to fail.

ADD REPLYlink written 3.9 years ago by Matt Shirley8.9k

So turns out if I change the commands slightly the .fai file is produced in full, and it doesn't simultaneously cut down my orginal file size to a fraction of what it was:

 library(Rsamtools)
 file_path <- "myfile.fa"
 indexFa(file_path)

Previously I was running:

 library(Rsamtools)
 indexFa("myfile.fa")
ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by oliver.bayfield110

That doesn't really make sense, but glad you got it to work!

ADD REPLYlink written 3.9 years ago by Matt Shirley8.9k

I know. Don't look a gift horse in the mouth eh.

ADD REPLYlink written 3.9 years ago by oliver.bayfield110
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 749 users visited in the last hour