R script to extract sequences using seqinr returns Error Subscript out of bounds?
1
0
Entering edit mode
6.4 years ago

I borrowed a script from How To Extract Specific Coordinates From Multifasta File to extract some sequences from a fasta nt file test.fa using the positions stored in the file positiontest.txt. I was getting a NULL returned I think at the apply() step, so I made some simpler files to test it on. Now executing it returns Error in infas[[seq_id]] : subscript out of bounds, Calls: apply -> FUN. If you could help me debug and find where I'm asking for something that isn't there that will help, and hopefully I can proceed to the NULL problem.

#!/usr/bin/Rscript

library(seqinr)

outfas=file("orf2seqs.fa", 'w')

doitall=function(x) {
seq_id <- x[[1]]; start <- x[[2]]; stop <- x[[3]]; seq <- infas[[seq_id]]
seqv <- seq[start:stop]
header <- paste(sep="", ">", attr(seq, "name"), "_", start, "_", stop, "\n")
cat(file=outfas, toupper(paste(sep="", collapse="", seqv)), "\n")
}
apply(posi, 1, doitall)
close(outfas)

test.fa:

>1234567
AATGATAATGATGATGATGAT
>1357924
GTGATGATGATGATGAAGATGGATGATG

positionstest.txt :

1234567    2    9
1357924    5    10

(There are probably other scripts / approaches, but for the sake of learning I would rather persevere with this script)

R sequence fasta extract debug • 2.7k views
1
Entering edit mode
6.4 years ago
Martombo ★ 2.9k

the main problem is that if your sequence ids are something like 1234567, they are going to be interpreted as integer indices instead of strings (R is evil). Try the following correction in the first line of your function (or use sequence names like seq1 and seq2):

infas[[as.character(seq_id)]]

also, you didn't define the variable strand

0
Entering edit mode

Thanks. Inserting as.character() removes the Out of bounds error and that makes sense. Now I'm back to the NULL.
How would strand be defined?

\$./extractseqs.r
Warning messages:
1: package ‘seqinr’ was built under R version 3.1.2
2: package ‘ade4’ was built under R version 3.1.3
NULL

1
Entering edit mode

heh that's a good question. there's nothing in your data that defines the strand of the sequences. you can just remove this variable from the header. you'll see that the output will be written to the file orf2seqs.fa.

0
Entering edit mode

Oh I'm sorry, this is removed in the script, previously I was trying it with strand info but I simplified it to troubleshoot.

Ignore my last comment, the NULL seems to be there regardless, the output is there as it should be now!

I fixed the NULL (or rather hid it) with invisible() wrapping.

Thanks.