R script to extract sequences using seqinr returns Error Subscript out of bounds?
1
0
Entering edit mode
6.4 years ago

I borrowed a script from How To Extract Specific Coordinates From Multifasta File to extract some sequences from a fasta nt file test.fa using the positions stored in the file positiontest.txt. I was getting a NULL returned I think at the apply() step, so I made some simpler files to test it on. Now executing it returns Error in infas[[seq_id]] : subscript out of bounds, Calls: apply -> FUN. If you could help me debug and find where I'm asking for something that isn't there that will help, and hopefully I can proceed to the NULL problem.

     #!/usr/bin/Rscript

     library(seqinr)

     infas=read.fasta("test.fa")
     posi=read.delim("positiontest.txt", header=FALSE, sep="\t")
     outfas=file("orf2seqs.fa", 'w')

     doitall=function(x) {
       seq_id <- x[[1]]; start <- x[[2]]; stop <- x[[3]]; seq <- infas[[seq_id]]
       seqv <- seq[start:stop]
       header <- paste(sep="", ">", attr(seq, "name"), "_", start, "_", stop, "\n")
       cat(file=outfas, header)
       cat(file=outfas, toupper(paste(sep="", collapse="", seqv)), "\n")
     }
     apply(posi, 1, doitall)
     close(outfas)

 

test.fa:

     >1234567
     AATGATAATGATGATGATGAT
     >1357924
     GTGATGATGATGATGAAGATGGATGATG

positionstest.txt :


     1234567    2    9
     1357924    5    10

(There are probably other scripts / approaches, but for the sake of learning I would rather persevere with this script)

R sequence fasta extract debug • 2.7k views
ADD COMMENT
1
Entering edit mode
6.4 years ago
Martombo ★ 2.9k

the main problem is that if your sequence ids are something like 1234567, they are going to be interpreted as integer indices instead of strings (R is evil). Try the following correction in the first line of your function (or use sequence names like seq1 and seq2):

infas[[as.character(seq_id)]]

also, you didn't define the variable strand

ADD COMMENT
0
Entering edit mode

Thanks. Inserting as.character() removes the Out of bounds error and that makes sense. Now I'm back to the NULL.
How would strand be defined?

      $./extractseqs.r
      .Rprofile: Setting UK repositorynLoading required package: ade4
      Warning messages:
      1: package ‘seqinr’ was built under R version 3.1.2
      2: package ‘ade4’ was built under R version 3.1.3
      NULL

      

ADD REPLY
1
Entering edit mode

heh that's a good question. there's nothing in your data that defines the strand of the sequences. you can just remove this variable from the header. you'll see that the output will be written to the file orf2seqs.fa.

ADD REPLY
0
Entering edit mode

Oh I'm sorry, this is removed in the script, previously I was trying it with strand info but I simplified it to troubleshoot.

Ignore my last comment, the NULL seems to be there regardless, the output is there as it should be now!

I fixed the NULL (or rather hid it) with invisible() wrapping.

Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2783 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6