Question: R script to extract sequences using seqinr returns Error Subscript out of bounds?
0
gravatar for oliver.bayfield
2.6 years ago by
United States
oliver.bayfield90 wrote:

I borrowed a script from How To Extract Specific Coordinates From Multifasta File to extract some sequences from a fasta nt file test.fa using the positions stored in the file positiontest.txt. I was getting a NULL returned I think at the apply() step, so I made some simpler files to test it on. Now executing it returns Error in infas[[seq_id]] : subscript out of bounds, Calls: apply -> FUN. If you could help me debug and find where I'm asking for something that isn't there that will help, and hopefully I can proceed to the NULL problem.

     #!/usr/bin/Rscript

     library(seqinr)

     infas=read.fasta("test.fa")
     posi=read.delim("positiontest.txt", header=FALSE, sep="\t")
     outfas=file("orf2seqs.fa", 'w')

     doitall=function(x) {
       seq_id <- x[[1]]; start <- x[[2]]; stop <- x[[3]]; seq <- infas[[seq_id]]
       seqv <- seq[start:stop]
       header <- paste(sep="", ">", attr(seq, "name"), "_", start, "_", stop, "\n")
       cat(file=outfas, header)
       cat(file=outfas, toupper(paste(sep="", collapse="", seqv)), "\n")
     }
     apply(posi, 1, doitall)
     close(outfas)

 

test.fa:

     >1234567
     AATGATAATGATGATGATGAT
     >1357924
     GTGATGATGATGATGAAGATGGATGATG

positionstest.txt :


     1234567    2    9
     1357924    5    10

(There are probably other scripts / approaches, but for the sake of learning I would rather persevere with this script)

debug extract sequence R fasta • 1.3k views
ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by oliver.bayfield90
1
gravatar for Martombo
2.6 years ago by
Martombo1.7k
Seville, ES
Martombo1.7k wrote:

the main problem is that if your sequence ids are something like 1234567, they are going to be interpreted as integer indices instead of strings (R is evil). Try the following correction in the first line of your function (or use sequence names like seq1 and seq2):

infas[[as.character(seq_id)]]

also, you didn't define the variable strand

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Martombo1.7k

Thanks. Inserting as.character() removes the Out of bounds error and that makes sense. Now I'm back to the NULL.
How would strand be defined?

      $./extractseqs.r
      .Rprofile: Setting UK repositorynLoading required package: ade4
      Warning messages:
      1: package ‘seqinr’ was built under R version 3.1.2
      2: package ‘ade4’ was built under R version 3.1.3
      NULL

      

ADD REPLYlink written 2.6 years ago by oliver.bayfield90
1

heh that's a good question. there's nothing in your data that defines the strand of the sequences. you can just remove this variable from the header. you'll see that the output will be written to the file orf2seqs.fa.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Martombo1.7k

Oh I'm sorry, this is removed in the script, previously I was trying it with strand info but I simplified it to troubleshoot.

Ignore my last comment, the NULL seems to be there regardless, the output is there as it should be now!

I fixed the NULL (or rather hid it) with invisible() wrapping.

Thanks.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by oliver.bayfield90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1007 users visited in the last hour