Question

Extract matched pattern with Index from fasta file in Biostrings R

0

Entering edit mode

9.5 years ago

thiele.ben • 0

Hey,Ive been working on extracting Sequences from a fasta file for some days now and finally decided to post my problem here. I have a fasta file containing approximately 15000 Sequences and I'd like to split these Sequences based on a certain pattern. I'm using the Biostrings package in R with the following functions:

library(Biostrings)
fasta <- readDNAStringSet("~/Downloads/NGS/IgsubPrimer/fasta", "fasta")
matched <- vmatchPattern("GACCACGTTCCCATCT", fasta, max.mismatch=1)

In the docus I find that vmatchPattern gives back an MIndex file on which basis it is possible to extract all matching sequences and write them to a new fasta file. Ive tried extractAllMatches(fasta, matched) which seems just to work with XString files and not in this case. Ive also tried the following line from some documentary:

x <- lapply(seq(along=fasta), function(x) as.character(Views(fasta[[x]], start(matched[[x]]), end(matched[[x]]))))

which gives:

Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  : 
  'x' has "out of limits" views

For a hint I would be very grateful, I'm really stock atm... :(

Biostrings sequencing R • 6.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by thiele.ben • 0

Ram · Answer 1 · 2014-11-03

2

Entering edit mode

9.5 years ago

Chris S. ▴ 320

First, please try to use a reproducible example. For example, check ?readDNAStringSet and you'll see this command that everyone with Biostrings can run

x <- readDNAStringSet(system.file("extdata", "someORF.fa", package="Biostrings"))

Next, run vmatchPattern

mi0 <- vmatchPattern("TTCCCAGC", x, max.mismatch=1)

Finally, check ?matchPDict for some code to run on MIndex objects including Views. Again, no one knows what the "IgA" object is in your example

startIndex(mi0)

Views(x[[1]], mi0[[1]])

    start  end width
[1]  2309 2316     8 [TTCACAGC]
[2]  5077 5084     8 [TTCCCAGA]
[3]  5380 5387     8 [TTCCAAGC]

mapply(Views, x, mi0)  #OR
lapply(seq_along(x), function(i) Views(x[[i]], mi0[[i]]))

 unlist(lapply(seq_along(x), function(i) as.character( Views(x[[i]], mi0[[i]]))))
 [1] "TTCACAGC" "TTCCCAGA" "TTCCAAGC" "TTCCCAGC" "TTCTCAGC" "TTCCCAGC" "TTCCCATC" "GTCCCAGC" "TTCCCAGG" "TTACCAGC" "TTCCCAAC"

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Chris S. ▴ 320

0

Entering edit mode

Thanks alot and apologies for my not reproducible problem! I can reproduce your suggestions with no problem at all and get my final fasta of matched sequences. But If I apply this to my data, where fasta is the subject and IgA is the Mindex file returned from vmatchpattern:

> x <- lapply(seq_along(fasta), function(i) as.character(Views(fasta[[i]], IgA[[i]])))
 Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  : 
  'x' has "out of limits" views

So I'm still getting the error message and I cant explain why. Any idea?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by thiele.ben • 0

0

Entering edit mode

I'm not sure. Try running traceback() after you get the error. And can you run Views(fasta[[1]], IgA[[1]]) without errors? If so, drop lapply and run a loop with a print statement and see where it goes out of limits

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Chris S. ▴ 320

0

Entering edit mode

Views(fasta[[1]], IgA[[1]]) is working fine. When I run traceback it returns:

 Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  : 
  'x' has "out of limits" views 
8 stop("'x' has \"out of limits\" views") 
7 fromXStringViewsToStringSet(x, out.of.limits = out.of.limits, 
    use.names = use.names) 
6 .local(x, ...) 
5 as.character(Views(fasta[[i]], IgA[[i]])) 
4 as.character(Views(fasta[[i]], IgA[[i]])) 
3 FUN(1:14980[[1574L]], ...) 
2 lapply(seq_along(fasta), function(i) as.character(Views(fasta[[i]], 
    IgA[[i]]))) 
1 lapply(seq_along(fasta), function(i) as.character(Views(fasta[[i]], 
    IgA[[i]])))

Returning with Debug gives me:

function (x, out.of.limits = c("ok", "warning", "error"), use.names = FALSE) 
{
    out.of.limits <- match.arg(out.of.limits)
    ans_ranges <- restrict(as(x, "IRanges"), start = 1L, end = nchar(subject(x)), 
        keep.all.ranges = TRUE, use.names = use.names)
    if (out.of.limits != "ok" && any(width(ans_ranges) < width(x))) {
        if (out.of.limits == "warning") 
            warning("trimming \"out of limits\" views")
        else stop("'x' has \"out of limits\" views")
    }
    extractList(subject(x), ans_ranges)
}

I'm not really sure how to interpret that, cause I'm just getting started with R and this Biostrings package. I'm also a bit lost with your recommendation to run a loop with print statements. Please excuse my ignorance!

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by thiele.ben • 0

0

Entering edit mode

I don't know either - the experts on the BioC mialing list may be able to help more, but you should post a reproducible example there. You could also try using toString instead of as.character. Also, to change lapply to a loop, try this to catch where the error occurs.

z<-vector("list", length(fasta))
for(i in seq_along(fasta)){ print(i); z[[i]]<- as.character(Views(fasta[[i]], IgA[[i]])) }

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Chris S. ▴ 320

0

Entering edit mode

Thank you so much for your advice! With toString instead of as.character I get the same error message. And if I'm looping it like you suggested I get also the same error when getting to 1574.

[1] 1574
Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  : 
  'x' has "out of limits" views

When I take a look at that specific Sequence with Views() it looks exactly like all the others before. So maybe the function cannot handle so much data!? To get the matched Sequence anyway I did a little work around. Ill post it as an answer below...

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by thiele.ben • 0

0

Entering edit mode

Can you attach the output from fasta[[1574]] and IgA[[1574]]?

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Chris S. ▴ 320

0

Entering edit mode

> IgA[[1574]]
IRanges of length 1
    start end width
[1]     0  15    16

>  fasta[[1574]]
  250-letter "DNAString" instance
seq: ACCACGTTCCCATCTGGCTGGGTGCTGCAGAGGCTCAGCGCCCCGGCCGTAATGGCCACTCTGCGTAGATCGGAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAAACAGAAAACTGTGGACGGGCAGCACAATAACAGAATCACGAGGCGACGAACCCTAGATACAGATAGGTGACATGATCCGACCACTAACCAACGCTCGAAACACAAACCACTTC

When I compare it to the sequences before, it doesn't look different..

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by thiele.ben • 0

0

Entering edit mode

With a start at 0, you'll get a space in the first position and then an error converting the View to a string. You could probably check for that in the loop or better yet track down why vmatchPattern returns a start at 0.

fasta<-DNAStringSet("ACCACGTTCCCATCTGGCTGGGTGCTGCAGAGGCTCAGCGCCCCGGCCGTAATGGCCACT") 
IgA<- IRanges(0,15)

Views(fasta[[1]], IgA)
  Views on a 60-letter DNAString subject
subject: ACCACGTTCCCATCTGGCTGGGTGCTGCAGAGGCTCAGCGCCCCGGCCGTAATGGCCACT
views:
    start end width
[1]     0  15    16 [ ACCACGTTCCCATCT]

as.character(Views(fasta[[1]], IgA))
Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  :
  'x' has "out of limits" views

ADD REPLY • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Chris S. ▴ 320

Ram · Answer 2 · 2014-11-05

Why I'm getting the above mentioned error still seems to be a mystery... Anyway for retrieving the matched sequences I did the following:

x <- readDNAStringSet(system.file("extdata", "someORF.fa", package="Biostrings"))
mi0 <- vmatchPattern("TTCCCAGC", x, max.mismatch=1)
mi0.mat <- as.matrix(unlist(mi0))
mi0.frame <- data.frame(mi0.mat)
x.frame <- data.frame(as.character(x))
mi0_table <-subset(x.frame,rownames(x.frame) %in% rownames(mi0.frame))

library(seqRFLP)
dataframe2fas(mi0_table, "~/desired path with filename for export")