Extract matched pattern with Index from fasta file in Biostrings R
2
0
Entering edit mode
7.0 years ago
thiele.ben • 0

Hey,Ive been working on extracting Sequences from a fasta file for some days now and finally decided to post my problem here. I have a fasta file containing approximately 15000 Sequences and I'd like to split these Sequences based on a certain pattern. Im using the Biostrings package in R with the following functions:

library(Biostrings)
fasta <- readDNAStringSet("~/Downloads/NGS/IgsubPrimer/fasta", "fasta")
matched <- vmatchPattern("GACCACGTTCCCATCT", fasta, max.mismatch=1)

In the docus I find that vmachPattern gives back an MIndex file on which basis it is possible to extract all matching sequences and write them to a new fasta file. Ive tried extractAllMatches(fasta, matched) which seems just to work with XString files and not in this case. Ive also tried the following line from some documentary:

x <- lapply(seq(along=fasta), function(x) as.character(Views(fasta[[x]], start(matched[[x]]), end(matched[[x]]))))

which gives:

Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  : 
  'x' has "out of limits" views 

For a hint I would be very greatful, Im really stock atm... :(

 

R Biostrings sequencing • 4.7k views
ADD COMMENT
2
Entering edit mode
7.0 years ago
Chris S. ▴ 310

First, please try to use a reproducible example.  For example, check ?readDNAStringSet and you'll see this command that everyone with Biostrings can run

x <- readDNAStringSet(system.file("extdata", "someORF.fa", package="Biostrings"))

Next, run vmatchPattern

mi0 <- vmatchPattern("TTCCCAGC", x, max.mismatch=1)

Finally, check ?matchPDict for some code to run on MIndex objects including Views.   Again, no one knows what the "IgA" object is in your example

startIndex(mi0)

Views(x[[1]], mi0[[1]])

    start  end width
[1]  2309 2316     8 [TTCACAGC]
[2]  5077 5084     8 [TTCCCAGA]
[3]  5380 5387     8 [TTCCAAGC]

mapply(Views, x, mi0)  #OR
lapply(seq_along(x), function(i) Views(x[[i]], mi0[[i]]))

 unlist(lapply(seq_along(x), function(i) as.character( Views(x[[i]], mi0[[i]]))))
 [1] "TTCACAGC" "TTCCCAGA" "TTCCAAGC" "TTCCCAGC" "TTCTCAGC" "TTCCCAGC" "TTCCCATC" "GTCCCAGC" "TTCCCAGG" "TTACCAGC" "TTCCCAAC"

 

 

 

 

ADD COMMENT
0
Entering edit mode

Thanks alot and apologies for my not reproducible problem! I can reproduce your suggestions with no problem at all and get my final fasta of matched sequences. But If I apply this to my data, where fasta is the subject and IgA is the Mindex file returned from vmatchpattern:

> x <- lapply(seq_along(fasta), function(i) as.character(Views(fasta[[i]], IgA[[i]])))
 Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  : 
  'x' has "out of limits" views 

So Im still getting the error message and I cant explain why. Any idea?

 

ADD REPLY
0
Entering edit mode

I'm not sure.  Try running traceback() after you get the error.  And can you run Views(fasta[[1]], IgA[[1]]) without errors?   If so, drop lapply and run a loop with a print statement and see where it goes out of limits

ADD REPLY
0
Entering edit mode

Views(fasta[[1]], IgA[[1]]) is working fine. When I run traceback it returns:

 Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  : 
  'x' has "out of limits" views 
8 stop("'x' has \"out of limits\" views") 
7 fromXStringViewsToStringSet(x, out.of.limits = out.of.limits, 
    use.names = use.names) 
6 .local(x, ...) 
5 as.character(Views(fasta[[i]], IgA[[i]])) 
4 as.character(Views(fasta[[i]], IgA[[i]])) 
3 FUN(1:14980[[1574L]], ...) 
2 lapply(seq_along(fasta), function(i) as.character(Views(fasta[[i]], 
    IgA[[i]]))) 
1 lapply(seq_along(fasta), function(i) as.character(Views(fasta[[i]], 
    IgA[[i]]))) 

Returning with Debug gives me:

function (x, out.of.limits = c("ok", "warning", "error"), use.names = FALSE) 
{
    out.of.limits <- match.arg(out.of.limits)
    ans_ranges <- restrict(as(x, "IRanges"), start = 1L, end = nchar(subject(x)), 
        keep.all.ranges = TRUE, use.names = use.names)
    if (out.of.limits != "ok" && any(width(ans_ranges) < width(x))) {
        if (out.of.limits == "warning") 
            warning("trimming \"out of limits\" views")
        else stop("'x' has \"out of limits\" views")
    }
    extractList(subject(x), ans_ranges)
}

Im not really sure how to interpret that, cause Im just getting started with R and this Biostrings package. Im also abit lost with your recommendation to run a loop with print statements. Please excuse my ignorance!

ADD REPLY
0
Entering edit mode

I don't know either - the  experts on the BioC mialing list may be able to help more, but you should post a reproducible example there.  You could also try using toString instead of as.character.   Also, to change lapply to a loop, try this to catch where the error occurs.

z<-vector("list", length(fasta))
for(i in seq_along(fasta)){ print(i); z[[i]]<- as.character(Views(fasta[[i]], IgA[[i]])) }
ADD REPLY
0
Entering edit mode

Thank you so much for your advice! With toString instead of as.character I get the same error message. And if Im looping it like you suggested I get also the same error when getting to 1574.

[1] 1574
Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  : 
  'x' has "out of limits" views

When I take a look at that specific Sequence with Views() it looks exactly like all the others before. So maybe the function cannot handle so much data!? To get the matched Sequence anyway I did a little work around. Ill post it as an answer below...

ADD REPLY
0
Entering edit mode

Can you attach the output from fasta[[1574]] and IgA[[1574]]?

ADD REPLY
0
Entering edit mode
> IgA[[1574]]
IRanges of length 1
    start end width
[1]     0  15    16

>  fasta[[1574]]
  250-letter "DNAString" instance
seq: ACCACGTTCCCATCTGGCTGGGTGCTGCAGAGGCTCAGCGCCCCGGCCGTAATGGCCACTCTGCGTAGATCGGAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATTAAAAAAAAAAAAAAACAGAAAACTGTGGACGGGCAGCACAATAACAGAATCACGAGGCGACGAACCCTAGATACAGATAGGTGACATGATCCGACCACTAACCAACGCTCGAAACACAAACCACTTC

When I compare it to the sequences before, it doesn't look different..

ADD REPLY
0
Entering edit mode

With a start at 0, you'll get a space in the first position and then an error converting the View to a string.  You could probably check for that in the loop or better yet track down why vmatchPattern returns a start at 0.

fasta<-DNAStringSet("ACCACGTTCCCATCTGGCTGGGTGCTGCAGAGGCTCAGCGCCCCGGCCGTAATGGCCACT") 
IgA<- IRanges(0,15)

Views(fasta[[1]], IgA)
  Views on a 60-letter DNAString subject
subject: ACCACGTTCCCATCTGGCTGGGTGCTGCAGAGGCTCAGCGCCCCGGCCGTAATGGCCACT
views:
    start end width
[1]     0  15    16 [ ACCACGTTCCCATCT]

as.character(Views(fasta[[1]], IgA))
Error in fromXStringViewsToStringSet(x, out.of.limits = out.of.limits,  :
  'x' has "out of limits" views

 

ADD REPLY
0
Entering edit mode
7.0 years ago
thiele.ben • 0

Why Im getting the above mentioned error still seems to be a mystery... Anyway for retrieving the matched sequences I did the following:

x <- readDNAStringSet(system.file("extdata", "someORF.fa", package="Biostrings"))
mi0 <- vmatchPattern("TTCCCAGC", x, max.mismatch=1)
mi0.mat <- as.matrix(unlist(mi0))
mi0.frame <- data.frame(mi0.mat)
x.frame <- data.frame(as.character(x))
mi0_table <-subset(x.frame,rownames(x.frame) %in% rownames(mi0.frame))

library(seqRFLP)
dataframe2fas(mi0_table, "~/desired path with filename for export")

 

 

 

 
ADD COMMENT

Login before adding your answer.

Traffic: 2455 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6