Question

Matching short reads to a reference sequence considering wildcards with the Biostrings package

0

Entering edit mode

9.6 years ago

vitor.aguiar ▴ 10

Given a sequence my_seq:

my_seq = "AGCTTCATGAAGCTGAGTNGGACGCGATGATGCG"

We can make a collection of 10-bp reads my_reads:

my_reads = substring(my_seq,1:(nchar(my_seq)-9),10:nchar(my_seq))

which I want to map to a target sequence targetseq:

targetseq = "AGCTTCATGAAGCTGAGTGGGACGCGATGATGCGACTAGGGACCTTAGCAGC"

treating "N" in my_reads as a wildcard, so all reads map to targetseq.

I want to do this kind of string matching in a large database of strings, and I found the Bioconductor package Biostrings be very efficient to do that. But, in order to be efficient, Biostrings requires you to convert your collection of substrings (here my_reads) into a dictionary of class "pdict" using the function PDidc(). You can use this 'dictionary' later in functions like countPDict() to count matches against a target sequence.

In order to use wildcards, you have to divide your dictionary in 3 parts: a head (left), a trusted band (middle) and a tail (right). You can only have wildcards, like "N", in the head or tail, but not in the trusted band, and you cannot have a trusted band of width = 0. So, for example, PDict() will complain about the read 15 if you use a trusted band of minimum width = 1 like:

> PDict(my_reads[1:15],tb.start=5,tb.end=5)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,  :
    non base DNA letter found in Trusted Band for pattern 15

because the "N" is right in the trusted band.

> PDict(my_reads[1:14],tb.start=5,tb.end=5) #is OK
TB_PDict object of length 14 and width 10 (preprocessing algo="ACtree2"):
    - with a head of width 4
    - with a Trusted Band of width 1
    - with a tail of width 5

Is there any way to circumvent this limitation imposed by Biostrings? Or does anyone have a different idea? I'm almost giving up and using a following approach: generate reads and an index with the target sequence (saving both as fasta), use a mapper like GEM to map the reads to the index, then parse the output file to count the matches!

P.S.: I also have "N" in the target sequence sometimes, but I guess I can address that by setting the argument 'fixed' in countPDict and related functions to 'FALSE'.

RNA-Seq R Biostrings • 2.5k views

ADD COMMENT • link 9.6 years ago by vitor.aguiar ▴ 10

0

Entering edit mode

I could solve that problem by creating different dictionaries of reads, taking all reads with N's in the trusted band and creating a separate dictionary with a trusted band with a different position. Then I could perform my match reads against target sequence with these different dictionary of reads and sum the matches across dictionaries. Far from elegant though.

ADD REPLY • link 9.6 years ago by vitor.aguiar ▴ 10