Matching short reads to a reference sequence considering wildcards with the Biostrings package
0
0
Entering edit mode
7.1 years ago
vitor.aguiar ▴ 10

Given a sequence my_seq:

my_seq = "AGCTTCATGAAGCTGAGTNGGACGCGATGATGCG"

We can make a collection of 10-bp reads my_reads:

my_reads = substring(my_seq,1:(nchar(my_seq)-9),10:nchar(my_seq))

which I want to map to a target sequence targetseq:

targetseq = "AGCTTCATGAAGCTGAGTGGGACGCGATGATGCGACTAGGGACCTTAGCAGC"

treating "N" in my_reads as a wildcard, so all reads map to targetseq.

I want to do this kind of string matching in a large database of strings, and I found the Bioconductor package Biostrings be very efficient to do that. But, in order to be efficient, Biostrings requires you to convert your collection of substrings (here my_reads) into a dictionary of class "pdict" using the function PDidc(). You can use this 'dictionary' later in functions like countPDict() to count matches against a target sequence.

In order to use wildcards, you have to divide your dictionary in 3 parts: a head (left), a trusted band (middle) and a tail (right). You can only have wildcards, like "N", in the head or tail, but not in the trusted band, and you cannot have a trusted band of width = 0. So, for example, PDict() will complain about the read 15 if you use a trusted band of minimum width = 1 like:

> PDict(my_reads[1:15],tb.start=5,tb.end=5)
Error in .Call2("ACtree2_build", tb, pp_exclude, base_codes, nodebuf_ptr,  :
    non base DNA letter found in Trusted Band for pattern 15

because the "N" is right in the trusted band.

> PDict(my_reads[1:14],tb.start=5,tb.end=5) #is OK
TB_PDict object of length 14 and width 10 (preprocessing algo="ACtree2"):
    - with a head of width 4
    - with a Trusted Band of width 1
    - with a tail of width 5

Is there any way to circumvent this limitation imposed by Biostrings? Or does anyone have a different idea? I'm almost giving up and using a following approach: generate reads and an index with the target sequence (saving both as fasta), use a mapper like GEM to map the reads to the index, then parse the output file to count the matches!

P.S.: I also have "N" in the target sequence sometimes, but I guess I can address that by setting the argument 'fixed' in countPDict and related functions to 'FALSE'.

RNA-Seq R Biostrings • 1.9k views
ADD COMMENT
0
Entering edit mode

I could solve that problem by creating different dictionaries of reads, taking all reads with N's in the trusted band and creating a separate dictionary with a trusted band with a different position. Then I could perform my match reads against target sequence with these different dictionary of reads and sum the matches across dictionaries. Far from elegant though.

ADD REPLY

Login before adding your answer.

Traffic: 2312 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6