Question: Pattern Matching - Biological Compatibility Between Query Reads And Reference Sequence?
gravatar for Agatha
8.5 years ago by
Agatha340 wrote:


I was trying to manually align RNA-seq reads to a RNA sequence in R using Biostrings.

Considering that my sequences are collapsed

 A DNAStringSet instance of length 1400027
      width seq                                                               names               

  [1]    22 GTCTGTGATGAATTGCTTTGAC                                            1-1694429
  [2]    22 AGTCTGTGATGAATTGCTTTGA                                            2-1546669
  [3]    22 GCATTGGTGGTTCAGTGGTAGA                                            3-928598

And I want to map them using my own algorithm that uses the matchPattern function to a pre-miRNA seq downloaded from mirBASE.

72-letter "RNAString" instance

How should I actually convert my reads in order to be biologically compatible? Is it relevant to compare match(DNAString, DNAString(reverseComplement(RNAString_instance)) -considering how the reads are generated using Illumina?

Otherwise, how should I convert either the source or the destination string for a correct mapping?

illumina cdna rna R mirna • 1.9k views
ADD COMMENTlink modified 8.5 years ago by Jeremy Leipzig19k • written 8.5 years ago by Agatha340

I would recommend to also align the reads against the reference genome unless the protocol somehow selected specifically (pre)miRNA, because otherwise you might be seeing repeated sequences from normal transcripts and that might go unnoticed. Then check the regions of interest in the genome for high coverage, (possibly after removing duplicate hits).

ADD REPLYlink written 8.5 years ago by Michael Dondrup47k
gravatar for Jeremy Leipzig
8.5 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

As much as I like R I am ambivalent to recommend it for first round of mapping -> Bowtie/BWA/Novoalign/BFast/etc.. should be able to properly map reads to hairpins. Once they are mapped, realigning using dynamic programming is not too costly.

Here is some code I use to do that (sorry it's a bit messy):


 #used for target positions
  #used for identifying the edits

  #pattern indels is given relative to the subject

  #we won't see deletions on the target side


  #remove N's



ADD COMMENTlink written 8.5 years ago by Jeremy Leipzig19k

@Jeremy Leipzig-thank you, I had a look at the code, I will try it asap to see what I am getting. As I have already mentioned, I did align them using bowtie, and none of the reads mapped to the a certain pre-miRNA (true positive)..Even though it sounds trivial, i think it has something to do with what Michael said, whether the sequencing platform is single or strand specific..and according to that I will have to transform my ref sequences into a format that is compatible with my query reads. Do I need to follow any other preprocessing steps of the data before mapping?

ADD REPLYlink written 8.5 years ago by Agatha340

@Jeremy Leipzig I meant: should my reads be as they are (dna reads) mapped to the hairpins? Maybe this may sound trivial but this is my first contact with sequencing data and I don't know exactly how to approach it, when I cannot find what I am looking for, knowing that it should be there.

ADD REPLYlink written 8.5 years ago by Agatha340

well it could be one of a dozen or so things - but all aligners know about reverse complements, so that is not one of them. using a simple grep (i.e. search in your favorite text editor) find a read that you think should map to a hairpin and post it (and the hairpin) as a new Biostar question entitled 'Why didn't this read map to this sequence"

ADD REPLYlink written 8.5 years ago by Jeremy Leipzig19k
gravatar for Michael Dondrup
8.5 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

Whether or not you have to consider matches in both strands depends on whether or not your protocol is strand specific. Afaik, there are both strand specific and non-strand specific sequencing protocols for RNA-seq with Illumina around, but you should ask the sequencing lab about this. Note that the name of the function is matchPattern not match, and that pattern must be a single string or DNAString object not a DNAStringSet.

If you want to use a mapping of reads to a database of multiple RNA sequences (and want to do that in R of course) then matchPDict with a preprocessed dictionary might be much faster. In addition, it might be reasonable to check if there is any chance of mismatches and if the tools allow mismatches and or indels. If that is required (matchPDict supports mismatches but no indels, matchPattern does but might be a bit slow) and you have a lot of sequences, an external tool might be a better choice.

Edit: if you don't know and cannot find out if the protocol was strand-specific, then I would simply assume it was not and test for both strands, better than to miss out on some hits.

ADD COMMENTlink modified 8.5 years ago • written 8.5 years ago by Michael Dondrup47k

@Michael Dondrup- yes I am aware of that, I tried to write as pseudocode..I want to loop through the reads, and for each one of them, test where it aligns to the rna seq with different mismathces..I got my data from NCBI so I am not sure if I can get that info from somewhere..thank you

ADD REPLYlink written 8.5 years ago by Agatha340

Well, then I would simply test both strands. How many sequences do you have and how large is your reference database? If they are of the size of a real illumina run, the R functions might be too slow anyway.

ADD REPLYlink written 8.5 years ago by Michael Dondrup47k

Around 1400000 reads and 1527 ref sequences..I have also tried Bowtie, but for some reason, the reads did not map to a miRNA that is very present in the tissues from which the rna was extracted and sequenced and I am not sure what I did wrong so I am trying to do it manually. How do I know, when I test both strands which one is correct or not?

ADD REPLYlink written 8.5 years ago by Agatha340

I think that will take forever with the R functions, especially if you loop over the 1.4e7 reads. Try bwa or bfast then if you cannot get a match of a true positive example. With the two strands, actually there is no way of telling directly, but you can look at the surrounding sequence of the alignments and check if they fulfill the biological properties set for your pre-miRNA.

ADD REPLYlink written 8.5 years ago by Michael Dondrup47k

thank you, Michael. One more question, if possible. What could be the explanation of the miRNA missing in the alignment file? Is it related to the preprocessing steps? Is it possible for the respective miRNA not to be present in a certain sample? Some of the miRNAs reported by the already published paper as miRNAs with max number of hits have the highest number of hits in my alignment...however, that miRNA it's completely absent (that miRNA is of interest for my analysis)

ADD REPLYlink written 8.5 years ago by Agatha340
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1428 users visited in the last hour