Question

Pattern Matching - Biological Compatibility Between Query Reads And Reference Sequence?

2

Entering edit mode

12.2 years ago

Agatha ▴ 350

Hi,

I was trying to manually align RNA-seq reads to a RNA sequence in R using Biostrings.

Considering that my sequences are collapsed

 A DNAStringSet instance of length 1400027
      width seq                                                               names               

  [1]    22 GTCTGTGATGAATTGCTTTGAC                                            1-1694429
  [2]    22 AGTCTGTGATGAATTGCTTTGA                                            2-1546669
  [3]    22 GCATTGGTGGTTCAGTGGTAGA                                            3-928598

And I want to map them using my own algorithm that uses the matchPattern function to a pre-miRNA seq downloaded from mirBASE.

72-letter "RNAString" instance
 seq: UGUCGGGUAGCUUAUCAGACUGAUGUUGACUGUUGAAUCUCAUGGCAACACCAGUCGAUGGGCUGUCUGACA

How should I actually convert my reads in order to be biologically compatible? Is it relevant to compare match(DNAString, DNAString(reverseComplement(RNAString_instance)) -considering how the reads are generated using Illumina?

Otherwise, how should I convert either the source or the destination string for a correct mapping?

illumina rna r mirna cdna • 3.1k views

ADD COMMENT • link updated 12.2 years ago by Jeremy Leipzig 22k • written 12.2 years ago by Agatha ▴ 350

1

Entering edit mode

I would recommend to also align the reads against the reference genome unless the protocol somehow selected specifically (pre)miRNA, because otherwise you might be seeing repeated sequences from normal transcripts and that might go unnoticed. Then check the regions of interest in the genome for high coverage, (possibly after removing duplicate hits).

ADD REPLY • link 12.2 years ago by Michael 54k

score 2 · Answer 1 · 2012-01-23

As much as I like R I am ambivalent to recommend it for first round of mapping -> Bowtie/BWA/Novoalign/BFast/etc.. should be able to properly map reads to hairpins. Once they are mapped, realigning using dynamic programming is not too costly.

Here is some code I use to do that (sorry it's a bit messy):

pwa<-pairwiseAlignment(targetseq,queryseq,type="local-global")

 #used for target positions
  target_poss_sub<-unlist(pwa@pattern@mismatch)
  #used for identifying the edits
  seq_poss_sub<-unlist(pwa@subject@mismatch)

  #pattern indels is given relative to the subject
  #pattern: [15] TCCCTGAGACCCTT-TAACCTGT 
  #subject:  [1] TCCCTGAGACCCTTTTAACCTGT 
  #"30i"
  target_poss_indel_ins<-concat(as.character(start(unlist(pwa@pattern@indel))+pwa@pattern@range@start),'i',sep='')
  seq_poss_indel_ins<-start(unlist(pwa@pattern@indel))

  #we won't see deletions on the target side
  #1234567890123456789012345678901234567
  #TGCCAGTCTCTAGGTCCCTGAGACCCTTTAACCTGTGAGGACATCCAGGGTCACAGGTGA
  #              TCCCTGAGACCCTT-AACCTGTG
  #pattern: [15] TCCCTGAGACCCTTTAACCTGTG 
  #subject:  [1] TCCCTGAGACCCTT-AACCTGTG 
  target_poss_indel_del<-concat(as.character(start(unlist(pwa@subject@indel))+pwa@pattern@range@start-1),'d',sep='')
  seq_poss_indel_del<-start(unlist(pwa@subject@indel))

  target_poss<-c(target_poss_sub,target_poss_indel_ins,target_poss_indel_del)
  seq_poss<-c(seq_poss_sub,seq_poss_indel_del,seq_poss_indel_ins)

  nucs<-strsplit(queryseq,'')[[1]][seq_poss]
  edits<-data.frame(nucs=nucs,poss=target_poss)
  editList<-list()
  if(N_is_reference){
    editList<-as.list(edits[edits$nucs!='N',])
  }else{
    editList<-as.list(edits)
  }
  #remove N's
  maskedSeq<-maskedEdits(as.character(subject(pwa)),as.character(pattern(pwa)))
  spacedSeq<-str_pad(maskedSeq,pwa@pattern@range@start+pwa@pattern@range@width-1)

  #formatEdits(nucs,target_poss)

  list(editList=editList,spacedSeq=spacedSeq)

maskedEdits<-function(subject,pattern){
  concat(apply(X=rbind(strsplit(subject,'')[[1]],strsplit(pattern,'')[[1]]),MARGIN=2,FUN=function(x){if(x[1]=='N'){x[2]}else{if(x[1]=='-'){''}else{x[1]}}}),collapse="")
}

score 1 · Answer 2 · 2012-01-22

1

Entering edit mode

12.2 years ago

Michael 54k

Whether or not you have to consider matches in both strands depends on whether or not your protocol is strand specific. Afaik, there are both strand specific and non-strand specific sequencing protocols for RNA-seq with Illumina around, but you should ask the sequencing lab about this. Note that the name of the function is matchPattern not match, and that pattern must be a single string or DNAString object not a DNAStringSet.

If you want to use a mapping of reads to a database of multiple RNA sequences (and want to do that in R of course) then matchPDict with a preprocessed dictionary might be much faster. In addition, it might be reasonable to check if there is any chance of mismatches and if the tools allow mismatches and or indels. If that is required (matchPDict supports mismatches but no indels, matchPattern does but might be a bit slow) and you have a lot of sequences, an external tool might be a better choice.

Edit: if you don't know and cannot find out if the protocol was strand-specific, then I would simply assume it was not and test for both strands, better than to miss out on some hits.

ADD COMMENT • link 12.2 years ago by Michael 54k

0

Entering edit mode

@Michael Dondrup- yes I am aware of that, I tried to write as pseudocode..I want to loop through the reads, and for each one of them, test where it aligns to the rna seq with different mismathces..I got my data from NCBI so I am not sure if I can get that info from somewhere..thank you

ADD REPLY • link 12.2 years ago by Agatha ▴ 350

0

Entering edit mode

Well, then I would simply test both strands. How many sequences do you have and how large is your reference database? If they are of the size of a real illumina run, the R functions might be too slow anyway.

ADD REPLY • link 12.2 years ago by Michael 54k

0

Entering edit mode

Around 1400000 reads and 1527 ref sequences..I have also tried Bowtie, but for some reason, the reads did not map to a miRNA that is very present in the tissues from which the rna was extracted and sequenced and I am not sure what I did wrong so I am trying to do it manually. How do I know, when I test both strands which one is correct or not?

ADD REPLY • link 12.2 years ago by Agatha ▴ 350

0

Entering edit mode

I think that will take forever with the R functions, especially if you loop over the 1.4e7 reads. Try bwa or bfast then if you cannot get a match of a true positive example. With the two strands, actually there is no way of telling directly, but you can look at the surrounding sequence of the alignments and check if they fulfill the biological properties set for your pre-miRNA.

ADD REPLY • link 12.2 years ago by Michael 54k

0

Entering edit mode

thank you, Michael. One more question, if possible. What could be the explanation of the miRNA missing in the alignment file? Is it related to the preprocessing steps? Is it possible for the respective miRNA not to be present in a certain sample? Some of the miRNAs reported by the already published paper as miRNAs with max number of hits have the highest number of hits in my alignment...however, that miRNA it's completely absent (that miRNA is of interest for my analysis)

ADD REPLY • link 12.2 years ago by Agatha ▴ 350