Question: microRNA Illumina Sequencing - Very low alignment rate
4
gravatar for wynstep
4.1 years ago by
wynstep70
wynstep70 wrote:

Dear colleagues,
I want to share my strange experience, to ask your opinion and help.
I'm working on microRNA sequencing of a never-studied plant. I received raw data from our service company, as FASTQ file. 30 
million read, 50bp length.
I already did RNAseq analysis and I was quite familiar with several tools such as fastqc, trimmomatic, bowtie2, bowtie etc...

I removed the 3' adapter provided to me by the service company. They used Illumina technology. Quality control confirmed that the adaptor sequence is right.
After adaptor trimming, I have great peaks between 19 and 39 bp (first strange thing for me...) and also some minor peaks between 39 and 50...

I downloaded the "hairpin.fa" file from MirBase, without filtering for a specific organism, changing all U in T and removing items with uncommon chars (Y,X,K etc...).
The alignment rate at this step is really low... about 3%

So I did the alignment again, this time versus the A.thaliana genome. The alignment rate increased to 65% (reads aligned only 1 time about 15%). if I launch htseq-count in order to count alignments in genome regions coding for microrna, I found 0 values for all!

I really tried everything and I don't know how to solve this problem:

  • adaptor trimming with: trimmomatic, cutadapt, fastx-clipper, novoalign
  • mapping with: bowtie, bowtie2 (using local parameter...), mirdeep2, novoalign

Waiting for your help and suggestions!

Thank you in advance

sequencing rna-seq assembly • 3.4k views
ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by wynstep70
2

If you got 65% alignment to the genome, you should probably take a look at the alignments with regard to location.  It sounds suspiciously like you got DNA sequences and not miRNA, but a quick look at the alignments will give you a good sense of whether that is the problem.

ADD REPLYlink written 4.1 years ago by Sean Davis25k
1

You can also check alignment against non-coding RNA http://www.ensembl.org/info/genome/genebuild/ncrna.html. A collaborator got better with subsequent preparations from IIRC 50% NCRNA (a lot or rRNA, tRNA) down to 10%. In 2 preparations at the beginning I could also detect a CT rich sequence that was neither the used adaptor nor any known blast result instead of the adaptor. This lead to too long sequences - similar to your case. If you really have a lot longer than 23nt, I would check also check these for common 3' sequences.

ADD REPLYlink written 4.1 years ago by Ido Tamir4.9k

Ido thank you for your suggestion! I used miRDeep2 to align again my reads and I've found few miRNAs expressed...

I will try to filter sequences longer to 23 nts, search for common 3' sequences (I have to trim them???) and then try alignments with ncrna fasta files.

Thank for your help!

ADD REPLYlink written 4.1 years ago by wynstep70

Sorry I took long to answer, but BioStars does not allow me to answer more then 5 times in less then 6 hours. Restrictions for potential spammers :)

Update: using miRDeep2 with hairpin and mature sequences for all plant species, in order to find levels of known miRNAs, gave to me some results. Few miRNAs, expressed in lot of copies!

I have read a paper (http://www.ncbi.nlm.nih.gov/pubmed/22647250) where this strange alignment seems to be due to a ligation bias of illumina adapter. Do you have any idea on how to try to solve or reduce this problem? 

as Ido Tamir suggested I will also try to align all discarded reads, against the RFam fasta, in order to find possible matches with other ncRNAs... what do you think?

ADD REPLYlink written 4.1 years ago by wynstep70

We have adaptors with 4 random nucleotides at the 3'end. i.e. the ligation product is ADAPTORNNNNSAMPLEINSERT . Of course then you have to adjust the pipeline appropriately. The idea is that the random 4 nt should overcome ligation bias and ligate with everything. like http://www.biooscientific.com/Portals/0/White%20Papers/Randomized-Adapters-for-Reducing-Bias-in-Small-RNA-Seq-Libraries.pdf

I forgot now if both adaptors have the random sequence or just the 5' one.

ADD REPLYlink written 4.1 years ago by Ido Tamir4.9k

I filtered all the reads with more than 23 nucleotides. Launched versus RFam and found that about 97% are ncRNA but not miRNAs! Is it normal?

 

ADD REPLYlink written 4.1 years ago by wynstep70

how many percent are these? maybe size selection was not good?

ADD REPLYlink written 4.1 years ago by Ido Tamir4.9k

In what sense? I used all the discarded reads from the alignment (using mirdeep) versus precursor and mature mirnas. is the 97% of all the discarded reads (about the 97% of all the initial aligned reads :P).

I don't know if I answered to your question.

I also found a strange thing inside the discarded reads. I show you an example for the most "populated" reads inside my file:

>seq_0_x1357010 TCCGTTGTCGTCCAGCGGTTAGGATATCTGGCT
>seq_1357010_x942324 GCGCCTGTAGCTCAGTGGA
>seq_2299334_x259229 TCCGTTGTCGTCCAGCGGTTAGGATATCTGGC
>seq_2558563_x250361 CCCAGTCCCGAACCCGTCGGC
>seq_2808924_x194922 GTCGTTGTAGTATAGTGGTGAGTATTCCCGCCT
>seq_3003846_x174056 GGTGGCTGTAGTTTAGTGGTAAGAATTCCACGTTGT
>seq_3177902_x165695 TCCGTTGTAGTCTAGTCGGTTAGGATATTCGGCT
>seq_3343597_x130364 CCAGTCCCGAACCCGTCGGC
>seq_3473961_x104610 TCCGTTGTCGTCCAGCGGTTAGGATATCTGGCTTTC

I tried to highlight common regions. There's no common sequences on 3'. It seems that one is a sub-sequence of another... These reads, for example, align with tRNAs into RFam...

Thanks a lot!!!

ADD REPLYlink written 4.1 years ago by wynstep70

If you used htseq-count, then what did you use as the annotation file? Wouldn't it just make more sense to count how many non-multimappers align to each contig? Better yet, have you tried using tools like mirDeep, which are designed exactly with microRNAs in mind?

ADD REPLYlink written 4.1 years ago by Devon Ryan84k

When I aligned against the A. thaliana genome, I used the annotation file provided by MirBase (gff3 file).
Against the precursors file, I desisted to continue with htseq-count, because of a too much low alignment rate.

I also tried with mirDeep2, without success! :( the alignment rate is always too much low... I was thinking that maybe is a problem of adaptor trimming... what do you think?

ADD REPLYlink written 4.1 years ago by wynstep70

This is the results of alignment without trimming the adapter:

bowtie2 --local -L 20 -N 0 -x hairpin -U Original_Sequence.fastq -S alignment.sam

31290964 reads; of these:

  31290964 (100.00%) were unpaired; of these:

    30973063 (98.98%) aligned 0 times

    94741 (0.30%) aligned exactly 1 time

    223160 (0.71%) aligned >1 times

1.02% overall alignment rate

If I provide the trimmed and collapsed fast, the percentage is around 4%

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by wynstep70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1042 users visited in the last hour