Question: miRNA mapping rate is very low.. (less than 0.03%)
5
gravatar for illinois.ks
3.8 years ago by
illinois.ks140
Korea, Republic Of
illinois.ks140 wrote:

Hello, 

I haven't worked with miRNA mapping..( the specie is mouse) It is the first time to do it.

I have read a lot of posting and web-site to map my microRNA data. 

Here is what I follow.

1) trim the adapter using cutadapt

When I looked at my fastq file (single-end), read length was 36 bp.. by inspecting the eye-ball test, I could not find my common sequences. (So, I am not sure my fastq file has been already trimed or not . So I kept both versions... )This is my command... (comes from http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data)

cutadapt -a CGACAGGTTCAGAGTTCTACAGTCCGACGATC \
         -a TACAGTCCGACGATC \
         -a ATCTCGTATGCCGTCTTCTGCTTG \
         -e 0.1 -O 5 -m 15 \
         -o xxx_microRNA_adaprm.fastq xxx_microRNA.fastq 

 

So, I have 2 version fastq files. (xxx_microRNA.fastq, xxx_microRNA_adaprm.fastq)

2) Local bowtie2 alignment of miRNA data

(I generated the reference with bowtie-build directly from the hairpin FASTA file downloaded from miRBase.)

 > Bowtie2-build hairpin_mms.fa hairpin_mms.fa

> Bowtie2-build mature_mms.fa mature_mms.fa

> Bowtie2 –local –N 1 –L 16 –x hairpin_mms.fa –U fastq/xxx_microRNA(_adaprm).fastq  -S xxxx.sam

The bowtie2 mapping returns the following result.. (for both versions)

I shold have done something wrong...

===========================================================

2849144 reads; of these:
  2849144 (100.00%) were unpaired; of these:
    2847350 (99.94%) aligned 0 times
    933 (0.03%) aligned exactly 1 time
    861 (0.03%) aligned >1 times
0.06% overall alignment rate

 

=======================================

 

Could you please help me with this?

=====================

Oops, I forgot to convert U to T in hairpin.fa file.. (so stupid about it..) Once I run it, I will confirm the result again..

 

 

mirna ngs; • 4.4k views
ADD COMMENTlink modified 3.8 years ago by David Langenberger8.7k • written 3.8 years ago by illinois.ks140
1

Can you provide the read length distribution after clipping the adapter?

Chaning U's to T's shouldn't change the mapping result. And you don't have to delete non-mouse hairpins. Mapping the reads against all sequences is absolutely fine. 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by David Langenberger8.7k
1

If your current tests don't prove more fruitful then try blasting a few of these. Perhaps you don't really have miRNAs but instead piRNAs (that'll happen in some tissues).

ADD REPLYlink written 3.8 years ago by Devon Ryan88k

Hello Devon, thank you for your comments,

How can I map to piRNA? I am trying to search regarding to piRNA, but there aren't many info.. 

Could you give some comments?

THanks

 

ADD REPLYlink written 3.8 years ago by illinois.ks140

There are a number of sources of piRNA reference sequences: piRNABank, RNAcentral, etc. The actual mapping part isn't any different really, alignment is alignment.

ADD REPLYlink written 3.8 years ago by Devon Ryan88k
1

And check bowtie2, if it returns all multiple mappings of a read. See Attention: Bowtie2 And Multiple Hits

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by David Langenberger8.7k
6
gravatar for David Langenberger
3.8 years ago by
Deutschland
David Langenberger8.7k wrote:

Try to clip the adapter of the Illumina HiSeq 2000 miRNA protocol (TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC)

And always check your read length distribution after clipping. There should be a peak at around 24nt.

ADD COMMENTlink written 3.8 years ago by David Langenberger8.7k

Hello david, thank you so much for your help.

Yes, I checked the read length distribution after cliping the adpater including the sequences you suggested.

(-a CGACAGGTTCAGAGTTCTACAGTCCGACGATC -a TACAGTCCGACGATC -a ATCTCGTATGCCGTCTTCTGCTTG -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC )

Then, I can see many reads are chipped... (orignical my read length was 36bp.. for all reads) 

here is the read distribution

--------------------------------------------

5585 15
7407 16
10051 17
13174 18
21148 19
18613 20
15112 21
16347 22
16967 23
19215 24
17687 25
25776 26
40074 27
94922 28
372518 29
1001995 30
417201 31
444296 36

=========================

as you can see, instead of 24bp, 30bp has read peak!! Do you think I need to clip more?? Then what sequences? 

When I try to map with bowties as it is (with 30 bp peak reads), I can improve my mapping rate a little, but still mapping rate is low.. 

2558088 (100.00%) were unpaired; of these:
    2553160 (99.81%) aligned 0 times
    2885 (0.11%) aligned exactly 1 time
    2043 (0.08%) aligned >1 times
0.19% overall alignment rate

 

 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by illinois.ks140

In addition, I have quick question. 

Since I have no idea about adapter sequences for my smallRNA data, I contacted web lab and got the follwing info.. However I am not sure about experiment in detail, cannot interprete this.. Could you please help me with this info to extract extact adpter sequences?? 

exp ng/ul ul Quantifluor
(ng/ul)
Quantifluor
(nM)
Realtime
(dilution.
pM)
Realtime
(nM)
index sequence reverse compliment detail
 
small RNA_A1 0.857 3.50 0.838 8.608115 0.52 0.0156693 CGTGAT ATCACG 3’ NEXTflex™ Adenylated Adapter : 5’ rApp /NNNNTGGAATTCTCGGGTGCCAAGG/ 3ddC/
5’ NEXTflex™ Adapter : 5' GUUCAGAGUUCUACAGUCCGACGAUCNNNN
NEXTflex™ RT Primer : 5’ GCCTTGGCACCCGAGAATTCCA
NEXTflex™ Barcode Primer 1*† : 5' CAAGCAGAAGACGGCATACGAGAT******GTGACTGGAGTTCCTTGGCACCCGAGAATTCCA
NEXTflex™ Universal Primer : 5' AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA
† The barcode sequence(******) will be read as the reverse complement in the index read.
small RNA_A2 0.803 3.74 GCCTAA TTAGGC
small RNA_A3 1.041 2.88 TCAAGT ACTTGA
small RNA_C1 0.628 4.78 CTGATC GATCAG
small RNA_C2 0.798 3.76 AAGCTA TAGCTT
small RNA_C3 0.764 3.93 GTAGCC GGCTAC

 

 

These index sequences should be my adapter?? then, for each sample, I have to use different adpter?? 

So confusing... :(

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by illinois.ks140
1

Ah ok, it looks like they did multiplexing. That means that they sequenced several samples on one flowcell, in your case it looks like three A replicates and three C replicates. To keep track of the samples, every molecule from A1 gets an A1 index sequence attached and every molecule from A2 an A2 index, etc..., before put in the machine. This index sequence is normally on the 5' end of your read, while the adapter is on the 3' end (index-sequence-3'adapter). 

To get rid of the index, you have to do demultiplexing. Normally, demultiplexing takes place automatically during the instrument's data post-processing protocols. But there are tools available for that task, you can google for that. These tools will take your fastq file together with a file of your index sequence and then create (in your case) six different files (three A replicates and 3 C replicates).

I hope that helps.

That would actually shift the peak of your read-length distribution to 24. ;)

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by David Langenberger8.7k

Dear David,

Thank you so much for your helps. Your explanation helps me a lot.. ;) 

I contacted the web experiment person and he said that the fastq file I got from him has already been demultiplexed.. (oh.. :() so it seems that the low mapping rate was not caused by demultiplexing problem.. 

Then, what I can do more? Could you please give some comments ? ( it is hard to map the miRNA.. ;() 

As I mentioned, when I clip the reads with  four adapter sequences, I have 30bp peak distribution of read-length.

The mapping rate with this 30bp peak read distribution was still very low.. (less than 1 %).. 

So frustrating.. what I am missing?? 

 

FYI : this experiment has been done in miSeq machine.  ( It might help to identify the adapter sequence.)

=======================================================================

In case, I have attached the part of my fastq file content to have a idea of my sequences or adapter.. 

A1 Sample Fastq 

@M02847:38:000000000-ACP9R:1:1101:18991:1866 1:N:0:1
CTGGGTCCAGTTTTCCCAGGAATCCCAAGACTGGAA
+
AC9,,CCE-,CEFFD@@,,,,,CC@@88,,CF88,,
@M02847:38:000000000-ACP9R:1:1101:14369:1882 1:N:0:1
AGGATGTAAACATCCTACACTCTCAGCTCCAATGGA
+
---,A,C,-,<,CCCF,C,CEEEC-,;CC@,,C,,,
@M02847:38:000000000-ACP9R:1:1101:19762:1898 1:N:0:1
CTATTGAGGTAGTAGTTTGTGCTGTTTATTATGGAA
+
CC@CC99,-C,<C9,CFF9E<@F<FFGAFE9F9,,,
@M02847:38:000000000-ACP9R:1:1101:18203:1899 1:N:0:1
AGGGTGAGATGAAGCACTGTAGCTCTTCTATGGAAT
+
-6-,A,<,-C,,,,;8CF,C9,;C@EFEF9E@,,,C
@M02847:38:000000000-ACP9R:1:1101:11269:1904 1:N:0:1
TGCATGAGGTAGTAGATTGTATAGTCTCCTGGAATT

.......

A2 Sample Fastq 

@M02847:38:000000000-ACP9R:1:1101:12431:1882 1:N:0:2
GCACTGTAAACATCCCCGACTGGAAGCGATATGGAA
+
-8ACCCE9<,C9C8<@,8+@C+,,-,;+,C,@,,,,
@M02847:38:000000000-ACP9R:1:1101:14093:1890 1:N:0:2
TACGTTTAGTAGTACCGTCCCTTTGCCTTCTGGAAT
+
C9B<CFF8-C9,C<@C,D@@@EFF-6CCEEF,,,,C
@M02847:38:000000000-ACP9R:1:1101:11008:1909 1:N:0:2
CGGCTAGCTTATCAGACTGATGTTGATGATGGAATT
+
A<9BC;<FFGDGF9,,CF8,E9EF9,C<9F<9,<EF
@M02847:38:000000000-ACP9R:1:1101:11811:1918 1:N:0:2
ATGATGAGGTAGTAGGTTGTGTGGTATTATGGAATT
+
-A@8C<,,-C,,C,,,CF@F@F@7F9EF9FA8,,CE
@M02847:38:000000000-ACP9R:1:1101:14883:1918 1:N:0:2
CGCCTGAGGTAGTAGTTTGTACAGTGGTCTGGAATT
+
@<BCC<,,-C,,C9,FFG9F9E99F9,CEF88,,CE
@M02847:38:000000000-ACP9R:1:1101:12297:1923 1:N:0:2
TTGGTGAGGTAGTAGATTGTATAGTTCAACTGGAAT

 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by illinois.ks140
1

When blasting your reads against mature miRNAs from the MirBase (http://www.mirbase.org/search.shtml) I realized that the first 4 nucleotides of your reads never match and the rest perfectly, until the adapter starts. 

And when looking at your sequences, it seems pretty obvious, that the adapter I gave you (TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC) is the correct one. Almost all reads have a "TGGAATT" at their ends (which is the 5' part of the adapter sequence). You can see that by eye. 

What you can try: Clip your reads using the "TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC" adapter and then trim the first four nucleotides of all reads... or vice versa. Then you should be able to map your reads. At least for the subset you provided, it worked perfectly.

BUT: I would highly recommend you to find out what these first four nucleotides are. It is always important to understand the data you work with! Perhaps this post from SEQanswers might give you a hint: http://seqanswers.com/forums/archive/index.php/t-30100.html

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by David Langenberger8.7k

Dear David,

I really appreciate your helps. I got it!! 

I will try to follow your comments!  Thank you .. I have learned a lot!! 

KS 

ADD REPLYlink written 3.8 years ago by illinois.ks140

Hello David,

Thank you so much.. I am learning a lot about miRNA processing.(I only have done for mRNA mapping. )

I just finished this mapping.. 

Hm.. but still mapping rate is low..

After trimmed with adapter and cut 4 bp from front of read,  

=============================

mapping to mature.fa

2565353 reads; of these:
  2565353 (100.00%) were unpaired; of these:
    2564319 (99.96%) aligned 0 times
    504 (0.02%) aligned exactly 1 time
    530 (0.02%) aligned >1 times
0.04% overall alignment rate

===========================

mapping to hairpin.fa

2565353 reads; of these:
  2565353 (100.00%) were unpaired; of these:
    2555740 (99.63%) aligned 0 times
    4536 (0.18%) aligned exactly 1 time
    5077 (0.20%) aligned >1 times
0.37% overall alignment rat

==============================================

THis results looks weird.. Since when I just blast my trimmed reads from the MirBase, I can see almost the perfect match. But, my bowtie2 results looks bad..  Still I should miss something.. ;(

This is my bowtie2 option command 

Bowtie2 –local –N 1 –L 16 –x hairpin.fa –U fastq/A1_trimmed_mirnaseq.fastq  -S xxxx.sam

Do I have to explicitly change the U to T in hairpin.fa??  Or other options?

KS

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by illinois.ks140
2

I don't use Bowtie for short read mapping, but when googling for bowtie2 and short reads, I found:


The chief differences between Bowtie 1 and Bowtie 2 are:

  1. For reads longer than about 50 bp Bowtie 2 is generally faster, more sensitive, and uses less memory than Bowtie 1. For relatively short reads (e.g. less than 50 bp) Bowtie 1 is sometimes faster and/or more sensitive.


http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

 

Perhaps you should try a different mapping algorithm? 

ADD REPLYlink written 3.8 years ago by David Langenberger8.7k
2

I checked the alignments again and it looks like there are another four nucleotides at the end of your read that seem not to match the genomic sequences. As long as there are that many nucleotides that do not match, no alignment tool will map your reads. 

My first thought here would be four random nucleotides ligated to both ends of the molecule. I read about such a strategy to overcome ligation biases that occur especially with short RNA-Seq. Since mature microRNAs are highly expressed and all molecules of one miRNA have the exact same sequence (in contrast to shotgun RNA-Seq), this can result in huge biases, if some ends do ligate easier than others. So, clever guys started adding random ends, to get rid of this error.

Can you check that with your lab-guys?

ADD REPLYlink written 3.8 years ago by David Langenberger8.7k

Hello David

Thank you so  much!. Yes I contacted the lab-guy.. and he just said that trimmed the first 4 bp and last 4bp.. ( as you found)

So,  I firstly trimmed the adapter sequences(TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC) 

And then, trimmed the first 4bp and last 4bp from reads, which leads to the 22bp peak of read-length distribution(instead of 24bp)

Anyhow, I tried to map with bowtie2 again. 

> bowtie2 --local -N 1 -L 16 -x ../miRNA_reference/hairpin_UtoT.fa -U first4bptrimmed_A1-SmallRNA_S1_L001_R1_001_Illuminaadpatertrim.fastq -S f4_trimmed.sam

 

I also changed hairpin.fa file (U to T) 

Oh.. thank you David,

Finallly, I got 

2565353 reads; of these:
  2565353 (100.00%) were unpaired; of these:
    479292 (18.68%) aligned 0 times
    11959 (0.47%) aligned exactly 1 time
    2074102 (80.85%) aligned >1 times
81.32% overall alignment rate

=========================================

 

 

Thank you so much for your help !! 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by illinois.ks140

BTW, I have one more question!!  

When I look at the hairpin.fa file,  it has many miRNAs from different species.

So, it seems that our reads are mapped to all different miRNA from different species even though my one came from Mouse.  Do I need to only extract the mouse miRNAs from hairpin.fa file?? Or does it okay?

 

When I look at the output sam file, it seems that our reads are mapping across all miRNAS from different species.. , which means our 81% mapping rate is not the real mapping rate.. It should be decreased... Is it right? 

First three lines of output sam file  

=======================================================================

M02847:38:000000000-ACP9R:1:1101:18991:1866     0       ptr-miR-145     1       1       22M1S   *       0       0       GTCCAGTTTTCCCAGGAATCCCA ,CCE-,CEFFD@@,,,,,CC@@8 AS:i:44 XS:i:44 XN:i:0  XM:i:        0 XO:i:0  XG:i:0  NM:i:0  MD:Z:22 YT:Z:UU

 

 M02847:38:000000000-ACP9R:1:1101:14369:1882     0       tgu-miR-30c-5p  1       1       24M4S   *       0       0       TGTAAACATCCTACACTCTCAGCTCCAA    A,C,-,<,CCCF,C,CEEEC-,;CC@,,    AS:i:48 XS:i:        48        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:24 YT:Z:UU

 M02847:38:000000000-ACP9R:1:1101:19762:1898     0       mmu-let-7j      1       0       22M1S   *       0       0       TGAGGTAGTAGTTTGTGCTGTTT C99,-C,<C9,CFF9E<@F<FFG AS:i:38 XS:i:38 XN:i:0  XM:i:        1 XO:i:0  XG:i:0  NM:i:1  MD:Z:7T14       YT:Z:UU

=====================================================

I checked with this reads and it seems that these ones are mapping to mouse miRNA well . 

 

For example, the first read "GTCCAGTTTTCCCAGGAATCCCA" is mapping to  ptr-miR-145 . But when I blast to mirbase(http://www.mirbase.org/search.shtml), it is hitted to mmu miRNA correctly..  Hm.. they just mapped  randomly?? then, definitely, I have to make my own hairpin.fa file having miRNAs of my speicies. Is it right? 

Anyhow, I am learning a lot from this study... So happy with this~

 

Thanks in advance. 

ADD REPLYlink modified 3.8 years ago • written 3.8 years ago by illinois.ks140
2

Hi,

glad the analysis starts working. You can filter the one mapping to mmu* miRNAs and quantify how many you have just to make you an idea if seems are good now. It is normal to get hits to other miRNAs, more in conserved miRNAs, so you should be an enriched of reads with hits at mmu* if you count that ones. 

Mapping to all helps to see any contamination, but normally you can create the hairpin.fa index only for your species and if get the majority there, you are good.

ADD REPLYlink written 3.8 years ago by Lorena Pantano360
1

Just one add here: I normally map my small RNA-Seq data to the complete genome, instead of only the miRBase hairpins. Imagine a sequence that maps to the hairpins with, let's say, 2 errors. You cannot know, if there is a hit in the genome with 0 hits, since you didn't look there. If you have the genome available, I would always use the genome and not a subset. Otherwise, you might get a bias. 

And as Lorena already mentioned: It's absolutely normal that you map one read to several homologous microRNAs, since most mature sequences are highly conserved.

ADD REPLYlink written 3.8 years ago by David Langenberger8.7k

Okay. I see.   

It makes sense. I will also try to map to the whole genome.. not only for miRBase genome.  and see the result too.!   

Thank you so much!!  ;) I feel that I am ready to analyze the miRNA now! ;) 

ADD REPLYlink written 3.8 years ago by illinois.ks140

Okay. I see.   

It makes sense. I will also try to map to the whole genome.. not only for miRBase genome.  and see the result too.!   

Thank you so much!!  ;) I feel that I am ready to analyze the miRNA now! ;) 

ADD REPLYlink written 3.8 years ago by illinois.ks140

Hello Lorena,

Thank you for your comments. Yes. I will make my own hairpin.fa index for my species.. and try.. hope it works well. 

I will also try the one you suggested too !! 

Thank you! 

ADD REPLYlink written 3.8 years ago by illinois.ks140
1

Did you do the index for hairpin.fa? I use to change U by T, but not sure if it does automatically. If you only want to map to mirbase you can use miraligner:. You can download from here: https://github.com/lpantano/seqbuster/raw/master/modules/miraligner/miraligner.jar and the usage is: 

 java -jar miraligner.jar -sub 1 -trim 3 -add 3 -s hsa -i fastq -db mirbase_DB_folder  -pre -o prefix 

mirbase_DB is a folder with the hairpin.fa and miRNA.str files from here: ftp://mirbase.org/pub/mirbase/CURRENT/

The good thing about this tool is that is very sensitive. Normally it works with collapse reads (normally around 50000 different sequences in a normal experiment), so you can try first with the first 100000 reads of your file to test it. 

last thought: What happens if you map against the genome? maybe you don't have a lot miRNA there. I saw the peak after trimming at 30 nt, that would be 26 nt after removing the first 4 nt.  I wouldn't use local, no big difference for short reads. 

ADD REPLYlink written 3.8 years ago by Lorena Pantano360
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1608 users visited in the last hour