Question

miRNA mapping rate is very low.. (less than 0.03%)

8

Entering edit mode

8.9 years ago

illinois.ks ▴ 210

Hello,

I haven't worked with miRNA mapping..( the specie is mouse) It is the first time to do it.

I have read a lot of posting and web-site to map my microRNA data.

Here is what I follow.

1) trim the adapter using cutadapt

When I looked at my fastq file (single-end), read length was 36 bp.. by inspecting the eye-ball test, I could not find my common sequences. (So, I am not sure my fastq file has been already trimmed or not . So I kept both versions... )This is my command... (comes from http://www.ark-genomics.org/events-online-training-eu-training-course/adapter-and-quality-trimming-illumina-data)

cutadapt -a CGACAGGTTCAGAGTTCTACAGTCCGACGATC \
         -a TACAGTCCGACGATC \
         -a ATCTCGTATGCCGTCTTCTGCTTG \
         -e 0.1 -O 5 -m 15 \
         -o xxx_microRNA_adaprm.fastq xxx_microRNA.fastq

So, I have 2 version fastq files. (xxx_microRNA.fastq, xxx_microRNA_adaprm.fastq)

2) Local bowtie2 alignment of miRNA data

(I generated the reference with bowtie-build directly from the hairpin FASTA file downloaded from miRBase.)

Bowtie2-build hairpin_mms.fa hairpin_mms.fa
Bowtie2-build mature_mms.fa mature_mms.fa
Bowtie2 -local -N 1 -L 16 -x hairpin_mms.fa -U fastq/xxx_microRNA(_adaprm).fastq -S xxxx.sam

The bowtie2 mapping returns the following result.. (for both versions)

I should have done something wrong...

2849144 reads; of these:
  2849144 (100.00%) were unpaired; of these:
    2847350 (99.94%) aligned 0 times
    933 (0.03%) aligned exactly 1 time
    861 (0.03%) aligned >1 times
0.06% overall alignment rate

Could you please help me with this?

Oops, I forgot to convert U to T in hairpin.fa file.. (so stupid about it..) Once I run it, I will confirm the result again.

NGS miRNA • 9.9k views

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

1

Entering edit mode

Can you provide the read length distribution after clipping the adapter?

Changing U's to T's shouldn't change the mapping result. And you don't have to delete non-mouse hairpins. Mapping the reads against all sequences is absolutely fine.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by David Langenberger 11k

1

Entering edit mode

If your current tests don't prove more fruitful then try blasting a few of these. Perhaps you don't really have miRNAs but instead piRNAs (that'll happen in some tissues).

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Devon Ryan 104k

0

Entering edit mode

Hello Devon, thank you for your comments,

How can I map to piRNA? I am trying to search regarding to piRNA, but there aren't many info..

Could you give some comments?

THanks

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

0

Entering edit mode

There are a number of sources of piRNA reference sequences: piRNABank, RNAcentral, etc. The actual mapping part isn't any different really, alignment is alignment.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Devon Ryan 104k

1

Entering edit mode

And check bowtie2, if it returns all multiple mappings of a read. See Attention: Bowtie2 And Multiple Hits

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by David Langenberger 11k

Ram · Accepted Answer · 2015-05-29

6

Entering edit mode

8.9 years ago

David Langenberger 11k

Try to clip the adapter of the Illumina HiSeq 2000 miRNA protocol (TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC)

And always check your read length distribution after clipping. There should be a peak at around 24nt.

ADD COMMENT • link updated 15 months ago by Ram 43k • written 8.9 years ago by David Langenberger 11k

0

Entering edit mode

Hello david, thank you so much for your help.

Yes, I checked the read length distribution after clipping the adapter including the sequences you suggested.

(-a CGACAGGTTCAGAGTTCTACAGTCCGACGATC -a TACAGTCCGACGATC -a ATCTCGTATGCCGTCTTCTGCTTG -a TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC)

Then, I can see many reads are clipped. (Originally my read length was 36bp for all reads)

Here is the read distribution

as you can see, instead of 24bp, 30bp has read peak! Do you think I need to clip more? Then what sequences?

When I try to map with bowtie as it is (with 30 bp peak reads), I can improve my mapping rate a little, but still mapping rate is low..

2558088 (100.00%) were unpaired; of these:
    2553160 (99.81%) aligned 0 times
    2885 (0.11%) aligned exactly 1 time
    2043 (0.08%) aligned >1 times
0.19% overall alignment rate

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

0

Entering edit mode

In addition, I have quick question.

Since I have no idea about adapter sequences for my smallRNA data, I contacted web lab and got the following info.. However I am not sure about experiment in detail, cannot interpret this. Could you please help me with this info to extract exact adapter sequences?

exp             ng/ul    ul      Quantifluor    Quantifluor    Realtime    Realtime    index sequence    reverse compliment    detail
                                 (ng/ul)        (nM)           (dilution.  (nM)
                                                                   pM)                 
small RNA_A1    0.857    3.50    0.838          8.608115       0.52        0.0156693   CGTGAT            ATCACG                3' NEXTflex™ Adenylated Adapter : 5' rApp /NNNNTGGAATTCTCGGGTGCCAAGG/ 3ddC/
small RNA_A2    0.803    3.74                                                          GCCTAA            TTAGGC                5' NEXTflex™ Adapter : 5' GUUCAGAGUUCUACAGUCCGACGAUCNNNN
small RNA_A3    1.041    2.88                                                          TCAAGT            ACTTGA                NEXTflex™ RT Primer : 5' GCCTTGGCACCCGAGAATTCCA
small RNA_C1    0.628    4.78                                                          CTGATC            GATCAG                NEXTflex™ Barcode Primer 1*† : 5' CAAGCAGAAGACGGCATACGAGAT******GTGACTGGAGTTCCTTGGCACCCGAGAATTCCA
small RNA_C2    0.798    3.76                                                          AAGCTA            TAGCTT                NEXTflex™ Universal Primer : 5' AATGATACGGCGACCACCGAGATCTACACGTTCAGAGTTCTACAGTCCGA
small RNA_C3    0.764    3.93                                                          GTAGCC            GGCTAC                † The barcode sequence(******) will be read as the reverse complement in the index read.

These index sequences should be my adapter?? then, for each sample, I have to use different adapter??

So confusing... :(

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

1

Entering edit mode

Ah ok, it looks like they did multiplexing. That means that they sequenced several samples on one flowcell, in your case it looks like three A replicates and three C replicates. To keep track of the samples, every molecule from A1 gets an A1 index sequence attached and every molecule from A2 an A2 index, etc..., before put in the machine. This index sequence is normally on the 5' end of your read, while the adapter is on the 3' end (index-sequence-3'adapter).

To get rid of the index, you have to do demultiplexing. Normally, demultiplexing takes place automatically during the instrument's data post-processing protocols. But there are tools available for that task, you can google for that. These tools will take your fastq file together with a file of your index sequence and then create (in your case) six different files (three A replicates and 3 C replicates).

I hope that helps.

That would actually shift the peak of your read-length distribution to 24. ;)

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by David Langenberger 11k

0

Entering edit mode

Dear David,

Thank you so much for your helps. Your explanation helps me a lot.. ;)

I contacted the web experiment person and he said that the fastq file I got from him has already been demultiplexed.. (oh.. :() so it seems that the low mapping rate was not caused by demultiplexing problem..

Then, what I can do more? Could you please give some comments ? ( it is hard to map the miRNA.. ;()

As I mentioned, when I clip the reads with four adapter sequences, I have 30bp peak distribution of read-length.

The mapping rate with this 30bp peak read distribution was still very low.. (less than 1 %)..

So frustrating.. what I am missing??

FYI: this experiment has been done in miSeq machine. ( It might help to identify the adapter sequence.)

In case, I have attached the part of my fastq file content to have a idea of my sequences or adapter..

A1 Sample Fastq

@M02847:38:000000000-ACP9R:1:1101:18991:1866 1:N:0:1
CTGGGTCCAGTTTTCCCAGGAATCCCAAGACTGGAA
+
AC9,,CCE-,CEFFD@@,,,,,CC@@88,,CF88,,
@M02847:38:000000000-ACP9R:1:1101:14369:1882 1:N:0:1
AGGATGTAAACATCCTACACTCTCAGCTCCAATGGA
+
---,A,C,-,<,CCCF,C,CEEEC-,;CC@,,C,,,
@M02847:38:000000000-ACP9R:1:1101:19762:1898 1:N:0:1
CTATTGAGGTAGTAGTTTGTGCTGTTTATTATGGAA
+
CC@CC99,-C,<C9,CFF9E<@F<FFGAFE9F9,,,
@M02847:38:000000000-ACP9R:1:1101:18203:1899 1:N:0:1
AGGGTGAGATGAAGCACTGTAGCTCTTCTATGGAAT
+
-6-,A,<,-C,,,,;8CF,C9,;C@EFEF9E@,,,C
@M02847:38:000000000-ACP9R:1:1101:11269:1904 1:N:0:1
TGCATGAGGTAGTAGATTGTATAGTCTCCTGGAATT
.......

A2 Sample Fastq

@M02847:38:000000000-ACP9R:1:1101:12431:1882 1:N:0:2
GCACTGTAAACATCCCCGACTGGAAGCGATATGGAA
+
-8ACCCE9<,C9C8<@,8+@C+,,-,;+,C,@,,,,
@M02847:38:000000000-ACP9R:1:1101:14093:1890 1:N:0:2
TACGTTTAGTAGTACCGTCCCTTTGCCTTCTGGAAT
+
C9B<CFF8-C9,C<@C,D@@@EFF-6CCEEF,,,,C
@M02847:38:000000000-ACP9R:1:1101:11008:1909 1:N:0:2
CGGCTAGCTTATCAGACTGATGTTGATGATGGAATT
+
A<9BC;<FFGDGF9,,CF8,E9EF9,C<9F<9,<EF
@M02847:38:000000000-ACP9R:1:1101:11811:1918 1:N:0:2
ATGATGAGGTAGTAGGTTGTGTGGTATTATGGAATT
+
-A@8C<,,-C,,C,,,CF@F@F@7F9EF9FA8,,CE
@M02847:38:000000000-ACP9R:1:1101:14883:1918 1:N:0:2
CGCCTGAGGTAGTAGTTTGTACAGTGGTCTGGAATT
+
@<BCC<,,-C,,C9,FFG9F9E99F9,CEF88,,CE
@M02847:38:000000000-ACP9R:1:1101:12297:1923 1:N:0:2
TTGGTGAGGTAGTAGATTGTATAGTTCAACTGGAAT

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

1

Entering edit mode

When blasting your reads against mature miRNAs from the MirBase (http://www.mirbase.org/search.shtml) I realized that the first 4 nucleotides of your reads never match and the rest perfectly, until the adapter starts.

And when looking at your sequences, it seems pretty obvious, that the adapter I gave you (TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC) is the correct one. Almost all reads have a "TGGAATT" at their ends (which is the 5' part of the adapter sequence). You can see that by eye.

What you can try: Clip your reads using the "TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC" adapter and then trim the first four nucleotides of all reads... or vice versa. Then you should be able to map your reads. At least for the subset you provided, it worked perfectly.

BUT: I would highly recommend you to find out what these first four nucleotides are. It is always important to understand the data you work with! Perhaps this post from SEQanswers might give you a hint: http://seqanswers.com/forums/archive/index.php/t-30100.html

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by David Langenberger 11k

0

Entering edit mode

Dear David,

I really appreciate your helps. I got it!

I will try to follow your comments! Thank you .. I have learned a lot!

KS

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

0

Entering edit mode

Hello David,

Thank you so much.. I am learning a lot about miRNA processing.(I only have done for mRNA mapping. )

I just finished this mapping..

Hm.. but still mapping rate is low..

After trimmed with adapter and cut 4 bp from front of read,

mapping to mature.fa

2565353 reads; of these:
  2565353 (100.00%) were unpaired; of these:
    2564319 (99.96%) aligned 0 times
    504 (0.02%) aligned exactly 1 time
    530 (0.02%) aligned >1 times
0.04% overall alignment rate

mapping to hairpin.fa

2565353 reads; of these:
  2565353 (100.00%) were unpaired; of these:
    2555740 (99.63%) aligned 0 times
    4536 (0.18%) aligned exactly 1 time
    5077 (0.20%) aligned >1 times
0.37% overall alignment rate

This results looks weird.. Since when I just blast my trimmed reads from the MirBase, I can see almost the perfect match. But, my bowtie2 results looks bad.. Still I should miss something.. ;(

This is my bowtie2 option command

bowtie2 -local -N 1 -L 16 -x hairpin.fa -U fastq/A1_trimmed_mirnaseq.fastq -S xxxx.sam

Do I have to explicitly change the U to T in hairpin.fa?? Or other options?

KS

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

2

Entering edit mode

I don't use Bowtie for short read mapping, but when googling for bowtie2 and short reads, I found:

The chief differences between Bowtie 1 and Bowtie 2 are:

For reads longer than about 50 bp Bowtie 2 is generally faster, more sensitive, and uses less memory than Bowtie 1. For relatively short reads (e.g. less than 50 bp) Bowtie 1 is sometimes faster and/or more sensitive.

http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml

Perhaps you should try a different mapping algorithm?

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.9 years ago by David Langenberger 11k

2

Entering edit mode

I checked the alignments again and it looks like there are another four nucleotides at the end of your read that seem not to match the genomic sequences. As long as there are that many nucleotides that do not match, no alignment tool will map your reads.

My first thought here would be four random nucleotides ligated to both ends of the molecule. I read about such a strategy to overcome ligation biases that occur especially with short RNA-Seq. Since mature microRNAs are highly expressed and all molecules of one miRNA have the exact same sequence (in contrast to shotgun RNA-Seq), this can result in huge biases, if some ends do ligate easier than others. So, clever guys started adding random ends, to get rid of this error.

Can you check that with your lab-guys?

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by David Langenberger 11k

0

Entering edit mode

Hello David

Thank you so much!. Yes I contacted the lab-guy.. and he just said that trimmed the first 4 bp and last 4bp.. ( as you found)

So, I firstly trimmed the adapter sequences (TGGAATTCTCGGGTGCCAAGGAACTCCAGTCAC)

And then, trimmed the first 4bp and last 4bp from reads, which leads to the 22bp peak of read-length distribution(instead of 24bp)

Anyhow, I tried to map with bowtie2 again.

bowtie2 --local -N 1 -L 16 -x ../miRNA_reference/hairpin_UtoT.fa \
        -U first4bptrimmed_A1-SmallRNA_S1_L001_R1_001_Illuminaadpatertrim.fastq \
        -S f4_trimmed.sam

I also changed hairpin.fa file (U to T)

Oh.. thank you David,

Finally, I got

2565353 reads; of these:
  2565353 (100.00%) were unpaired; of these:
    479292 (18.68%) aligned 0 times
    11959 (0.47%) aligned exactly 1 time
    2074102 (80.85%) aligned >1 times
81.32% overall alignment rate

=========================================

Thank you so much for your help!

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

0

Entering edit mode

BTW, I have one more question!!

When I look at the hairpin.fa file, it has many miRNAs from different species.

So, it seems that our reads are mapped to all different miRNA from different species even though my one came from Mouse. Do I need to only extract the mouse miRNAs from hairpin.fa file?? Or does it okay?

When I look at the output sam file, it seems that our reads are mapping across all miRNAS from different species.. , which means our 81% mapping rate is not the real mapping rate.. It should be decreased... Is it right?

First three lines of output sam file

=======================================================================

M02847:38:000000000-ACP9R:1:1101:18991:1866     0       ptr-miR-145     1       1       22M1S   *       0       0       GTCCAGTTTTCCCAGGAATCCCA ,CCE-,CEFFD@@,,,,,CC@@8 AS:i:44 XS:i:44 XN:i:0  XM:i:        0 XO:i:0  XG:i:0  NM:i:0  MD:Z:22 YT:Z:UU
M02847:38:000000000-ACP9R:1:1101:14369:1882     0       tgu-miR-30c-5p  1       1       24M4S   *       0       0       TGTAAACATCCTACACTCTCAGCTCCAA    A,C,-,&lt;,CCCF,C,CEEEC-,;CC@,,    AS:i:48 XS:i:        48        XN:i:0  XM:i:0  XO:i:0  XG:i:0  NM:i:0  MD:Z:24 YT:Z:UU
M02847:38:000000000-ACP9R:1:1101:19762:1898     0       mmu-let-7j      1       0       22M1S   *       0       0       TGAGGTAGTAGTTTGTGCTGTTT C99,-C,<C9,CFF9E<@F<FFG AS:i:38 XS:i:38 XN:i:0  XM:i:        1 XO:i:0  XG:i:0  NM:i:1  MD:Z:7T14       YT:Z:UU

=====================================================

I checked with this reads and it seems that these ones are mapping to mouse miRNA well.

For example, the first read GTCCAGTTTTCCCAGGAATCCCA is mapping to ptr-miR-145. But when I blast to mirbase, it is hitted to mmu miRNA correctly.. Hm.. they just mapped randomly?? then, definitely, I have to make my own hairpin.fa file having miRNAs of my species. Is it right?

Anyhow, I am learning a lot from this study... So happy with this~

Thanks in advance.

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

2

Entering edit mode

Hi,

glad the analysis starts working. You can filter the one mapping to mmu* miRNAs and quantify how many you have just to make you an idea if seems are good now. It is normal to get hits to other miRNAs, more in conserved miRNAs, so you should be an enriched of reads with hits at mmu* if you count that ones.

Mapping to all helps to see any contamination, but normally you can create the hairpin.fa index only for your species and if get the majority there, you are good.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Lorena Pantano ▴ 380

1

Entering edit mode

Just one add here: I normally map my small RNA-Seq data to the complete genome, instead of only the miRBase hairpins. Imagine a sequence that maps to the hairpins with, let's say, 2 errors. You cannot know, if there is a hit in the genome with 0 hits, since you didn't look there. If you have the genome available, I would always use the genome and not a subset. Otherwise, you might get a bias.

And as Lorena already mentioned: It's absolutely normal that you map one read to several homologous microRNAs, since most mature sequences are highly conserved.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by David Langenberger 11k

0

Entering edit mode

Okay. I see.

It makes sense. I will also try to map to the whole genome.. not only for miRBase genome and see the result too!

Thank you so much! ;) I feel that I am ready to analyze the miRNA now! ;)

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

0

Entering edit mode

Hello Lorena,

Thank you for your comments. Yes. I will make my own hairpin.fa index for my species.. and try.. hope it works well.

I will also try the one you suggested too!

Thank you!

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by illinois.ks ▴ 210

1

Entering edit mode

Did you do the index for hairpin.fa? I use to change U by T, but not sure if it does automatically. If you only want to map to mirbase you can use miraligner. You can download from here: https://github.com/lpantano/seqbuster/raw/master/modules/miraligner/miraligner.jar and the usage is:

java -jar miraligner.jar -sub 1 -trim 3 -add 3 -s hsa -i fastq -db mirbase_DB_folder  -pre -o prefix

mirbase_DB is a folder with the hairpin.fa and miRNA.str files from here: ftp://mirbase.org/pub/mirbase/CURRENT/

The good thing about this tool is that is very sensitive. Normally it works with collapse reads (normally around 50000 different sequences in a normal experiment), so you can try first with the first 100000 reads of your file to test it.

last thought: What happens if you map against the genome? maybe you don't have a lot miRNA there. I saw the peak after trimming at 30 nt, that would be 26 nt after removing the first 4 nt. I wouldn't use local, no big difference for short reads.

ADD REPLY • link updated 15 months ago by Ram 43k • written 8.9 years ago by Lorena Pantano ▴ 380