Question: Microrna Pipeline
7
gravatar for xiaojuhu13
5.5 years ago by
xiaojuhu13150
China
xiaojuhu13150 wrote:

I have two miRNA-seq dataset, already removing adapters appeared in raw data. What I want to do is to detect the differential expression between two samples(no replicates, I know it is not a good experimental design). The pipeline: 1)remove contaminations like mRNA, tRNA, snoRNA and so on; 2) select reads length between 18-26; 3) mapping to mature miRNA downloaded from miRBasev20 using bwa; 4) detect the differential expression between two samples using DESeq. However, I cannot do the first step, is there any mRNA, tRNA, snoRNA database downloaded online and is this step operated in linux or windows? And is this pipeline reasonable or not, do you have any more suggestions?

mirna • 8.4k views
ADD COMMENTlink modified 2.3 years ago by k.kathirvel93200 • written 5.5 years ago by xiaojuhu13150

you can use: miRNAkey - A software pipeline for the analysis of microRNA Deep Sequencing data

ADD REPLYlink written 5.5 years ago by Rm7.9k

You can download a lot of annotations using the UCSC table browser.

ADD REPLYlink written 5.5 years ago by David Langenberger9.0k
7
gravatar for David Langenberger
5.5 years ago by
Deutschland
David Langenberger9.0k wrote:

What I recommend to do:

  1. Clip adapter sequences
  2. Map the reads against a complete reference genome and not only against miRBase microRNAs.
  3. Quantify microRNA expression using e.g. bedtools intersect or featureCounts
  4. Use DESeq to find differentially expressed microRNAs

I don't think you have to filter out reads of length between 18-26. The length fraction you get after clipping should be in that range. The microRNA-Seq protocol should result in this length anyway.

Also reads mapping to rRNAs, mRNAs, etc. should be no problem. They map to these regions... no problem for your analysis.

Important: Allow multiple hits of each read when mapping! there are multiple copies of some microRNAs in the genome and if you do not allow it, the results might be misleading, or wrong.

ADD COMMENTlink written 5.5 years ago by David Langenberger9.0k

Thanks for so detailed suggestion, and I'm sorry for reply later because of time lag. Take one sample example,

I don't think you have to filter out reads of length between 18-26 In my case, after removingadapter(TGGAATTCTCGGGTGCCAAGGAACTC), Do FastQC check, there is a peak between 32-33 : mainly compose of GTTTCCGTAGTGTAGTGGTTATCACGTTCGCCT; TGCCTATGCTGAAACCCAGAGGCTGTTTCTGAGC; GCATTGGTGGTTCAGTGGTAGAATTCTCGCCT. so should I just let it alone, and do the mapping step?

Use DESeq to find differentially expressed microRNAs this step, which .gtf file should I select for the reads count? This species total .gtf file including mRNA, tRNA, miRNA annotation, or just gtf file including miRNA only downloaded from miRBase?

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by xiaojuhu13150

First of all, you should find a peak somewhere around 24nt. I do not know the sequencing protocol, you use, but normally, when doing microRNA-Seq, molecules of length 18-30 (more or less) are filtered and sequenced. If you have a different protocol and a bigger length region (e.g. < 100) was sequenced, I would also filter for read of length between 18-30. Otherwise you might count microRNA precursors (~70nt).

You don't have to filter out reads overlapping with annotated mRNAs, or ncRNAs. They just map to their locus of origin and give you information. There are known microRNAs overlapping with snoRNAs or tRNAs. Read for example [1] or [2]. So, when deleting these reads, you will loose information. Also reads mapping to mRNAs... you might loose mirtrons (microRNAs in introns). I highly recommend to just map all reads. When filtering data, or mapping to just a subset (miRBase precursors), you will always loose information. Especially, if you want to predict new microRNAs. ;)

Your question about what .gtf file to use. What I would do:

  • Do some read counting fist. How many reads overlap with mRNAs? How many reads overlap with different kinds of ncRNAs (rRNAs, tRNAs, snoRNAs, etc.)? How many reads overlap with microRNAs? And so on.... That gives you a feeling for your data. Does it make sense, what you see?
  • If you want to find differentially expressed miRBase microRNAs, then you should count these reads only for DESeq. Take the miRBase file.
  • If you want to predict new microRNAs and do a differential expression analysis with these, then predict microRNAs and take this file for your DESeq analysis (or combine known with predicted).
ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by David Langenberger9.0k

you should find a peak somewhere around 24nt--- yes, there are two peaks, one is around 22bp, another is 32-33bp. Most 32-33bp peak (I listed), I blast against blastn, are mRNA.

In my case, for microRNA data. I have Hiseq2000, single-end 100bp sequencing.

Do some read counting fist.--I am a newbie for bioinformatic, I really want to do this step, but I do not know how to do it. I know how to collapse reads (cluster same reads and have their counts) using bwa_sam_converter.pl, but how to know reads are tRNA, snoRNA et. Using blast tools or some other softwares. I am sorry, can you list what do you do for this step?

I used mirdeep2 to get the known and novel microRNA, but the output result(.csv), there are only known miRNA counts, then how to combine known and novel microRNA together? which other softwares can do this step?

ADD REPLYlink written 5.5 years ago by xiaojuhu13150

To count reads overlapping with mRNAs, try something like

bedtools intersect -abam sorted_bam_file.bam -b list_of_mRNA_exons.bed (get these from UCSC Table Browser) -u | wc -l

That returns all entries of your sorted bam file that overlap with exon of mRNAs. You need the -u option, because some entries may overlap with more then one exon and you don't want to count them twice. With wc -l you count the lines of the result from bedtools intersect.

The new bedtools version also supports BED12 format using the -split option, but I never tried it. If it works, you could also input a BED12 file of complete genes, instead of exons. But in your case, it doesn't matter, since you don't wanna quantify genes.

ADD REPLYlink written 5.5 years ago by David Langenberger9.0k

Thanks for your answers, I just found that in the latest version of bedtools. "-bed" option needs to be specified since the default output is binary format.

ADD REPLYlink written 9 months ago by chengrui70

Hi I am just curious about filtering spike-in before maping to bowtie2. I also have miRNA fastqc data to analyze and want to know how to filter spike-ins. I created a separate *.fa file listing all the spike-ins used. How should I proceed with Bowtie2 ? e.g. spike-ins

UniSP100
TCCCAAATGTAGACAAAGCA
UniSP101
TGAAGCTGCCAGCATGATCTA
UniSP102
CAGCCAAGGATGACTTGCCGG

ADD REPLYlink written 2.2 years ago by Björn40
1
gravatar for Charles Warden
5.5 years ago by
Charles Warden7.2k
Duarte, CA
Charles Warden7.2k wrote:

If you are aligning to a genomic reference, then I don't think you need to filter the results per se so much as you need to annotate them appropriately. In fact, a QC plot showing the distribution of reads by gene type is probably useful.

If your institution has a license, I think you can do all of these steps in Partek.

I haven't personally tried them out, but I think these are open-source alternatives:

A: microRNA reads alignment

ADD COMMENTlink written 5.5 years ago by Charles Warden7.2k

It also looks like eRNA is another (new) option (which is compared to miRDeep2 and miRspring in the paper): http://www.biomedcentral.com/1471-2164/15/176/abstract

ADD REPLYlink written 5.5 years ago by Charles Warden7.2k
0
gravatar for Ashutosh Pandey
5.5 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

You can use RFAM (ftp://ftp.sanger.ac.uk/pub/databases/Rfam/CURRENT) database that contains sequences for non-coding RNAs. It contains sequences for rRNA, snRNA, snoRNA, tRNA etc. It also has a few sequences for miRNA that you would like to remove before your analysis. The Rfam.fasta.gz file contains sequences for all the organisms, so you may have to grep sequences specific to your species.

ADD COMMENTlink written 5.5 years ago by Ashutosh Pandey11k

I use miRExpress (http://mirexpress.mbc.nctu.edu.tw/) or perform the alignment using SHRiMP2 "mirna" mode against the miRBase.

ADD REPLYlink written 5.5 years ago by Ashutosh Pandey11k
0
gravatar for David Langenberger
5.5 years ago by
Deutschland
David Langenberger9.0k wrote:

You can also try miRanalyzer. It maps the reads against Rfam and other ncRNA databases, quantifies known microRNAs and predict new microRNA candidates.

There is an online version: http://bioinfo5.ugr.es/miRanalyzer/miRanalyzer.php

and a stand-alone version: http://bioinfo5.ugr.es/miRanalyzer/standalone.html

ADD COMMENTlink written 5.5 years ago by David Langenberger9.0k

Sorry David in case I there is no miRBase microRNAs for Aspergillus fumigatus, can I use another gff3 contains cDNA, proreins and ncRNA or what should I do??

thank you

ADD REPLYlink written 3.0 years ago by Angel3.5k
0
gravatar for k.kathirvel93
2.3 years ago by
k.kathirvel93200
India
k.kathirvel93200 wrote:

Hi everyOne, This conversation is more helpful for me, coz the same question what i have, is discussed here. I have an another doubt, can anyone help me? How can i exclude microRNA sequences which are below 18 and above 30 N. Is there any Tools ?

ADD COMMENTlink written 2.3 years ago by k.kathirvel93200

Can you post this as another question? Also what you need is a length filter which has nothing to do with miRNA, and has already been covered by several posts on biostar.

ADD REPLYlink written 2.3 years ago by Rohit1.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1063 users visited in the last hour