Question: miRna-Seq, featureCounts problem
0
gravatar for AngAdd
13 months ago by
AngAdd0
AngAdd0 wrote:

Hi everyone!

I'm new to Bioinformatics. I need to write a pipeline for miRNA-Seq. The situation is the following: I have three normal samples and three cancer samples. I downloaded the stem-loop and the mature human sequences from miRBase, then I performed the mapping using Bowtie. Now I need to obtain the read counts for each sample; I'm trying to use featureCounts with mirBase hsa.gff3 file and my bam file but featureCounts doesn't work! It returns this:

Assigned    0 

Unassigned_Unmapped 0

Unassigned_MappingQuality   0

Unassigned_Chimera  0

Unassigned_FragmentLength   0

Unassigned_Duplicate    0

Unassigned_MultiMapping 0

Unassigned_Secondary    0

Unassigned_Nonjunction  0

Unassigned_NoFeatures   1831739

Unassigned_Overlapping_Length   0

Unassigned_Ambiguity    0

This is my command:

> featureCounts -t miRNA -g 'Name' -a my_path/hsa.gff3 -o my_path/my_bam_file

I have also tried with -g ID but the result is the same.

I have done a lot of researches but I don't understand where is the problem.

Thank you.

mirna featurecounts mirbase • 755 views
ADD COMMENTlink modified 13 months ago by Buffo1.8k • written 13 months ago by AngAdd0
0
gravatar for i.sudbery
13 months ago by
i.sudbery8.2k
Sheffield, UK
i.sudbery8.2k wrote:

featureCounts allows you to count the number of reads mapped to a whole genome sequence using a gff file that contains the co-ordinates of your features (in this case miRNAs).

However, you have not mapped to the whole genome, you have mapped only to a set of sequences derived from miRNAs themselves. As such your bam file will say things like "Read 1 is mapped to base 1 of miR-29a", where as featureCounts is looking for something like "Read1 is mapped to base 34,234,546 of chromosome 5, and miR-29a covers bases 34,234,546- 34,234,566".

In order to get the numbers of reads aligned to each miR after having mapped to a miR sequence collection, simply use samtools idxstats like so:

samtools idxstats my_path/hsa.gff3 -o mirna/cancer/featureCountsmy_path/my_bam_file
ADD COMMENTlink written 13 months ago by i.sudbery8.2k

Hi, thank you for your answer!

I tried to run your command line but I had this error message:

samtools idxstats: fail to load index for "my_path/hsa.gff3"

Then I tried to run this:

samtools index 4T_mapped.sorted.bam

samtools idxstats my_path/4T_mapped.sorted.bam

and I have obtained a result like this:

hsa-miR-576-3p 22 5 0

hsa-miR-140-5p 22 15 0

hsa-miR-522-5p 22 1 0

hsa-miR-1298-5p 22 0 0

hsa-miR-133a-3p 22 1 0

hsa-miR-4743-3p 21 0 0

hsa-miR-557 23 0 0

hsa-miR-548ao-3p 22 0 0

hsa-miR-5088-5p 24 1 0

hsa-miR-4649-5p 24 0 0

hsa-miR-665 20 0 0

.....

Is it what I'm searching for?

ADD REPLYlink written 13 months ago by AngAdd0
1

Yes. The first column is the length. The second column is the number of reads mapped to that miRNA, the thrid column is the number of unmapped reads on that miRNA (this is a pretty meaningless number, and will normally be 0).

ADD REPLYlink written 13 months ago by i.sudbery8.2k
0
gravatar for Buffo
13 months ago by
Buffo1.8k
Buffo1.8k wrote:

In addition to i.sudbery answer, I recommend you to map the reads against the genome, not against the mature/hairpin sequences. This second pipeline cause biased analysis since some of the actually known miRNAs are redundant in sequence (especially in mouse).

ADD COMMENTlink written 13 months ago by Buffo1.8k

The fact that the sequences are redundant is exactly why we should map to the hairpin sequences, rather than the genome. If miR-X is encoded in two places in the genome, then a read from miR-X will always be multi-mapped. Pipelines will then either ignore this read, or count it twice (once for miR-Xa and miR-Xb). If you map to a non-redundant set of mature sequences, then you will just get one, single mapping, hit against miR-X.

ADD REPLYlink written 13 months ago by i.sudbery8.2k
 If miR-X is encoded in two places in the genome, then a read from miR-X will always be multi-mapped.

That is exactly why you should use bowtie instead of bowtie2 and sequences of high quality.

Pipelines will then either ignore this read, or count it twice (once for miR-Xa and miR-Xb)

I prefer to ignore redundant sequences (if they are) in a genome than count them erroneously because they are in a set of "described" sequences.

If you map to a non-redundant set of mature sequences, then you will just get one, single mapping, hit against miR-X.

mirbase does not contain a set of non-redundant sequences, fasta files contains all hairpin and mature mirnas.

My point is, I consider correct to map the reads against the entire genome because smallRNAseq libraries are not specific to miRNAs, they sequence everything in your sample that is small in length. If you map your reads against to small set of sequences (such as mirbase files), yes, your mapping summary will be higher because you will count all redundant reads or even low-quality reads will increase the mapping. Even creating the nr-file you are deviating your analysis supposing that there are not new potential miRNAs.

ADD REPLYlink written 13 months ago by Buffo1.8k

So given a miRNA with two identical copies in the genome, you rather say the cell contains none of that miRNA, rather than say it contains some (and a no less accurate an estimate how much compare to a non-duplicated miRNA), but we have no way of knowing which copy it came from?

My point is, I consider correct to map the reads against the entire genome because smallRNAseq libraries are not specific to miRNAs, they sequence everything in your sample that is small in length.

Fair enough, we have been mapping to Rfam rather than mirbase. But even so, if you care about miRNAs, do you really care if there are some non-miRNA reads in your dataset that fail to map? Even if you map to the genome, you are unlikely to quantify those reads that don't map to annotated miRNAs unless you are specifically looking for them.

ADD REPLYlink written 13 months ago by i.sudbery8.2k
So given a miRNA with two identical copies in the genome

No sense! it does not exist. You don't understand what redundant sequence is.

but we have no way of knowing which copy it came from?

Yes, we can, mapping to the genome.

Fair enough, we have been mapping to Rfam rather than mirbase. But even so, if you care about miRNAs, do you really care if there are some non-miRNA reads in your dataset that fail to map?

Again, no sense!. After trimming smallRNAseq reads by quality we obtain more than 95% of mapped reads.

Even if you map to the genome, you are unlikely to quantify those reads that don't map to annotated miRNAs unless you are specifically looking for them.

That's my point, I do both things. Quantify known miRNAs and look for new ones, something like mirdeep2 proposes.

Anyways AngAdd, please tell us about your results of rt-qpcr validation after quantifying against mirbase files, these results will be very interesting.

ADD REPLYlink written 13 months ago by Buffo1.8k

No sense! it does not exist.

I'm afraid that's not true. For example, the mature sequence miR-29b-3p is expressed as identical sequences from two different locations, miR-29b-1 (chr7:130877459-130877539:(-)) and miR-29b-2 (chr1: 207802443-207802523:(-)). The mature product from these two hairpins is identical (UAGCACCAUUUGAAAUCAGUGUU) and thus indistinguishable from sequence. This is why the miBase fasta only contains one record for miR-29b-3p despite their being miR-29b-3p-1 and miR-29b-3p-2 in the genome.

That's my point, I do both things.

Good for you, but I'd propose that most people studying Humans or well annotated model organisms are unlikely to care about loci that are unannotated at this point, unless small RNA biology is their particular area of study.

ADD REPLYlink written 13 months ago by i.sudbery8.2k
The mature product from these two hairpins is identical (UAGCACCAUUUGAAAUCAGUGUU) and thus indistinguishable from sequence

That is not a new debate, it is an old question about isomirs, and again, if you map your reads against mirbase files you will count multi-hit reads as single. If you consider it correct ¯(°_o)/¯.

I'd propose that most people studying Humans or well annotated model organisms are unlikely to care about loci that are unannotated at this point, unless small RNA biology is their particular area of study.

Well annotated genomes do not mean they are completely annotated, and especially about new RNAs such as mirnas. Probably these unknown miRNAs are the answer to your biologic question, Don't you think?. But do not worry that is what researchers like to do, they will do for you.

Thanks for this nice discussion.

ADD REPLYlink written 13 months ago by Buffo1.8k

if you map your reads against mirbase files you will count multi-hit reads as single. If you consider it correct ¯(°_o)/¯.

Yes, because I care about how many molecules with the sequence UAGCACCAUUUGAAAUCAGUGUU are floating around in the cytoplasm. I couldn't care less if they came from a single genomic locus, 2 loci or 200 loci. A genome mapping pipeline that discards multi-mappers would say there was no miR-29b-3p in the cell, even if there was loads.

ADD REPLYlink written 13 months ago by i.sudbery8.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour