Question: Difference between mapped reads and counts
gravatar for josmantorres
5 months ago by
josmantorres0 wrote:

Hello Biostars members,

I am analysing my RNA-seq data (20 M PE, 150 bp) using STAR (2.5V) to map the trimmed reads and "bedtools multicov" tool to obtain the transcript counts. After generating the genome index (the genome and GFF are available), I used the following STAR script to map my libraries:

STAR --runMode alignReads --runThreadN 20 \
        --genomeDir /Indexs/ --readFilesIn F.paired.1.fastq R.paired.2.fastq \
        --bamRemoveDuplicatesType UniqueIdentical --outFilterMultimapNmax 1 \
        --outBAMsortingThreadN 6 --quantMode GeneCounts \
        --outSAMtype BAM SortedByCoordinate

The mapping seems to be ok according to the files that shows 90-88% of mapped reads (18M ) and average mapped length was 266 bp in the different libraries. After that, I used samtools index to create the index of the STAR output files Aligned.sortedByCoord.out.bam. Finally, I used:

bedtools multicov -bams -bed filtered.gff3 > TranscriptCounts.txt

Including after -bams the different paths to the different .out.bam files

When I check the TranscriptCounts file, the total number of counts in each library is close to 36 M, which is almost twice the number of mapped reads. I have check the sorted and indexed bam files as well as the GFF file and they seem to be ok. Even, I have run STAR to generate unsorted BAM files as output and after that I made the sort and index steps and the result was the same.

Thanks a lot for your help

All the best


sequencing rna-seq • 234 views
ADD COMMENTlink modified 5 months ago by RamRS25k • written 5 months ago by josmantorres0

Don't use multicov. It has its applications but RNA-seq is none of it. Use dedicated tools such as HTseq, featureCounts or a leightweight quantifier such as salmon or kallisto to obtain a count matrix from your alignment. The main point is probably what i.sudbery said below with respect to the gff.

ADD REPLYlink modified 5 months ago • written 5 months ago by ATpoint28k

Thanks for your quick answers and comments !

I will try HTseq and Featurecounts to obtain the count matrix instead of bedtools multicov

Regarding Ian Sudbery's comment about gff3 filter, I used only gene lines, avoiding the problem with the isoforms. So, it could be a problem with bedtools multicov sensitivity on gene count process.

All the best

Thanks for your help


ADD REPLYlink written 5 months ago by josmantorres0

Please do not add an answer unless it is an actual answer to the top level question. Also, please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.

ADD REPLYlink written 5 months ago by RamRS25k

If you want to be free of all that trouble, simply use a quantifer like salmon or kallisto to directly quantify your fastq files against a reference transcriptome in fasta format. Isoforms do exist and simply removing them is probably oversimplifying the reality. Most protein-coding genes have > 1 isoform which influence the overall gene expression level. salmon will take care of that all, give it a try.

ADD REPLYlink modified 5 months ago • written 5 months ago by ATpoint28k
gravatar for i.sudbery
5 months ago by
Sheffield, UK
i.sudbery6.6k wrote:

This is almost certainly because of the contents of your filtered.gff3 file. Most likely this file contains a record for each exon (if you have used it more or less from one of the standard annotation sources). Alternatively, the name filtered suggests you may have filitered this first. Have you perhaps filtered it to transcript records?

Either way, the mostly likely explanation for what is happening here is that you are counting reads more than once because the features in your filtered.gff3 overlap. Alternative splicing means that genes may have more than one transcript associated with them. Indeed, I think the average number of transcripts associated with a human gene is quite high, and only a small number of genes in the human genome have only one transcript.

The situation might look like this:


        |>>>>t2.exon1 >>|----------|>>>>>t2.exon2>>>>>|---|>t2.exon3>>|


This would be encoded in the gff3 file something like:

chr1   gencode   transcript   1000000    1001000    .   +   .   ID="transcript1"; Parent="gene1"
chr1   gencode   exon         1000000    1000300    .   +   .   ID="t1exon1"; Parent="transcript1"
chr1   gencode   exon         1000600    1001000    .   +   .   ID="t1exon2"; Parent="transcript1"
chr1   gencode   transcript   1000000    1002000    .   +   .   ID="transcript2"; Parent="gene1"
chr1   gencode   exon         1000000    1000300    .   +   .   ID="t2exon1"; Parent="transcript2"
chr1   gencode   exon         1000600    1001100    .   +   .   ID="t2exon2"; Parent="transcript2"
chr1   gencode   exon         1001600    1002000    .   +   .   ID="t2exon3"; Parent="transcript2"

The read shown overlaps both the transcript lines, and two of the exon lines, and so multicov would count this read 3 times.

The solutions to this problem are:

  1. Use a quantifier designed for quantifying genes from BAM rather than just quantifying arbitrary intervals. I generally use featureCounts from the subread pacakge.
  2. Filter out only the gene lines from your gff3 file. It is also possible to get genes that overlap, but the problem will be less bad than it is now.
  3. Quantify using transcripts using a transcript quantifier that uses EM to assign fractions of reads to the most likely transcripts. Examples here are RSEM, Salmon and Kalisto.
ADD COMMENTlink written 5 months ago by i.sudbery6.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 618 users visited in the last hour