Question

Difference between mapped reads and counts

0

Entering edit mode

4.7 years ago

josmantorres ▴ 10

Hello Biostars members,

I am analysing my RNA-seq data (20 M PE, 150 bp) using STAR (2.5V) to map the trimmed reads and "bedtools multicov" tool to obtain the transcript counts. After generating the genome index (the genome and GFF are available), I used the following STAR script to map my libraries:

STAR --runMode alignReads --runThreadN 20 \
        --genomeDir /Indexs/ --readFilesIn F.paired.1.fastq R.paired.2.fastq \
        --bamRemoveDuplicatesType UniqueIdentical --outFilterMultimapNmax 1 \
        --outBAMsortingThreadN 6 --quantMode GeneCounts \
        --outSAMtype BAM SortedByCoordinate

The mapping seems to be ok according to the Log.final.out files that shows 90-88% of mapped reads (18M ) and average mapped length was 266 bp in the different libraries. After that, I used samtools index to create the index of the STAR output files Aligned.sortedByCoord.out.bam. Finally, I used:

bedtools multicov -bams -bed filtered.gff3 > TranscriptCounts.txt

Including after -bams the different paths to the different .out.bam files

When I check the TranscriptCounts file, the total number of counts in each library is close to 36 M, which is almost twice the number of mapped reads. I have check the sorted and indexed bam files as well as the GFF file and they seem to be ok. Even, I have run STAR to generate unsorted BAM files as output and after that I made the sort and index steps and the result was the same.

Thanks a lot for your help

All the best

Jose

rna-seq sequencing • 1.4k views

ADD COMMENT • link updated 4.7 years ago by Ram 43k • written 4.7 years ago by josmantorres ▴ 10

1

Entering edit mode

Don't use multicov. It has its applications but RNA-seq is none of it. Use dedicated tools such as HTseq, featureCounts or a leightweight quantifier such as salmon or kallisto to obtain a count matrix from your alignment. The main point is probably what i.sudbery said below with respect to the gff.

ADD REPLY • link 4.7 years ago by ATpoint 82k

0

Entering edit mode

Thanks for your quick answers and comments !

I will try HTseq and Featurecounts to obtain the count matrix instead of bedtools multicov

Regarding Ian Sudbery's comment about gff3 filter, I used only gene lines, avoiding the problem with the isoforms. So, it could be a problem with bedtools multicov sensitivity on gene count process.

All the best

Thanks for your help

Jose

ADD REPLY • link 4.7 years ago by josmantorres ▴ 10

1

Entering edit mode

Please do not add an answer unless it is an actual answer to the top level question. Also, please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY • link 4.7 years ago by Ram 43k

0

Entering edit mode

If you want to be free of all that trouble, simply use a quantifer like salmon or kallisto to directly quantify your fastq files against a reference transcriptome in fasta format. Isoforms do exist and simply removing them is probably oversimplifying the reality. Most protein-coding genes have > 1 isoform which influence the overall gene expression level. salmon will take care of that all, give it a try.

ADD REPLY • link 4.7 years ago by ATpoint 82k

score 1 · Answer 1 · 2019-08-16

This is almost certainly because of the contents of your filtered.gff3 file. Most likely this file contains a record for each exon (if you have used it more or less from one of the standard annotation sources). Alternatively, the name filtered suggests you may have filitered this first. Have you perhaps filtered it to transcript records?

Either way, the mostly likely explanation for what is happening here is that you are counting reads more than once because the features in your filtered.gff3 overlap. Alternative splicing means that genes may have more than one transcript associated with them. Indeed, I think the average number of transcripts associated with a human gene is quite high, and only a small number of genes in the human genome have only one transcript.

The situation might look like this:

        |>>>>>>>>>>transcript1>>>>>>>>>>>>>>>>|
        |>>>>t1.exon1>>>|----------|>t1.exon2>|

        |>>>>>>>>>>transcript2>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>|
        |>>>>t2.exon1 >>|----------|>>>>>t2.exon2>>>>>|---|>t2.exon3>>|

             ---read----

This would be encoded in the gff3 file something like:

chr1   gencode   transcript   1000000    1001000    .   +   .   ID="transcript1"; Parent="gene1"
chr1   gencode   exon         1000000    1000300    .   +   .   ID="t1exon1"; Parent="transcript1"
chr1   gencode   exon         1000600    1001000    .   +   .   ID="t1exon2"; Parent="transcript1"
chr1   gencode   transcript   1000000    1002000    .   +   .   ID="transcript2"; Parent="gene1"
chr1   gencode   exon         1000000    1000300    .   +   .   ID="t2exon1"; Parent="transcript2"
chr1   gencode   exon         1000600    1001100    .   +   .   ID="t2exon2"; Parent="transcript2"
chr1   gencode   exon         1001600    1002000    .   +   .   ID="t2exon3"; Parent="transcript2"

The read shown overlaps both the transcript lines, and two of the exon lines, and so multicov would count this read 3 times.

The solutions to this problem are:

Use a quantifier designed for quantifying genes from BAM rather than just quantifying arbitrary intervals. I generally use featureCounts from the subread pacakge.
Filter out only the gene lines from your gff3 file. It is also possible to get genes that overlap, but the problem will be less bad than it is now.
Quantify using transcripts using a transcript quantifier that uses EM to assign fractions of reads to the most likely transcripts. Examples here are RSEM, Salmon and Kalisto.