Question: Difference in number of reads in tophat and htseq-count outputs
0
5.3 years ago by
nalandaatmi90
United States
nalandaatmi90 wrote:

Dear all,

I have paired end reads from my sample and the total number of raw reads is 132,938,316.

1) After applying trimmomatic, I have

Output from Trimmomatic:

Input Read Pairs: 66469158 Both Surviving: 66113117 (99.46%) Forward Only Surviving: 330055 (0.50%) Reverse Only Surviving: 24010 (0.04%) Dropped: 1976 (0.00%)
TrimmomaticPE: Completed successfully

Difference in reads pairs (raw-survived) => (66,469,158-66,113,117) = 356,041 (phred score <=20)

Difference in reads (Raw-Trimmed) => (132,938,316 - 132,226,234) = 712,082

2) After aligning using tophat

TopHat Output align summary:

Input     :  66,113,117
Mapped   :  54521571 (82.5% of input)
of these:   4509697 ( 8.3%) have multiple alignments (78742 have >20)
Input     :  66,113,117
Mapped   :  53610826 (81.1% of input)
of these:   4412079 ( 8.2%) have multiple alignments (78753 have >20)

Aligned pairs:  51538017
of these:   4275585 ( 8.3%) have multiple alignments
780763 ( 1.5%) are discordant alignments
76.8% concordant pair alignment rate.

Questions:

What is the difference between right_reads_out (66,070,354) and right reads (53,610,826 )?

Why these numbers are different?

What is the difference between 81.8% overall read mapping rate vs 76.8% concordant pair alignment rate?

3) samtools flagstat accepted_hits.bam

0 + 0 duplicates
131374951 + 0 mapped (100.00%:-nan%)
131374951 + 0 paired in sequencing
107575718 + 0 properly paired (81.88%:-nan%)
124948346 + 0 with itself and mate mapped
6426605 + 0 singletons (4.89%:-nan%)
9330692 + 0 with mate mapped to a different chr
810494 + 0 with mate mapped to a different chr (mapQ>=5)

Questions:

a) what is this QC-passed reads (131,374,951)?

4) samtools flagstat unmapped.bam
0 + 0 duplicates
0 + 0 mapped (0.00%:-nan%)
24093837 + 0 paired in sequencing
0 + 0 properly paired (0.00%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

5) samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

samtools view accepted_hits.bam | cut -f1 | sort | wc -l
131,374,951

samtools view unmapped.bam | cut -f1 | sort | uniq | wc -l

6) I used htseq-count to get gene counts for accepted_hits.bam

 Globin Genes Stranded-Yes(Sense) Stranded-reverse (Antisense) HBB 36,472 7 HBA1 20,398 14 HBA2 144,174 79 HBG1 2,825 0 HBG2 2,638 0 HBD 4 0 HBE1 0 1 HBZ 0 0 HBQ1 3 6 MB 0 4 CYGB 2 354 NGB 2 319 Total Globin 206,518 784 Total reads 126,970,700 126,970,700 % of Globin_reads 0.1627 0.0006

Again, I used the accepted_hits.bam as in the input for htseq-count.

Questions:

Why the total reads (126,970,700) is different from other numbers?

To get percentage of reads, do I need to divide either by total reads (126,970,700) or unique reads: (56,594,380)?

modified 5.1 years ago by Carlo Yague5.5k • written 5.3 years ago by nalandaatmi90

Your question(s) is too polluted and confusing. I will just point to the fact that if you did not sort "accepted_hits.bam" by name (samtools sort -n), htseq-count will could each read from a pair separately, effectively doubling your read counts for paired-end sequencing.

Dear h.mon,

Actually, I need some clarifications on each section of this pipeline. So I provided the read counts in each step I performed in my pipeline. From sequencing, I received 66,469,158 these many raw read pairs (R1 and R2). After trimming using trimmomatic, I received 66,113,117 trimmed reads pairs (R1 and R2). 356,041 read pairs don't have phred score >20. So they are removed. It is clear till this point.

1. Why numbers varying between tophat alignment summary and prep_reads_info? Can you explain how to interpret these numbers in both outputs.
2. Using samtools flag, I got a summary for accepted_hits.bam output file from tophat analysis. How to interpret these numbers from samtools.
3. As you suggested, now I sorted bam file based on name and provided the sorted to htseq-count. Below is the output

``````                          Unsorted-Bam     Sorted-Bam
Globin Genes                 Stranded: YES    Stranded: YES
HBB                          36472            19800
HBA1                         20398            4249
HBA2                         144174           47553
HBG1                         2825             2384
HBG2                         2638             1942
HBD                          4                3
HBE1                         0                0
HBZ                          0                0
HBQ1                         3                2
MB                           0                0
CYGB                         2                1
NGB                          2                1
Total Globin                 206,518          75,935
__no_feature                 94,038,451       51,420,979
__ambiguous                  46,175           56,695
__too_low_aQual              0                0
__not_aligned                0                0
__alignment_not_unique       32,018,089       16,952,589
Total reads for all genes    126,969,861      68,900,778
``````
4. To get the percentage of reads for globin genes, is it correct to use following values? ((75,935)/(68,900,778))*100 = 0.1102%

5. How come the total number of reads for sorted bam file 68,900,778 and unsorted bam file 126,969,861 are different.
6. Why the reads count 68,900,778 from htseq-count is not matching to trimmed reads count or aligned reads?
1
5.1 years ago by
Carlo Yague5.5k
Carlo Yague5.5k wrote:

You probably could have found the answers by yourself, reading the manuals. Anyway:

54,521,571 (left reads) is the number of reads mapped. 66,079,886 (left_reads_out) is the number of reads considered for mapping (tophat does is own filtering too).

What is the difference between 81.8% overall read mapping rate vs 76.8% concordant pair alignment rate?

Overall mapping rate = percentage of total reads mapped.

concordant pair alignment rate = Percentage of read pairs (you have paired-end data) that map concordantly, i.e, that map while respecting constrains : they are on the same chromosome and not too far apart.

what is this QC-passed reads (131,374,951)?

QC means quality-control. Those reads are well mapped according to samtools criteria.

What is this : Unique Reads: 56,594,380?

This is obviously the number of reads that are unique in your file.

Why the total reads (126,970,700) is different from other numbers?

Because HTseq-counts has its own requirements regarding the reads used as input. see http://www-huber.embl.de/HTSeq/doc/counting.html#handling-paired-end-reads

To get percentage of reads, do I need to divide either by total reads (126,970,700) or unique reads: (56,594,380)?

You should divide by total pairs (as h.mon comment suggest it, since you sorted your bam file). 0.1102% is therefore the right proportion of read pairs mapped on globin genes.