Question: Difference in number of reads in tophat and htseq-count outputs
0
gravatar for nalandaatmi
3.9 years ago by
nalandaatmi70
United States
nalandaatmi70 wrote:

Dear all,

I have paired end reads from my sample and the total number of raw reads is 132,938,316.

1) After applying trimmomatic, I have 

R1 66,113,117 reads
R2  66,113,117 reads
total 132,226,234 reads

Output from Trimmomatic: 

Input Read Pairs: 66469158 Both Surviving: 66113117 (99.46%) Forward Only Surviving: 330055 (0.50%) Reverse Only Surviving: 24010 (0.04%) Dropped: 1976 (0.00%)
TrimmomaticPE: Completed successfully

Difference in reads pairs (raw-survived) => (66,469,158-66,113,117) = 356,041 (phred score <=20)

Difference in reads (Raw-Trimmed) => (132,938,316 - 132,226,234) = 712,082 

2) After aligning using tophat

TopHat Output align summary:

Left reads:
          Input     :  66,113,117
           Mapped   :  54521571 (82.5% of input)
            of these:   4509697 ( 8.3%) have multiple alignments (78742 have >20)
Right reads:
          Input     :  66,113,117
           Mapped   :  53610826 (81.1% of input)
            of these:   4412079 ( 8.2%) have multiple alignments (78753 have >20)
81.8% overall read mapping rate.

Aligned pairs:  51538017
     of these:   4275585 ( 8.3%) have multiple alignments
                  780763 ( 1.5%) are discordant alignments
76.8% concordant pair alignment rate. 

TopHat Output Prep_reads_info:

left_min_read_len=20
left_max_read_len=101
left_reads_in =66,113,117
left_reads_out=66,079,886

right_min_read_len=20
right_max_read_len=101
right_reads_in =66,113,117
right_reads_out=66,070,354

Both left reads input and Left_read_in have 66,113,117 sample value,

Questions:

What is the difference between left_reads_out (66,079,886) and left reads (54,521,571)?

What is the difference between right_reads_out (66,070,354) and right reads (53,610,826 )?

Why these numbers are different?

What is the difference between 81.8% overall read mapping rate vs 76.8% concordant pair alignment rate?

3) samtools flagstat accepted_hits.bam

131374951 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
131374951 + 0 mapped (100.00%:-nan%)
131374951 + 0 paired in sequencing
66297524 + 0 read1
65077427 + 0 read2
107575718 + 0 properly paired (81.88%:-nan%)
124948346 + 0 with itself and mate mapped
6426605 + 0 singletons (4.89%:-nan%)
9330692 + 0 with mate mapped to a different chr
810494 + 0 with mate mapped to a different chr (mapQ>=5)

Questions:

a) what is this QC-passed reads (131,374,951)?

b) What is this Read1 (66,297,524) and Read2 (65,077,427)? 

4) samtools flagstat unmapped.bam
24,093,837 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
0 + 0 mapped (0.00%:-nan%)
24093837 + 0 paired in sequencing
11,591,546 + 0 read1
12,502,291 + 0 read2
0 + 0 properly paired (0.00%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

5) samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

Unique Reads: 56,594,380 

What about this number 56,594,380?

samtools view accepted_hits.bam | cut -f1 | sort | wc -l
131,374,951

samtools view unmapped.bam | cut -f1 | sort | uniq | wc -l
Unique Reads:14,575,100

6) I used htseq-count to get gene counts for accepted_hits.bam

Globin Genes Stranded-Yes(Sense) Stranded-reverse (Antisense)
HBB 36,472 7
HBA1 20,398 14
HBA2 144,174 79
HBG1 2,825 0
HBG2 2,638 0
HBD 4 0
HBE1 0 1
HBZ 0 0
HBQ1 3 6
MB 0 4
CYGB 2 354
NGB 2 319
Total Globin 206,518 784
Total reads 126,970,700 126,970,700
% of Globin_reads 0.1627 0.0006

Again, I used the accepted_hits.bam as in the input for htseq-count.

Questions:

Why the total reads (126,970,700) is different from other numbers?

To get percentage of reads, do I need to divide either by total reads (126,970,700) or unique reads: (56,594,380)?

 

 

ADD COMMENTlink modified 3.8 years ago by Carlo Yague4.6k • written 3.9 years ago by nalandaatmi70

Your question(s) is too polluted and confusing. I will just point to the fact that if you did not sort "accepted_hits.bam" by name (samtools sort -n), htseq-count will could each read from a pair separately, effectively doubling your read counts for paired-end sequencing.

ADD REPLYlink written 3.9 years ago by h.mon27k

Dear h.mon,

Actually, I need some clarifications on each section of this pipeline. So I provided the read counts in each step I performed in my pipeline. From sequencing, I received 66,469,158 these many raw read pairs (R1 and R2). After trimming using trimmomatic, I received 66,113,117 trimmed reads pairs (R1 and R2). 356,041 read pairs don't have phred score >20. So they are removed. It is clear till this point.

1) Why numbers varying between tophat alignment summary and prep_reads_info? Can you explain how to interpret these numbers in both outputs.

2) Using samtools flag, I got a summary for accepted_hits.bam output file from tophat analysis. How to interpret these numbers from samtools.

3) As you suggested, now I sorted bam file based on name and provided the sorted to htseq-count. Below is the output

  Unsorted-Bam Sorted-Bam
Globin Genes Stranded: YES Stranded: YES
HBB 36472 19800
HBA1 20398 4249
HBA2 144174 47553
HBG1 2825 2384
HBG2 2638 1942
HBD 4 3
HBE1 0 0
HBZ 0 0
HBQ1 3 2
MB 0 0
CYGB 2 1
NGB 2 1
Total Globin 206,518 75,935
__no_feature 94,038,451 51,420,979
__ambiguous 46,175 56,695
__too_low_aQual 0 0
__not_aligned 0 0
__alignment_not_unique 32,018,089 16,952,589
Total reads for all genes 126,969,861 68,900,778
% of Globin_reads 0.1626 0.1102 

 

4) To get the percentage of reads for globin genes, is it correct to use following values?

 ((75,935)/(68,900,778))*100 = 0.1102%

5) How come the total number of reads for sorted bam file 68,900,778 and unsorted bam file 126,969,861 are different.

6) Why the reads count 68,900,778 from htseq-count is not matching to trimmed reads count or aligned reads?

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by nalandaatmi70
1
gravatar for Carlo Yague
3.8 years ago by
Carlo Yague4.6k
Belgium
Carlo Yague4.6k wrote:

You probably could have found the answers by yourself, reading the manuals. Anyway :

 

1) What is the difference between left_reads_out (66,079,886) and left reads (54,521,571)?

54,521,571 (left reads) is the number of reads mapped. 66,079,886 (left_reads_out) is the number of reads considered for mapping (tophat does is own filtering too).

 

2) What is the difference between 81.8% overall read mapping rate vs 76.8% concordant pair alignment rate?

Overall mapping rate = percentage of total reads mapped.

concordant pair alignment rate = Percentage of read pairs (you have paired-end data) that map concordantly, i.e, that map while respecting constrains : they are on the same chromosome and not too far apart.

 

3) what is this QC-passed reads (131,374,951)?

QC means quality-control. Those reads are well mapped according to samtools criteria.

 

4) What is this Read1 (66,297,524) and Read2 (65,077,427)? 

Once again, you have paired end data. Reads1 are the first reads of the pairs, Reads2 the second reads.

 

5) What is this : Unique Reads: 56,594,380 ?

This is obviously the number of reads that are unique in your file.

 

6) Why the total reads (126,970,700) is different from other numbers?

Because HTseq-counts has its own requirements regarding the reads used as input. see http://www-huber.embl.de/HTSeq/doc/counting.html#handling-paired-end-reads

 

7) To get percentage of reads, do I need to divide either by total reads (126,970,700) or unique reads: (56,594,380)?

You should divide by total pairs (as h.mon comment suggest it, since you sorted your bam file). 0.1102% is therefore the right proportion of read pairs mapped on globin genes.

ADD COMMENTlink modified 3.8 years ago • written 3.8 years ago by Carlo Yague4.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 792 users visited in the last hour