Question

Difference in number of reads in tophat and htseq-count outputs

0

Entering edit mode

9.7 years ago

nalandaatmi ▴ 110

Dear all,

I have paired end reads from my sample and the total number of raw reads is 132,938,316.

1) After applying trimmomatic, I have

R1    66,113,117 reads
R2    66,113,117 reads
total    132,226,234 reads

Output from Trimmomatic:

Input Read Pairs: 66469158
Both Surviving: 66113117 (99.46%)
Forward Only Surviving: 330055 (0.50%)
Reverse Only Surviving: 24010 (0.04%)
Dropped: 1976 (0.00%)
TrimmomaticPE: Completed successfully

Difference in reads pairs (raw-survived) => (66,469,158 - 66,113,117) = 356,041 (phred score <=20)

Difference in reads (Raw-Trimmed) => (132,938,316 - 132,226,234) = 712,082

2) After aligning using tophat

TopHat Output align summary:

Left reads:
          Input     :  66,113,117
           Mapped   :  54521571 (82.5% of input)
            of these:   4509697 ( 8.3%) have multiple alignments (78742 have >20)
Right reads:
          Input     :  66,113,117
           Mapped   :  53610826 (81.1% of input)
            of these:   4412079 ( 8.2%) have multiple alignments (78753 have >20)
81.8% overall read mapping rate.

Aligned pairs:  51538017
     of these:   4275585 ( 8.3%) have multiple alignments
                  780763 ( 1.5%) are discordant alignments
76.8% concordant pair alignment rate.

TopHat Output Prep_reads_info:

left_min_read_len=20
left_max_read_len=101
left_reads_in =66,113,117
left_reads_out=66,079,886
right_min_read_len=20
right_max_read_len=101
right_reads_in =66,113,117
right_reads_out=66,070,354

Both left reads input and Left_read_in have 66,113,117 sample value,

Questions:

What is the difference between left_reads_out (66,079,886) and left reads (54,521,571)?
What is the difference between right_reads_out (66,070,354) and right reads (53,610,826 )?

Why are these numbers different?

What is the difference between 81.8% overall read mapping rate vs 76.8% concordant pair alignment rate?

3) samtools flagstat accepted_hits.bam

131374951 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
131374951 + 0 mapped (100.00%:-nan%)
131374951 + 0 paired in sequencing
66297524 + 0 read1
65077427 + 0 read2
107575718 + 0 properly paired (81.88%:-nan%)
124948346 + 0 with itself and mate mapped
6426605 + 0 singletons (4.89%:-nan%)
9330692 + 0 with mate mapped to a different chr
810494 + 0 with mate mapped to a different chr (mapQ>=5)

Questions:

a) what is this QC-passed reads (131,374,951)?

b) What is this Read1 (66,297,524) and Read2 (65,077,427)?

4) samtools flagstat unmapped.bam

24,093,837 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
0 + 0 mapped (0.00%:-nan%)
24093837 + 0 paired in sequencing
11,591,546 + 0 read1
12,502,291 + 0 read2
0 + 0 properly paired (0.00%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

5) samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

Unique Reads: 56,594,380

What about this number 56,594,380?

samtools view accepted_hits.bam | cut -f1 | sort | wc -l

131,374,951

samtools view unmapped.bam | cut -f1 | sort | uniq | wc -l

Unique Reads:14,575,100

6) I used htseq-count to get gene counts for accepted_hits.bam

Globin Genes        Stranded-Yes(Sense)   Stranded-reverse (Antisense)
HBB                 36,472                7
HBA1                20,398                14
HBA2                144,174               79
HBG1                2,825                 0
HBG2                2,638                 0
HBD                 4                     0
HBE1                0                     1
HBZ                 0                     0
HBQ1                3                     6
MB                  0                     4
CYGB                2                     354
NGB                 2                     319
Total Globin        206,518               784
Total reads         126,970,700           126,970,700
% of Globin_reads   0.1627                0.0006

Again, I used the accepted_hits.bam as in the input for htseq-count.

Questions:

Why the total reads (126,970,700) is different from other numbers?

To get percentage of reads, do I need to divide either by total reads (126,970,700) or unique reads: (56,594,380)?

next-gen RNA-Seq ChIP-Seq sequence alignment • 4.1k views

ADD COMMENT • link updated 2.8 years ago by Ram 45k • written 9.7 years ago by nalandaatmi ▴ 110

0

Entering edit mode

Your question(s) is too polluted and confusing. I will just point to the fact that if you did not sort "accepted_hits.bam" by name (samtools sort -n), htseq-count will could each read from a pair separately, effectively doubling your read counts for paired-end sequencing.

ADD REPLY • link 9.7 years ago by h.mon 35k

0

Entering edit mode

Dear h.mon,

Actually, I need some clarifications on each section of this pipeline. So I provided the read counts in each step I performed in my pipeline. From sequencing, I received 66,469,158 these many raw read pairs (R1 and R2). After trimming using trimmomatic, I received 66,113,117 trimmed reads pairs (R1 and R2). 356,041 read pairs don't have phred score >20. So they are removed. It is clear till this point.

Why numbers varying between tophat alignment summary and prep_reads_info? Can you explain how to interpret these numbers in both outputs.
Using samtools flag, I got a summary for accepted_hits.bam output file from tophat analysis. How to interpret these numbers from samtools.

As you suggested, now I sorted bam file based on name and provided the sorted to htseq-count. Below is the output

                           Unsorted-Bam     Sorted-Bam
Globin Genes                 Stranded: YES    Stranded: YES
HBB                          36472            19800
HBA1                         20398            4249
HBA2                         144174           47553
HBG1                         2825             2384
HBG2                         2638             1942
HBD                          4                3
HBE1                         0                0
HBZ                          0                0
HBQ1                         3                2
MB                           0                0
CYGB                         2                1
NGB                          2                1
Total Globin                 206,518          75,935
__no_feature                 94,038,451       51,420,979
__ambiguous                  46,175           56,695
__too_low_aQual              0                0
__not_aligned                0                0
__alignment_not_unique       32,018,089       16,952,589
Total reads for all genes    126,969,861      68,900,778
% of Globin_reads            0.1626           0.1102

To get the percentage of reads for globin genes, is it correct to use following values? ((75,935)/(68,900,778))*100 = 0.1102%
How come the total number of reads for sorted bam file 68,900,778 and unsorted bam file 126,969,861 are different.
Why the reads count 68,900,778 from htseq-count is not matching to trimmed reads count or aligned reads?

ADD REPLY • link updated 2.8 years ago by Ram 45k • written 9.7 years ago by nalandaatmi ▴ 110

Ram · Answer 1 · 2015-12-15

You probably could have found the answers by yourself, reading the manuals. Anyway:

What is the difference between left_reads_out (66,079,886) and left reads (54,521,571)?

54,521,571 (left reads) is the number of reads mapped. 66,079,886 (left_reads_out) is the number of reads considered for mapping (tophat does is own filtering too).

What is the difference between 81.8% overall read mapping rate vs 76.8% concordant pair alignment rate?

Overall mapping rate = percentage of total reads mapped.

concordant pair alignment rate = Percentage of read pairs (you have paired-end data) that map concordantly, i.e, that map while respecting constrains : they are on the same chromosome and not too far apart.

what is this QC-passed reads (131,374,951)?

QC means quality-control. Those reads are well mapped according to samtools criteria.

What is this Read1 (66,297,524) and Read2 (65,077,427)?

Once again, you have paired end data. Reads1 are the first reads of the pairs, Reads2 the second reads.

What is this : Unique Reads: 56,594,380?

This is obviously the number of reads that are unique in your file.

Why the total reads (126,970,700) is different from other numbers?

Because HTseq-counts has its own requirements regarding the reads used as input. see http://www-huber.embl.de/HTSeq/doc/counting.html#handling-paired-end-reads

To get percentage of reads, do I need to divide either by total reads (126,970,700) or unique reads: (56,594,380)?

You should divide by total pairs (as h.mon comment suggest it, since you sorted your bam file). 0.1102% is therefore the right proportion of read pairs mapped on globin genes.