Dear all,
I have paired end reads from my sample and the total number of raw reads is 132,938,316.
1) After applying trimmomatic, I have
R1 | 66,113,117 reads |
R2 | 66,113,117 reads |
total | 132,226,234 reads |
Output from Trimmomatic:
Input Read Pairs: 66469158 Both Surviving: 66113117 (99.46%) Forward Only Surviving: 330055 (0.50%) Reverse Only Surviving: 24010 (0.04%) Dropped: 1976 (0.00%)
TrimmomaticPE: Completed successfully
Difference in reads pairs (raw-survived) => (66,469,158-66,113,117) = 356,041 (phred score <=20)
Difference in reads (Raw-Trimmed) => (132,938,316 - 132,226,234) = 712,082
2) After aligning using tophat
TopHat Output align summary:
Left reads:
Input : 66,113,117
Mapped : 54521571 (82.5% of input)
of these: 4509697 ( 8.3%) have multiple alignments (78742 have >20)
Right reads:
Input : 66,113,117
Mapped : 53610826 (81.1% of input)
of these: 4412079 ( 8.2%) have multiple alignments (78753 have >20)
81.8% overall read mapping rate.
Aligned pairs: 51538017
of these: 4275585 ( 8.3%) have multiple alignments
780763 ( 1.5%) are discordant alignments
76.8% concordant pair alignment rate.
TopHat Output Prep_reads_info:
left_min_read_len=20
left_max_read_len=101
left_reads_in =66,113,117
left_reads_out=66,079,886
right_min_read_len=20
right_max_read_len=101
right_reads_in =66,113,117
right_reads_out=66,070,354
Both left reads input and Left_read_in have 66,113,117 sample value,
Questions:
What is the difference between left_reads_out (66,079,886) and left reads (54,521,571)?
What is the difference between right_reads_out (66,070,354) and right reads (53,610,826 )?
Why these numbers are different?
What is the difference between 81.8% overall read mapping rate vs 76.8% concordant pair alignment rate?
3) samtools flagstat accepted_hits.bam
131374951 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
131374951 + 0 mapped (100.00%:-nan%)
131374951 + 0 paired in sequencing
66297524 + 0 read1
65077427 + 0 read2
107575718 + 0 properly paired (81.88%:-nan%)
124948346 + 0 with itself and mate mapped
6426605 + 0 singletons (4.89%:-nan%)
9330692 + 0 with mate mapped to a different chr
810494 + 0 with mate mapped to a different chr (mapQ>=5)
Questions:
a) what is this QC-passed reads (131,374,951)?
b) What is this Read1 (66,297,524) and Read2 (65,077,427)?
4) samtools flagstat unmapped.bam
24,093,837 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
0 + 0 mapped (0.00%:-nan%)
24093837 + 0 paired in sequencing
11,591,546 + 0 read1
12,502,291 + 0 read2
0 + 0 properly paired (0.00%:-nan%)
0 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
5) samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l
Unique Reads: 56,594,380
What about this number 56,594,380?
samtools view accepted_hits.bam | cut -f1 | sort | wc -l
131,374,951
samtools view unmapped.bam | cut -f1 | sort | uniq | wc -l
Unique Reads:14,575,100
6) I used htseq-count to get gene counts for accepted_hits.bam
Globin Genes | Stranded-Yes(Sense) | Stranded-reverse (Antisense) |
HBB | 36,472 | 7 |
HBA1 | 20,398 | 14 |
HBA2 | 144,174 | 79 |
HBG1 | 2,825 | 0 |
HBG2 | 2,638 | 0 |
HBD | 4 | 0 |
HBE1 | 0 | 1 |
HBZ | 0 | 0 |
HBQ1 | 3 | 6 |
MB | 0 | 4 |
CYGB | 2 | 354 |
NGB | 2 | 319 |
Total Globin | 206,518 | 784 |
Total reads | 126,970,700 | 126,970,700 |
% of Globin_reads | 0.1627 | 0.0006 |
Again, I used the accepted_hits.bam as in the input for htseq-count.
Questions:
Why the total reads (126,970,700) is different from other numbers?
To get percentage of reads, do I need to divide either by total reads (126,970,700) or unique reads: (56,594,380)?
Your question(s) is too polluted and confusing. I will just point to the fact that if you did not sort "accepted_hits.bam" by name (samtools sort -n), htseq-count will could each read from a pair separately, effectively doubling your read counts for paired-end sequencing.
Dear h.mon,
Actually, I need some clarifications on each section of this pipeline. So I provided the read counts in each step I performed in my pipeline. From sequencing, I received 66,469,158 these many raw read pairs (R1 and R2). After trimming using trimmomatic, I received 66,113,117 trimmed reads pairs (R1 and R2). 356,041 read pairs don't have phred score >20. So they are removed. It is clear till this point.
As you suggested, now I sorted bam file based on name and provided the sorted to htseq-count. Below is the output
To get the percentage of reads for globin genes, is it correct to use following values? ((75,935)/(68,900,778))*100 = 0.1102%