Question: Difference in htseq-count values for sorted bam and unsorted bam by name
1
gravatar for nalandaatmi
3.4 years ago by
nalandaatmi60
United States
nalandaatmi60 wrote:

Dear All,

I am working on getting the read counts for some genes. Firstly, I aligned my trimmed RNASeq reads with human and mouse genome to get alignment file (accepted_hits.bam) in bam format using tophat. I provided htseq-count tool with two different bam files (sorted and unsorted).

Case1: I used the accepted_hits.bam from tophat directly as the input to the htseq-count and counted the reads for each gene. 

htseq-count -f bam -s yes -i gene_id accepted_hits.bam hg19_genes.gtf > HTSeq-count_$sampleid$divider$currentDate.txt

Note: By default accepted_hits.bam file is sorted based on coordinates by tophat.

Case2: I sorted the accepted_hits.bam file based on "NAMES" using samtools. Then provided as input to the htseq-count and counted the reads for each gene.

$ samtools sort -n accepted_hits.bam accepted_hits_sorted.bam

$ htseq-count -f bam -r name -s yes -i gene_name accepted_hits_sorted.bam hg19_genes.gtf  > HTSeq-count_$sampleid$divider$currentDate.txt

  Unsorted-Bam Sorted-Bam
Globin Genes Stranded: YES Stranded: YES
HBB 36472 19800
HBA1 20398 4249
HBA2 144174 47553
HBG1 2825 2384
HBG2 2638 1942
HBD 4 3
HBE1 0 0
HBZ 0 0
HBQ1 3 2
MB 0 0
CYGB 2 1
NGB 2 1
Total Globin 206,518 75,935
__no_feature 94,038,451 51,420,979
__ambiguous 46,175 56,695
__too_low_aQual 0 0
__not_aligned 0 0
__alignment_not_unique 32,018,089 16,952,589
Total reads 126,969,861 68,900,778
% of Globin_reads 0.1626 0.1102

Can anyone explain why the total reads value is low for sorted bam compared to unsorted bam?

ADD COMMENTlink modified 3.4 years ago by Devon Ryan88k • written 3.4 years ago by nalandaatmi60
3
gravatar for Devon Ryan
3.4 years ago by
Devon Ryan88k
Freiburg, Germany
Devon Ryan88k wrote:

The name sorted results are correct, the coordinate ones are not because you lied to htseq-count and told it you had a name-sorted file (it's the default, you don't need to say -r name) instead of the coordinate-sorted file that you do have (accepted_hits.bam is coordinate sorted by default). Try adding -r pos with accepted_hits.bam and you'll likely get more or less identical results.

ADD COMMENTlink written 3.4 years ago by Devon Ryan88k

Thanks Devon for your answers. I submitted the jobs for htseq-count with -r pos parameters. Currently, the jobs are running. I have another query,

Why the reads count 68,900,778 from htseq-count is not matching to either the trimmed reads count (66,113,117 trimmed reads pairs (R1 and R2))

or

the aligned reads count (Unique reads: 56,594,380 or 131,374,951) from bam file?

$samtools view accepted_hits.bam | cut -f1 | sort | uniq | wc -l

Unique Reads: 56,594,380 

$samtools view accepted_hits.bam | cut -f1 | sort | wc -l
131,374,951

 

 

ADD REPLYlink written 3.4 years ago by nalandaatmi60

Tophat may output multiple alignments for a given read/pair. htseq-count won't include them in any real counts (they'll be in "__alignment_not_unique"). You don't want to count reads, but fragments, so there are ~57 million of them.

ADD REPLYlink written 3.4 years ago by Devon Ryan88k

Yeah that make sense, there are ~57 million unique fragments. So in that case, a fragment can be single read or combination of multiple reads. 

The below details are from this post C: Difference in number of reads in tophat and htseq-count outputs.

2) After aligning the trimmed reads using tophat

TopHat Output align summary:

Left reads: (Left reads alone aligned)
          Input     :  66,113,117
           Mapped   :  54,521,571 (82.5% of input)
            of these:   4,509,697 ( 8.3%) have multiple alignments (78742 have >20)
Right reads: (Right reads alone aligned)
          Input     :  66,113,117
           Mapped   :  53,610,826 (81.1% of input)
            of these:   4,412,079 ( 8.2%) have multiple alignments (78753 have >20)
81.8% overall read mapping rate. (is average of 82.5% and 81.1%)

Aligned pairs:  51,538,017 (Both left and right reads as a pair aligned)
     of these:   4,275,585 ( 8.3%) have multiple alignments
                  780763 ( 1.5%) are discordant alignments
76.8% concordant pair alignment rate. 

Left reads Input - Mapped reads = 11,591,546 reads (i.e 17.5% is not mapped from left)

Right reads Input - Mapped reads = 12,502,291 reads (18.9% is not mapped from right)

TopHat Output Prep_reads_info:

left_min_read_len=20
left_max_read_len=101
left_reads_in =66,113,117
left_reads_out=66,079,886

right_min_read_len=20
right_max_read_len=101
right_reads_in =66,113,117
right_reads_out=66,070,354

left_reads_in - left_reads_out = 33,231 reads (what happened to these reads?)

right_reads_in - right_reads_out= 42,763 (what happened to these reads?)

Both left reads input and Left_read_in have 66,113,117 sample value,

Questions:

What is the difference between left_reads_out (66,079,886) and left reads (54,521,571)?

What is the difference between right_reads_out (66,070,354) and right reads (53,610,826 )?

What is the difference between 81.8% overall read mapping rate and 76.8% concordant pair alignment rate?

ADD REPLYlink written 3.4 years ago by nalandaatmi60

left_reads_out/right_reads_out are what remain after tophat filters for length. You can google "concordant alignment" for the rest of your question.

ADD REPLYlink written 3.4 years ago by Devon Ryan88k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1374 users visited in the last hour