Question

Htseq-count: discrepancy between raw count value and viewable value of reads on BAM with IGV

1

Entering edit mode

6.3 years ago

sophie.ehresmann ▴ 10

Hello,

I have aligned a trimmed FASTQ with STAR then sorted it and counted with htseq-count. When I look at the pos- sorted BAM on IGV for a specific gene, I get around 20 reads all over this gene for sample 1, 150 for sample 2, and 6000 for sample 3. However in the raw count read file, I see the same amount of reads for samples 1 and 2. There is no other gene in that region, so I don't think the reads would be discarded because they are ambiguous. Could anybody shed some light on this? Here is the code for both STAR alignment, name sorting, and htseq-count:

STAR --runMode alignReads \
  --genomeDir reference.Merged \
  --readFilesIn \
    trim/Rseq_13/Readset11.trim.pair1.fastq.gz \
    trim/Rseq_13/Readset11.trim.pair2.fastq.gz \
  --runThreadN 16 \
  --readFilesCommand zcat \
  --outStd Log \
  --outSAMunmapped Within \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix alignment/Rseq_13/Readset11/ \
  --limitGenomeGenerateRAM 70000000000 \
  --limitBAMsortRAM 70000000000 \
  --limitIObufferSize 1000000000 \
  --outWigType wiggle read1_5p --outWigStrand Stranded --outWigReferencesPrefix chr \
  --chimSegmentMin 21 && \

module load /java/openjdk-jdk1.8.0_72 mugqic/picard/2.0.1 && \
 java -Djava.io.tmpdir=/gs/scratch/$USER -XX:ParallelGCThreads=1 -Xmx27G -jar $PICARD_HOME/picard.jar SortSam \
 VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true \
 TMP_DIR=/gs/scratch/$USER \
 INPUT=alignment/Rseq_13/Rseq_13.sorted.bam \
 OUTPUT=alignment/Rseq_13/Rseq_13.QueryNameSorted.bam \
 SORT_ORDER=queryname \
 MAX_RECORDS_IN_RAM=5750000

module load /samtools/1.3.1 mugqic/python/2.7.13 && \
mkdir -p raw_counts && \
samtools view -F 4 \
  alignment/Rseq_13/Rseq_13.QueryNameSorted.bam | \
htseq-count - \
  -m intersection-nonempty \
  --stranded=reverse \
  --format=sam \
  /annotations/Homo_sapiens.GRCh37.Ensembl75.gtf \
  >raw_counts/Rseq_13.readcounts.csv

rna-seq htseq-count • 2.2k views

ADD COMMENT • link updated 6.1 years ago by Biostar 20 • written 6.3 years ago by sophie.ehresmann ▴ 10

1

Entering edit mode

Look at the SAM flags of the reads. Maybe some are multi-mappers, which are not counted by default by HTSeq.

ADD REPLY • link 6.3 years ago by h.mon 35k

0

Entering edit mode

I went to look at the specific region, which is around 34,000 bp, for all of my samples.

samtools view -o view_TBC_${SAMPLE}.tsv /gs/scratch/p1044860/Pipeline_output_SK_TBC/alignment/${SAMPLE}/${SAMPLE}.sorted.mdup.bam 16:2475051-2509560

For sample 1: 4402 total reads, of which only 1 is bad quality/multi-mapper For sample 2: 3714 total reads, of which 18 are bad quality/multimapper For sample 3: 5215 total reads, of which 36 are bad quality/multimapper.

The rest are fine! All the reads are also in the same direction.

So my follow up question would be: how is there a lower amount of total reads in that region in sample 2, but when I look at it in IGV, the average is 4-5x more in the gene's exons compared to sample 1? Also, how can there be only ~1000 more reads in sample 3 for that region, but a 300x difference in average read count in IGV?

ADD REPLY • link 6.3 years ago by sophie.ehresmann ▴ 10

1

Entering edit mode

And the read directions are the same in all samples? Since you are doing strand specific counting...

ADD REPLY • link 6.3 years ago by WouterDeCoster 47k