Htseq-count: discrepancy between raw count value and viewable value of reads on BAM with IGV
0
1
Entering edit mode
6.3 years ago

Hello,

I have aligned a trimmed FASTQ with STAR then sorted it and counted with htseq-count. When I look at the pos- sorted BAM on IGV for a specific gene, I get around 20 reads all over this gene for sample 1, 150 for sample 2, and 6000 for sample 3. However in the raw count read file, I see the same amount of reads for samples 1 and 2. There is no other gene in that region, so I don't think the reads would be discarded because they are ambiguous. Could anybody shed some light on this? Here is the code for both STAR alignment, name sorting, and htseq-count:

STAR --runMode alignReads \
  --genomeDir reference.Merged \
  --readFilesIn \
    trim/Rseq_13/Readset11.trim.pair1.fastq.gz \
    trim/Rseq_13/Readset11.trim.pair2.fastq.gz \
  --runThreadN 16 \
  --readFilesCommand zcat \
  --outStd Log \
  --outSAMunmapped Within \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix alignment/Rseq_13/Readset11/ \
  --limitGenomeGenerateRAM 70000000000 \
  --limitBAMsortRAM 70000000000 \
  --limitIObufferSize 1000000000 \
  --outWigType wiggle read1_5p --outWigStrand Stranded --outWigReferencesPrefix chr \
  --chimSegmentMin 21 && \

module load /java/openjdk-jdk1.8.0_72 mugqic/picard/2.0.1 && \
 java -Djava.io.tmpdir=/gs/scratch/$USER -XX:ParallelGCThreads=1 -Xmx27G -jar $PICARD_HOME/picard.jar SortSam \
 VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true \
 TMP_DIR=/gs/scratch/$USER \
 INPUT=alignment/Rseq_13/Rseq_13.sorted.bam \
 OUTPUT=alignment/Rseq_13/Rseq_13.QueryNameSorted.bam \
 SORT_ORDER=queryname \
 MAX_RECORDS_IN_RAM=5750000

module load /samtools/1.3.1 mugqic/python/2.7.13 && \
mkdir -p raw_counts && \
samtools view -F 4 \
  alignment/Rseq_13/Rseq_13.QueryNameSorted.bam | \
htseq-count - \
  -m intersection-nonempty \
  --stranded=reverse \
  --format=sam \
  /annotations/Homo_sapiens.GRCh37.Ensembl75.gtf \
  >raw_counts/Rseq_13.readcounts.csv
rna-seq htseq-count • 2.2k views
ADD COMMENT
1
Entering edit mode

Look at the SAM flags of the reads. Maybe some are multi-mappers, which are not counted by default by HTSeq.

ADD REPLY
0
Entering edit mode

I went to look at the specific region, which is around 34,000 bp, for all of my samples.

samtools view -o view_TBC_${SAMPLE}.tsv /gs/scratch/p1044860/Pipeline_output_SK_TBC/alignment/${SAMPLE}/${SAMPLE}.sorted.mdup.bam 16:2475051-2509560

For sample 1: 4402 total reads, of which only 1 is bad quality/multi-mapper For sample 2: 3714 total reads, of which 18 are bad quality/multimapper For sample 3: 5215 total reads, of which 36 are bad quality/multimapper.

The rest are fine! All the reads are also in the same direction.

So my follow up question would be: how is there a lower amount of total reads in that region in sample 2, but when I look at it in IGV, the average is 4-5x more in the gene's exons compared to sample 1? Also, how can there be only ~1000 more reads in sample 3 for that region, but a 300x difference in average read count in IGV?

ADD REPLY
1
Entering edit mode

And the read directions are the same in all samples? Since you are doing strand specific counting...

ADD REPLY

Login before adding your answer.

Traffic: 2966 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6