Question: Htseq-count: discrepancy between raw count value and viewable value of reads on BAM with IGV
1
gravatar for sophie.ehresmann
15 months ago by
sophie.ehresmann10 wrote:

Hello,

I have aligned a trimmed FASTQ with STAR then sorted it and counted with htseq-count. When I look at the pos- sorted BAM on IGV for a specific gene, I get around 20 reads all over this gene for sample 1, 150 for sample 2, and 6000 for sample 3. However in the raw count read file, I see the same amount of reads for samples 1 and 2. There is no other gene in that region, so I don't think the reads would be discarded because they are ambiguous. Could anybody shed some light on this? Here is the code for both STAR alignment, name sorting, and htseq-count:

STAR --runMode alignReads \
  --genomeDir reference.Merged \
  --readFilesIn \
    trim/Rseq_13/Readset11.trim.pair1.fastq.gz \
    trim/Rseq_13/Readset11.trim.pair2.fastq.gz \
  --runThreadN 16 \
  --readFilesCommand zcat \
  --outStd Log \
  --outSAMunmapped Within \
  --outSAMtype BAM SortedByCoordinate \
  --outFileNamePrefix alignment/Rseq_13/Readset11/ \
  --limitGenomeGenerateRAM 70000000000 \
  --limitBAMsortRAM 70000000000 \
  --limitIObufferSize 1000000000 \
  --outWigType wiggle read1_5p --outWigStrand Stranded --outWigReferencesPrefix chr \
  --chimSegmentMin 21 && \

module load /java/openjdk-jdk1.8.0_72 mugqic/picard/2.0.1 && \
 java -Djava.io.tmpdir=/gs/scratch/$USER -XX:ParallelGCThreads=1 -Xmx27G -jar $PICARD_HOME/picard.jar SortSam \
 VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true \
 TMP_DIR=/gs/scratch/$USER \
 INPUT=alignment/Rseq_13/Rseq_13.sorted.bam \
 OUTPUT=alignment/Rseq_13/Rseq_13.QueryNameSorted.bam \
 SORT_ORDER=queryname \
 MAX_RECORDS_IN_RAM=5750000

module load /samtools/1.3.1 mugqic/python/2.7.13 && \
mkdir -p raw_counts && \
samtools view -F 4 \
  alignment/Rseq_13/Rseq_13.QueryNameSorted.bam | \
htseq-count - \
  -m intersection-nonempty \
  --stranded=reverse \
  --format=sam \
  /annotations/Homo_sapiens.GRCh37.Ensembl75.gtf \
  >raw_counts/Rseq_13.readcounts.csv
rna-seq htseq-count • 765 views
ADD COMMENTlink modified 13 months ago by Biostar ♦♦ 20 • written 15 months ago by sophie.ehresmann10
1

Look at the SAM flags of the reads. Maybe some are multi-mappers, which are not counted by default by HTSeq.

ADD REPLYlink written 15 months ago by h.mon24k

I went to look at the specific region, which is around 34,000 bp, for all of my samples.

samtools view -o view_TBC_${SAMPLE}.tsv /gs/scratch/p1044860/Pipeline_output_SK_TBC/alignment/${SAMPLE}/${SAMPLE}.sorted.mdup.bam 16:2475051-2509560

For sample 1: 4402 total reads, of which only 1 is bad quality/multi-mapper For sample 2: 3714 total reads, of which 18 are bad quality/multimapper For sample 3: 5215 total reads, of which 36 are bad quality/multimapper.

The rest are fine! All the reads are also in the same direction.

So my follow up question would be: how is there a lower amount of total reads in that region in sample 2, but when I look at it in IGV, the average is 4-5x more in the gene's exons compared to sample 1? Also, how can there be only ~1000 more reads in sample 3 for that region, but a 300x difference in average read count in IGV?

ADD REPLYlink written 15 months ago by sophie.ehresmann10
1

And the read directions are the same in all samples? Since you are doing strand specific counting...

ADD REPLYlink written 15 months ago by WouterDeCoster38k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 727 users visited in the last hour