Entering edit mode
6.3 years ago
sophie.ehresmann
▴
10
Hello,
I have aligned a trimmed FASTQ with STAR then sorted it and counted with htseq-count. When I look at the pos- sorted BAM on IGV for a specific gene, I get around 20 reads all over this gene for sample 1, 150 for sample 2, and 6000 for sample 3. However in the raw count read file, I see the same amount of reads for samples 1 and 2. There is no other gene in that region, so I don't think the reads would be discarded because they are ambiguous. Could anybody shed some light on this? Here is the code for both STAR alignment, name sorting, and htseq-count:
STAR --runMode alignReads \
--genomeDir reference.Merged \
--readFilesIn \
trim/Rseq_13/Readset11.trim.pair1.fastq.gz \
trim/Rseq_13/Readset11.trim.pair2.fastq.gz \
--runThreadN 16 \
--readFilesCommand zcat \
--outStd Log \
--outSAMunmapped Within \
--outSAMtype BAM SortedByCoordinate \
--outFileNamePrefix alignment/Rseq_13/Readset11/ \
--limitGenomeGenerateRAM 70000000000 \
--limitBAMsortRAM 70000000000 \
--limitIObufferSize 1000000000 \
--outWigType wiggle read1_5p --outWigStrand Stranded --outWigReferencesPrefix chr \
--chimSegmentMin 21 && \
module load /java/openjdk-jdk1.8.0_72 mugqic/picard/2.0.1 && \
java -Djava.io.tmpdir=/gs/scratch/$USER -XX:ParallelGCThreads=1 -Xmx27G -jar $PICARD_HOME/picard.jar SortSam \
VALIDATION_STRINGENCY=SILENT CREATE_INDEX=true \
TMP_DIR=/gs/scratch/$USER \
INPUT=alignment/Rseq_13/Rseq_13.sorted.bam \
OUTPUT=alignment/Rseq_13/Rseq_13.QueryNameSorted.bam \
SORT_ORDER=queryname \
MAX_RECORDS_IN_RAM=5750000
module load /samtools/1.3.1 mugqic/python/2.7.13 && \
mkdir -p raw_counts && \
samtools view -F 4 \
alignment/Rseq_13/Rseq_13.QueryNameSorted.bam | \
htseq-count - \
-m intersection-nonempty \
--stranded=reverse \
--format=sam \
/annotations/Homo_sapiens.GRCh37.Ensembl75.gtf \
>raw_counts/Rseq_13.readcounts.csv
Look at the SAM flags of the reads. Maybe some are multi-mappers, which are not counted by default by HTSeq.
I went to look at the specific region, which is around 34,000 bp, for all of my samples.
For sample 1: 4402 total reads, of which only 1 is bad quality/multi-mapper For sample 2: 3714 total reads, of which 18 are bad quality/multimapper For sample 3: 5215 total reads, of which 36 are bad quality/multimapper.
The rest are fine! All the reads are also in the same direction.
So my follow up question would be: how is there a lower amount of total reads in that region in sample 2, but when I look at it in IGV, the average is 4-5x more in the gene's exons compared to sample 1? Also, how can there be only ~1000 more reads in sample 3 for that region, but a 300x difference in average read count in IGV?
And the read directions are the same in all samples? Since you are doing strand specific counting...