[RNA-seq] Extremely Unconsistent Raw Count from Different Tools; a dark mistery
0
1
Entering edit mode
3.1 years ago
ReWeeda ▴ 120

Good morning fellow bioinformaticians,

It's been a full week since I started struggling with a problem related to reads count extraction from RNA-seq data and I ran out of ideas.

I mainly work with RNA-seq data from FFPE samples (I know... FFPE samples sucks but this is another story) and I developed a dedicated workflow for data analysis.

After a pre-processing phase (quality check, quality filtering and trimming), raw reads are aligned with STAR (v.2.7.3.a; with 2pass mode), then I perform reads count extraction using salmon (v.1.4.0).

Recently, I had to focus on a specific gene (CD99) that is expressed in my samples. I noticed that, while salmon returns me a reasonable value for that specific gene, STAR (option "--quantMode GeneCounts" provided in the alignment phase) gives me 0 counts for CD99.

Given this observation, I tried to produce reads count with different tools such as htseq-count (v.0.13.5) and featurecounts (v.2.4.0). Again, htseq-count and featurecount returned me 0 counts for CD99. Checking the bam file of the tested sample with IGV, CD99 is clearly covered by a good amount of aligned reads.

First, I thought I set wrong some flags related to the strandness of my data in htseq-count and featurecounts so I brutally tried all the possibile flags combination. Unfortunately, no solution popped up (still 0 counts for CD99)

Given that both htseq-counts and featurecounts take bam file in input, I thought about some problem related to the alignment phase so I regenerated the STAR genome index and then I realigned my sample. Again, STAR, htseq-count and featurecounts returned me 0 counts for CD99.

At the same time, uploading original fastq files to BaseSpace, I get counts for that gene as expected. Moreover, trying to produce counts in R with featurecounts (v.1.3.4) I get counts for CD99.

Right now, I've no clue about what's going on. Can anyone help me explaining such behaviour?

  • Reference Genome is HG38
  • GTF file is from Gencode v33
  • Library is Paired-end; reverse strand

Code I use for STAR:

STAR --genomeDir $star_index_path --readFilesIn $R1 $R2 --runThreadN 8 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --alignSJDBoverhangMin 1 --readFilesCommand gunzip -c --sjdbGTFfile $gtfFile --outFilterMatchNminOverLread 0.33 --outFilterScoreMinOverLread 0.33 --sjdbOverhang 100 --outSAMstrandField intronMotif --outSAMtype None --outSAMmode None --outFileNamePrefix $output_prefix

STAR --runMode genomeGenerate --genomeDir $idx_path_1pass --genomeFastaFiles $reference_path --sjdbOverhang 100 --runThreadN 8 --sjdbFileChrStartEnd $output_prefix'SJ.out.tab'

STAR --genomeDir $idx_path_1pass --readFilesIn $R1 $R2 --runThreadN 8 --outFilterMultimapScoreRange 1 --outFilterMultimapNmax 20 --outFilterMismatchNmax 10 --alignIntronMax 500000 --alignMatesGapMax 1000000 --sjdbScore 2 --alignSJDBoverhangMin 1 --limitBAMsortRAM 0 --readFilesCommand gunzip -c --sjdbGTFfile $gtfFile --quantMode GeneCounts --outFilterMatchNminOverLread 0.33 --outFilterScoreMinOverLread 0.33 --sjdbOverhang 100 --alignSJstitchMismatchNmax 5 -1 5 5 --chimSegmentMin 10 --chimOutType Junctions WithinBAM HardClip --chimJunctionOverhangMin 10 --chimScoreDropMax 30 --chimScoreJunctionNonGTAG 0 --chimScoreSeparation 1 --chimSegmentReadGapMax 3 --chimMultimapNmax 0 --outSAMstrandField intronMotif --outFileNamePrefix $output_prefix --outSAMunmapped Within --outSAMtype BAM SortedByCoordinate

Code for Salmon:

salmon quant -i $salmon_transcriptome_index -l ISR -1 $R1 -2 $R2 --threads 8 --validateMappings --output $salmon_output_dir

Code for htseq-count:

htseq-count -q -f bam -s reverse -r pos -t exon -i gene_name -n 8 $bam $gtf > './htseq/'$idn'_htseq_name-sort_rw.txt'

Code for featurecounts :

featureCounts --primary -C -t exon -g gene_name -a $GTF --ignoreDup -s 2 -T 8 -o './featurecounts/'$idn'.quant.FC.name-sorted.rw.txt' $bam
RNA-Seq alignment counts • 1.3k views
ADD COMMENT
1
Entering edit mode

Are the reads covering that gene of good mapping quality or multimappers (MAPQ=0, or at least low)?

ADD REPLY
0
Entering edit mode
samtools view sample_CD99.bam | awk '{sum+=$5} END { print "Mean MAPQ =",sum/NR}'

Mean MAPQ = 2.94356

(*) CD99 (chrX) has got its PAR on chrY

Edit for: copy-pasting error.

ADD REPLY
1
Entering edit mode

That is pretty low, guess that's the answer then. I would use salmon, in fact I do :)

ADD REPLY
1
Entering edit mode

Well, I was expecting such low MAPQ value for a gene in pseudoautosomal region. I didn't realize until now that it could be a problem during reads count extraction.

I guess I will stick to Salmon too. :-)

ADD REPLY
2
Entering edit mode
3.1 years ago

Is it possible that all reads that map to CD99 are multimapping? Both featureCounts and htseq-count ignore multimapping reads, salmon uses EM to distribute their signal across all locations.

ADD COMMENT
1
Entering edit mode

That's the answer. I was aware that htseq-count and featurecounts ignore multimapping reads but I didn't think that genes in psudoautosomal regions are by definition multimapper in chromosome X and Y.

ADD REPLY
0
Entering edit mode

Any time a gene has a big number on the end you have to think about why. Are there 98 other genes with sequence similarity? I would want to keep multi-mapping to some extent. Or get fancy like Salmon or Cufflinks tries to disambiguate, and dont even look at "raw counts". Edit this will also be an issue when you try to qPCR or Sanger validate some findings, beware there could be 98 more genes with sequence similarity.

ADD REPLY
0
Entering edit mode

I would not blame CD99 sequence similarity alone. While it may have similar regions, it also has unique regions. You can get many counts (CPM > 10) for CD99 based on unique mappers alone.

If you are using Salmon, you should strongly consider using decoys in your index. Otherwise, low-quality samples may end up with highly inflated counts.

ADD REPLY

Login before adding your answer.

Traffic: 3205 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6