I am rather new to RNAseq, particularly when it comes to processing my own data. Please excuse any naivete on my part. And please excuse the length of the post. I wanted to be detailed (and have really run myself into a place of confusion).
For my current experiment, I have six samples derived from RNA from human cells with about 76M reads per sample from 2x150 Illumina sequencing. My RNA and libraries were prepped targeting 150bp inserts, and all library QC before sequencing reflects this. As a helpful "control", I do have single-cell RNAseq data collected from the same samples concurrently with the 10x Genomics platform, so I can verify some of these expression trends using those data.
For each sample, I trimmed adapters using bbduk and then aligned my reads using hisat2. I converted the resulting SAM files to BAM using samtools sort. Using
samtools flagstat, I can see that about ~92% of my reads map and ~85% are properly paired. This seems adequate for me for now.
From there, I want to get expression data, but I'm trying to discern which program is best for me to use based on their outputs and integration into downstream programs for analysis of differential expression. So I've turned to two different options: stringtie and htseq - but I am getting very different results between these two programs, and I'm not sure why.
A sample stringtie input is as follows (the sample is named "V184"); I intended to run the output through either edgeR or ballgown for differential expression analysis:
stringtie -p 8 -G [GRCh38 GTF] -B -o V184/transcripts.gtf -A V184/gene_abundances.tsv [alignment directory]/V2D.bam
For whatever it is worth here, stringtie throws an error if my BAM input is sorted by transcript name, so I have to sort by by position using samtools.
A sample input for htseq-count is as follows (same sample):
htseq-count --format bam --order name --mode union --stranded reverse --minaqual 1 --type exon --idattr gene_id [Alignment directory]/V184_name.bam [GRCh38 GTF] > V184_gene.tsv
htseq-count reports memory errors if I use a BAM sorted by position; I found a previous post on this site with the same issue where they solved it by sorting the file by read name. Thus I did so accordingly, sorting a second BAM file from the source SAM file (
samtools sort -n). Those BAMs should be identical other than the sort.
So the first question:
How do I tell the strandedness of my paired reads? The arguments in
--stranded reverse versus
--stranded yes give very different outputs. I initially ran
stranded-reverse because my library preparation kit gave no indication one way or another, and so I figured I'd try both. Generally, the outputs of
stranded-yes better mirror what I see in my corresponding single-cell data. But going forward, how do I think about this?
If I move forward with comparisons now of stringtie and htseq, the results are very different (I fully recognize that the FPKM/TPM outputs of stringtie are different than the raw counts of htseq). Generally, the stringtie output corresponds better to the output of
htseq-count --stranded reverse - which seems biologically incorrect.
Second question: Did I miss something in my stringtie input that has reversed the strandedness of my reads?
But even if I move forward and compare the outputs of
htseq-count --stranded reverse, there are meaningful differences:
My most-abundant transcript by FPKM from stringtie is VIM-AS1/ENSG00000229124 (FPKM = 70600) - but in htseq, this transcript is the most abundant but has an FPKM of 540. GAPDH/ENSG00000111640 shows the following: 10.6 FPKM (stringtie) versus 77253 total reads, calculating to 860 FPKM (htseq-count reports 64.8M total reads for this sample). I could go on and on...
Possibly relevant is that the output of
htseq-count shows 29% of reads having "no_feature", and another 27% either not aligning, having low quality, or non-unique alignment. The "no_feature" output is 71% of my reads (and the 27% for alignment issues is consistent) when I use the
stranded --reverse argument.
Final question, then: Why do the outputs between stringtie and htseq-count vary so much? Is this a function of the large number of reads being discarded by htseq? Can I fix this somehow?
Thank you so much for your time and any insight you can provide!