I have a discrepancy between STAR and other software I cannot reconcile. I'm working on RNA-seq data from B. subtilis. The sequencing facility is adamant they used stranded libraries. And indeed, when I run infer_experiment.py
on the BAM file I get:
This is PairEnd Data
Fraction of reads failed to determine: 0.0000
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0744
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9256
This strongly indicates a strand-specific data set. I have also run featureCounts
with -s 1 and -s 2 options. The first one returned about 4.8 M total counts, while the second returned about 76 M total counts, as expected.
The problem occurs when I use STAR for quantification. Considering the above results, I'd expect many more reads in column 4 of the ReadsPerGene.out.tab file than in column 3. However, I get the following total counts in genes, in each column:
col2 col3 col4
38879280 19555164 19324116
This is unexpected. The STAR quantification suggests unstranded data, while both infer_experiment.py
and featureCounts
indicate a stranded set.
Here are the options I used to create the index:
--sjdbGTFfeatureExon gene
--genomeSAindexNbases 10
and for mapping:
--sjdbGTFfeatureExon gene
--outFilterType BySJout
--outSAMtype BAM SortedByCoordinate
--outFilterMultimapNmax 2
--alignIntronMax 1
--limitBAMsortRAM 3000000000
--readFilesCommand zcat
--quantMode GeneCounts
The --alignIntronMax 1 option prohibits splicing - we have a bacterium here. The --sjdbGTFfeatureExon gene option is included both in the index build and mapping, because the GFT file does not have exons and gene is the main feature.
EDIT
I used htseq-count
on the BAM files created by STAR. According to the STAR documentation they should give exactly the same counts. Here is comparison of the htseq-count
with -s reverse -t gene
options versus column 4 in the ReadsPerGene.out.tab file. The red line shows x = y.
To extend this comparison further I also ran featureCount
with -s 2 -t gene
options, here it is how it compares to htseq-count
result:
While htseq and featureCounts agree very well, STAR is an outlier.
Are the genes in the GTF stranded ?
Discussion in https://groups.google.com/g/rna-star/c/O6gUuDi9_os may also be relevant since this is bacterial RNAseq. Do not include the annotation file when generating index per author of STAR.
Hm, that might be a good point. However, I can't see how this would affect the distribution of counts between 3rd and 4th column in quantification.
Given the wildly good correlation I have a feeling it has to do with your forward and reverse reads being counted towards each strand separately. It is probably related to the STAR parameters but I’m not an expert on non-vertebrates. Did you try matching up the counts manually in IGV looking at the BAM for a smaller gene/region
Oh, I wouldn't know what to look for in IGV. The issue is with quantification, and looking at a BAM file in IGV does not tell me how a given read was counted by STAR or htseq-count. I think the issue must be with STAR and/or its parameters. The STAR manual states "The counts coincide with those produced by htseq-count with default parameters" and then specifically "column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)". But my results disagree a lot.
Since the program author has recommended it, can you try indexing without the GTF and then counting again. You don't have polycistronic transctipts, correct? If you see reads crossing over more than one gene in IGV that may be case.
I tried STAR counting with an index created without GTF. The results are identical to those based on index created with GTF. Every single count is the same.
Looks like you have already opened an issue on STAR repo: https://github.com/alexdobin/STAR/issues/2650
That was going to be the next suggestion.
Since STAR is mainly designed to be splice-aware perhaps some interaction of parameters is causing this.
Yes, GTF file contains a strand column and it is either "+" or "-", in about 50-50 ratio.
As long as the genes don’t overlap one another then it should be fine.