Strandedness issues in STAR quantification
1
0
Entering edit mode
9 weeks ago

I have a discrepancy between STAR and other software I cannot reconcile. I'm working on RNA-seq data from B. subtilis. The sequencing facility is adamant they used stranded libraries. And indeed, when I run infer_experiment.py on the BAM file I get:

This is PairEnd Data
Fraction of reads failed to determine: 0.0000
Fraction of reads explained by "1++,1--,2+-,2-+": 0.0744
Fraction of reads explained by "1+-,1-+,2++,2--": 0.9256

This strongly indicates a strand-specific data set. I have also run featureCounts with -s 1 and -s 2 options. The first one returned about 4.8 M total counts, while the second returned about 76 M total counts, as expected.

The problem occurs when I use STAR for quantification. Considering the above results, I'd expect many more reads in column 4 of the ReadsPerGene.out.tab file than in column 3. However, I get the following total counts in genes, in each column:

    col2     col3     col4
38879280 19555164 19324116

This is unexpected. The STAR quantification suggests unstranded data, while both infer_experiment.py and featureCounts indicate a stranded set.

Here are the options I used to create the index:

--sjdbGTFfeatureExon gene
 --genomeSAindexNbases 10

and for mapping:

--sjdbGTFfeatureExon gene
--outFilterType BySJout
--outSAMtype BAM SortedByCoordinate
--outFilterMultimapNmax 2
--alignIntronMax 1
--limitBAMsortRAM 3000000000
--readFilesCommand zcat
--quantMode GeneCounts

The --alignIntronMax 1 option prohibits splicing - we have a bacterium here. The --sjdbGTFfeatureExon gene option is included both in the index build and mapping, because the GFT file does not have exons and gene is the main feature.


EDIT

I used htseq-count on the BAM files created by STAR. According to the STAR documentation they should give exactly the same counts. Here is comparison of the htseq-count with -s reverse -t gene options versus column 4 in the ReadsPerGene.out.tab file. The red line shows x = y.

STAR vs htseq-count

To extend this comparison further I also ran featureCount with -s 2 -t gene options, here it is how it compares to htseq-count result:

featureCount vs htseq-count

While htseq and featureCounts agree very well, STAR is an outlier.

STAR stranded • 1.3k views
ADD COMMENT
0
Entering edit mode

Are the genes in the GTF stranded ?

ADD REPLY
0
Entering edit mode

Discussion in https://groups.google.com/g/rna-star/c/O6gUuDi9_os may also be relevant since this is bacterial RNAseq. Do not include the annotation file when generating index per author of STAR.

ADD REPLY
0
Entering edit mode

Hm, that might be a good point. However, I can't see how this would affect the distribution of counts between 3rd and 4th column in quantification.

ADD REPLY
0
Entering edit mode

Given the wildly good correlation I have a feeling it has to do with your forward and reverse reads being counted towards each strand separately. It is probably related to the STAR parameters but I’m not an expert on non-vertebrates. Did you try matching up the counts manually in IGV looking at the BAM for a smaller gene/region

ADD REPLY
0
Entering edit mode

Oh, I wouldn't know what to look for in IGV. The issue is with quantification, and looking at a BAM file in IGV does not tell me how a given read was counted by STAR or htseq-count. I think the issue must be with STAR and/or its parameters. The STAR manual states "The counts coincide with those produced by htseq-count with default parameters" and then specifically "column 4: counts for the 2nd read strand aligned with RNA (htseq-count option -s reverse)". But my results disagree a lot.

ADD REPLY
0
Entering edit mode

Since the program author has recommended it, can you try indexing without the GTF and then counting again. You don't have polycistronic transctipts, correct? If you see reads crossing over more than one gene in IGV that may be case.

ADD REPLY
0
Entering edit mode

I tried STAR counting with an index created without GTF. The results are identical to those based on index created with GTF. Every single count is the same.

ADD REPLY
0
Entering edit mode

Looks like you have already opened an issue on STAR repo: https://github.com/alexdobin/STAR/issues/2650

That was going to be the next suggestion.

Since STAR is mainly designed to be splice-aware perhaps some interaction of parameters is causing this.

ADD REPLY
0
Entering edit mode

Yes, GTF file contains a strand column and it is either "+" or "-", in about 50-50 ratio.

ADD REPLY
0
Entering edit mode

As long as the genes don’t overlap one another then it should be fine.

ADD REPLY
3
Entering edit mode
8 weeks ago

Problem solved with help from rfran010 on STAR GitHub. The GTF file I use has a blank attribute transcript_id "" for each gene. The default STAR option --sjdbGTFtagExonParentTranscript=transcript_id points to this blank attribute and causes problems. After replacing it with --sjdbGTFtagExonParentTranscript=gene_id, the reads are now properly counted and stranded, most of them are in column 4.

Also, to get the same results from STAR and featureCounts, I had to add --countReadPairs argument to featureCounts, otherwise it counts reads, not pairs and reports twice the counts of STAR.

ADD COMMENT
1
Entering edit mode

Thanks for coming back and providing the solution.

I had to add --countReadPairs argument to featureCounts, otherwise it counts reads, not pairs and reports twice the counts of STAR.

That option was added to featureCounts (in subread release v.2.0.2 in 2021) and is noted explicitly in the manual.

Having an empty feature name perhaps breaks the GTF file format (though it is not specifically noted as such in spec).

This issue emphasizes the fact that default values for program parameters always apply. Something easy to overlook with a program like STAR that has many.

ADD REPLY

Login before adding your answer.

Traffic: 4549 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6