I'm trying to generate read counts based on a GTF using featureCounts.
When I last ran an RNAseq project using Subread v2.0.3, the following line worked. I used -t CDS because not all of the 'exon' entries in my file have a 'gene_id' available:
featureCounts \
-a $ANNOTATION \
-o ${OUTPUT_DIR}/counts_v5gtf.txt \
-t CDS \
-g gene_id \
-p \
--countReadPairs \
Now, in v2.0.8, using the same code above, my job is failing with an error that the 9th column in the GTF has other options besides just 'gene_id'. Makes it seem like -t is not working properly?
Has anyone experienced similar issues? Or any suggestions on what else I might be missing?
Try the older version? Is it the same GTF from before?
I would guess a formatting issue with the 9th column, but I haven't tried this version.
The exact error message and a couple example lines of the GTF may be helpful too.
Unfortunately, I'm working on a managed cluster, so I can't test with the older version easily. But it's the same GTF as before. In both cases, I've rearranged the original GTF so that the column containing the gene_id is available in the 9th column.
The issue with the GTF seems to be that not all of the exon entries have a gene_id available. Previously, I was able to circumvent this issue by simply telling featureCounts to work with the CDS lines, but that's the option that no longer seems to be working? (
-t CDS
)Example lines from the rearranged GTF:
The error message that I get is:
It would seem the gene_id is not available for one of the records. You can check if the CDS option is working by extracting a few CDS only annotations and seeing if counting is working. If it is working, then there's an error during rearrangement. You can manually add it, or change the gene identifier if that would make sense in your use case.
e.g. to extract CDS annotations (if the incorrect record, gene-ND1 is in the first 10 CDS, it would fail too though...)