featureCounts -t option not working in v2.0.8?
0
0
Entering edit mode
12 weeks ago
atan • 0

I'm trying to generate read counts based on a GTF using featureCounts.

When I last ran an RNAseq project using Subread v2.0.3, the following line worked. I used -t CDS because not all of the 'exon' entries in my file have a 'gene_id' available:

featureCounts \ -a $ANNOTATION \ -o ${OUTPUT_DIR}/counts_v5gtf.txt \ -t CDS \ -g gene_id \ -p \ --countReadPairs \

Now, in v2.0.8, using the same code above, my job is failing with an error that the 9th column in the GTF has other options besides just 'gene_id'. Makes it seem like -t is not working properly?

Has anyone experienced similar issues? Or any suggestions on what else I might be missing?

subread featureCounts RNAseq • 533 views
ADD COMMENT
1
Entering edit mode

Try the older version? Is it the same GTF from before?

I would guess a formatting issue with the 9th column, but I haven't tried this version.

The exact error message and a couple example lines of the GTF may be helpful too.

ADD REPLY
0
Entering edit mode

Unfortunately, I'm working on a managed cluster, so I can't test with the older version easily. But it's the same GTF as before. In both cases, I've rearranged the original GTF so that the column containing the gene_id is available in the 9th column.

The issue with the GTF seems to be that not all of the exon entries have a gene_id available. Previously, I was able to circumvent this issue by simply telling featureCounts to work with the CDS lines, but that's the option that no longer seems to be working? (-t CDS)

Example lines from the rearranged GTF:

NW_022144751.1  Gnomon  CDS     7912    8280    .   -   0   gene_id "gene-LOC115922266"; transcript_id "rna-XM_030981128.1"; gene_name "LOC115922266";

NW_022144751.1  Gnomon  transcript  10108   15985   .   -   .   gene_id "gene-LOC115927140"; transcript_id "gene-LOC115927140"; gene_name "LOC115927140"

NW_022144751.1  Gnomon  exon    10108   10427   .   -   .   transcript_id "gene-LOC115927140"; gene_name "LOC115927140";

The error message that I get is:

ERROR: failed to find the gene identifier attribute in the 9th column of the provided GTF file.
The specified gene identifier attribute is 'gene_id'
An example of attributes included in your GTF annotation is 'transcript_id "gene-ND1"; gene_name "ND1";'.
ADD REPLY
0
Entering edit mode

It would seem the gene_id is not available for one of the records. You can check if the CDS option is working by extracting a few CDS only annotations and seeing if counting is working. If it is working, then there's an error during rearrangement. You can manually add it, or change the gene identifier if that would make sense in your use case.

e.g. to extract CDS annotations (if the incorrect record, gene-ND1 is in the first 10 CDS, it would fail too though...)

awk '$3 == "CDS"' GTF_FILE | head -n 10 > GTF_FILE_CDS_ONLY
ADD REPLY

Login before adding your answer.

Traffic: 3994 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6