Extracting GeneID from Dbxref section in GFF file while using featureCounts
1
0
Entering edit mode
13 months ago

Hi all,

I'm trying to generate feature count files for the DeSeq2 pipeline, but I've run into an issue while using featureCounts .

I see that the gene IDs that I need, aren't in the same format at the rest of the attributes, but within the Dbxref section. How can I extract just the gene ID so that my featurecounts will produce an output?

thanks and kind regards

featurecounts gff • 508 views
1
Entering edit mode
13 months ago
vkkodali_ncbi ★ 3.4k

One solution is to use the gene attribute with featureCounts. Separately, you can generate a GeneID to gene name map from the GFF3 file using something like this:

zgrep 'GeneID' GCF_900626175.2_cs10_genomic.gff.gz \
| cut -f9 | perl -pe 's/ID.*(GeneID:\d+).*gene=([^;]*).*/\1\t\2/g' \
| sort -u > genes.txt


Finally, join the featureCounts output table to the genes.txt file on gene name column.

0
Entering edit mode

Thanks for your response! I tried using 'gene' with the -g flag, but it gave me unsatisfactory results (no features were found for any of my samples). I would hypothesize that the gene ID should be just the number, without the LOC. I was doing a long-winded series of awk commands to execute your second alternative, but this is far neater. Thanks again!

0
Entering edit mode

I would hypothesize that the gene ID should be just the number, without the LOC.

Yes, if you come across any LOC style identifiers you can be sure that the suffix numeral is the GeneID.