Question

Extracting GeneID from Dbxref section in GFF file while using featureCounts

0

Entering edit mode

2.5 years ago

Shraddha ▴ 90

Hi all,

I'm trying to generate feature count files for the DeSeq2 pipeline, but I've run into an issue while using featureCounts . (see image)

I see that the gene IDs that I need, aren't in the same format at the rest of the attributes, but within the Dbxref section. How can I extract just the gene ID so that my featurecounts will produce an output?

thanks and kind regards

featurecounts gff • 1.0k views

ADD COMMENT • link updated 2.5 years ago by vkkodali_ncbi ★ 3.7k • written 2.5 years ago by Shraddha ▴ 90

score 1 · Answer 1 · 2021-11-01

1

Entering edit mode

2.5 years ago

vkkodali_ncbi ★ 3.7k

One solution is to use the gene attribute with featureCounts. Separately, you can generate a GeneID to gene name map from the GFF3 file using something like this:

zgrep 'GeneID' GCF_900626175.2_cs10_genomic.gff.gz \
  | cut -f9 | perl -pe 's/ID.*(GeneID:\d+).*gene=([^;]*).*/\1\t\2/g' \
  | sort -u > genes.txt

Finally, join the featureCounts output table to the genes.txt file on gene name column.

ADD COMMENT • link 2.5 years ago by vkkodali_ncbi ★ 3.7k

0

Entering edit mode

Thanks for your response! I tried using 'gene' with the -g flag, but it gave me unsatisfactory results (no features were found for any of my samples). I would hypothesize that the gene ID should be just the number, without the LOC. I was doing a long-winded series of awk commands to execute your second alternative, but this is far neater. Thanks again!

ADD REPLY • link 2.5 years ago by Shraddha ▴ 90

0

Entering edit mode

I would hypothesize that the gene ID should be just the number, without the LOC.

Yes, if you come across any LOC style identifiers you can be sure that the suffix numeral is the GeneID.

ADD REPLY • link 2.5 years ago by vkkodali_ncbi ★ 3.7k