Question: Finding Only Protein Coding Genes from Gencode GTF
gravatar for Ark
18 months ago by
Ark70 wrote:


I am trying to filter out only the protein coding genes from the gencode gtf file found here:

I have found a discrepancy in the way some of the genes are labeled and it has been causing me a bit of grief! Most of the genes I am interested in are labeled "protein_coding" as their biotype (as you would expect). These are easy enough to parse, however, there are even more protein coding genes that aren't being caught this way. Some of the protein coding genes are not labeled "protein_coding" and therein lies my problem. I need to find a way to extract all the protein coding genes, including the ones with the non-uniform biotypes.

As an example, if you look at the IGHD1-1 gene (ENSG00000236170), we know this is protein coding. However, if you look for it in the gencode gtf file, it's biotype is listed as "IGH_D_Gene". Many of the Immunoglobulin genes are missed by my parsing because of this.

Could anyone help me out? Or maybe suggest other ways to filter a list of ENSEMBE ID's for only protein coding genes?

Thank you!

rna-seq R • 1.2k views
ADD COMMENTlink modified 18 months ago by igor9.9k • written 18 months ago by Ark70
gravatar for igor
18 months ago by
United States
igor9.9k wrote:

Have you checked Gene/Transcript Biotypes in GENCODE & Ensembl? There is extensive explanation of the annotation labels there.

There is not really a hierarchical structure, so some transcripts could fall into multiple categories (more and less specific). You really have to go through all the options and figure out which labels makes sense for your application.

ADD COMMENTlink modified 18 months ago • written 18 months ago by igor9.9k

Thanks for your response Igor!

I had not seen that page. Thank you for the link! I will dig through and try to determine which labels are useful for me.

Do you have any recommendations on a better way to separate the protein coding genes? I am working in R and currently I am using awk and some regular expressions to select the genes that match my desired parameters. I know this isn't exactly optimal, but I couldn't find a better way.

Thanks again!

ADD REPLYlink modified 18 months ago • written 18 months ago by Ark70

There are some suggestions in this previous discussion: Transform a GTF file into a data frame in R

You can import the GTF into R as a data frame and then use dplyr to filter it.

You can also try plyranges:

ADD REPLYlink modified 18 months ago • written 18 months ago by igor9.9k

I see, that is definitely a better approach than I am currently using.

Thanks for the suggestions and the links!

ADD REPLYlink written 17 months ago by Ark70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 737 users visited in the last hour