Question: Finding Only Protein Coding Genes from Gencode GTF
0
gravatar for Ark
5 weeks ago by
Ark60
US
Ark60 wrote:

Hello!

I am trying to filter out only the protein coding genes from the gencode gtf file found here: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.basic.annotation.gtf.gz

I have found a discrepancy in the way some of the genes are labeled and it has been causing me a bit of grief! Most of the genes I am interested in are labeled "protein_coding" as their biotype (as you would expect). These are easy enough to parse, however, there are even more protein coding genes that aren't being caught this way. Some of the protein coding genes are not labeled "protein_coding" and therein lies my problem. I need to find a way to extract all the protein coding genes, including the ones with the non-uniform biotypes.

As an example, if you look at the IGHD1-1 gene (ENSG00000236170), we know this is protein coding. However, if you look for it in the gencode gtf file, it's biotype is listed as "IGH_D_Gene". Many of the Immunoglobulin genes are missed by my parsing because of this.

Could anyone help me out? Or maybe suggest other ways to filter a list of ENSEMBE ID's for only protein coding genes?

Thank you!

rna-seq R • 168 views
ADD COMMENTlink modified 5 weeks ago by igor7.0k • written 5 weeks ago by Ark60
3
gravatar for igor
5 weeks ago by
igor7.0k
United States
igor7.0k wrote:

Have you checked Gene/Transcript Biotypes in GENCODE & Ensembl? There is extensive explanation of the annotation labels there.

There is not really a hierarchical structure, so some transcripts could fall into multiple categories (more and less specific). You really have to go through all the options and figure out which labels makes sense for your application.

ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by igor7.0k

Thanks for your response Igor!

I had not seen that page. Thank you for the link! I will dig through and try to determine which labels are useful for me.

Do you have any recommendations on a better way to separate the protein coding genes? I am working in R and currently I am using awk and some regular expressions to select the genes that match my desired parameters. I know this isn't exactly optimal, but I couldn't find a better way.

Thanks again!

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by Ark60
1

There are some suggestions in this previous discussion: Transform a GTF file into a data frame in R

You can import the GTF into R as a data frame and then use dplyr to filter it.

You can also try plyranges: https://bioconductor.org/packages/release/bioc/vignettes/plyranges/inst/doc/an-introduction.html

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by igor7.0k

I see, that is definitely a better approach than I am currently using.

Thanks for the suggestions and the links!

ADD REPLYlink written 5 weeks ago by Ark60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 711 users visited in the last hour