Finding Only Protein Coding Genes from Gencode GTF
1
1
Entering edit mode
5.5 years ago
Ark ▴ 90

Hello!

I am trying to filter out only the protein coding genes from the gencode gtf file found here: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.basic.annotation.gtf.gz

I have found a discrepancy in the way some of the genes are labeled and it has been causing me a bit of grief! Most of the genes I am interested in are labeled "protein_coding" as their biotype (as you would expect). These are easy enough to parse, however, there are even more protein coding genes that aren't being caught this way. Some of the protein coding genes are not labeled "protein_coding" and therein lies my problem. I need to find a way to extract all the protein coding genes, including the ones with the non-uniform biotypes.

As an example, if you look at the IGHD1-1 gene (ENSG00000236170), we know this is protein coding. However, if you look for it in the gencode gtf file, it's biotype is listed as "IGH_D_Gene". Many of the Immunoglobulin genes are missed by my parsing because of this.

Could anyone help me out? Or maybe suggest other ways to filter a list of ENSEMBE ID's for only protein coding genes?

Thank you!

RNA-Seq R • 4.2k views
ADD COMMENT
3
Entering edit mode
5.5 years ago
igor 13k

Have you checked Gene/Transcript Biotypes in GENCODE & Ensembl? There is extensive explanation of the annotation labels there.

There is not really a hierarchical structure, so some transcripts could fall into multiple categories (more and less specific). You really have to go through all the options and figure out which labels makes sense for your application.

ADD COMMENT
0
Entering edit mode

Thanks for your response Igor!

I had not seen that page. Thank you for the link! I will dig through and try to determine which labels are useful for me.

Do you have any recommendations on a better way to separate the protein coding genes? I am working in R and currently I am using awk and some regular expressions to select the genes that match my desired parameters. I know this isn't exactly optimal, but I couldn't find a better way.

Thanks again!

ADD REPLY
1
Entering edit mode

There are some suggestions in this previous discussion: Transform a GTF file into a data frame in R

You can import the GTF into R as a data frame and then use dplyr to filter it.

You can also try plyranges: https://bioconductor.org/packages/release/bioc/vignettes/plyranges/inst/doc/an-introduction.html

ADD REPLY
0
Entering edit mode

I see, that is definitely a better approach than I am currently using.

Thanks for the suggestions and the links!

ADD REPLY

Login before adding your answer.

Traffic: 2542 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6