I am trying to filter out only the protein coding genes from the gencode gtf file found here: ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_28/gencode.v28.basic.annotation.gtf.gz
I have found a discrepancy in the way some of the genes are labeled and it has been causing me a bit of grief! Most of the genes I am interested in are labeled "protein_coding" as their biotype (as you would expect). These are easy enough to parse, however, there are even more protein coding genes that aren't being caught this way. Some of the protein coding genes are not labeled "protein_coding" and therein lies my problem. I need to find a way to extract all the protein coding genes, including the ones with the non-uniform biotypes.
As an example, if you look at the IGHD1-1 gene (ENSG00000236170), we know this is protein coding. However, if you look for it in the gencode gtf file, it's biotype is listed as "IGH_D_Gene". Many of the Immunoglobulin genes are missed by my parsing because of this.
Could anyone help me out? Or maybe suggest other ways to filter a list of ENSEMBE ID's for only protein coding genes?