Question

Gencode annotation filterring problems

0

Entering edit mode

19 months ago

Sarah • 0

Hello everyone!

I'm trying to apply the following filtering to the Human gencode annotation

library(rtracklayer)
library(GenomicFeatures)

path_to_gtf = "gencode.v41lift37.annotation.gtf"
gtf = rtracklayer::import(path_to_gtf)

gtf_filtered = gtf[which(
  gtf$type %in% c('gene', 'transcript','exon','CDS','UTR') &
  gtf$level %in% c('1','2') &
  gtf$gene_status=="KNOWN" &
  gtf$transcript_status=="KNOWN" &
  gtf$transcript_type=="protein_coding" &
  !(gtf$tag %in% c('cds_start_NF','cds_end_NF','mRNA_start_NF','mRNA_end_NF'))
                           )]
length(gtf_filtered)

So I want only 'gene', 'transcript','exon','CDS','UTR' records, where genes and transcripts are known, and transcripts code for proteins, with full ORFs

In the beginning I had 3380508 records, 252785 of which were transcripts After filtering I have only 4201 records, 242 are transcripts I was expecting 10-20 times more of transcripts

When I checked gene_type column I found out that 3335463 records have NA as gene_type KNOWN NOVEL PUTATIVE <NA> 30963 13715 367 3335463

Do you have any ideas why I have so little transcripts left after filtering?

Gencode R filtering • 518 views

ADD COMMENT • link updated 19 months ago by Istvan Albert 100k • written 19 months ago by Sarah • 0

0

Entering edit mode

If you are willing to try alternate software then try https://github.com/NBISweden/AGAT

ADD REPLY • link 19 months ago by GenoMax 141k

score 0 · Answer 1 · 2022-09-20

0

Entering edit mode

19 months ago

Istvan Albert 100k

It seems your filtering is too stringent. Especially when you require the presence of those tags like cds_start_NF. To be honest I have no idea what that stands for or why you select for it.

And then, you join all conditions with AND operator, thus all of the conditions have to be met.

ADD COMMENT • link 19 months ago by Istvan Albert 100k