6.0 years ago by
Children's Hospital of Philadelphia, Philadelphia, PA
In the Ensembl gtf file, there are many types of genes:
Out of these, we usually keep protein_coding & lincRNA because we are interested in identifying differentially expressed and novel lincRNAs. Once we also kept pseudogene, antisense & miRNA because our aim was to identify whether such genes are differentially expressed or not, and if that's the case then find whether they are near any of the differentially expressed protein-coding genes (to correlate whether a pseudogene, antisense or miRNA is regulating a protein-coding gene). So depending on what your aim is, you may filter out different gene types. We usually apply a secondary filter depending on the "expected" length of the gene (filtering out lincRNAs that are <200 bp long and so on).