I downloaded the fastq files from ebi.ac.uk Then I aligned them to hg38 on galaxy. And then I downloaded the latest version of gtf (release version 27) from http://www.gencodegenes.org/stats/archive.html#a27 to obtain read counts. As you see, the gtf file has 58288 genes of which 19836 genes are protein coding.
The problem is that when you obtain the count matrix from your bam files using this gtf, you will be given a matrix with 58288 rows displaying the total number of genes in the gtf file. And thousands of these genes are counted zero for all samples. How should I filter my count data to a reasonable number of genes? 58288 genes is way too much and weird. Especially, when many of them are zero.