I have a RNASeq dataset of 40K genes about 800 individuals. For many genes (around 30% of the total) the expression value is equal to 0 in almost every individual. I.e the expression level values for the Gene X is equal to 0 for 90% or 95% of the individuals. How should I deal with these Genes? Should I simply remove them from my analysis? I want to use this data to build a predictor (classifier). So each individual has a class value (i.e Control vs Target) and I use the gene expression values to train my model.
Yes, remove them. You have almost no data on them, and depending on the questions you might be trying to answer, it seems they would be able to contribute nothing, and just get in the way. You mention 40k genes, but rather than think of them as genes, why not think of them as 40k points in the genome from which you might expect to see expression, and that might change your expectation. How strong is the evidence that they actually are genes, and what is the relevance of this regarding the questions you are trying to answer? At some later point, you might care to go back and examine those loci that have expression in only a few individuals (again for a given question that we haven't yet defined).
We also have no idea about the relationship between sequencing depth and your ability to detect expression from those loci. Often, gaining more depth does not significantly help these loci.
Lastly, does your data reflect multiple tissues across individuals? Some of these loci may be tissue/context specific, and not show expression for reasons having to do with the experiment itself.
Removing all of them a priori may not be a good decision, depending on what you intend to do and the source of the data. Is the data from one particular tissue in all individuals? Remember that in any given cell type we only expect about 20% of the genome or so to be actively expressed in adults. Depending on what your classifier is meant to do, no expression is as important in a transcriptional profile as what level something is being expressed at. No data is still a data point, and can be quite informative. In many diseases, particularly cancer, you get inappropriate expression of a transcript in a cell-type when it is not normally seen.
We may need some further details to clarify exactly what you are doing and what your classifier will be looking at specifically but my gut instinct is not to remove null points for the reasons I outlined above.
I agree with the others responses, removing them is adequate to only observe the genes being expressed, but I just have one comment, be careful with your data, if the values are in RPKM many of those zeroes are not really a "zero" reads observed, in general, one or more genes are highly expressed but the other genes will have very few reads assigned (1 or 2), so when you compute the RPKM those are rounded to zero. Also, I will prefer those genes highly expressed in order to train a classifier.
If the goal of your analysis was gene expression then removing would be meaningful since these low expression genes will provide spurious differential expression results. But this is not what you are using the data for. Instead, you have a classification task.
Because of the learning approach, your decisions might better be served by how your classifier performs on these genes. Ideally, your classifier should have an objective measure that will be low for such genes and automatically exclude them. You really don't want to babysit your predictor.
This is obviously distinct from genes with zero expression in all samples: there's nothing to learn from there and those genes should be excluded a priori.
Therefore, I would recommend that you keep those genes because they may carry tissue-specific information (as has been pointed out) that may be crucial for your learning task.