Which data would be called as expressed genes. Which set of genes would be called as Differentially Expressed Genes ?
All the genes which have reads assigned to them are being 'expressed' in some sense.
From what I understood the genes in the count file would undergo normalisation and the resulting genes screened with the FDR<0.05 and logFC threshold are called Differentially expressed. Please correct me if I am wrong.
This is tool dependent but when looking to find genes that are differentially expressed, the software package you use will generally build a model of the gene expression distribution within each samples replicates then normalize across samples to try and remove most of the noise (technical and otherwise). The list it generates will tell you which genes are differentially expressed and how much confidence can be ascribed to any particular gene, based on the assumptions made by the model. What q-value or fold-change you decide to use is up to you, but generally speaking using a q-value of <0.05 combined with a fold change >2 is considered 'stringent' and will give you results that are most likely true and can be validated (via qPCR etc).
Which parameter should I consider to select the DEGs? I read that most commonly used is FDR<0.05 and logFC, but somewhere else Pvalue<0.05 is also mentioned.
First of all "FDR" is an approach to adjusting the p-value that accounts for multiple testing. I know that it's popular to say "...an FDR<0.05" but this is as incorrect as saying "...a Bonferroni < 0.05". What you have is an FDR adjusted p-value, which is called a q-value (or an 'adjusted p-value'). That being said, you always want to use the q-value when drawing conclusions from multiple testing. The p-value is, at best, not trustworthy and at worst meaningless for many genes, as it encompasses a growing number of false positives with increased sampling size.
Somewhere I found that all the genes expressed are Differentially expressed and the PValue and logFC cutoff is done to select the significant DEGs. Is that the concept?
Couple of things you need to understand to be able to answer this question: (1) Biologically speaking most genes in the genome are 'expressed' in the sense that if you look for a single transcript of that gene its probably there if you look hard enough. (2) Practically speaking you will never have the exact same amount of transcripts when comparing two cells. Even if you sequence the exact same sample three times, you will notice that you have a different amount of reads in each sample. This is because the cell, like most biology, at some level is noisy. The technology, like most technology, at some level, is noisy. What you are looking for when doing DE analysis is differences which you can be confident are not due to noise. This is what most of the software seeks to do. It builds models, it normalizes, it adjusts p-values, etc, etc. In the end what you're left with is the software telling you, "Look... here's a list of genes that, based on our models and assumptions, appear to be significantly different between the samples. We are this confident that they are different (q-value)."
It's fine to say FDR < 0.05 because that's the error rate you're setting/allowing/controlling. There is some ground truth FDR (it could be 0.03) but FDR-controlling means you're trying to guarantee that the error rate is less than 0.05.
It's similar to the concept of FWER (another type of error rate which Bonferroni controls for).
However, it's incorrect to say a gene got an FDR of 0.0026 (that's an adjusted p-value).
Fold change >2 only reports upregulated genes. To symmetrically obtain downregulated genes, use fold change < 0.5. Those fold changes are log2'd for convenience and therefore to obtain upregulated genes, log2FC > 1, and to obtain downregulated genes, log2FC < -1.
Thank you for the detailed reply. I understand that using q-value cutoff of <0.05 instead of PValue is to reduce false positives.