I am quite new to RNA-Seq analysis and I have some confusion as to how to perform certain filtering steps in my pipeline.
Experiment Conditions: I have three conditions tested based on viral infection experiments in human cells; Uninfected, Wildtype Virus Infected, and Mutant Virus Infected with three replicates in each condition (9 samples in total).
I have a few goals in mind:
I would like to identify significantly upregulated human long-non coding RNA between my conditions.
Compare expression of significantly upregulated human lncRNA with protein coding genes (Co-Expression?)
I have aligned my raw fastq files using merged annotation consisting of the human (GENCODE) and viral genomes and annotation from LNCipedia (lncRNA only) using HISAT2. After QC of alignments, i performed HTSeq-counts to generate a read count matrix. As i am aware of the bias of using lncRNA annotation only, HTSeq was repeated using full annotation (human genome (GENCODE), viral genome and lncipedia annotation) to generate a second count matrix with both lncRNA and protein coding genes.
DESeq2 was then used to perform DE analysis on both sets of counts. Genes which had row counts less than 30 where filtered out. I now have 3 gene-list csv files of differentially expressed genes (Wildtype vs Uninfected and Mutant vs Uninfected and Wildtype vs Mutant) . (6 files in tota- 3 from deseq2 using lncRNA only count matrix and 3 from deseq2 using full annotation count matrix)
Recently I was told that I also need to filter by a FPKM threshold value of 1.0. I performed StringTie on the bam files generated from HISAT2 using the same annotation (GENOME Human genome, Viral genome and lncipedia annotation) to generate FPKM values but i notice that many lncRNA shown to be significant from the DESeq2 result files only have FPKM values less than 0.1. I am unsure as to how to filter out low expression genes as i am aware that lncRNA have low expression in contrast to mRNA.
- Would it be better to use the full annotation rather than lncRNA only annotation for HTSeq to identify differentially expressed lncRNA?
- Are there more appropriate filters that I should use to identify differentially upregulated lncRNA genes?
- At what stage in the RNA-Seq pipeline is filtering supposed to be done (before or after DESeq2)?
Your thoughts and feedback are most appreciated. Thanks,