Hi, I have human RNA-Seq (Total RNA) data of 24 paired samples. My samples are labelled as S1,S2,S3 and so on. S1-S2 form a pair and S3-S4 for another (odd represents Normal samples and even diseased samples). I have used STAR to align the reads and have obtained an overall alignment of > 90% for most cases. I have then used Stringtie for transcript assembly using the following command:
stringtie -p 3 -e -B -G reference_annotation_file -l S1 --rf -o ..S1.transcripts.gtf S1Aligned.sortedByCoord.out.quality.filtered.reads.bam
This command has been used individually for all the samples.
Then I have used the online script prepDE.py to convert the fpkm values of reads to raw counts. (http://ccb.jhu.edu/software/stringtie/dl/prepDE.py)
Thereafter I have used edgeR to perform paired differential analysis for the data.
I am facing several ambiguities with the data:
The annotation file was downloaded from Gencode (hg19 version). I had used the complete annotation including contigs, scaffolds, patch. Now, I have seen that there are several genes which have names but different gene IDs, let's say gene1_ID1 and gene1_ID2. In case of some of these genes, I find that both gene1_ID1 and gene1_ID2 are differential, but their logFC and FDR values are differing significantly.
My set of 24 samples have been classified into 2 groups, based on an additional factor being present and absent in the patients. Differential analysis was performed separately in the two groups and the results compared. In these I am finding that for some genes, gene_ID1 is differential in the total set (24 samples), while the same gene with different ID (gene_ID2) is found to be uniquely differential in one of the two groups (of 12 samples). How can this be?
In case of a gene with 2 unique IDs, I find that one is upregulated and the other is downregulated, though both are significant. I shall give the example here and provide the links for the gene with 2 IDs.
- One important gene that is known to be upregulated in my disease, as supported by numerous publications, has been upregulated in my sample set. However the logFC of this gene is only 0.5.
These results have left me confused and wondering whether I have made some mistakes in my workflow. I am relatively new to RNA-Seq data. In case I have not been able to clarify my doubts properly, please feel free to probe me further.