Hi, My question is about RNA-seq data analysis, particularly differential gene expression analysis between different species.
I have RNA-seq reads for a tissue type from human and chimpanzee and I need differentially expressed genes (DEGs). These are the three sequential steps I am following:
Using Kallisto: to align read to transcriptome and get transcript counts
Using tximp: to convert transcript counts to gene counts
Using DEseq: to get DEGs based on gene count
I am getting the latest release of the transcriptome file (cdna.fa) for human and chimpanzee genome, but the number of transcripts in the human cdna.fa file is ~200,000 and in chimpanzee cdna.fa file is ~50,000. I think it is because the human genome annotations are more advanced. My question is if this difference will lead to higher gene counts for humans and thus impact the determination of DEGs. I am asking this question because tximp (and other transcript count to gene count converters) sums the counts of all transcripts to get the gene count.
I think that the difference in the number of transcripts between human and chimpanzee won’t lead to higher gene counts for humans and won’t impact determining DEGs (because looking at how kallisto works tells me if the read is pseudo-aligning to more than one transcripts, Kallisto distributes the count of that read among the transcripts and not give whole count to all the transcripts), but I want to double check.