I am new to RNA-seq analysis. Currently, I am trying to use the salmon, tximport, edgeR pipeline to process my human RNA-seq results on galaxy. The cDNA library for my RNA-seq is generated from PolyA selection.
I am abit confused with the normlisation steps.
For salmon, i have aligned my reads to the human transcriptome, and used the human gff file for quant.genes.sf output, however, the TPM are still annotated with ENST00000XXXXXX.X instead of ENSGXXXXXXXXXXX. Does that mean salmon failed to recognise the GFF file and my TPM number is still for transcripts and not genes?
If salmon failed to produce the correct quant.genes.sf files, I would like to use tximport to aggregate my transcripts to genes with my quant.sf files. But I come across 4 options in tximport for "Summarization using the abundance (TPM) values?"------ i) No, ii) scaled up to library size, iii) scaled using the avg. transcript length over samples and then the library size, iv) scaled using the median transcript length among isoforms of a gene, and then library size.
Which option should I be using if I want to follow up with edgeR on degust? Will I "overnormalised" my results if I choose the wrong option to go with edgeR?
Any help would be appreciated. Many thanks in advance!