Question

Differential expression analysis

2

Entering edit mode

3.6 years ago

wes ▴ 90

Dear All

I had RNAseq data from different plant tissues (eg, root, leave, fruit). I managed to perform de novo assembly using Trinity. Then, I proceed with TGICL to obtain Unigenes.

In order to obtain gene expression value, I used both RSEM and salmon.

Example of script for RSEM

rsem-prepare-reference Unigenes.fasta Unigenes.fasta
rsem-calculate-expression --bowtie2 -p 20 --paired-end SRR5904767_1.trimmed.fastq.gz SRR5904767_2.trimmed.fastq.gz Unigenes.fasta R_A1

Example of Script for salmon

salmon index -t Unigenes.fasta -i unigenes_index 
perl batch.pl Trimmed_data_set

batch,pl file

#!usr/bin/perl
use autodie;
open FILE, "$ARGV[0]", or die;
while (<FILE>) {
        chomp $_;
        @column = split (/\t/, $_);
    #   print "$column[1]\t\t$column[2]\t\t$column[3]\n";
        print "Running $column[1] and $column[2] against unigenes_index. Output to quants\/$column[1]\n";
        system ("salmon quant -i unigenes_index -l A -1 $column[2] -2 $column[3] -p 20 --validateMappings -o quants/$column[1]");
        }

I noticed that RSEM produced both TPM and FPKM value whereas salmon produced TPM and NumReads. However, TPM value produced from RSEM and salmon is a bit different. May I know which value (FPKM from RSEM or TPM from salmon or RSEM) should I proceed with Differential expression analysis? If TPM to use, what can we do to the FPKM value?

According to the paper in the link below, it mentioned that normalization metrics should be avoided as RPKM has shown to be inconsistent and Transcripts Per Million (TPM) is preferable https://www.intechopen.com/books/applications-of-rna-seq-and-omics-strategies-from-microorganisms-to-human-health/rna-seq-applications-and-best-practices

Should I performed normalization first before DE analysis or not needed as normalization already done in salmon?

Any paper suggestion for step by step analysis if I would like to do differential gene expression using the Unigenes instead of the transcript?

The main objective is to get the differential expressed genes in fruit during 3 different time points and I have two biological replicate only.

Thanks

rna-seq • 1.6k views

ADD COMMENT • link 3.6 years ago by wes ▴ 90

score 2 · Answer 1 · 2020-09-19

The typical pipeline of gene-level differential expression analysis (assuming you use salmon or any other tool that generates transcript abundance estimates, or lets just say transcript-level "counts") is:

1) Aggregate transcripts to the gene level. You are interested in the total gene expression rather than each transcript. Therefore you can use tools such as tximport (Bioconductor package) that will sum you the transcript level abundance estimates, producing raw gene level counts.

2) Gene level counts need to be normalized to correct for differences in library size and composition. [Here][1] is how DESeq2 does that. You generally do not want to correct for gene length like TPM does it because this decreases the counts for longer genes, hence reducing statistical power. not from RPKM, TPM or anything else. In fact the statement that TPM is preferrable over RPKM and can be used for DE analysis is just wrong. The way they write it suggests that TPM can be used with DESeq2, that is simply wrong. Browse support.bioconductor.org for threads where the DESeq2 author comments on people asking whether TPM can be used, answer is no.

3) Two biological replicates is not much, but technically sufficient. If the gene expression differences are strong enough and the replicates are reasonable comparable you might still get differential genes.

4) Ignore that terrible review. Be sure to start from raw counts, using established packages. I personally use edgeR, but DESeq2 and limma plus many others are fine as well.