Question: How to perform differential gene expression analysis between two closely related organisms?
gravatar for gyulap
2.5 years ago by
gyulap20 wrote:

My experimental setup contains two Arabidopsis thaliana ecotypes (Col-0 and Bur-0) that are infected either with a virus or mock. I want to compare the response of these two genotypes to virus infection at the transcript level by RNA-seq. My problem is that although these ecotypes are closely related, they still have plenty of differences: mutations that can cause amino-acid changes in the coded proteins, rearranged/fused genes, novel alternative transcripts, etc. This means that the reference transcriptome for the alignment will be different.

Which of the following approaches do you think is the best?

  1. Make a merged reference quasi-transcriptome (i.e., mix the two sequence sets with all the alternative transcripts, collapsing the exact duplicates), and perform a kallisto alignment to this mixed transcriptome. My reasoning for this is that the sequence variations in the two ecotypes can be treated as alternative transcripts, but this also results in an almost duplicated transcriptome, and I am not sure how it affects the models that the DE programs use.

  2. Use the Col-0 reference transcriptome for the Col-0 reads and the Bur-0 transcriptome for the Bur-0 reads, perform the alignments and the differential gene expression separately and then try to compare them somehow (i.e., GO enrichment analysis). In this case, I have to relate the genes to each other in the two ecotypes (i.e., which gene a fused transcript in Bur-0 matches in Col-0).

  3. Assemble a quasi-transcriptome de novo using the reads from both ecotypes (virus removed), annotate them, and perform a DE analysis. It will be a pain in the ass to map them back to the genome(s) and get gff/gtf files with genomic coordinates. An alternative (maybe better) version is that I assemble the two transcriptomes separately (to avoid building chimeras).

  4. Use the Col-0 (and Bur-0) genome(s) to predict novel transcripts with Hisat2 and then either merge or treat them separately (see points 1 and 2).

  5. Just use the Col-0 transcriptome (or genome) for both ecotypes and perform the differential gene expression analysis in one step, missing the largely different transcripts (which can be the most interesting). Maybe the structural variants (fused transcripts) can be predicted with some tools.

  6. Some combination of the above possibilities.

I appreciate any suggestions.

rna-seq • 1.3k views
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by gyulap20

Although this does not address your initial question, I wanted to propose a downstream QC step that can help you decide if the approach you do use was successful or not.

After finding a mapping that is acceptable (for example the 2-step assembly per ecotype approach recommended by @genomemax in his comment above) validate the accuracy and efficacy of this approach using tools found in the WGCNA package.

To explain, suppose for a moment that even when specific transcripts / isoforms change, gene co-expression patterns acting coordinately to subserve the same goals change less in related organisms. If the co-expression networks you generate after assembly are highly similar, you would have in essence provided yourself a data-driven validation that this approach produced sensible results (that you could then use to wade through various possibilities you mention in 1-6).

WGCNA can also be used to make high level comparisons regard GO ontologies etc. that you mention.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Vincent Laufer1.1k

Interesting suggestion, I will try that. I wanted to learn this WGCNA tool anyway.

ADD REPLYlink written 2.5 years ago by gyulap20

How many biological replicates are you planning to do for each ecotype? Are there time points or just one? If you have plenty of replicates you could assemble transcriptomes for each ecotype using trinity and then use them independently. Depending on time/effort you are willing to spend on this all above possibilities could be tried.

ADD REPLYlink written 2.5 years ago by GenoMax95k

Four biological replicates have been prepared (polyA-enriched, stranded, paired-end 150 bp), at just one time point, but if we consider the virus-infected and the mock as replicates for an ecotype then it makes up to eight replicates. Certainly, I will try to assemble transcriptomes but then I have to map them back to the appropriate reference genomes (both are available) or annotate them in a way that I could compare them. Unfortunately, I am not that experienced to do that easily.

ADD REPLYlink written 2.5 years ago by gyulap20

So you should have enough samples to do assembly per ecotype. Once you assemble the transcriptomes (e.g. with Trinity) you can map your reads back to the respective transcriptomes to do the DE analysis. After you identify transcripts of interest they themselves can then be mapped back on the known Arabidopsis genome (or independently analyzed by blasting etc) to identify what genes they represent.

ADD REPLYlink written 2.5 years ago by GenoMax95k

Actually, my #3 method is about this. I am planning to do the mapping with kallisto and the DE with sleuth (or DESeq2). I am just hesitating between a de novo and a genome-based assembly.

ADD REPLYlink written 2.5 years ago by gyulap20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 976 users visited in the last hour