Question

Bulk RNA-seq differential gene expression analysis on haplotype‑resolved diploid plant assembly

0

Entering edit mode

5 months ago

Zephy • 0

I’m working on a haplotype‑resolved diploid assembly of a plant genome, where each chromosome is represented by two FASTA/GTF pairs rather than a single consensus. I want to carry out Bulk RNA-seq count‑based differential expression analysis with Bioconductor (e.g. DESeq2, edgeR) but I’m unsure how to adapt the standard workflow for this dual‑sequence setup.

Experimental Design (basic) Organism: Plant. Samples: 3 replicates of Condition Aand 3 replicates of Condition B. Data: Paired-end RNA-seq reads (150 bp, 30 millions reads for sample) aligned to a haplotype-resolved genome assembly. Goal: Identify DE genes between Conditions A and B, accounting for haplotype-specific expression.

What I would like to know:

Should I concatenate the two haplotype FASTAs (and GTFs) into one “merged” reference, or keep them separate and run two parallel alignments?
How to avoid double‑counting of genes that appear in both GTFs? Should I rename features (e.g. geneA_hap1 vs. geneA_hap2) or collapse them after quantification?

Quantification strategy

Does it make sense to use allele‑aware quantifiers (e.g. Salmon with ––keepDuplicates) to retain haplotype information?
How to handle multi‑mapping reads that span identical exons on both haplotypes?

Count matrix construction

Best practice for building the count matrix:
- Option A: Separate counts per haplotype (two columns per sample) and then sum counts for downstream DE?
- Option B: Sum at the gene level before DE and ignore haplotype origin?
- Option C: Test allele‑specific expression by including haplotype as a factor in the design?

Statistical modelling in DESeq2/edgeR

If I keep haplotypes separate, can I simply aggregate counts (geneA_hap1 + geneA_hap2) into a single count per sample? Any known caveats?
If I wish to model allele‑specific changes (e.g. hap1 vs. hap2 expression bias across conditions), what design formulas or contrasts are recommended?

Any suggestions (reference to a published/unpublished work, practical tips, pipelines) would be much appreciated! Thanks in advance.

RNA-seq • 1.0k views

ADD COMMENT • link updated 5 months ago by rfran010 ★ 1.7k • written 5 months ago by Zephy • 0

1

Entering edit mode

I've not done this type of analysis myself, but I think Kallisto or Salmon may be better suited.

I don't know how Salmon indexing works, but I believe with Kallisto this would also solve your first two questions, where each haplotype is a transcript. So, instead of geneA_hap1 / geneA_hap2, it would be geneA:tx_hap1 / geneA:tx_hap2

ADD REPLY • link 5 months ago by rfran010 ★ 1.7k

0

Entering edit mode

Thanks for the reply. From what I've seen (my online searches and the limited responses on the matter), there doesn't seem to be a standardized procedure for this type of analysis."

ADD REPLY • link 5 months ago by Zephy • 0

1

Entering edit mode

Also, an alternative method would be to quantify gene expression independently of haplotypes and just quantify variant percentage of the genes.

ADD REPLY • link 5 months ago by rfran010 ★ 1.7k

0

Entering edit mode

Yes, I don't think it's too common, but it should fit easily into isoform/allele-aware quantification workflows, no?

ADD REPLY • link 5 months ago by rfran010 ★ 1.7k