Bulk RNA-seq differential gene expression analysis on haplotype‑resolved diploid plant assembly
0
0
Entering edit mode
3 months ago
Zephy • 0

I’m working on a haplotype‑resolved diploid assembly of a plant genome, where each chromosome is represented by two FASTA/GTF pairs rather than a single consensus. I want to carry out Bulk RNA-seq count‑based differential expression analysis with Bioconductor (e.g. DESeq2, edgeR) but I’m unsure how to adapt the standard workflow for this dual‑sequence setup.

Experimental Design (basic) Organism: Plant. Samples: 3 replicates of Condition Aand 3 replicates of Condition B. Data: Paired-end RNA-seq reads (150 bp, 30 millions reads for sample) aligned to a haplotype-resolved genome assembly. Goal: Identify DE genes between Conditions A and B, accounting for haplotype-specific expression.

What I would like to know:

  • Should I concatenate the two haplotype FASTAs (and GTFs) into one “merged” reference, or keep them separate and run two parallel alignments?
  • How to avoid double‑counting of genes that appear in both GTFs? Should I rename features (e.g. geneA_hap1 vs. geneA_hap2) or collapse them after quantification?

Quantification strategy

  • Does it make sense to use allele‑aware quantifiers (e.g. Salmon with ––keepDuplicates) to retain haplotype information?
  • How to handle multi‑mapping reads that span identical exons on both haplotypes?

Count matrix construction

  • Best practice for building the count matrix:
    • Option A: Separate counts per haplotype (two columns per sample) and then sum counts for downstream DE?
    • Option B: Sum at the gene level before DE and ignore haplotype origin?
    • Option C: Test allele‑specific expression by including haplotype as a factor in the design?

Statistical modelling in DESeq2/edgeR

  • If I keep haplotypes separate, can I simply aggregate counts (geneA_hap1 + geneA_hap2) into a single count per sample? Any known caveats?
  • If I wish to model allele‑specific changes (e.g. hap1 vs. hap2 expression bias across conditions), what design formulas or contrasts are recommended?

Any suggestions (reference to a published/unpublished work, practical tips, pipelines) would be much appreciated! Thanks in advance.

RNA-seq • 808 views
ADD COMMENT
1
Entering edit mode

I've not done this type of analysis myself, but I think Kallisto or Salmon may be better suited.

I don't know how Salmon indexing works, but I believe with Kallisto this would also solve your first two questions, where each haplotype is a transcript. So, instead of geneA_hap1 / geneA_hap2, it would be geneA:tx_hap1 / geneA:tx_hap2

ADD REPLY
0
Entering edit mode

Thanks for the reply. From what I've seen (my online searches and the limited responses on the matter), there doesn't seem to be a standardized procedure for this type of analysis."

ADD REPLY
1
Entering edit mode

Also, an alternative method would be to quantify gene expression independently of haplotypes and just quantify variant percentage of the genes.

ADD REPLY
0
Entering edit mode

Yes, I don't think it's too common, but it should fit easily into isoform/allele-aware quantification workflows, no?

ADD REPLY

Login before adding your answer.

Traffic: 3415 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6