I’m working on a haplotype‑resolved diploid assembly of a plant genome, where each chromosome is represented by two FASTA/GTF pairs rather than a single consensus. I want to carry out Bulk RNA-seq count‑based differential expression analysis with Bioconductor (e.g. DESeq2, edgeR) but I’m unsure how to adapt the standard workflow for this dual‑sequence setup.
Experimental Design (basic) Organism: Plant. Samples: 3 replicates of Condition Aand 3 replicates of Condition B. Data: Paired-end RNA-seq reads (150 bp, 30 millions reads for sample) aligned to a haplotype-resolved genome assembly. Goal: Identify DE genes between Conditions A and B, accounting for haplotype-specific expression.
What I would like to know:
- Should I concatenate the two haplotype FASTAs (and GTFs) into one “merged” reference, or keep them separate and run two parallel alignments?
- How to avoid double‑counting of genes that appear in both GTFs? Should I rename features (e.g. geneA_hap1 vs. geneA_hap2) or collapse them after quantification?
Quantification strategy
- Does it make sense to use allele‑aware quantifiers (e.g. Salmon with ––keepDuplicates) to retain haplotype information?
- How to handle multi‑mapping reads that span identical exons on both haplotypes?
Count matrix construction
- Best practice for building the count matrix:
- Option A: Separate counts per haplotype (two columns per sample) and then sum counts for downstream DE?
- Option B: Sum at the gene level before DE and ignore haplotype origin?
- Option C: Test allele‑specific expression by including haplotype as a factor in the design?
Statistical modelling in DESeq2/edgeR
- If I keep haplotypes separate, can I simply aggregate counts (geneA_hap1 + geneA_hap2) into a single count per sample? Any known caveats?
- If I wish to model allele‑specific changes (e.g. hap1 vs. hap2 expression bias across conditions), what design formulas or contrasts are recommended?
Any suggestions (reference to a published/unpublished work, practical tips, pipelines) would be much appreciated! Thanks in advance.
I've not done this type of analysis myself, but I think Kallisto or Salmon may be better suited.
I don't know how Salmon indexing works, but I believe with Kallisto this would also solve your first two questions, where each haplotype is a transcript. So, instead of geneA_hap1 / geneA_hap2, it would be geneA:tx_hap1 / geneA:tx_hap2
Thanks for the reply. From what I've seen (my online searches and the limited responses on the matter), there doesn't seem to be a standardized procedure for this type of analysis."
Also, an alternative method would be to quantify gene expression independently of haplotypes and just quantify variant percentage of the genes.
Yes, I don't think it's too common, but it should fit easily into isoform/allele-aware quantification workflows, no?