Hi, I'm working with transcriptomic data from a diploid species with an incomplete reference genome (likely 80% of genome represented in reference). I aligned paired-end reads with STAR then generated counts with StringTie. After running DE analyses, most of the results make sense and look good, although we found a surprisingly high number of DE genes between genetic lineages.
After visualizing SNPs/indels from the alignment files in IGV, it looks like a number of genes have alignments that indicate more than two haplotypes for a single locus, indicating potential misalignment of transcripts from lineage-specific CNVs or paralogs/pseudogenes missing from the reference genome.
Now, I would like to quantify these misalignments in a more systematic way and my thought was to extract physical phasing information per sample to see if we can detect more than two overlapping haplotypes at each locus, which would then be flagged as a potential misalignment. So far, I've tried the GATK haplotype caller pipeline for RNA-seq data (barcwiki.wi.mit.edu/wiki/SOP/CallingVariantsRNAseq), but the problem I'm running into is that phasing information isn't generated in vcf output files when ploidy is set higher than two, and running haplotype caller on samples set as diploid doesn't tell us if we have extra unexpected alleles.
Can anyone suggest another tool that can output phasing information for more than two haplotypes, or just suggest an entirely different strategy to address my original problem quantifying possible misalignments of RNA-seq data? Thanks.