I have transcriptome sequence from 44 individuals representing two stocks of fish (an inbred lab stock and mixed wild stock) from the same species. We do have a genome assembly for this species based on the lab stock, but it's not yet complete and not fully representative of sequences in the wild stock. So far, we've done differential expression analysis with a pretty standard pipeline (STAR -> StringTie -> DESeq2) and found a surprising overabundance of DE genes between stocks.
I went deeper in the SAM files to look at SNP differences between stocks. While for the most part SNPs showed expected patterns of diversity based on stock origins, we also found genes with dozens and sometimes hundreds of SNPs with relation to the reference genome, way more than I would expect based if they were just different alleles from the same gene.
Now, I want to confirm that we are not calling genes differentially expressed when they are actually an artefact resulting from overalignment of paralogs missing from the reference genome.
To this end, I've constructed de novo transcriptome assemblies from all 44 samples using only the reads that aligned to the reference genome using STAR. I was hoping to cluster these assemblies separately for each stock with some program like CD-HIT-EST to eliminate redundancy, then align non-redundant consensus contigs to the the reference genome to see how many unique sequences align from each stock.
Unfortunately I've found CD-HIT-EST is obviously not the right tool for the job, especially considering it doesn't produce super-contigs from clusters or provide positional information, which is important to determine whether multiple contigs from a single stock are aligning to the same place in a gene as opposed to being 2 unassembled gene fragments.
I can't help but think there should already be some well established pipelines for this sort of problem, but I haven't found them and I feel I might be on the wrong track. Can anyone provide any input or methods they've found that more directly differentiate alleles from unmapped paralogs?
Thanks in advance!