Very briefly, after quality trimming and adapter removal with Trimmomatic, reads will be first mapped to the reference genes (bwa mem). Then, the mapped reads will be used to build individual assemblies for each gene (SPAdes). After that, the program Exonerate is used to find the coding sequences. For paralog detection, the program will produce a warning if it detects multiple contigs containing long coding sequences-- by default at least 75% of the reference sequence.
For baits design, we did not filter for single-copy loci since that information is not known for our non-model species. Therefore, we were expecting a high enrichment of paralog sequences in our data. However, after the first run of HybPiper, I did NOT find any paralog warnings at all for the sample that I tried. I have noticed that SPAdes will produce short contigs for many genes (their length will not be 75% of the RefSeq), which might be a consequence of low coverage ~ which we thought it was caused by over-enrichment of paralog sequences.
Does that mean that there are no paralogs in that sample?
Do you think that is a good way of detecting paralogs in enriched data?