I'm building a de novo transcriptome assembly pipeline. One of the features I'm trying to implement is targeted assembly, where a user can provide reference FASTAs, and the pipeline will iteratively extract reads that map to it for separate assembly. For example, if a user is assembling a plant transcriptome and they provide reference chloroplast and mitochondrion genomes, the pipeline will map the RNA short-reads to the chloroplast and separate reads that align to their own file, and then do the same with the mitochondrion.
A problem I am having is that the mapper appears to not be specific enough. That is, after extracting reads which map to one organellar genome and then attempting to map the remaining reads to the other one, the mapper only returns a few (< 20) hits.
Is this low specificity something I can fix by tuning alignment parameters? Obviously there's some genetic overlap between chloroplast and mitochondrial DNA, but surely they can't be as similar as would be indicated by the mapping results.
- I am using STAR to index, map reads. Mostly default configuration.
- Reads are single or paired-end, and are thoroughly cleaned prior to mapping
- After mapping, 'extraction' occurs by parsing the resulting SAM file for unique hits