I have RNA-Seq data from a species A for which I don't have a reference genome. I would like to map this data on the closest available genome from species B.
I have read this post and I am planning to use hisat2 or STAR. Does any of you have recommandations regarding the parameters to set for mapping, given that I don't have a reference genome? I would expect that I should use this program with less "stringency" than if I could map on the species A genome. For instance, the default parameter in STAR is --outFilterMismatchNmax 10: should I set it to 20? 30? In hisat2, it is --score-min L,0,-0.2.
My reads are 100nt.
In addition, could you recommend any lecture related to this question?
I use STAR to do this frequently with the parameter --outFilterMismatchNmax 8. You can check the alignment percentage in the Log.final.out file and adjust your mismatch threshold accordingly. You may face the issue of reads aligning to a lot of different locations and can filer your bam file by the mapq score to address this. Be aware that unaligned reads may be placed in the "% of reads unmapped: too short" category even if they are unaligned because of mismatches. Dobin said something about this in one of the threads for the STAR google group.
BBMap will align RNA-seq data to genomes with a substantially lower identity than most other aligners, particularly other splice-aware aligners. You might add the flags "maxindel=200k minid=0.7" to increase alignment rate in this case.