There are many different bioinformatics tasks that require selection of reference genome and reference assembly and their preparation. These includes alignment of NGS reads, gene prediction, homology analysis, variant annotation, variant reporting, etc.
My question is coming from this conversation: C: Troubles comparing SNP called on Illumina reference human genome
I always thought that when we have a big read depth and the goal is to predict genotypes by running haplotype caller algorithm, then masking repeats is going to improve results if soft clipping of reads and multiple alignments are allowed during the alignment. My reasoning for that special case was that with this approach we are removing lots of false alignments and a bit of true alignments, plus we get a few misalignments. I thought that since the number of this misalignment is little they are going to be technically discarded on a haplotype caller step if coverage is high and we have lots of correct alignments locally. When the read depth is low there is no way for haplotype caller to correctly address this. But at >30x coverage should be fine, isn't it?
I know that there is almost a mantra of never using masking before alignment and apply filtration of repetitive regions only after alignment. While this is true in most cases in humans especially with aligners that do not recognize soft masking and for WES with medium read depth and all RNA-seq and ChIP-seq experiments from my opinion, I am not convinced yet that this is the only right approach for every type of NGS experiment and every organism and data set. For example, some plants have lots of repetitive and low-complexity regions. Should we use the same approach of aligning first and filtering afterward? Why?
What reference genome and reference assembly should be used in each case and if there is a need to prepare these references before using them. Do we need to select the latest version? Should we use RefSeq or Ensembl for transcripts? Should we use patches, unplaced contigs and so on? Should we soft mask or hard mask repetitive and Alu regions? How should we work with low complexity regions, centromeres, chromosome ends? Are there any difference of recommendations for WES, RNA-seq and ChIP-seq? How read depth affect your decisions?