Question: Masked Or Unmasked Dna Reference For Sam2Vcf.Pl
9.7 years ago
Pi520 wrote:


I am converting pileup to vcf using sam2vcf. The tools says I need the reference sequence when indels are present. Whilst downloading the reference dna sequence in fasta format from ensembl the readme states the dna is available in these formats

 * 'dna' - unmasked genomic DNA sequences.
  * 'dna_rm' - masked genomic DNA.  Interspersed repeats and low
     complexity regions are detected with the RepeatMasker tool and masked
     by replacing repeats with 'N's.

Which should i use to make the conversion more reliable or doesn't it matter? I believe mpileup has replaced pileup and that the indel predictions with pileup are more reliable but I don't have the original data to rerun the analysis


9.7 years ago
United States
Adam1.0k wrote:

You should use the unmasked reference. If you use the masked reference, then an indel in a repetitive region will have an 'N' as a reference base in the VCF, which (while not strictly invalid) is probably not what you want.

