Question

Masked Or Unmasked Dna Reference For Sam2Vcf.Pl

0

Entering edit mode

12.9 years ago

Pi ▴ 520

Hello

I am converting pileup to vcf using sam2vcf. The tools says I need the reference sequence when indels are present. Whilst downloading the reference dna sequence in fasta format from ensembl the readme states the dna is available in these formats

 * 'dna' - unmasked genomic DNA sequences.
  * 'dna_rm' - masked genomic DNA.  Interspersed repeats and low
     complexity regions are detected with the RepeatMasker tool and masked
     by replacing repeats with 'N's.

Which should i use to make the conversion more reliable or doesn't it matter? I believe mpileup has replaced pileup and that the indel predictions with pileup are more reliable but I don't have the original data to rerun the analysis

thanks

samtools pileup vcf • 5.9k views

ADD COMMENT • link updated 8.6 years ago by Biostar 20 • written 12.9 years ago by Pi ▴ 520

score 1 · Answer 1 · 2011-05-22

1

Entering edit mode

12.9 years ago

Adam ★ 1.0k

You should use the unmasked reference. If you use the masked reference, then an indel in a repetitive region will have an 'N' as a reference base in the VCF, which (while not strictly invalid) is probably not what you want.

ADD COMMENT • link 12.9 years ago by Adam ★ 1.0k