Masked Or Unmasked Dna Reference For Sam2Vcf.Pl
1
0
Entering edit mode
9.9 years ago
Pi ▴ 520

Hello

I am converting pileup to vcf using sam2vcf. The tools says I need the reference sequence when indels are present. Whilst downloading the reference dna sequence in fasta format from ensembl the readme states the dna is available in these formats

 * 'dna' - unmasked genomic DNA sequences.
  * 'dna_rm' - masked genomic DNA.  Interspersed repeats and low
     complexity regions are detected with the RepeatMasker tool and masked
     by replacing repeats with 'N's.

Which should i use to make the conversion more reliable or doesn't it matter? I believe mpileup has replaced pileup and that the indel predictions with pileup are more reliable but I don't have the original data to rerun the analysis

thanks

samtools pileup vcf • 5.1k views
ADD COMMENT
1
Entering edit mode
9.9 years ago
Adam ★ 1.0k

You should use the unmasked reference. If you use the masked reference, then an indel in a repetitive region will have an 'N' as a reference base in the VCF, which (while not strictly invalid) is probably not what you want.

ADD COMMENT

Login before adding your answer.

Traffic: 1195 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6