Reference and dbSNP incompatibility issue (MuTect2)
2
0
Entering edit mode
8.2 years ago
umn_bist ▴ 390

When I try using MuTect2 (from GATK) I get this error

Is there a link to an (old) dbSNP that is compatible with UCSC's hg19 assembly?

EDIT: I cannot post the error message because Biostar is saying that it isn't in English... I used the dbSNP from NCBI ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/

00-All.vcf.gz

and I am using ucsc.hg19.fasta reference assembly

##### ERROR   dbsnp contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT, GL000207.1, GL000226.1, GL000229.1, GL000231.1, GL000210.1, GL000239.1, GL000235.1, GL000201.1, GL000247.1, GL000245.1, GL000197.1, GL000203.1, GL000246.1, GL000249.1, GL000196.1, GL000248.1, GL000244.1, GL000238.1, GL000202.1, GL000234.1, GL000232.1, GL000206.1, GL000240.1, GL000236.1, GL000241.1, GL000243.1, GL000242.1, GL000230.1, GL000237.1, GL000233.1, GL000204.1, GL000198.1, GL000208.1, GL000191.1, GL000227.1, GL000228.1, GL000214.1, GL000221.1, GL000209.1, GL000218.1, GL000220.1, GL000213.1, GL000211.1, GL000199.1, GL000217.1, GL000216.1, GL000215.1, GL000205.1, GL000219.1, GL000224.1, GL000223.1, GL000195.1, GL000212.1, GL000222.1, GL000200.1, GL000193.1, GL000194.1, GL000225.1, GL000192.1, NC_007605]
##### ERROR   reference contigs = [chrM, chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chr1_gl000191_random, chr1_gl000192_random, chr4_ctg9_hap1, chr4_gl000193_random, chr4_gl000194_random, chr6_apd_hap1, chr6_cox_hap2, chr6_dbb_hap3, chr6_mann_hap4, chr6_mcf_hap5, chr6_qbl_hap6, chr6_ssto_hap7, chr7_gl000195_random, chr8_gl000196_random, chr8_gl000197_random, chr9_gl000198_random, chr9_gl000199_random, chr9_gl000200_random, chr9_gl000201_random, chr11_gl000202_random, chr17_ctg5_hap1, chr17_gl000203_random, chr17_gl000204_random, chr17_gl000205_random, chr17_gl000206_random, chr18_gl000207_random, chr19_gl000208_random, chr19_gl000209_random, chr21_gl000210_random, chrUn_gl000211, chrUn_gl000212, chrUn_gl000213, chrUn_gl000214, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chrUn_gl000218, chrUn_gl000219, chrUn_gl000220, chrUn_gl000221, chrUn_gl000222, chrUn_gl000223, chrUn_gl000224, chrUn_gl000225, chrUn_gl000226, chrUn_gl000227, chrUn_gl000228, chrUn_gl000229, chrUn_gl000230, chrUn_gl000231, chrUn_gl000232, chrUn_gl000233, chrUn_gl000234, chrUn_gl000235, chrUn_gl000236, chrUn_gl000237, chrUn_gl000238, chrUn_gl000239, chrUn_gl000240, chrUn_gl000241, chrUn_gl000242, chrUn_gl000243, chrUn_gl000244, chrUn_gl000245, chrUn_gl000246, chrUn_gl000247, chrUn_gl000248, chrUn_gl000249]
ucsc.hg19.fa GATK RNA-Seq dbSNP Mutect2 • 3.9k views
ADD COMMENT
1
Entering edit mode

Hi,

Just one addition to what Chris has already said. There is difference in the mito. sequence in the UCSC version as compared to the b37/ 1000G/ Ensembl ver. So if you stick to 1-22 & X and Y only then replacing/ prefixing 'chr' is Ok.

Else take care of the mito. data. And also the alternate/ unplaced contigs. Those are also different in the UCSC ver.

When I analyze WES data, since its (Agilent) not designed to capture mito. anyways, I just choose 1-22, X and Y. Then the data/ sequence of UCSC is interchangeable smoothly with b37/ 1000G

ADD REPLY
3
Entering edit mode
8.2 years ago

This is the same as your previous problems. "You'll either need to change the dbSNP file or change your data and reference fasta. The former is probably easier - you'll just need to add "chr" when appropriate, change "MT" to "chrM", and convert between the gl contig names

ADD COMMENT
2
Entering edit mode

There is now a separate dbSNP download section with "corrected" contig names: ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/GATK/

ADD REPLY
0
Entering edit mode

This is pretty useful. THX

ADD REPLY
0
Entering edit mode

Thanks for your help, Chris. Yes, this has all been little validation errors due to the main issue of not having the original reference.

I did however get a hold of a working reference genome (ucsc.hg19), its corresponding dbSNP and COSMIC vcf but having gone through the formatting process (sorting, indexing, add read group) and finally getting a vcf file with no mutation detection, I think I will resort to the second best option. Do you have any recommendations other than Mutect2 if I am trying to resort to a single tool? FreeBayes/VarScan2/SomaticSniper? GATK has been a very difficult, time consuming (and eye-opening) experience thus far. Thanks again for your help.

EDIT: I find samtools mpileup function much more comfortable to use (but it seems that it is horrible for somatic variant calling).

ADD REPLY
1
Entering edit mode
8.2 years ago
If you're only going to run one variant caller, Mutect is probably the way to go
ADD COMMENT
0
Entering edit mode

Does this stand even if I have (impure) tumor samples with no matching normals? I read that MuTect2 is great for pure tumor samples because it picks up low VAF % but for impure ones, it can be too sensitive (high false positives). Does the fact that I have dbSNP and COSMIC vcf ensure that MuTect is good for my use case? Thank you for your help.

ADD REPLY
0
Entering edit mode

No variant caller that I've seen yet is great at low-VAF calling. Impure tumors are more difficult, because the signal is depressed and closer to the noise level from the error rate of the sequencer/prep. If you push too far down, you begin picking those up get a huge number of false positives. My preference is always for some sort of ensemble calling, followed by filtering, but if you're going to use one caller, I still think that Mutect is a reasonable way to go here.

ADD REPLY

Login before adding your answer.

Traffic: 3111 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6