Question: COSMIC vcf file compatibility for Mutect2
gravatar for erwan.scaon
4.0 years ago by
Nantes - France
erwan.scaon810 wrote:

I am running a variant calling pipeline for cancer samples. It includes Mutect2. Working on human, i started with the reference & dbsnp files contained in the GATK bundle for hg38 ( Picked the following files : Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta.gz dbsnp_146.hg38.vcf.gz dbsnp_146.hg38.vcf.gz.tbi

With Mutect2, you can feed a DB of known somatic variants using "--cosmic". Given that i started the pipeline with hg38 reference file, i picked the grch38 cosmic file ( From my understanding, hg38 <=> UCSC and GRCh38 <=> NCBI, but i thought it would be close/good enough.

Then, when i run Mutect2, i get the following error : "Input files cosmic and reference have incompatible contigs. Error details: The contig order in cosmic and reference is not the same"

I corrected chromosomes names (1->chr1, MT->chrM, etc...) in the CosmicCodingMuts.vcf file, then sorted it using Picard SortVcf. But i am still stuck with the same kind of error in Mutect2.

Question is : 1) How to modify the COSMIC.vcf to match hg38 reference ? 2) If 1) is not possible, where can retrieve compatible genome_ref + germline_snp + somatic_snp ?

cosmic mutect2 gatk vcf • 6.6k views
ADD COMMENTlink modified 3.8 years ago • written 4.0 years ago by erwan.scaon810
gravatar for erwan.scaon
4.0 years ago by
Nantes - France
erwan.scaon810 wrote:

Ok, things are working for me, i did the following to make my "grch38" cosmic.vcf compatible with human genome reference hg38 (downloaded from GATK bundle) :

--- convert cosmic contigs names from "1" -> "chr1"

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' CosmicCodingMuts.vcf > CosmicCodingMuts_chr.vcf

--- change chrMT -> chrM

sed 's/chrMT/chrM/g' CosmicCodingMuts_chr.vcf > CosmicCodingMuts_chr_M.vcf

--- sort this modified cosmic.vcf (it will create and index too)

java -jar $PICARD SortVcf I=CosmicCodingMuts_chr_M.vcf O=CosmicCodingMuts_chr_M_sorted.vcf SEQUENCE_DICTIONARY=GATK_bundle_h38/Homo_sapiens_assembly38.dict

--- Weird step, need to remove the index created by Picard (

rm CosmicCodingMuts_chr_M_sorted.vcf.idx

--- Then, when you launch Mutect2 with "--cosmic CosmicCodingMuts_chr_M_sorted.vcf", it will "regenerate" an .idx for this vcf file, and this .idx will be compatible with the reference genome

Hope this helps

ADD COMMENTlink written 4.0 years ago by erwan.scaon810

If anyone wants to do this in R, it a lot easier:


## import cosmic vcf file
cos.ens <- readVcf('./data/CosmicCodingMuts.vcf')
## check current chr names

## set UCSC compatible chromosome names
seqlevelsStyle(cos.ens) <- 'UCSC'
## check new chr names

## export new vcf
writeVcf(cos.mut.ens, 'data/cosmic_mut_ucsc.vcf')
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by benformatics2.0k

Thanks for posting your troubleshooting / solution for this problem. I know I will be looking for this when we move to hg38.

ADD REPLYlink written 4.0 years ago by dyollluap300

Thank you so much for this information. I was having trouble locating the cosmic data. By the way, is there any reason for you to skip the non-coding variants? Cosmic should serve as the white list, and I believe the more confident variants we provide, the better MuTect2 can work. Please correct me if I am wrong.

ADD REPLYlink written 3.8 years ago by DVA550

Thanks a lot for posting. I have a question. Why did you sort your 'chr added' vcf which was already sorted vcf? I mean, is not original ccosmic vcf sorted?

ADD REPLYlink written 2.2 years ago by changhan11100
gravatar for erwan.scaon
3.8 years ago by
Nantes - France
erwan.scaon810 wrote:

Indeed, you should use both coding and non-coding variants.

In our MuTect2 project, we ended up using the hg19 COSMIC files for both coding and non-coding :

Retrive files

sftp "";
get files/grch37/cosmic/v79/VCF/CosmicCodingMuts.vcf.gz
get files/grch37/cosmic/v79/VCF/CosmicNonCodingVariants.vcf.gz

Make them "hg19-compatible"

grep '^[^#]' CosmicNonCodingVariants.vcf >> CosmicCodingMuts.vcf;
mv CosmicCodingMuts.vcf cosmic_coding_and_noncoding.vcf;
awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' cosmic_coding_and_noncoding.vcf > cosmic_coding_and_noncoding_chr.vcf;
sed 's/chrMT/chrM/g' cosmic_coding_and_noncoding_chr.vcf > cosmic_coding_and_noncoding_chr_M.vcf;
java -jar $PICARD SortVcf I=cosmic_coding_and_noncoding_chr_M.vcf O=cosmic_coding_and_noncoding_chr_M_sorted.vcf SEQUENCE_DICTIONARY=../GATK_bundle_hg38/Homo_sapiens_assembly38.dict;
rm cosmic_coding_and_noncoding_chr_M_sorted.vcf.idx;
ADD COMMENTlink written 3.8 years ago by erwan.scaon810
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1084 users visited in the last hour