Question: COSMIC vcf file compatibility for Mutect2
4
gravatar for erwan.scaon
11 months ago by
erwan.scaon360
Limoges - CBRS - France
erwan.scaon360 wrote:

I am running a variant calling pipeline for cancer samples. It includes Mutect2. Working on human, i started with the reference & dbsnp files contained in the GATK bundle for hg38 (ftp://ftp.broadinstitute.org/bundle/hg38). Picked the following files : Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta.fai
Homo_sapiens_assembly38.fasta.gz dbsnp_146.hg38.vcf.gz dbsnp_146.hg38.vcf.gz.tbi

With Mutect2, you can feed a DB of known somatic variants using "--cosmic". Given that i started the pipeline with hg38 reference file, i picked the grch38 cosmic file (https://cancer.sanger.ac.uk/cosmic/files?data=/files/grch38/cosmic/v79/VCF/CosmicCodingMuts.vcf.gz). From my understanding, hg38 <=> UCSC and GRCh38 <=> NCBI, but i thought it would be close/good enough.

Then, when i run Mutect2, i get the following error : "Input files cosmic and reference have incompatible contigs. Error details: The contig order in cosmic and reference is not the same"

I corrected chromosomes names (1->chr1, MT->chrM, etc...) in the CosmicCodingMuts.vcf file, then sorted it using Picard SortVcf. But i am still stuck with the same kind of error in Mutect2.

Question is : 1) How to modify the COSMIC.vcf to match hg38 reference ? 2) If 1) is not possible, where can retrieve compatible genome_ref + germline_snp + somatic_snp ?

cosmic mutect2 gatk vcf • 2.2k views
ADD COMMENTlink modified 9 months ago • written 11 months ago by erwan.scaon360
7
gravatar for erwan.scaon
11 months ago by
erwan.scaon360
Limoges - CBRS - France
erwan.scaon360 wrote:

Ok, things are working for me, i did the following to make my "grch38" cosmic.vcf compatible with human genome reference hg38 (downloaded from GATK bundle) :

--- convert cosmic contigs names from "1" -> "chr1"

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' CosmicCodingMuts.vcf > CosmicCodingMuts_chr.vcf

--- change chrMT -> chrM

sed 's/chrMT/chrM/g' CosmicCodingMuts_chr.vcf > CosmicCodingMuts_chr_M.vcf

--- sort this modified cosmic.vcf (it will create and index too)

java -jar $PICARD SortVcf I=CosmicCodingMuts_chr_M.vcf O=CosmicCodingMuts_chr_M_sorted.vcf SEQUENCE_DICTIONARY=GATK_bundle_h38/Homo_sapiens_assembly38.dict

--- Weird step, need to remove the index created by Picard (http://gatkforums.broadinstitute.org/gatk/discussion/1328/errors-about-contigs-in-bam-or-vcf-files-not-being-properly-ordered-or-sorted)

rm CosmicCodingMuts_chr_M_sorted.vcf.idx

--- Then, when you launch Mutect2 with "--cosmic CosmicCodingMuts_chr_M_sorted.vcf", it will "regenerate" an .idx for this vcf file, and this .idx will be compatible with the reference genome

Hope this helps

ADD COMMENTlink written 11 months ago by erwan.scaon360

Thanks for posting your troubleshooting / solution for this problem. I know I will be looking for this when we move to hg38.

ADD REPLYlink written 11 months ago by dyollluap300

Thank you so much for this information. I was having trouble locating the cosmic data. By the way, is there any reason for you to skip the non-coding variants? Cosmic should serve as the white list, and I believe the more confident variants we provide, the better MuTect2 can work. Please correct me if I am wrong.

ADD REPLYlink written 9 months ago by Helene390

If anyone wants to do this in R, it a lot easier:

library(VariantAnnotation)

## import cosmic vcf file
cos.ens <- readVcf('./data/CosmicCodingMuts.vcf')
## check current chr names
seqinfo(cos.ens)

## set UCSC compatible chromosome names
seqlevelsStyle(cos.ens) <- 'UCSC'
## check new chr names
seqinfo(cos.ens)

## export new vcf
writeVcf(cos.mut.ens, 'data/cosmic_mut_ucsc.vcf')
ADD REPLYlink modified 5 months ago • written 5 months ago by benformatics150
2
gravatar for erwan.scaon
9 months ago by
erwan.scaon360
Limoges - CBRS - France
erwan.scaon360 wrote:

Indeed, you should use both coding and non-coding variants.

In our MuTect2 project, we ended up using the hg19 COSMIC files for both coding and non-coding :

Retrive files

sftp "erwan.scaon@unilim.fr"@sftp-cancer.sanger.ac.uk;
get files/grch37/cosmic/v79/VCF/CosmicCodingMuts.vcf.gz
get files/grch37/cosmic/v79/VCF/CosmicNonCodingVariants.vcf.gz

Make them "hg19-compatible"

grep '^[^#]' CosmicNonCodingVariants.vcf >> CosmicCodingMuts.vcf;
mv CosmicCodingMuts.vcf cosmic_coding_and_noncoding.vcf;
awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' cosmic_coding_and_noncoding.vcf > cosmic_coding_and_noncoding_chr.vcf;
sed 's/chrMT/chrM/g' cosmic_coding_and_noncoding_chr.vcf > cosmic_coding_and_noncoding_chr_M.vcf;
java -jar $PICARD SortVcf I=cosmic_coding_and_noncoding_chr_M.vcf O=cosmic_coding_and_noncoding_chr_M_sorted.vcf SEQUENCE_DICTIONARY=../GATK_bundle_hg38/Homo_sapiens_assembly38.dict;
rm cosmic_coding_and_noncoding_chr_M_sorted.vcf.idx;
ADD COMMENTlink written 9 months ago by erwan.scaon360
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1707 users visited in the last hour