Retrive files

Question

COSMIC vcf file compatibility for Mutect2

4

Entering edit mode

7.3 years ago

erwan.scaon ▴ 940

I am running a variant calling pipeline for cancer samples. It includes Mutect2. Working on human, i started with the reference & dbsnp files contained in the GATK bundle for hg38 (ftp://ftp.broadinstitute.org/bundle/hg38). Picked the following files : Homo_sapiens_assembly38.dict
Homo_sapiens_assembly38.fasta.fai
Homo_sapiens_assembly38.fasta.gz dbsnp_146.hg38.vcf.gz dbsnp_146.hg38.vcf.gz.tbi

With Mutect2, you can feed a DB of known somatic variants using "--cosmic". Given that i started the pipeline with hg38 reference file, i picked the grch38 cosmic file (https://cancer.sanger.ac.uk/cosmic/files?data=/files/grch38/cosmic/v79/VCF/CosmicCodingMuts.vcf.gz). From my understanding, hg38 <=> UCSC and GRCh38 <=> NCBI, but i thought it would be close/good enough.

Then, when i run Mutect2, i get the following error : "Input files cosmic and reference have incompatible contigs. Error details: The contig order in cosmic and reference is not the same"

I corrected chromosomes names (1->chr1, MT->chrM, etc...) in the CosmicCodingMuts.vcf file, then sorted it using Picard SortVcf. But i am still stuck with the same kind of error in Mutect2.

Question is : 1) How to modify the COSMIC.vcf to match hg38 reference ? 2) If 1) is not possible, where can retrieve compatible genome_ref + germline_snp + somatic_snp ?

GATK Mutect2 COSMIC VCF • 10k views

ADD COMMENT • link 7.0 years ago by erwan.scaon ▴ 940

score 8 · Answer 1 · 2017-01-24

8

Entering edit mode

7.3 years ago

erwan.scaon ▴ 940

Ok, things are working for me, i did the following to make my "grch38" cosmic.vcf compatible with human genome reference hg38 (downloaded from GATK bundle) :

--- convert cosmic contigs names from "1" -> "chr1"

awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' CosmicCodingMuts.vcf > CosmicCodingMuts_chr.vcf

--- change chrMT -> chrM

sed 's/chrMT/chrM/g' CosmicCodingMuts_chr.vcf > CosmicCodingMuts_chr_M.vcf

--- sort this modified cosmic.vcf (it will create and index too)

java -jar $PICARD SortVcf I=CosmicCodingMuts_chr_M.vcf O=CosmicCodingMuts_chr_M_sorted.vcf SEQUENCE_DICTIONARY=GATK_bundle_h38/Homo_sapiens_assembly38.dict

--- Weird step, need to remove the index created by Picard (http://gatkforums.broadinstitute.org/gatk/discussion/1328/errors-about-contigs-in-bam-or-vcf-files-not-being-properly-ordered-or-sorted)

rm CosmicCodingMuts_chr_M_sorted.vcf.idx

--- Then, when you launch Mutect2 with "--cosmic CosmicCodingMuts_chr_M_sorted.vcf", it will "regenerate" an .idx for this vcf file, and this .idx will be compatible with the reference genome

Hope this helps

ADD COMMENT • link 7.3 years ago by erwan.scaon ▴ 940

1

Entering edit mode

If anyone wants to do this in R, it a lot easier:

library(VariantAnnotation)

## import cosmic vcf file
cos.ens <- readVcf('./data/CosmicCodingMuts.vcf')
## check current chr names
seqinfo(cos.ens)

## set UCSC compatible chromosome names
seqlevelsStyle(cos.ens) <- 'UCSC'
## check new chr names
seqinfo(cos.ens)

## export new vcf
writeVcf(cos.mut.ens, 'data/cosmic_mut_ucsc.vcf')

ADD REPLY • link 6.7 years ago by benformatics 3.9k

0

Entering edit mode

Thanks for posting your troubleshooting / solution for this problem. I know I will be looking for this when we move to hg38.

ADD REPLY • link 7.3 years ago by dyollluap ▴ 310

0

Entering edit mode

Thank you so much for this information. I was having trouble locating the cosmic data. By the way, is there any reason for you to skip the non-coding variants? Cosmic should serve as the white list, and I believe the more confident variants we provide, the better MuTect2 can work. Please correct me if I am wrong.

ADD REPLY • link 7.0 years ago by DVA ▴ 630

0

Entering edit mode

Thanks a lot for posting. I have a question. Why did you sort your 'chr added' vcf which was already sorted vcf? I mean, is not original ccosmic vcf sorted?

ADD REPLY • link 5.4 years ago by changhan1110 • 0

score 5 · Answer 2 · 2017-04-12

Indeed, you should use both coding and non-coding variants.

In our MuTect2 project, we ended up using the hg19 COSMIC files for both coding and non-coding :

Retrive files

sftp "erwan.scaon@unilim.fr"@sftp-cancer.sanger.ac.uk;
get files/grch37/cosmic/v79/VCF/CosmicCodingMuts.vcf.gz
get files/grch37/cosmic/v79/VCF/CosmicNonCodingVariants.vcf.gz

Make them "hg19-compatible"

grep '^[^#]' CosmicNonCodingVariants.vcf >> CosmicCodingMuts.vcf;
mv CosmicCodingMuts.vcf cosmic_coding_and_noncoding.vcf;
awk '{if($0 !~ /^#/) print "chr"$0; else print $0}' cosmic_coding_and_noncoding.vcf > cosmic_coding_and_noncoding_chr.vcf;
sed 's/chrMT/chrM/g' cosmic_coding_and_noncoding_chr.vcf > cosmic_coding_and_noncoding_chr_M.vcf;
java -jar $PICARD SortVcf I=cosmic_coding_and_noncoding_chr_M.vcf O=cosmic_coding_and_noncoding_chr_M_sorted.vcf SEQUENCE_DICTIONARY=../GATK_bundle_hg38/Homo_sapiens_assembly38.dict;
rm cosmic_coding_and_noncoding_chr_M_sorted.vcf.idx;