Hi,
I am currently trying to do somatic variant calling for some tumor samples, but have been running into errors / confused about some resources to use.
My reads were aligned to Ensembl's GRCh38 reference genome. I am using GATK's Mutect2 for the somatic variant calling. As recommended on their site, I had used GATK's provided 1000g_pon.hg38.vcf.
and af-only-gnomad
files for the --pon
and --germline-resource
arguments respectively. Issue here, is that because my reads were aligned to Ensembl's genome, the ##contig
headers of the PON/gnomAD do not match the input BAM files (GATK errors out, related to reference and feature contigs not matching).
I've been looking if there are any Ensembl-compatible files, and so far I have only found that Ensembl does provide a 1000GENOME_phase3.vcf
(from here), which to my guess would be used for the --pon
parameter?
I haven't been able to find an alternative for the germline resource though, and was looking into maybe NCBI's common_all.vcf
which mentions being a resource for common germline variants, would it work for that? At least Mutect2's Manual says any VCF containing the "AF" INFO field is valid - though I don't know if there is an alternative to the gnomAD one that is commonly used. Alternatively, it would also be possible to rename all the contig names from the 1000g_pon.hg38.vcf
, but I don't know if that would be a bit more troublesome / lead to errors.
I guess I am somewhat confused as what files are available / can be used for each of those parameters. I've looked at some previous posts, but haven't been able to find a concrete answer of what is the ideal way.
I have also found this on Ensembl's FTP which seems to have per-chromosome gnomAD files; but no "combined" file similar to the one included in GATK's resource bundle. The file size is also vastly different (GATK's af_only_gnomad
~3GB total, whereas Ensembl ~1-2GB each chromosome).
Another question that would pop up is whether to use gnomAD's exomes or genomes files? It seems that gnomADv3 doesn't have exome-based data available.
Any help is appreciated!