Liftover vcf file from hs37d5 assembly to b37 assembly
10 weeks ago
nhaus ▴ 60

Hello,

I have a vcf file which consists of mutations that was generated using the GATK variant calling workflow. For this the hs37d5 assembly was used. The problem is, that all GAKT reference resources use the b37 assembly, and if I simply use them, my script fails, because for some variants (less than 0.1%) there is a mismatch between the b37 and hs37d5 reference genome. So my idea was to simply remap the variants of the VCF file to b37. I planed on using something like CrossMap, but no chain files are available for my reference assemblies.

Does anyone have an idea how I can remap the variants from my hs37d5 vcf file to the b37 assembly without the use of chain files, or any other suggestions?

I would greatly appreciate them!

Cheers

don't you just have to rename the chromosomes (if needed) and discard the chromosomes that are not present in the other reference ?

Unfortunately not... Very rarely, the also differ in the nucleotide sequence. But because I am working with WGS, these events do occur and causes my script to crash, because the "REF" in my VCF file does not match the "REF" of my provided genome assembly.

My idea was to use a simple python script to manually change the REF nucleotides where a mismatch occurs, but it feels kinda wrong to manually change nucleotides...

Unfortunately not... Very rarely,

hs37d5 : Includes data from GRCh37, the rCRS mitochondrial sequence, Human herpesvirus 4 type 1 and the concatenated decoy sequences.

b37: includes data from GRCh37, the rCRS mitochondrial sequence, and the Human herpesvirus 4 type 1.

Could you elaborate what you mean? I also thought that they share the same sequence for the autosomes, but there are definitely some positions where they differ (at least the ones that I am using). I also found this on the GATK page:

For b37:

These alterations largely consist of contig name changes, however there are known sequence differences on some contigs as well.

