How to liftover SNP positions from one genome to another genome?
1
0
Entering edit mode
9 months ago

Hello,

Do you know a good way to map contigs/snp-regions from one genome to another?

At the moment, I would like to check the positions of the SNPs that I obtained from one reference genome on the new preliminary scaffolded reference genome. I tried so far a few tools I would normally use, but they fail with a genome size of 33GB. BLAST cannot handle long sequences (the chromosomes) when creating a BLAST-db. minimap2 gives an error, and bwa-mem2 gives a memory error. Do you have some suggestions?

Thank you.

liftover SNP VCF alignment • 1.3k views
ADD COMMENT
1
Entering edit mode
9 months ago
DBScan ▴ 300

There are many tools available, see for instance Crossmap and GATK LiftoverVCF.

ADD COMMENT
0
Entering edit mode

Thank you. Crossmap indeed looks promising.

ADD REPLY
0
Entering edit mode

I used CrossMap and it's pretty straightforward. There was a gotcha with chromosome names, when it was renaming them in the header, but not in the body of the VCF, so bcftools afterwards complains about undefined contigs and generally won't touch the file. I fixed it by running CrossMap again with --chromid s which worked fine for my file (it was human data lifted over between hg38 -> hg19).

Also don't bother with compression, CrossMap just calls gzip - it's better to compress your output with bgzip, so you can index it for quick random access ;) Look at tabix if you are not familiar with this.

I picked CrossMap after seeing it was used by the 1000 genomes folks to compare their latest data (high-coverage resequencing of phase 3 data mapped to hg38) with their older variant calls (most used hg19). I thought if it was good enough for them, it should be good enough for me too :D

ADD REPLY
0
Entering edit mode

I am working with plant genomes and this tool uses a chain, also GATK as well. Is there any alternative tool that you know of?

ADD REPLY
0
Entering edit mode

Hmm, first idea was minimap2 but you tried that - what error you got there btw? There is another similar tool LASTZ you may wanna try. But I think even standard approaches to compare assemblies exist.

I'd probably compare assemblies using various stats and then proceed to denovo variant calling on the new reference, rather than trying liftover - you can always lose information during liftover, while denovo calling is gonna get you the maximum information present in your data IMHO.

ADD REPLY
0
Entering edit mode

The error is related to memory usage since it is a 33GB genome. Yes, I can see the benefit of calling new SNPs on the new genome but we created and genotyped the samples using SNPs from the previous genome. This is why I am trying to use Liftover.

ADD REPLY
0
Entering edit mode

In principle, you should get the same information no matter if you call denovo or lift over - the end result should be the same. All I'm saying is that, in practice, variant callers are waaay more tested and robust than any liftover tool I've seen.

Your SNPs probably won't have the same coordinates anyway, so you cannot compare them directly. Instead, I'd probably look at higher-level information, like heterozygosity of samples, or some PCA/MDS clustering - if these look similar enough, then your new data are probably fine too.

ADD REPLY
0
Entering edit mode

I fully agree with you and I would have done the same thing probably. But I am just following a protocol and trying to achieve something through a tool that can deal with such a huge genome.

ADD REPLY
0
Entering edit mode

I get it, but I always wondered how much information got messed up by the liftover. I did MDS on both versions, which looked the same, so I moved on for a time, but I kept wondering if some issue later creeps up. Later we abandoned that dataset in favour of a different one, for mostly unrelated reasons, although I always considered a plus that the other dataset didn't go through a liftover.

I just think I would trust the output of a variant caller more.

ADD REPLY
0
Entering edit mode

Another question - is there significant benefit in moving to the new assembly? Can you just stay with the old assembly if all your samples use it?

Or do you rather just want some sort of "dictionary" to translate between the two references? For example to move samples (and some results) to the new reference when it gets good enough?

We thought about such things in my previous team but I think it was never done even after I left. I can ask though.

ADD REPLY
1
Entering edit mode

Yes, there is a significant benefit because the new assembly is scaffolded to the chromosome level with PacBIO+Hi-C technology. And we will use this new genome in future analysis.

ADD REPLY
1
Entering edit mode

But if new assembly is substantially different than the prior assembly you should reanalyze the data.

ADD REPLY
0
Entering edit mode

Ok thank you.

ADD REPLY
1
Entering edit mode

I see, it's similar to the situation in our old team. We worked for years with transcriptome-based reference and exome-capture data, publishing several papers on the data while working on whole genome sequencing in the meantime.

My supervisor wanted to be able to translate between the two references so he doesn't need to throw out all that built-up knowledge. But in the end, it needs to be repeated on the genome assembly denovo, or else you always wonder if something went wrong/suboptimal during various conversions. I don't know how far they got after I left, but I hope they just repeated necessary stuff denovo and produce new results on the new genome instead of constantly switching between the two contexts.

ADD REPLY

Login before adding your answer.

Traffic: 1647 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6