How to determine assembly of a VCF?
2
0
Entering edit mode
21 months ago
magnolia ▴ 20

Hi,

I have some VCF files that doesn't contain any assembly information in the header. Is there a tool or algorithm to detect the assembly?

Thank you!

vcf • 3.0k views
ADD COMMENT
1
Entering edit mode

this is a tricky one. the VCF does contain the REF allele, so you can check the reference genome FASTA and see if it matches

ADD REPLY
0
Entering edit mode

Yeah, but I guess that would be hard to automatize as each VCF contain different variants. What do you think?

ADD REPLY
1
Entering edit mode

a script could probably be made to check many of the REF column values against a given FASTA file in bulk

ADD REPLY
0
Entering edit mode

That's true. I wish there was a solid tool in bcftools or something. Thank you!

ADD REPLY
1
Entering edit mode

there might be some code in bcftools that can kind of do this. the bcftools csq command outputs an error message related to this for example...https://github.com/samtools/bcftools/issues/869 .... not sure if there is a simpler way to make it do that check

ADD REPLY
0
Entering edit mode

Thank you for the idea. This may work but have no idea how to utilize it. VCF content is unpredictable. One filter may not work on the other one I think.

ADD REPLY
0
Entering edit mode

I wrote a program that tries to do this by hand https://github.com/cmdcolin/vcfverifier

ADD REPLY
1
Entering edit mode

Do you mean genome assembly (like gr37/gr38) or the assembler used (SPAdes)?

ADD REPLY
0
Entering edit mode

I meant assembly (grch37, grch38 etc.)

ADD REPLY
1
Entering edit mode

Oh right - well the easy way is just to pull out some SNP rsids and look up the position on dbSNP or something, that will quickly tell you what build it's on.

ADD REPLY
0
Entering edit mode

Yeah that makes sense. It won't cover all cases but hopefully, cover enough of them. Thank you!

ADD REPLY
2
Entering edit mode
21 months ago
4galaxy77 2.8k

Maybe someone else has a clever idea, but I'm pretty sure unless there is some metadata in the vcf header, it's impossible to tell the assembler.

ADD COMMENT
0
Entering edit mode

Ah, I hope there can be a way. Thanks anyway.

ADD REPLY
2
Entering edit mode
20 months ago
cmdcolin ★ 3.8k

I wrote a program vcfverifier that takes in a --fasta and --vcf argument and tries to verify that the REF column of the VCF matches what is in the FASTA file

Possibly it could be sped up but it takes 23 seconds to match the ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz from 1000 genomes to hs37d5.fa

https://github.com/cmdcolin/vcfverifier

ADD COMMENT
0
Entering edit mode

Thank you! I'll definitely check.

ADD REPLY
0
Entering edit mode

this is sick. Colin - do you have one of these for variants that are a part of an unknown transcript isoform?

ADD REPLY
0
Entering edit mode

can you elaborate? would be curious about ways to expand the tool

ADD REPLY
0
Entering edit mode

ill build it relatively soon proably for this project.

VAL

ADD REPLY

Login before adding your answer.

Traffic: 2517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6