I have some VCF files that doesn't contain any assembly information in the header. Is there a tool or algorithm to detect the assembly?
this is a tricky one. the VCF does contain the REF allele, so you can check the reference genome FASTA and see if it matches
Yeah, but I guess that would be hard to automatize as each VCF contain different variants. What do you think?
a script could probably be made to check many of the REF column values against a given FASTA file in bulk
That's true. I wish there was a solid tool in bcftools or something. Thank you!
there might be some code in bcftools that can kind of do this. the bcftools csq command outputs an error message related to this for example...https://github.com/samtools/bcftools/issues/869 .... not sure if there is a simpler way to make it do that check
Thank you for the idea. This may work but have no idea how to utilize it. VCF content is unpredictable. One filter may not work on the other one I think.
I wrote a program that tries to do this by hand https://github.com/cmdcolin/vcfverifier
Do you mean genome assembly (like gr37/gr38) or the assembler used (SPAdes)?
I meant assembly (grch37, grch38 etc.)
Oh right - well the easy way is just to pull out some SNP rsids and look up the position on dbSNP or something, that will quickly tell you what build it's on.
Yeah that makes sense. It won't cover all cases but hopefully, cover enough of them. Thank you!
Maybe someone else has a clever idea, but I'm pretty sure unless there is some metadata in the vcf header, it's impossible to tell the assembler.
Ah, I hope there can be a way. Thanks anyway.
I wrote a program vcfverifier that takes in a --fasta and --vcf argument and tries to verify that the REF column of the VCF matches what is in the FASTA file
Possibly it could be sped up but it takes 23 seconds to match the ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz from 1000 genomes to hs37d5.fa
Thank you! I'll definitely check.
this is sick. Colin - do you have one of these for variants that are a part of an unknown transcript isoform?
can you elaborate? would be curious about ways to expand the tool
ill build it relatively soon proably for this project.
Login before adding your answer.
Use of this site constitutes acceptance of our User Agreement and Privacy