How to extract information from a VCF file
1
1
Entering edit mode
7.4 years ago
ron ▴ 40

Hi everyone,

I am currently taking an online tutorial on Galaxy and I have managed to complete some of the steps. However, the last one demands working with the final VCF file to determine the amount of SNPs, INDELs, MNVs, etc. I have no idea how to do that. I have already tried a few things googled a little bit, but unfortunately, I haven't got anywhere. I also need to get the names of the genes with the largest number of polymorphic sites.

Any ideas on how to extract this information out of a VCF file? I have even tried with Excel! Although I know that isn't very professional... besides of being quite difficult and impractical to do (not to mention that not even then I have managed to filter this info).

Please any ideas? I will surely give them a try!

Ty!

SNP MNP INDEL Galaxy VCF • 5.3k views
ADD COMMENT
0
Entering edit mode

I don't see the problem... I paste VCF files into Excel, and it works fine for me. However, doing something like finding the names of the genes with the largest number of polymorphic sites sounds pretty random and not very useful, so it's unlikely that there are tools for it.

If you will routinely need specific pieces of information from large VCF files, it might pay to read the VCF specification and learn a scripting language like Python so that you can write custom queries. That will greatly expand your power and the scope of what you can accomplish.

ADD REPLY
0
Entering edit mode

Thank you for the help, Brian Bushnell! Sure, I agree that working with VCF files on Excel is not a problem, I just wanted to make it in a more automated way with a script, as emulating working with a bunch of files. For that, I should improve my Python first. I managed my way through Excel in the end ;-)

ADD REPLY
0
Entering edit mode
7.4 years ago

Bcftools has many of the resources that you might need: https://samtools.github.io/bcftools/bcftools.html

In addition to that, I usually do these things with quick one-liners that make great use of awk, sed, grep and that family there. If you run bcftools stats you'll get a parseable output where you can find the counts for each type of variant in your VCF file. You can also plot that information with plot-vcfstats : https://github.com/samtools/bcftools/blob/develop/plot-vcfstats

;)

ADD COMMENT

Login before adding your answer.

Traffic: 1947 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6