How to extract complete data for specific SNPs (say about 10 of them)
1
0
Entering edit mode
2.9 years ago
tinyskinn • 0

I have some datasets that I'll like to extract complete sequenced data for some specific SNPs. In each directory contain the following files:

.dose.vcf.gz
.phased.vcf.gz


My questions are:-

1. Which of these files above are relevant for the extraction?
2. from which of these datatype above do I get to extract the particular SNPs i needed
SNP sequencing Genotyping • 1.5k views
0
Entering edit mode

Well, it would help a lot if you explained from where you obtained your data. Also, why do you have RNA- and ChIP-seq as tags?

0
Entering edit mode

Sorry, the RNA and Chip-Seq were not meant to be included -i'll remove them. The data are output from the program shapeit and minimac3. Hope that helps a bit :)

0
Entering edit mode

Helps a bit, yes! Have you read the manuals to see to what these files relate?

You can immediately view the contents of these files by using BCFtools (bcftools view). If you learn more about BCFtools, then, you can also find a way to extract your SNPs of interest.

0
Entering edit mode

Hi Kevin, thank you for the answer. I already have the bcftools installed but I really do not know where exactly to look in the data I posted for all the Sequence data for specific SNPs. If you have idea, I will be highly grateful. Thanks

1
Entering edit mode
2.9 years ago

If you have your SNP IDs in a file, MySNPs.list, you can filter your input VCFs like this:

bcftools filter --include 'ID=@MySNPs.list' .phased.vcf.gz > output.vcf


0
Entering edit mode

That was really helpful. But I have a question more, whats the relationship or différences between these files .dose.vcf.gz and phased.vcf.gz ? Is extracting the SNPs from .phased.vcf.gz enough? what about the .dose.vcf.gz. I'm a newbie and trying to understand it. Thanls for above answer!

1
Entering edit mode

Perhaps you should show the exact commands that you used to produce the files? Genotype dosage is a very simple metric, which is explained here: http://www.internationalgenome.org/faq/what-does-genotype-dosage-mean-phase1-integrated-call-set/

You could just literally take a look within each file to see how they differ. Also reading the manual for minimac3 would help (however, in saying this, I do not know how comprehensive it is).

By the way, the BCFtools command that I showed (above) assumes that the IDs in your file, MySNPs.list, have the same format as those set in your VCF's ID fields. Usually, this would be dbSNP rs ID.

Also be sure of what your end goal is, i.e., what do you need to do with the sequence information for your SNPs of interest?

1
Entering edit mode

Those are just file names that you're giving us. Like Kevin said, you could peek into the files and find out how they were produced - you'll probably need to do both to speculate on the differences.

0
Entering edit mode

this is not intended. was same as above.

0
Entering edit mode

Hi Kevin,

I am trying to use the code you have above "bcftools filter --include 'ID=@MySNPs.list' .phased.vcf.gz > output.vcf", but it gives me an error saying that mySNPs.list is not in vcf header.

I want to extract some SNPs of interest from dose.vcf.gz file, it is not only SNPs id, but the whole SNPs dosage information. In my SNPs.list, I list them as snps id only.

Do you have some suggestions?

Thanks, Elena

0
Entering edit mode

Hey, can you show the contents of MySNPs.list, and also provide the version of BCFtools that you are using?

The functionality for extracting variants via a filer-listing is given here: