Question: Filtered a .vcf file based on id of a second file
0
gravatar for cerriguglielmo
8 weeks ago by
cerriguglielmo0 wrote:

I have a .vcf file of 10 GB with different columns (pos, ref, alt, quality,ecc..) and I have a column for each patient !!! I have another file containing only the id of the patients (only one column) that I want to analyze. How can I filter the 1° file deleting the columns that don't appear in the 2° file ??? I try to use bcftools: bcftools view -S sample_file.txt file.vcf > filtered.vcf, but the result contains only 2 columns and the total number of patients can be 300.

thanks in advance

snp bcftools vcf • 156 views
ADD COMMENTlink modified 8 weeks ago • written 8 weeks ago by cerriguglielmo0

I have a column for each patient !!!

Yes, that's how VCF files work :-)

Your bcftools command looks OK, can you please paste the output of the following commands as a reply to my comment:

  1. head sample_file.txt | cat -te #if this errors out, try cat -A instead of cat -te
  2. bcftools query -l file.vcf
  3. bcftools -v
ADD REPLYlink modified 8 weeks ago • written 8 weeks ago by RamRS23k

Ok, the sample file contains the id while the file contains all data: The command that generate the file.vcf give an error "subset called for sample that does not exist in header", so I used --force-samples command 1:

##fileformat=VCFv4.1$
##filedate=2019.6.21$
##source=Minimac3$
##contig=<ID=21>$
##FILTER=<ID=GENOTYPED,Description="Marker was genotyped AND imputed">$
##FILTER=<ID=GENOTYPED_ONLY,Description="Marker was genotyped but NOT imputed">$
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">$
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Estimated Alternate Allele Dosage : [P(0/1)+2*P(1/1)]">$
##FORMAT=<ID=GP,Number=3,Type=Float,Description="Estimated Posterior Probabilities for Genotypes 0/0, 0/1 and 1/1 ">$
##INFO=<ID=AF,Number=1,Type=Float,Description="Estimated Alternate Allele Frequency">$

command 2:

003_S_1057
003_S_0908
003_S_1122
136_S_0695
136_S_0873
130_S_0886
012_S_1133
003_S_1074
037_S_0501
027_S_0074
031_S_0351
051_S_1072
052_S_0951
016_S_1117
002_S_2010
128_S_2002
127_S_0622
014_S_0658
.. and so on, it's very very long

command 3:

bcftools 1.9-204-g6244ac9
Using htslib 1.9-247-g6eacc77
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
ADD REPLYlink modified 7 weeks ago by finswimmer12k • written 8 weeks ago by cerriguglielmo0

Command 1 given by RamRS should give you also a list of sample names. Instead you are showing the header of the vcf file. Have you take accidentally the wrong file?

ADD REPLYlink written 7 weeks ago by finswimmer12k

In the second file I have only the name of the patients (id) -->sample_file.txt; the command are lunched on the complete vcf with different columns about properties.

ADD REPLYlink written 7 weeks ago by cerriguglielmo0

Please refer to specific files using the filenames you've used in the command instead of "first file", "second file" etc. Like finswimmer points out, your if command-1 (head sample_file.txt | cat -te) outputs the text you've shown above, it's not a list of samples, it's a VCF file.

Please ensure that:

  1. sample_file.txt contains one sample identifier per line
  2. file.vcf is a VCF file
  3. The sample IDs in the sample_file.txt file overlap the sample IDs in the VCF.
ADD REPLYlink written 7 weeks ago by RamRS23k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1389 users visited in the last hour