[bcftools] [linux cluster] subset vcf.gz file per snp keeping headers and columns
17 months ago
mgois

Hi there!

I have a .vcf.gz file which I can only access via command line (on a linux cluster). The file is specific for chr 2 and it is quite big, so I don't know how many columns are there. I want to extract the all the columns information selected for one SNP, which I know the ID. I also need the output file to contain the same header and all columns from the original, but the info for only this snp (so I could run another code).

So far, the only thing that worked for selecting the snp (but doesn't keep the header or other columns) was:

bcftools query -i 'ID="snp id"' -f'[%SAMPLE\t%DS\t%REF\t%ALT\n]'  file.in.vcf.gz  > file.out.vcf.gz

I also tried:

bcftools view -i 'ID="snp id"'  <file.in.vcf.gz> -o <file.out.vcf.gz>

which returned the error:

-bash: syntax error near unexpected token `newline'

I also tried this one, with the same error:

bcftools query -i 'ID="snp id"' file.in.vcf.gz > file.out.vcf.gz

Hope you can help me figure this out. I am new in bcftools, but I also read the manual for this and couldn't find anything.


bcftools SNP variant calling
-bash: syntax error near unexpected token `newline'

it a problem with how you're invoking bcftools. There is something in the context we cannot see with the snippet you provided. UNLESS... are you really using the expression <file.in.vcf.gz> ?

17 months ago
arnstrm

If you have gzipped VCF file, you could run this simple bash command to get the SNP you want:

zcat input.vcf.gz | awk '(/#/ || $3=="snp-id")' > outputfile.vcf

here zcat will stream the extracted vcf file and awk will print the lines that have either # (header lines) or the SNP id (3rd column) containing exactly the "snp-id" you provide. I assume this is what you want to accomplish?


