I have downloaded the 1000 genomes dataset from http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/
I have converted it to bgzip format, and am doing some QC using bcftools. I have removed multiallelic SNPS and indels, and then converted to plink format. However, when I scan the .bim file, there are some lines like this:
21 rs79073988 0 43190101 C T
21 esv3647070 0 43190101 <cn2> T
I.e. lines with the same position, but different IDs. I couldn't exactly find what this means, but I am guessing that they are 'encoded structural variants', and specifically copy number variants? I only want the lines containing 'rs' in my .bim file.
I am aware I could just grep out the lines with 'esv' in them, but is there a more formal way to remove structural variants using bcftools?
n.b I tried doing something like
bcftools filter -e'ID~"erv"' infile.vcf.gz
But you can only use ==
or !=
in the ID column..
this worked, thanks.
I realised I could also do: