Question

removing copy number variations using bcftools

0

Entering edit mode

6.4 years ago

ucbtsm8 ▴ 20

I have downloaded the 1000 genomes dataset from http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/

I have converted it to bgzip format, and am doing some QC using bcftools. I have removed multiallelic SNPS and indels, and then converted to plink format. However, when I scan the .bim file, there are some lines like this:

21 rs79073988 0 43190101 C T
21 esv3647070 0 43190101 <cn2> T

I.e. lines with the same position, but different IDs. I couldn't exactly find what this means, but I am guessing that they are 'encoded structural variants', and specifically copy number variants? I only want the lines containing 'rs' in my .bim file.

I am aware I could just grep out the lines with 'esv' in them, but is there a more formal way to remove structural variants using bcftools?

n.b I tried doing something like

bcftools filter -e'ID~"erv"'  infile.vcf.gz

But you can only use == or != in the ID column..

genome • 2.0k views

ADD COMMENT • link updated 6.4 years ago by finswimmer 16k • written 6.4 years ago by ucbtsm8 ▴ 20

score 2 · Answer 1 · 2019-02-15

2

Entering edit mode

6.4 years ago

finswimmer 16k

You can filter by the value in the ALT column. CNVs have a value containing "cn". You can remove them with:

$ bcftools view -e 'ALT[*]~"CN"' input.vcf.gz

fin swimmer

ADD COMMENT • link 6.4 years ago by finswimmer 16k

0

Entering edit mode

this worked, thanks.

I realised I could also do:

view  -e'VT!="SNP"'

ADD REPLY • link 6.4 years ago by ucbtsm8 ▴ 20