Question: removing copy number variations using bcftools
0
gravatar for ucbtsm8
18 months ago by
ucbtsm80
ucbtsm80 wrote:

I have downloaded the 1000 genomes dataset from http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/

I have converted it to bgzip format, and am doing some QC using bcftools. I have removed multiallelic SNPS and indels, and then converted to plink format. However, when I scan the .bim file, there are some lines like this:

21 rs79073988 0 43190101 C T
21 esv3647070 0 43190101 <cn2> T

I.e. lines with the same position, but different IDs. I couldn't exactly find what this means, but I am guessing that they are 'encoded structural variants', and specifically copy number variants? I only want the lines containing 'rs' in my .bim file.

I am aware I could just grep out the lines with 'esv' in them, but is there a more formal way to remove structural variants using bcftools?

n.b I tried doing something like

bcftools filter -e'ID~"erv"'  infile.vcf.gz

But you can only use == or != in the ID column..

genome • 580 views
ADD COMMENTlink modified 18 months ago by finswimmer13k • written 18 months ago by ucbtsm80
2
gravatar for finswimmer
18 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

You can filter by the value in the ALT column. CNVs have a value containing "cn". You can remove them with:

$ bcftools view -e 'ALT[*]~"CN"' input.vcf.gz

fin swimmer

ADD COMMENTlink written 18 months ago by finswimmer13k

this worked, thanks.

I realised I could also do:

view  -e'VT!="SNP"'
ADD REPLYlink written 18 months ago by ucbtsm80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 805 users visited in the last hour