Question: removing copy number variations using bcftools
0
gravatar for ucbtsm8
23 months ago by
ucbtsm80
ucbtsm80 wrote:

I have downloaded the 1000 genomes dataset from http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/

I have converted it to bgzip format, and am doing some QC using bcftools. I have removed multiallelic SNPS and indels, and then converted to plink format. However, when I scan the .bim file, there are some lines like this:

21 rs79073988 0 43190101 C T
21 esv3647070 0 43190101 <cn2> T

I.e. lines with the same position, but different IDs. I couldn't exactly find what this means, but I am guessing that they are 'encoded structural variants', and specifically copy number variants? I only want the lines containing 'rs' in my .bim file.

I am aware I could just grep out the lines with 'esv' in them, but is there a more formal way to remove structural variants using bcftools?

n.b I tried doing something like

bcftools filter -e'ID~"erv"'  infile.vcf.gz

But you can only use == or != in the ID column..

genome • 735 views
ADD COMMENTlink modified 23 months ago by finswimmer14k • written 23 months ago by ucbtsm80
2
gravatar for finswimmer
23 months ago by
finswimmer14k
Germany
finswimmer14k wrote:

You can filter by the value in the ALT column. CNVs have a value containing "cn". You can remove them with:

$ bcftools view -e 'ALT[*]~"CN"' input.vcf.gz

fin swimmer

ADD COMMENTlink written 23 months ago by finswimmer14k

this worked, thanks.

I realised I could also do:

view  -e'VT!="SNP"'
ADD REPLYlink written 23 months ago by ucbtsm80
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1228 users visited in the last hour
_