removing copy number variations using bcftools
1
0
Entering edit mode
5.2 years ago
ucbtsm8 ▴ 20

I have downloaded the 1000 genomes dataset from http://hgdownload.cse.ucsc.edu/gbdb/hg19/1000Genomes/phase3/

I have converted it to bgzip format, and am doing some QC using bcftools. I have removed multiallelic SNPS and indels, and then converted to plink format. However, when I scan the .bim file, there are some lines like this:

21 rs79073988 0 43190101 C T
21 esv3647070 0 43190101 <cn2> T

I.e. lines with the same position, but different IDs. I couldn't exactly find what this means, but I am guessing that they are 'encoded structural variants', and specifically copy number variants? I only want the lines containing 'rs' in my .bim file.

I am aware I could just grep out the lines with 'esv' in them, but is there a more formal way to remove structural variants using bcftools?

n.b I tried doing something like

bcftools filter -e'ID~"erv"'  infile.vcf.gz

But you can only use == or != in the ID column..

genome • 1.7k views
ADD COMMENT
2
Entering edit mode
5.2 years ago

You can filter by the value in the ALT column. CNVs have a value containing "cn". You can remove them with:

$ bcftools view -e 'ALT[*]~"CN"' input.vcf.gz

fin swimmer

ADD COMMENT
0
Entering edit mode

this worked, thanks.

I realised I could also do:

view  -e'VT!="SNP"'
ADD REPLY

Login before adding your answer.

Traffic: 2898 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6