Question: How to remove duplicate sites and inconsistent sites from a VCF?
2
gravatar for Sharon
12 months ago by
Sharon460
Sharon460 wrote:

Hi Everyone

I am trying to use Michigan Impute Server. I use Checkvcf first to avoid failure in the server.

python checkVCF.py -r checkVCF/hs37d5.fa -o test chr3.vcf

I got some duplicates and inconsistent ref.

> checkVCF.py -- check validity of VCF file for meta-analysis version
> 1.4 (20140115) contact zhanxw@umich.edu or dajiang@umich.edu for problems. Python version is [ 2.7.5.final.0 ]  Begin checking vcfFile
> [ chr3.vcf ] Duplicated site [ 3:14187449 ] Duplicated site [
> 3:21307401 ] Duplicated site [ 3:38608045 ] Duplicated site [
> 3:39146429 ] Duplicated site [ 3:41912651 ] [ 10000 ] lines processed 
> Duplicated site [ 3:48618728 ] Duplicated site [ 3:79399575 ]
> Duplicated site [ 3:95176677 ] Duplicated site [ 3:96472739 ]
> Duplicated site [ 3:99067458 ] [ 20000 ] lines processed  Duplicated
> site [ 3:113876275 ] Duplicated site [ 3:120522716 ] Duplicated site [
> 3:121633904 ] Duplicated site [ 3:128622922 ] [ 30000 ] lines
> processed  Duplicated site [ 3:171926373 ] Duplicated site [
> 3:183371250 ]
> ---------------     REPORT     --------------- Total [ 37146 ] lines processed Examine [ 33 ] VCF header lines, [ 37113 ] variant sites, [
> 378 ] samples [ 16 ] duplicated sites [ 0 ] NonSNP site are outputted
> to [ test.check.nonSnp ] [ 6995 ] Inconsistent reference sites are
> outputted to [ test.check.ref ] [ 0 ] Variant sites with invalid
> genotypes are outputted to [ test.check.geno ] [ 0 ] Alternative
> allele frequency > 0.5 sites are outputted to [ test.check.af ] [ 0 ]
> Monomorphic sites are outputted to [ test.check.mono ]
> ---------------     ACTION ITEM     ---------------
> * Remove duplicated sites and rerun checkVCF.py
> * Read test.check.ref, for autosomal sites, make sure the you are using the forward strand
> * Upload these files to the ftp server (so we can double check): test.check.log test.check.dup test.check.noSnp test.check.ref
> test.check.geno test.check.af test.check.mono

How can I remove this duplicate sites and inconsistent reference sites?

I tried this but it seems it excludes duplicate variants not sites:

plink --bfile snps_filtered --list-duplicate-vars ids-only suppress-first
plink --bfile snps_filtered --exclude plink.dupvar --make-bed --out snps.DuplicatesRemoved 
plink --bfile snps_filtered --recode vcf  --snps-only just-acgt  --out snps.final

A link to where is this in plink will be okay too.

Thanks

ADD COMMENTlink modified 10 months ago by Biostar ♦♦ 20 • written 12 months ago by Sharon460
1
gravatar for Kevin Blighe
12 months ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

I note that you are comparing to hs37d5.fa, but to which genome was the original sample aligned? It would likely be recorded in your VCF header.

ADD COMMENTlink written 12 months ago by Kevin Blighe52k
1

Good catch. I should use Ghr37, I will check if this will remove the duplications. Thanks Kevin a lot. Always helpful.

ADD REPLYlink written 12 months ago by Sharon460
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 724 users visited in the last hour