Question: How to remove duplicate sites and inconsistent sites from a VCF?
2
gravatar for Sharon
22 months ago by
Sharon480
Sharon480 wrote:

Hi Everyone

I am trying to use Michigan Impute Server. I use Checkvcf first to avoid failure in the server.

python checkVCF.py -r checkVCF/hs37d5.fa -o test chr3.vcf

I got some duplicates and inconsistent ref.

> checkVCF.py -- check validity of VCF file for meta-analysis version
> 1.4 (20140115) contact zhanxw@umich.edu or dajiang@umich.edu for problems. Python version is [ 2.7.5.final.0 ]  Begin checking vcfFile
> [ chr3.vcf ] Duplicated site [ 3:14187449 ] Duplicated site [
> 3:21307401 ] Duplicated site [ 3:38608045 ] Duplicated site [
> 3:39146429 ] Duplicated site [ 3:41912651 ] [ 10000 ] lines processed 
> Duplicated site [ 3:48618728 ] Duplicated site [ 3:79399575 ]
> Duplicated site [ 3:95176677 ] Duplicated site [ 3:96472739 ]
> Duplicated site [ 3:99067458 ] [ 20000 ] lines processed  Duplicated
> site [ 3:113876275 ] Duplicated site [ 3:120522716 ] Duplicated site [
> 3:121633904 ] Duplicated site [ 3:128622922 ] [ 30000 ] lines
> processed  Duplicated site [ 3:171926373 ] Duplicated site [
> 3:183371250 ]
> ---------------     REPORT     --------------- Total [ 37146 ] lines processed Examine [ 33 ] VCF header lines, [ 37113 ] variant sites, [
> 378 ] samples [ 16 ] duplicated sites [ 0 ] NonSNP site are outputted
> to [ test.check.nonSnp ] [ 6995 ] Inconsistent reference sites are
> outputted to [ test.check.ref ] [ 0 ] Variant sites with invalid
> genotypes are outputted to [ test.check.geno ] [ 0 ] Alternative
> allele frequency > 0.5 sites are outputted to [ test.check.af ] [ 0 ]
> Monomorphic sites are outputted to [ test.check.mono ]
> ---------------     ACTION ITEM     ---------------
> * Remove duplicated sites and rerun checkVCF.py
> * Read test.check.ref, for autosomal sites, make sure the you are using the forward strand
> * Upload these files to the ftp server (so we can double check): test.check.log test.check.dup test.check.noSnp test.check.ref
> test.check.geno test.check.af test.check.mono

How can I remove this duplicate sites and inconsistent reference sites?

I tried this but it seems it excludes duplicate variants not sites:

plink --bfile snps_filtered --list-duplicate-vars ids-only suppress-first
plink --bfile snps_filtered --exclude plink.dupvar --make-bed --out snps.DuplicatesRemoved 
plink --bfile snps_filtered --recode vcf  --snps-only just-acgt  --out snps.final

A link to where is this in plink will be okay too.

Thanks

ADD COMMENTlink modified 20 months ago by Biostar ♦♦ 20 • written 22 months ago by Sharon480
1
gravatar for Kevin Blighe
22 months ago by
Kevin Blighe65k
Kevin Blighe65k wrote:

I note that you are comparing to hs37d5.fa, but to which genome was the original sample aligned? It would likely be recorded in your VCF header.

ADD COMMENTlink written 22 months ago by Kevin Blighe65k
1

Good catch. I should use Ghr37, I will check if this will remove the duplications. Thanks Kevin a lot. Always helpful.

ADD REPLYlink written 22 months ago by Sharon480
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1792 users visited in the last hour