Question: ANNOVAR is losing 886 SNPs, and I can't figure out why
0
gravatar for devenvyas
2.3 years ago by
devenvyas580
Stony Brook
devenvyas580 wrote:

I have a SNP dataset in Plink for 419,102 SNPs.

I am trying to run them through ANNOVAR, so I can figure out what types of functional elements they are spread in across the genome.

plink --bfile input --recode vcf-iid --out Ancestral_419k
convert2annovar.pl -format vcf4old Ancestral_419k.vcf -outfile Ancestral_419k.avinput

The resulting VCF file has all 419,102 SNPs (and 28 header lines)

The ANNOVAR log file states the following:

NOTICE: Read 419130 lines and wrote 417600 different variants at 418216 genomic positions (418216 SNPs and 0 indels)
NOTICE: Among 418216 different variants at 418216 positions, 111601 are heterozygotes, 305999 are homozygotes
NOTICE: Among 418216 SNPs, 340143 are transitions, 78073 are transversions (ratio=4.36)

The avinput file has 418216 SNPs. I am not sure why 886 SNPs are not being read in the conversion. Anyone have an idea what is going on?

snp • 872 views
ADD COMMENTlink modified 2.3 years ago • written 2.3 years ago by devenvyas580

show us one that's missing

ADD REPLYlink written 2.3 years ago by Jeremy Leipzig18k

It appears that there are 886 that are completely monomorphic within my dataset, so ANNOVAR is ignoring them instead of including them...

Here are five examples rs377583051 rs561224271 rs544889745 rs573338017 rs555351100

ADD REPLYlink written 2.3 years ago by devenvyas580

What do you mean by monomorphic? If there is an rsID, it should correspond to at least a SNP

ADD REPLYlink written 2.3 years ago by Santosh Anand4.9k

Within Plink, all the samples are monomorphic, so ANNOVAR ignores them instead of including them in the avinput file

Unless I add more samples that have the derived versions of those SNPs, ANNOVAR will think they are monomorphic and ignore them. There must be some way to override this.

ADD REPLYlink written 2.3 years ago by devenvyas580

I'm not sure annovar if looks at the sample annotation. AFAIR it derives the concordance by looking at chrom post ref and alt only

ADD REPLYlink written 2.3 years ago by Santosh Anand4.9k

Pastr some lines in the main Q, where annovar is missing the annotation. Or better, upload part of the file to somewhere and attach a link here

ADD REPLYlink written 2.3 years ago by Santosh Anand4.9k

Then something might be getting lost when Plink converts from bed/bim/fam to VCF. The bim file clears displays both alleles (even though the derived allele is absent for 886 sites).

I created a bim for just those 886 sites, and I reformatted it to avinput format and cat'ed it to the end of actual avinput file. It's running through ANNOVAR now. Hopefully, it annotates those SNPs.

ADD REPLYlink written 2.3 years ago by devenvyas580
2
gravatar for devenvyas
2.3 years ago by
devenvyas580
Stony Brook
devenvyas580 wrote:

So I figured out how to do this. I am writing it up as an answer, so future users can refer to it if they have the same problem

Basically, I identified 886 SNPs, which were monomorphic in my dataset (and thus getting lost in the VCF to avinput conversion).

I created a bim file for the 886 SNPs and converted it to avinput format in Excel. I tacked this on to the end of the original avinput file and ran it through ANNOVAR successfully. The only issue is that the SNPs are not completely sorted by coordinate, since I tacked on those SNPs to the end.

It may be easier in the future to just convert a bim directly into an avinput.

ADD COMMENTlink written 2.3 years ago by devenvyas580
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 981 users visited in the last hour