Question

How to convert SNP genome positions to variant identifiers and genome annotations

0

Entering edit mode

9.5 years ago

Tim • 0

Hi Biostars,

I would like to learn how to convert the genome positions (e.g., Chr6: 467841) into other useful identifiers and annotations. For example, I use the vcftools to get only SNPs in a ".012" format, which also outputs the site locations (i.e., genome positions) in a ".012.pos" file. I use the following command:

vcftools --vcf xxx.vcf --out SNP --remove-indels --012

Basically, it creates "SNP.012" that only contain 0,1,2 values and "SNP.012.pos" that contains the site location like:

Chr1    2673
Chr1    2695
Chr1    2696

I would like to match these site locations (i.e., genome positions) to variant identifiers to genome annotations. I have some success in loading a gff3 file (e.g., NCBI genome annotation downloaded) and doing left/right joins in R. But it seems somewhat ad hoc. I tried to use Bioconductor packages (GenomicRanges, GenomicFeatures, biomaRt) but I couldn't find efficient/fast/best practices. FYI, I prefer working in R/Bioconductor.

Thanks!

snp vcftools genome • 2.7k views

ADD COMMENT • link updated 3.0 years ago by Ram 45k • written 9.5 years ago by Tim • 0

Ram · Answer 1 · 2016-01-18

1

Entering edit mode

9.5 years ago

harold.smith.tarheel ★ 5.0k

Why not use one of the available variant annotation tools, like Annovar or SnpEff, with the original VCF? Those provide information relative to known features, and have the additional advantage of mutation classification (synonymous, missense, nonsense, splicing) in coding sequences (impossible from your SNP.pos, which lacks the nucleotide change). You can always filter the output for only SNPs.

ADD COMMENT • link 9.5 years ago by harold.smith.tarheel ★ 5.0k

0

Entering edit mode

I had to analyze the genotype matrix ("012" format) in R and find out "important" SNPs. I simply feel like there must be a straightforward way of going from the site location (genome position) to variant identifiers, gene id, and/or known annotations. In other words, if there is a list of site locations (like Chr1 2673), what's the best way of getting annotations from RefSeq, Ensembl, and such (downloaded in gff3 or gtf formats, or accessing via any API)? Any help would be appreciated!

Thanks for great suggestions. I look more into Annovar and SnpEff.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.5 years ago by Tim • 0