Question

Find SNPs in a list of genes

0

Entering edit mode

9.3 years ago

win ▴ 970

I have a list of genes such as BRCA1, BRCA2, PTEN etc and I want a list of SNPs found in each gene. How could I do that programmatically?

Thanks in advance.

SNP • 2.9k views

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by win ▴ 970

1

Entering edit mode

Could you tell us what kind of data do you have on those genes?

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by guillaume.rbt ★ 1.0k

0

Entering edit mode

Hi, I just have a list of genes like this:

BRCA1
BRCA2
PTEN

and I want a list of SNP for these genes.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by win ▴ 970

3

Entering edit mode

9.3 years ago

Alex Reynolds 35k

Programmatically, here's a demonstration of how to do this for 1000Genome SNPs on chrY.

First, start with a sorted set of genes in BED format:

$ sort-bed unsorted_genes.bed > genes.bed

(If you don't have locations of your genes, just edit your question to describe what you have, which we can probably work into a BED file.)

Then map 1000Genome SNPs to those genes:

$ wget -qO- ftp://ftp-trace.ncbi.nih.gov//1000genomes/ftp/release/20130502/ALL.chrY.phase3_integrated.20130502.genotypes.vcf.gz \
    | gunzip --stdout - \
    | convert2bed -i vcf - \
    | awk '{print "chr"$0;}' - \
    | bedmap --echo --echo-map-id-uniq --chrom chrY genes.bed - \
    > answer.chrY.bed

This uses convert2bed to convert VCF to sorted BED, and BEDOPS bedmap --echo-map-id-uniq to map a unique list of SNP IDs to each gene in genes.bed.

We add --chrom chrY to bedmap in order focus file operations on the part of the file we're interested in, which improves performance.

The awk statement prepends each line of the BED output with the string "chr", to reformat the chromosome name to UCSC standards, which is the naming scheme used in GENCODE records pulled in the second answer in this post. We want to make sure the chromosome names line up when mapping genes to SNPs.

This is just for chrY. You could probably parallelize this for the other human chromosomes pretty easily (see: www.biostars.org/p/63816/ for other examples):

$ parallel "wget -qO- ftp://ftp-trace.ncbi.nih.gov//1000genomes/ftp/release/20130502/ALL.{}.phase3_integrated.20130502.genotypes.vcf.gz | gunzip --stdout - | convert2bed -i vcf - | awk '{print \"chr\"$0;}' | bedmap --echo --echo-map-uniq-id --chrom {} genes.bed - > answer.{}.bed"::: {1..22} X Y

Then collate the per-chromosome results into one file:

$ bedops --everything answer.*.bed > answer.bed

ADD COMMENT • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by Alex Reynolds 35k

0

Entering edit mode

I have a list of genes as follows:

BRCA1
BRCA2
PTEN

Will I be able to create a .bed from that and use it for getting variants for each gene.

ADD REPLY • link updated 2.1 years ago by Ram 43k • written 9.3 years ago by win ▴ 970

0

Entering edit mode

See the answer provided below, which should give you a list of BED elements for each gene.

ADD REPLY • link 9.3 years ago by Alex Reynolds 35k

1

Entering edit mode

9.3 years ago

Emily 23k

You can use the Ensembl Perl API.

ADD COMMENT • link 9.3 years ago by Emily 23k

score 3 · Accepted Answer · 2014-12-17

If you first need a BED file of genes (the genomic locations of your genes), then let's say you have a file called genes.txt that contains your gene names of interest:

$ cat genes.txt
BRCA1
BRCA2
PTEN
$

You can pull GENCODE records (via wget), filter them for genes (via awk), filter again for each of your gene names (via grep -f), and convert them to sorted BED (via convert2bed):

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3=="gene"' - \
    | grep -f genes.txt - \
    | convert2bed -i gff - \
    > genes.bed

No sort-bed step is required here; the default output of convert2bed is sorted (assuming BEDOPS is installed).

Once you have genes.bed, you can use it as the input for the other answer I provided in this post.