Find SNPs in a list of genes
3
0
Entering edit mode
9.3 years ago
win ▴ 970

I have a list of genes such as BRCA1, BRCA2, PTEN etc and I want a list of SNPs found in each gene. How could I do that programmatically?

Thanks in advance.

SNP • 2.9k views
ADD COMMENT
1
Entering edit mode

Could you tell us what kind of data do you have on those genes?

ADD REPLY
0
Entering edit mode

Hi, I just have a list of genes like this:

BRCA1
BRCA2
PTEN

and I want a list of SNP for these genes.

ADD REPLY
3
Entering edit mode
9.3 years ago

If you first need a BED file of genes (the genomic locations of your genes), then let's say you have a file called genes.txt that contains your gene names of interest:

$ cat genes.txt
BRCA1
BRCA2
PTEN
$

You can pull GENCODE records (via wget), filter them for genes (via awk), filter again for each of your gene names (via grep -f), and convert them to sorted BED (via convert2bed):

$ wget -qO- ftp://ftp.sanger.ac.uk/pub/gencode/Gencode_human/release_21/gencode.v21.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3=="gene"' - \
    | grep -f genes.txt - \
    | convert2bed -i gff - \
    > genes.bed

No sort-bed step is required here; the default output of convert2bed is sorted (assuming BEDOPS is installed).

Once you have genes.bed, you can use it as the input for the other answer I provided in this post.

ADD COMMENT
3
Entering edit mode
9.3 years ago

Programmatically, here's a demonstration of how to do this for 1000Genome SNPs on chrY.

First, start with a sorted set of genes in BED format:

$ sort-bed unsorted_genes.bed > genes.bed

(If you don't have locations of your genes, just edit your question to describe what you have, which we can probably work into a BED file.)

Then map 1000Genome SNPs to those genes:

$ wget -qO- ftp://ftp-trace.ncbi.nih.gov//1000genomes/ftp/release/20130502/ALL.chrY.phase3_integrated.20130502.genotypes.vcf.gz \
    | gunzip --stdout - \
    | convert2bed -i vcf - \
    | awk '{print "chr"$0;}' - \
    | bedmap --echo --echo-map-id-uniq --chrom chrY genes.bed - \
    > answer.chrY.bed

This uses convert2bed to convert VCF to sorted BED, and BEDOPS bedmap --echo-map-id-uniq to map a unique list of SNP IDs to each gene in genes.bed.

We add --chrom chrY to bedmap in order focus file operations on the part of the file we're interested in, which improves performance.

The awk statement prepends each line of the BED output with the string "chr", to reformat the chromosome name to UCSC standards, which is the naming scheme used in GENCODE records pulled in the second answer in this post. We want to make sure the chromosome names line up when mapping genes to SNPs.

This is just for chrY. You could probably parallelize this for the other human chromosomes pretty easily (see: www.biostars.org/p/63816/ for other examples):

$ parallel "wget -qO- ftp://ftp-trace.ncbi.nih.gov//1000genomes/ftp/release/20130502/ALL.{}.phase3_integrated.20130502.genotypes.vcf.gz | gunzip --stdout - | convert2bed -i vcf - | awk '{print \"chr\"$0;}' | bedmap --echo --echo-map-uniq-id --chrom {} genes.bed - > answer.{}.bed"::: {1..22} X Y

Then collate the per-chromosome results into one file:

$ bedops --everything answer.*.bed > answer.bed
ADD COMMENT
0
Entering edit mode

I have a list of genes as follows:

BRCA1
BRCA2
PTEN

Will I be able to create a .bed from that and use it for getting variants for each gene.

ADD REPLY
0
Entering edit mode

See the answer provided below, which should give you a list of BED elements for each gene.

ADD REPLY
1
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2581 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6